# **Machine Learning Project - Predict Titanic Survivor :**
- ### Write a program that takes Titanic data and uses machine-learning algorithms to predict whether or not a person will survive. The program should be able to handle preprocessing of the data, such as cleaning up missing values and creating new features if required.
- ### Your task is to choose multiple machine learning algorithms and compare their accuracy in predicting survival rates. You can use metrics such as accuracy, precision, recall, or F1 score to evaluate the performance of each model.

---
- # Importing required modules :

In [92]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score

## Loading the data from the **'titanic.csv'** file into a Pandas DataFrame called **'data'** :

In [93]:
data = pd.read_csv('/content/drive/MyDrive/Data_Science/Projects/ML Projects/Datasets/titanic.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Checking the details of the dataset :

In [94]:
data.shape

(891, 12)

In [95]:
data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [96]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [97]:
data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


## Calculating null values :

In [98]:
data.isnull().sum()

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0


## Replacing null values :

In [99]:
data['Age'] = data['Age'].fillna(data['Age'].mean())
data['Cabin'] = data['Cabin'].fillna(data['Cabin'].mode()[0])
data['Embarked'] = data['Embarked'].fillna(data['Embarked'].mode()[0])

In [100]:
data.isnull().sum().sum()

0

## Converting the **'Sex'** column to numerical values using **LabelEncoder** and prints the transformed column :

In [101]:
encoder = LabelEncoder()
data['Sex'] = encoder.fit_transform(data['Sex'])
print(data['Sex'])

0      1
1      0
2      0
3      0
4      1
      ..
886    1
887    0
888    0
889    1
890    1
Name: Sex, Length: 891, dtype: int64


## Separating features and target variable :

In [102]:
x = data[['Pclass','Sex','Age','SibSp','Parch','Fare']]
y = data['Survived']

## Checking class balance report :

In [103]:
print("Before SMOTE :\n",y.value_counts())

# !pip install imblearn -- Installing the library
from imblearn.over_sampling import SMOTE
smote = SMOTE()
x_resampled, y_resampled = smote.fit_resample(x, y)
y_resampled.value_counts()

print("After SMOTE :\n",y_resampled.value_counts())

Before SMOTE :
 Survived
0    549
1    342
Name: count, dtype: int64
After SMOTE :
 Survived
0    549
1    549
Name: count, dtype: int64


## Splitting **'data'** into training and testing sets :

In [104]:
x_train, x_test, y_train, y_test = train_test_split(x_resampled, y_resampled, test_size = 0.2, random_state = 42)

In [105]:
x_train

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare
2,3,0,26.000000,0,0,7.925000
6,1,1,54.000000,0,0,51.862500
578,3,0,29.699118,1,0,14.458300
636,3,1,32.000000,0,0,7.925000
844,3,1,17.000000,0,0,8.662500
...,...,...,...,...,...,...
466,2,1,29.699118,0,0,0.000000
121,3,1,29.699118,0,0,8.050000
1044,1,0,28.041533,0,0,221.023107
1095,2,0,32.398833,0,0,13.000000


## Standardization of 'x' :

In [106]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

x_train

array([[ 0.90841961, -1.13160041, -0.26062997, -0.45963024, -0.47522086,
        -0.5110742 ],
       [-1.44436039,  0.88370417,  1.85678998, -0.45963024, -0.47522086,
         0.32621331],
       [ 0.90841961, -1.13160041,  0.01910522,  0.4702208 , -0.47522086,
        -0.38657347],
       ...,
       [-1.44436039, -1.13160041, -0.10624486, -0.45963024, -0.47522086,
         3.54979371],
       [-0.26797039, -1.13160041,  0.22326346, -0.45963024, -0.47522086,
        -0.41436332],
       [ 0.90841961,  0.88370417,  0.87370214,  1.40007184, -0.47522086,
        -0.39324319]])

# Logistic Regression :

In [107]:
# Defining the model :
model_1 = LogisticRegression()

# Fitting the model :
model_1.fit(x_train, y_train)

# Predicting based on test features :
y_pred = model_1.predict(x_test)

# Evaluating the model :
score_1 = model_1.score(x_test, y_test)
accuracy = accuracy_score(y_test, y_pred)
precision_1 = precision_score(y_test, y_pred)
recall_1 = recall_score(y_test, y_pred)
print("Accuracy :",score_1)
print("Precision :",precision_1)
print("Recall :",recall_1)

Accuracy : 0.8363636363636363
Precision : 0.8188976377952756
Recall : 0.8888888888888888


# Decision Tree Classifier :

In [108]:
model_2 = DecisionTreeClassifier()
model_2.fit(x_train, y_train)
y_pred = model_2.predict(x_test)
score_2 = model_2.score(x_test, y_test)
precision_2 = precision_score(y_test, y_pred)
recall_2 = recall_score(y_test, y_pred)
print("Accuracy :",score_2)
print("Precision :",precision_2)
print("Recall :",recall_2)

Accuracy : 0.7909090909090909
Precision : 0.8034188034188035
Recall : 0.8034188034188035


# Random Forest Classifier :

In [109]:
model_3 = RandomForestClassifier()
model_3.fit(x_train, y_train)
y_pred = model_3.predict(x_test)
score_3 = model_3.score(x_test, y_test)
precision_3 = precision_score(y_test, y_pred)
recall_3 = recall_score(y_test, y_pred)
print("Accuracy :",score_3)
print("Precision :",precision_3)
print("Recall :",recall_3)

Accuracy : 0.8272727272727273
Precision : 0.8495575221238938
Recall : 0.8205128205128205


# Support Vector Classifier :

In [110]:
model_4 = SVC()
model_4.fit(x_train, y_train)
y_pred = model_4.predict(x_test)
score_4 = model_4.score(x_test, y_test)
precision_4 = precision_score(y_test, y_pred)
recall_4 = recall_score(y_test, y_pred)
print("Accuracy :",score_4)
print("Precision :",precision_4)
print("Recall :",recall_4)

Accuracy : 0.8636363636363636
Precision : 0.9142857142857143
Recall : 0.8205128205128205


## So we can see :
- ### **Logistic Regression :** Shows good accuracy (83.6%) and a high recall (88.9%), meaning it correctly identifies a high percentage of true survivors. However, the slightly lower precision (81.9%) suggests some false positives.

- ### **Decision Tree :** Has the lowest accuracy (79.1%) and shows a balance between precision (80.3%) and recall (80.3%). This indicates a decent ability to classify both survivors and non-survivors, but with room for improvement.

- ### **Random Forest :** Offers better accuracy (82.7%) compared to Decision Tree, along with a good precision (85%). The recall (82.1%) is slightly lower, suggesting a slightly higher rate of false negatives.

- ### **Support Vector Classifier (SVC) :** Achieves the highest accuracy (86.4%) and a good precision (91.4%). This indicates strong overall performance with a low rate of false positives. However, the recall (82%) is the lowest among the models, suggesting some missed true survivors.

- ### **In summary**, SVC demonstrates the best accuracy, but Logistic Regression might be preferred if identifying most true survivors is crucial (high recall).
---