# Titanic Machine Learning Model

## Data Dictionary

Variable|Definition|Key
------------- | ------------- | -------------
survival|Survival|0 = No, 1 = Yes
pclass|Ticket class|	1 = 1st, 2 = 2nd, 3 = 3rd 
sex|Sex|
Age|Age in years|Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp|# of siblings / spouses aboard the Titanic|
parch|# of parents / children aboard the Titanic	|
ticket |Ticket number|
fare|Passenger fare|
cabin	|	Cabin number|
embarked|Port of Embarkation|C = Cherbourg, Q = Queenstown, S = Southampton

<br>

## Variable Notes
**pclass:** A proxy for socio-economic status (SES)
* 1st = Upper
* 2nd = Middle
* 3rd = Lower
<br>

**age:** Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5<br>

**sibsp:** The dataset defines family relations in this way...<br>
* Sibling = brother, sister, stepbrother, stepsister
* Spouse = husband, wife (mistresses and fiancés were ignored)

**parch:** The dataset defines family relations in this way...<br>
* Parent = mother, father
* Child = daughter, son, stepdaughter, stepson
* Some children travelled only with a nanny, therefore parch=0 for them.

In [1]:
import pandas as pd

In [5]:
df = pd.read_csv("C:/Users/ertug/Desktop/Titanic ML Project/titanic_ml/dataset/train.csv")

In [17]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [18]:
df_train = df.drop(['PassengerId', 'Name', 'Survived', 'Ticket'], axis=1)

In [24]:
df_train

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked_Q,Embarked_S
0,3,0,22.0,1,0,7.2500,,0,1
1,1,1,38.0,1,0,71.2833,C,0,0
2,3,1,26.0,0,0,7.9250,,0,1
3,1,1,35.0,1,0,53.1000,C,0,1
4,3,0,35.0,0,0,8.0500,,0,1
...,...,...,...,...,...,...,...,...,...
886,2,0,27.0,0,0,13.0000,,0,1
887,1,1,19.0,0,0,30.0000,B,0,1
888,3,1,28.0,1,2,23.4500,,0,1
889,1,0,26.0,0,0,30.0000,C,0,0


In [20]:
# Fill missing Age values with the median
df_train['Age'].fillna(df_train['Age'].median(), inplace=True)

In [21]:
df_train['Cabin'] = df_train['Cabin'].str[0]  # First letter of cabin

In [22]:
# Fill missing Embarked values with the mode
df_train['Embarked'].fillna(df_train['Embarked'].mode()[0], inplace=True)

In [23]:
# Convert Sex to numerical values
df_train['Sex'] = df_train['Sex'].map({'male': 0, 'female': 1})

# One-hot encode Embarked
df_train = pd.get_dummies(df_train, columns=['Embarked'], drop_first=True)

In [28]:
df_train['Cabin'].fillna('U', inplace=True)

# Optionally, you can convert it to a categorical variable
df_train['Cabin'] = df_train['Cabin'].astype('category')

In [29]:
# Check for any missing values
print(df_train.isnull().sum())

# Check data types
print(df_train.dtypes)

Pclass        0
Sex           0
Age           0
SibSp         0
Parch         0
Fare          0
Cabin         0
Embarked_Q    0
Embarked_S    0
dtype: int64
PassengerId       int64
Survived          int64
Pclass            int64
Name             object
Sex              object
Age             float64
SibSp             int64
Parch             int64
Ticket           object
Fare            float64
Cabin          category
Embarked         object
dtype: object


In [32]:
df_train['Cabin'].unique()

['U', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T']
Categories (9, object): ['A', 'B', 'C', 'D', ..., 'F', 'G', 'T', 'U']

In [33]:
# One-hot encode the 'Cabin_First_Letter' column
df_train = pd.get_dummies(df_train, columns=['Cabin'], drop_first=True)

In [35]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Pclass      891 non-null    int64  
 1   Sex         891 non-null    int64  
 2   Age         891 non-null    float64
 3   SibSp       891 non-null    int64  
 4   Parch       891 non-null    int64  
 5   Fare        891 non-null    float64
 6   Embarked_Q  891 non-null    uint8  
 7   Embarked_S  891 non-null    uint8  
 8   Cabin_B     891 non-null    uint8  
 9   Cabin_C     891 non-null    uint8  
 10  Cabin_D     891 non-null    uint8  
 11  Cabin_E     891 non-null    uint8  
 12  Cabin_F     891 non-null    uint8  
 13  Cabin_G     891 non-null    uint8  
 14  Cabin_T     891 non-null    uint8  
 15  Cabin_U     891 non-null    uint8  
dtypes: float64(2), int64(4), uint8(10)
memory usage: 50.6 KB


In [41]:
## Let's scale the column Age and Fare 
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = df_train.copy()  # Make a copy to avoid modifying original dataframe

# Apply scaling to Age and Fare
df_scaled[['Age', 'Fare']] = scaler.fit_transform(df_train[['Age', 'Fare']])

In [42]:
df_scaled

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_Q,Embarked_S,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Cabin_U
0,3,0,-0.565736,1,0,-0.502445,0,1,0,0,0,0,0,0,0,1
1,1,1,0.663861,1,0,0.786845,0,0,0,1,0,0,0,0,0,0
2,3,1,-0.258337,0,0,-0.488854,0,1,0,0,0,0,0,0,0,1
3,1,1,0.433312,1,0,0.420730,0,1,0,1,0,0,0,0,0,0
4,3,0,0.433312,0,0,-0.486337,0,1,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,2,0,-0.181487,0,0,-0.386671,0,1,0,0,0,0,0,0,0,1
887,1,1,-0.796286,0,0,-0.044381,0,1,1,0,0,0,0,0,0,0
888,3,1,-0.104637,1,2,-0.176263,0,1,0,0,0,0,0,0,0,1
889,1,0,-0.258337,0,0,-0.044381,0,0,0,1,0,0,0,0,0,0


In [43]:
from sklearn.model_selection import train_test_split

# Separate features and target
X = df_scaled  # Drop the target column
y = df['Survived']  # The target column

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### Logistic Regression:

In [44]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

LogisticRegression(random_state=42)

In [45]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

y_pred = model.predict(X_test)

# Evaluate the model
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')

[[90 15]
 [19 55]]
              precision    recall  f1-score   support

           0       0.83      0.86      0.84       105
           1       0.79      0.74      0.76        74

    accuracy                           0.81       179
   macro avg       0.81      0.80      0.80       179
weighted avg       0.81      0.81      0.81       179

Accuracy: 0.8100558659217877


### Random Forest:

In [48]:
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

RandomForestClassifier(random_state=42)

In [49]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

y_pred = rf_model.predict(X_test)

# Evaluate the model
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')

[[86 19]
 [19 55]]
              precision    recall  f1-score   support

           0       0.82      0.82      0.82       105
           1       0.74      0.74      0.74        74

    accuracy                           0.79       179
   macro avg       0.78      0.78      0.78       179
weighted avg       0.79      0.79      0.79       179

Accuracy: 0.7877094972067039


### Gradient Boosting Classifier

In [51]:
from sklearn.ensemble import GradientBoostingClassifier
gb_model = GradientBoostingClassifier(random_state=42)
gb_model.fit(X_train, y_train)

GradientBoostingClassifier(random_state=42)

In [52]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

y_pred = gb_model.predict(X_test)

# Evaluate the model
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')

[[92 13]
 [19 55]]
              precision    recall  f1-score   support

           0       0.83      0.88      0.85       105
           1       0.81      0.74      0.77        74

    accuracy                           0.82       179
   macro avg       0.82      0.81      0.81       179
weighted avg       0.82      0.82      0.82       179

Accuracy: 0.8212290502793296


In [53]:
### Since aim is to decrease FP, precision plays crutial role here.

In [55]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# Define the parameter grid
param_dist = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.001, 0.01, 0.1],
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'subsample': [0.8, 1.0]
}

# Initialize the Gradient Boosting Classifier
gbc = GradientBoostingClassifier(random_state=42)

# Set up the RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=gbc,
    param_distributions=param_dist,
    n_iter=50,  # Number of different combinations to try
    scoring='precision',  # To focus on decreasing FP
    n_jobs=-1,
    cv=5,
    verbose=1,
    random_state=42
)

# Fit the model to find the best parameters
random_search.fit(X_train, y_train)

# Print the best parameters
print("Best Parameters Found:")
print(random_search.best_params_)

# Predict with the best estimator
best_gbc = random_search.best_estimator_
y_pred = best_gbc.predict(X_test)

# Evaluate the model
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best Parameters Found:
{'subsample': 1.0, 'n_estimators': 300, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_depth': 3, 'learning_rate': 0.001}
[[101   4]
 [ 37  37]]
              precision    recall  f1-score   support

           0       0.73      0.96      0.83       105
           1       0.90      0.50      0.64        74

    accuracy                           0.77       179
   macro avg       0.82      0.73      0.74       179
weighted avg       0.80      0.77      0.75       179



In [56]:
### Final Model

In [57]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Define the best parameters
best_params = {
    'subsample': 1.0,
    'n_estimators': 300,
    'min_samples_split': 5,
    'min_samples_leaf': 1,
    'max_depth': 3,
    'learning_rate': 0.001
}

# Rebuild the Gradient Boosting Classifier with the best parameters
best_gbc_model = GradientBoostingClassifier(
    subsample=best_params['subsample'],
    n_estimators=best_params['n_estimators'],
    min_samples_split=best_params['min_samples_split'],
    min_samples_leaf=best_params['min_samples_leaf'],
    max_depth=best_params['max_depth'],
    learning_rate=best_params['learning_rate'],
    random_state=42
)

# Train the model with the best parameters
best_gbc_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_best = best_gbc_model.predict(X_test)

# Evaluate the model
print(confusion_matrix(y_test, y_pred_best))
print(classification_report(y_test, y_pred_best))

[[101   4]
 [ 37  37]]
              precision    recall  f1-score   support

           0       0.73      0.96      0.83       105
           1       0.90      0.50      0.64        74

    accuracy                           0.77       179
   macro avg       0.82      0.73      0.74       179
weighted avg       0.80      0.77      0.75       179



### Let's Eliminate Cabin-column due to Missing Values and Rebuild the Model

In [58]:
df_final = df.drop(['PassengerId', 'Name', 'Survived', 'Ticket', 'Cabin'], axis=1)

In [59]:
df_final

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,22.0,1,0,7.2500,S
1,1,female,38.0,1,0,71.2833,C
2,3,female,26.0,0,0,7.9250,S
3,1,female,35.0,1,0,53.1000,S
4,3,male,35.0,0,0,8.0500,S
...,...,...,...,...,...,...,...
886,2,male,27.0,0,0,13.0000,S
887,1,female,19.0,0,0,30.0000,S
888,3,female,,1,2,23.4500,S
889,1,male,26.0,0,0,30.0000,C


In [61]:
df_final['Sex'] = df_final['Sex'].map({'male': 0, 'female': 1})
df_final['Age'].fillna(df_final['Age'].median(), inplace=True)
df_final['Embarked'].fillna(df_final['Embarked'].mode()[0], inplace=True)

# One-hot encode Embarked
df_final = pd.get_dummies(df_final, columns=['Embarked'], drop_first=True)

# Check for any missing values
print(df_final.isnull().sum())

Pclass        0
Sex           0
Age           0
SibSp         0
Parch         0
Fare          0
Embarked_Q    0
Embarked_S    0
dtype: int64


In [63]:
## Let's scale the column Age and Fare 
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled1 = df_final.copy()  # Make a copy to avoid modifying original dataframe

# Apply scaling to Age and Fare
df_scaled1[['Age', 'Fare']] = scaler.fit_transform(df_scaled1[['Age', 'Fare']])

In [66]:
df_scaled1

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_Q,Embarked_S
0,3,0,-0.565736,1,0,-0.502445,0,1
1,1,1,0.663861,1,0,0.786845,0,0
2,3,1,-0.258337,0,0,-0.488854,0,1
3,1,1,0.433312,1,0,0.420730,0,1
4,3,0,0.433312,0,0,-0.486337,0,1
...,...,...,...,...,...,...,...,...
886,2,0,-0.181487,0,0,-0.386671,0,1
887,1,1,-0.796286,0,0,-0.044381,0,1
888,3,1,-0.104637,1,2,-0.176263,0,1
889,1,0,-0.258337,0,0,-0.044381,0,0


In [67]:
from sklearn.model_selection import train_test_split

# Separate features and target
X = df_scaled1  # Drop the target column
y = df['Survived']  # The target column

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [68]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Define the best parameters
best_params = {
    'subsample': 1.0,
    'n_estimators': 300,
    'min_samples_split': 5,
    'min_samples_leaf': 1,
    'max_depth': 3,
    'learning_rate': 0.001
}

# Rebuild the Gradient Boosting Classifier with the best parameters
best_gbc_model = GradientBoostingClassifier(
    subsample=best_params['subsample'],
    n_estimators=best_params['n_estimators'],
    min_samples_split=best_params['min_samples_split'],
    min_samples_leaf=best_params['min_samples_leaf'],
    max_depth=best_params['max_depth'],
    learning_rate=best_params['learning_rate'],
    random_state=42
)

# Train the model with the best parameters
best_gbc_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_best = best_gbc_model.predict(X_test)

# Evaluate the model
print(confusion_matrix(y_test, y_pred_best))
print(classification_report(y_test, y_pred_best))

[[101   4]
 [ 37  37]]
              precision    recall  f1-score   support

           0       0.73      0.96      0.83       105
           1       0.90      0.50      0.64        74

    accuracy                           0.77       179
   macro avg       0.82      0.73      0.74       179
weighted avg       0.80      0.77      0.75       179



### It seems that Cabin info does not add any value!