# Problem: Titanic


# Dataset Description
The dataset is taken from the kaggle [website](https://www.kaggle.com/c/titanic/data) has been split into two groups:
1. training set (train.csv)
2. test set (test.csv)
    

- The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

- The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

- We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

##### **Variable Notes** accordint to the [Refrence](https://www.kaggle.com/c/titanic/data)

__pclass__: A proxy for socio-economic status (SES)
    1st = Upper
    2nd = Middle
    3rd = Lower

__age__: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

__sibsp__: The dataset defines family relations in this way...
    Sibling = brother, sister, stepbrother, stepsister
    Spouse = husband, wife (mistresses and fiancés were ignored)

__parch__: The dataset defines family relations in this way...
    Parent = mother, father
    Child = daughter, son, stepdaughter, stepson

Some children travelled only with a nanny, therefore parch=0 for them.

In [26]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text, export_graphviz
from sklearn.metrics import accuracy_score

In [27]:
df = pd.read_csv('titanic datasets\\train.csv', index_col='PassengerId')  ## We are also changing the Index to the "PassengerId" column
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [28]:
df.info();

<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


In [29]:
df.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [30]:
print(f"The Columns of the dataset are: \n\t{df.columns}")

The Columns of the dataset are: 
	Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Cabin', 'Embarked'],
      dtype='object')


### Data Preprocessing
Let's check the data more:

In [31]:
df.isnull().sum() # To check for missing values

# Handle missing values (either by dropping or filling them with appropriate values)
df['Age'].fillna(df['Age'].mean(), inplace=True)

# Convert categorical data to numerical data
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

# Drop unnecessary columns
df = df.drop(['Name', 'Ticket', 'Cabin', 'Embarked'], axis=1)

Shuffling dataset to make it ready for the Splitting

In [32]:
# Shuffling the Dataset
df = df.sample(
    frac=1,
    random_state=10  ## To become able to reproduce a same result
)
df.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
591,0,3,0,35.0,0,0,7.125
132,0,3,0,20.0,0,0,7.05
629,0,3,0,26.0,0,0,7.8958
196,1,1,1,58.0,0,0,146.5208
231,1,1,1,35.0,1,0,83.475


Let's Split the dataset to Train and Test subsets. For this reason, we take something around 80% of the dataset as the Training set and the remaining as the test set.

In [33]:
X = df.drop('Survived', axis=1)
y = df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(712, 6) (179, 6) (712,) (179,)


In the following we want to fit the model and then we will do th predictions

In [34]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

predictions = model.predict(X_test)

# Check accuracy
accuracy = accuracy_score(y_test, predictions)
print(f'Model Accuracy: {accuracy}')

Model Accuracy: 0.664804469273743


In [None]:
filled=True
feature_names = list(X_train.columns)
class_names = ['Surv', 'Unsurv']
rounded = True
proportion = True

fig = plt.figure(figsize=(70, 95))
plot_tree(model);
fig.savefig('titanic_fig.png')

In [None]:
print(export_text(model))  ## Showing the output with the Text tree

Showing in other way using other libraries

In [None]:
import graphviz

dot_data = export_graphviz(model, out_file=None, 
                           feature_names=feature_names, 
                           class_names=class_names,
                           filled=filled)

# Draw graph
graph = graphviz.Source(dot_data, format="png") 
graph.render("decision_tree_graphivz")
graph

### Testing the model

In [96]:
df2 = pd.read_csv(r'titanic datasets\test.csv')
passengerid = df2['PassengerId']
df2 = pd.read_csv(r'titanic datasets\test.csv', index_col=['PassengerId'])


df2.isnull().sum() # To check for missing values

# Handle missing values (either by dropping or filling them with appropriate values)
df2['Age'].fillna(df2['Age'].mean(), inplace=True)

# Convert categorical data to numerical data
df2['Sex'] = df2['Sex'].map({'male': 0, 'female': 1})

# Drop unnecessary columns
df2 = df2.drop(['Name', 'Ticket', 'Cabin', 'Embarked'], axis=1)


main_pred = model.predict(df2)
o_df = pd.DataFrame({'PassengerId': passengerid, 'Survived': main_pred})
o_df = o_df.set_index('PassengerId')
o_df.to_csv('o.csv')
o_df.head()


ValueError: Input X contains NaN.
RandomForestClassifier does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

# Implemntation of Titanic Problem with Random Forest

Step 1: Importing Libraries and loading the dataset

In [77]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

data = pd.read_csv(r'titanic datasets\train.csv')

Step 2: Explatory Data Analysis

In [78]:
data.head();print()
data.info();print()
data.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB



Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Step 3: Data Preprocessing 

In [79]:
data.isnull().sum() # To check for missing values

# Handle missing values
data['Age'].fillna(data['Age'].mean(), inplace=True)

# Convert categorical data to numerical data
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})

# Drop unnecessary columns
data = data.drop(['Name', 'Ticket', 'Cabin', 'Embarked', 'PassengerId'], axis=1)

Step 4: Splitting the dataset

In [80]:
X = data.drop('Survived', axis=1)
y = data['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=10)


Step 5: Building the Random Forest model

In [93]:
model = RandomForestClassifier(n_estimators=110) # n_estimators is the number of trees in the forest
model.fit(X_train, y_train)

Step 6: Making predictions and Checking accuracy

In [106]:
predictions = model.predict(X_test)
# print(data.isnull().sum())

# Check accuracy
accuracy = accuracy_score(y_test, predictions)
print(f'Model Accuracy: {accuracy}')

Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
dtype: int64
Model Accuracy: 0.8156424581005587


### Testing the Model

In [104]:
df2 = pd.read_csv(r'titanic datasets\test.csv')
passengerid = df2['PassengerId']
df2 = pd.read_csv(r'titanic datasets\test.csv', index_col=['PassengerId'])


df2.isnull().sum() # To check for missing values

# Handle missing values (either by dropping or filling them with appropriate values)
df2['Age'].fillna(df2['Age'].mean(), inplace=True)
df2['Fare'].fillna(df2['Fare'].mean(), inplace=True)

# Convert categorical data to numerical data
df2['Sex'] = df2['Sex'].map({'male': 0, 'female': 1})


# Drop unnecessary columns
df2 = df2.drop(['Name', 'Ticket', 'Cabin', 'Embarked'], axis=1)

# print(df2.isnull().sum())

main_pred = model.predict(df2)
o_df = pd.DataFrame({'PassengerId': passengerid, 'Survived': main_pred})
o_df = o_df.set_index('PassengerId')
o_df.to_csv('output_RandomForest.csv')
o_df.head()

Pclass    0
Sex       0
Age       0
SibSp     0
Parch     0
Fare      0
dtype: int64


Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,0
894,0
895,1
896,0


# Using Hyper Parameter Tuning 

Step 1: Import additional required libraries

In [108]:
from sklearn.model_selection import GridSearchCV


Step 2: Define the parameter grid

The parameter grid is a dictionary where the keys are the parameters and the values are the settings to be tested.

In [112]:
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_features': ['sqrt'],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

Step 3: Initialize the classifier and GridSearchCV

We will initialize the Random Forest classifier and the GridSearchCV, and then fit it to our data.

In [113]:
rf = RandomForestClassifier()
grid_search = GridSearchCV(
    estimator=rf, 
    param_grid=param_grid, 
    cv=3, 
    n_jobs=-1, 
    verbose=2
    )

grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 216 candidates, totalling 648 fits


In the GridSearchCV, the arguments passed are:

- estimator: The model

- param_grid: The distribution of parameters

- cv: The cross-validation splitting strategy

- n_jobs: Number of jobs to run in parallel

verbose: Controls the verbosity when fitting and predicting.

Step 4: Get the best parameters

After fitting, we can get the parameters that give the best results:



In [114]:
grid_search.best_params_

{'bootstrap': False,
 'max_depth': 10,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 50}

Step 5: Evaluate the model

We can use these parameters to build a new model and then check the accuracy of our model.


In [115]:
best_grid = grid_search.best_estimator_
predictions = best_grid.predict(X_test)

# Check accuracy
accuracy = accuracy_score(y_test, predictions)
print(f'Model Accuracy: {accuracy}')

Model Accuracy: 0.8324022346368715


## Testing the Model

In [118]:
df2 = pd.read_csv(r'titanic datasets\test.csv')
passengerid = df2['PassengerId']
df2 = pd.read_csv(r'titanic datasets\test.csv', index_col=['PassengerId'])


df2.isnull().sum() # To check for missing values

# Handle missing values (either by dropping or filling them with appropriate values)
df2['Age'].fillna(df2['Age'].mean(), inplace=True)
df2['Fare'].fillna(df2['Fare'].mean(), inplace=True)

# Convert categorical data to numerical data
df2['Sex'] = df2['Sex'].map({'male': 0, 'female': 1})


# Drop unnecessary columns
df2 = df2.drop(['Name', 'Ticket', 'Cabin', 'Embarked'], axis=1)

# print(df2.isnull().sum())

main_pred = best_grid.predict(df2)
o_df = pd.DataFrame({'PassengerId': passengerid, 'Survived': main_pred})
o_df = o_df.set_index('PassengerId')
o_df.to_csv('output_RandomForest_tuned.csv')
o_df.head()

Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,0
894,0
895,0
896,0


The output Data is tested in the kaggle