TASK 2: Predictive Analysis, Supervised Learning – Titanic
This task is about classifying a large set of data based on a set of pre-classified sample

In [1]:
import pandas as pd
import numpy as np
pd.set_option("display.max_rows",None,"display.max_columns", None)

Load train dataset and explore the data

In [2]:
df_train = pd.read_csv('C:/Users/Study/OneDrive/Desktop/DU/Business_Intelligence/Labs/Lab3/titanic_data/titanic_train.csv')
df_train.head(5)



Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Exploring dataset
checking df dimension, structure, summary

In [3]:
df_train.shape
df_train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,889.0,889.0,889.0,712.0,889.0,889.0,889.0
mean,446.0,0.382452,2.311586,29.642093,0.524184,0.382452,32.096681
std,256.998173,0.48626,0.8347,14.492933,1.103705,0.806761,49.697504
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,224.0,0.0,2.0,20.0,0.0,0.0,7.8958
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.0,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [4]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  889 non-null    int64  
 1   Survived     889 non-null    int64  
 2   Pclass       889 non-null    int64  
 3   Name         889 non-null    object 
 4   Sex          889 non-null    object 
 5   Age          712 non-null    float64
 6   SibSp        889 non-null    int64  
 7   Parch        889 non-null    int64  
 8   Ticket       889 non-null    object 
 9   Fare         889 non-null    float64
 10  Cabin        202 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.5+ KB


Remove columns not useful for prediction

Passengerid, Name has too many unique values which is not helpful to see the survival rate.
Ticket is redundant as we can tell class by pclass, so this not required
Cabin not useful as well. too many null fields 


In [5]:
d_train = df_train.drop(["PassengerId","Name","Ticket","Cabin"],axis=1)
d_train.head(5)
#d_train.describe()
#d_train.info()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


Groom your data for applying model

Age has null fields. So, replace them with mean values
Column Sex can be converted to 0,1
Column Embarked can be changeed to numeric format 1- c, 2-q, 3-s (model gave error when values were string)

In [6]:
d_train['Age'].isnull().sum()
d_train["Age"].mean()
d_train = d_train.fillna(28.2) # replacing by mean value
d_train['Age'].isnull().sum()

0

In [7]:
d_train.loc[d_train["Sex"] == "male", "Sex"] = 1
d_train.loc[d_train["Sex"] == "female", "Sex"] = 0
d_train["Sex"].unique()

array([1, 0], dtype=object)

In [8]:
d_train.loc[d_train["Embarked"] == "C", "Embarked"] = 1
d_train.loc[d_train["Embarked"] == "Q", "Embarked"] = 2
d_train.loc[d_train["Embarked"] == "S", "Embarked"] = 3
d_train["Embarked"].unique()


array([3, 1, 2], dtype=object)

In [9]:
d_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  889 non-null    int64  
 1   Pclass    889 non-null    int64  
 2   Sex       889 non-null    object 
 3   Age       889 non-null    float64
 4   SibSp     889 non-null    int64  
 5   Parch     889 non-null    int64  
 6   Fare      889 non-null    float64
 7   Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(2)
memory usage: 55.7+ KB


In [10]:
d_train

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,1,22.0,1,0,7.25,3
1,1,1,0,38.0,1,0,71.2833,1
2,1,3,0,26.0,0,0,7.925,3
3,1,1,0,35.0,1,0,53.1,3
4,0,3,1,35.0,0,0,8.05,3
5,0,3,1,28.2,0,0,8.4583,2
6,0,1,1,54.0,0,0,51.8625,3
7,0,3,1,2.0,3,1,21.075,3
8,1,3,0,27.0,0,2,11.1333,3
9,1,2,0,14.0,1,0,30.0708,1


We fit the models now. Below are the references for code
decision trees ref https://stackoverflow.com/questions/35097003/cross-validation-decision-trees-in-sklearn
svm ref https://stackoverflow.com/questions/47663694/how-to-run-svc-classifier-after-running-10-fold-cross-validation-in-sklearn


We import the necessary libraries for implementing decision trees and SVM. Column "Survived" will be our response variable for this. We apply and check the accuracy of each model   

In [11]:
#decision trees
from sklearn import tree
from sklearn.model_selection import GridSearchCV
x_train = d_train[["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]]
y_train = d_train[["Survived"]]

parameters = {'max_depth':range(3,20)}
clf = GridSearchCV(tree.DecisionTreeClassifier(), parameters, n_jobs=4)
clf.fit(X=x_train, y=y_train)
tree_model = clf.best_estimator_
print (clf.best_score_, clf.best_params_)

0.8200723671681585 {'max_depth': 6}


So we see decision trees has a good accuracy of 81%. Lets see how SVM does.

In [12]:
#svm
from sklearn import svm
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, x_train, np.ravel(y_train), cv=10)
scores.mean() #accuracy

0.7862487231869254

SVM also performs well but decision trees perform relatively better