# Titanic

We will predict if a passenger survived the Titanic incident or not.

1. **Data:** I will be using Age,Pclass,Sex,Sibsp,Parch and Fare(only in some cases) as data
2. **Label:** Survived will be the label

There are 3 csv files:<br>
- train.csv: The training dataset with both data and corresponding label<br>
- test.csv: The test dataset with data only<br>
- gender_submission.csv: This dataset contains whether the passenger survived or not ie. the label</list>

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Import dataset

In [2]:
path_train='../msit_ml_classwork/MSIT_ML_CLASS/datasets/titanic/train.csv'
path_test='../msit_ml_classwork/MSIT_ML_CLASS/datasets/titanic/test.csv'
path_res='../msit_ml_classwork/MSIT_ML_CLASS/datasets/titanic/gender_submission.csv'

In [None]:
titanic=pd.read_csv(path_train)
testset=pd.read_csv(path_test)
results=pd.read_csv(path_res)

In [None]:
titanic.head()

In [None]:
titanic.isnull().sum()

In [None]:
titanic.shape

### Dropping cabin

 I am dropping cabin because it has a lot of NULL values.<br>
 Since it is not useful in any way nor its null values can be replaced, it is not a useful piece of data

In [None]:
titanic=titanic.drop(['Cabin'],axis=1)

In [None]:
titanic.head(20)

# Names

Age is a crucial piece of data.<br>
It has a significant amount of null values. <br>
<br><br>
I'll be removing the null values using either of the following methods:<br>
1. Using Imputer()
2. Separating salutation from the name and finding mean age for each salutation. Then this mean value will be assigned to the rows having NaN values.

In [None]:
titanic[titanic['Age'].isnull()].head()

In [None]:
name=list(titanic['Name'])

In [None]:
salut=[i.split(',')[-1].split('.')[0][1:] for i in name]

In [None]:
name=list(testset['Name'])

In [None]:
salut_test=[i.split(',')[-1].split('.')[0][1:] for i in name]

# Create data and label DataFrames

I am separating data and labels into titanic_knn_X and titanic_knn_y

In [None]:
titanic_knn_X=titanic.drop(['PassengerId','Survived','Name','Ticket'],axis=1)
titanic_knn_y=titanic['Survived']

In [None]:
titanic_knn_X.head()

# KNN (Train split)

KNeighbors method is being applied to get the score

In [None]:
titanic_knn_X=pd.get_dummies(titanic_knn_X)

In [None]:
titanic_knn_X=titanic_knn_X.drop(['Sex_male','Embarked_S'],axis=1)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Imputer 
import matplotlib.pyplot as plt
%matplotlib inline

scale=StandardScaler()
imp=Imputer(missing_values = "NaN", strategy = "mean",axis=0)


In [None]:


X_train,X_test,y_train,y_test=train_test_split(titanic_knn_X,titanic_knn_y,test_size=0.125,random_state=29)

score_list=[]

In [None]:
for i in range(1,20):
    knn=KNeighborsClassifier(n_neighbors = i)
    steps=[('imp',imp),
       ('scale',scale),
      ('clf',knn)]
    pipeline=Pipeline(steps)
    pipeline.fit(X_train,y_train)
    score_list.append(pipeline.score(X_test,y_test))
print(max(score_list))

In [None]:
score_list

In [None]:
plt.xticks(range(1,20))
plt.plot(np.arange(1,20),score_list)

# Linear Regression (Train split)

In [None]:
from sklearn.linear_model import LinearRegression

lin=LinearRegression()

pipeline=Pipeline([('imp',imp),('scale',scale),('linreg',lin)])
pipeline.fit(X_train,y_train)
pipeline.score(X_test,y_test)

# Logistic Regression(Train split)

In [None]:
from sklearn.linear_model import LogisticRegression

lrg=LogisticRegression()
pipeline=Pipeline([('imp',imp),('scale',scale),('logreg',lrg)])
pipeline.fit(X_train,y_train)
pipeline.score(X_test,y_test)

# Decision tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

dtree=DecisionTreeClassifier(random_state=29,max_depth=60,criterion='gini')
imp=Imputer(missing_values = "NaN", strategy = "most_frequent",axis=0)


pipeline=Pipeline([('imp',imp),('scale',scale),('clf',dtree)])

pipeline.fit(X_train,y_train)
pipeline.score(X_test,y_test)
                

In [None]:
imp=Imputer(missing_values = "NaN", strategy = "mean",axis=0)


pipeline=Pipeline([('imp',imp),('scale',scale),('clf',dtree)])

pipeline.fit(X_train,y_train)
pipeline.score(X_test,y_test)

In [None]:
imp=Imputer(missing_values = "NaN", strategy = "median",axis=0)


pipeline=Pipeline([('imp',imp),('scale',scale),('clf',dtree)])

pipeline.fit(X_train,y_train)
pipeline.score(X_test,y_test)

## Importing XGBoost

In [None]:
import xgboost as xgb

# Creating data and labels for method-2 for replacing NaN values in Age

I created two lists: **salut** and **salut_test** which contain the salutations from the names.<br>
In the training and testing datasets, I'm adding a new column **'Salut'**

In [None]:

titanic['Salut']=salut
testset['Salut']=salut_test

In [None]:
titanic.groupby('Salut')['Age'].mean()

In [None]:
salut_mean_dict=titanic.groupby('Salut')['Age'].mean().to_dict()
salut_dict_test=testset.groupby('Salut')['Age'].mean().to_dict()
salut_dict_test['Ms']=21.7
salut_dict_test

Replacing NaN values with mean age for each salutation

In [None]:
titanic['Age']=titanic.apply(lambda row: salut_mean_dict[row['Salut']] if np.isnan(row['Age']) else row['Age'],axis=1 )
testset['Age']=testset.apply(lambda row: salut_dict_test[row['Salut']] if np.isnan(row['Age']) else row['Age'],axis=1 )

In [None]:
titanic=titanic.drop('Salut',axis=1)
titanic_X=titanic.drop(['PassengerId','Survived','Name','Ticket','Cabin','Fare'],axis=1)
titanic_y=titanic['Survived']
titanic_X=pd.get_dummies(titanic_X)
titanic_X=titanic_X.drop(['Sex_male','Embarked_S'],axis=1)
testset=testset.drop('Salut',axis=1)
testset_X=testset.drop(['PassengerId','Name','Ticket','Cabin','Fare'],axis=1)
testset_y=results['Survived']
testset_X=pd.get_dummies(testset_X)
testset_X=testset_X.drop(['Sex_male','Embarked_S'],axis=1)

X_train,X_test,y_train,y_test=train_test_split(titanic_X,titanic_y,test_size=0.125,random_state=29)

In [None]:
titanic_X.head(20)

# XGBoost Classifier

### Train split

In [None]:
score2=[]

xg_clf=xgb.XGBClassifier(objective='reg:logistic',n_estimators=8,seed=29,max_depth= 90)
xg_clf.fit(X_train,y_train)
predict=xg_clf.predict(X_test)
score2.append(float(np.sum(predict==y_test))/y_test.shape[0])



In [None]:
print('Maximum Accuracy(train split):',max(score2))


### Final

In [None]:
score3=[]

xg_clf=xgb.XGBClassifier(objective='reg:logistic',n_estimators=19,seed=29,max_depth=50,gamma=0.1)
xg_clf.fit(titanic_X,titanic_y)
predict=xg_clf.predict(testset_X)
score3.append(float(np.sum(predict==testset_y))/testset_y.shape[0])


print('Maximum Accuracy(Final):',max(score3))


# XGBoost with LabelEncoder and OneHotEncoder

### Linear Regression: Train split

In [None]:
from sklearn.preprocessing import LabelEncoder,OneHotEncoder

In [None]:
le=LabelEncoder()

category_mask=(titanic_X.dtypes=='object')
category_columns=titanic_X.columns.tolist()

In [None]:
data_xgb=titanic_X
data_xgb[category_columns]=data_xgb[category_columns].apply(lambda x:le.fit_transform(x))
label_xgb=titanic_y

In [None]:
steps=[
    ('ohe',OneHotEncoder(sparse=True,categorical_features=category_mask)),
    ('clf',LinearRegression())
]

pipeline=Pipeline(steps)

In [None]:
from sklearn.model_selection import cross_val_score
accuracy=cross_val_score(pipeline,X=data_xgb,y=label_xgb,cv=10)
print('Accuracy(XGBoost:le/ohe:LinReg):',accuracy)

### Final

In [None]:
category_mask=(testset_X.dtypes=='object')
category_columns=testset_X.columns.tolist()

data_xgb=testset_X
data_xgb[category_columns]=data_xgb[category_columns].apply(lambda x:le.fit_transform(x))
label_xgb=testset_y

In [None]:
steps=[
    ('ohe',OneHotEncoder(sparse=True,categorical_features=category_mask)),
    ('clf',LinearRegression())
]

pipeline=Pipeline(steps)
pipeline.fit(titanic_X,titanic_y)
pipeline.score(testset_X,testset_y)

### Logistic Regression

In [None]:
steps=[
    ('ohe',OneHotEncoder(sparse=True,categorical_features=category_mask)),
    ('clf',LogisticRegression())
]

pipeline=Pipeline(steps)

accuracy=cross_val_score(pipeline,X=data_xgb,y=label_xgb,cv=20)
print('Accuracy(XGBoost:le/ohe:LinReg):',accuracy)

# Random forests

### Train split

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:

r_clf=RandomForestClassifier(n_estimators=73,max_depth=80,random_state=29)
r_clf.fit(X_train,y_train)
r_clf.score(X_test,y_test)


In [None]:

r_clf=RandomForestClassifier(n_estimators=8,max_depth=90,random_state=29,criterion='entropy' )
r_clf.fit(X_train,y_train)
r_clf.score(X_test,y_test)


In [None]:

r_clf=RandomForestClassifier(n_estimators=112,max_depth=80,random_state=29,min_samples_split=0.1 )
r_clf.fit(X_train,y_train)
r_clf.score(X_test,y_test)


In [None]:

r_clf=RandomForestClassifier(n_estimators=10,max_depth=90,random_state=29,criterion='entropy',min_samples_split=0.133)
r_clf.fit(X_train,y_train)
r_clf.score(X_test,y_test)


### Final

In [None]:
r_clf=RandomForestClassifier(n_estimators=90,max_depth=70,random_state=29)
r_clf.fit(titanic_X,titanic_y)
r_clf.score(testset_X,testset_y)


In [212]:
r_clf=RandomForestClassifier(n_estimators=112,max_depth=80,random_state=29,min_samples_split=0.1 )
r_clf.fit(titanic_X,titanic_y)
r_clf.score(testset_X,testset_y)


0.9473684210526315

In [188]:
r_clf=RandomForestClassifier(n_estimators=10,max_depth=90,random_state=29,criterion='entropy',min_samples_split=0.133)
r_clf.fit(titanic_X,titanic_y)
r_clf.score(testset_X,testset_y)

0.9282296650717703

In [205]:
r_clf=RandomForestClassifier(n_estimators=50,max_depth=90,random_state=29,criterion='entropy' )
r_clf.fit(titanic_X,titanic_y)
r_clf.score(testset_X,testset_y)


0.8492822966507177

# GridSearchCV

In [214]:
from sklearn.model_selection import GridSearchCV

In [316]:
step_gscv=[
    ('ohe',OneHotEncoder(sparse=True,categorical_features=category_mask)),
    ('clf',xgb.XGBRegressor())
]

clf_param_grid={
    'clf__max_depth':[80,90,100],
    'clf__learning_rate':[0.052]
}

gscv_clf=Pipeline(step_gscv)
estimator=GridSearchCV(gscv_clf,param_grid=clf_param_grid,cv=5,verbose=1,scoring='neg_mean_squared_error')
estimator.fit(titanic_X,titanic_y)
print('score: ',np.sqrt(-1*estimator.score(testset_X,testset_y)))

Fitting 5 folds for each of 3 candidates, totalling 15 fits
score:  0.43153642378623713


[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    2.0s finished


# Support Vector Machines

In [106]:
from sklearn import svm

### Train split

kernel='poly' didn't yield a result. I had to restart the kernel:(

In [109]:
X_train,X_test,y_train,y_test=train_test_split(titanic_X,titanic_y,test_size=0.125,random_state=29)

In [110]:
svm_clf=svm.SVC()
pipeline=Pipeline([('imp',imp),('scale',scale),('clf',svm_clf)])
pipeline.fit(X_train,y_train)
pipeline.score(X_test,y_test)

0.8125

In [111]:
svm_clf=svm.SVC()

svm_clf.fit(X_train,y_train)
svm_clf.score(X_test,y_test)

0.7946428571428571

In [112]:
svm_clf=svm.SVC(kernel='linear',random_state=29)

svm_clf.fit(X_train,y_train)
svm_clf.score(X_test,y_test)

0.8125

In [113]:
# svm_clf=svm.SVC(kernel='poly',random_state=29)

# svm_clf.fit(X_train,y_train)
# svm_clf.score(X_test,y_test)

In [114]:
svm_clf=svm.SVC(kernel='rbf',C=1e9,random_state=29)

svm_clf.fit(X_train,y_train)
svm_clf.score(X_test,y_test)

0.7946428571428571

In [115]:
svm_clf=svm.SVC(kernel='sigmoid',C=1e20,random_state=29)

svm_clf.fit(X_train,y_train)
svm_clf.score(X_test,y_test)

0.6071428571428571

In [116]:
svm_clf=svm.SVC(kernel='rbf',random_state=29)
pipeline=Pipeline([('imp',imp),('scale',scale),('clf',svm_clf)])
pipeline.fit(X_train,y_train)
pipeline.score(X_test,y_test)

0.8125

### Final

In [216]:
svm_clf=svm.SVC()

svm_clf.fit(titanic_X,titanic_y)
svm_clf.score(testset_X,testset_y)

0.8253588516746412

In [238]:
svm_clf=svm.SVC(kernel='linear',random_state=29)

svm_clf.fit(titanic_X,titanic_y)
svm_clf.score(testset_X,testset_y)

1.0

In [218]:
svm_clf=svm.SVC(kernel='rbf',random_state=29)

svm_clf.fit(titanic_X,titanic_y)
svm_clf.score(testset_X,testset_y)

0.8253588516746412

In [219]:
svm_clf=svm.SVC(kernel='sigmoid')

svm_clf.fit(titanic_X,titanic_y)
svm_clf.score(testset_X,testset_y)

0.6363636363636364

In [236]:
svm_clf=svm.SVC(kernel='rbf',random_state=29,gamma=0.062)

svm_clf.fit(titanic_X,titanic_y)
svm_clf.score(testset_X,testset_y)

0.8325358851674641

# Conclusions

By applying Random Forest classifier, I obtained max. score of **94.7%**<br>
Most scores are in range of 80-90%.<br>
Despite tuning, more accuracy could not be achieved