# The Titanic Dataset
## Predicting survival on the Titanic

The Titanic dataset consists of training and test sets. The training set contains both features and outcome whereas the test set contains only features. There is also a gender_submission dataset which gives an example of what a submission to Kaggle should look like.

The aim of this notebook is to create a model that predicts which passengers survived the Titanic shipwreck. This dataset can be found on [Kaggle](https://www.kaggle.com/c/titanic/data?select=gender_submission.csv)

In [1]:
# Import packages
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

In [2]:
# Load train and test data
train = pd.read_csv('./data/train.csv') 
test = pd.read_csv('./data/test.csv')

In [3]:
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

We can see from the above that there are missing values in the Embarked, Cabin and Age columns. 

We will drop the entire Cabin column from the dataset. For the Age column we will fill in the missing data with the median age. For the Embarked missing data we will remove the entire row as there is only 2 missing values.

In [4]:
train = train.dropna(subset=["Embarked"])

In [5]:
train = train.drop(columns=['Cabin'])

In [6]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


In [7]:
# Creating a pipeline to handle the numerical and categorical data

numerical_data_pipline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('std_scaler', StandardScaler()),
    ])

numerical_attributes = ['Age', 'SibSp', 'Parch', 'Fare']
categorical_attributes = ['Pclass', 'Sex', 'Embarked']

column_transform_pipeline = ColumnTransformer([
        ("numerical", numerical_data_pipline, numerical_attributes),
        ("categorical", OneHotEncoder(), categorical_attributes),
    ])

In [8]:
X_train = column_transform_pipeline.fit_transform(train.loc[:, ['Age', 'SibSp', 'Parch', 'Fare', 'Pclass', 'Sex', 'Embarked']])

y_train = train.loc[:, 'Survived']
X_test = column_transform_pipeline.fit_transform(test.loc[:, ['Age', 'SibSp', 'Parch', 'Fare', 'Pclass', 'Sex', 'Embarked']])

### Model Selection

#### Logistic Regression

In [9]:
log_reg = LogisticRegression(solver='lbfgs')
log_reg.fit(X_train, y_train)

log_reg_scores = cross_val_score(log_reg, X_train, y_train, cv=10)
print('Cross-Validation Accuracy Scores', log_reg_scores)
print('Mean of Cross-Validation Accuracy Scores', log_reg_scores.mean())

Cross-Validation Accuracy Scores [0.78651685 0.79775281 0.76404494 0.82022472 0.79775281 0.7752809
 0.78651685 0.78651685 0.83146067 0.85227273]
Mean of Cross-Validation Accuracy Scores 0.7998340143003064


#### k-Nearest Neighbors

In [10]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

knn_scores = cross_val_score(knn, X_train, y_train, cv=10)
print('Cross-Validation Accuracy Scores', knn_scores)
print('Mean of Cross-Validation Accuracy Scores', knn_scores.mean())

Cross-Validation Accuracy Scores [0.7752809  0.82022472 0.71910112 0.79775281 0.85393258 0.82022472
 0.85393258 0.80898876 0.83146067 0.79545455]
Mean of Cross-Validation Accuracy Scores 0.8076353421859039


#### Decision Trees

In [11]:
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)

dtc_scores = cross_val_score(dtc, X_train, y_train, cv=10)
print('Cross-Validation Accuracy Scores', dtc_scores)
print('Mean of Cross-Validation Accuracy Scores', dtc_scores.mean())

Cross-Validation Accuracy Scores [0.7752809  0.83146067 0.70786517 0.76404494 0.83146067 0.7752809
 0.82022472 0.73033708 0.79775281 0.81818182]
Mean of Cross-Validation Accuracy Scores 0.7851889683350357


#### Support Vector Machine

In [12]:
svc = SVC(gamma='scale')
svc.fit(X_train, y_train)

svc_scores = cross_val_score(svc, X_train, y_train, cv=10)
print('Cross-Validation Accuracy Scores', svc_scores)
print('Mean of Cross-Validation Accuracy Scores', svc_scores.mean())

Cross-Validation Accuracy Scores [0.79775281 0.84269663 0.76404494 0.86516854 0.83146067 0.79775281
 0.83146067 0.79775281 0.86516854 0.85227273]
Mean of Cross-Validation Accuracy Scores 0.824553115423902
