This is my solution in predicting survival with Titanic dataset. 

It is also my very first time programming in Python and doing a data science project. 

Any thoughts, suggestions and comments are greatly appreciated. 

The first submission yielded an accuracy of 0.78468.

**Question of interest**

*Use a Machine Learning model to predict which passengers survived the shipwreck given various features of the passengers.*

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.metrics import accuracy_score

In [None]:
train_data = pd.read_csv('/kaggle/input/titanic/train.csv')

In [None]:
test_data = pd.read_csv('/kaggle/input/titanic/test.csv')

In [None]:
trdata = train_data.copy()
tesdata = test_data.copy()
dsets = [trdata,tesdata]
print(trdata.head())
print(trdata.info())
print(tesdata.info())

In [None]:
trdata.describe(include='all')

Let's assess the relationship between the features and our question of interest.
- A few things to note: 
    - it may make more sense to analyze age and fare in groups 
    - I expect the passenger id and ticket columns to not have an association with survival 

In [None]:
trdata.groupby('Sex').Survived.mean()

- A much higher proportion of females survived the incident

In [None]:
trdata.groupby('Pclass')['Survived'].mean()

- Passengers of higher socio economic status were more likely to survive 

In [None]:
trdata.groupby('Embarked')['Survived'].mean()

In [None]:
sns.countplot(x='Survived', data=trdata, hue='Embarked')

- A higher proportion of passengers who embarked at Cherbourg survived in comparson to other ports of embarkation. 
- Most passengers embarked at Southampton, which also leads to the most deaths when comparing by where the passengers embarked
- embarked will be included as a feature

In [None]:
print(train_data.groupby('SibSp')['Survived'].mean())
print(trdata.groupby('Parch').Survived.mean())

- The features SibSp and Parch do not seem to be that useful in predicting survival.
- Perhaps creating a 'Family' feature indicating the size of the family on board could be useful 
- Based on the family feature, an 'Alone' feature can be created indicating whether the passenger was travelling alone 

In [None]:
sns.catplot(x='Survived', hue='Sex', data=trdata, kind='count', col='Pclass')

* In all Pclasses, females were more likely than males to survive. Justifying the inclusion of Sex and Pclass as features into our model 

In [None]:
sns.catplot(x='Survived', y='Fare', data=train_data)

In [None]:
g=sns.FacetGrid(trdata, col='Survived')
g.map(plt.hist, 'Age')

* The conclusion from the above graphs are that bins should be created for age and fare

In [None]:
# fill in the missing values in age and fare with the median value. 
# the median is used as there are a few outliers for these features.
# there is an 80 year old on board, along with a passenger who paid $512.

age_med=trdata['Age'].median()
fare_med=trdata['Fare'].median()
tesdata['Fare'].fillna(fare_med, inplace=True)

# impute median
for dset in dsets:
    dset['Age'].fillna(age_med, inplace=True)
    
# create bins for age
age_bins = [-np.inf, 20, 40, 60, np.inf]
fare_bins = [-np.inf, 128.082, 256.165, 384.247, np.inf]
labs = [0, 1, 2, 3]



for dset in dsets:
    dset['Agebin'] = pd.cut(dset['Age'], bins=age_bins, labels=labs)
    dset['Farebin'] = pd.cut(dset['Fare'], bins=fare_bins, labels=labs)
    dset['Agebin'].astype('int64')
    dset['Farebin'].astype('int64')
    
    
print(trdata['Agebin'].unique())
print(trdata['Farebin'].unique())

# sns.countplot(x='Survived', data=trdata, hue='Agebin')
# plt.figure()
# sns.countplot(x='Survived', data=trdata, hue='Farebin')

In [None]:
trdata.groupby('Agebin')['Survived'].mean()

In [None]:
trdata.groupby('Farebin')['Survived'].sum()

In [None]:
trdata.groupby('Farebin')['Survived'].mean()

* Young passengers were more likely to survive 
* The 3 highest paying passengers survived

In [None]:
# create a family feature

for dset in dsets:
    dset['Family'] = dset['SibSp'] + dset['Parch'] + 1

trdata.groupby('Family')['Survived'].mean()

* In general, large families were less likely to survive, it would be very cold-hearted to leave family members behind

In [None]:
# create an 'Alone' feature 
# Alone = 1 if the passenger is onboard by him/herself
for dset in dsets:
    dset['Alone']=1
    dset.loc[dset['Family'] > 1, 'Alone'] = 0
    
trdata.groupby('Alone')['Survived'].mean()

* Passengers that were not alone on average were more likely to survive.

**Drop the family feature to prevent colinearity**.  

Fill in the missing values for Embarked with the most frequent

In [None]:
mode_emb = trdata['Embarked'].mode()[0]
trdata['Embarked'].fillna(mode_emb, inplace=True)

* Drop PassengerId, Name, Age, SibSp, Parch, Ticket, Fare, Cabin, Family
* Dummy code Sex, Embarked

In [None]:
emb_dummy_tr = pd.get_dummies(trdata['Embarked'], drop_first=True)
sex_dummy_tr = pd.get_dummies(trdata['Sex'], drop_first=True)
emb_dummy_te = pd.get_dummies(tesdata['Embarked'], drop_first=True)
sex_dummy_te = pd.get_dummies(tesdata['Sex'], drop_first=True)

trdata_enc = pd.concat([trdata, emb_dummy_tr, sex_dummy_tr], axis=1)
tesdata_enc = pd.concat([tesdata, emb_dummy_te, sex_dummy_te], axis=1)

trdata_enc.head()

In [None]:
drop_col = ['PassengerId', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Family']
dset_enc = [trdata_enc, tesdata_enc]
for dset in dset_enc:
    dset.drop(drop_col, axis=1, inplace=True)
    
trdata_enc.head()

In [None]:
tesdata_enc.head()

In [None]:
y_train = trdata_enc.Survived
X_train = trdata_enc.drop('Survived', axis=1)
X_test = tesdata_enc

Now we will fit and make predictions with our models. 
* K Neighbours
* Logistic Regression
* Decision Trees
* Random Forest
* Voting classifier
* Gradient boosting 

In [None]:
rs = 99
lr = LogisticRegression(random_state=rs)
knn = KNN()
dt=DecisionTreeClassifier(random_state=rs)
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbors', knn), ('Decision Tree', dt)]

for clfname, clf in classifiers:
    # fit the model
    clf.fit(X_train, y_train)
    
    # predict
    pred = clf.predict(X_test)
    
    # score of the model 
    print('The score of {} is {:.4f}'.format(clfname, clf.score(X_train, y_train)))

In [None]:
vc=VotingClassifier(estimators=classifiers)
vc.fit(X_train, y_train)
vc_pred = vc.predict(X_test)
print('The score of the Voting classifier is: {:.4f}'.format(vc.score(X_train, y_train)))


In [None]:
rf=RandomForestClassifier()
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
print('The score of the Random Forest classifier is: {:.4f}'.format(rf.score(X_train, y_train)))


In [None]:
gb=GradientBoostingClassifier()
gb.fit(X_train, y_train)
gb_pred = gb.predict(X_test)
print('The score of the Gradient Boosting classifier is: {:.4f}'.format(gb.score(X_train, y_train)))

**Choose the Random Forest Classifier**

In [None]:
output = pd.DataFrame({'PassengerId': tesdata.PassengerId, 'Survived': rf_pred})
output.to_csv('my_submission.csv', index=False)
print("Submission was successfully saved!")