## Exercise notebook for the fourth session

This is the exercise notebook for the fourth session of the [Machine Learning workshop series at Harvey Mudd College](http://www.aashitak.com/ML-Workshops/). Please feel free to ask for help from the instructor and/or TAs.

First we import python modules:

In [0]:
import pandas as pd
import numpy as np
import re

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

import warnings
warnings.simplefilter('ignore')

For your convenience, the data preprocessing and feature engineering that we did in the previous sessions is summarized below.

In [46]:
from google.colab.files import upload
upload()
df = pd.read_csv('train.csv')
train = pd.read_csv('train.csv')
target = train.Survived.astype('category', ordered=False)
train.drop('Survived', axis=1)

test = pd.read_csv('test.csv')
PassengerId = test.PassengerId

def get_Titles(df):
    df.Name = df.Name.apply(lambda name: re.findall("\s\S+[.]\s", name)[0].strip())
    df = df.rename(columns = {'Name': 'Title'})
    df.Title.replace({'Ms.': 'Miss.', 'Mlle.': 'Miss.', 'Dr.': 'Rare', 'Mme.': 'Mr.', 'Major.': 'Rare', 'Lady.': 'Rare', 'Sir.': 'Rare', 'Col.': 'Rare', 'Capt.': 'Rare', 
                      'Countess.': 'Rare', 'Jonkheer.': 'Rare', 'Dona.': 'Rare', 'Don.': 'Rare', 'Rev.': 'Rare'}, inplace=True)
    return df

def fill_Age(df):
    df.Age = df.Age.fillna(df.groupby("Title").Age.transform("median"))
    return df

def get_Group_size(df):
    Ticket_counts = df.Ticket.value_counts()
    df['Ticket_counts'] = df.Ticket.apply(lambda x: Ticket_counts[x])
    df['Family_size'] = df['SibSp'] + df['Parch'] + 1
    df['Group_size'] = df[['Family_size', 'Ticket_counts']].max(axis=1)
    return df

def process_features(df):
    df.Sex = df.Sex.astype('category', ordered=False).cat.codes
    features_to_keep = ['Age', 'Fare', 'Group_size', 'Pclass', 'Sex']
    df = df[features_to_keep]
    return df

def process_data(df):
    df = df.copy()
    df = get_Titles(df)
    df = fill_Age(df)
    df = get_Group_size(df)
    df = process_features(df)
    medianFare = df['Fare'].median()
    df['Fare'] = df['Fare'].fillna(medianFare)
    return df

X_train, X_test = process_data(train), process_data(test)

Saving test.csv to test (5).csv
Saving train.csv to train (5).csv


First, split the data into training and validation set using `train_test_split` and name the variables as `X_train, X_valid, y_train, y_valid `.

In [0]:
X_train, X_valid, y_train, y_valid = train_test_split(X_train, target, random_state=0)

In [44]:
X_train.head()
X_test.head()

Unnamed: 0,Age,Fare,Group_size,Pclass,Sex
0,22.0,7.25,2,3,1
1,38.0,71.2833,2,1,0
2,26.0,7.925,1,3,0
3,35.0,53.1,2,1,0
4,35.0,8.05,1,3,1


In [5]:
y_train.head()

105    0
68     1
253    0
320    0
706    1
Name: Survived, dtype: category
Categories (2, int64): [0, 1]

Train a logistic regression classifier on `X_train, y_train` and test its accuracy on both `X_train, y_train` and `X_valid, y_valid`.

In [6]:
LDG_clf = LogisticRegression().fit(X_train,y_train)

print('Accuracy of Logaistic regression classifier on training set : {:.3f}'.format(LDG_clf.score(X_train,y_train)))
print('Accuracy of Logistic regression classifier on testing set : {:.3f}'.format(LDG_clf.score(X_valid,y_valid)))

Accuracy of Logaistic regression classifier on training set : 0.799
Accuracy of Logistic regression classifier on testing set : 0.780


In [7]:
from sklearn.tree import DecisionTreeClassifier

DT_clf = DecisionTreeClassifier().fit(X_train,y_train)

print('Accuracy of Logaistic regression classifier on training set : {:.5f}'.format(DT_clf.score(X_train,y_train)))
print('Accuracy of Logistic regression classifier on testing set : {:.5f}'.format(DT_clf.score(X_valid,y_valid)))

Accuracy of Logaistic regression classifier on training set : 0.97754
Accuracy of Logistic regression classifier on testing set : 0.78924


In [8]:
from sklearn.neighbors import KNeighborsClassifier

KNN_clf = KNeighborsClassifier().fit(X_train,y_train)

print('Accuracy of Logaistic regression classifier on training set : {:.3f}'.format(KNN_clf.score(X_train,y_train)))
print('Accuracy of Logistic regression classifier on testing set : {:.3f}'.format(KNN_clf.score(X_valid,y_valid)))

Accuracy of Logaistic regression classifier on training set : 0.813
Accuracy of Logistic regression classifier on testing set : 0.726


In [9]:
from sklearn.svm import SVC

SVC_clf = SVC().fit(X_train,y_train)

print('Accuracy of Logaistic regression classifier on training set : {:.3f}'.format(SVC_clf.score(X_train,y_train)))
print('Accuracy of Logistic regression classifier on testing set : {:.3f}'.format(SVC_clf.score(X_valid,y_valid)))

Accuracy of Logaistic regression classifier on training set : 0.906
Accuracy of Logistic regression classifier on testing set : 0.717


In [48]:
from sklearn.ensemble import RandomForestClassifier

RF_clf = RandomForestClassifier().fit(X_train,y_train)

print('Accuracy of Logaistic regression classifier on training set : {:.3f}'.format(RF_clf.score(X_train,y_train)))
print('Accuracy of Logistic regression classifier on testing set : {:.3f}'.format(RF_clf.score(X_valid,y_valid)))

Accuracy of Logaistic regression classifier on training set : 0.970
Accuracy of Logistic regression classifier on testing set : 0.803


In [12]:
from sklearn.ensemble import VotingClassifier

eclf1 = VotingClassifier(estimators=[('lr', LDG_clf),('dt', DT_clf), ('knn', KNN_clf),('rf', RF_clf), ('svc', SVC_clf)],voting='hard')

eclf1 = eclf1.fit(X_train, y_train)

print('Accuracy of Random Forest classifier on training set: {:.2f}'.format(eclf1.score(X_train, y_train)))
print('Accuracy of Random Forest classifier on test set: {:.2f}'.format(eclf1.score(X_valid, y_valid)))

Accuracy of Random Forest classifier on training set: 0.94
Accuracy of Random Forest classifier on test set: 0.82


We are selecting RandomForestClassifier

In [50]:
y_test = RF_clf.predict(X_test)
y_test[:10]

array([0, 0, 0, 1, 0, 0, 0, 0, 1, 0])

We create a dataframe for submission using the predictions from `y_test` and save it to a csv file. It is important that our submission file is in correct format to be graded without errors.

In [0]:
submission = pd.DataFrame({'PassengerId': PassengerId, 'Survived': y_test})
submission.to_csv('submission.csv', index=False)
from google.colab.files import download
download('submission.csv')

 [submitted my predictions to the competitions leaderboard](https://www.kaggle.com/c/titanic/submit).