The purpose of this kernel is to reinforce my understanding of using the logistic regression, k nearest neighbors and support vector machine to make predictions on a binary response. In addition to that, I'd like to get practice of completing an end to end machine learning project. This kernel contains the following sections:
1. Data Cleaning
2. Exploratory Data Analysis
3. Feature Engineering
4. Model training and selection using a training and validation data set.
5. Submitting predictions.

In [1]:
#Imports
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # import seaborn
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

# 1. Data Cleaning

In [2]:
titanic_train = pd.read_csv('train.csv')
titanic_test = pd.read_csv('test.csv')

In [3]:
titanic_train.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Some observations about the head of the data frame
* Looks like there are 11 features and one response variable "survived".
* PassengerID could represent the index of the data frame
* Pclass looks like it could be a categorical variable
* Cabin appears to have some missing data.

In [4]:
titanic_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


Looks like the following columns have some missing values:

* Age
* Cabin - which aligns with the previous observation
* Embarked

In [5]:
titanic_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


The test data set had missing values for the following columns
* Age
* Cabin - which aligns with the previous observation
* Fare

In [6]:
titanic_train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [7]:
titanic_train.describe(include=['O'])

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Isham, Miss. Ann Elizabeth",male,1601,B96 B98,S
freq,1,577,7,4,644


A couple things here:
* No repeated names
* 3 different values for embarked

# Data Cleaning
In order to clean this dataset, I'd like to make sure that each column is free from NaN values and is of the correct type. As noted previously, the age, embarked, and cabin columns are all missing values.

Let's take a look at the age column

In [8]:
print("Age broken down by P-class")
titanic_train.groupby('Pclass').mean()[['Age']]

Age broken down by P-class


Unnamed: 0_level_0,Age
Pclass,Unnamed: 1_level_1
1,38.233441
2,29.87763
3,25.14062


I'm going to impute the age column based on the average age per passenger determined by the Pclass column for both the training and testing data sets because both of these columns contain missing data and less than 25% of the column is missing data. 

In [9]:
titanic_train.loc[titanic_train.Age.isnull(), 'Age'] = titanic_train.groupby('Pclass')['Age'].transform('mean')
titanic_test.loc[titanic_test.Age.isnull(), 'Age'] = titanic_test.groupby('Pclass')['Age'].transform('mean')

Check out rows 5 and 17 to ensure age of ~25 got inputed for age in row 5 and ~29 was inputted for age in row 17. Looks good, and checking .info() method there are no missing values for age column.

In [10]:
titanic_train.iloc[[5, 17]]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,6,0,3,"Moran, Mr. James",male,25.14062,0,0,330877,8.4583,,Q
17,18,1,2,"Williams, Mr. Charles Eugene",male,29.87763,0,0,244373,13.0,,S


Due to the large number of missing entires for the cabin column in both the training and testing dataset, I'm going to drop it from both.

In [11]:
titanic_train = titanic_train.drop('Cabin', axis=1)
titanic_test = titanic_test.drop('Cabin', axis=1)

Also because Embarked is only missing two entries from the training dataset and fare is only missing one entry from the test dataset I'm just going to impute these values with the mode and median value for each column respectively.

In [12]:
titanic_train['Embarked'].fillna(titanic_train['Embarked'].mode()[0], inplace=True)
titanic_test['Fare'].fillna(titanic_test['Fare'].median(), inplace = True)

Ensure all columns have no null values

In [13]:
print('Training Data Null Values')
print(titanic_train.isnull().sum())
print("-" * 30)
print('Test Data Null Values')
print(titanic_test.isnull().sum())

Training Data Null Values
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64
------------------------------
Test Data Null Values
PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64


Looks like all columns are cleaned

## Exploratory Data Analysis

In [14]:
titanic_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


Because the goal is to predict the Survived column I want to take a look at the class balance in that column

In [15]:
sns.countplot(x='Survived', data=titanic_train)

<matplotlib.axes._subplots.AxesSubplot at 0x223d06cf860>

There is a class imbalance meaning that more people did not survive the titanic than did survive it in our training dataset.

Want to look at how the price of tickets bought varied by the age of the people on board.

In [16]:
sns.boxplot(x = 'Survived', y = 'Fare', data = titanic_train)

<matplotlib.axes._subplots.AxesSubplot at 0x223d06cf860>

In [17]:
titanic_train.groupby('Survived').mean()[['Fare']]

Unnamed: 0_level_0,Fare
Survived,Unnamed: 1_level_1
0,22.117887
1,48.395408


Looks like the median ticket price is larger for those who survived. Average ticket price is much higher but is likely due to the outlier. Want to investigate this outlier. Look below and see that three individuals purchased tickets at a fare of $512. Money must have not been a problem for these folks!

In [18]:
titanic_train.loc[titanic_train['Fare'] > 500, :]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
258,259,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.3292,C
679,680,1,1,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0,1,PC 17755,512.3292,C
737,738,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,C


In [19]:
titanic_no_500s = titanic_train.loc[titanic_train['Fare'] < 500, :]
sns.boxplot(x = 'Survived', y = 'Fare', data = titanic_no_500s, palette = 'RdBu_r')
titanic_no_500s.groupby('Survived').mean()[['Fare']]

Unnamed: 0_level_0,Fare
Survived,Unnamed: 1_level_1
0,22.117887
1,44.289799


With the fare's of 500+ removed, the boxplots are more readable. The mean and median are definitely higher for those who survived and will include as a feaure for model training.

Now I want to take a look at the effect of male vs female passengers

In [20]:
sns.countplot(x = 'Sex', data = titanic_train, hue = 'Survived')

<matplotlib.axes._subplots.AxesSubplot at 0x223d06cf860>

Looking at this chart more male a larger proportion of male passengers didn't survive when compared to female. Will consider this as an important feature for model training and building.

Let's take a look at the Age column

In [21]:
hist = sns.distplot(titanic_train['Age'], color='b', bins=30, kde=False)
hist.set(xlim=(0, 100), title = "Distribution of Passenger Age's")

[(0, 100), Text(0.5,1,"Distribution of Passenger Age's")]

In [22]:
titanic_train.Age.describe()

count    891.000000
mean      29.292875
std       13.210527
min        0.420000
25%       22.000000
50%       26.000000
75%       37.000000
max       80.000000
Name: Age, dtype: float64

In [23]:
age_box = sns.boxplot(y = 'Age', x = 'Survived',data = titanic_train, palette='coolwarm')
age_box.set(title='Boxplot of Age')

[Text(0.5,1,'Boxplot of Age')]

Based on the description and histogram our passengers are roughly normally distributed with a mean of 29 and median of 26 years of age respectively. Looking at the boxplots of ages of passengers who did and didn't survive the distributions look relatively similar. Based on this I'm debating including the age column in model training.

Embarked Column

In [24]:
titanic_train.groupby(['Embarked']).count()

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
C,168,168,168,168,168,168,168,168,168,168
Q,77,77,77,77,77,77,77,77,77,77
S,646,646,646,646,646,646,646,646,646,646


In [25]:
sns.countplot(x = 'Embarked', hue = 'Survived', data=titanic_train)

<matplotlib.axes._subplots.AxesSubplot at 0x223d06cf860>

Looks like people who boarded from S were more likely to not survive than those who didn't board at S

PClass column

In [26]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Pclass', data=titanic_train, palette = 'rainbow')

<matplotlib.axes._subplots.AxesSubplot at 0x223d06cf860>

Looks like a majority of those who didn't survive were in the 3rd P-class. Would definitely be worth including as a feature in the model.

## Feature Engineering

First step is to make copies of each dataframe

In [27]:
#Make copies of both dataframes.
traindf = titanic_train.copy()
testdf = titanic_test.copy()

Next I'm going to put the copied dataframes into a list so I can perform the same actions to both dataframes.

In [28]:
#Create list of both data frames to apply similar functions to.
all_data = [traindf, testdf]


### Drop Name and Ticket Columns

In [29]:
#Drop name and ticket columns
for dat in all_data:
    dat.drop(['Name', 'Ticket'], axis=1, inplace=True)

### Bin Fare Column
Next I'm going to bin the fare column based on the summary statistics for that column

In [30]:
traindf.describe()['Fare']

count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64

Looks like some good cutoff points will be 0, 8, 15, 31, and 515 to include the max fare value of 512.

In [31]:
#Perform operation on both frames
for dat in all_data:
    
    #Create bins to separate fares
    bins = (0, 8, 15, 31, 515)

    #Assign group names to bins
    group_names = ['Fare_Group_1', 'Fare_Group_2', 'Fare_Group_3', 'Fare_Group_4']

    #Bin the Fare column based on bins
    categories = pd.cut(dat.Fare, bins, labels=group_names)
    
    #Assign bins to column
    dat['Fare'] = categories


### Bin Age Column

In [32]:
traindf.describe()['Age']

count    891.000000
mean      29.292875
std       13.210527
min        0.420000
25%       22.000000
50%       26.000000
75%       37.000000
max       80.000000
Name: Age, dtype: float64

Am going to try binning by every 15 years.

In [33]:
#Perform operation on both frames
for dat in all_data:
    
    #Create bins to separate fares
    bins = (0, 15, 30, 45, 60, 75, 90)

    #Assign group names to bins
    group_names = ['Child', 'Young Adult', 'Adult', 'Experienced', 'Senior', 'Elderly']

    #Bin the Fare column based on bins
    categories = pd.cut(dat.Age, bins, labels=group_names)
    
    #Assign bins to column
    dat['Age'] = categories

In [34]:
traindf.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,male,Young Adult,1,0,Fare_Group_1,S
1,2,1,1,female,Adult,1,0,Fare_Group_4,C
2,3,1,3,female,Young Adult,0,0,Fare_Group_1,S
3,4,1,1,female,Adult,1,0,Fare_Group_4,S
4,5,0,3,male,Adult,0,0,Fare_Group_2,S


### Create Family Size Feature. SibSp + Parch

In [35]:
for dat in all_data:
    dat['Fam_Size'] = dat['SibSp'] + dat['Parch']

### Use one hot encoding to code categorical variables.

In [36]:
traindf = pd.get_dummies(traindf)
traindf.head()

Unnamed: 0,PassengerId,Survived,Pclass,SibSp,Parch,Fam_Size,Sex_female,Sex_male,Age_Child,Age_Young Adult,...,Age_Experienced,Age_Senior,Age_Elderly,Fare_Fare_Group_1,Fare_Fare_Group_2,Fare_Fare_Group_3,Fare_Fare_Group_4,Embarked_C,Embarked_Q,Embarked_S
0,1,0,3,1,0,1,0,1,0,1,...,0,0,0,1,0,0,0,0,0,1
1,2,1,1,1,0,1,1,0,0,0,...,0,0,0,0,0,0,1,1,0,0
2,3,1,3,0,0,0,1,0,0,1,...,0,0,0,1,0,0,0,0,0,1
3,4,1,1,1,0,1,1,0,0,0,...,0,0,0,0,0,0,1,0,0,1
4,5,0,3,0,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,1


In [37]:
testdf = pd.get_dummies(testdf)
testdf.head()

Unnamed: 0,PassengerId,Pclass,SibSp,Parch,Fam_Size,Sex_female,Sex_male,Age_Child,Age_Young Adult,Age_Adult,Age_Experienced,Age_Senior,Age_Elderly,Fare_Fare_Group_1,Fare_Fare_Group_2,Fare_Fare_Group_3,Fare_Fare_Group_4,Embarked_C,Embarked_Q,Embarked_S
0,892,3,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,1,0
1,893,3,1,0,1,1,0,0,0,0,1,0,0,1,0,0,0,0,0,1
2,894,2,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,1,0
3,895,3,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,1
4,896,3,1,1,2,1,0,0,1,0,0,0,0,0,1,0,0,0,0,1


# Machine Learning
In order to predict whether a passenger survived the titainc or not, a classification machine learning algorithm will be needed. I've decided for this kernel to try the following methods:
* Logistic Regression
* Support Vector Machine
* K Nearest Neighbors

The steps I'm going to take to find the best model are outlined below
1. Split data into training, validation, and test sets
2. Train and fit each model to training data
3. Test each model on validation data
4. Pick model with highest prediction accuracy on validation set.
5. Use model from step 4 on test dataset.

In [38]:
#Import libraries
from sklearn.metrics import confusion_matrix #confusion matrix
from sklearn.linear_model import LogisticRegression #Logistic Regression
from sklearn.ensemble import RandomForestClassifier #Random Forest Classifier
from sklearn.svm import SVC #Support Vector Machine
from sklearn.preprocessing import StandardScaler #For scaling data
from sklearn.neighbors import KNeighborsClassifier #KNN
from sklearn.model_selection import train_test_split #Split data into training and validation sets.
from sklearn.metrics import accuracy_score  #Accuracy Score

### 1. Split data into training and validation sets
Because we already have the test dataset provided to us, all we need to do is split the training dataset into a training and validation set.

In [40]:
#Split data into training and validation set
X = traindf.drop(['PassengerId', 'Survived'], axis=1)
y = traindf['Survived']

#Note they are labeled as test sets but I'm treating them as validation data sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

### 2. Train and fit each model to train, test on validaiton data.
I will do this for each model listed above. The dataframe below will hold the validation results.

In [41]:
results = pd.DataFrame(columns=['Validation'], index=['Logistic Regression', 'Support Vector Machine', 'KNN', 'Random Forest'])

### Logistic Regression
First create function to train, fit, and test logistic regression model on validation data

In [42]:
def log_reg(X_train, X_test, y_train, y_test):
    #Create logmodel object
    logmodel = LogisticRegression()

    #fit logistic regression model
    logmodel.fit(X_train, y_train)

    #Make predictions on validation data
    predictions = logmodel.predict(X_test)
    
    #Print Statistics
    print(accuracy_score(y_test, predictions))
    
    #Return predictions
    return accuracy_score(y_test, predictions)

In [43]:
#Get prediction accuracy for model.
LR_preds = log_reg(X_train, X_test, y_train, y_test)

#Add to dataframe.
results.loc['Logistic Regression', 'Validation'] = LR_preds

0.779850746269


### Support Vector Machine
First create function to train, fit, and test support vector machine model on. For SVM we will need to scale the input features.


In [44]:
def svm(X_train, X_test, y_train, y_test):
    
    #Scale data
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)
    
    #Create list of c values to try
    c_vals = [.0001, .001, .01, .1, 1, 10, 100]
    
    #Accuracy list
    accuracy = [0,0,0,0,0,0,0]
    
    #Loop through c_values
    for i, c in enumerate(c_vals):
        #Create support vector machine object
        svc_model = SVC(C=c)
        
        #fit support vector machine model
        svc_model.fit(X_train, y_train)
        
        #Make predictions
        predictions = svc_model.predict(X_test)
        
        #add accuracy score to accuracy list
        accuracy[i] = accuracy_score(y_test, predictions)
    
    print("Best C Value:", c_vals[accuracy.index(max(accuracy))])
    
    print("Prediction Accuracy: ", max(accuracy))
    
    return max(accuracy)
        
        

In [45]:
#Get support vector machine results
svm_preds = svm(X_train, X_test, y_train, y_test)

#Add to dataframe.
results.loc['Support Vector Machine', 'Validation'] = svm_preds
results.head()


Best C Value: 10
Prediction Accuracy:  0.85447761194


Unnamed: 0,Validation
Logistic Regression,0.779851
Support Vector Machine,0.854478
KNN,
Random Forest,


### K Nearest Neighbors
First create function to train, fit, and test K Nearest Neighbors model on. For KNN we will need to scale the input features.

In [46]:
def knn(X_train, X_test, y_train, y_test):
    
    #Scale data
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)
    
    #Create list of c values to try
    ks = [i + 1 for i in range(20)]
    
    #Accuracy list
    accuracy = [0 for i in range(20)]
    
    #Loop through c_values
    for i, k in enumerate(ks):
        #Create support vector machine object
        knn = KNeighborsClassifier(n_neighbors = k)
        
        #fit support vector machine model
        knn.fit(X_train, y_train)
        
        #Make predictions
        predictions = knn.predict(X_test)
        
        #add accuracy score to accuracy list
        accuracy[i] = accuracy_score(y_test, predictions)
    
    print(ks)
    print(accuracy)
    print("Best k Value:", ks[accuracy.index(max(accuracy))])
    
    print("Prediction Accuracy: ", max(accuracy))
    
    return max(accuracy)

In [47]:
knn_preds = knn(X_train, X_test, y_train, y_test)
results.loc['KNN', 'Validation'] = knn_preds
results.head()

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
[0.79104477611940294, 0.78731343283582089, 0.80223880597014929, 0.80223880597014929, 0.80970149253731338, 0.77985074626865669, 0.80223880597014929, 0.78358208955223885, 0.80597014925373134, 0.78358208955223885, 0.77985074626865669, 0.79850746268656714, 0.80597014925373134, 0.80970149253731338, 0.81343283582089554, 0.79850746268656714, 0.80223880597014929, 0.79850746268656714, 0.79850746268656714, 0.80597014925373134]
Best k Value: 15
Prediction Accuracy:  0.813432835821


Unnamed: 0,Validation
Logistic Regression,0.779851
Support Vector Machine,0.854478
KNN,0.813433
Random Forest,


![](http://)Use SVM with C = 1 to make predictions on testing data.

In [48]:
scaler = StandardScaler()
scaler.fit(X)
test_feats = testdf.drop('PassengerId', axis=1)
X = scaler.transform(X)
test_feats = scaler.transform(test_feats)


In [49]:
svc_model = SVC(C = 1)
svc_model.fit(X, y)
predictions = svc_model.predict(test_feats)
output = pd.DataFrame({ 'PassengerId' : testdf['PassengerId'], 'Survived': predictions })
output.to_csv('titanic-predictions.csv', index=False)
output.head()


Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,0


Conclusion: This model resulted in 78.947% accuracy which ranks in the top 1/2 of submissions on the kaggle leaderboard. As this was intended to be a 
simple notebook to reinforce learning concepts I'm pretty happy with this result. As I continue to improve my feature engineering skills and understand the workings of more advanced machine learning models I will update this kernel to try and improve upon the body of work that is here.

If you  made it this far, thanks for reading! Any feedback is appreciated :)