# Classification

Let's start by importing the necessary libraries and viewing the first five rows of the dataset

In [1]:
import numpy as np
import pandas as pd

data = pd.read_csv('input/train.csv')

data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We next check for missing values using the ```.info()``` method of the dataframe.

In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


## Preprocessing

Let's first correct the *Age* column by filling in the null values with -0.5.

In [3]:
data.Age.fillna(-0.5,inplace = True)

In [4]:
data["Embarked"].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

For the *Embarked* feature we fill in the NaNs with 'S', which was the most frequent one

In [4]:
data["Embarked"].fillna("S",inplace=True)

In [5]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
categorical = ["Sex","Embarked"]
for i in categorical:
    le.fit(data[i])
    data[i] = le.transform(data[i])

Next, the irrelevant columns are dropped

In [6]:
data.drop(["Name","Ticket","Cabin"],axis = 1, inplace=True)

Let's again call the ```.info()``` method to perform a sanity check and insure that everything has gone right.

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null int64
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Embarked       891 non-null int64
dtypes: float64(2), int64(7)
memory usage: 62.7 KB


## Dataset preparation and model evaluation

We next split the training and testing sets into independent and dependent (target) variables, which is a requirement of the scikit API. Then, we generate separate training and testing sets by using ```train_test_split()``` method present in scikit-learn.

In [7]:
X = data.drop("Survived",axis=1)
y = data["Survived"]

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

We now fit a couple of models (K-nearest neighbors and Random Forest) and evaluate their performance.

In [11]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

In [13]:
# initialize classifiers by creating objects
knn = KNeighborsClassifier()
rf = RandomForestClassifier()

# fit to training data
knn.fit(X_train,y_train)
rf.fit(X_train,y_train)

print(f"Score of knn on test set: {knn.score(X_test,y_test)}")
print(f"Score of rf on test set: {rf.score(X_test,y_test)}")

Score of knn on test set: 0.6576271186440678
Score of rf on test set: 0.7898305084745763
