## K Nearest Neighbors (KNN) algorithm on Kaggle data.
(https://www.kaggle.com/c/titanic)

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Here is our initial setup:

In [146]:
import pandas
import numpy
#from scipy.stats import mode
from sklearn import neighbors
from sklearn.neighbors import DistanceMetric 
from pprint import pprint

T_TRAIN = 'raw_data/train_titanic.csv'
T_TEST = 'raw_data/test_titanic.csv'

titanic_dataframe = pandas.read_csv(T_TRAIN, header=0)
titanic_dataframe.head(5)

print('length: {0} '.format(len(titanic_dataframe)))

length: 891 


### Hypothesis:

Women, children and first class passengers will be more likely to survive.

Here we'll drop the frames that don't appear to help us with out analysis:

In [147]:
titanic_dataframe.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
print('dropped')

titanic_dataframe.describe()

dropped


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### Data Clean up:

Below we'll do some cleaning of the datapoints.

Replace the unknown ages with an mean aveage age:

In [148]:
titanic_dataframe = pandas.read_csv(T_TRAIN, header=0)
avg_age_all_raw = titanic_dataframe['Age'].mean()
titanic_dataframe['Age'] = titanic_dataframe['Age'].fillna(avg_age_all_raw)

titanic_dataframe.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,13.002015,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,22.0,0.0,0.0,7.9104
50%,446.0,0.0,3.0,29.699118,0.0,0.0,14.4542
75%,668.5,1.0,3.0,35.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### Next

1) Replaced NaN 'Embarked' values with 'E'. 

2) Created a new column 'Port' with integer values that represent the port. 

3) Created a new column 'Gender' with integer values that represent the sex. 

We then remove the previous, now unused, columns:

In [149]:
titanic_dataframe['Embarked'].unique()
titanic_dataframe['Embarked'] = titanic_dataframe['Embarked'].fillna('E')
titanic_dataframe['Port'] = titanic_dataframe['Embarked'].map({'S':1, 'C':2, 'Q':3, 'E':4}).astype(int)

titanic_dataframe['Sex'].unique()
titanic_dataframe['Gender'] = titanic_dataframe['Sex'].map({'female': 0, 'male': 1}).astype(int)
titanic_dataframe = titanic_dataframe.drop(['Sex', 'Embarked', 'Name', 'Ticket', 'Cabin'], axis=1)


print('length: {0} '.format(len(titanic_dataframe)))
print(titanic_dataframe.info())

length: 891 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Port           891 non-null int64
Gender         891 non-null int64
dtypes: float64(2), int64(7)
memory usage: 62.7 KB
None


### Clean up the "TEST data" in the same way

In the same manner that we cleaned up the training data we'll now make those same changes to the test data.

In [150]:
test_dataframe = pandas.read_csv(T_TEST, header=0)

test_dataframe.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
avg_age_all_raw = test_dataframe['Age'].mean()
test_dataframe['Age'] = test_dataframe['Age'].fillna(avg_age_all_raw)
avg_fare_mean = test_dataframe.Fare.mean()
test_dataframe.Fare = test_dataframe.Fare.fillna(avg_fare_mean)

test_dataframe['Embarked'].unique()
test_dataframe['Embarked'] = test_dataframe['Embarked'].fillna('E')
test_dataframe['Port'] = test_dataframe['Embarked'].map({'S':1, 'C':2, 'Q':3, 'E':4}).astype(int)

test_dataframe['Sex'].unique()
test_dataframe['Gender'] = test_dataframe['Sex'].map({'female': 0, 'male': 1}).astype(int)
test_dataframe = test_dataframe.drop(['Sex', 'Embarked'], axis=1)

# test_dataframe.describe()
test_data = test_dataframe.values
print(test_dataframe.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 8 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Age            418 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Fare           418 non-null float64
Port           418 non-null int64
Gender         418 non-null int64
dtypes: float64(2), int64(6)
memory usage: 26.2 KB
None


In [151]:
# Covert cols to lists
columns = titanic_dataframe.columns.tolist()
titanic_dataframe = titanic_dataframe[columns]

train_columns = [columns[0], *columns[2:]]
target_columns = [columns[1]]
print(train_columns, target_columns)

train_data = titanic_dataframe[train_columns]
target_data = titanic_dataframe[target_columns]

['PassengerId', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Port', 'Gender'] ['Survived']


In [152]:
model = neighbors.KNeighborsClassifier()
model.fit(train_data.values, [value[0] for value in target_data.values])

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [153]:
output = model.predict(test_data).astype(int)
print(output[:5])

[0 0 0 0 0]


In [154]:
results = numpy.c_[test_dataframe.PassengerId.astype(int), output]

In [155]:
knn_results = pandas.DataFrame(results[:,0:2], columns=['PassengerId', 'Survived'])
