This notebook shows in a simple way a quick process to train a ML model and submit the predictions it generates.


In [1]:
# Import the usual libraries
import pandas as pd
import numpy as np
import graphviz as gv
import matplotlib.pyplot as plt
%matplotlib inline
print(pd.__version__, np.__version__, gv.__version__)

We will load both train and test data (actually evaluation data), and concat them to work on both at the same time. Just notice that the test data has the _Survived_ feature missing.

In [2]:
train_df = pd.read_csv('../input/train.csv', index_col='PassengerId')
test_df = pd.read_csv('../input/test.csv', index_col='PassengerId')

df = pd.concat([train_df, test_df], sort=True)

Let's see 10 random examples (if Survived is NaN, it's a one from the test/evaluation data)

In [3]:
df.sample(10)

You can refer to [its data dictionary](https://www.kaggle.com/c/titanic/data) to know more about these features.

Notice that original features start with uppercase. We will add later new features in lowercase.

First let's see if the dataset has missing values.

In [4]:
df[['Age', 'Sex']].isnull().sum()

So we do need to fill in the missing Age values of 263 examples, and no need to do this with Sex feature.

Using pandas __.describe()__ method we can see general statistics for each feature.

In [5]:
df['Age'].describe()

In [6]:
# Quantity of people by given age
max_age = df['Age'].max()
df['Age'].hist(bins=int(max_age))

In [7]:
# Survival ratio per decade, ignoring NaN with dropna()
df['decade'] = df['Age'].dropna().apply(lambda x: int(x/10))
df[['decade', 'Survived']].groupby('decade').mean().plot()

The younger the passenger, the more chances of survival. There is some outsider at Age 80, however.

We need to complete missing values of Age. Let's do this using the mean value.

In [9]:
mean_age = df['Age'].mean()
df['Age'] = df['Age'].fillna(mean_age)

Sex is stored as "male" or "female", but a ML algorithm needs to get numerical values as input. So let's create a new feature "male".

In [10]:
df['male'] = df['Sex'].map({'male': 1, 'female': 0})
df.sample(5)

In [11]:
df[['male','Survived']].groupby('male').mean()

So 74% of females survived, while men had just a 18.9% of surviving ratio.

First we will prepare train examples for training the algorithm.

In [12]:
train = df[df['Survived'].notnull()]

features = ['Age', 'male']
train_X = train[features]
train_y = train['Survived']

Let's train a Decision Tree, which is really easy to understand.

In [13]:

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(train_X, train_y)
test = df[df['Survived'].isnull()]
test_X = test[features]

test_y = knn.predict(test_X)
acc_knn = round(knn.score(train_X, train_y) * 100, 2)
acc_knn



print results

In [14]:
submit = pd.DataFrame(test_y.astype(int),
                      index=test_X.index,
                      columns=['Survived'])
submit.head()

Let's save this predictions in a file tha kaggle will use to evaluate it.

In [15]:
submit.to_csv('smuni_knearest_submit.csv')