# Homework 7 - Let's have another look at the Titanic

The objective of this homework is to practice k-means clustering. To successfully complete this homework, you may use any resources available to you. 

Last week, we used supervised classification to understand what drives survivability. This week we explore whether the machine can figure it out on it's own.

Get the `titanic3.csv` data (Source: [Link](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3info.txt)).

1. Analyze the data using pandas.
    * Create a column `child` that specifies whether the person is a child (age <= 10).
    * Create a column `family_size` that specifies the size of the family of that person (Please note that there are two relevant columns for this step.
2. Develop clusters for the dataset.
    * Impute the `age` column with the **median** (Please note that this a very simplified step. Imputing is usually way more complex).
    * Drop all remaining NaN values.
    * Preprocess the `sex` column using LabelEncoder.
    * Preprocess the `child` column using LabelEncoder (not actually necessary but for systematic purposes).
    * Scale the data set to standardize the dataset using `sklearn.preprocessing.scale`
    * Run a KMeans cluster analysis. Pick an appropriate number of clusters.
    * Interpret the results.
3. Implement a Search for the best cluster using silhouette.
    * Set the parameters to 2,3,4,5,6 clusters.
    * Interpret the best results.
    * Try to find names for the clusters.
    
Hints:
* Explain what you are doing.
* Use references.

In [243]:
import numpy as np
import pandas as pd
import seaborn as sns

In [244]:
import sklearn as sk
import sklearn.tree as tree
import sklearn.preprocessing as pp
import sklearn.metrics as sm

In [245]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [246]:
%matplotlib inline
import matplotlib.pyplot as plt

In [247]:
raw = pd.read_csv("https://raw.githubusercontent.com/mschermann/msis2802winter2018homework/master/\
titanic3.csv")

In [248]:
tc = raw.copy()

## Description of the dataset

Each row is a passenger on the Titanic. The columns report attributes for the passengers:

| Column | Description | Relevant for this homework|
|--------|-------------|--------|
|`survived`|1 = survived, 0 = died | **X**|
|`pclass`| 1 = first class, 2 = second class, 3 = third class |**X**|
|`name`| Name of the passenger| |
|`sex`| male or female|**X**|
|`age`| age in years|**X**|
|`sibsp`| The number of siblings or spouses that are also traveling on the Titanic| **X**|
|`parch`| The number of parents or childen that are also traveling on the Titanic| **X**|
|`ticket`|The ticket number| |
|`fare`| The ticket price | |
|`cabin`| The cabin number | |
|`embarked`| The starting city | |
|`boat`| The emergency boat number | |
|`body`| The identification number of the body | |
|`home.dest`| The destination of the passenger | |

In [249]:
tc['child'] = tc['age'] <= 10

In [250]:
tc['family_size'] = 1 + tc['sibsp'] + tc['parch']

In [251]:
tc['age'] = pp.Imputer(strategy='median').fit_transform(tc[['age']])

In [252]:
t = tc[['survived', 'pclass', 'sex', 'age', 'child', 'family_size']].copy().dropna()

In [253]:
ge = pp.LabelEncoder()
t['sex'] = ge.fit_transform(t['sex'].astype(str))
dict(zip(ge.classes_, ge.transform(ge.classes_)))

{'female': 0, 'male': 1}

In [254]:
ce = pp.LabelEncoder()
t['child'] = ce.fit_transform(t['child'].astype(str))
dict(zip(ce.classes_, ce.transform(ce.classes_)))

{'False': 0, 'True': 1}

In [255]:
X = t.drop('survived', axis=1)
X = pp.scale(X)

In [256]:
from sklearn.cluster import KMeans

In [257]:
cluster_label = KMeans(n_clusters=5).fit_predict(X)

In [258]:
t['cluster_label'] = pd.Series(cluster_label, index=t.index)

In [259]:
check = t.groupby('cluster_label').agg({'survived':['count','sum', 'mean'], 'pclass':'mean', 'sex':'sum', 'age':'mean', 'child':'sum', 'family_size':'mean'}).reset_index()
check

Unnamed: 0_level_0,cluster_label,survived,survived,survived,pclass,sex,age,child,family_size
Unnamed: 0_level_1,Unnamed: 1_level_1,count,sum,mean,mean,sum,mean,sum,mean
0,0,554,75.0,0.135379,2.790614,554,27.472924,0,1.274368
1,1,86,50.0,0.581395,2.651163,45,4.304264,86,4.116279
2,2,344,257.0,0.747093,2.22093,0,27.534884,0,1.767442
3,3,276,108.0,0.391304,1.206522,223,44.068841,0,1.608696
4,4,49,10.0,0.204082,2.714286,21,28.459184,0,7.22449


In [260]:
from sklearn.pipeline import Pipeline

In [261]:
pipe = Pipeline([('cluster', KMeans())])

In [262]:
cluster__n_clusters = [2,3,4,5,6]

In [263]:
from sklearn.metrics import silhouette_score
score = pd.DataFrame(columns=['clusters', 'silhouette_score', 'labels'])
for n_cluster in cluster__n_clusters:
    pipe = pipe.set_params(cluster__n_clusters = n_cluster)
    labels = pipe.fit_predict(X)
    silhouette_avg = silhouette_score(X, labels)
    score = score.append({'clusters':n_cluster, 'silhouette_score': silhouette_avg, 'labels': labels}, ignore_index=True)

In [264]:
score

Unnamed: 0,clusters,silhouette_score,labels
0,2,0.559778,"[0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,3,0.386265,"[1, 0, 0, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 1, 2, ..."
2,4,0.4129,"[0, 2, 2, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ..."
3,5,0.438838,"[3, 1, 1, 2, 3, 2, 2, 2, 2, 2, 2, 3, 3, 3, 2, ..."
4,6,0.457964,"[4, 2, 2, 0, 4, 0, 4, 0, 4, 0, 0, 4, 4, 4, 0, ..."


In [265]:
t['cluster_label_2'] = pd.Series(score.iloc[0,2], index=t.index)
t['cluster_label_3'] = pd.Series(score.iloc[1,2], index=t.index)
t['cluster_label_4'] = pd.Series(score.iloc[2,2], index=t.index)
t['cluster_label_5'] = pd.Series(score.iloc[3,2], index=t.index)
t['cluster_label_6'] = pd.Series(score.iloc[4,2], index=t.index)

In [266]:
check = t.groupby('cluster_label_2').agg({'survived':['count','sum', 'mean'], 'pclass':'mean', 'sex':'sum', 'age':'mean', 'child':'sum', 'family_size':'mean'}).reset_index()

In [267]:
check

Unnamed: 0_level_0,cluster_label_2,survived,survived,survived,pclass,sex,age,child,family_size
Unnamed: 0_level_1,Unnamed: 1_level_1,count,sum,mean,mean,sum,mean,sum,mean
0,0,1206,449.0,0.372305,2.259536,789,31.403814,0,1.613599
1,1,103,51.0,0.495146,2.708738,54,7.249191,86,5.048544


In [275]:
check = t.groupby('cluster_label_6').agg({'pclass':['count','mean'], 'sex':'sum', 'age':['mean','median'], 'child':'sum', 'family_size':'mean'}).reset_index()

In [276]:
check

Unnamed: 0_level_0,cluster_label_6,pclass,pclass,sex,age,age,child,family_size
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,sum,mean,median,sum,mean
0,0,240,1.3,240,41.664583,40.0,0,1.520833
1,1,47,2.765957,21,27.244681,28.0,0,7.297872
2,2,86,2.651163,45,4.304264,4.0,86,4.116279
3,3,230,2.717391,0,25.895652,28.0,0,1.7
4,4,169,1.171598,0,37.893491,36.0,0,1.95858
5,5,537,2.81378,537,27.205773,28.0,0,1.270019


2. Women in lower classes have the best chances of survival.
1. Men in higher classes have dismal chances of survival.
0. Kids have a 60 percent chance of survival (perhaps because the mean class is so high).
4. Woman in high classes have 37 percent lower chances of survival.
3. Men in lower classes have dismal changes of survival.
5. Passengers with kids have dismal chances of survival.

In [274]:
t.groupby('cluster_label_6').agg({'survived':['count','sum', 'mean'], 'pclass':'mean', 'sex':'sum', 'age':['mean','median'], 'child':'sum', 'family_size':'mean'}).reset_index()

Unnamed: 0_level_0,cluster_label_6,survived,survived,survived,pclass,sex,age,age,child,family_size
Unnamed: 0_level_1,Unnamed: 1_level_1,count,sum,mean,mean,sum,mean,median,sum,mean
0,0,240,60.0,0.25,1.3,240,41.664583,40.0,0,1.520833
1,1,47,8.0,0.170213,2.765957,21,27.244681,28.0,0,7.297872
2,2,86,50.0,0.581395,2.651163,45,4.304264,4.0,86,4.116279
3,3,230,147.0,0.63913,2.717391,0,25.895652,28.0,0,1.7
4,4,169,161.0,0.952663,1.171598,0,37.893491,36.0,0,1.95858
5,5,537,74.0,0.137803,2.81378,537,27.205773,28.0,0,1.270019
