# Homework 7 - Let's have another look at the Titanic

The objective of this homework is to practice k-means clustering. To successfully complete this homework, you may use any resources available to you. 

Last week, we used supervised classification to understand what drives survivability. This week we explore whether the machine can figure it out on it's own.

Get the `titanic3.csv` data (Source: [Link](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3info.txt)).

1. Analyze the data using pandas.
    * Create a column `child` that specifies whether the person is a child (age <= 10).
    * Create a column `family_size` that specifies the size of the family of that person (Please note that there are two relevant columns for this step.
2. Develop clusters for the dataset.
    * Impute the `age` column with the **median** (Please note that this a very simplified step. Imputing is usually way more complex).
    * Drop all remaining NaN values.
    * Preprocess the `sex` column using LabelEncoder.
    * Preprocess the `child` column using LabelEncoder (not actually necessary but for systematic purposes).
    * Scale the data set to standardize the dataset using `sklearn.preprocessing.scale`
    * Run a KMeans cluster analysis. Pick an appropriate number of clusters.
    * Interpret the results.
3. Implement a search for the best number of clusters using the silhouette score from `sklearn.metrics`.
    * Set the parameters to 2,3,4,5,6 clusters.
    * Interpret the best results.
    * Try to find names for the clusters.
    
Hints:
* Explain what you are doing.
* Use references.

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns

In [3]:
import sklearn as sk
import sklearn.tree as tree
import sklearn.preprocessing as pp
import sklearn.metrics as sm

In [4]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [5]:
%matplotlib inline

In [6]:
raw = pd.read_csv("https://raw.githubusercontent.com/mschermann/msis2802winter2018homework/master/\
titanic3.csv")

In [7]:
raw.shape

(1310, 14)

In [8]:
tc = raw.copy()

## Description of the dataset

Each row is a passenger on the Titanic. The columns report attributes for the passengers:

| Column | Description | Relevant for this homework|
|--------|-------------|--------|
|`survived`|1 = survived, 0 = died | **X**|
|`pclass`| 1 = first class, 2 = second class, 3 = third class |**X**|
|`name`| Name of the passenger| |
|`sex`| male or female|**X**|
|`age`| age in years|**X**|
|`sibsp`| The number of siblings or spouses that are also traveling on the Titanic| **X**|
|`parch`| The number of parents or childen that are also traveling on the Titanic| **X**|
|`ticket`|The ticket number| |
|`fare`| The ticket price | |
|`cabin`| The cabin number | |
|`embarked`| The starting city | |
|`boat`| The emergency boat number | |
|`body`| The identification number of the body | |
|`home.dest`| The destination of the passenger | |

In [45]:
tc.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,child,family_size
0,1.0,1.0,0,29.0,0.0,0.0,0,1.0
1,1.0,1.0,1,0.9167,1.0,2.0,1,4.0
2,0.0,1.0,0,2.0,1.0,2.0,1,4.0
3,0.0,1.0,1,30.0,1.0,2.0,0,4.0
4,0.0,1.0,0,25.0,1.0,2.0,0,4.0


In [9]:
tc=tc[['survived','pclass','sex','age','sibsp','parch']]

In [10]:
tc['child']=tc.apply(lambda x: 1 if x['age']<=10 else 0, axis=1)

In [11]:
# number of siblings or spouses and number of parents or childen, plus him/herself
tc['family_size']=tc['sibsp']+tc['parch']+1

In [12]:
# fill NaN age value with median 
tc["age"].fillna(tc["age"].median(), inplace=True)

In [13]:
tc.isnull().sum()

survived       1
pclass         1
sex            1
age            0
sibsp          1
parch          1
child          0
family_size    1
dtype: int64

In [14]:
# drop remaining NaN values
tc.dropna(axis=0,how='any', inplace=True)

In [16]:
# change sex to a dummy variable
le_sex = pp.LabelEncoder()
tc['sex'] = le_sex.fit_transform(tc['sex'].astype(str))
tc['sex'].head()

0    0
1    1
2    0
3    1
4    0
Name: sex, dtype: int64

In [17]:
# change child column to dummies
le_child = pp.LabelEncoder()
tc['child'] = le_child.fit_transform(tc['child'].astype(str))
tc['child'].head()

0    0
1    1
2    1
3    0
4    0
Name: child, dtype: int64

In [18]:
# standardize the dataset
tc_scaled= pp.scale(tc)

In [19]:
# run KMeans cluster=3 
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
y_kmeans = kmeans.fit_predict(tc_scaled)

In [24]:
tc1=tc.copy()

In [25]:
# store kmean cluster lables back to the original dataset 
tc1['cluster']=y_kmeans

In [26]:
# groupby the cluster and calculate the mean for each column
tc1.groupby('cluster').mean()

Unnamed: 0_level_0,survived,pclass,sex,age,sibsp,parch,child,family_size
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0.076531,2.510204,0.894133,30.762117,0.229592,0.108418,0.0,1.33801
1,0.976982,1.731458,0.191816,32.704604,0.42711,0.398977,0.0,1.826087
2,0.432836,2.679104,0.5,12.79602,2.283582,1.962687,0.641791,5.246269


The second cluster has the highest surviral rate, this cluster is mostly class 1 and 2 and female, has the average age of 32.7 and family size of 1.8; and they are not children.

In [27]:
from sklearn.metrics import silhouette_score

In [37]:
from sklearn.pipeline import Pipeline

In [38]:
# pipeline
pipe = Pipeline([('scale', pp.StandardScaler()),('cluster', KMeans())])

In [39]:
cluster__n_clusters = [2,3,4,5,6]

In [43]:
# search for the best number of clusters
score = pd.DataFrame(columns=['clusters', 'silhouette_score', 'labels'])
for n_cluster in cluster__n_clusters:
    pipe = pipe.set_params(cluster__n_clusters = n_cluster)
    labels = pipe.fit_predict(tc)
    silhouette_avg = silhouette_score(tc, labels)
    score = score.append({'clusters':n_cluster, 'silhouette_score': silhouette_avg, 'labels': labels}, ignore_index=True)

In [44]:
score.sort_values(by='silhouette_score')

Unnamed: 0,clusters,silhouette_score,labels
1,3,0.081974,"[0, 2, 2, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, ..."
4,6,0.086892,"[0, 2, 2, 3, 3, 4, 0, 4, 0, 4, 4, 0, 0, 0, 4, ..."
2,4,0.095143,"[1, 3, 3, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, ..."
3,5,0.112884,"[3, 1, 1, 0, 3, 0, 3, 0, 3, 0, 0, 3, 3, 3, 0, ..."
0,2,0.401508,"[0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


According to silhouette_score, cluster of 2 is the best way to cluster the dataset. 

In [33]:
kmeans = KMeans(n_clusters=2)

In [34]:
cluster = kmeans.fit_predict(tc_scaled)

In [35]:
tc1['cluster']=cluster

In [36]:
tc1.groupby('cluster').mean()

Unnamed: 0_level_0,survived,pclass,sex,age,sibsp,parch,child,family_size
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0.493506,2.571429,0.441558,14.777056,2.136364,1.987013,0.558442,5.123377
1,0.3671,2.258009,0.670996,31.466667,0.280519,0.171429,0.0,1.451948


Similar to last homework, the two clusters are seperated by pclass around 2.5. The first cluster can be called the big family cluster that has on average 5 family members. It has more females and is much younger. This cluster has a higher serviral rate than the second cluster. The second cluster can be called the non-child cluster. They are all above 10 years old and has a much smaller family size. Average age of this group is 31 and has more males.  