# Titanic Clusterization Example

Cluster is an example of "unsupervised learning" as we do not know or use the ground truth to train the algorithm.  
Instead the algorithm looks at the data and decides what to do (in this case cluster the data together into groups it "thinks" are alike).  

## Import necessary modules 
Also set some Pandas defaults

In [1]:
import numpy as np
from sklearn.cluster import KMeans
from sklearn import preprocessing
import pandas as pd

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

## 1. Get data into python kernel
Here we read a csv and take a peek at the first few rows (head() method).

In [2]:
df = pd.read_csv('../../Data/titanic.csv')

print(df.head())

   PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
0            1         0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S
1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85        C
2            3         1       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
3            4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S
4            5         0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S


## 2. Profile the data

Here we use Pandas's Dataframe describe function.  
For more thorough/detailed profiling look at pandas-profiling library and run a profile_report.

In [3]:
df.describe()

#import pandas_profiling
#df.profile_report()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


## 3. Perform common dataset operations

- Select desired/necessary fields
- Fill holes in data
- normalize data
- map Strings to values

In [4]:
# Select fields of use - here we remove the Name column as it does not contribute to clustering/classification
df.drop(['PassengerId', 'Name','Ticket'], 1, inplace=True)

# Continue and map different literals to values
df['Sex']      = df['Sex'].map({'female': 1, 'male': 0})
df['Embarked'] = df['Embarked'].map({'C': 1, 'S': 0})
df['Cabin'].fillna(0, inplace=True)
df['Cabin']    = df['Cabin'].map({0: 0}).fillna(1)

# General fill holes with 0 value
df.fillna(0, inplace=True)
print(df.head(20))

    Survived  Pclass  Sex   Age  SibSp  Parch     Fare  Cabin  Embarked
0          0       3    0  22.0      1      0   7.2500    0.0       0.0
1          1       1    1  38.0      1      0  71.2833    1.0       1.0
2          1       3    1  26.0      0      0   7.9250    0.0       0.0
3          1       1    1  35.0      1      0  53.1000    1.0       0.0
4          0       3    0  35.0      0      0   8.0500    0.0       0.0
5          0       3    0   0.0      0      0   8.4583    0.0       0.0
6          0       1    0  54.0      0      0  51.8625    1.0       0.0
7          0       3    0   2.0      3      1  21.0750    0.0       0.0
8          1       3    1  27.0      0      2  11.1333    0.0       0.0
9          1       2    1  14.0      1      0  30.0708    0.0       1.0
10         1       3    1   4.0      1      1  16.7000    1.0       0.0
11         1       1    1  58.0      0      0  26.5500    1.0       0.0
12         0       3    0  20.0      0      0   8.0500    0.0   

## 4. K Means Model for Firm Clustering 

Let's create 2 clusters

In [6]:
X = np.array(df.drop(['Survived'], 1).astype(float))
y = np.array(df['Survived'])

clf = KMeans(n_clusters=2)
clf.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

## 5. Test accuracy of clusters to ground truth

In [8]:
# Test clusters against ground truth of real survivorship
correct = 0
for i in range(len(X)):
    predict_passenger = np.array(X[i].astype(float))
    predict_passenger = predict_passenger.reshape(-1, len(predict_passenger))
    prediction = clf.predict(predict_passenger)
    if prediction[0] == y[i]:
        correct += 1

print(correct/len(X))

0.6442199775533108
