# Clustering Titanic Passengers with K Means

For this project, we would like to use K Means clustering  on the famous Titanic Passenger dataset to see if the K Means algorithm can give us any insight as to which factors led to a passenger's death or survival.

We will start by importing some libraries to use.

In [1]:
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
import numpy as np
from sklearn.cluster import KMeans
from sklearn import preprocessing
import pandas as pd

We can now set our dataframe. A brief description of the data is listed below.

In [2]:
'''
Pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
survival Survival (0 = No; 1 = Yes)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare (British pound)
cabin Cabin
embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
boat Lifeboat
body Body Identification Number
home.dest Home/Destination
'''

'\nPclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)\nsurvival Survival (0 = No; 1 = Yes)\nname Name\nsex Sex\nage Age\nsibsp Number of Siblings/Spouses Aboard\nparch Number of Parents/Children Aboard\nticket Ticket Number\nfare Passenger Fare (British pound)\ncabin Cabin\nembarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)\nboat Lifeboat\nbody Body Identification Number\nhome.dest Home/Destination\n'

In [3]:
df = pd.read_excel('titanic.xls')
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


If a body was found, it means the passenger died. We don't want this value tainting our clusters, so we will remove it, along with 'name'.

Once the unwanted columns are dropped, we can fill in the null values of the data.

In [4]:
df.drop(['body','name'], 1, inplace=True)
df.fillna(0, inplace=True)
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
pclass       1309 non-null int64
survived     1309 non-null int64
sex          1309 non-null object
age          1309 non-null float64
sibsp        1309 non-null int64
parch        1309 non-null int64
ticket       1309 non-null object
fare         1309 non-null float64
cabin        1309 non-null object
embarked     1309 non-null object
boat         1309 non-null object
home.dest    1309 non-null object
dtypes: float64(2), int64(4), object(6)
memory usage: 122.8+ KB


Looks like all the nulls have been filled. Let's move on. We will create a function for handling text, since we can only use quantifiable data here. This function will turn each unique text value into a unique number which we can use, but will still hold the same value of information. For example, under the 'sex' column, Female will become 0 and Male will become 1.

In [5]:
def handle_non_numerical_data(df):
    columns = df.columns.values
    for column in columns:
        text_digit_vals = {}
        #ex. {'Female': 0, 'Male': 1}
        def convert_to_int(val):
            return text_digit_vals[val]
        #this is asking if the column is numerical. If not, it will populate the dict above
        if df[column].dtype != np.int64 and df[column].dtype != np.float64:
            column_contents = df[column].values.tolist()
            unique_elements = set(column_contents) #This will give us all unique non-repetitive values
            x = 0
            #if not numerical, converts to list, gets the set, populates the dict with the unique elements and changes to ints
            for unique in unique_elements:
                if unique not in text_digit_vals:
                    text_digit_vals[unique] = x
                    x+=1
            df[column] = list(map(convert_to_int, df[column]))
        
    return df

Great, we have our function, now let's run it on our dataframe.

In [6]:
df = handle_non_numerical_data(df)

df.drop(['cabin','sibsp', 'embarked', 'home.dest'],1,inplace=True) #came back to do this for better accuracy

print(df.head())

   pclass  survived  sex      age  parch  ticket      fare  boat
0       1         1    1  29.0000      0     752  211.3375     1
1       1         1    0   0.9167      2     522  151.5500    18
2       1         0    1   2.0000      2     522  151.5500     0
3       1         0    0  30.0000      2     522  151.5500     0
4       1         0    1  25.0000      2     522  151.5500     0


Looks like everything worked, and we now have only quantifiable data to work with. Let's set-up and train the model.

We will start by determining our X and y values, preprocessing (scaling) the X data, and fitting the model. Since this is unsupervised learning, there is no splitting the data for training and testing.

We will set the number of clusters to 2, hoping that the model will separate the passengers into survived and deceased clusters.

In [7]:
X = np.array(df.drop(['survived'], 1).astype(float))
X= preprocessing.scale(X)
y=np.array(df['survived'])

clf = KMeans(n_clusters=2)
clf.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

Let's take a look at the 2 clusters using describe()

In [8]:
df['predicted'] = clf.predict(X)

print(df[df['predicted'] ==0]['survived'].describe())

print(df[df['predicted'] ==1]['survived'].describe())

count    393.000000
mean       0.707379
std        0.455546
min        0.000000
25%        0.000000
50%        1.000000
75%        1.000000
max        1.000000
Name: survived, dtype: float64
count    916.000000
mean       0.242358
std        0.428744
min        0.000000
25%        0.000000
50%        0.000000
75%        0.000000
max        1.000000
Name: survived, dtype: float64


We can see here that the first cluster has a survival rate of 70%, and the second has a survival rate of 24% (mean of 'survived'). Though this is a discernible difference, it isn't as large of a difference as we were hoping. Although not distinctly groups of 'survived' and 'deceased' like we had hoped, we can still learn from these clusters. We should say this model was somewhat successful.