# Titanic Passenger Clusters with Mean Shift

In this dataset, we will use the famous Titanic Passengers dataset to test out a mean shift model, and see if the model's clusters can give us any insight into what determined if a passenger survuved or died.

Let's start by importing the linraries we will be using and loading the dataset.

In [1]:
import numpy as np
from sklearn.cluster import MeanShift, KMeans
from sklearn import preprocessing, model_selection
import matplotlib.pyplot as plt
import pandas as pd


'''
Pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
survival Survival (0 = No; 1 = Yes)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare (British pound)
cabin Cabin
embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
boat Lifeboat
body Body Identification Number
home.dest Home/Destination
'''

df = pd.read_excel('titanic.xls')

Let's make a copy of the df so that we can see values as text instead of ints at the end. We must convert qualifiable values to ints for the model, and wil use the copy of the df to return the text values after. We will make a function for this conversion below.

In [2]:
original_df = pd.DataFrame.copy(df)
df.drop(['body','name'], 1, inplace=True)
df.fillna(0,inplace=True)

def handle_non_numerical_data(df):
    
    columns = df.columns.values

    for column in columns:
        text_digit_vals = {}
        def convert_to_int(val):
            return text_digit_vals[val]

        if df[column].dtype != np.int64 and df[column].dtype != np.float64:
            
            column_contents = df[column].values.tolist()
            unique_elements = set(column_contents) 
            x = 0
            for unique in unique_elements:
                if unique not in text_digit_vals:
                    text_digit_vals[unique] = x
                    x+=1 
            df[column] = list(map(convert_to_int,df[column]))

    return df

We will now run the conversion function on the df, and drop un-needed fields.

In [3]:
df = handle_non_numerical_data(df)
df.drop(['ticket','home.dest'], 1, inplace=True)

Now we will create and train our model.

In [4]:
X = np.array(df.drop(['survived'], 1).astype(float))
X = preprocessing.scale(X)
y = np.array(df['survived'])

clf = MeanShift()
clf.fit(X)

MeanShift(bandwidth=None, bin_seeding=False, cluster_all=True, min_bin_freq=1,
     n_jobs=1, seeds=None)

We will now create a placeholder column for where we will be assigning a cluster group to each passenger. We will then iterate through each passenger and fill in the placeholder column by assigning them the appropriate cluster group. We will then count how many clusters the model created.

In [5]:
labels = clf.labels_
cluster_centers = clf.cluster_centers_

original_df['cluster_group'] = np.nan

for i in range(len(X)):
    original_df['cluster_group'].iloc[i] = labels[i]

n_clusters_ = len(np.unique(labels))
print(n_clusters_)

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


4


We would like to check if the model's clusters reflect chances of survival in any way. Let's calculate survival rates per cluster.

In [6]:
survival_rates = {}
for i in range(n_clusters_):
    temp_df = original_df[ (original_df['cluster_group']==float(i)) ]
    survival_cluster = temp_df[  (temp_df['survived'] == 1) ]

    survival_rate = len(survival_cluster) / len(temp_df)
    print(i,survival_rate)
    survival_rates[i] = survival_rate
    
#print(survival_rates)

0 0.37170263788968827
1 1.0
2 0.6818181818181818
3 0.1


We should now explore these clusters to see if we can get any valuable insight. We can do this by using a where clause, only pulling data from rows where the cluster group assigned matches what we want.

In [7]:
print(original_df[(original_df['cluster_group']==2)])
print(original_df[(original_df['cluster_group']==2)].describe())

     pclass  survived                                               name  \
1         1         1                     Allison, Master. Hudson Trevor   
2         1         0                       Allison, Miss. Helen Loraine   
3         1         0               Allison, Mr. Hudson Joshua Creighton   
4         1         0    Allison, Mrs. Hudson J C (Bessie Waldo Daniels)   
10        1         0                             Astor, Col. John Jacob   
11        1         1  Astor, Mrs. John Jacob (Madeleine Talmadge Force)   
16        1         0                           Baxter, Mr. Quigg Edmond   
17        1         1    Baxter, Mrs. James (Helene DeLaudeniere Chaput)   
23        1         1                              Bidois, Miss. Rosalie   
24        1         1                                  Bird, Miss. Ellen   
35        1         1                           Bowen, Miss. Grace Scott   
66        1         1                        Chaudanson, Miss. Victorine   
78        1 

We can create separate dataframes using the same logic to separate clusters. From there, we can get even more specific, for example, 1st class passangers in cluster 2. With this info, we can see that 1st class passengers in cluster 2 had a 68% survival rate. Breaking this down further and finding what caused this clustering could then give us some valuable insight.

In [8]:
cluster_2 = original_df[(original_df['cluster_group']==2)]
cluster_2_fc = cluster_2[(cluster_2['pclass']==1)]
cluster_2_fc.describe()
#This shows that the first class passengers in cluster 2 had a 70% survival rate

Unnamed: 0,pclass,survived,age,sibsp,parch,fare,body,cluster_group
count,44.0,44.0,44.0,44.0,44.0,44.0,5.0,44.0
mean,1.0,0.681818,36.964016,0.795455,1.272727,208.069316,104.4,2.0
std,0.0,0.471155,17.847674,0.929605,1.107346,52.453703,36.156604,0.0
min,1.0,0.0,0.9167,0.0,0.0,83.1583,45.0,2.0
25%,1.0,0.0,24.0,0.0,0.0,151.55,96.0,2.0
50%,1.0,1.0,36.0,1.0,1.0,221.7792,122.0,2.0
75%,1.0,1.0,50.0,1.0,2.0,262.375,124.0,2.0
max,1.0,1.0,67.0,3.0,4.0,263.0,135.0,2.0


You may have noticed group 1 has a 100% survival rate, and group 3's survival rate was as low as 10%. As we can see, the model was able to group the passengers in a way that can help us determine what factors led to death or survival, all without us programming it to do so.