# K Means clustering of titanic passengers

Modified and updated from [this example](https://pythonprogramming.net/k-means-titanic-dataset-machine-learning-tutorial/)

Original idea and data from contest at [Kaggle](https://www.kaggle.com/c/titanic/overview)

[Link to Data](https://pythonprogramming.net/static/downloads/machine-learning-data/titanic.xls)

In [1]:
import numpy as np
from sklearn.cluster import KMeans
import pandas as pd

In [3]:
df = pd.read_excel('./data/titanic.xls')
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [4]:
#Arrange data
df.drop(columns=['name','ticket','cabin','boat','body','home.dest'], inplace=True)
df['sex']=df['sex'].replace({'male':0,'female':1})
df['embarked'] = df['embarked'].replace({'C':1, 'Q':2, 'S':3})
df.dropna(inplace=True)
df.shape

(1043, 8)

In [5]:
X = df.drop(columns=['survived']).to_numpy()
Y = df['survived'].to_numpy()

In [7]:
#Apply K Means
kmeans = KMeans(n_clusters=2,random_state=100)
df['predict'] = kmeans.fit_predict(X)

In [10]:
df[df['survived']==df['predict']].shape[0] / df.shape[0]

0.6222435282837967

Data:

- `pclass`: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- `survival`: Survival (0 = No; 1 = Yes)
- `name`: Name
- `sex`: Sex
- `age`: Age
- `sibsp`: Number of Siblings/Spouses Aboard
- `parch`: Number of Parents/Children Aboard
- `ticket`: Ticket Number
- `fare`: Passenger Fare (British pound)
- `cabin`: Cabin
- `embarked`: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
- `boat`: Lifeboat
- `body`: Body Identification Number
- `home.dest`: Home/Destination

# For Homework

Find the correlation matrix using `df.corr()` and identify highly corelated variables to the `survived`.
Re-cluster the data using one, two, or three highly correlated variables. Check the prediction accuracy.

In [7]:
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,embarked,predict
pclass,1.0,-0.317737,-0.141032,-0.409082,0.046333,0.016342,-0.564558,0.276547,-0.37698
survived,-0.317737,1.0,0.536332,-0.057416,-0.011403,0.115436,0.247858,-0.202258,0.172692
sex,-0.141032,0.536332,1.0,-0.066007,0.096464,0.222531,0.1864,-0.109425,0.147441
age,-0.409082,-0.057416,-0.066007,1.0,-0.242345,-0.149311,0.177205,-0.083269,0.106192
sibsp,0.046333,-0.011403,0.096464,-0.242345,1.0,0.37396,0.142131,0.04551,0.052322
parch,0.016342,0.115436,0.222531,-0.149311,0.37396,1.0,0.21765,0.01123,0.170967
fare,-0.564558,0.247858,0.1864,0.177205,0.142131,0.21765,1.0,-0.301455,0.817999
embarked,0.276547,-0.202258,-0.109425,-0.083269,0.04551,0.01123,-0.301455,1.0,-0.191188
predict,-0.37698,0.172692,0.147441,0.106192,0.052322,0.170967,0.817999,-0.191188,1.0


In [25]:
#Arrange data
df = pd.read_excel('./data/titanic.xls')
df.drop(columns=['name','ticket','cabin','boat','body','home.dest'], inplace=True)
df['sex']=df['sex'].replace({'male':0,'female':1})
df['embarked'] = df['embarked'].replace({'C':1, 'Q':2, 'S':3})
df.dropna(inplace=True)
df.head()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,embarked
0,1,1,1,29.0,0,0,211.3375,3.0
1,1,1,0,0.9167,1,2,151.55,3.0
2,1,0,1,2.0,1,2,151.55,3.0
3,1,0,0,30.0,1,2,151.55,3.0
4,1,0,1,25.0,1,2,151.55,3.0


In [26]:
df.drop(columns=['embarked','fare','pclass','age','sibsp','parch'], inplace=True)
df.head()

Unnamed: 0,survived,sex
0,1,1
1,1,0
2,0,1
3,0,0
4,0,1


In [27]:
X = df.drop(columns=['survived']).to_numpy()
Y = df['survived'].to_numpy()

In [28]:
#Apply K Means
kmeans = KMeans(n_clusters=2,random_state=100)
df['predict'] = kmeans.fit_predict(X)

In [29]:
df[df['survived']==df['predict']].shape[0] / df.shape[0]

0.7785234899328859

In [30]:
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,survived,sex,predict
survived,1.0,0.536332,0.536332
sex,0.536332,1.0,1.0
predict,0.536332,1.0,1.0


In [31]:
df.head()

Unnamed: 0,survived,sex,predict
0,1,1,1
1,1,0,0
2,0,1,1
3,0,0,0
4,0,1,1
