# Challenge: Boston marathon

---
There is a lot of information on [runners and their performance for the Boston Marathon](https://github.com/llimllib/bostonmarathon). Pick a year (post-2012 has more info) and do some clustering.

Specifically, use the tools at hand to determine which clustering solution, including number of clusters and algorithm used, is best for the marathon data. Once you have a solution you like, write a data story, including visualizations, where you teach the reader something about the Boston Marathon based on your clusters.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import scipy
import seaborn as sns

%matplotlib inline

In [5]:
try:
    file = 'C:/Users/18047/Downloads/results.csv'
    df = pd.read_csv(file)
except:
    file = ''
    df = pd.read_csv(file)
    
df = df.drop(columns={'5k', '10k', '20k', '25k', '30k', '35k', '40k', 'name', 'city','state', 'ctz', 'bib', 'half', 'pace'})

df.head()

Unnamed: 0,age,division,gender,official,country,overall,genderdiv
0,28,9,M,90.9,CAN,9,9
1,30,5,M,132.5,KEN,5,5
2,23,1,M,130.37,ETH,1,1
3,32,5,M,88.43,AUS,5,5
4,39,3,M,87.22,JPN,3,3


In [6]:
# Convert country from string to number

counter = 1
c_list = {}

countries = set(df['country'])

for c in countries:
    c_list[c] = counter
    counter += 1
    
df['country'] = list(map(c_list.get, df['country']))

# Convert gender to boolean
df['gender'] = (df['gender'] == 'M')
df = df.rename(columns={'gender':'male'})

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.cluster import MeanShift, estimate_bandwidth

X = df.drop('age', 1)
Y = df['age']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.9, random_state=42)

bandwidth = estimate_bandwidth(X_train, quantile=0.2, n_samples=500)

ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X_train)
labels = ms.labels_

n_clusters = len(np.unique(labels))
print('Number of clusters: ', n_clusters)

Number of clusters:  3


In [14]:
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

X_norm = normalize(X)

X_pca = PCA(2).fit_transform(X_norm)

y_pred = KMeans(n_clusters = n_clusters, random_state=42).fit_predict(X_pca)

print(confusion_matrix(Y, y_pred))

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]]


In [15]:
print(pd.crosstab(y_pred, Y))

age    18  19  20  21  22  23   24   25   26   27 ...  69  70  71  72  73  74  \
row_0                                             ...                           
0       1   0   1   6   3   9    8   14   12   17 ...   8  10   2   4   3   0   
1       5  14  22  33  53  98  152  153  167  205 ...   0   1   0   0   0   1   
2       7  12  27  39  45  62   88  112  147  150 ...   0   0   0   0   0   0   

age    75  76  78  80  
row_0                  
0       3   1   1   1  
1       0   0   0   0  
2       0   0   0   0  

[3 rows x 61 columns]


In [None]:
# Note: How can I analyze confusion matrix/crosstab?