# Exercise 9

In this exercise, you will apply the k-means clustering algorithms to the *diabetes* dataset. The goal of this exercise is to apply the clustering algorithm on the various attributes of the data and to compare the clustering results against the true class (tested_positive or tested_negative). In other words, you will use the 'class' attribute as ground truth for comparing the clustering results.

**1.** Load the dataset into a pandas DataFrame object named *data*. Extract the last column (*data[class]*) of the dataframe and store the column as a pandas Series object named *classes*. Remove the last column (*data[class]*) and display the resulting data frame object. 

**Solution:**

In [1]:
import pandas as pd

data = pd.read_csv("diabetes.csv")
classes = pd.Series(data['class'])  # extract the 'class' column 
data = data.drop('class',1)    # drop the 'class' column
data

Unnamed: 0,preg,plas,pres,skin,insu,mass,pedi,age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33
...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63
764,2,122,70,27,0,36.8,0.340,27
765,5,121,72,23,112,26.2,0.245,30
766,1,126,60,0,0,30.1,0.349,47


**2.** Change the values of the classes from strings to integers: tested_positive (mapped to 1) and tested_negative (mapped to 0). Display the resulting *classes* series object. Make sure you change the data type of the series to 'int32' using astype() function.

In [2]:
classes = classes.replace("tested_positive",1)
classes = classes.replace("tested_negative",0)
classes = classes.astype('int32')
classes

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: class, Length: 768, dtype: int32

**3.** Standardize the *data* by subtracting each column with its corresponding mean and dividing by its standard deviation. Display the dataframe object after standardization. 

In [3]:
def standardize(i):
    result = (data[i] - data[i].mean()) / data[i].std()
    return result

data.preg = standardize('preg')
data.plas = standardize('plas')
data.pres = standardize('pres')
data.skin = standardize('skin')
data.insu = standardize('insu')
data.mass = standardize('mass')
data.pedi = standardize('pedi')
data.age = standardize('age')

data

Unnamed: 0,preg,plas,pres,skin,insu,mass,pedi,age
0,0.639530,0.847771,0.149543,0.906679,-0.692439,0.203880,0.468187,1.425067
1,-0.844335,-1.122665,-0.160441,0.530556,-0.692439,-0.683976,-0.364823,-0.190548
2,1.233077,1.942458,-0.263769,-1.287373,-0.692439,-1.102537,0.604004,-0.105515
3,-0.844335,-0.997558,-0.160441,0.154433,0.123221,-0.493721,-0.920163,-1.040871
4,-1.141108,0.503727,-1.503707,0.906679,0.765337,1.408828,5.481337,-0.020483
...,...,...,...,...,...,...,...,...
763,1.826623,-0.622237,0.356200,1.721613,0.869464,0.115094,-0.908090,2.530487
764,-0.547562,0.034575,0.046215,0.405181,-0.692439,0.609757,-0.398023,-0.530677
765,0.342757,0.003299,0.149543,0.154433,0.279412,-0.734711,-0.684747,-0.275580
766,-0.844335,0.159683,-0.470426,-1.287373,-0.692439,-0.240048,-0.370859,1.169970


**4.** Apply k-means clustering to partition the data into 2 clusters. To ensure repeatability, set the random_state of the clustering function to 1. Create a dataframe object to store the two cluster centroids and display them. 

**Solution:**

In [4]:
from sklearn import cluster
import pandas as pd
import numpy as np

k_means = cluster.KMeans(n_clusters=2,random_state=1)
k_means.fit(data)

centroids = k_means.cluster_centers_
pd.DataFrame(centroids,columns=['preg','plas','pres','skin','insu','mass','pedi','age'])

Unnamed: 0,preg,plas,pres,skin,insu,mass,pedi,age
0,-0.528901,-0.247804,-0.225729,0.086659,0.018109,-0.067391,-0.016456,-0.575867
1,0.948174,0.444244,0.40467,-0.155356,-0.032464,0.120813,0.029501,1.032372


**5.** Store the cluster labels generated by k-means into a pandas Series object named *clusters*. Show the distribution of the clusters using value_counts().

In [5]:
clusters = pd.Index(k_means.labels_)
clusters.value_counts()

0    493
1    275
dtype: int64

**6.** To evaluate the clusters, you will compare the cluster assignments against the ground class (from step 2 above). Compute the confusion matrix that shows the number of points from each class assigned to each cluster. Store the confusion matrix as a pandas DataFrame object (where the rows correspond to the true classes and the columns are the cluster labels).

In [6]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(classes, clusters)
pd.DataFrame(cm,columns=['Cluster 1','Cluster 2'],index=['tested negative','tested positive'])

Unnamed: 0,Cluster 1,Cluster 2
tested negative,372,128
tested positive,121,147
