# DAT 19: Homework 3 Assignment - Clustering with K-means

## Instructions

In this homework assignment, we will get practice with our first unsupervised learning technique, clustering. We will analyze wholesale purchases by 440 clients of a wholesale distributor. 

Please do all your analysis to answer the questions below in this Jupyter notebook. Show your work.

**Please submit your completed notebook by 6:30PM on Wednesday, January 20.**

## About the Data

The [Wholesale Customers dataset](http://archive.ics.uci.edu/ml/datasets/Wholesale+customers) and a description of the data is available from the UCI ML Repository.

## Homework Assignment

**1) Load the dataset. Check for missing values, perform any normalization that you think is necessary (remember that K-means uses the Euclidean Distance function).**

In [None]:
#Import libraries
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
#Load the dataset
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale%20customers%20data.csv')

In [None]:
#Look at column names
print list(df.columns.values)

In [None]:
#Lower-case all DataFrame column names
df.columns = map(str.lower, df.columns)

#List the column names and show dataframe attributes
print list(df.columns.values)
print "\n...\n"
print df.info()

In [None]:
#Get quick count of rows in the DataFrame
len(df.index)

In [None]:
#List unique values in the categorical variables
print "channel unique values:", pd.unique(df.channel.ravel())
print "region unique values:", pd.unique(df.region.ravel())

In [None]:
#List count of each unique value in each categorical variable
print df['channel'].value_counts()
print df['region'].value_counts()

In [None]:
#Rename channel column for easy associating once converted to binary
df = df.rename(columns={'channel' : 'channel_1'})

#Convert categorical variables to binary
df['channel_1'] = df.channel_1.map({1:1, 2:0})

In [None]:
#Create two new columns for Region and convert to binary
df['region_3'] = df['region'].map( {1:0, 2:0, 3:1} )
df['region_1'] = df['region'].map( {1:1, 2:0, 3:0} )

In [None]:
#Order columns
cols = list(df)
cols.insert(1, cols.pop(cols.index('region_3')))
cols.insert(2, cols.pop(cols.index('region_1')))
df = df.ix[:, cols]

#Remove the original 'region' column
del df['region']

In [None]:
#StandardScaler practice

#std_scale = StandardScaler().fit(df['fresh'])
#df_std = std_scale.transform(df['fresh'])
#print('Mean after standardization:\nfresh={:.2f}'.format(df_std.mean()))

**2.1) Look at the dataset. There are both continuous and categorical variables. What are the categorical variables? From a business perspective, what do those categorical variables represent?**

The categorical variables are 'Channel' and 'Region'. 'Channel' aligns to one of two verticals: Retail or Hotel/Restaurant. 'Region' aligns to one of three locations: Other, Oporto or Lisbon.

**2.2) What results might we expect from the k-means clustering if we were to run it on the dataset as-is? Explain your thinking in words.**

Because K-means uses the mean, the binary features will pull the centroids closer to each other, neglecting the importance of a cluster center

**3) Using ONLY the continuous features in the dataset, apply the K-means algorithm to find clusters in the data.**

In [None]:
X = df.ix[:,3:9]
X.head(5)

In [None]:
X = StandardScaler().fit_transform(X)

In [None]:
km = KMeans()
km.fit(X)

In [None]:
centers = km.cluster_centers_
centers

**4.1) Plot the Silhouette Coefficient as a function of the number of clusters (remember that you set the number of clusters as an input to K-means).**

In [None]:
labels = km.labels_
silhouette_score(X,labels,metric='euclidean')

In [None]:
manykm = KMeans(2)
manykm.fit(X)
manycenters = manykm.cluster_centers_

manylabels = manykm.labels_
silhouette_score(X,manylabels,metric='euclidean')

In [None]:
for i in range(0,2):
    # select only data observations with cluster label == i
    ds = X[np.where(labels==i)]
    # plot the data observations
    plt.plot(ds[:,0],ds[:,1],'o')
    # plot the centroids
    lines = plt.plot(manycenters[i,0],manycenters[i,1],'kx')
    # make the centroid x's bigger
    plt.setp(lines,ms=15.0)
    plt.setp(lines,mew=2.0)
plt.show()

**4.2) What is the ideal value for k, the number of clusters? Why?**

2, because it provides the highest Silhouette value

**4.3) How does your answer for 3.2 compare with your thoughts from 2.2 above?**

In [None]:
#your text based answer here. Feel free to convert this cell to markdown.

### Extra Credit Questions
**The following questions are strongly encouraged, but not required for this homework assignment.**

**5) Read the scikit-learn user guide section about [clustering](http://scikit-learn.org/stable/modules/clustering.html). Pay particular attention to the section about [assumptions](http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_assumptions.html#example-cluster-plot-kmeans-assumptions-py).**

**6) PCA & PLOTTING:** <br> With six continuous features, plotting our clusters in two dimensions will be challenging. We can use [Principal Components Analysis](http://scikit-learn.org/stable/modules/decomposition.html#pca) and then plot only the "top two" dimensions. More technically, these are the dimensions that capture most of the variance in our data set. For this extra credit question, read about [PCA in the sklearn.decomposition module](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html), apply it to the wholesale dataset, repeat the k-means clustering, and plot your results using only the top two principal components.

In [None]:
#your code here, should you choose to attempt it