# Unsupervised Learning: Clustering Lab





In [8]:
from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn.metrics import silhouette_score
import numpy as np
import pandas as pd
from scipy.io import arff

## 1. Initial practice with the K-means and HAC algorithms

### 1.1 (10%) K-means
Run K-means on this [Abalone Dataset.](https://raw.githubusercontent.com/cs472ta/CS472/master/datasets/abalone.arff)
The dataset was modified to be smaller. The last datapoint should be on line 359 or the point 0.585,0.46,0.185,0.922,0.3635,0.213,0.285,10. The remaining points are commented out. Treat the output class (last column) as an additional input feature. Create your K-Mmeans model with the paramaters K-means(n_clusters=3, init='random', n_init=1) 

Output the following:
- Class label for each point (labels_)
- The k=3 cluster centers (cluster_centers_)
- Number of iterations it took to converge (n_iter_)
- Total sum squared error of each point from its cluster center (inertia_)
- The total average silhouette score (see sklearn.metrics silhouette_score)

In [99]:
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import MinMaxScaler
# read in dataset
# convert dataset to pd
# split data into X and y
# normalize data?
# declare model
# train_test_split
# fit
# analyze

abalone_data = arff.loadarff('abalone.arff')
abalone_df = pd.DataFrame(abalone_data[0])
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(abalone_df)

# K-means with Abalone
model = KMeans(n_clusters=3, init='random', n_init=1)
model.fit(abalone_df)
print("The array of the class label for each point is: ")
print(model.labels_)
print("\nThe centers of each cluster are as shown: ")
print(model.cluster_centers_)
print("\nIt took " + str(model.n_iter_) + " iterations to converge.")
print("The total sum squared error of each point from its cluster center is ", model.inertia_)
print("The total average silhouette score is ", silhouette_score(abalone_df, model.labels_))


The array of the class label for each point is: 
[0 1 1 1 1 1 0 0 1 0 2 1 2 1 1 2 1 1 1 1 2 1 2 1 1 2 2 2 0 2 1 0 0 0 2 1 0
 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 1 0 2
 2 0 1 1 2 2 1 2 0 0 2 2 2 1 1 2 0 2 2 1 0 2 1 1 1 1 1 0 0 1 2 2 2 1 1 1 1
 1 1 1 2 2 2 1 1 1 1 1 2 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 2 1 1 1 1
 1 1 0 1 1 1 2 1 2 0 2 2 2 2 2 0 0 2 0 0 2 2 2 1 1 1 1 1 1 1 1 2 2 0 1 1 2
 2 2 1 2 1 2 2 2 1 1 2 2 0 0 1]

The centers of each cluster are as shown: 
[[ 0.61366667  0.48933333  0.16716667  1.29283333  0.48815     0.25873333
   0.45488333 17.06666667]
 [ 0.44221239  0.3439823   0.11331858  0.49762832  0.20678319  0.11430973
   0.15534513  8.24778761]
 [ 0.58122807  0.45649123  0.1604386   1.05523684  0.41045614  0.231
   0.33986842 12.33333333]]

It took 4 iterations to converge.
The total sum squared error of each point from its cluster center is  540.2109608826968
The total average silhouette score is  0.5010728634549578


*Discussion*: The three clusters look to have different population sizes, with the '0' cluster being the smallest, and the '2' cluster holding the most datapoints. It looks like cluster '0' and cluster '2' have centroids that are fairly close together, while cluster '1' has a centroid that is a little further away. The model only took 5 iterations to converge, which was a lot fewer than I expected. I suspect this is because we are working with a smaller dataset and perhaps the clusters are fairly distinct. However, the sum squared error and the silhouette score work against this hypothesis, as the model only had a score of .51, which is a pretty bad silhouette score. This means that there were likely points that the K-Means algorithm labeled as a certain cluster, that really should have been assigned to another cluster. If I were guessing, I would say that the silhouette score would improve if the K-Means model was trained for a different number of centroids.

### 1.2 (10%) Hierarchical Agglomerative Clustering (HAC) 

Run HAC on the same [Abalone Dataset](https://raw.githubusercontent.com/cs472ta/CS472/master/datasets/abalone.arff) using complete linkage and k=3.

Output the following:
- Class label for each point (labels_)
- The total average silhouette score

In [131]:
# HAC with Abalone
model = AgglomerativeClustering(linkage='complete', n_clusters=3)
model.fit(abalone_df)
print("The array of the class label for each point is: ")
print(model.labels_)

print("\nThe total average silhouette score is ",
      silhouette_score(abalone_df, model.labels_))

The array of the class label for each point is: 
[1 0 0 0 0 0 2 1 0 2 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 2 2 1 0 1
 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 2 0
 1 1 0 0 0 1 0 1 1 2 1 1 1 0 0 1 1 1 1 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 2 2 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
 0 0 1 0 0 0 1 0 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 0 0 0 0 0 0 0 0 0 1 2 0 0 1
 1 1 0 0 0 1 1 1 0 0 1 0 1 1 0]

The total average silhouette score is  0.5589106353312348


*Discussion*: The HAC model clustered the points differently than the K-Means model did. However, they ended with roughly the same silhouette score with HAC scoring a .54, which was around .03 better than the K-Means model. In class we discussed that using complete linkage in HAC often finds more compact clusters, which I would assume applies to the clusters in this dataset. We also discussed that ward linkage is often the most suitable method for quantitative features, so my prediction is that if a HAC model was trained using ward linkage, it would do better than both the K-Means and the complete linkage model.

## 2. K-means Clustering with the [Iris Dataset](https://raw.githubusercontent.com/cs472ta/CS472/master/datasets/iris.arff)
Use the Iris data set for 2.1 and 2.2.  Don't include the output label as one of the input features.

### 2.1 (20%) K-means Initial Centroids Experiments
K-means results differ based on the initial centroids used.
- Run K-means 5 times with *k*=4, each time with different initial random centroids (init="random) and with n_init=1.  Give inertia and silhouette scores for each run and discuss any variations in the results.
- SKlearn has a parameter that does this automatically (n_init).  n_init = z runs K-means z times, each with different random centroids and returns the clustering with the best SSE (intertia) of the z runs. Try it out and discuss how it does and how it compares with your 5 runs above.
- Sklearn also has a parameter (init:'K-means++') which runs a simpler fast version of K-means first on the data to come up with good initial centroids, and then runs regular K-means with this centroids.  Try it out (with n_init = 1) and discuss.

In [153]:
# K-means initial centroid experiments
from IPython.display import HTML, display
from tabulate import tabulate

table = []
for i in range (0,5):
    model = KMeans(n_clusters=3, init='random', n_init=1)
    model.fit(abalone_df)
    table.append([i+1, model.inertia_, silhouette_score(abalone_df, model.labels_)])

headers = ["Trial", "Inertia", "Silhouette Score"]


model = KMeans(n_init=5)
model.fit(abalone_df)
table.append(['n_init=5', model.inertia_, silhouette_score(abalone_df, model.labels_)])

model = KMeans(init='k-means++', n_init=1)
model.fit(abalone_df)
table.append(['K-means++', model.inertia_, silhouette_score(abalone_df, model.labels_)])

display(HTML(tabulate(table, headers=headers, tablefmt='html')))

Trial,Inertia,Silhouette Score
1,514.399,0.511755
2,529.254,0.518423
3,529.254,0.518423
4,529.254,0.518423
5,529.254,0.518423
n_init=5,78.0255,0.532056
K-means++,82.3285,0.537351


Results and Discussion

### 2.2 (20%) Silhouette Graphs
In this part you will show silhouette graphs for different *k* values.  Install the [Yellowbrick visualization package](https://www.scikit-yb.org/en/latest/quickstart.html) and import the [Silhouette Visualizer](https://www.scikit-yb.org/en/latest/api/cluster/silhouette.html).  This library includes lots of visualization packages which you might find useful. (Note: The YellowBrick silhouette visualizer does not currently support HAC).
- Show Silhouette graphs for clusterings with *k* = 2-6. Print the SSE (inertia) and total silhouette score for each.
- Learn with the default n_init = 10 to help insure a decent clustering.
- Using the silhouette graphs, choose which *k* you think is best and discuss why. Think about and discuss more than just the total silhouette score.

In [None]:
# Iris Clustering with K-means and silhouette graphs
from yellowbrick.cluster import SilhouetteVisualizer

Discuss your results and justify which clustering is best based on the silhouette graphs

## 3 (20%) Iris Clustering with HAC

- Use the same dataset as above and learn with HAC clustering
- Create one table with silhouette scores for k=2-6 for each of the linkage options single, average, complete, and ward

In [None]:
#HAC with Iris

*Discussion and linkage comparison*

## 4 (20%) Run both algorithms on a real world data
- Choose any real world data set which you have not used previously
- Use parameters of your choosing
- Try each algorithm a few times with different parameters and output one typical example of labels and silhouette scores for each algorithm
- Show the silhouette graph for at least one reasonable *k* value for K-means

In [None]:
# Run both algoriths on a data set of your choice

*Discussion and comparison*

## 5. Extra Credit for Coding Your Own Clustering Algorithms
### 5.1 (Optional 10% extra credit) Code up the K-means clustering algorithm 
Below is a scaffold you could use if you want. As above, you only need to support numeric inputs, but think about how you would support nominal inputs and unknown values. Requirements for this task:
- Your model should support the methods shown in the example scaffold below.
- Ability to choose *k* and specify the *k* initial centroids.
- Run and show the cluster label for each point with both the Iris data set and the data set of your choice above.

### 5.2 (Optional 10% extra credit) Code up the HAC clustering algorithm 

- Your model should support the methods shown in the example scaffold below.
- HAC should support both single link and complete link options.
- HAC automatically generates all clusterings from *n* to 2.  You just need to output results for the curent chosen *k*.
- Run and show the cluster label for each point with both the Iris data set and the data set of your choice above.

Discussion and comparision of each model implemented

In [None]:
from sklearn.base import BaseEstimator, ClassifierMixin, ClusterMixin

class KMEANSClustering(BaseEstimator,ClusterMixin):

    def __init__(self,k=3,debug=False): ## add parameters here
        """
        Args:
            k = how many final clusters to have
            debug = if debug is true use the first k instances as the initial centroids otherwise choose random points as the initial centroids.
        """
        self.k = k
        self.debug = debug

    def fit(self, X, y=None):
        """ Fit the data; In this lab this will make the K clusters :D
        Args:
            X (array-like): A 2D numpy array with the training data
            y (array-like): An optional argument. Clustering is usually unsupervised so you don't need labels
        Returns:
            self: this allows this to be chained, e.g. model.fit(X,y).predict(X_test)
        """
        return self
    
    def print_labels(self): # Print the cluster label for each data point
        pass

In [None]:
class HACClustering(BaseEstimator,ClusterMixin):

    def __init__(self,k=3,link_type='single'): ## add parameters here
        """
        Args:
            k = how many final clusters to have
            link_type = single or complete. when combining two clusters use complete link or single link
        """
        self.link_type = link_type
        self.k = k
        
    def fit(self, X, y=None):
        """ Fit the data; In this lab this will make the K clusters :D
        Args:
            X (array-like): A 2D numpy array with the training data
            y (array-like): An optional argument. Clustering is usually unsupervised so you don't need labels
        Returns:
            self: this allows this to be chained, e.g. model.fit(X,y).predict(X_test)
        """
        return self
    
    def print_labels(self): # Print the cluster label for each data point
        pass