## Project 6 : Clustering
- Name:Eli Billinger
- Date:10/4/2020 

## Instructions

### Description

Practice clustering on a using the well known and very popular `Iris` Dataset! The Iris flower data set is fun for learning supervised classification algorithms, and is known as a difficult case for unsupervised learning. 
https://cran.r-project.org/web/packages/dendextend/vignettes/Cluster_Analysis.html
<br><br>Yes, there are many examples out there, but see if you can do it yourself :). We can easily hypothesize on how many clusters would yield the best result, so let us prove it through a simple experiment that you could repeat with additional data sets.

### Grading

For grading purposes, we will clear all outputs from all your cells and then run them all from the top.  Please test your notebook in the same fashion before turning it in.

### Submitting Your Solution

To submit your notebook, first clear all the cells (this won't matter too much this time, but for larger data sets in the future, it will make the file smaller).  Then use the File->Download As->Notebook to obtain the notebook file.  Finally, submit the notebook file on Canvas.

### Setup

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.decomposition import PCA
from sklearn import datasets
import sklearn as sk
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import Normalizer
from sklearn.cluster import KMeans

### Problem 1: Data Generation (5 points)
Reference for more information: Chapter 5.11 K-Means in the online course book.

1. Load the `iris` dataset and separate into `X` and `y` variables (our ground truth labels will just be used for visualization).
2. Write a hypothesis on how many clusters will yield the best labeling.

In [None]:
iris = datasets.load_iris()
X = iris.data
y = iris.target

**Hypothesis**
>
>The best labeling will occur with 4 clusters becuase there are 4 features.

### Problem 2: Data exploration (10 points)

This is the step where you would normally conduct any needed preprocessing, data wrangling, and investigation of the data.
<br>**Note:** `print(iris.DESCR)` prints the iris dataset description, provided you loaded it into a variable named `iris`

a. Using your skills from previous projects, provide code below to produce answers to the following questions (edit this cell with your answers): 

    1. How many features are provided?

    There are 4 features.

    2. How many total observations?
    
    There are 150 observations.

    3. How many different labels are included, what are they called, and is it a balanced dataset with the same number of observations for each class?
        There are three different labels called Iris-Setosa,Iris-Versicolour,Iris-Virginica and each of the classe have 50 observations each.
    
        
b. Create a 2D or 3D scatter plot of two or three of the features and use the y labels for color coding. Do not reduce the data or number of features in any way (you will do this by applying PCA in problem 5).

c. Since clusters can be influenced by the magnitudes of the variables, normalize the feature data and plot a histogram of the normalized features data.

In [None]:
# a
print(iris.DESCR)

In [None]:
# b
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.scatter(X[:,0],X[:,1],y)
plt.tight_layout()

In [None]:
#c. Normalization
transformed = Normalizer().transform(X)
plt.hist(transformed)

### Problem 3: Unsupervised Learning - Clustering (15 points)
Conduct clustering experiments with one of algorithms discussed in class (e.g., k-means) for number of clusters k = 2-10. Create another 2D or 3D scatter plot utilizing the <b>cluster assignments</b> for color coding (this output can be a plot for each of the values of k or just one final plot using the value of k from your best Silhouette result obtained in Problem 4 below).  

#### Steps:
Repeat for each value of k (maybe a loop here would be appropriate):
1. Create model object
2. Train or fit the model
3. Predict cluster assignments
4. Calculate Silhouette width (see Problem 4)
4. Plot points color coded by class labels predicted by the model.

In [None]:
silhouette = []
for k in range(2,11):
    model = KMeans(k)
    model.fit(X,y)
    prediction = model.predict(X,y)
    silhouette.append(sk.metrics.silhouette_score(X,prediction))
    plt.scatter(X[:,0],X[:,1],c=prediction)
    plt.show()


### Problem 4: Evaluate results (20 points)

As we have discussed, validating an usupervised problem is difficult. There is a metric that can be used to determine the density or separation of cluster assignments, called Silhouette width. In this step, perform analysis of results using the above `k = 2-10` and compute the Silhouette width (Hint: possibly you can just add code to your loop in problem 3 and store the results in a list of values). 

Scikit Learn has a great example for Silhouette analysis [here](http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html)

1. For each k (k = 2-10), what are the Silhouette width values?

k=2: 0.681046169211746
k=3: 0.5528190123564091
k=4: 0.4980505049972867
k=5: 0.4887488870931048
k=6: 0.3678464984712235
k=7: 0.3571558397875374
k=8: 0.3471194328049025
k=9: 0.3455246689352856
k=10: 0.3258225826755834

2. Discuss if your best number of clusters (highest Silhouette width value) matches your hypothesis from Problem 1.

The best number of clusters is 2 based on the silhouette values and that does not match my hypothesis. When looking at all the scatter plots of the clusters 2 clusters makes sense because the more cluster that are added the closer together they become and they start to have points overlap.

In [None]:
for i in silhouette:
    print(i)

### Problem 5 (15 points): Principal Component Analysis (PCA)
PCA is the most popular form of dimensionality reduction, which basically, rotates and transforms the data into a new subspace, such that the resultant matrix has:
- Most relevance (variation) now associated with first feature
- Second feature gets the next most, etc.
#### Steps:
1. Reduce the feature data (X) using PCA
2. Repeat the same experiment from problem 3 above (remember your plots are now the 1st, 2nd, and possibly 3rd principal component vs. the raw feature data like before).
3. Compare and contrast results to those from previous/non-PCA problems; does it perform better/worse/same? Provide discussion below (this could vary, depending on setup).

In [None]:
# Clustering with PCA
silhouettePCA = []

for k in range(2,5):
    pca = PCA(n_components=k)
    pca.fit(X)
    xPCA = pca.transform(X)
    model = KMeans(k)
    model.fit(xPCA,y)
    prediction = model.predict(xPCA,y)
    silhouettePCA.append(sk.metrics.silhouette_score(X,prediction))
    plt.scatter(xPCA[:,0],xPCA[:,1],c=prediction)
    plt.show()


In [None]:
for i in silhouettePCA:
    print(i)

**Discuss new results**(Edit this cell)
>
>The silhouette values of the PCA and non PCA solutions are almost identifal to each other. The visual scatter plots are different but still give the same results.

## You Finished! Treat yourself by taking this questionnaire
### Questionnaire
1) How long did you spend on this assignment?
<br><br>
I spent about 5 hours on this assignment

2) What did you like about it? What did you not like about it?
<br><br>
I liked how it demonstrated the differnt number of cluster in problem 3.

3) Did you find any errors or is there anything you would like changed?
<br><br>
No errors and change nothing