# Lab 3 - Part 2: PCA and Clustering (12 marks)
### Due Date: Monday, March 13 at 12pm

Author: Christopher DiMattia

The purpose of this portion of the assignment is to practice using PCA and clustering techniques on a given dataset

In [1]:
import numpy as np
import pandas as pd

## 0. Function definitions (2 marks)

In [2]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

def cluster_fn(X, n_clusters, n_components=0):
    '''Calculate silhouette score for a given dataset, number of clusters, 
       and number of principle components using Kmeans clustering (random_state=0)
        
        n_clusters (int): number of clusters to use for Kmeans
        n_components (int): number of principle components (optional)
        X (numpy.array or pandas.DataFrame): unlabelled dataset
        
        returns: silhouette score
    
    '''
    # TODO: Implement function body

    if n_components > 0:
        X = PCA(n_components=n_components).fit_transform(X)
    
    kmeans = KMeans(n_clusters=n_clusters,random_state=0).fit(X)
    return silhouette_score(X, kmeans.labels_) 

## 1. Load data (2 marks)

For this assignment, we will use the dataset found below:

https://archive.ics.uci.edu/ml/datasets/Chemical+Composition+of+Ceramic+Samples

In [3]:
# TODO: Import dataset
df = pd.read_csv('Chemical Composion of Ceramic.csv')

In [4]:
display(df)

Unnamed: 0,Ceramic Name,Part,Na2O,MgO,Al2O3,SiO2,K2O,CaO,TiO2,Fe2O3,MnO,CuO,ZnO,PbO2,Rb2O,SrO,Y2O3,ZrO2,P2O5
0,FLQ-1-b,Body,0.62,0.38,19.61,71.99,4.84,0.31,0.07,1.18,630,10,70,10,430,0,40,80,90
1,FLQ-2-b,Body,0.57,0.47,21.19,70.09,4.98,0.49,0.09,1.12,380,20,80,40,430,-10,40,100,110
2,FLQ-3-b,Body,0.49,0.19,18.60,74.70,3.47,0.43,0.06,1.07,420,20,50,50,380,40,40,80,200
3,FLQ-4-b,Body,0.89,0.30,18.01,74.19,4.01,0.27,0.09,1.23,460,20,70,60,380,10,40,70,210
4,FLQ-5-b,Body,0.03,0.36,18.41,73.99,4.33,0.65,0.05,1.19,380,40,90,40,360,10,30,80,150
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
83,DY-M-3-g,Glaze,0.34,0.55,12.37,70.70,5.33,8.06,0.06,1.61,1250,10,90,30,250,520,30,140,690
84,DY-QC-1-g,Glaze,0.72,0.34,12.20,72.19,6.19,6.06,0.04,1.27,1700,60,110,10,270,540,40,120,630
85,DY-QC-2-g,Glaze,0.23,0.24,12.99,71.81,5.25,7.15,0.05,1.29,750,40,100,0,240,470,40,120,480
86,DY-QC-3-g,Glaze,0.14,0.46,12.62,69.16,4.34,11.03,0.05,1.20,920,40,90,20,230,470,40,130,1100


Two of the columns are non-numeric. For this assignment, we will remove those two columns and focus on clustering the ceramic samples based on the numerical measurements

In [5]:
# TODO: Remove non-numeric columns
df = df.drop(["Ceramic Name",'Part'],axis=1)

In [6]:
display(df)

Unnamed: 0,Na2O,MgO,Al2O3,SiO2,K2O,CaO,TiO2,Fe2O3,MnO,CuO,ZnO,PbO2,Rb2O,SrO,Y2O3,ZrO2,P2O5
0,0.62,0.38,19.61,71.99,4.84,0.31,0.07,1.18,630,10,70,10,430,0,40,80,90
1,0.57,0.47,21.19,70.09,4.98,0.49,0.09,1.12,380,20,80,40,430,-10,40,100,110
2,0.49,0.19,18.60,74.70,3.47,0.43,0.06,1.07,420,20,50,50,380,40,40,80,200
3,0.89,0.30,18.01,74.19,4.01,0.27,0.09,1.23,460,20,70,60,380,10,40,70,210
4,0.03,0.36,18.41,73.99,4.33,0.65,0.05,1.19,380,40,90,40,360,10,30,80,150
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
83,0.34,0.55,12.37,70.70,5.33,8.06,0.06,1.61,1250,10,90,30,250,520,30,140,690
84,0.72,0.34,12.20,72.19,6.19,6.06,0.04,1.27,1700,60,110,10,270,540,40,120,630
85,0.23,0.24,12.99,71.81,5.25,7.15,0.05,1.29,750,40,100,0,240,470,40,120,480
86,0.14,0.46,12.62,69.16,4.34,11.03,0.05,1.20,920,40,90,20,230,470,40,130,1100


## 2. Implement clustering (8 marks)

### 2.1 Cluster using raw data (1 mark)

Implement Kmeans clustering using the raw data. Compare the silhouette scores using 2, 3, 4, 5 and 6 clusters

In [7]:
# TODO: Implement clustering with raw data using cluster_fn above

clusters = [2,3,4,5,6]

df_21 = pd.DataFrame(index=clusters, columns=["Silhouette Score"])
df_21.index.name = 'Cluster'

#print scores for comparision
for cluster in clusters:
    print("cluster " + str(cluster) + ": silhouette score: " + str(cluster_fn(df,cluster)))
    df_21.at[cluster,'Cluster'] = cluster
    df_21.at[cluster,'Silhouette Score'] = cluster_fn(df,cluster)

df_21 = df_21.drop(["Cluster"],axis=1)

cluster 2: silhouette score: 0.5840130686182087
cluster 3: silhouette score: 0.5616399840165864
cluster 4: silhouette score: 0.543410860697891
cluster 5: silhouette score: 0.5080642704990367
cluster 6: silhouette score: 0.5103991098981858


### 2.2 Cluster using PCA-transformed data (2 marks)

Implement Kmeans clustering using the PCA-transformed data. Compare the silhouette scores using 2, 3, 4, 5 and 6 clusters and 2, 3, 4, 5 and 6 principle components 

In [8]:
# TODO: Implement clustering with PCA-transformed data using cluster_fn above
clusters = [2,3,4,5,6]
components = [2,3,4,5,6]

df_22 = pd.DataFrame(index=clusters, columns=components)
df_22.index.name = 'Cluster'
df_22.columns.name = "Components"

#print scores for comparision
for cluster in clusters:
    for component in components:
        print("cluster " + str(cluster) + " components: " + str(component)  + ": silhouette score: "+ str(cluster_fn(df,cluster,component)))
        df_22.at[cluster,component] = cluster_fn(df,cluster,component)



cluster 2 components: 2: silhouette score: 0.6194422739820585
cluster 2 components: 3: silhouette score: 0.5999612865384472
cluster 2 components: 4: silhouette score: 0.5899549664230884
cluster 2 components: 5: silhouette score: 0.5874723776389512
cluster 2 components: 6: silhouette score: 0.5859629695809577
cluster 3 components: 2: silhouette score: 0.6116248458476962
cluster 3 components: 3: silhouette score: 0.5866089718929351
cluster 3 components: 4: silhouette score: 0.5709488853404947
cluster 3 components: 5: silhouette score: 0.5674695588445686
cluster 3 components: 6: silhouette score: 0.5647248342959945
cluster 4 components: 2: silhouette score: 0.6007517657297899
cluster 4 components: 3: silhouette score: 0.5705309604281562
cluster 4 components: 4: silhouette score: 0.5537151283872483
cluster 4 components: 5: silhouette score: 0.5492856440931447
cluster 4 components: 6: silhouette score: 0.5467520920972184
cluster 5 components: 2: silhouette score: 0.567087772402513
cluster 5

### 2.3 Display results (2 marks)

Print the results for 2.1 and 2.2 in a table. Include column and row labels

In [9]:
# TODO: Display results
display(df_21)



Unnamed: 0_level_0,Silhouette Score
Cluster,Unnamed: 1_level_1
2,0.584013
3,0.56164
4,0.543411
5,0.508064
6,0.510399


In [10]:
display(df_22)

Components,2,3,4,5,6
Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,0.619442,0.599961,0.589955,0.587472,0.585963
3,0.611625,0.586609,0.570949,0.56747,0.564725
4,0.600752,0.570531,0.553715,0.549286,0.546752
5,0.567088,0.545911,0.521348,0.515809,0.512537
6,0.56932,0.550951,0.529829,0.524216,0.515274


**Question**: Which combination of number of clusters and number of components produced the best results? What is the silhouette score for this combination? **(3 marks)**

The best results are from 2 components and 2 clusters (0.619)

## 3. Improve results (Bonus - 3 marks)

Think about how you could improve the results from the previous section. Two potential methods include preprocessing the data or selecting a different clustering algorithm. Repeat section 2 with your selected improvement method to determine what the new silhouette scores would be

In [None]:
# TODO: Repeat steps 2.1-2.3 using a different method/preprocessing/etc.

**Question**: Why did you select this improvement method? Which combination of number of clusters and number of components produced the best results? Did you improve the silhouette scores? If yes, how much of an improvement did you get over the previous results?

*ANSWER HERE*