# Lab 3 - Part 2: PCA and Clustering (12 marks)
### Due Date: Monday, March 13 at 12pm

Author: Ameen Yarkhan

The purpose of this portion of the assignment is to practice using PCA and clustering techniques on a given dataset

In [14]:
import numpy as np
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## 0. Function definitions (2 marks)

In [15]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

def cluster_fn(n_clusters, X, n_components=0):
    '''Calculate silhouette score for a given dataset, number of clusters, 
       and number of principle components using Kmeans clustering (random_state=0)
        
        n_clusters (int): number of clusters to use for Kmeans
        n_components (int): number of principle components (optional)
        X (numpy.array or pandas.DataFrame): unlabelled dataset
        
        returns: silhouette score
    
    '''

    if(n_components > 0):
        pca = PCA(n_components=n_components)

        pca.fit(X=X)
        X = pca.fit_transform(X=X)
        


        # return(silhouette_score(X,kmeans.labels_))

    # TODO: Implement function body
    kmeans = KMeans(n_clusters=n_clusters, random_state=00)
    labels = kmeans.fit(X=X)


    return(silhouette_score(X,kmeans.labels_))
    

## 1. Load data (2 marks)

For this assignment, we will use the dataset found below:

https://archive.ics.uci.edu/ml/datasets/Chemical+Composition+of+Ceramic+Samples

In [16]:
# TODO: Import dataset

df = pd.read_csv('ceramic.csv' )

df.head()

Unnamed: 0,Ceramic Name,Part,Na2O,MgO,Al2O3,SiO2,K2O,CaO,TiO2,Fe2O3,MnO,CuO,ZnO,PbO2,Rb2O,SrO,Y2O3,ZrO2,P2O5
0,FLQ-1-b,Body,0.62,0.38,19.61,71.99,4.84,0.31,0.07,1.18,630,10,70,10,430,0,40,80,90
1,FLQ-2-b,Body,0.57,0.47,21.19,70.09,4.98,0.49,0.09,1.12,380,20,80,40,430,-10,40,100,110
2,FLQ-3-b,Body,0.49,0.19,18.6,74.7,3.47,0.43,0.06,1.07,420,20,50,50,380,40,40,80,200
3,FLQ-4-b,Body,0.89,0.3,18.01,74.19,4.01,0.27,0.09,1.23,460,20,70,60,380,10,40,70,210
4,FLQ-5-b,Body,0.03,0.36,18.41,73.99,4.33,0.65,0.05,1.19,380,40,90,40,360,10,30,80,150


Two of the columns are non-numeric. For this assignment, we will remove those two columns and focus on clustering the ceramic samples based on the numerical measurements

In [17]:
# TODO: Remove non-numeric columns
df = df.drop(['Ceramic Name','Part'],axis=1)

df.head()

Unnamed: 0,Na2O,MgO,Al2O3,SiO2,K2O,CaO,TiO2,Fe2O3,MnO,CuO,ZnO,PbO2,Rb2O,SrO,Y2O3,ZrO2,P2O5
0,0.62,0.38,19.61,71.99,4.84,0.31,0.07,1.18,630,10,70,10,430,0,40,80,90
1,0.57,0.47,21.19,70.09,4.98,0.49,0.09,1.12,380,20,80,40,430,-10,40,100,110
2,0.49,0.19,18.6,74.7,3.47,0.43,0.06,1.07,420,20,50,50,380,40,40,80,200
3,0.89,0.3,18.01,74.19,4.01,0.27,0.09,1.23,460,20,70,60,380,10,40,70,210
4,0.03,0.36,18.41,73.99,4.33,0.65,0.05,1.19,380,40,90,40,360,10,30,80,150


## 2. Implement clustering (8 marks)

### 2.1 Cluster using raw data (1 mark)

Implement Kmeans clustering using the raw data. Compare the silhouette scores using 2, 3, 4, 5 and 6 clusters

In [23]:
# TODO: Implement clustering with raw data using cluster_fn above

clusters = [2,3,4,5,6]

rawclusterdf = pd.DataFrame(columns=['Number of Clusters','Silhouette Score'])

for cluster in clusters:

 

    ss = cluster_fn(cluster, df)

    rawclusterdf = rawclusterdf.append({'Number of Clusters': cluster, 'Silhouette Score': ss }, ignore_index=True)


# print(rawclusterdf)
rawclusterdf

Unnamed: 0,Number of Clusters,Silhouette Score
0,2.0,0.584013
1,3.0,0.56164
2,4.0,0.543411
3,5.0,0.508064
4,6.0,0.510399


### 2.2 Cluster using PCA-transformed data (2 marks)

Implement Kmeans clustering using the PCA-transformed data. Compare the silhouette scores using 2, 3, 4, 5 and 6 clusters and 2, 3, 4, 5 and 6 principle components 

In [22]:
# TODO: Implement clustering with PCA-transformed data using cluster_fn above

clusters = [2,3,4,5,6]
components = [2,3,4,5,6]

pcaclusterdf = pd.DataFrame(columns=['Number of Clusters','Silhouette Score','PCA components'])



for cluster in clusters:

    for component in components:
        silhouetteScore = cluster_fn(cluster, df, n_components=component)

        pcaclusterdf = pcaclusterdf.append({'Number of Clusters': cluster,'Silhouette Score': silhouetteScore,'PCA components': component},ignore_index=True)

    

# print(pcaclusterdf)
pcaclusterdf


Unnamed: 0,Number of Clusters,Silhouette Score,PCA components
0,2.0,0.619442,2.0
1,2.0,0.599961,3.0
2,2.0,0.589955,4.0
3,2.0,0.587472,5.0
4,2.0,0.585963,6.0
5,3.0,0.611625,2.0
6,3.0,0.586609,3.0
7,3.0,0.570949,4.0
8,3.0,0.56747,5.0
9,3.0,0.564725,6.0


### 2.3 Display results (2 marks)

Print the results for 2.1 and 2.2 in a table. Include column and row labels

In [20]:
# TODO: Display results


'''
I already did this for the previous two tasks and assigned them to dataframes.
'''
#find max number of 'Silhouette Score' in pcaclusterdf

maxSilhouetteScore = pcaclusterdf['Silhouette Score'].max()

#find the row with the max 'Silhouette Score' in pcaclusterdf

maxSilhouetteScoreRow = pcaclusterdf.loc[pcaclusterdf['Silhouette Score'] == maxSilhouetteScore]

maxSilhouetteScoreRow.head()


Unnamed: 0,Number of Clusters,Silhouette Score,PCA components
0,2.0,0.619442,2.0


**Question**: Which combination of number of clusters and number of components produced the best results? What is the silhouette score for this combination? **(3 marks)**

Highest score is 0.619 for 2 clusters and 2 components

## 3. Improve results (Bonus - 3 marks)

Think about how you could improve the results from the previous section. Two potential methods include preprocessing the data or selecting a different clustering algorithm. Repeat section 2 with your selected improvement method to determine what the new silhouette scores would be

In [21]:
# TODO: Repeat steps 2.1-2.3 using a different method/preprocessing/etc.

**Question**: Why did you select this improvement method? Which combination of number of clusters and number of components produced the best results? Did you improve the silhouette scores? If yes, how much of an improvement did you get over the previous results?

*ANSWER HERE*