# Lab 3 - Part 2: PCA and Clustering (12 marks)
### Due Date: Monday, March 13 at 12pm

Author: Nikhil Naikar

The purpose of this portion of the assignment is to practice using PCA and clustering techniques on a given dataset

In [1]:
import numpy as np
import pandas as pd

## 0. Function definitions (2 marks)

In [20]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

def cluster_fn(X, n_clusters, n_components=0):
    '''Calculate silhouette score for a given dataset, number of clusters, 
       and number of principle components using Kmeans clustering (random_state=0)
        
        n_clusters (int): number of clusters to use for Kmeans
        n_components (int): number of principle components (optional)
        X (numpy.array or pandas.DataFrame): unlabelled dataset
        
        returns: silhouette score
    
    '''
    # TODO: Implement function body
    if n_components != 0:
        pca_model = PCA(n_components=n_components)
        X = pca_model.fit_transform(X)
    kmeans_model = KMeans(n_clusters = n_clusters, random_state=47)
    temp = kmeans_model.fit(X)
    avg_score = silhouette_score(X, temp.labels_) 
    
    return avg_score

## 1. Load data (2 marks)

For this assignment, we will use the dataset found below:

https://archive.ics.uci.edu/ml/datasets/Chemical+Composition+of+Ceramic+Samples

In [8]:
# TODO: Import dataset
data = pd.read_csv('Chemical Composion of Ceramic.csv')
print(data.head())
print(data.shape)

  Ceramic Name  Part  Na2O   MgO  Al2O3   SiO2   K2O   CaO  TiO2  Fe2O3  MnO  \
0      FLQ-1-b  Body  0.62  0.38  19.61  71.99  4.84  0.31  0.07   1.18  630   
1      FLQ-2-b  Body  0.57  0.47  21.19  70.09  4.98  0.49  0.09   1.12  380   
2      FLQ-3-b  Body  0.49  0.19  18.60  74.70  3.47  0.43  0.06   1.07  420   
3      FLQ-4-b  Body  0.89  0.30  18.01  74.19  4.01  0.27  0.09   1.23  460   
4      FLQ-5-b  Body  0.03  0.36  18.41  73.99  4.33  0.65  0.05   1.19  380   

   CuO  ZnO  PbO2  Rb2O  SrO  Y2O3  ZrO2  P2O5  
0   10   70    10   430    0    40    80    90  
1   20   80    40   430  -10    40   100   110  
2   20   50    50   380   40    40    80   200  
3   20   70    60   380   10    40    70   210  
4   40   90    40   360   10    30    80   150  
(88, 19)


Two of the columns are non-numeric. For this assignment, we will remove those two columns and focus on clustering the ceramic samples based on the numerical measurements

In [9]:
# TODO: Remove non-numeric columns
temp = data.select_dtypes(exclude='number')
data = data.drop(temp, axis=1)
print(data.head())
print(data.shape)

   Na2O   MgO  Al2O3   SiO2   K2O   CaO  TiO2  Fe2O3  MnO  CuO  ZnO  PbO2  \
0  0.62  0.38  19.61  71.99  4.84  0.31  0.07   1.18  630   10   70    10   
1  0.57  0.47  21.19  70.09  4.98  0.49  0.09   1.12  380   20   80    40   
2  0.49  0.19  18.60  74.70  3.47  0.43  0.06   1.07  420   20   50    50   
3  0.89  0.30  18.01  74.19  4.01  0.27  0.09   1.23  460   20   70    60   
4  0.03  0.36  18.41  73.99  4.33  0.65  0.05   1.19  380   40   90    40   

   Rb2O  SrO  Y2O3  ZrO2  P2O5  
0   430    0    40    80    90  
1   430  -10    40   100   110  
2   380   40    40    80   200  
3   380   10    40    70   210  
4   360   10    30    80   150  
(88, 17)


## 2. Implement clustering (8 marks)

### 2.1 Cluster using raw data (1 mark)

Implement Kmeans clustering using the raw data. Compare the silhouette scores using 2, 3, 4, 5 and 6 clusters

In [35]:
# TODO: Implement clustering with raw data using cluster_fn above
results = []
num_clusters = [2,3,4,5,6]
for num in num_clusters: 
    results.append([num, cluster_fn(data,num,0)])
    print('Silhouette score for', num, 'clusters = {:.2f}'.format(cluster_fn(data,num,0)))

Silhouette score for 2 clusters = 0.58
Silhouette score for 3 clusters = 0.56
Silhouette score for 4 clusters = 0.54
Silhouette score for 5 clusters = 0.51
Silhouette score for 6 clusters = 0.49


### 2.2 Cluster using PCA-transformed data (2 marks)

Implement Kmeans clustering using the PCA-transformed data. Compare the silhouette scores using 2, 3, 4, 5 and 6 clusters and 2, 3, 4, 5 and 6 principle components 

In [39]:
# TODO: Implement clustering with PCA-transformed data using cluster_fn above
num_clusters = [2,3,4,5,6]
num_principles = [2,3,4,5,6]
result_PCA = []
for num in num_clusters:
    print('Silhouette score for', num, 'clusters with:')
    for n in num_principles:
        result_PCA.append([num, n, cluster_fn(data,num,n)])
        print(n, 'principle components = {:.2f}'.format(cluster_fn(data,num,n)))

Silhouette score for 2 clusters with:
2 principle components = 0.62
3 principle components = 0.60
4 principle components = 0.59
5 principle components = 0.59
6 principle components = 0.59
Silhouette score for 3 clusters with:
2 principle components = 0.61
3 principle components = 0.59
4 principle components = 0.57
5 principle components = 0.57
6 principle components = 0.57
Silhouette score for 4 clusters with:
2 principle components = 0.60
3 principle components = 0.57
4 principle components = 0.55
5 principle components = 0.55
6 principle components = 0.55
Silhouette score for 5 clusters with:
2 principle components = 0.57
3 principle components = 0.55
4 principle components = 0.52
5 principle components = 0.51
6 principle components = 0.51
Silhouette score for 6 clusters with:
2 principle components = 0.58
3 principle components = 0.54
4 principle components = 0.53
5 principle components = 0.52
6 principle components = 0.50


### 2.3 Display results (2 marks)

Print the results for 2.1 and 2.2 in a table. Include column and row labels

In [51]:
# TODO: Display results
df = pd.DataFrame(results, columns = ['num_of_clusters', 'silhouette score'])
df.set_index('num_of_clusters', inplace=True)
df_PCA = pd.DataFrame(result_PCA, columns = ['num_of_clusters', 'num_of_principle_components', 'silhouette score'])
df_PCA.set_index('num_of_clusters', inplace=True)
print(df)
print(df_PCA)

                 silhouette score
num_of_clusters                  
2                        0.584013
3                        0.562567
4                        0.543411
5                        0.506608
6                        0.489695
                 num_of_principle_components  silhouette score
num_of_clusters                                               
2                                          2          0.619442
2                                          3          0.599961
2                                          4          0.589955
2                                          5          0.587472
2                                          6          0.585963
3                                          2          0.611625
3                                          3          0.586609
3                                          4          0.571447
3                                          5          0.567714
3                                          6          0.565490
4     

**Question**: Which combination of number of clusters and number of components produced the best results? What is the silhouette score for this combination? **(3 marks)**

2 clusters and 2 principle components combination produce the best results which is a score of 0.62. <br>

## 3. Improve results (Bonus - 3 marks)

Think about how you could improve the results from the previous section. Two potential methods include preprocessing the data or selecting a different clustering algorithm. Repeat section 2 with your selected improvement method to determine what the new silhouette scores would be

In [None]:
# TODO: Repeat steps 2.1-2.3 using a different method/preprocessing/etc.

**Question**: Why did you select this improvement method? Which combination of number of clusters and number of components produced the best results? Did you improve the silhouette scores? If yes, how much of an improvement did you get over the previous results?

*ANSWER HERE*