# Dimension Reduction and Clustering

#### DS 862 - Project I
#### Authors
* Syed Asim
* Divya Raghunathan

### Table of Contents


* [I: Project Overview](#one)
* [II: Introduction](#two)
* [Part I : Dimension Reduction](#1)
    * [1.1 Split the data into train, valid , train+valid and test set](#1.1)
    * [1.2 Split the four datasets into numerical and categorical variables](#1.2)
    * [1.3 Maintaining the levels of the categorical variables](#1.3)
    * [1.4 Train and Valid set](#1.4)
        * [1.4.1 Scaling the data- numerical train and valid](#1.4.1)
        * [1.4.2 Setting hyperparameters](#1.4.2)
        * [1.4.3 Perform PCA,MCA and ridge regression on the train set and evaluate on the valid set](#1.4.3)
    * [1.5 Train+Valid and Test set](#1.5)
        * [1.5.1 Rescaling the data - train+valid and test numerical set](#1.5.1)
        * [1.5.2 Perform PCA, MCA and Ridge regression on the train+valid using the best parameters and evaluate on the test set](#1.5.2)
    * [1.6 Ridge regression on the original dataset](#1.6)
    * [1.7 Findings of Part I](#1.7)
* [Part II : Clustering Analysis](#2)
    * [2.1 Computer gower distance on the predictor set ](#2.1)
    * [2.2 Apply K-medoids using the gower distance matrix as input ](#2.2)
    * [2.3 Comparing clustering results ](#2.3)
    * [2.4 Findings of Part II](#2.4)

# **I: Project Overview** <a class="anchor" id="one"></a>

The project is divided into two parts- Dimension reduction and Clustering. In dimension reduction, we apply PCA on the numerical features and MCA on the categorical features. We will tune PCA and MCA using a train/valid and test split. Finally, we will tune parameters & perform ridge regression on the reduced features and find the mean squared error. 
In clustering, we will use Gower distance on the mixed numerical and categorical dataset. We will apply k-mediods clustering method and obtain the clusters.
For the above two methods, we will compare the results in the following way:


1.   For dimension reduction, we will compare the results by performing ridge regression on the original dataset.
2.   For clustering, we will compare the result with the ground truth.





# **II: Introduction** <a class="anchor" id="one"></a>

In this notebook, we will be applying  dimension reduction techniques and a clustering method on the housing dataset. The housing dataset can be found here (https://drive.google.com/file/d/1F9Z03GYHppfBbmDFvNsA89i2OQm8N-4T/view?usp=sharing). The response variable on this dataset is the price of the house. The dataset contains 80 features (including the response variable) and 1460 observations. After removing columns with more than 30 null values, we are left with 64 features (including response). Out of the 63 predictor variables, 34 variables are numerical and 29 variables are categorical.

In [4]:
# General imports

import pandas as pd
import numpy as np
import warnings
warnings.simplefilter('ignore')

# Part 1 - Dimension Reduction imports

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn import linear_model
import prince
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import itertools

#Part 2- Clustering imports

import gower
from pyclustering.cluster.kmedoids import kmedoids
from sklearn.metrics.cluster import normalized_mutual_info_score

In [5]:
# Load data set
data = pd.read_csv('train.csv')
data = data.drop('Id', axis = 1)
print("Shape of the dataset : {}".format(data.shape))

# Remove columns that have too many missing values
data = data.drop(data.columns[data.isnull().sum() > 30], axis = 1)

# Remove missing values
data.dropna(inplace = True)

Shape of the dataset : (1460, 80)


In [6]:
data.head()

Unnamed: 0,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,...,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,8450,Pave,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,...,0,0,0,0,0,2,2008,WD,Normal,208500
1,20,RL,9600,Pave,Reg,Lvl,AllPub,FR2,Gtl,Veenker,...,0,0,0,0,0,5,2007,WD,Normal,181500
2,60,RL,11250,Pave,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,...,0,0,0,0,0,9,2008,WD,Normal,223500
3,70,RL,9550,Pave,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,...,272,0,0,0,0,2,2006,WD,Abnorml,140000
4,60,RL,14260,Pave,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,...,0,0,0,0,0,12,2008,WD,Normal,250000


In [7]:
print("The number of rows and columns in the dataset : {}".format(data.shape))

The number of rows and columns in the dataset : (1451, 64)


In [8]:
# Defining the predictors and response variables
X = data.drop('SalePrice', axis = 1)
y = data.SalePrice

In [9]:
print("The number of numerical variables in the dataset : {}".format(X.select_dtypes(include=[np.number]).shape[1]))
print("The number of categorical variables in the dataset : {}".format(X.select_dtypes(exclude=[np.number]).shape[1]))

The number of numerical variables in the dataset : 34
The number of categorical variables in the dataset : 29


# **Part I : Dimension Reduction** <a class="anchor" id="1"></a>

To perform dimension reduction on our dataset, we will use Principle componenet analysis (PCA) on the numerical data and Multiple Correspondence Analysis (MCA) on the categorical variables. Prince library is used for MCA and PCA here, https://github.com/MaxHalford/prince. Step-by-step explanation of the code is as below:



1.   Split the data into train,valid and test set(60%-20%-20% split)

2.   Split the training and validation set into categorical and numerical features (2 sets will split into four)- X training numerical, X training catgorical, X validation numerical, X validation categorical 

3.  Preparing the categorical data for MCA. It is important that the training and validation data have the same number of levels and the same levels. Else the algorithm will throw a "dimension mismatch" error. To avoid the error we will remove any level that is not present in either of the two sets- training or validation. Now the data is prepped for MCA

4.   Preparing the numerical data for PCA.Scaling- `Fit` the numerical training data, `Transform` the numerical training & validation data 
 
5.   Define the hyperparameter sets using `itertools`. We will tune 3 parameters here- Number of components - PCA, MCA and number of iteration 

6. For each combination of hyperparameters , `fit` PCA on numerical training data and `fit` MCA on the categorical training data. The result would be two numerical dataframes. Combine the two dataframes to one.

7. Tune `alphas` for ridge regression and `Fit` the combined dataset.

8. `Transform` PCA on the validation numerical and transofrm MCA on validation catgorical, combine the results.

9. Predict using ridge regression for various alpha values and find the mean squared error of the validation set. We will get a best combination hyperparameter set in this step- Number of components- PCA, MCA, iterations andf alpha.

10. Now combine training and validation set, let's call it train+valid set.

11. Repeat steps 4,6,7,8 and 9, but this time we won't tune the PCA, MCA and ridge regression hyper parameters, we will directly use the best results from the above steps. Instead of the train set, we will use the train+valid set and then evaluate on the test set. 






#### 1.1 Split the data into train, valid , train+valid and test set <a class="anchor" id="1.1"></a>

In [10]:
# Using 60%-20%-20% split.

X_train_valid, X_test, y_train_valid, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_valid, y_train_valid, test_size = 0.25, random_state = 42)

#### 1.2 Split the four datasets into numerical and categorical variables <a class="anchor" id="1.2"></a>

In [11]:
X_train_valid_num = X_train_valid.select_dtypes(include=[np.number])
X_train_valid_cat = X_train_valid.select_dtypes(exclude=[np.number])

X_test_num = X_test.select_dtypes(include=[np.number])
X_test_cat = X_test.select_dtypes(exclude=[np.number])

X_train_num = X_train.select_dtypes(include=[np.number])
X_train_cat = X_train.select_dtypes(exclude=[np.number])

X_valid_num = X_valid.select_dtypes(include=[np.number])
X_valid_cat = X_valid.select_dtypes(exclude=[np.number])

#### 1.3 Maintaining the levels of the categorical variables <a class="anchor" id="1.3"></a>

In [12]:
#Keeping the variables with the same number of levels in each of the categorical feature in train+valid set and test set
keep = X_train_valid_cat.nunique() == X_test_cat.nunique() 
X_train_valid_cat = X_train_valid_cat[X_train_valid_cat.columns[keep]]
X_test_cat = X_test_cat[X_test_cat.columns[keep]]
X_train_cat = X_train_cat[X_train_cat.columns[keep]]
X_valid_cat = X_valid_cat[X_valid_cat.columns[keep]]

In [13]:
#Keeping the variables with the same number of levels in each of the categorical feature in train set and valid set
keep2 = X_train_cat.nunique() == X_valid_cat.nunique()
X_train_valid_cat = X_train_valid_cat[X_train_valid_cat.columns[keep2]]
X_test_cat = X_test_cat[X_test_cat.columns[keep2]]
X_train_cat = X_train_cat[X_train_cat.columns[keep2]]
X_valid_cat = X_valid_cat[X_valid_cat.columns[keep2]]

In [14]:
# For the categorical features that have same number of levels, we will keep the variables which have the same classes in the train+valid set and test set 
keep = []
for i in range(X_train_valid_cat.shape[1]):
    keep.append(all(np.sort(X_train_valid_cat.iloc[:,i].unique()) == np.sort(X_test_cat.iloc[:,i].unique())))
X_train_valid_cat = X_train_valid_cat[X_train_valid_cat.columns[keep]]
X_test_cat = X_test_cat[X_test_cat.columns[keep]]
X_train_cat = X_train_cat[X_train_cat.columns[keep]]
X_valid_cat = X_valid_cat[X_valid_cat.columns[keep]]

In [15]:
# For the categorical features that have same number of levels, we will keep the variables which have the same classes in the train set and valid set
keep2 = []
for i in range(X_train_cat.shape[1]):
    keep2.append(all(np.sort(X_train_cat.iloc[:,i].unique()) == np.sort(X_valid_cat.iloc[:,i].unique())))
X_train_valid_cat = X_train_valid_cat[X_train_valid_cat.columns[keep2]]
X_test_cat = X_test_cat[X_test_cat.columns[keep2]]
X_train_cat = X_train_cat[X_train_cat.columns[keep2]]
X_valid_cat = X_valid_cat[X_valid_cat.columns[keep2]]

In [16]:
print('Shape of all categorical datasets:', X_train_valid_cat.shape, X_train_cat.shape, X_valid_cat.shape, X_test_cat.shape)
print('Shape of all numerical datasets  :', X_train_valid_num.shape, X_train_num.shape, X_valid_num.shape, X_test_num.shape)

Shape of all categorical datasets: (1160, 10) (870, 10) (290, 10) (291, 10)
Shape of all numerical datasets  : (1160, 34) (870, 34) (290, 34) (291, 34)



#### 1.4 Train and Valid set <a class="anchor" id="1.4"></a>

##### 1.4.1 Scaling the data- numerical train and valid <a class="anchor" id="1.4.1"></a>



In [17]:
#scaling the numerical dat train and valid set
scaler = StandardScaler() # Instantiate
scaler.fit(X_train_num) # Fitting the data
X_train_numeric_S = pd.DataFrame(scaler.transform(X_train_num)) # transforming the training data
X_valid_numeric_S = pd.DataFrame(scaler.transform(X_valid_num)) # transforming the validation data

##### 1.4.2 Setting hyperparameters <a class="anchor" id="1.4.2"></a>



We will use "Number of components" and "Number of iterations" as tuning parameter for PCA and MCA

In [19]:
comps_pca = np.arange(3,35) #Number of components for PCA
comps_mca = np.arange(3,11) #Number of components for MCA
iters = np.arange(1,10) #Number of iterations for PCA & MCA
# tuning parameters for Regression
alphas = np.logspace(-10,10,21) # lambda values

In [20]:
 #Forming all possible combinations for number of components- PCA, number of components- MCA and number of iterations
hyperparameter_set = list(itertools.product(comps_pca,comps_mca,iters,alphas))
print("The number of hyperparameter sets in total: {}".format(len(hyperparameter_set)))

The number of hyperparameter sets in total: 48384


##### 1.4.3 Perform PCA,MCA and ridge regression on the train set and evaluate on the valid set<a class="anchor" id="1.4.3"></a>


In [22]:
MSE_valid = []
comp_pca = []
comp_mca = []
iter = []
alpha = []

#For-loop for each of the hyperparameter set

for val in hyperparameter_set:
    
#For each combination of hyperparameters, fit and transform PCA on numerical training data and fit and transform MCA on the categorical training data. 

    pca = prince.PCA(n_components=val[0], n_iter=val[2], random_state=42) 
    pca = pca.fit(X_train_numeric_S)
    X_train_pca = pca.transform(X_train_numeric_S)
    X_train_pca = X_train_pca.reset_index() #Resetting index to concat in a future step
    
    mca = prince.MCA(n_components=val[1], n_iter=val[2], random_state=42)
    mca = mca.fit(X_train_cat)
    X_train_mca = mca.transform(X_train_cat)
    X_train_mca = X_train_mca.reset_index() #Resetting index to concat in a future step
    
#Combine the PCA and MCA results from above into one dataframe
    X_train_pmca = pd.concat([X_train_pca, X_train_mca], axis=1) # X_train_pmca contains numerical values of the reduced features 

#Transform the valid-numerical and categorical data using PCA and MCA respectively
    X_valid_pca = pca.transform(X_valid_numeric_S)
    X_valid_pca = X_valid_pca.reset_index()
    X_valid_mca = mca.transform(X_valid_cat)
    X_valid_mca = X_valid_mca.reset_index()

#Combine the PCA and MCA results from above into one dataframe    
    X_valid_pmca = pd.concat([X_valid_pca,X_valid_mca], axis=1)
    
    #Appending the hyperparameters to a list for each run
    alpha.append(val[3])
    comp_pca.append(val[0])
    comp_mca.append(val[1])
    iter.append(val[2])
    #Fit ridge regression on the reduced features for each alpha value
    lm = linear_model.Ridge(alpha=val[3])
    lm.fit(X_train_pmca,y_train)
    
    #Evaluate the results using mean squared error and store the results in MSE_valid  
    MSE_valid.append(metrics.mean_squared_error(lm.predict(X_valid_pmca),y_valid))

In [23]:
#printing the results 
valid_score = pd.DataFrame(data=comp_pca)
valid_score['Number of components-MCA'] = comp_mca
valid_score['Number of iterations'] = iter
valid_score['Alpha']=alpha
valid_score['Validation scores'] = MSE_valid
valid_score = valid_score.rename(columns = {0:'Number of components-PCA'})
valid_score

Unnamed: 0,Number of components-PCA,Number of components-MCA,Number of iterations,Alpha,Validation scores
0,3,3,1,1.000000e-10,1.055596e+09
1,3,3,1,1.000000e-09,1.055596e+09
2,3,3,1,1.000000e-08,1.055596e+09
3,3,3,1,1.000000e-07,1.055596e+09
4,3,3,1,1.000000e-06,1.055596e+09
...,...,...,...,...,...
48379,34,10,9,1.000000e+06,6.541368e+09
48380,34,10,9,1.000000e+07,6.581531e+09
48381,34,10,9,1.000000e+08,6.545761e+09
48382,34,10,9,1.000000e+09,6.521434e+09


In [24]:
#The hyperparameter set that produces the least validation error
best_score = valid_score[valid_score['Validation scores'] == min(valid_score['Validation scores'])]
best_score = best_score.iloc[0]
best_score

Number of components-PCA    3.200000e+01
Number of components-MCA    1.000000e+01
Number of iterations        9.000000e+00
Alpha                       1.000000e-10
Validation scores           8.174067e+08
Name: 45339, dtype: float64

From the above results we can see that the best score is given by 32 PCA components, 10 MCA components, 9 iterations for PCA and MCA and alpha values of 1 * e-10. Lets now use these results in our train_valid dataset to evaluate it on test set.


#### 1.5 Train+vald and Test set <a class="anchor" id="1.5"></a>

##### 1.5.1 Rescaling the data - train+valid and test set <a class="anchor" id="1.5.1"></a>


In [25]:
#scaling the data
scaler = StandardScaler() # Instantiate
scaler.fit(X_train_valid_num) # Fitting the data
X_train_valid_numeric_S = pd.DataFrame(scaler.transform(X_train_valid_num)) # transforming the train+valid data
X_test_numeric_S = pd.DataFrame(scaler.transform(X_test_num)) # transforming the test set


##### 1.5.2 Perform PCA, MCA and Ridge regression on the train+valid using the best parameters and evaluate on the test set <a class="anchor" id="1.5.2"></a>

In [26]:
#Using the best hyperparameter set from above, fit and transform PCA on numerical training+validation data and fit and transform MCA on the categorical training+validation data. 

pca = prince.PCA(n_components=int(best_score['Number of components-PCA']),n_iter=int(best_score['Number of iterations']),random_state=42)
pca = pca.fit(X_train_valid_numeric_S)
X_train_valid_pca = pca.transform(X_train_valid_numeric_S)
X_train_valid_pca = X_train_valid_pca.reset_index()

mca = prince.MCA(n_components=int(best_score['Number of components-MCA']),n_iter=int(best_score['Number of iterations']),random_state=42)
mca = mca.fit(X_train_valid_cat) 
X_train_valid_mca = mca.transform(X_train_valid_cat)
X_train_valid_mca = X_train_valid_mca.reset_index()

#Combine the results into one dataframe    
X_train_valid_pmca = pd.concat([X_train_valid_pca,X_train_valid_mca],axis=1)

# Fit ridge regression on the reduced features for the alpha value as identified above      
lm = linear_model.Ridge(alpha = best_score['Alpha'])
lm.fit(X_train_valid_pmca, y_train_valid)
    
#Transform the test-numerical and categorical data using PCA and MCA respectively    
X_test_pca = pca.transform(X_test_numeric_S)
X_test_pca = X_test_pca.reset_index()
X_test_mca = mca.transform(X_test_cat)
X_test_mca = X_test_mca.reset_index()
    
#Combine the PCA and MCA results from above into one dataframe  
X_test_pmca = pd.concat([X_test_pca,X_test_mca],axis=1)

#Evaluate the results on the test set and find the MSE   
MSE_test = metrics.mean_squared_error(lm.predict(X_test_pmca),y_test)

In [27]:
print("The prediction error on the test set is : {}".format(MSE_test))

The prediction error on the test set is : 800322359.1532334


Lets now compare the above results with regression on the original dataset i.e. without dimension reduction, using Ridge regression.

#### 1.6 Ridge regression on the original dataset <a class="anchor" id="1.6"></a>

In [28]:
#Since we removed the variables with different levels, we will concat numerical and categorical from above to use the same features.
X_train = pd.concat([X_train_num,X_train_cat], axis=1)
X_valid = pd.concat([X_valid_num,X_valid_cat], axis=1)
X_train_valid = pd.concat([X_train_valid_num,X_train_valid_cat], axis=1)
X_test = pd.concat([X_test_num,X_test_cat], axis=1)

In [29]:
#Getting dummy variables for the categorical variables and dropping the first dummy variable
X_train = pd.get_dummies(X_train,drop_first=True)
X_valid = pd.get_dummies(X_valid,drop_first=True)
X_train_valid = pd.get_dummies(X_train_valid,drop_first=True)
X_test = pd.get_dummies(X_test,drop_first=True)

In [30]:
print('Shape of the datasets:', X_train_valid.shape, X_train.shape, X_valid.shape, X_test.shape)

Shape of the datasets: (1160, 59) (870, 59) (290, 59) (291, 59)


In [31]:
#scaling the data
scaler = StandardScaler() # Instantiate
scaler.fit(X_train) # Fitting the data
X_train = pd.DataFrame(scaler.transform(X_train)) # transforming the training data
X_valid = pd.DataFrame(scaler.transform(X_valid)) # transforming the validation data

In [32]:
Validation_Scores = []
# performing ridge regression on all the alpha values. Fit in the training data and predicting the validation set and finding MSE
for a in alphas: #using the same alphas we defined above
    lm_trainlasso = linear_model.Ridge(alpha=a)
    lm_trainlasso.fit(X_train,y_train)
    Validation_Scores.append(metrics.mean_squared_error(lm_trainlasso.predict(X_valid),y_valid))

In [33]:
minerror_M1 = min(Validation_Scores) # min validation misclassification error
besttrio_M1 = alphas[np.argmin(Validation_Scores)] #finding the alpha value that gives least error

In [34]:
# Displaying the results for least error
bestparam = pd.DataFrame(index=['Alpha','Validation Error'],data=[besttrio_M1,minerror_M1])
bestparam = bestparam.rename(columns = {0:'Values'})
bestparam

Unnamed: 0,Values
Alpha,1e-10
Validation Error,786921200.0


The alpha value is the same as above. We will now use this alpha value, fit our ridge regression and evaluate it on the test set.

In [35]:
# Scaling the data
scaler = StandardScaler()
scaler.fit(X_train_valid)
X_train_valid = pd.DataFrame(scaler.transform(X_train_valid))
X_test = pd.DataFrame(scaler.transform(X_test)) # transforming the test set

In [36]:
# Refit model with train + validation set, perform prediction on test set
lm1 = linear_model.Ridge(alpha = bestparam.loc['Alpha'])
lm1.fit(X_train_valid, y_train_valid)

Ridge(alpha=Values    1.000000e-10
Name: Alpha, dtype: float64)

In [37]:
M1_terror = metrics.mean_squared_error(lm1.predict(X_test), y_test)
print("The prediction error for the test set is : {}".format(M1_terror))

The prediction error for the test set is : 774598123.6178343


#### 1.7 Findings of Part I <a class="anchor" id="1.7"></a>

In [38]:
print("Predition error on the test set- with PCA and MCA : {}".format(MSE_test))
print("Predition error on the test set- without PCA and MCA : {}".format(M1_terror))

Predition error on the test set- with PCA and MCA : 800322359.1532334
Predition error on the test set- without PCA and MCA : 774598123.6178343


With all the other parameters being the same, the prediction error is lower when you don't do dimension reduction.
The number of components chosen by PCA and MCA is the 42, whereas the total number of features in our dataset is 44. Not much of a difference there. Dimension reduction technique is particularly useful when you have hgih dimensional data. Since we had just 44 features, performing dimension reduction wasn't particularly useful. We could also use different dimension reduction techniques like Random forerst, Forward Feature Selection etc for different results.


# **Part II : Clustering Analysis** <a class="anchor" id="2"></a>

In the above section, we performed dimension reduction separately for numerical and categorical data. 
In this section, we would be performing clustering analysis on the same dataset using `Gower Distance` as a metric which is used to measure how different two records are by measuring numerical data and categorical data separately and then combines them to form a distance calculation. The distance is always a number between 0 (identical) and 1 (maximally dissimilar).
Following are the steps performed for the clustering analysis:

1. Creating an array `X_dist` to calculate the gower distance.
2. Define the range of clusters that we want to create.
3. Create random initial medoids equal to the number of clusters.
4. Apply K-medoids using the gower distance matrix as input. K-Medoids uses existed points from input data space as medoids.
5. Run the cluster analysis.
6. Use the `get_clusters` and `get_medoids` function to determine the number of observations in each cluster and the medoids respectively.
7. Define a for-loop to create a list `clustering_result` that records the cluster membership of each observation.
8. Binning the response variable into the same number of categories as that of clusters.
9. Computing and append the `Normalized Mutual Information` (a metric to measure clustering performance) between the clustering results and the binned categories. It normalizes the mutual information score to scale the results between 0 (no mutual information) and 1 (perfect correlation).
10. Create a DataFrame to evaluate the `NMI` score with respect to each cluster value.


#### 2.1 Compute gower distance on the predictor set <a class="anchor" id="2.1"></a>

In [39]:
# Load list of points for cluster analysis
X_dist = gower.gower_matrix(X)

#### 2.2 Apply K-medoids using the gower distance matrix as input<a class="anchor" id="2.2"></a>

In [40]:
Correlation_score = [] # empty list to record NMI score
Clusters = [] # empty list to record cluster value

# Lets define the numbers of clusters that we want to create
K = np.arange(1,11,1)

# writing script to loop through different cluster values and evaluate NMI score for each value
for val in K:

    # Setting random initial medoids 
    center_index = np.random.randint(0, len(X_dist), val)

    # Creating instance of K-Medoids algorithm
    kmedoids_instance = kmedoids(X_dist, center_index, data_type='distance_matrix')

    # Running cluster analysis to obtain results
    kmedoids_instance.process()
    clusters = kmedoids_instance.get_clusters()
    #medoids = kmedoids_instance.get_medoids()

    # creating a list 'y_pred' that records the cluster membership of each observation
    clustering_result = [-1]*len(y.to_list()) # creating a list of same length as that of response variable
    for c in range(val): # 'K' here is the number of clusters defined in the above code
        for i in range(len(clusters[c])):
            clustering_result[clusters[c][i]] = c

    # Binning the response variable into the same number of categories as that of clusters which are created above. Here the value is 5.
    ground_truth = pd.qcut(y, q=val, labels=range(val), precision=int)

    # recording the cluster value
    Clusters.append(val) 

    # Computing the normalized mutual information (NMI) between the clustering results and the binned categories
    Correlation_score.append(normalized_mutual_info_score(ground_truth, clustering_result))

#### 2.3 Comparing clustering results<a class="anchor" id="2.3"></a>



In [41]:
# creating a dataframe to evaluate the NMI score with respect to each cluster value
score = pd.DataFrame(Clusters)
score['NMI_Score'] = Correlation_score
score = score.rename(columns = {0:'Clusters'})
score

Unnamed: 0,Clusters,NMI_Score
0,1,1.0
1,2,0.394213
2,3,0.276993
3,4,0.236005
4,5,0.243624
5,6,0.224866
6,7,0.195392
7,8,0.18164
8,9,0.191165
9,10,0.185328


#### 2.4 Findings of part II <a class="anchor" id="2.4"></a>

From the above table, we get a perfect correlation between the ground truth and the clustering results when all the data is in one group. However, as the number of clusters increase, the NMI score decreases most of the time. This might be because the probabilities in NMI are approximated by frequency, resulting in finite size effect and preferring large number of partitions. This phenomenon is more evident in smaller networks. In this case when we increase the number of clusters then the class members are split across different clusters, the assignment is more in-complete, hence the NMI is decreasing. This could also be due to the small data size of just close to 1000 observations.