<a href="https://colab.research.google.com/github/alicezil/38615-Lab-2/blob/main/Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#38615 Lab 2: Clustering

Tasks:
1. Load the data and explore it.
2. Perform k-means clustering. Determine the optimal number of clusters using e.g. elbow method.
3. Perform clustering with different clustering methods implemented in Scikit Learn.
4. Now, try clustering with another distance metric (e.g. Cosine, Jaccard, etc). Hint: Think whether the default distance metric is appropriate for your data or not.
5. Visualize results using the dimensionality reduction (UMAP or tSNE) technique with respect to the cluster labels.
6. Compare clustering results. Try to rationalize observed commonalities or differences with respect to clustering methods and distance metrics.

#1. Loading the Data and Exploring It

**1.1 Importing the necessary libraries**

In [7]:
import pandas as pd
import numpy as np
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt
import scipy

from sklearn import manifold
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.metrics import accuracy_score

%matplotlib inline 
sns.set(color_codes=True)

**1.2 Loading the data into the dataframe and complete basic data preparation**

In [8]:
df = pd.read_csv("/content/Lab2_clustering_dataset.csv")
df.shape

(969, 1025)

In [9]:
# removing duplicate rows if there are any
df = df.drop_duplicates()    

# dropping the missing values
df = df.dropna()

df.shape

(969, 1025)

As we can see above there were no duplicates or missing values. Let's take a look at the summary table to get a better sense of the data.

In [10]:
df_summary = df.describe()
df_summary.head(8) 

Unnamed: 0,D_0,D_1,D_2,D_3,D_4,D_5,D_6,D_7,D_8,D_9,...,D_1014,D_1015,D_1016,D_1017,D_1018,D_1019,D_1020,D_1021,D_1022,D_1023
count,969.0,969.0,969.0,969.0,969.0,969.0,969.0,969.0,969.0,969.0,...,969.0,969.0,969.0,969.0,969.0,969.0,969.0,969.0,969.0,969.0
mean,1.0,0.900929,1.0,1.0,0.684211,1.0,0.764706,0.68937,0.570691,0.373581,...,0.860681,0.737874,1.0,0.881321,0.854489,0.681115,0.553148,0.809082,0.821465,0.912281
std,0.0,0.298912,0.0,0.0,0.46507,0.0,0.424402,0.46299,0.495233,0.484004,...,0.346458,0.440018,0.0,0.323577,0.352797,0.466285,0.497424,0.393228,0.38316,0.283032
min,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,...,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0
50%,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
75%,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


There are multiple columns where the standard deviation is 0 implying that the entire column has the same value (in this case 1.0). These columns are entirely uhelpful in our data analysis and therefore can be eliminated. Let's do so:

In [11]:
#removing the columns that contain the same value in every row
for col in df_summary:
    if df_summary[col]['std'] == 0:
      df = df.drop([col],axis=1)

df.shape

(969, 951)

Next let's construct a correlation matrix and remove highly correlated variables:

In [12]:
#create correlation matrix
corr_matrix = df.corr().abs()

#isolate upper triangle (for lack of repetition)
upper_triangle = corr_matrix.where((np.triu(np.ones(corr_matrix.shape), k=1) + 
                           np.tril(np.ones(corr_matrix.shape), k=-1)).astype(bool))

#make a list of columns with correlation larger than .95
drop_list = []
for col in upper_triangle.columns:
  if any(upper_triangle[col] > 0.95):
    drop_list.append(col)

#drop all the columns from the list
df.drop(drop_list, axis=1, inplace=True)

df.shape

(969, 913)