# <font color='Blue'>Hierarchical Clustering for the Market Ressearch Data on Electronics PurchasesClustering</font>

<b> Hierarchical clustering </b>

    Steps in executing Hierarchical clustering
       1. Decide the value of k
       2. Decide the distance type using parameter 'affinity' - “euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or “precomputed”
       3. Decide the linkage type using parameter 'linkage' - “ward”, “complete”, “average”, “single”

# <font color='Blue'>Loading Libraries</font>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn

from sklearn.cluster import AgglomerativeClustering
from sklearn import metrics
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings('ignore')

## <font color='Blue'>1.0 Loading Data</font>

In [2]:
mktres = pd.read_csv( "research_data.csv" )
# Parking data in another dataframe
data = mktres
data.head(10)

Unnamed: 0,ID,Gender,Marital_Status,Annual_Income,Age,Monthly_Electronics_Spend,Purchasing_Frequency,Technology_Adoption,Viewing_hours_day
0,1,male,married,49,30,35,13,late,2
1,2,male,single,46,36,35,26,late,10
2,3,male,married,58,66,64,13,early,0
3,4,male,married,51,78,33,22,late,5
4,5,female,single,46,52,45,47,late,2
5,6,female,married,31,72,14,32,early,1
6,7,male,married,33,62,18,41,early,0
7,8,male,married,29,30,23,9,early,1
8,9,male,married,57,60,74,1,early,0
9,10,female,married,30,59,16,25,early,0


## <font color='Blue'>1.1 Get the column names</font>

In [3]:
columns = list(data.columns) 
print(columns)
print("")

['ID', 'Gender', 'Marital_Status', 'Annual_Income', 'Age', 'Monthly_Electronics_Spend', 'Purchasing_Frequency', 'Technology_Adoption', 'Viewing_hours_day']



## <font color='Blue'>1.2 Adding derived data</font>

In [4]:
data['Annual_Electronics_Spend'] = data['Monthly_Electronics_Spend']*12
# Number of raws and columns
print("#Rows and #Columns",data.shape)
print("")

#Rows and #Columns (1000, 10)



## <font color='Blue'>1.3 Drop columns not need for clustering</font>

In [5]:
data = data.drop(['ID','Monthly_Electronics_Spend'],axis=1)
print("#Rows and #Columns",data.shape)
print("")
columns = list(data.columns) 
print(columns)

#Rows and #Columns (1000, 8)

['Gender', 'Marital_Status', 'Annual_Income', 'Age', 'Purchasing_Frequency', 'Technology_Adoption', 'Viewing_hours_day', 'Annual_Electronics_Spend']


## <font color='Blue'>1.4 Dummy Coding Variables</font>

In [6]:
dummy      = ['Gender', 'Marital_Status', 'Technology_Adoption']
dummydata  = pd.get_dummies(data, columns=dummy)
dummydata.head()

#Columns
print("#Rows and #Columns",dummydata.shape)
print("")
columns = list(dummydata.columns) 
print(columns)

#Rows and #Columns (1000, 11)

['Annual_Income', 'Age', 'Purchasing_Frequency', 'Viewing_hours_day', 'Annual_Electronics_Spend', 'Gender_female', 'Gender_male', 'Marital_Status_married', 'Marital_Status_single', 'Technology_Adoption_early', 'Technology_Adoption_late']


## <font color='Blue'>1.5 Examining Data</font>

In [7]:
dummydata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype
---  ------                     --------------  -----
 0   Annual_Income              1000 non-null   int64
 1   Age                        1000 non-null   int64
 2   Purchasing_Frequency       1000 non-null   int64
 3   Viewing_hours_day          1000 non-null   int64
 4   Annual_Electronics_Spend   1000 non-null   int64
 5   Gender_female              1000 non-null   uint8
 6   Gender_male                1000 non-null   uint8
 7   Marital_Status_married     1000 non-null   uint8
 8   Marital_Status_single      1000 non-null   uint8
 9   Technology_Adoption_early  1000 non-null   uint8
 10  Technology_Adoption_late   1000 non-null   uint8
dtypes: int64(5), uint8(6)
memory usage: 45.0 KB


In [8]:
dummydata.head()

Unnamed: 0,Annual_Income,Age,Purchasing_Frequency,Viewing_hours_day,Annual_Electronics_Spend,Gender_female,Gender_male,Marital_Status_married,Marital_Status_single,Technology_Adoption_early,Technology_Adoption_late
0,49,30,13,2,420,0,1,1,0,0,1
1,46,36,26,10,420,0,1,0,1,0,1
2,58,66,13,0,768,0,1,1,0,1,0
3,51,78,22,5,396,0,1,1,0,0,1
4,46,52,47,2,540,1,0,0,1,0,1


## <font color='Blue'>1.6 Normalizing Non-Categorical Variables</font>

In [9]:
scaler = StandardScaler()
dummydata[["Annual_Income"]]               = scaler.fit_transform(dummydata[["Annual_Income"]])
dummydata[["Age"]]                         = scaler.fit_transform(dummydata[["Age"]])
dummydata[["Purchasing_Frequency"]]        = scaler.fit_transform(dummydata[["Purchasing_Frequency"]])
dummydata[["Viewing_hours_day"]]           = scaler.fit_transform(dummydata[["Viewing_hours_day"]])
dummydata.head()

Unnamed: 0,Annual_Income,Age,Purchasing_Frequency,Viewing_hours_day,Annual_Electronics_Spend,Gender_female,Gender_male,Marital_Status_married,Marital_Status_single,Technology_Adoption_early,Technology_Adoption_late
0,0.322666,-1.027302,-0.719876,-0.1272,420,0,1,1,0,0,1
1,0.225798,-0.69129,0.217275,2.623078,420,0,1,0,1,0,1
2,0.613268,0.988773,-0.719876,-0.81477,768,0,1,1,0,1,0
3,0.387244,1.660798,-0.071079,0.904154,396,0,1,1,0,0,1
4,0.225798,0.204744,1.731134,-0.1272,540,1,0,0,1,0,1


## <font color='Blue'>2.0 Generating Hierarchical Clusterining Solutions</font>

### <font color='Blue'>2.1 Generate 3 cluster and 4 cluster solution</font>

    Let's use different combinations of distance and linkage types
    We shall generate 3 and 4 cluster solution for different combinations of distance and linkage types
    Then observe cluster sizes and based on that retain the solution
    Based on the sizes, we could observe euclidean ward 3 cluster solution is okay

In [10]:
clusterid3 = AgglomerativeClustering(n_clusters=3,affinity='euclidean',linkage='ward').fit(dummydata).labels_
clusterid4 = AgglomerativeClustering(n_clusters=4,affinity='euclidean',linkage='ward').fit(dummydata).labels_
clusterid5 = AgglomerativeClustering(n_clusters=5,affinity='euclidean',linkage='ward').fit(dummydata).labels_
clusterid6 = AgglomerativeClustering(n_clusters=6,affinity='euclidean',linkage='ward').fit(dummydata).labels_

# You may try other combinations not listed here.

# Solution Sizes

# Euclidean Ward     3 , Euclidean Ward     4 
# Euclidean Complete 3 , Euclidean complete 4 
# Euclidean average  3 , Euclidean average  4 
# Euclidean single   3 , Euclidean single   3

# Manhattan Complete 3 , Manhattan complete 4
# Manhattan average  3 , Manhattan average  4
# Manhattan single   3 , Manhattan single   3

###  <font color='Blue'>2.2 Assign Cluster Labels</font>

In [11]:
data["clusterid3"] = clusterid3
data["clusterid4"] = clusterid4
data["clusterid5"] = clusterid5
data["clusterid6"] = clusterid6
cluster_size3 = data.groupby(['clusterid3']).size() 
cluster_size4 = data.groupby(['clusterid4']).size() 
cluster_size5 = data.groupby(['clusterid5']).size() 
cluster_size6 = data.groupby(['clusterid6']).size() 
print(cluster_size3)
print("")
print(cluster_size4)
print("")
print(cluster_size5)
print("")
print(cluster_size6)

clusterid3
0    145
1    498
2    357
dtype: int64

clusterid4
0    357
1    498
2     62
3     83
dtype: int64

clusterid5
0    498
1    185
2     62
3     83
4    172
dtype: int64

clusterid6
0     62
1    185
2    331
3     83
4    172
5    167
dtype: int64


###  <font color='Blue'>2.3 Performance Measure: Silhouette Score</font>

In [12]:
print("Silhouette Coefficient: %0.3f"% metrics.silhouette_score(dummydata, clusterid3))
print("Silhouette Coefficient: %0.3f"% metrics.silhouette_score(dummydata, clusterid4))
print("Silhouette Coefficient: %0.3f"% metrics.silhouette_score(dummydata, clusterid5))
print("Silhouette Coefficient: %0.3f"% metrics.silhouette_score(dummydata, clusterid6))
# Silhouette score between -1 and 1

Silhouette Coefficient: 0.696
Silhouette Coefficient: 0.665
Silhouette Coefficient: 0.655
Silhouette Coefficient: 0.541


###  <font color='Blue'>2.4 Performance Measure: Calinski-Harabasz Index</font>

In [13]:
print("Calinski-Harabasz index: %0.3f"% metrics.calinski_harabasz_score(dummydata, clusterid3))
print("Calinski-Harabasz index: %0.3f"% metrics.calinski_harabasz_score(dummydata, clusterid4))
print("Calinski-Harabasz index: %0.3f"% metrics.calinski_harabasz_score(dummydata, clusterid5))
print("Calinski-Harabasz index: %0.3f"% metrics.calinski_harabasz_score(dummydata, clusterid6))

Calinski-Harabasz index: 4269.508
Calinski-Harabasz index: 4328.010
Calinski-Harabasz index: 5706.676
Calinski-Harabasz index: 6018.239


<b> In the above solutions, in cluster 5 solution, we can omit the cluster with size 2 and treat it as a four cluster solution </b>

###  <font color='Blue'>3.0 Examining Chararcteristics</font>

In [14]:
values=['Annual_Income','Age','Purchasing_Frequency','Viewing_hours_day','Annual_Electronics_Spend']
index =['clusterid4']
aggfunc={'Annual_Income': np.mean,
         'Age': np.mean,
         'Purchasing_Frequency':np.mean,
         'Viewing_hours_day':np.mean}
result = pd.pivot_table(data,values=values,
                             index =index,
                             aggfunc=aggfunc,
                             fill_value=0)
result['cluster_size'] = cluster_size4
result = result.round(2)
result

Unnamed: 0_level_0,Age,Annual_Income,Purchasing_Frequency,Viewing_hours_day,cluster_size
clusterid4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,42.01,42.04,25.42,4.46,357
1,52.0,29.97,24.73,1.03,498
2,51.63,66.03,8.39,1.05,62
3,51.16,60.01,12.96,2.43,83


###  <font color='Blue'>3.1 Examining Chararcteristics - Cont'd</font>

In [15]:
dummydata['clusterid5'] = clusterid5
values=['Gender_female','Gender_male','Marital_Status_married','Marital_Status_single']
index =['clusterid5']
aggfunc={'Gender_female': np.mean,
         'Gender_male': np.mean,
         'Marital_Status_married':np.mean,
         'Marital_Status_single':np.mean}
result = pd.pivot_table(dummydata,values=values,
                             index =index,
                             aggfunc=aggfunc,
                             fill_value=0)
result['cluster_size'] = cluster_size5
result = result.round(2)
result

Unnamed: 0_level_0,Gender_female,Gender_male,Marital_Status_married,Marital_Status_single,cluster_size
clusterid5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0.48,0.52,0.8,0.2,498
1,0.5,0.5,0.82,0.18,185
2,0.26,0.74,0.69,0.31,62
3,0.25,0.75,0.86,0.14,83
4,0.55,0.45,0.33,0.67,172


###  <font color='Blue'>3.3 Examining Chararcteristics - Cont'd</font>

In [16]:
dummydata['clusterid5'] = clusterid5
values=['Technology_Adoption_early','Technology_Adoption_late']
index =['clusterid5']
aggfunc={'Technology_Adoption_early': np.mean,
         'Technology_Adoption_late': np.mean}
result = pd.pivot_table(dummydata,values=values,
                             index =index,
                             aggfunc=aggfunc,
                             fill_value=0)
result['cluster_size'] = cluster_size5
result = result.round(2)
result

Unnamed: 0_level_0,Technology_Adoption_early,Technology_Adoption_late,cluster_size
clusterid5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1.0,0.0,498
1,0.26,0.74,185
2,1.0,0.0,62
3,0.78,0.22,83
4,0.74,0.26,172
