**Part 3: Clustering with Frequent Itemsets**

In this notebook, we run two clustering algorithms (K-Means and Agglomerative clustering) on two separate datasets after generating frequent itemsets in these datasets with the efficient_apriori algorithm.

The datasets used are labeled with binary labels (0 or 1), which is why we pick clustering algorithms where the number of clusters has to be pre-specified. The datasets are linked below.

Dataset 1: https://archive.ics.uci.edu/dataset/267/banknote+authentication

Dataset 2: https://archive.ics.uci.edu/dataset/149/statlog+vehicle+silhouettes

In [12]:
!pip install efficient_apriori
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import TransactionEncoder
from mpl_toolkits.mplot3d import Axes3D
import networkx as nx
from efficient_apriori import apriori as ap
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score, fowlkes_mallows_score

  and should_run_async(code)




In [2]:
df = pd.read_csv("/content/data_banknote_authentication.csv", header=None)
df.columns = ['variance','skewness','curtosis','entropy','class'] # Rename columns for clarity
true_labels = df['class'] # Ground truth
df.head()

  and should_run_async(code)


Unnamed: 0,variance,skewness,curtosis,entropy,class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


In [3]:
df = df.drop("class", axis=1) # Drop labels for clustering

  and should_run_async(code)


In the following cell, we convert the dataset to categorical to make it suitable for the apriori algorithm. We do this by using the cut() function, which partitions each column into 'low', 'medium', and 'high' based on an arbitrary threshold. We then one-hot encode this new dataset so that each row represents a basket. Thus, all those items with value 1 are present in the basket and those with value 0 are not.

In [4]:
df['variance'] = pd.cut(df['variance'], bins=3, labels=['low', 'medium', 'high'])
df['skewness'] = pd.cut(df['skewness'], bins=3, labels=['low', 'medium', 'high'])
df['curtosis'] = pd.cut(df['curtosis'], bins=3, labels=['low', 'medium', 'high'])
df['entropy'] = pd.cut(df['entropy'], bins=3, labels=['low', 'medium', 'high'])

# Convert the columns to categorical
df = df.astype('category')

df_encoded = pd.get_dummies(df)

# Use one-hot encoding to convert categorical variables into binary format
item_names = df.columns.tolist()
baskets = [tuple(row.index[row == 1]) for _, row in df_encoded.iterrows()]

# Get frequent itemsets and rules using efficient_apriori
itemsets, _ = ap(baskets, min_support=0.1)

# Extract frequent itemsets with support above a threshold
min_support = 0.1
frequent_itemsets = {itemset: support for length, items in itemsets.items() for itemset, support in items.items() if support >= min_support}

# Create a binary-encoded dataset based on frequent itemsets
encoded_dataset = pd.DataFrame(0, columns=[str(itemset) for itemset in frequent_itemsets.keys()], index=df_encoded.index)

for idx, basket in enumerate(baskets):
    for itemset in frequent_itemsets.keys():
        if set(itemset).issubset(basket):
            encoded_dataset.loc[idx, str(itemset)] = 1

  and should_run_async(code)


In [5]:
display(encoded_dataset)

  and should_run_async(code)


Unnamed: 0,"('variance_high',)","('skewness_high',)","('curtosis_low',)","('entropy_high',)","('entropy_medium',)","('skewness_medium',)","('variance_medium',)","('curtosis_medium',)","('skewness_low',)","('variance_low',)",...,"('curtosis_low', 'entropy_medium', 'skewness_high')","('curtosis_low', 'entropy_medium', 'variance_medium')","('curtosis_low', 'skewness_high', 'variance_high')","('curtosis_low', 'skewness_high', 'variance_medium')","('curtosis_low', 'skewness_medium', 'variance_medium')","('curtosis_medium', 'entropy_high', 'skewness_medium')","('curtosis_medium', 'entropy_high', 'variance_medium')","('entropy_high', 'skewness_medium', 'variance_high')","('entropy_high', 'skewness_medium', 'variance_medium')","('curtosis_low', 'entropy_high', 'skewness_medium', 'variance_medium')"
0,1,1,1,1,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,1,1,1,0,1,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,0
2,1,0,1,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,1,1,1,0,1,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,0
4,0,0,0,1,0,1,1,1,0,0,...,0,0,0,0,0,1,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1367,0,0,1,1,0,1,1,0,0,0,...,0,0,0,0,1,0,0,0,1,1
1368,0,0,0,1,0,0,1,1,1,0,...,0,0,0,0,0,0,1,0,0,0
1369,0,0,0,0,1,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1370,0,0,0,0,1,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0


In [6]:
# Scale data for clustering
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(encoded_dataset)

  and should_run_async(code)


In [10]:
kmeans = KMeans(n_clusters=2, random_state=42)
clusters = kmeans.fit_predict(scaled_data)

# Add cluster labels to the original DataFrame
df['KMeansCluster'] = clusters

ari_kmeans = adjusted_rand_score(true_labels, df['KMeansCluster'])
nmi_kmeans = normalized_mutual_info_score(true_labels, df['KMeansCluster'])
fmi_kmeans = fowlkes_mallows_score(true_labels, df['KMeansCluster'])

print("KMeans Clustering Results:")
print("ARI:", ari_kmeans)
print("NMI:", nmi_kmeans)
print("FMI:", fmi_kmeans)

  and should_run_async(code)


KMeans Clustering Results:
ARI: 0.13657394120799035
NMI: 0.15119767387982547
FMI: 0.5915766449301924


In [9]:
agg_clustering = AgglomerativeClustering(n_clusters=2)
clusters = agg_clustering.fit_predict(scaled_data)

# Add cluster labels to the original DataFrame
df['AggCluster'] = clusters

ari_agg = adjusted_rand_score(true_labels, df['AggCluster'])
nmi_agg = normalized_mutual_info_score(true_labels, df['AggCluster'])
fmi_agg = fowlkes_mallows_score(true_labels, df['AggCluster'])

print("\nAgglomerative Clustering Results:")
print("ARI:", ari_agg)
print("NMI:", nmi_agg)
print("FMI:", fmi_agg)


Agglomerative Clustering Results:
ARI: 0.05011456650653853
NMI: 0.03154611184871104
FMI: 0.547167464452367


  and should_run_async(code)


We now repeat this with another dataset, vehicles.csv, which contains numeric information about various features of specific to different types of vehicles.

In [17]:
df2 = pd.read_csv("/content/vehicle.csv")
display(df2.head())

  and should_run_async(code)


Unnamed: 0,COMPACTNESS,CIRCULARITY,'DISTANCE CIRCULARITY','RADIUS RATIO','PR.AXIS ASPECT RATIO','MAX.LENGTH ASPECT RATIO','SCATTER RATIO',ELONGATEDNESS,'PR.AXIS RECTANGULARITY','MAX.LENGTH RECTANGULARITY','SCALED VARIANCE_MAJOR','SCALED VARIANCE_MINOR','SCALED RADIUS OF GYRATION','SKEWNESS ABOUT_MAJOR','SKEWNESS ABOUT_MINOR','KURTOSIS ABOUT_MAJOR','KURTOSIS ABOUT_MINOR','HOLLOWS RATIO',Class
0,95,48,83,178,72,10,162,42,20,159,176,379,184,70,6,16,187,197,van
1,91,41,84,141,57,9,149,45,19,143,170,330,158,72,9,14,189,199,van
2,104,50,106,209,66,10,207,32,23,158,223,635,220,73,14,9,188,196,saab
3,93,41,82,159,63,9,144,46,19,143,160,309,127,63,6,10,199,207,van
4,85,44,70,205,103,52,149,45,19,144,241,325,188,127,9,11,180,183,bus


In the next cell, we map each of the four classes (labels) to integer values for comparing clustering results.

In [18]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df2['Class_encoded'] = label_encoder.fit_transform(df2['Class'])

true_labels2 = df2['Class_encoded']

category_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))

# Shows how each feature gets encoded. We encode the classes as numbers here for simplicity instead of using one-hot encoding
print(category_mapping)

df2 = df2.drop(['Class', 'Class_encoded'], axis=1) # Drop ground truth labels for clustering
display(df2.head())

{'bus': 0, 'opel': 1, 'saab': 2, 'van': 3}


  and should_run_async(code)


Unnamed: 0,COMPACTNESS,CIRCULARITY,'DISTANCE CIRCULARITY','RADIUS RATIO','PR.AXIS ASPECT RATIO','MAX.LENGTH ASPECT RATIO','SCATTER RATIO',ELONGATEDNESS,'PR.AXIS RECTANGULARITY','MAX.LENGTH RECTANGULARITY','SCALED VARIANCE_MAJOR','SCALED VARIANCE_MINOR','SCALED RADIUS OF GYRATION','SKEWNESS ABOUT_MAJOR','SKEWNESS ABOUT_MINOR','KURTOSIS ABOUT_MAJOR','KURTOSIS ABOUT_MINOR','HOLLOWS RATIO'
0,95,48,83,178,72,10,162,42,20,159,176,379,184,70,6,16,187,197
1,91,41,84,141,57,9,149,45,19,143,170,330,158,72,9,14,189,199
2,104,50,106,209,66,10,207,32,23,158,223,635,220,73,14,9,188,196
3,93,41,82,159,63,9,144,46,19,143,160,309,127,63,6,10,199,207
4,85,44,70,205,103,52,149,45,19,144,241,325,188,127,9,11,180,183


Now, we binarize our dataframe to make it suitable for the apriori algorithm. We do this by assigning a value of 1 to a column in a specific row if the value of the column in that row is higher than the average value of the column and 0 if it is lower than the average value of the column. This serves as an indicator function of whether or not each column value is present in a "basket", i.e; the columns are the items and the rows are the baskets.

In [19]:
column_averages = df2.mean()

for col in df2.columns:
    df2[col] = (df2[col] >= column_averages[col]).astype(int)

display(df2)

  and should_run_async(code)


Unnamed: 0,COMPACTNESS,CIRCULARITY,'DISTANCE CIRCULARITY','RADIUS RATIO','PR.AXIS ASPECT RATIO','MAX.LENGTH ASPECT RATIO','SCATTER RATIO',ELONGATEDNESS,'PR.AXIS RECTANGULARITY','MAX.LENGTH RECTANGULARITY','SCALED VARIANCE_MAJOR','SCALED VARIANCE_MINOR','SCALED RADIUS OF GYRATION','SKEWNESS ABOUT_MAJOR','SKEWNESS ABOUT_MINOR','KURTOSIS ABOUT_MAJOR','KURTOSIS ABOUT_MINOR','HOLLOWS RATIO'
0,1,1,1,1,1,1,0,1,0,1,0,0,1,0,0,1,0,1
1,0,0,1,0,0,1,0,1,0,0,0,0,0,0,1,1,1,1
2,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,0,0,1
3,0,0,0,0,1,1,0,1,0,0,0,0,0,0,0,0,1,1
4,0,0,0,1,1,1,0,1,0,0,1,0,1,1,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
841,0,0,1,1,1,0,1,0,0,0,1,0,0,0,1,1,0,0
842,0,1,1,0,1,1,0,1,0,1,0,0,1,0,0,1,0,1
843,1,1,1,1,1,1,1,0,1,1,1,1,1,0,0,0,0,1
844,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0


Next, we add each of the frequent itemsets as columns to the dataframe for clustering.

In [20]:
# Run efficient_apriori

baskets = [set(df2.columns[df2.loc[idx] == 1]) for idx in range(len(df2))]
itemsets, _ = ap(baskets, min_support=0.1)

min_support = 0.1
frequent_itemsets_dict = {itemset: support for length, items in itemsets.items() for itemset, support in items.items() if support >= min_support}

# Create a binary-encoded dataset based on frequent itemsets
encoded_dataset2 = pd.DataFrame(0, columns=[str(itemset) for itemset in frequent_itemsets_dict.keys()], index=df2.index)

for idx, basket in enumerate(baskets):
    for itemset, support in frequent_itemsets_dict.items():
        if all(item in basket for item in itemset):
            encoded_dataset2.loc[idx, str(itemset)] = 1

  and should_run_async(code)


In [21]:
display(encoded_dataset2)

  and should_run_async(code)


Unnamed: 0,"('CIRCULARITY',)","('ELONGATEDNESS',)","(""'RADIUS RATIO'"",)","(""'DISTANCE CIRCULARITY'"",)","(""'MAX.LENGTH ASPECT RATIO'"",)","(""'KURTOSIS ABOUT_MAJOR'"",)","(""'MAX.LENGTH RECTANGULARITY'"",)","(""'PR.AXIS ASPECT RATIO'"",)","('COMPACTNESS',)","(""'SCALED RADIUS OF GYRATION'"",)",...,"(""'PR.AXIS RECTANGULARITY'"", ""'RADIUS RATIO'"", ""'SCALED RADIUS OF GYRATION'"", ""'SCALED VARIANCE_MAJOR'"", ""'SCATTER RATIO'"", ""'SKEWNESS ABOUT_MAJOR'"", 'CIRCULARITY', 'COMPACTNESS')","(""'PR.AXIS RECTANGULARITY'"", ""'RADIUS RATIO'"", ""'SCALED RADIUS OF GYRATION'"", ""'SCALED VARIANCE_MAJOR'"", ""'SCATTER RATIO'"", ""'SKEWNESS ABOUT_MINOR'"", 'CIRCULARITY', 'COMPACTNESS')","(""'PR.AXIS RECTANGULARITY'"", ""'RADIUS RATIO'"", ""'SCALED RADIUS OF GYRATION'"", ""'SCALED VARIANCE_MINOR'"", ""'SCATTER RATIO'"", ""'SKEWNESS ABOUT_MAJOR'"", 'CIRCULARITY', 'COMPACTNESS')","(""'PR.AXIS RECTANGULARITY'"", ""'RADIUS RATIO'"", ""'SCALED RADIUS OF GYRATION'"", ""'SCALED VARIANCE_MINOR'"", ""'SCATTER RATIO'"", ""'SKEWNESS ABOUT_MINOR'"", 'CIRCULARITY', 'COMPACTNESS')","(""'PR.AXIS RECTANGULARITY'"", ""'RADIUS RATIO'"", ""'SCALED VARIANCE_MAJOR'"", ""'SCALED VARIANCE_MINOR'"", ""'SCATTER RATIO'"", ""'SKEWNESS ABOUT_MAJOR'"", 'CIRCULARITY', 'COMPACTNESS')","(""'PR.AXIS RECTANGULARITY'"", ""'RADIUS RATIO'"", ""'SCALED VARIANCE_MAJOR'"", ""'SCALED VARIANCE_MINOR'"", ""'SCATTER RATIO'"", ""'SKEWNESS ABOUT_MINOR'"", 'CIRCULARITY', 'COMPACTNESS')","(""'PR.AXIS RECTANGULARITY'"", ""'SCALED RADIUS OF GYRATION'"", ""'SCALED VARIANCE_MAJOR'"", ""'SCALED VARIANCE_MINOR'"", ""'SCATTER RATIO'"", ""'SKEWNESS ABOUT_MAJOR'"", 'CIRCULARITY', 'COMPACTNESS')","(""'PR.AXIS RECTANGULARITY'"", ""'SCALED RADIUS OF GYRATION'"", ""'SCALED VARIANCE_MAJOR'"", ""'SCALED VARIANCE_MINOR'"", ""'SCATTER RATIO'"", ""'SKEWNESS ABOUT_MINOR'"", 'CIRCULARITY', 'COMPACTNESS')","(""'RADIUS RATIO'"", ""'SCALED RADIUS OF GYRATION'"", ""'SCALED VARIANCE_MAJOR'"", ""'SCALED VARIANCE_MINOR'"", ""'SCATTER RATIO'"", ""'SKEWNESS ABOUT_MAJOR'"", 'CIRCULARITY', 'COMPACTNESS')","(""'RADIUS RATIO'"", ""'SCALED RADIUS OF GYRATION'"", ""'SCALED VARIANCE_MAJOR'"", ""'SCALED VARIANCE_MINOR'"", ""'SCATTER RATIO'"", ""'SKEWNESS ABOUT_MINOR'"", 'CIRCULARITY', 'COMPACTNESS')"
0,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,0,1,0,1,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,1,1,1,0,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
3,0,1,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,1,0,1,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
841,0,0,1,1,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
842,1,1,0,1,1,1,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
843,1,0,1,1,1,0,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
844,0,1,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
from sklearn.preprocessing import StandardScaler
scaler2 = StandardScaler()
scaled_data2 = scaler2.fit_transform(encoded_dataset2)

  and should_run_async(code)


In [23]:
# Perform K-Means clustering

kmeans = KMeans(n_clusters=4, random_state=42)
clusters2 = kmeans.fit_predict(scaled_data2)

# Add cluster labels to the original DataFrame
df2['KMeansCluster'] = clusters2

ari_kmeans = adjusted_rand_score(true_labels2, df2['KMeansCluster'])
nmi_kmeans = normalized_mutual_info_score(true_labels2, df2['KMeansCluster'])
fmi_kmeans = fowlkes_mallows_score(true_labels2, df2['KMeansCluster'])

print("KMeans Clustering Results:")
print("ARI:", ari_kmeans)
print("NMI:", nmi_kmeans)
print("FMI:", fmi_kmeans)

  and should_run_async(code)


KMeans Clustering Results:
ARI: 0.05865998325270368
NMI: 0.165890093550524
FMI: 0.39885889788318735


In [25]:
# Perform Agglomerative clustering

agg_clustering = AgglomerativeClustering(n_clusters=4)
clusters2 = agg_clustering.fit_predict(scaled_data2)

# Add cluster labels to the original DataFrame
df2['AggCluster'] = clusters2

ari_agg = adjusted_rand_score(true_labels2, df2['AggCluster'])
nmi_agg = normalized_mutual_info_score(true_labels2, df2['AggCluster'])
fmi_agg = fowlkes_mallows_score(true_labels2, df2['AggCluster'])

print("\nAgglomerative Clustering Results:")
print("ARI:", ari_agg)
print("NMI:", nmi_agg)
print("FMI:", fmi_agg)

  and should_run_async(code)



Agglomerative Clustering Results:
ARI: 0.061214537417000905
NMI: 0.16963181008545594
FMI: 0.39543602723347054
