# Assignment: Clustering and association rule mining

## 1. Business Understanding

Another optimization idea is to increase the revenue of the company by recommending products to customers based on their purchase history. For this purpose, the company is looking to use association rule mining to find interesting relationships between products that are frequently purchased together.
In the initial phase of the project, the company focuses on 20 internal product groups, and wants to find out which product groups act as triggers for buying items from some other product groups.

## Task

The task is split into two sub-tasks: finding optimal hub locations and finding interesting relationships between product groups.

## Task 1: Finding optimal hub locations

### Business understanding

The company operates a drone fleet for product delivery and plans to establish depots for stationing, servicing, and loading. The goal is to optimize operations by identifying the best hub locations by analyzing order location data.

### Data understanding

Here we import the dataset into pandas as a dataframe, and display first 5 rows to check data format.

In [None]:
import pandas as pd

df = pd.read_csv('drone_cust_locations.csv', sep=";")

display(df.head())

The dataset includes customer IDs with corresponding x and y coordinates. The coordinates are expressed as floating-point values in the range 0–1000, rather than in degrees.

### Data preparation

We clean the data by removing the ID column and then display the results.

In [None]:
df.drop(columns="clientid", inplace=True)

display(df.head(5))

Next, we visualize the data using a seaborn scatter plot.

In [None]:
import seaborn as sns

sns.scatterplot(data=df, x="x", y="y")

From the plot, we can see that the residential area is divided by a river or another uninhabitable zone, such as a highway. However, there are still a few individual customers located within this area.

### Modeling

We perform K-Means clustering with 3, 5, 7 and 10 clusters to identify optimal locations for the drone stations. Cluster centers are added to the dataframe with a depot value of "center", enabling us to plot the cluster centers. A function is also defined to print out the cluster details. We also printed out the runtimes for each plot to see the difference between cluster counts.

In [None]:
import time
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

clusters_list = [3, 5, 7, 10]

def plot_clusters(num, df):
    start_time = time.time()
    
    dff = df.copy()
    
    model = KMeans(init="random", n_clusters=num, random_state=123)
    
    model.fit(dff)
    
    dff["depot"] = model.labels_
    
    columns = ["x", "y"]
    centers = pd.DataFrame(model.cluster_centers_, columns=columns)
            
    sns.scatterplot(data=dff, x="x", y="y", hue="depot", style="depot", palette="pastel")
    
    sns.scatterplot(data=centers, x="x", y="y", color="red", marker="X", s=150)
    
    plt.show()
    
    end_time = time.time()
    runtime = end_time - start_time
    
    display(f"Runtime: {runtime:.4f} seconds")
    
for n in clusters_list:
    plot_clusters(n, df)

With more clusters the runtime seems to decrease. From the plot we can also see how the clusters form around the uninhabitable zone in the middle.

### Evaluation

We evaluate the K-Means results using WCSS (within-cluster sum of squares) for cluster counts ranging from 1 to 10.

In [None]:
wcss = []
for i in range(1,11):
    model = KMeans(init='random', n_clusters=i, random_state=42).fit(df)
    wcss.append(model.inertia_)
    
plt.plot(range(1,11), wcss, 'o-')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

Based on the graph, 5-6 clusters appear optimal, as the WCSS decreases only marginally beyond this point.

In [None]:
from sklearn.metrics import silhouette_score

model = KMeans(init='random', n_clusters=3, random_state=42).fit(df)
labels = model.labels_
print('Silhouette score = %.2f' % silhouette_score(df, labels))

This result tells us, that the clustering results are not ideal, but are acceptable. By testing other values, we found no large increase in the score.

### Agglomerative hierarchical clustering

We applied agglomerative clustering to group customers and explore potential locations for drone hubs as an alternative to K-Means.

In [None]:
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

def hierarchical_clustering(num, df):
    start_time = time.time()
    dff = df.copy()
    
    # Agglomerative clustering
    model = AgglomerativeClustering(n_clusters=num, metric="euclidean", linkage="ward")
    dff["depot"] = model.fit_predict(dff[["x", "y"]])
    
    # Plot clusters
    sns.scatterplot(data=dff, x="x", y="y", hue="depot", style="depot", palette="pastel")
    plt.title(f"Agglomerative Clustering with {num} depots")
    plt.show()
    
    runtime = time.time() - start_time
    print(f"Runtime for {num} depots: {runtime:.4f} seconds")
    
    # Plot dendrogram
    Z = linkage(dff[["x", "y"]], method="ward")
    plt.figure(figsize=(8, 4))
    dendrogram(Z, labels=dff.index.tolist())
    plt.title(f"Dendrogram for {num} depots")
    plt.xlabel("Samples")
    plt.ylabel("Distance")
    plt.axhline(y=Z[-num+1, 2], color="red", linestyle="--")
    plt.show()
    
for n in clusters_list:
    hierarchical_clustering(n, df)


From the plots, we can see that depot locations should follow the K-Means results, as hierarchical clustering often identified areas that were not optimally located.

## Part 2: Finding interesting relationships between product groups

### Business understanding

Next we will analyze order info based on different products on order. The idea is to find out correlations between products, and group them up based on order preferences.

### Data understanding

Now we load a new dataset into a pandas dataframe.

In [None]:
df2 = pd.read_csv('drone_prod_groups.csv')

display(df2.head(5))

The dataset has binary rows for 20 different products and order id.

### Data preparation

We prepare the data by replacing zeroes with False and ones with True. This prevents the model creating associations based on the 0 rows and focuses only on the purchased products. We also drop the id column.

In [None]:
df2.replace(0, False, inplace=True)
df2.replace(1, True, inplace=True)

df2.drop(columns="ID", inplace=True)

display(df2.head())

### Modeling

Here we apply the Apriori algorithm to the dataset. we will play around with different values of min_support to get a result that is suitable to continue with.

In [None]:
from mlxtend.frequent_patterns import apriori

frequent_itemsets = apriori(df2, min_support=0.005, use_colnames=True)
display(frequent_itemsets)

We ended up using min_support value of 0.005 to produce a list of item sets containing 339 sets of items.

Next we will generate association rules based on the item sets. Here we will alter the value of min_threshold to produce rules set.

In [None]:
from mlxtend.frequent_patterns import association_rules

# generate association rules
rules = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.5)

# sort in descending order of confidence
rules = rules.sort_values(by='confidence', ascending=False)

display(rules)

We made a set of 133 rules based on groupings of orders in the dataset.

Next, we group values by lift, which indicates how much more likely an outcome is compared to random chance.

In [None]:
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=0.5)

# sort in descending order of lift
rules = rules.sort_values(by='lift', ascending=False)

display(rules)