# Assignment: Clustering and association rule mining

## 1. Business Understanding

Another optimization idea is to increase the revenue of the company by recommending products to customers based on their purchase history. For this purpose, the company is looking to use association rule mining to find interesting relationships between products that are frequently purchased together.
In the initial phase of the project, the company focuses on 20 internal product groups, and wants to find out which product groups act as triggers for buying items from some other product groups.

## Task

The task is split into two sub-tasks: finding optimal hub locations and finding interesting relationships between product groups.

## Task 1: Finding optimal hub locations

### Business understanding

The company has a fleet of drones that deliver products to customers.
The company intends to create a set of hubs, or depots, where the drones are stationed, serviced, and loaded with products for delivery.
The company is looking to optimize its operations by finding the optimal locations for its hubs.

### Data understanding

höpöhöpö

In [None]:
import pandas as pd

df = pd.read_csv('drone_cust_locations.csv', sep=";")

display(df.head())

Datasetti sisältää asiakkaiden id -numerot, sekä x- ja y-koordinaatit. Koordinaatit eivät vaikuta olevan asteellisia koordinaatteja, vaan liukulukuja välillä 0-1000.

### Data preparation

Here we clean the data by removing the id column.

In [None]:
df.drop(columns="clientid", inplace=True)

display(df.head(5))

In [None]:
import seaborn as sns

sns.scatterplot(data=df, x="x", y="y")

Kuvaajasta voimme päätellä, että asuinalueen lävistää joki, taikka muu asuinkelvoton alue, kuten moottoritie. Kuitenkin tämänkin alueen sisällä on muutama yksittäinen asiakas. 

### Modeling

Here we create the K Means clustering for 3 clusters. The model will split the data into 3 clusters for optimal locations of the drone stations.

Here we added the cluster centers to the dataframe with depot value of "center". This allows us to plot the center points of the clusters. We define a function to print out the 

In [None]:
import time
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans


clusters_list = [3, 5, 7, 10]

def plot_clusters(num, df):
    start_time = time.time()
    
    dff = df.copy()
    
    model = KMeans(init="random", n_clusters=num, random_state=123)
    
    model.fit(dff)
    
    dff["depot"] = model.labels_
    
    columns = ["x", "y"]
    centers = pd.DataFrame(model.cluster_centers_, columns=columns)
            
    sns.scatterplot(data=dff, x="x", y="y", hue="depot", style="depot", palette="pastel")
    
    sns.scatterplot(data=centers, x="x", y="y", color="red", marker="X", s=150)
    
    plt.show()
    
    end_time = time.time()
    runtime = end_time - start_time
    
    display(f"Runtime: {runtime:.4f} seconds")
    
for n in clusters_list:
    plot_clusters(n, df)

With more clusters the runtime decreases.

### Evaluation

Now we evaluate the results of the KMeans algorithm.

In [None]:
wcss = []
for i in range(1,11):
    model = KMeans(init='random', n_clusters=i, random_state=42).fit(df)
    wcss.append(model.inertia_)
    
plt.plot(range(1,11), wcss, 'o-')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

Based on the graph, it seems that 4 clusters would be the sweet spot, as the wcss will decrease very slowly after that.

In [None]:
from sklearn.metrics import silhouette_score

model = KMeans(init='random', n_clusters=3, random_state=42).fit(df)
labels = model.labels_
print('Silhouette score = %.2f' % silhouette_score(df, labels))

This result tells us, that the clustering results are not ideal, but are acceptable. By testing other values, we found no large increase in the score.

### Agglomerative hierarchical clustering

In [None]:
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage


def hierarchical_clustering(num, df):
    start_time = time.time()
    dff = df.copy()
    
    # Agglomerative clustering
    model = AgglomerativeClustering(n_clusters=num, metric="euclidean", linkage="ward")
    dff["depot"] = model.fit_predict(dff[["x", "y"]])
    
    # Plot clusters
    sns.scatterplot(data=dff, x="x", y="y", hue="depot", style="depot", palette="pastel")
    plt.title(f"Agglomerative Clustering with {num} depots")
    plt.show()
    
    runtime = time.time() - start_time
    print(f"Runtime for {num} depots: {runtime:.4f} seconds")
    
    # Plot dendrogram
    Z = linkage(dff[["x", "y"]], method="ward")
    plt.figure(figsize=(8, 4))
    dendrogram(Z, labels=dff.index.tolist())
    plt.title(f"Dendrogram for {num} depots")
    plt.xlabel("Samples")
    plt.ylabel("Distance")
    plt.axhline(y=Z[-num+1, 2], color="red", linestyle="--")
    plt.show()
    
# for n in clusters_list:
#     hierarchical_clustering(n, df)


The depot places should be placed according to the KMeans algorithm. The Hierarchical clustering resulted in several areas that are not optimized most of the time.

## Part 2: Finding interesting relationships between product groups

### Business understanding

Here we have a dataset with different orders containing several different products. We will analyze the data to find relations between data rows based on included items.

### Data understanding

Next we load the second dataset into a dataframe.

In [None]:
df2 = pd.read_csv('drone_prod_groups.csv')

display(df2.head(5))

The dataset has binary rows for 20 different products and order id.

### Data preparation

We prepare the data by replacing zeroes with False and ones with True. This prevents the model creating associations based on the 0 rows and focuses only on the purchased products. We also drop the id column.

In [None]:
df2.replace(0, False, inplace=True)
df2.replace(1, True, inplace=True)

df2.drop(columns="ID", inplace=True)

display(df2.head())

### Modeling

Here we apply the Apriori algorithm to the dataset.

In [None]:
from mlxtend.frequent_patterns import apriori

frequent_itemsets = apriori(df2, min_support=0.005, use_colnames=True)
display(frequent_itemsets)

In [None]:
from mlxtend.frequent_patterns import association_rules

# generate association rules
rules = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.5)

# sort in descending order of confidence
rules = rules.sort_values(by='confidence', ascending=False)

display(rules)

In [None]:
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=2)

# sort in descending order of lift
rules = rules.sort_values(by='lift', ascending=False)

display(rules)