# Lab 1.2 - Clustering Load Profiles

Load profiling is a crucial aspect of energy consumption analysis that involves collecting and analyzing data on the energy usage of a system or building. Clustering, on the other hand, is a machine learning technique used to group similar data points together.

In this exercise session, we will learn how to use Python to cluster load profiles. We will start by exploring what load profiles are, how they are collected and why they are important. Next, we will delve into the theory behind clustering and its applications in load profiling. We will then work through a step-by-step guide to using Python's scikit-learn library to perform load profile clustering.

By the end of this exercise session, you will have a good understanding of load profiles, clustering techniques and how to apply these techniques in Python to perform load profile clustering. You will also be able to interpret the results of the clustering and use them to make informed decisions about energy consumption management. So, let's get started!

### Importing Libraries and Data

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.dates as mdates
import numpy as np
from sklearn.cluster import KMeans
import seaborn as sns

df = pd.read_csv("data/LoadTimeSeriesData.csv", parse_dates=['timestamp'])

**Task**: use one of the previous nas replacement methods to fill nas

In [None]:
df['power'].interpolate(method='spline', inplace=True, order=3)

In order to perform load profiles clustering, we need to re-organize our dataset into a MxN matrix where:
* M is the number of days in our dataset
* N is the frequency of our timeseries (i.e. hour)

**Task**: Assign new columns for data and hour

In [None]:
df['date'] = df['timestamp'].dt.date
df['hour'] = df['timestamp'].dt.hour

**Task**: generate the MxN matrix using the new columns.

*hint*: use the pivot function of pandas

In [None]:
df_matrix = df.pivot(index='date', columns='hour', values='power')

**Task**: perfrom clustering using KMeans method. Select a value for K (desired number of clusters). Then extract the cluster labels from the results obtained.

In [None]:
# Perform clustering
K = 5
kmeans = KMeans(n_clusters=K, random_state=0).fit(df_matrix)
labels = kmeans.labels_

**Task**: add the labels to the original dataframe (timeseries) 

In [None]:
# Add the cluster labels to the original dataset
df['cluster'] = np.repeat(labels, 24)

### Centroids

In the context of load profiles clustering, centroids are usually represented as the average load profile of the cluster.

**Task**: evaluate the average profile (i.e. evaluate the mean of power for each cluster and each hour)

In [None]:
# creating a new DataFrame with the average power for each hour of the day and for each cluster
centroids = df.groupby(['cluster', 'hour'])['power'].mean().reset_index()

**Task**: plot the load profiles for each cluster and the centroid.

*hint*: employ the code of the previous lab

In [None]:
# generating load profiles
g = sns.FacetGrid(data=df, col='cluster', hue='date', col_wrap=3, height=3, aspect=2, sharey=False)
g.map(sns.lineplot, 'hour', 'power', color='gray')

# adding average values
for ax, cluster in zip(g.axes.flatten(), centroids['cluster'].unique()):
    sns.lineplot(x='hour', y='power', data=centroids[centroids['cluster'] == cluster], color='r', ax=ax, label='Profilo medio', legend=False)
    ax.set_ylim(bottom=0, top=df['power'].max())
    ax.set_xticks(range(0, 24))
    ax.grid(True, linestyle='--')

plt.show()

In [None]:
# Perform clustering
K = 3
kmeans = KMeans(n_clusters=K, random_state=0).fit(df_matrix)
labels = kmeans.labels_

# Add the cluster labels to the original dataset
df['cluster'] = np.repeat(labels, 24)

# creating a new DataFrame with the average power for each hour of the day and for each cluster
centroids = df.groupby(['cluster', 'hour'])['power'].mean().reset_index()

# generating load profiles
g = sns.FacetGrid(data=df, col='cluster', hue='date', col_wrap=3, height=3, aspect=2, sharey=False)
g.map(sns.lineplot, 'hour', 'power', color='gray')

# adding average values
for ax, cluster in zip(g.axes.flatten(), centroids['cluster'].unique()):
    sns.lineplot(x='hour', y='power', data=centroids[centroids['cluster'] == cluster], color='r', ax=ax, label='Profilo medio', legend=False)
    ax.set_ylim(bottom=0, top=df['power'].max())
    ax.set_xticks(range(0, 24))
    ax.grid(True, linestyle='--')

plt.show()

### BONUS: Hierarchical clustering

Agglomerative clustering is a popular hierarchical clustering method used to group similar data points into clusters. It works by initially treating each data point as its own cluster and then iteratively merging the closest clusters until only one cluster containing all data points remains. The merging process is guided by a linkage method that determines how the distance between two clusters is measured.

There are several linkage methods used in agglomerative clustering, each with its own advantages and disadvantages.

The **single** linkage method, also known as the nearest-neighbor method, calculates the distance between the closest two data points in each cluster and merges the two clusters with the smallest distance. This method tends to produce long, narrow clusters and can be sensitive to noise and outliers.

The **complete** linkage method, also known as the farthest-neighbor method, calculates the distance between the furthest two data points in each cluster and merges the two clusters with the largest distance. This method tends to produce compact, spherical clusters but can also be sensitive to noise and outliers.

The **average** linkage method calculates the average distance between all pairs of data points in each cluster and merges the two clusters with the smallest average distance. This method can be less sensitive to noise and outliers than the single and complete linkage methods and tends to produce more balanced clusters.

The **Ward** linkage method minimizes the increase in variance of the clusters resulting from merging them. It tends to produce compact, spherical clusters of similar size and is less sensitive to noise and outliers than the other linkage methods.

Overall, the choice of linkage method in agglomerative clustering depends on the nature of the data and the goals of the analysis. Each method has its own strengths and weaknesses, and selecting the appropriate method can lead to more meaningful and accurate clustering results.

In [None]:
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram

model = AgglomerativeClustering(distance_threshold=0, n_clusters=None, linkage="single").fit(df_matrix)

In [None]:
def plot_dendrogram(model, **kwargs):
    # Create linkage matrix and then plot the dendrogram

    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack(
        [model.children_, model.distances_, counts]
    ).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)

In [None]:
plt.title("Hierarchical Clustering Dendrogram")
# plot the top (p) levels of the dendrogram
plot_dendrogram(model, truncate_mode="level", p=4)
plt.xlabel("Number of points in node (or index of point if no parenthesis).")
plt.show()

**Task**: Analyze different *linkage methods* and *number of K* to understand the differences between the different approaches.

In [None]:
# Perform clustering
K = 5
linkage_method = "ward" # "complete", "single", "average"
model = AgglomerativeClustering(distance_threshold=None, n_clusters=K, linkage=linkage_method).fit(df_matrix)
labels = model.labels_

# Add the cluster labels to the original dataset
df['cluster'] = np.repeat(labels, 24)

# creating a new DataFrame with the average power for each hour of the day and for each cluster
centroids = df.groupby(['cluster', 'hour'])['power'].mean().reset_index()

# generating load profiles
g = sns.FacetGrid(data=df, col='cluster', hue='date', col_wrap=3, height=3, aspect=2, sharey=False)
g.map(sns.lineplot, 'hour', 'power', color='gray')

# adding average values
for ax, cluster in zip(g.axes.flatten(), centroids['cluster'].unique()):
    sns.lineplot(x='hour', y='power', data=centroids[centroids['cluster'] == cluster], color='r', ax=ax, label='Profilo medio', legend=False)
    ax.set_ylim(bottom=0, top=df['power'].max())
    ax.set_xticks(range(0, 24))
    ax.grid(True, linestyle='--')

plt.show()

### BONUS: Classifying cluster labes through supervised model