# Spotify Recommendation System: Genre Pattern Analysis

This notebook compares the genre distribution patterns of songs clustered using two methods: Hierarchical Clustering and Decision Tree-based segmentation. The ultimate goal is to understand genre separability across clusters to inform a recommendation engine focused on `mpb`, `rock`, and `death metal`.

### Step 1: Importing necessary libraries

We begin by importing essential libraries for data manipulation and visualization.


In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

pd.set_option('display.max_rows', 10)


### Step 2: Load DataFrames

We load:
- `DecisionTreeDF`: The output of a Decision Tree classifier where each song is assigned to a leaf node (interpreted as a cluster).
- `df_cluster_PCA_gp`: The result of hierarchical clustering with PCA annotations and genre information.


In [2]:
DecisionTreeDF = pd.read_csv("Data for Modeling/AftDTsMPBROCKMETAL_KGDf.csv")
df_cluster_PCA_gp = pd.read_csv("Data for Modeling/df_cluster_PCA_gp.csv")


FileNotFoundError: [Errno 2] No such file or directory: 'Data for Modeling/AftDTsMPBROCKMETAL_KGDf.csv'

### Step 3: Drop Unnecessary Columns

Both DataFrames contain irrelevant or placeholder columns like unnamed indices or extra spaces. We drop those for clarity.


In [None]:
del DecisionTreeDF[" "]
del DecisionTreeDF["Unnamed: 0"]
del df_cluster_PCA_gp ["Unnamed: 0"]


### Step 4: Preview Cleaned DataFrames

In [None]:
DecisionTreeDF

In [None]:
df_cluster_PCA_gp

### Step 5: Merge Genre and Track Info with Clustering Data

We merge the `track_genre` and `track_id` columns from the Decision Tree DataFrame into the hierarchical clustering DataFrame to allow direct comparison of cluster compositions.


In [None]:
df_cluster_PCA_gp_Genre = pd.merge(DecisionTreeDF["track_genre"], df_cluster_PCA_gp, right_index=True, left_index=True)
df_cluster_PCA_gp_Genre = pd.merge(DecisionTreeDF["track_id"], df_cluster_PCA_gp_Genre, right_index=True, left_index=True)
df_cluster_PCA_gp_Genre.sort_values(by="cluster_x")


### Step 6: Define Cluster Counting Function

This function calculates how many tracks of each genre appear in each cluster (both Decision Tree and Hierarchical). It distinguishes between `mpb`, `rock`, and all other genres which are grouped into `metal`.


In [None]:
def countingOcorencies(df, columnname, clusterIDs):
    grouped = df.groupby([columnname, "track_genre"]).size().unstack(fill_value=0)
    grouped = grouped.reindex(clusterIDs, fill_value=0)
    grouped = grouped.rename(columns={"mpb": "QntMPB", "rock": "QntROCK"})

    if "QntMPB" not in grouped.columns:
        grouped["QntMPB"] = 0
    if "QntROCK" not in grouped.columns:
        grouped["QntROCK"] = 0

    grouped["QntMETAL"] = (
        df.groupby(columnname).size()
        .reindex(clusterIDs, fill_value=0)
        - grouped["QntMPB"]
        - grouped["QntROCK"]
    )

    grouped = grouped.reset_index().rename(columns={columnname: "Cluster"}).drop_duplicates()
    ids_count = len(df)

    return grouped, ids_count

d, i = countingOcorencies(df_cluster_PCA_gp_Genre, "cluster_x", list(df_cluster_PCA_gp_Genre["cluster_x"]))
d1, i2 = countingOcorencies(DecisionTreeDF, "node_id", (DecisionTreeDF["node_id"].sort_values()).to_list())
d_sorted = d.sort_values("Cluster")
d1_sorted = d1.sort_values("Cluster")


### Step 7: Heatmap for Hierarchical Clustering (106 Clusters)

We generate a heatmap showing how many songs from each genre fall into each cluster obtained from hierarchical clustering.


In [None]:
for col in ["QntMPB", "QntROCK", "QntMETAL"]:
    d_sorted[col] = d_sorted[col].apply(lambda x: list(x)[0] if isinstance(x, set) else x)

heatmap_data = d_sorted.set_index("Cluster")[["QntMPB", "QntROCK", "QntMETAL"]]

plt.figure(figsize=(8, 20))
sns.heatmap(heatmap_data, annot=True, cmap="coolwarm", fmt="d")
plt.title("Heatmap of Song Counts per Genre per Cluster")
plt.xlabel("Genre")
plt.ylabel("Cluster")
plt.tight_layout()
plt.show()


### Step 8: Heatmap for Decision Tree Leaf Nodes (106 Nodes)

This heatmap visualizes the genre distribution across each leaf node (interpreted as a cluster) in the Decision Tree.


In [None]:
for col in ["Cluster", "QntMPB", "QntROCK", "QntMETAL"]:
    d1[col] = d1[col].apply(lambda x: list(x)[0] if isinstance(x, set) else x)

heatmap_data = d1.set_index("Cluster")[["QntMPB", "QntROCK", "QntMETAL"]]

plt.figure(figsize=(8, 20))
sns.heatmap(heatmap_data, annot=True, cmap="coolwarm", fmt="d")
plt.title("Heatmap of Song Counts per Genre per Cluster")
plt.xlabel("Genre")
plt.ylabel("Cluster")
plt.tight_layout()
plt.show()
