# Spotify Challenge - Clustering songs- Unsupervised Learning

🎯 The goal of this recap is to **cluster songs** using the **KMeans _(clustering algorithm)_** with a **PCA _(dimensionality reduction)_**

In [1]:
# Data Manipulation
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)

# Data Visualisation
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Pipeline and Column Transformers
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn import set_config
set_config(display = "diagram")

# Scaling
from sklearn.preprocessing import RobustScaler, StandardScaler, MinMaxScaler

# Cross Validation
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_predict

# Unsupervised Learning
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans 

# STATISTICS
from statsmodels.graphics.gofplots import qqplot
# This function plots your sample against a Normal distribution, 
# to see whether your sample is normally distributed or not

## (1) The Spotify Dataset

In [2]:
spotify = pd.read_csv('https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_spotify_data.csv')
spotify.head()

### (1.1) Basic Info

In [3]:
# YOUR CODE HERE

### (1.2) Working on the numerical features

🔢  Let's focus on the numerical features:

In [4]:
# YOUR CODE HERE

### (1.3) Correlations between features

፨ Visualize the correlation matrix (`PuRd` is a nice cmap choice)

In [5]:
# YOUR CODE HERE

### (1.4) 3D Visualisation

🎨 Let's select 3 features of these songs and visualise them in a 3D-scatter-plot `plotly`:

In [7]:
# YOUR CODE HERE

🎯 _Remember: our goal is to cluster these songs_

## (2) Unsupervised Learning: Clustering

### (2.1) Vanilla KMeans

💫 Let's try our first _Unsupervised Algorithm_, the **`KMeans`** directly `spotify_num` without any transformation.

In [8]:
kmeans_vanilla = KMeans(n_clusters = 8) # 8 is the default number of clusters in the KMeans implemented by SKLearn
kmeans_vanilla.fit(spotify_num)

In [9]:
labels_vanilla = kmeans_vanilla.labels_
labels_vanilla

In [10]:
fig = px.scatter_3d(spotify_num, 
                    x = 'danceability',
                    y = 'energy',
                    z = 'speechiness',
                    color = labels_vanilla)
fig.show()

😭 It doesn't look good at all... who would pay 10-15 euros a month to Spotify, Apple Music, Deezer, Amazon Music or Tidal for that ?

### (2.2) Scaling + KMeans

🪜 Remember that the KMeans is a distance-based algorithm and that for any Machine Learning model, we should scale the features so that they start with an "equal chance" of impacting the predictions.

In [11]:
# YOUR CODE HERE

In [12]:
# YOUR CODE HERE

In [14]:
spotify_scaled = pd.DataFrame(scalers.fit_transform(spotify_num))
spotify_scaled.columns = features_robust + features_standard + features_minmax
spotify_scaled

🙏 What if we could cluster our songs now that our data are scaled ? Store your clusters in a new column.

In [15]:
kmeans_scaled = KMeans(n_clusters = 8)
kmeans_scaled.fit(spotify_scaled)

In [16]:
labels_scaled = kmeans_scaled.labels_
labels_scaled

🧪Okay, our songs'clustering look better even if we can't rival yet with the datascience team at Spotify!

In [17]:
fig_scaled = px.scatter_3d(spotify_scaled,
                           x = 'danceability',
                           y = 'energy',
                           z = 'speechiness',
                           color = labels_scaled)
fig_scaled.show()

🤪 It is a bit better but still messy...

### (2.3) Scaling + PCA + KMeans

🧑🏻‍🏫 What if we perform a PCA before running our clustering algorithm ? We could:
* use the orthogonality of the principal components so that the KMeans algorithm increases its clustering power
* potentially reduce dimensionality

In [18]:
# YOUR CODE HERE

🕵🏻 Print these components

In [19]:
# Print the PCs (as rows)
# Expressed as linear combination of initial vector basis (10 PC, with X_ columns)

pass  # YOUR CODE HERE

In [20]:
Wt.shape

🤔 How many principal components should we keep ?

In [21]:
with plt.style.context('seaborn-deep'):
    # figsize
    plt.figure(figsize=(10,6))
    # getting axes
    ax = plt.gca()
    # plotting
    explained_variance_ratio_cumulated = np.cumsum(pca.explained_variance_ratio_)
    x_axis_ticks = np.arange(1,explained_variance_ratio_cumulated.shape[0]+1)
    ax.plot(x_axis_ticks,explained_variance_ratio_cumulated,label="cumulated variance ratio",color="purple",linestyle=":",marker="D",markersize=10)
    # customizing
    ax.set_xlabel('Number of Principal Components')
    ax.set_ylabel('% cumulated explained variance')
    ax.legend(loc="upper left")
    ax.set_title('The Elbow Method')
    ax.set_xticks(x_axis_ticks)
    ax.scatter(4,explained_variance_ratio_cumulated[4-1],c='blue',s=400)
    ax.scatter(5,explained_variance_ratio_cumulated[5-1],c='blue',s=400)
    ax.scatter(6,explained_variance_ratio_cumulated[6-1],c='blue',s=400)
    ax.grid(axis="x",linewidth=0.5)
    ax.grid(axis="y",linewidth=0.5)

🔮 Project your $ 10000 \times 10$ `spotify_num` dataset into this new space with the number of principal components you decided to keep.

In [22]:
# YOUR CODE HERE

### (2.4) Ideal number of clusters ? 

In [29]:
nb_clusters_to_try = np.arange(1,20+1,1)
nb_clusters_to_try

<details>
    <summary><i>Are there some number of clusters useless to try ? </i></summary>

* $K = 1$ means that you would have only 1 cluster with the whole dataset of $10000$ songs
* $K = 10000$ means that each of the $10000$ songs would be its own cluster!
* $K = 2$ means that you would have only 2 clusters with $5000$ songs each...
    
        
</details>        

In [30]:
# Apply the elbow method to find the optimal number of clusters.
wcss = []

for K in nb_clusters_to_try:
    print('working with ' + str(K) + ' clusters...', flush=True)
    kmeans = KMeans(n_clusters = K)
    kmeans.fit(spotify_proj)
    wcss.append(kmeans.inertia_)
print("DONE !")

In [48]:
with plt.style.context('seaborn-deep'):
    # figsize
    plt.figure(figsize=(20,10))
    # getting axes
    ax = plt.gca()
    # plotting
    ax.plot(nb_clusters_to_try, wcss,color="blue",linestyle=":",marker="D",label="Inertia")
    # customizing
    ax.legend(loc="upper right")
    ax.set_title('The Elbow Method')
    ax.set_xticks(nb_clusters_to_try)
    ax.set_xlabel('Number of clusters')
    ax.set_ylabel('Within-Cluster Sums of Squares')
    
    # highlting the elbows and some interesting values
    ax.scatter(4,wcss[4-1],c='red',s=400)
    
    ax.scatter(6,wcss[6-1],c='red',s=400)
    ax.scatter(8,wcss[8-1],c='red',s=400)
    ax.scatter(9,wcss[9-1],c='red',s=400)
    
    # annotate
    ax.annotate("Spotify", 
                (6,wcss[6-1]),
                (6+0.50,wcss[6-1]+5000), 
                arrowprops=dict(facecolor='black'),
                fontsize=16,
                horizontalalignment='right', 
                verticalalignment='top')
    
    ax.grid(axis="y",linewidth=0.5)
    plt.show()

### (2.5) YellowBricks and Elbow Method

📚 There is a nice ***Data Visualisation*** library dedicated to Machine Learning algorithms which is called [**`YellowBricks`**].

⚙️ Install the library

In [49]:
# !pip install yellowbrick

In [50]:
# !pip install --upgrade yellowbrick

6️⃣ Try to find the Elbow of the KMeans algorithm on Spotify using the ***KElbowVisualizer***

In [34]:
from yellowbrick.cluster import KElbowVisualizer

In [35]:
# YOUR CODE HERE

👉 This `KElbowVisualizer` was able to detect the elbow at $ K = 4 $ but we could think that we would deserve more curated playlists knowing that the service costs 10-15 USD. Let's use $ K = 6$ instead and build the playlists.

### (2.6) Spotify : 6 daily mixes

In [36]:
spotify_clusters = 6

print('Working with ' + str(spotify_clusters) + ' clusters as in Spotify', flush=True)
print("-"*80)

kmeans = KMeans(n_clusters = spotify_clusters, max_iter = 300)

kmeans.fit(spotify_proj)

labelling = kmeans.labels_

fig_scaled = px.scatter_3d(spotify_proj,
                           x = 0,
                           y = 1,
                           z = 2,
                           color=labelling)
fig_scaled.show()

In [37]:
spotify_labelled = pd.concat([spotify,pd.Series(labelling)],axis=1).rename(columns={0:"label"})
spotify_labelled

In [38]:
np.unique(labelling)

In [39]:
spotify_labelled.label.value_counts()

In [40]:
daily_mixes = {}

for numero_cluster in np.unique(labelling):
    daily_mixes[numero_cluster] = spotify_labelled[spotify_labelled.label == numero_cluster]

In [41]:
for key,value in daily_mixes.items():
    print("-"*100)
    print(f"Here are some songs for the playlist number {key}")
    print("-"*100)
    display(value.sample(20))

### (2.7) Pipeling the labelling process

In [42]:
from sklearn import set_config; set_config(display="diagram")  

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

pipeline_spotify = Pipeline([
    ("scaler",StandardScaler()),
    ("pca",PCA()),
    ("kmeans",KMeans())
])

pipeline_spotify

In [43]:
spotify_num

In [44]:
pipeline_spotify.fit(spotify_num)

In [45]:
labels_spotify_with_pipeline = pipeline_spotify.predict(spotify_num)
labels_spotify_with_pipeline

In [51]:
fig_scaled = px.scatter_3d(spotify_proj,
                           x = 0, 
                           y = 1,
                           z = 2,
                           color = labels_spotify_with_pipeline)
fig_scaled.show()

## (3) (Optional) To go the extra mile...

* Making sense of `PCA` : <a href="https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/140579#140579">StatsExchange, Eigenelements of a PCA</a>

<br/>

* `TruncatedSVD` : <a href="https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html">Doc</a>

```quote
Dimensionality reduction using truncated SVD (aka LSA).

This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). Contrary to PCA, this estimator does not center the data before computing the singular value decomposition. This means it can work with sparse matrices efficiently.
```