# Creating Cohorts of Songs

## Problem Scenario
The customer always looks forward to specialized treatment, whether shopping on an e-commerce website or watching Netflix. The customer desires content that aligns with their preferences. To maintain customer engagement, companies must consistently provide the most relevant information.

Starting with Spotify, a Swedish audio streaming and media service provider, boasting over 456 million active monthly users (including more than 195 million paid subscribers as of September 2022), the company aims to create cohorts of different songs to enhance song recommendations. These cohorts will be based on various relevant features, ensuring that each group contains similar types of songs.

## Problem Objective
As a data scientist, you should perform exploratory data analysis and cluster analysis to create cohorts of songs. The goal is to better understand the various factors that create a cohort of songs.

## Data Description
The dataset comprises information from Spotify's API regarding all albums by the Rolling Stones available on Spotify. It's crucial to highlight that each song possesses a unique ID.

In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv("rolling_stones_spotify.csv")
df.head()

In [None]:
df.info(), df.isnull().sum(), df.duplicated().sum()

In [None]:
df.describe()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

popular_songs = df[df['popularity'] >= 50]
top_albums = popular_songs['album'].value_counts().head(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=top_albums.values, y=top_albums.index, palette="Blues_d")
plt.title('Top 10 Albums by Number of Popular Songs (popularity >= 50)')
plt.xlabel('Number of Popular Songs')
plt.ylabel('Album')
plt.tight_layout()
plt.show()

In [None]:
features = df.drop(columns=["name", "album", "release_date", "id", "uri"])
corr = features.corr()

plt.figure(figsize=(12, 8))
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix of Song Features (including Popularity)")
plt.tight_layout()
plt.show()

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

X = features.drop(columns=["popularity"])
X_scaled = StandardScaler().fit_transform(X)

pca = PCA(n_components=2)
pca_components = pca.fit_transform(X_scaled)

plt.figure(figsize=(8, 6))
sns.scatterplot(x=pca_components[:, 0], y=pca_components[:, 1], alpha=0.6)
plt.title("PCA Projection of Song Features")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.tight_layout()
plt.show()

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

scores = []
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X_scaled)
    scores.append(silhouette_score(X_scaled, labels))

plt.plot(range(2, 11), scores, marker='o')
plt.title("Silhouette Scores for Various Cluster Counts")
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Score")
plt.grid(True)
plt.show()

In [None]:
kmeans = KMeans(n_clusters=2, random_state=42)
features['cluster'] = kmeans.fit_predict(X_scaled)

df_clustered = df.copy()
df_clustered['cluster'] = features['cluster']

df_clustered.groupby('cluster').mean(numeric_only=True)