# Lab | Unsupervised learning intro (GNOD - part 4)

### Instructions

It's the moment to perform clustering on the songs you collected. Remember that the ultimate goal of this little project is to improve the recommendations of songs in the hope that the user will enjoy the new song.. Clustering the songs will allow the recommendation system to limit the scope of the recommendations to only songs that belong to the same cluster - songs with similar audio features.

The activities you did with the Spotify API and the PopVortex web scraping will allow you to create a pipeline such that when the user enters a song, you:

Check whether or not the input song is in the PopVortex Hot 100.
Recommend another Hot 100 song
If it is NOT in the hot 100, then collect the audio features from the Spotify API for the input song.
You want to send the Spotify audio features of the submitted song to the clustering model, which should return a cluster number.
Then you recommend a song from the same cluster number.
Your model will be even more accurate the more songs you use to create your clusters in your model, so you want to have as many songs as possible to create the clustering model. You can find some larger datasets on Kaggle containing more songs with audio features that have already been scraped and saved as a .csv file. You can add these to your own scraped data. Here are some links to try:

* Datasets 1960 - 2019
* Datasets 1921 - 2020
* General Song Datasets

In [185]:
from sklearn import cluster, datasets
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
import pickle

In [186]:
import pandas as pd
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import numpy as np
from pandas import json_normalize
import requests
import pandas as pd

## Lab | Get data

In [187]:
import pandas as pd

# read csv file from the previous lab (playlists)
file = 'df_w_audio_ft2.csv'
Audio = pd.read_csv(file)
Audio.shape

(50609, 21)

In [188]:
Audio.isna().sum().sum()

0

## Lab | Clustering

In [189]:
Audio.head()

Unnamed: 0,track.name,name,song_id,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
0,Like a Rolling Stone,Bob Dylan,3AhXZa8sUQht0UEdBJgpGc,0.482,0.721,0,-6.839,1,0.0321,0.731,0.0,0.189,0.557,95.263,audio_features,3AhXZa8sUQht0UEdBJgpGc,spotify:track:3AhXZa8sUQht0UEdBJgpGc,https://api.spotify.com/v1/tracks/3AhXZa8sUQht...,https://api.spotify.com/v1/audio-analysis/3AhX...,369600,4
1,Smells Like Teen Spirit,Nirvana,3oTlkzk1OtrhH8wBAduVEi,0.485,0.863,1,-9.027,1,0.0495,1.2e-05,0.0162,0.138,0.767,116.835,audio_features,3oTlkzk1OtrhH8wBAduVEi,spotify:track:3oTlkzk1OtrhH8wBAduVEi,https://api.spotify.com/v1/tracks/3oTlkzk1Otrh...,https://api.spotify.com/v1/audio-analysis/3oTl...,300977,4
2,A Day In The Life - Remastered,The Beatles,3ZFBeIyP41HhnALjxWy1pR,0.364,0.457,4,-14.162,0,0.0675,0.29,0.000106,0.922,0.175,163.219,audio_features,3ZFBeIyP41HhnALjxWy1pR,spotify:track:3ZFBeIyP41HhnALjxWy1pR,https://api.spotify.com/v1/tracks/3ZFBeIyP41Hh...,https://api.spotify.com/v1/audio-analysis/3ZFB...,337413,4
3,Good Vibrations (Mono),The Beach Boys,5Qt4Cc66g24QWwGP3YYV9y,0.398,0.413,1,-10.934,1,0.0388,0.0822,2.5e-05,0.0891,0.331,133.574,audio_features,5Qt4Cc66g24QWwGP3YYV9y,spotify:track:5Qt4Cc66g24QWwGP3YYV9y,https://api.spotify.com/v1/tracks/5Qt4Cc66g24Q...,https://api.spotify.com/v1/audio-analysis/5Qt4...,219147,4
4,Johnny B Goode,Chuck Berry,7MH2ZclofPlTrZOkPzZKhK,0.518,0.756,10,-10.851,1,0.0915,0.735,6.2e-05,0.317,0.968,166.429,audio_features,7MH2ZclofPlTrZOkPzZKhK,spotify:track:7MH2ZclofPlTrZOkPzZKhK,https://api.spotify.com/v1/tracks/7MH2ZclofPlT...,https://api.spotify.com/v1/audio-analysis/7MH2...,160893,4


In [190]:
Audio.columns

Index(['track.name', 'name', 'song_id', 'danceability', 'energy', 'key',
       'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
       'liveness', 'valence', 'tempo', 'type', 'id', 'uri', 'track_href',
       'analysis_url', 'duration_ms', 'time_signature'],
      dtype='object')

In [191]:
Audio_ft = Audio.select_dtypes(include='number')

In [192]:
Audio_ft.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
danceability,50609.0,0.576502,0.160818,0.0,0.468,0.586,0.694,0.988
energy,50609.0,0.665728,0.211479,2e-05,0.523,0.697,0.837,0.999
key,50609.0,5.276077,3.570118,0.0,2.0,5.0,9.0,11.0
loudness,50609.0,-8.014889,3.528075,-60.0,-10.031,-7.367,-5.414,2.769
mode,50609.0,0.657571,0.474527,0.0,0.0,1.0,1.0,1.0
speechiness,50609.0,0.067014,0.067462,0.0,0.0333,0.043,0.0675,0.954
acousticness,50609.0,0.226327,0.277448,0.0,0.0117,0.0915,0.367,0.996
instrumentalness,50609.0,0.123533,0.260593,0.0,1e-06,0.000277,0.0452,0.993
liveness,50609.0,0.191982,0.160546,0.00829,0.0929,0.127,0.247,0.998
valence,50609.0,0.533088,0.251775,0.0,0.332,0.535,0.741,0.995


In [193]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(Audio_ft)
Audio_scaled = scaler.transform(Audio_ft)
pd.DataFrame(Audio_scaled,columns=Audio_ft.columns).head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,-0.587635,0.26136,-1.477858,0.333298,0.721629,-0.517542,1.818998,-0.47405,-0.018572,0.094973,-0.993402,1.394934,0.188735
1,-0.56898,0.932828,-1.197753,-0.286876,0.721629,-0.259616,-0.815708,-0.411884,-0.336242,0.929058,-0.193964,0.626151,0.188735
2,-1.32139,-0.987002,-0.357436,-1.742359,-1.385753,0.007203,0.229498,-0.473644,4.547152,-1.422267,1.524985,1.034343,0.188735
3,-1.109969,-1.195062,-1.197753,-0.827403,0.721629,-0.418226,-0.519478,-0.473953,-0.640831,-0.802661,0.426369,-0.290591,0.188735
4,-0.363778,0.426863,1.323197,-0.803877,0.721629,0.362962,1.833415,-0.473812,0.778717,1.727396,1.643945,-0.943211,0.188735


In [194]:
Audio_scaled

array([[-0.58763499,  0.26136026, -1.47785845, ..., -0.99340235,
         1.39493423,  0.18873466],
       [-0.56898022,  0.93282785, -1.19775292, ..., -0.19396361,
         0.62615065,  0.18873466],
       [-1.32138961, -0.98700203, -0.35743633, ...,  1.52498532,
         1.03434322,  0.18873466],
       ...,
       [-1.50171914,  0.84771224, -0.63754186, ..., -0.04572724,
         1.26080069, -3.10949473],
       [-0.32024984, -1.13359003,  1.04309132, ...,  0.23577362,
        -0.12209784,  0.18873466],
       [ 0.89231076, -1.11940409, -1.19775292, ..., -1.18914847,
        -0.41413769,  0.18873466]])

In [195]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=1234)
kmeans.fit(Audio_scaled)

KMeans(n_clusters=3, random_state=1234)

In [196]:
kmeans.inertia_

530008.8588523065

In [197]:
%%time
kmeans = KMeans(n_clusters=8,
                init="random",
                n_init= 1,  # try with 1, 4, 8, 20, 30, 100...
                max_iter=10,
                tol=0,
                random_state=1234)
kmeans.fit(Audio_scaled)
print(kmeans.inertia_)

403352.7596460913
CPU times: total: 109 ms
Wall time: 42.8 ms


### Lab | Clustering | choosing k | means.inertia

In [198]:
# import numpy as np
# K = range(2, 20)
# inertia = []

# for k in K:
#     kmeans = KMeans(n_clusters=k,
#                     random_state=1234)
#     kmeans.fit(Audio_scaled)
#     inertia.append(kmeans.inertia_)

# import matplotlib.pyplot as plt
# %matplotlib inline
# plt.figure(figsize=(16,8))
# plt.plot(K, inertia, 'bx-')
# plt.xlabel('k')
# plt.ylabel('inertia')
# plt.xticks(np.arange(min(K), max(K)+1, 1.0))
# plt.title('Elbow Method showing the optimal k')

### Lab | Clustering | choosing k | Sihouette score

In [199]:
# from sklearn.metrics import silhouette_score
# K = range(2, 20)
# silhouette = []
# for k in K:
#     kmeans = KMeans(n_clusters=k,
#                     random_state=1234)
#     kmeans.fit(Audio_scaled)
#     silhouette.append(silhouette_score(Audio_scaled, kmeans.predict(Audio_scaled)))
# plt.figure(figsize=(16,8))
# plt.plot(K, silhouette, 'bx-')
# plt.xlabel('k')
# plt.ylabel('silhouette score')
# plt.xticks(np.arange(min(K), max(K)+1, 1.0))
# plt.title('Slhouette score showing the optimal k')

### Lab | Clustering | Clustering

In [200]:
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd

kmeans = KMeans(n_clusters=19, random_state=1234).fit(Audio_scaled)
clusters_1 = kmeans.predict(Audio_scaled)
pd.Series(clusters_1).value_counts().sort_index()

0     2384
1     3780
2     3346
3     3899
4     3391
5     5108
6     2236
7     2483
8     2548
9     4367
10    1382
11    4677
12    3473
13    1278
14     193
15    2148
16     260
17    2412
18    1244
dtype: int64

In [351]:
# Audio['cluster']=clusters_1

In [350]:
# Audio.head(50)

### Lab | Recomendor | Authentication with secrets

In [203]:
secrets_file = open("secrets.txt","r")

In [204]:
string = secrets_file.read()

In [205]:
string

'clientid:806b30c89127425d9de916ba36c86ea8\nclientsecret:0f6eaf58bc364cd598cac727e6da88b4'

In [206]:
secrets_dict={}
for line in string.split('\n'):
    if len(line) > 0:
        #print(line.split(':'))
        secrets_dict[line.split(':')[0]]=line.split(':')[1].strip()

In [207]:
secrets_dict

{'clientid': '806b30c89127425d9de916ba36c86ea8',
 'clientsecret': '0f6eaf58bc364cd598cac727e6da88b4'}

In [208]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id=secrets_dict['clientid'],
                                                           client_secret=secrets_dict['clientsecret']))

### Lab | Recomendor | Getting Artist and Song_Id

In [333]:
song_title = 'Imagine' #regex
# results = sp.search(q=song_title, limit=10, offset=1) # alternative to loop song searchs
results = sp.search(q=song_title, limit=50)
# results

### Imput Title - Artist choise

In [334]:
pd.set_option("display.max_columns", 0)
tracks = pd.json_normalize(results["tracks"]["items"])
# filtered_tracks = tracks[tracks['name'].lower().str.startswith(song_title)].copy()
filtered_tracks = tracks[tracks['name'].str.lower().str.contains(song_title.lower())].copy()
filtered_tracks.loc[:, 'artists_dfs'] = filtered_tracks.apply(expand_list_dict, axis=1)
# filtered_tracks

In [335]:
def expand_list_dict(row):
    df = json_normalize(row['artists'])
    df['song_id'] = row['id']
    return df
filtered_tracks['artists_dfs'] = filtered_tracks.apply(expand_list_dict, axis=1)
# tracks['artists_dfs'] = tracks.apply(expand_list_dict, axis=1)

In [336]:
artist_df = pd.DataFrame(columns=['external_urls.spotify', 'href', 'id', 'name', 'type', 'uri', 'song_id'])
# for mini_df in tracks['artists_dfs']:
for mini_df in filtered_tracks['artists_dfs']:
    artist_df = pd.concat([artist_df, mini_df], axis=0)
# artist_df

In [337]:
optional_singer = artist_df['name'].value_counts().head(10)
optional_singer

John Lennon               2
Doja Cat                  2
Ariana Grande             1
yvngxchris                1
Jack Johnson              1
John Lennon Experience    1
Ben Platt                 1
Clémence Alheoux          1
Pentatonix                1
Kevin Gates               1
Name: name, dtype: int64

### Lab | Recomendor | Get song id

In [338]:
imput_artist = "John Lennon"
filtered_df = artist_df[artist_df['name'] == imput_artist]
choosen_song_id = filtered_df['song_id'].sample()
choosen_song_id

0    7pKfPomDEeI4TPT6EOYjn9
Name: song_id, dtype: object

### Lab | Recomendor | Getting features for new song

In [346]:
new_song=sp.audio_features(choosen_song_id)
new_song_df = pd.DataFrame(new_song)
new_song_ft = new_song_df.select_dtypes(include='number')
new_song_ft

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,0.578,0.339,0,-10.686,1,0.0263,0.924,0.0491,0.0988,0.14,75.661,186213,4


### Lab | Recomendor | Getting recommendation

In [347]:
new_song_scaled2 = scaler.transform(new_song_ft)
predicted_cluster_unscaled = kmeans.predict(new_song_scaled2)
filtered_audio = Audio[Audio['cluster'] == predicted_cluster_unscaled[0]]
filtered_audio.sample()

Unnamed: 0,track.name,name,song_id,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature,cluster
35251,Beneden Alle Peil,Boudewijn de Groot,4qxliTKFH7xpsHmjssxlwp,0.623,0.256,2,-17.674,1,0.0483,0.816,0.00813,0.111,0.557,116.217,audio_features,4qxliTKFH7xpsHmjssxlwp,spotify:track:4qxliTKFH7xpsHmjssxlwp,https://api.spotify.com/v1/tracks/4qxliTKFH7xp...,https://api.spotify.com/v1/audio-analysis/4qxl...,188560,4,7


## Recommend a song | Top 200

In [341]:
allsongs_df = 'allsongs.csv'
allsongs_df = pd.read_csv(allsongs_df)
song_list = allsongs_df['Title'].str.lower()
song_to_check = input('Enter the Song to check: ').lower()

if song_to_check in song_list.values:
    filtered_list = allsongs_df[allsongs_df['Title'].str.lower() != song_to_check]
    random_song = filtered_list.sample(1)
    recommended_title = random_song['Title'].values[0]
    recommended_artist = random_song['Artist'].values[0]
    print(f"Entered song '{song_to_check}'. Here's a recommendation:")
    print(f"Title: {recommended_title}, Artist: {recommended_artist}")
else:
    print("It's not in the hot songs list.")

Enter the Song to check: Imagine
It's not in the hot songs list.


## Recommend a song | Playlists

In [344]:
results = sp.search(q=song_to_check, limit=50)

pd.set_option("display.max_columns", 0)
tracks = pd.json_normalize(results["tracks"]["items"])

filtered_tracks = tracks[tracks['name'].str.lower().str.contains(song_title.lower())].copy()
filtered_tracks.loc[:, 'artists_dfs'] = filtered_tracks.apply(expand_list_dict, axis=1)

filtered_tracks['artists_dfs'] = filtered_tracks.apply(expand_list_dict, axis=1)

artist_df = pd.DataFrame(columns=['external_urls.spotify', 'href', 'id', 'name', 'type', 'uri', 'song_id'])

for mini_df in filtered_tracks['artists_dfs']:
    artist_df = pd.concat([artist_df, mini_df], axis=0)
    
optional_singer = artist_df['name'].value_counts().head(10)
optional_singer

John Lennon         2
Doja Cat            2
Ariana Grande       1
Ben Platt           1
yvngxchris          1
Jack Johnson        1
Kirk Franklin       1
Kevin Gates         1
Clémence Alheoux    1
Ragged Insomnia     1
Name: name, dtype: int64

## Recomend a song | Artist choise

In [349]:
input_artist = input('Enter the artist to check: ')
filtered_df = artist_df[artist_df['name'] == input_artist]
choosen_song_id = filtered_df['song_id'].sample()
new_song = sp.audio_features(choosen_song_id)
new_song_df = pd.DataFrame(new_song)
new_song_ft = new_song_df.select_dtypes(include='number')
new_song_scaled = scaler.transform(new_song_ft)
predicted_cluster_unscaled = kmeans.predict(new_song_scaled)
filtered_audio = Audio[Audio['cluster'] == predicted_cluster_unscaled[0]]
recommended_song = filtered_audio.sample()

for index, row in recommended_song.iterrows():
    print('Title:', row['track.name'], ' ////  Artist:', row['name'],' ////  Cluster:', row['cluster'])

Enter the artist to check: John Lennon
Title: Divine  ////  Artist: Antony and the Johnsons  ////  Cluster: 7
