# Instructions

To move forward with the project, you need to create a collection of songs with their audio features - as large as possible!

These are the songs that we will cluster. And, later, when the user inputs a song, we will find the cluster to which the song belongs and recommend a song from the same cluster. The more songs you have, the more accurate and diverse recommendations you'll be able to give. Although... you might want to make sure the collected songs are "curated" in a certain way. Try to find playlists of songs that are diverse, but also that meet certain standards.

The process of sending hundreds or thousands of requests can take some time - it's normal if you have to wait a few minutes (or, if you're ambitious, even hours) to get all the data you need.

An idea for collecting as many songs as possible is to start with all the songs of a big, diverse playlist and then go to every artist present in the playlist and grab every song of every album of that artist. The amount of songs you'll be collecting per playlist will grow exponentially!


#### Importing all libraries & spotify authentification

In [None]:
!pip install spotipy

In [None]:
#import lib
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import getpass # to hide the password
import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.cluster import KMeans

In [None]:
client_id = getpass.getpass('id?')
client_secret = getpass.getpass('secret?')

In [None]:
# building and saving an API connection.
sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id=client_id,
                                                           client_secret=client_secret))

### Creating a collection of songs from different artists

In [None]:
# from this list of artist we are going to extract the top 50 songs in spotify and extract ids and audio features
artists = ["8Ball & MJG", "Three 6 Mafia","Outkast","T.I.", "Youngbloodz", "Devin The Dude", "Baby"
          ,"Clipse", "Goodie Mob", "Young Dro", "Master P", "OJ Da Juiceman", "Gucci Mane", "Lil Wayne",
           "Project Pat", "Bun B", "UGK", "Snoop Dogg", "Juicy J", "Rick Ross", "Drake", "French Montana",
           "Soulja Boy", "Jeezy", "Mike Jones", "Paul Wall", "Slim Thug", "Waka Flocka Flame", "Ludacris", "Styles P", "Ying Yang Twins", "Future", "Redman", "Travis Scott",
           "Meek Mill","Young Thug",
           "Gunna",
           "Moneybagg Yo",
           "Baby Keem",
           "Kendrick Lamar",
           "Da Baby",
           "J. Cole",
           "Kanye West",
           "21 Savage",
           "Tyga",
           "2 Chainz",
           "Big Sean",
           "Kodak Black",
           "Megan Thee Stallion",
           "Jadakiss",
           "JAY-Z"] 
# loop the artist search
#sp.searc pertutti gli artisti nella lista
my_20_artists = [sp.search(q= artist , limit = 50) for artist in artists]
# create the dictionary
#def artist_to_dict(artists):
   # return{artist:sp.search(q= artist , limit = 50) for artist in artists}
#len(artist_to_dict(artists))


### Taking a look at the data that we have

In [None]:
# obtain the names and ids of all the artists
# Each item is a dict in and of itself.
my_20_artists[2]['tracks'].keys() #Each Spotify track is a dictionary with the following keys:

In [None]:
my_20_artists[2]['tracks']['total'] # total reproduction 

In [None]:
# Outkast, track item number 1
# We can get the id, uri, and so on from here.
my_20_artists[2]['tracks']['items'][0] 

In [None]:
# song title
my_20_artists[2]['tracks']['items'][0]['name']

In [None]:
my_20_artists[2]['tracks']['items'][0]['id']

In [None]:
my_20_artists[2]['tracks']['items'][0].keys()

In [None]:
# # I discovered one thing, and now I must iterate to obtain all items for each artist.
my_20_artists[2]['tracks']['items']

In [None]:
#total items 

tot_items = [my_20_artists[i]['tracks']['items'] for i in  range(len(my_20_artists))]

In [None]:
# double indexing for artist and songs
# first artist, first song - Id
tot_items[0][0]['id']

### Obtaining all of the ids for each artist and track

In [None]:
# obtain all of the ids
tot_ids = [tot_items[artist][track]['id'] for track in range(0,50) for artist in range(len(my_20_artists))]
len(tot_ids)

In [None]:
tot_ids

In [None]:
# Obtain the audio features
sp.audio_features(tot_ids[:10])

In [None]:
# chunckin the tot ids is required to go beyond Spotipy's restriction of 50 tracks.
def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

        
audio_feats = []
for chunk in list(chunks(tot_ids, 50)):
        audio_feats.append(sp.audio_features(chunk))

In [None]:
type(audio_feats)

In [None]:
audio_feats

In [None]:
# Because some of the values were none, I chose to use'song is not none' instead

chunk_list = [song for artist in audio_feats for song in artist if song is not None]

In [None]:
df = pd.DataFrame(chunk_list)
df

In [None]:
max(df['danceability'])

In [None]:
df.to_csv('/Users/edudarrelljockers/Desktop/Ironhack/wrapper_lab.csv', index=False)

In [None]:
df.keys()

In [None]:
df_num = df._get_numeric_data()
df_num.head()

In [None]:
df_num

In [None]:
col_names = df_num.columns

In [None]:
from sklearn.preprocessing import StandardScaler
#X_prep = StandardScaler().fit_transform(df_num)

In [None]:
scaler = StandardScaler()
scaler.fit(df_num)
X_prep = scaler.transform(df_num)

In [None]:
X_prep_df = pd.DataFrame(X_prep, columns=col_names)

In [None]:
kmeans = KMeans(n_clusters=8, random_state=1234)
kmeans.fit(X_prep_df)

In [None]:
kmeans.cluster_centers_

In [None]:
kmeans.inertia_

In [None]:
clusters = kmeans.predict(X_prep)
clusters

In [None]:
pd.Series(clusters).value_counts().sort_index()

In [None]:
X_df = pd.DataFrame(X_prep)
X_df['cluster'] = clusters
X_df.head()

In [None]:
X_df['cluster'].plot(kind='hist')

### Inertia

Running the method numerous times with different random initializations and keeping the best result is one technique to determine the centroids. The n init hyperparameter controls the number of random initializations: by default, it is, which means that the entire algorithm runs 10 times when we use fit(), and Scikit Learn keeps the best answer. The inertia is defined as the average square distance between each instance and its nearest centroid. The KMeans class iteratively runs the procedure n init times and maintains the model with the lowest inertia.

In [None]:
# I want to iterate over a range of n_clusters and for every value, I want to return the inertia
def get_kmeans_inertia_varying_cluster_n(n_clusters):
    
    # setup the model
    kmeans = KMeans(n_clusters=n_clusters,
                    random_state=1234,
                    n_init=10,
                    #algorithm='elkan',
                   )
    # train the model
    kmeans.fit(X_prep_df)
    
    # return the resulting inertia
    return kmeans.inertia_

# Plot for a range of cluster numbers
import matplotlib.pyplot as plt

cluster_range = range(1,20)

plt.plot(cluster_range,
         [get_kmeans_inertia_varying_cluster_n(c_number) for c_number in cluster_range],
         marker="o",
         ms=10,
        )
plt.xlabel('Cluster Number')
plt.ylabel('inertia')

In [None]:
# I want to iterate over a range of mx_iter and for every value, I want to return the inertia
def get_kmeans_ineratia_varying_max_iter(max_iter):
    kmeans = KMeans(n_clusters=10,
                    random_state=1234,
                    n_init=10,
                    algorithm='elkan',
                    max_iter=max_iter,
                   )
    kmeans.fit(X_prep_df)

    return kmeans.inertia_

max_iter_list = [1, 5, 10, 20, 30, 40, 50, 100]

plt.plot(max_iter_list,
         [get_kmeans_ineratia_varying_max_iter(x) for x in max_iter_list],
        )
plt.xlabel('Max iter')
plt.ylabel('inertia')

### Silhouette coefficient
The silhouette coefficient of an instance is equal to (b-a)/max(a,b), where an is the distance to other instances in the same cluster and b is the mean nearest-cluster distance. The silhouette coefficient can range from -1 to +1. A coefficient near to +1 indicates that the instance is well within its own cluster and distant from other clusters, a coefficient close to 0 indicates that it is close to a cluster border, and a coefficient close to -1 indicates that the distance may have been given to the incorrect cluster.

In [None]:
from sklearn.metrics import silhouette_score

K = range(2, 20)

silhouettes = []

for k in K:
    kmeans = KMeans(n_clusters=k,
                   random_state=1234)
    kmeans.fit(X_prep)
    silhouettes.append(silhouette_score(X_prep, kmeans.predict(X_prep)))

In [None]:
import matplotlib.pyplot as plt


plt.figure(figsize=(16,8))
plt.plot(K, silhouettes, 'bo-')
plt.xlabel('k (number of clusters)')
plt.ylabel('silhouette score')

In [None]:
kmeans = KMeans(n_clusters=10,
             random_state=1234)

kmeans.fit(X_prep)

clusters = kmeans.predict(X_prep)
clusters

In [None]:
clusters.shape

In [None]:
features_clustered = pd.DataFrame(X_prep, columns=col_names)

In [None]:
features_clustered['cluster_id'] = clusters

In [None]:
features_clustered.head()

In [None]:
features_clustered['cluster_id'].value_counts()

In [None]:
kmeans.cluster_centers_

In [None]:
cluster_centers_df = pd.DataFrame(kmeans.cluster_centers_, columns=col_names)

In [None]:
cluster_centers_df

In [None]:
cluster_centers_df['cluster_id'] = range(0,10)

In [None]:
cluster_centers_df

In [None]:
# this contains my cluster centers
cluster_center_sub_df = cluster_centers_df[['danceability', 'loudness', 'cluster_id']]

features_clustered_sub_df= features_clustered[['danceability', 'loudness', 'cluster_id']]

In [None]:
cluster_center_sub_df

In [None]:
features_clustered_sub_df

In [None]:
import seaborn as sns

sns.scatterplot(data=features_clustered_sub_df,
               x='danceability',
               y='loudness',
               hue='cluster_id')

# plot centroids
sns.scatterplot(data=cluster_center_sub_df,
               x="danceability",
               y="loudness",
               hue='cluster_id',
                legend=False,
                # marker=u'8',
                marker='+',
                s=500,
               )

In [None]:
song = np.array([[-0.35992001,  0.42882697, -0.14838292,  0.17284094,  0.16457822,
        -0.03195258, -0.1271788 , -0.20547284,  3.02816589, -0.26732747,
         0.02835173,  0.09672491,  0.07415629]])

In [None]:
kmeans.predict(song)

### Importing Billboard scraping csv

In [None]:
pwd

In [None]:
hot_songs = pd.read_csv('/Users/edudarrelljockers/Desktop/Ironhack/hot100_songs.csv')

In [None]:
hot_songs

In [None]:
## Part 4
# from the input to the cluster recommendation
df_chunk = pd.DataFrame(chunk_list)
df_chunk.head()

In [None]:
df_chunk['clusters_id'] = clusters
df_chunk

In [None]:
import random

# from song name to audio feature
def cluster_song(title):
    title = sp.search(q= title , limit = 1)
    title_id = title['tracks']['items'][0]['id']
    title_features = sp.audio_features(title_id)
    df = pd.DataFrame(title_features, index= [0])
    df_1 = df._get_numeric_data()
    df_title_scaled = scaler.transform(df_1)
    cluster_num = kmeans.predict(df_title_scaled)
    sub_df = df_chunk.loc[df_chunk['clusters_id'] == int(cluster_num)]
    sub_df.reset_index(drop=True, inplace=True) 
    title_name_id = random.choice(sub_df["id"])
    name = sp.track(title_name_id)['name']  
    return f"I recommend you: {name}"

In [None]:
# return random recomandation

title = str(input('name a title: ')).title()


if title in list(hot_songs['title']):
     print('I recommend you to listen to: ' + random.choice(hot_songs['title']))
else:
     print(cluster_title(title))

In [None]:
df_chunk.to_csv('/Users/edudarrelljockers/Desktop/Ironhack/df_chunk.csv', index=False)