# **Entrega 2: Sistema de recomendación de música.**

**Curso: 09481 - Inteligencia Artificial.**

**Universidad ICESI.**

**Profesor: Breyner Posso.**

**Integrantes: Daniela Olarte Borja, Carlos Jimmy Pantoja, Juan Esteban Caicedo y Carlos Andres Tafurt.**

En esta entrega vamos a aplicar el algoritmo de agrupación k-means y se construirá el sistema de recomendación de música.

# Parte 1.

# 0. Importar Librerías

In [None]:
import re                                                       #Expresiones regulares.
import itertools                                                #Herramientas de iteración.
import numpy as np                                              #Matemática y manejo de arreglos.
import pandas as pd                                             #Manejo de datasets.
import warnings as wrg                                          #Control de advertencias.

from copy import deepcopy                                       #Copias profundas de DataFrames.

from collections import defaultdict                             #Construcción de un diccionario.

from sklearn.cluster import KMeans                              #Agrupamiento de datos.
from sklearn.preprocessing import MinMaxScaler                  #Normalización de datos.
from sklearn.feature_extraction.text import TfidfVectorizer     #Vectorizador de información a partir de su ocurrencia.

from scipy.spatial.distance import cdist                        #Cálcula la distancia entre dos vectores.

In [2]:
#Se ignoran las advertencias.
wrg.filterwarnings("ignore")

# 1. Cargar Datos

In [3]:
#Dataset a nivel de canciones, contiene toda su información.
data = pd.read_csv('data.csv')
#Dataset a nivel de artistas, contiene toda su información.
data_w_genres = pd.read_csv('data_w_genres.csv')

# 2. Análisis Exploratorio de los Datos.

Convertimos los artistas de una canción en una lista

In [8]:
data.artists[0]

"['Sergei Rachmaninoff', 'James Levine', 'Berliner Philharmoniker']"

In [9]:
data.artists[0][0]

'['

In [10]:
#Función lambda que hace el cambio de str a list.
data.artists = data.artists.apply(lambda x: [i.strip() for i in re.sub('[\]\["\']', '', x).split(',')])

In [11]:
data.artists[0][0]

'Sergei Rachmaninoff'

De igual manera quitamos carácteres especiales de los nombres de los artistas en data_w_genre

In [12]:
data_w_genres.artists[0]

'"Cats" 1981 Original London Cast'

In [13]:
#Función lambda que elimina los carácteres especiales.
data_w_genres.artists = data_w_genres.artists.apply(lambda x: re.sub('[\]\["\']', '', x).strip())

In [14]:
data_w_genres.artists[0]

'Cats 1981 Original London Cast'

Convertimos los generos de un artista a una lista

In [15]:
data_w_genres.genres[3717]

'["canadian children\'s music", \'cartoon\', "children\'s music", \'musique pour enfant quebecois\']'

In [16]:
data_w_genres.genres[3717][0]

'['

In [17]:
#Función lambda que hace el cambio de str a list.
data_w_genres.genres = data_w_genres.genres.apply(lambda x: [re.sub(' ','_',i.strip()) for i in re.sub('[\]\["\']', '', x).split(',')])

In [18]:
data_w_genres.genres[3717][0]

'canadian_childrens_music'

Vamos a encontrar los generos de cada canción a partir de sus artistas

In [19]:
#Se genera una fila por cada artista con el id de la canción.
artists_exploded = data[['artists','id']].explode('artists')

In [20]:
#Se añade por medio de los artistas toda la información de los mismos.
artists_exploded_enriched = artists_exploded.merge(data_w_genres, how = 'left', left_on = 'artists', right_on = 'artists')
#Se eliminan generos nulos
artists_exploded_enriched_nonnull = artists_exploded_enriched[~artists_exploded_enriched.genres.isnull()]

In [21]:
#Se agrupa por el id de la canción y los generos del artista para tener una lista de generos por cada id de una canción.
artists_genres_consolidated = artists_exploded_enriched_nonnull.groupby('id')['genres'].apply(list).reset_index()

In [22]:
#Se eliminan los generos repetidos de cada lista.
artists_genres_consolidated['genres'] = artists_genres_consolidated['genres'].apply(lambda x: list(set(list(itertools.chain.from_iterable(x)))))

In [23]:
#Se añade al dataset principal la columna genres que contiene la lista de generos de cada canción.
data = data.merge(artists_genres_consolidated[['id','genres']], on = 'id',how = 'left')

In [24]:
#Se reemplazan posibles listas de generos nulas por listas vacias.
data.genres = data.genres.apply(lambda x: [] if x is np.nan else x)

In [25]:
#Aqui una muestra del dataset con los generos de cada canción.
data.head()

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,genres
0,0.0594,1921,0.982,"[Sergei Rachmaninoff, James Levine, Berliner P...",0.279,831667,0.211,0,4BJqT0PrAfrxzMOxytFOIz,0.878,10,0.665,-20.096,1,"Piano Concerto No. 3 in D Minor, Op. 30: III. ...",4,1921,0.0366,80.954,"[classical, ukrainian_classical, orchestral_pe..."
1,0.963,1921,0.732,[Dennis Day],0.819,180533,0.341,0,7xPhfUan2yNtyFG0cUWkt8,0.0,7,0.16,-12.441,1,Clancy Lowered the Boom,5,1921,0.415,60.936,[]
2,0.0394,1921,0.961,[KHP Kridhamardawa Karaton Ngayogyakarta Hadin...,0.328,500062,0.166,0,1o6I8BglA6ylDMrIELygv1,0.913,3,0.101,-14.85,1,Gati Bali,5,1921,0.0339,110.339,[]
3,0.165,1921,0.967,[Frank Parker],0.275,210000,0.309,0,3ftBPsC5vPBKxYSee08FDH,2.8e-05,5,0.381,-9.316,1,Danny Boy,3,1921,0.0354,100.109,[]
4,0.253,1921,0.957,[Phil Regan],0.418,166693,0.193,0,4d6HGyGT8e121BsdKmw9v6,2e-06,3,0.229,-10.096,1,When Irish Eyes Are Smiling,2,1921,0.038,101.665,[]


# 3. Construcción del modelo

## Estimar el valor de k más adecuado.

Vamos a vectorizar los generos para ver cuáles son los más comunes y estimar nuestra k.

In [26]:
tfidf = TfidfVectorizer()                                                                   #Se inicializa el vectorizador.
tfidf_matrix =  tfidf.fit_transform(data['genres'].apply(lambda x: " ".join(x).strip()))    #Se ajusta el vectorizador con los géneros de cada canción.
genre_df = pd.DataFrame(tfidf_matrix.toarray())                                             #Se genera un dataframe con el resultado.
genre_df.columns = [i for i in tfidf.get_feature_names()]                                   #Se nombran las columnas del dataset.
genre_df.reset_index(drop = True, inplace=True)                                             #Se reinicia el index del dataset.
genre_df                                                                                    

Unnamed: 0,21st_century_classical,432hz,_hip_hop,a_cappella,abstract,abstract_beats,abstract_hip_hop,accordeon,accordion,acid_house,...,yugoslav_rock,zambian_pop,zeuhl,zhongguo_feng,zimdancehall,zolo,zouglou,zouk,zurich_indie,zydeco
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
170648,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
170649,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
170650,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
170651,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
#Usando el método de prueba y error encontramos que son 201 los generos que consolidan el 70% de nuestras canciones.
(genre_df.sum().sort_values(ascending=False)[0:201].sum() / genre_df.sum().sum()) * 100

70.07401936464258

In [28]:
#Nos quedamos sólo con los 201 generos más comunes.
genre_df = genre_df[genre_df.sum().sort_values(ascending=False)[0:201].index]

In [29]:
#Añadimos al dataset principal las columnas que contienen los 201 generos vectorizados.
data = pd.concat([data, genre_df], axis=1)

#### Se van a particionar los datos en un conjunto de entrenamiento y uno de prueba. El de entrenamiento se va a utilizar para construir el modelo de Kmeans y hacer agrupamiento (clustering) y el de prueba para probar a qué grupo o categoría asocia estas nuevas observaciones.

In [30]:
#En este primer ejemplo vamos a utilizar todas las variables:
nombre_de_las_variables_de_entrada = ['valence','year','acousticness','danceability','duration_ms','energy','explicit',
                                      'instrumentalness','key','liveness','loudness','mode','popularity','speechiness','tempo']

nombre_de_las_variables_de_entrada.extend(list(genre_df.columns))

#Ahora, obtenemos un dataframe con las variables de entrada que vamos a utilizar.
df_x = deepcopy(data[nombre_de_las_variables_de_entrada])

In [31]:
rango_de_salida_de_las_variables_escaladas = (0,1)  #Tupla con el siguiente formato: (mínimo deseado, máximo deseado).
scaler = MinMaxScaler()  #Instanciamos el objeto para escalar los datos. 

df_x_norm = deepcopy(df_x)  #Inicializamos este objeto con una copia profunda del las columnas de entrada de interés del dataframe.
df_x_norm[nombre_de_las_variables_de_entrada] = scaler.fit_transform(df_x) #Ajustamos y transformamos los datos.

print('-----------------------------------------------------------------------')
print('Datos de entrada del conjunto de entrenamiento ANTES del escalado')
print('-----------------------------------------------------------------------')
print(df_x[nombre_de_las_variables_de_entrada[:4]].describe(),'\n')

print('-----------------------------------------------------------------------')
print('Datos de entrada del conjunto de entrenamiento DESPUÉS del escalado')
print('-----------------------------------------------------------------------')
print(df_x_norm[nombre_de_las_variables_de_entrada[:4]].describe())

-----------------------------------------------------------------------
Datos de entrada del conjunto de entrenamiento ANTES del escalado
-----------------------------------------------------------------------
             valence           year   acousticness   danceability
count  170653.000000  170653.000000  170653.000000  170653.000000
mean        0.528587    1976.787241       0.502115       0.537396
std         0.263171      25.917853       0.376032       0.176138
min         0.000000    1921.000000       0.000000       0.000000
25%         0.317000    1956.000000       0.102000       0.415000
50%         0.540000    1977.000000       0.516000       0.548000
75%         0.747000    1999.000000       0.893000       0.668000
max         1.000000    2020.000000       0.996000       0.988000 

-----------------------------------------------------------------------
Datos de entrada del conjunto de entrenamiento DESPUÉS del escalado
------------------------------------------------------

### Se entrena el modelo definitivo.

In [32]:
#-------------------------------------------------------------------------------
#-------------------------------------------------------------------------------
# HYPERPARÁMETROS DEL MODELO
#-------------------------------------------------------------------------------
#-------------------------------------------------------------------------------

init              ='k-means++'  #Se define el método de inicialización. Otra opción válida es 'random'.
n_init            = 10          #Número de inicializaciones aleatorias. Al final scikit learn escoge aquel con la menor inercia 
                                #(i.e.: suma de cuadrados de distancias de cada punto a su centroide respectivo dentro de cada grupo, para todos los puntos). 
                                #https://scikit-learn.org/stable/modules/clustering.html
max_iter          = 300         #Número MÁXIMO de iteraciones para una sola ejecución.
random_seed       = 42          #Semilla aleatoria. Permite obtener los mismos resultados en cada ejecución.

#-------------------------------------------------------------------------------
#K-means
#-------------------------------------------------------------------------------
k = 201  #Número de grupos que se escogió después del análisis previo.

#Ahora se instancia el objeto para utilizar el agrupamiento con k-means.
#Para ver todas los opciones del constructor, consulte: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
#Nota: el algoritmo de k-means disponible en scikit-learn funciona únicamente con la distancia euclidiana.
#Si requiere aplicar k-means con otras métricas de distancia, puede consultar la librería PyClustering: https://github.com/annoviko/pyclustering
kmeans = KMeans(n_clusters   = k,            #Se define el número de grupos.
                init         = init,         #Se define el método de inicialización. Otra opción es 'random'
                n_init       = n_init,       #Número de inicializaciones aleatorias. 
                max_iter     = max_iter,     #Número MÁXIMO de iteraciones para una sola ejecución.
                random_state = random_seed)

#Hagamos el ajuste (i.e.: encontremos los centroides).
kmeans.fit(df_x_norm)

#Se añade la columna cluster a los datos para posteriores agrupamientos.
data['cluster'] = kmeans.predict(df_x_norm)

#Revisemos los centroides de cada grupo.
centroides = kmeans.cluster_centers_
print('Centroides:', centroides)

Centroides: [[ 4.69145934e-01  8.69130869e-01  4.59880746e-02 ...  8.67361738e-19
   1.30104261e-18  1.73472348e-18]
 [ 3.08523376e-01  2.52913200e-01  9.33709708e-01 ... -2.60208521e-18
   1.08420217e-17  8.04044027e-03]
 [ 6.73816122e-01  5.89104119e-01  1.60510120e-01 ...  7.37257477e-18
   8.67361738e-19 -2.60208521e-18]
 ...
 [ 2.12872524e-01  2.73275890e-01  9.45096681e-01 ...  8.67361738e-19
   1.30104261e-18  8.67361738e-19]
 [ 8.47696145e-01  5.29718958e-01  6.12990966e-01 ...  1.30104261e-18
   4.33680869e-19  2.16840434e-18]
 [ 4.33989831e-01  8.24529519e-01  4.82120237e-02 ...  5.63785130e-18
  -4.33680869e-19 -4.33680869e-19]]


# 4. Sistema de recomendación de música

Función para obtener los datos de una canción.

In [33]:
def get_song_data(song, data):
    try:
        #Se obtiene los datos de una canción con base en su nombre y año.
        song_data = data[(data['name'] == song['name']) & (data['year'] == song['year'])].iloc[0]   
        return song_data
    
    except IndexError:
        return

Función para obtener el vector media de una lista de canciones.

In [34]:
def get_mean_vector(song_list, data):
    
    song_vectors = []                                                               #Se inicializa el vector media.
    
    for song in song_list:
        song_data = get_song_data(song, data)                                       #Se obtiene el vector (datos) de cada canción.
        
        if song_data is None:                                                       #En caso de que una canción no se encuente en el dataset principal
            print('Warning: {} does not exist in database'.format(song['name']))    #se lanza una advertencia pero no se detiene el proceso
            continue                                                                #simplemente no se tiene en cuenta.
        
        song_vector = song_data[nombre_de_las_variables_de_entrada].values          #Se crea una copia del vector de datos de las canciones
        song_vectors.append(song_vector)                                            #sólo con las columnas que nos interesan.
    
    song_matrix = np.array(list(song_vectors))                                      #Se crea una matriz a partir de la información anterior.
    
    return np.mean(song_matrix, axis=0)                                             #Se obtiene el vector media de la matriz.

Función para crear un diccionario.

In [35]:
def flatten_dict_list(dict_list):
    flattened_dict = defaultdict()                  #Se inicializa el diccionario.
    
    for key in dict_list[0].keys():                 
        flattened_dict[key] = []                    #Se añaden las claves del diccionario.
    
    for dictionary in dict_list:                    
        for key, value in dictionary.items():
            flattened_dict[key].append(value)       #Se añaden los valores del diccionario.
            
    return flattened_dict                           

Función para recomendar canciones.

In [36]:
def recommend_songs(song_list, data, n_songs=10):   #Se recomiendan 10 canciones a partir de una lista de canciones de entrada y el dataset principal.
    
    metadata_cols = ['name', 'artists', 'genres', 'year']       #Información que se retornará de cada canción recomendada.  
    song_dict = flatten_dict_list(song_list)                    #Se crea un diccionario con la lista de canciones de entrada.
    
    song_center = get_mean_vector(song_list, data)              #Se obtiene el vector media de la lista de canciones de entrada.

    scaled_song_center = scaler.transform(song_center.reshape(1, -1))   #Se normaliza el vector media de la lista de canciones de entrada.

    cluster = kmeans.predict(scaled_song_center)    #Se predice el cluster al que pertenece el vector media de la lista de canciones de entrada.

    #Se normalizan los datos de las canciones que pertenecen al mismo cluster.
    scaled_data = scaler.transform(data[data.cluster == cluster[0]][nombre_de_las_variables_de_entrada])    
    
    distances = cdist(scaled_song_center, scaled_data, 'cosine')    #Se encuentra la distancia coseno se cada canción en el cluster con el vector media.
    
    index = list(np.argsort(distances)[:, :n_songs][0])     #Se eligen los 10 indices con menor distancia coseno.

    rec_songs = data[data.cluster == cluster[0]].reset_index().iloc[index]   #Se encuentra toda la información de las 10 cacniones a partir de su indice.
    rec_songs = rec_songs[~rec_songs['name'].isin(song_dict['name'])]   #En caso de que una canción recomendada sea igual a una canción de entrada está recomendación se omite.
    return rec_songs[metadata_cols].to_dict(orient='records')   #Se crea un diccionario con la información que se desea de cada canción recomendada.

# 5. Prueba de recomendación

In [37]:
#Prueba con canciones de género reggaeton
songs = recommend_songs([{'name': 'China', 'year': 2019},
                         {'name': 'Mi Gente', 'year': 2017},
                         {'name': 'Gasolina', 'year': 2004}],  data)
 
pd.DataFrame(songs)

Unnamed: 0,name,artists,genres,year
0,Dembow 2020,"[Yandel, Rauw Alejandro]","[latin_hip_hop, tropical, latin, trap_latino, ...",2020
1,Sigues Con El,"[Dímelo Flow, Arcangel, Sech]","[latin_hip_hop, tropical, reggaeton_flow, lati...",2019
2,Guaya Guaya,[Don Omar],"[latin_hip_hop, tropical, latin, trap_latino, ...",2015
3,Pasarela,[Ñejo & Dalmata],"[latin_hip_hop, tropical, reggaeton_flow, lati...",2007
4,No Es Justo,"[J Balvin, Zion & Lennox]","[latin_hip_hop, tropical, latin, reggaeton_col...",2018
5,La Tóxica (feat. Jay Wheeler & Tempo) - Remix,"[Farruko, Sech, Myke Towers, Jay Wheeler, Tempo]","[latin_hip_hop, tropical, reggaeton_flow, lati...",2020
6,"Agua (with J Balvin) - Music From ""Sponge On T...","[Tainy, J Balvin]","[latin_hip_hop, tropical, pop_reggaeton, latin...",2020
7,Que Tire Pa Lante,[Daddy Yankee],"[latin_hip_hop, tropical, latin, trap_latino, ...",2019


In [38]:
#Prueba con canciones de género pop
songs = recommend_songs([{'name': '7 rings', 'year': 2019},
                         {'name': 'New Rules', 'year': 2017},
                         {'name': 'Lover', 'year': 2019}],  data)
                        
pd.DataFrame(songs)

Unnamed: 0,name,artists,genres,year
0,Trampoline (with ZAYN),"[SHAED, ZAYN]","[dance_pop, uk_pop, pop_dance, electropop, pos...",2019
1,Love Me Like You Do,[Ellie Goulding],"[indietronica, dance_pop, metropopolis, uk_pop...",2018
2,Trampoline (with ZAYN),"[SHAED, ZAYN]","[dance_pop, uk_pop, pop_dance, electropop, pos...",2019
3,Now Or Never,[Halsey],"[dance_pop, pop_dance, electropop, etherpop, i...",2017
4,Easy,[Camila Cabello],"[post-teen_pop, dance_pop, pop, pop_dance]",2019
5,Dear Society,[Madison Beer],"[post-teen_pop, dance_pop, pop, pop_dance]",2019
6,Look What You Made Me Do,[Taylor Swift],"[post-teen_pop, dance_pop, pop, pop_dance]",2017
7,Call You Mine,"[The Chainsmokers, Bebe Rexha]","[dance_pop, edm, pop_dance, tropical_house, el...",2019
8,Control,[Halsey],"[dance_pop, pop_dance, electropop, etherpop, i...",2015
9,Love Me or Leave Me,[Little Mix],"[talent_show, dance_pop, uk_pop, pop_dance, gi...",2015


In [39]:
#Prueba con canciones de género rock
songs = recommend_songs([{'name': 'Jailhouse Rock', 'year': 1958},
                         {'name': 'We Will Rock You', 'year': 1977},
                         {'name': 'Hey Jude', 'year': 2006}],  data)
                        
pd.DataFrame(songs)

Unnamed: 0,name,artists,genres,year
0,Hold On,[Santana],"[rock, mexican_classic_rock, blues_rock, class...",1983
1,Riders on the Storm,[The Doors],"[rock, classic_rock, acid_rock, album_rock, ps...",1971
2,I'll Be Waiting,[Santana],"[rock, mexican_classic_rock, blues_rock, class...",1977
3,Help! - Remastered 2015,[The Beatles],"[rock, beatlesque, british_invasion, merseybea...",2000
4,Riders on the Storm - New Stereo Mix,[The Doors],"[rock, classic_rock, acid_rock, album_rock, ps...",2008
5,The Soft Parade,[The Doors],"[rock, classic_rock, acid_rock, album_rock, ps...",1969
6,Money (That's What I Want) - Remastered 2009,[The Beatles],"[rock, beatlesque, british_invasion, merseybea...",1963
7,Catfish Blues,[Jimi Hendrix],"[rock, classic_rock, acid_rock, psychedelic_ro...",1994
8,Maggie M'Gill,[The Doors],"[rock, classic_rock, acid_rock, album_rock, ps...",1970
9,Have You Ever Been (To Electric Ladyland),[Jimi Hendrix],"[rock, classic_rock, acid_rock, psychedelic_ro...",1968


In [40]:
#Prueba con canciones de género electrónica
songs = recommend_songs([{'name': 'Tsunami', 'year': 2013},
                         {'name': 'Astronomia', 'year': 2016},
                         {'name': 'Tremor - Sensation 2014 Anthem', 'year': 2014}],  data)
                        
pd.DataFrame(songs)

Unnamed: 0,name,artists,genres,year
0,Come With Me - Radio Mix,[Nora En Pure],"[deep_tropical_house, progressive_house, pop_d...",2013
1,Pizza Guy,[Touch Sensitive],"[australian_indie, alternative_dance, nu_disco...",2014
2,Future Club,[Perturbator],"[darksynth, cyberpunk, synthwave, dark_clubbing]",2014
3,Spaceman,[Hardwell],"[dance_pop, edm, progressive_house, pop_dance,...",2012
4,Red Stars,[The Birthday Massacre],"[canadian_rock, industrial_rock, slayer, gothi...",2007
5,We Are Mirage,"[Eric Prydz, Empire of the Sun]","[dance-punk, edm, progressive_house, new_rave,...",2014
6,Five Hours,[Deorro],"[melbourne_bounce_international, dance_pop, ed...",2014
7,Dive,[Tycho],"[downtempo, chillwave, electronica, intelligen...",2011
