**CS4710 Final Project Music Recommendation System**

Group Members: Carter Bassler (cnb8mq), Elise Galletta (eag9sb), Gabe Levy (gel4xet), Kate Newton(krn4jc)

Sites consulted:
- https://www.geeksforgeeks.org/music-recommendation-system-using-machine-learning/
- https://youtube.com/watch?v=tooddaC14q4 

Dataset: 
- https://www.kaggle.com/datasets/yamaerenay/spotify-dataset-19212020-600k-tracks

Import dependencies


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import re
import sys
import itertools
import json

 
# from sklearn.metrics.pairwise import cosine_similarity
# from sklearn.feature_extraction.text import CountVectorizer
# from sklearn.manifold import TSNE
 
import warnings
warnings.filterwarnings('ignore')


Read in tracks.csv<br>
Dataset contains 600,000+ tracks from 1921-2020<br>
Includes basic information (id, title, artist, length) as well as features of each track (danceability, energy, liveness, acousticness...)

In [None]:
tracks = pd.read_csv('tracks.csv')
tracks.head()

"Artists" data set contains genre info<br>
We need to pull this genre info and combine it with the tracks data to assign a genre to the tracks

In [None]:
artists = pd.read_csv('artists.csv')
artists.head()

Regex from source to format genre data as a list of strings

In [None]:
artists['genres_upd'] = artists['genres'].apply(lambda x: [re.sub(' ','_',i) for i in re.findall(r"'([^']*)'", x)])


Now that we have the genres in a list format, we can show the genre by indexing into the list. <br><br>

Note we haven't yet cleared out any empty lists (no genres), so for this example we are indexing 9434, a random index where the genre list for that artist is not empty so that we can show a genre result for testing purposes

In [None]:
artists['genres_upd'].values[9434][0]

The tracks data is about songs, but does not include genre information which is important to the model. 

The artists data is at an artist level and includes the important genre information that we want to combine with the track data. 

<br>
Making a column called 'artists_upd' which formates it like the genres_updated column, a list of strings. We need to use 2 regexs: one for normal artists, one for artists with any apostrophes in the name. Then combine these two columns (v1, v2) into the final artists_upd


In [None]:
tracks['artists_upd_v1'] = tracks['artists'].apply(lambda x: re.findall(r"'([^']*)'", x))
tracks['artists_upd_v2'] = tracks['artists'].apply(lambda x: re.findall('\"(.*?)\"',x))
tracks['artists_upd'] = np.where(tracks['artists_upd_v1'].apply(lambda x: not x), tracks['artists_upd_v2'], tracks['artists_upd_v1'] )


Now that we have the artists updated into the format we want to work in, we need to remove any duplicate songs. We will do this by creating a new column called 'artists_song' which takes the format of ArtistSong. Example: TheDave BrubeckQuartetLover, Taylor SwiftLover, etc. This is based on song title and release date. 

In [None]:
tracks['artists_song'] = tracks.apply(lambda row: row['artists_upd'][0]+str(row['name']),axis = 1)
tracks.sort_values(['artists_song','release_date'], ascending = False, inplace = True)
tracks[tracks['name']=='Lover']

Now let's drop the duplicates of any song. We will show that this worked by showing the results of tracks that have the name Lover

In [None]:
tracks.drop_duplicates('artists_song',inplace = True)

print(tracks[tracks['name']=='Lover'])

For some songs, there are multiple artists in the list but the data is in a single row. We need to explode this column.

artists_exploded will now distribute any songs with multiple artists into separate rows. the artists' names will be separte in the artists_upd column.

In [None]:
tracks[tracks['name']=='Under Pressure']
tracks = tracks.explode('artists_upd')

In [None]:
tracks[tracks['name']=='Under Pressure']

Now do a left join with tracks as left table and artists as right table. The artists name is 'artists_upd' in the left table and 'name' in the right table, so we will merge on those values.

In [None]:
artists_exploded = tracks.merge(artists, how = 'left', left_on = 'artists_upd',right_on = 'name')

artists_exploded = artists_exploded[~artists_exploded.genres_upd.isnull()]

In [None]:
print(artists_exploded[artists_exploded['id_x'] =='5oidljiMjeJTWUGZ4TfFea'])

In [None]:
artists_genres = artists_exploded.groupby('id_x')['genres_upd'].apply(list).reset_index()
artists_genres['genre_list'] = artists_genres['genres_upd'].apply(lambda x: list(set(list(itertools.chain.from_iterable(x)))))
artists_genres.head()

Now do a left join with tracks as left table and artists_genre_consolidated as right table. The song ID is 'id' in the left table and 'id_x' in the right table, so we will merge on those values. It is id_x because the previous left join created an id_x and id_y because the column name was repeated across tables.

In [None]:
tracks = tracks.merge(artists_genres[['id_x','genre_list']], how = 'left', left_on = 'id', right_on='id_x')

Now we have the following:
- all data in dataframe 'tracks'
- artist name formatted as a single string in 'artists_upd'
- if multiple artists to a single song, they are distributed across new rows
- the genres are listed under genre_list


In [None]:
tracks[tracks['artists_upd']=='Taylor Swift']

In [None]:
tracks.info()

After all of this, we can check how many songs have no genre list associated with them. Its a small amount ~10,000 entries, so we will just drop these rows. We are dropping the rows because genre is an important predictor in our model.

In [None]:
tracks.isnull().sum()

In [None]:
tracks.dropna(subset=['genre_list'], inplace=True)

In [None]:
tracks.isnull().sum()