# Artist Recommendation System #

Our objective is to make an artist recommendation system, based on the data available in [Kaggle's Spotify Dataset](https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks?select=data_by_artist.csv).

The user will input an artist, and the program will return the 10 most similar artists. 

**Methodology**:

The recommendation system is based on 3 parts:
1. The genre of the artist
2. The artist's popularity
3. The artist's debut year

**Steps**:

1. We began by filtering the data based on the genre of the artist. All the other artists who don't share any genre in common are eliminated. Similarity is calculated using Jaccard Score.
2. Next, we calculated the similarity of the other artists to the input artist, by calculating the difference in the genre jaccard similarity score, the popularity and the debut years.
3. After scaling these differences (using the Min-Max scaler), we calculated the proximity of each artist to the main artist. The lower the final score, the closer the artist.
4. The top 10 closest artists are returned as the recommendations.


In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
from ast import literal_eval

In [3]:
df_songs = pd.read_csv(r"data/data.csv", ',')

In [4]:
df_artists = pd.read_csv(r"data/data_w_genres.csv", ',')

In [5]:
# Contains information of each song
df_songs.head()

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
0,0.995,['Carl Woitschach'],0.708,158648,0.195,0,6KbQ3uYMLKb5jDxLF7wYDD,0.563,10,0.151,-12.428,1,Singende Bataillone 1. Teil,0,1928,0.0506,118.469,0.779,1928
1,0.994,"['Robert Schumann', 'Vladimir Horowitz']",0.379,282133,0.0135,0,6KuQTIu1KoTTkLXKrwlLPV,0.901,8,0.0763,-28.454,1,"Fantasiestücke, Op. 111: Più tosto lento",0,1928,0.0462,83.972,0.0767,1928
2,0.604,['Seweryn Goszczyński'],0.749,104300,0.22,0,6L63VW0PibdM1HDSBoqnoM,0.0,5,0.119,-19.924,0,Chapter 1.18 - Zamek kaniowski,0,1928,0.929,107.177,0.88,1928
3,0.995,['Francisco Canaro'],0.781,180760,0.13,0,6M94FkXd15sOAOQYRnWPN8,0.887,1,0.111,-14.734,0,Bebamos Juntos - Instrumental (Remasterizado),0,1928-09-25,0.0926,108.003,0.72,1928
4,0.99,"['Frédéric Chopin', 'Vladimir Horowitz']",0.21,687733,0.204,0,6N6tiFZ9vLTSOIxkj8qKrd,0.908,11,0.098,-16.829,1,"Polonaise-Fantaisie in A-Flat Major, Op. 61",1,1928,0.0424,62.149,0.0693,1928


In [6]:
# Info of each song
df_artists.head()

Unnamed: 0,artists,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode,count,genres
0,"""Cats"" 1981 Original London Cast",0.575083,0.44275,247260.0,0.386336,0.022717,0.287708,-14.205417,0.180675,115.9835,0.334433,38.0,5,1,12,['show tunes']
1,"""Cats"" 1983 Broadway Cast",0.862538,0.441731,287280.0,0.406808,0.081158,0.315215,-10.69,0.176212,103.044154,0.268865,33.076923,5,1,26,[]
2,"""Fiddler On The Roof” Motion Picture Chorus",0.856571,0.348286,328920.0,0.286571,0.024593,0.325786,-15.230714,0.118514,77.375857,0.354857,34.285714,0,1,7,[]
3,"""Fiddler On The Roof” Motion Picture Orchestra",0.884926,0.425074,262890.962963,0.24577,0.073587,0.275481,-15.63937,0.1232,88.66763,0.37203,34.444444,0,1,27,[]
4,"""Joseph And The Amazing Technicolor Dreamcoat""...",0.605444,0.437333,232428.111111,0.429333,0.037534,0.216111,-11.447222,0.086,120.329667,0.458667,42.555556,11,1,9,[]


In [7]:
# Convert genres in the df_w_genres dataframe from list to string 
pattern = re.compile(r"\'(.*?)\'", re.IGNORECASE)
df_artists['genres'] = df_artists['genres'].map(lambda x: re.findall(pattern, x))

# Change fields of no genres to None
df_artists['genres'] = df_artists['genres'].map(lambda x: np.nan if len(x) == 0 else x)

In [8]:
# Drop all items that do not have a genre
df_artists = df_artists.drop(df_artists[df_artists['genres'].isna()].index)
df_artists.reset_index(inplace=True)
df_artists.drop('index', axis=1, inplace=True)

In [9]:
# Artists should be lowercase for both the dataframes
df_songs['artists'] = df_songs['artists'].map(lambda x: x.lower())

df_artists['artists']= df_artists['artists'].map(lambda x: x.lower())

In [10]:
# Convert artist values to a list
df_songs['artists'] = df_songs['artists'].map(literal_eval)

In [12]:
# Add in the debut year
# Instead of using a third party API, we will define the debut year as the year of the first song of that artist in 
# our database

df_expanded = df_songs.explode('artists')
debut_years = df_expanded.groupby('artists')['year'].min()
debut_years.index = debut_years.index.map(lambda x: x.lower())

In [18]:
# Add in the debut years
df_artists = pd.merge(df_artists, debut_years, left_on='artists', right_index=True, how='left')

# Rename year as debut year
df_artists = df_artists.rename(columns={'year':'debut_year'})

In [19]:
artist_name = input('Enter artist name \n').lower()

Enter artist name 
the beatles


In [20]:
# Check if artist is in the database

if artist_name not in df_artists['artists'].unique():
    raise ValueError ('Artist not found')


In [21]:
def __jaccard_score(list_1, list_2):
    '''
    Returns the jaccard score (Intersection/Union) of two iterables
    '''
    set_1 = set(list_1)
    set_2  = set(list_2)
    return len(set_1.intersection(set_2))/len(set_1.union(set_2))

In [22]:
'''
Only keep those artists that have at least one genre that matches with the input artist 
'''
# The genres of the input artist
input_artist_genres = df_artists[df_artists['artists']==artist_name]['genres'].values[0]

# Calculate the jaccard score of every artist, based on its similarity with the input artists
similarity_scores = df_artists['genres'].map(lambda x: __jaccard_score(x, input_artist_genres))

# Filter out the 0 values (i.e. no common genres)
similarity_scores = similarity_scores[similarity_scores>0]
similar_artists = pd.DataFrame(similarity_scores.values, columns=['genre_similarity'], 
                                       index=similarity_scores.index)

In [23]:
# Add in the debut year popularity and artist name to the similarity dataframe
similar_artists = similar_artists.join(df_artists[['artists', 'debut_year', 'popularity']]) 

# Set the artist as the index
similar_artists = similar_artists.set_index('artists', drop=True)

In [24]:
similar_artists

Unnamed: 0_level_0,genre_similarity,debut_year,popularity
artists,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10cc,0.125000,1973,41.190476
1910 fruitgum company,0.100000,1962,24.750000
311,0.066667,1993,40.327869
38 special,0.153846,1977,32.288889
? & the mysterians,0.090909,1964,24.181818
...,...,...,...
a-ha,0.181818,1985,46.888889
cleopatrick,0.100000,2017,61.000000
grandson,0.125000,2016,65.142857
half•alive,0.111111,2017,63.750000


In [31]:
weights = [0.2, 0.4, 0.4] # order is [genre, debut_year, popularity]
n=10 # Number of recommendations to return

In [26]:
# Calculate similarity based on the sum of weighted scores for genre, debut year and popularity
# The closer the value is to 0, the more similar is the artist to the selected artist 
my_artist_specs = similar_artists.loc[artist_name]
differences = abs(similar_artists - my_artist_specs)

# Using a min-max scaler, to ensure that all columns are of equal scale
differences = differences.apply(lambda x: (x-x.min())/(x.max()-x.min()))
differences = differences*weights

# The score is the sum of all columns. The lowest score is the most similar
# The input artist will have a score of 0
differences['score'] = differences.sum(axis=1)


In [29]:
final_recommendations = differences.sort_values(by='score', ascending=True).iloc[1:n+1].index
final_recommendations = final_recommendations.map(lambda x: x.title()).values
final_recommendations

array(['The Beat', 'John Lennon', 'The Doors', 'Pink Floyd',
       'Jim Morrison', 'Janis Joplin', 'The Zombies',
       'Big Brother & The Holding Company', 'Jimi Hendrix',
       'The Rolling Stones'], dtype=object)