# Spotify Artist Recommendation Project

In this project, we will explore the [Spotify DataSet 1921-2020](https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks) from Kaggle and look to perform the following tasks:

* Time-series Analysis of song and artist trends over the years
* Create a *super* genres based on common audio characteristics
* Create a recommendation system for artists



Before we begin, we will import some libraries that we will use in the project as well as read in the data into pandas Dataframes.

In [123]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import os
import ast

%matplotlib inline

In [124]:
data = pd.read_csv('data/data.csv')

data.head()

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
0,0.991,['Mamie Smith'],0.598,168333,0.224,0,0cS0A1fUEUd1EW3FcF8AEI,0.000522,5,0.379,-12.628,0,Keep A Song In Your Soul,12,1920,0.0936,149.976,0.634,1920
1,0.643,"[""Screamin' Jay Hawkins""]",0.852,150200,0.517,0,0hbkKFIJm7Z05H8Zl9w30f,0.0264,5,0.0809,-7.261,0,I Put A Spell On You,7,1920-01-05,0.0534,86.889,0.95,1920
2,0.993,['Mamie Smith'],0.647,163827,0.186,0,11m7laMUgmOKqI3oYzuhne,1.8e-05,0,0.519,-12.098,1,Golfing Papa,4,1920,0.174,97.6,0.689,1920
3,0.000173,['Oscar Velazquez'],0.73,422087,0.798,0,19Lc5SfJJ5O1oaxY0fpwfh,0.801,2,0.128,-7.311,1,True House Music - Xavier Santos & Carlos Gomi...,17,1920-01-01,0.0425,127.997,0.0422,1920
4,0.295,['Mixe'],0.704,165224,0.707,1,2hJjbsLCytGsnAHfdsLejp,0.000246,10,0.402,-6.036,0,Xuniverxe,2,1920-10-01,0.0768,122.076,0.299,1920


In [125]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 174389 entries, 0 to 174388
Data columns (total 19 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   acousticness      174389 non-null  float64
 1   artists           174389 non-null  object 
 2   danceability      174389 non-null  float64
 3   duration_ms       174389 non-null  int64  
 4   energy            174389 non-null  float64
 5   explicit          174389 non-null  int64  
 6   id                174389 non-null  object 
 7   instrumentalness  174389 non-null  float64
 8   key               174389 non-null  int64  
 9   liveness          174389 non-null  float64
 10  loudness          174389 non-null  float64
 11  mode              174389 non-null  int64  
 12  name              174389 non-null  object 
 13  popularity        174389 non-null  int64  
 14  release_date      174389 non-null  object 
 15  speechiness       174389 non-null  float64
 16  tempo             17

In [126]:
genre = pd.read_csv('data/data_by_genres.csv')

genre.head()

Unnamed: 0,genres,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode
0,21st century classical,0.7546,0.2841,352593.2,0.15958,0.484374,0.16858,-22.1534,0.06206,91.351,0.14338,6.6,4,1
1,432hz,0.485515,0.312,1047430.0,0.391678,0.47725,0.26594,-18.131267,0.071717,118.900933,0.236483,41.2,11,1
2,8-bit,0.0289,0.673,133454.0,0.95,0.63,0.069,-7.899,0.292,192.816,0.997,0.0,5,1
3,[],0.535793,0.546937,249531.2,0.48543,0.278442,0.22097,-11.624754,0.101511,116.06898,0.486361,12.35077,7,1
4,a cappella,0.694276,0.516172,201839.1,0.330533,0.03608,0.222983,-12.656547,0.083627,105.506031,0.454077,39.086248,7,1


In [127]:
genre.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3232 entries, 0 to 3231
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   genres            3232 non-null   object 
 1   acousticness      3232 non-null   float64
 2   danceability      3232 non-null   float64
 3   duration_ms       3232 non-null   float64
 4   energy            3232 non-null   float64
 5   instrumentalness  3232 non-null   float64
 6   liveness          3232 non-null   float64
 7   loudness          3232 non-null   float64
 8   speechiness       3232 non-null   float64
 9   tempo             3232 non-null   float64
 10  valence           3232 non-null   float64
 11  popularity        3232 non-null   float64
 12  key               3232 non-null   int64  
 13  mode              3232 non-null   int64  
dtypes: float64(11), int64(2), object(1)
memory usage: 353.6+ KB


In [128]:
artist_genre = pd.read_csv('data/data_w_genres.csv')

artist_genre.head()

Unnamed: 0,artists,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode,count,genres
0,"""Cats"" 1981 Original London Cast",0.5985,0.4701,267072.0,0.376203,0.010261,0.28305,-14.4343,0.20915,114.1288,0.35832,38.2,5,1,10,['show tunes']
1,"""Cats"" 1983 Broadway Cast",0.862538,0.441731,287280.0,0.406808,0.081158,0.315215,-10.69,0.176212,103.044154,0.268865,31.538462,5,1,26,[]
2,"""Fiddler On The Roof” Motion Picture Chorus",0.856571,0.348286,328920.0,0.286571,0.024593,0.325786,-15.230714,0.118514,77.375857,0.354857,34.571429,0,1,7,[]
3,"""Fiddler On The Roof” Motion Picture Orchestra",0.884926,0.425074,262890.962963,0.24577,0.073587,0.275481,-15.63937,0.1232,88.66763,0.37203,34.407407,0,1,27,[]
4,"""Joseph And The Amazing Technicolor Dreamcoat""...",0.510714,0.467143,270436.142857,0.488286,0.0094,0.195,-10.236714,0.098543,122.835857,0.482286,42.0,5,1,7,[]


In [129]:
artist_genre.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32539 entries, 0 to 32538
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   artists           32539 non-null  object 
 1   acousticness      32539 non-null  float64
 2   danceability      32539 non-null  float64
 3   duration_ms       32539 non-null  float64
 4   energy            32539 non-null  float64
 5   instrumentalness  32539 non-null  float64
 6   liveness          32539 non-null  float64
 7   loudness          32539 non-null  float64
 8   speechiness       32539 non-null  float64
 9   tempo             32539 non-null  float64
 10  valence           32539 non-null  float64
 11  popularity        32539 non-null  float64
 12  key               32539 non-null  int64  
 13  mode              32539 non-null  int64  
 14  count             32539 non-null  int64  
 15  genres            32539 non-null  object 
dtypes: float64(11), int64(3), object(2)
memo

In [130]:
artist = pd.read_csv('data/data_by_artist.csv')

artist.head()

Unnamed: 0,artists,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode,count
0,"""Cats"" 1981 Original London Cast",0.5985,0.4701,267072.0,0.376203,0.010261,0.28305,-14.4343,0.20915,114.1288,0.35832,38.2,5,1,10
1,"""Cats"" 1983 Broadway Cast",0.862538,0.441731,287280.0,0.406808,0.081158,0.315215,-10.69,0.176212,103.044154,0.268865,31.538462,5,1,26
2,"""Fiddler On The Roof” Motion Picture Chorus",0.856571,0.348286,328920.0,0.286571,0.024593,0.325786,-15.230714,0.118514,77.375857,0.354857,34.571429,0,1,7
3,"""Fiddler On The Roof” Motion Picture Orchestra",0.884926,0.425074,262890.962963,0.24577,0.073587,0.275481,-15.63937,0.1232,88.66763,0.37203,34.407407,0,1,27
4,"""Joseph And The Amazing Technicolor Dreamcoat""...",0.510714,0.467143,270436.142857,0.488286,0.0094,0.195,-10.236714,0.098543,122.835857,0.482286,42.0,5,1,7


In [131]:
artist.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32539 entries, 0 to 32538
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   artists           32539 non-null  object 
 1   acousticness      32539 non-null  float64
 2   danceability      32539 non-null  float64
 3   duration_ms       32539 non-null  float64
 4   energy            32539 non-null  float64
 5   instrumentalness  32539 non-null  float64
 6   liveness          32539 non-null  float64
 7   loudness          32539 non-null  float64
 8   speechiness       32539 non-null  float64
 9   tempo             32539 non-null  float64
 10  valence           32539 non-null  float64
 11  popularity        32539 non-null  float64
 12  key               32539 non-null  int64  
 13  mode              32539 non-null  int64  
 14  count             32539 non-null  int64  
dtypes: float64(11), int64(3), object(1)
memory usage: 3.7+ MB


In [132]:
year = pd.read_csv('data/data_by_year.csv')

year.head()

Unnamed: 0,year,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode
0,1920,0.631242,0.51575,238092.997135,0.4187,0.354219,0.216049,-12.65402,0.082984,113.2269,0.49821,0.610315,2,1
1,1921,0.862105,0.432171,257891.762821,0.241136,0.337158,0.205219,-16.81166,0.078952,102.425397,0.378276,0.391026,2,1
2,1922,0.828934,0.57562,140135.140496,0.226173,0.254776,0.256662,-20.840083,0.464368,100.033149,0.57119,0.090909,5,1
3,1923,0.957247,0.577341,177942.362162,0.262406,0.371733,0.227462,-14.129211,0.093949,114.01073,0.625492,5.205405,0,1
4,1924,0.9402,0.549894,191046.707627,0.344347,0.581701,0.235219,-14.231343,0.092089,120.689572,0.663725,0.661017,10,1


In [133]:
year.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102 entries, 0 to 101
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   year              102 non-null    int64  
 1   acousticness      102 non-null    float64
 2   danceability      102 non-null    float64
 3   duration_ms       102 non-null    float64
 4   energy            102 non-null    float64
 5   instrumentalness  102 non-null    float64
 6   liveness          102 non-null    float64
 7   loudness          102 non-null    float64
 8   speechiness       102 non-null    float64
 9   tempo             102 non-null    float64
 10  valence           102 non-null    float64
 11  popularity        102 non-null    float64
 12  key               102 non-null    int64  
 13  mode              102 non-null    int64  
dtypes: float64(11), int64(3)
memory usage: 11.3 KB


We see that the artist and artist_genre dataframes are basically the same from first glance, with the artist_genre simply having a genre column at the end of the artist dataframe. Therefore, we will use the artist_genre dataframe for our analysis and ignore the artist dataframe. 

#### Data Feature first glance

We see that the *data* dataframe contains the majority of the information including the artists, songs, audio information, and time of release among a few other features. What this dataframe is lacking that is in our other dataframes is the genre, given in both *genre* or *artist* dataframes.

In order to create a proper recommendation system, we will need to narrow down our feature list to those that we imagine will be useful. Additionally, we will need to create a way of finding similar artists, genres, or songs based on previous listening history.

We will start this process by defining a few functions that will help with this information gathering process.

In [120]:
# Borrowed this function from another notebook in order to evaluate the artists collaboration 
def get_collab(artist):
    # function to flatten the list of artists in the artists column of data 
    flatten = lambda l: [item for sublist in l for item in sublist]
    # creates a unique list of all artists in the artists column minus the artist being evaluated
    artists, counts = np.unique([i for i in 
                      flatten([eval(x) for x in data[data['artists'].str.contains(artist)]
                      ['artists']]) if i != artist], return_counts=True)
    # If no artist collaboration, return empty list
    if len(artists) == 0:
        return []
    # Creates a count based on the highest collaboration total. The top collaborator has a value of 1.0
    counts = counts / np.max(counts)
    # Sorts the collaborators by their count percentage with relation to the maximum collaborator
    indices = np.argsort(counts)
    return list(zip(artists[indices[::-1]], counts[indices[::-1]]))

In [121]:
def get_genre(a):
    return artist[artist['artists']==a]['genres'].to_numpy()[0]

In [122]:
def get_artist_w_genre(a):
    genres = ast.literal_eval(get_genre(a))
    artist_list = []
    for g in genres:
        artist_list += artist[artist['genres'].str.contains('\'' + g + '\'')]['artists'].to_list()
    if len(artist_list) == 0:
        return []
    artists, counts = np.unique(artist_list, return_counts=True)
    counts = counts / np.max(counts)
    indices = np.argsort(counts)
    return list(zip(artists[indices[::-1]], counts[indices[::-1]]))