# Content-based Filtering Spotify Song Recommendation System

This notebook describes is a content-based Filtering approach for Spotify Song recommendation. 
The code accompanys a [medium article]() called "Part III: Build a Recommendation System with Spotify Datasets".
This notebook is the thrid article in a [Spotify Song Recommendation System series]() by the ENCA team.

## Structure

- Package Setup
- Preprocessing
- Feature Generation
- Content-based Filtering
- Recommendation

## Setup

**Downloaded Package**
- TextBlob

**Imported Packages**

- Pandas
- Scikit-learn
- re
- Spotipy (refer to Part I of the series)

## Credits

This notebook builds on top of Madhav Thaker's [spotify-recommendation-system tutorial](https://github.com/madhavthaker/spotify-recommendation-system).





### Package Setup
#### Download Dependencies

In [50]:
!pip install textblob



#### Import Dependencies

In [51]:
# Import library
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from textblob import TextBlob
import re

#### Data Import
The data here is not raw data, it is imported after the retrieving Spotify data in Part I. Please refer to [Part I]() for more information.

In [52]:
# Import processed data
playlistDF = pd.read_csv("../data/processed_data.csv")
print(playlistDF.columns)
playlistDF.drop(columns=["Unnamed: 0",'Unnamed: 0.1'], inplace = True)
playlistDF.head()

Index(['Unnamed: 0', 'Unnamed: 0.1', 'pos', 'artist_name', 'track_uri',
       'artist_uri', 'track_name', 'album_uri', 'duration_ms_x', 'album_name',
       'name', 'danceability', 'energy', 'key', 'loudness', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'type', 'id', 'uri', 'track_href', 'analysis_url',
       'duration_ms_y', 'time_signature', 'artist_pop', 'genres', 'track_pop'],
      dtype='object')


Unnamed: 0,pos,artist_name,track_uri,artist_uri,track_name,album_uri,duration_ms_x,album_name,name,danceability,...,type,id,uri,track_href,analysis_url,duration_ms_y,time_signature,artist_pop,genres,track_pop
0,0,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,Throwbacks,0.904,...,audio_features,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69
1,73,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,w o r k o u t,0.904,...,audio_features,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69
2,14,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,party playlist,0.904,...,audio_features,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69
3,42,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,Dance mix,0.904,...,audio_features,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69
4,1,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,spin,0.904,...,audio_features,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69


### Preprocessing

The following cells conducts further preprocessing for the imported data to cater the data specifically for the content-based filtering.

Here is the general pipeline:
1. Useful data Selection
2. List concatenation

#### Useful Data Selection

Due to the nature of playlist, there will be duplicates in songs across multiple playlists. Therefore, I combined the song and the artist and used the `drop_duplicates()` function in `pandas` to remove duplicate songs when building the base dataframe with all unique songs.

In [53]:
# Show that there are duplicates of songs accross playlists
playlistDF[['artist_name','track_name','name']]

Unnamed: 0,artist_name,track_name,name
0,Missy Elliott,Lose Control (feat. Ciara & Fat Man Scoop),Throwbacks
1,Missy Elliott,Lose Control (feat. Ciara & Fat Man Scoop),w o r k o u t
2,Missy Elliott,Lose Control (feat. Ciara & Fat Man Scoop),party playlist
3,Missy Elliott,Lose Control (feat. Ciara & Fat Man Scoop),Dance mix
4,Missy Elliott,Lose Control (feat. Ciara & Fat Man Scoop),spin
...,...,...,...
4461,Built By Titan,Collide (feat. Jonathan Thulin),Mom's playlist
4462,Astoria Kings,Come Alive,Mom's playlist
4463,Anthem Lights,Best of 2012: Payphone / Call Me Maybe / Wide ...,Mom's playlist
4464,Anthem Lights,Best of 2012: Payphone / Call Me Maybe / Wide ...,Favorite Songs


Now, I drop the duplicates with `pandas` by combining the artist name and track name. This is to prevent droping songs from different artists but with the same names.

In [143]:
# Drop song duplicates
def drop_duplicates(df):
    '''
    Drop duplicate songs
    '''
    df['artists_song'] = df.apply(lambda row: row['artist_name']+row['track_name'],axis = 1)
    return df.drop_duplicates('artists_song')

songDF = drop_duplicates(playlistDF)
print("Are all songs unique: ",len(pd.unique(songDF.artists_song))==len(songDF))


Are all songs unique:  True


Finally, I select the features I would use later on. The following is a short list of them in categories:
1. Metadata
    - id
    - genres
    - artist_pop
    - track_pop
2. Audio
    - **Mood**: Danceability, Valence, Energy, Tempo
    - **Properties**: Loudness, Speechiness, Instrumentalness
    - **Context**: Liveness, Acousticness
    - **metadata**: duration_ms_x, duration_ms_y, key, mode, time_signature
3. Text
    - artist_name
    - track_name
    - album_name

In [144]:
# Select useful columns
def select_cols(df):
       '''
       Select useful columns
       '''
       return df[['artist_name','id','track_name','album_name','danceability', 'energy', 'key', 'loudness', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo','time_signature', "artist_pop", "genres", "track_pop"]]
songDF = select_cols(songDF)
songDF.head()

Unnamed: 0,artist_name,id,track_name,duration_ms_x,duration_ms_y,album_name,danceability,energy,key,loudness,...,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,artist_pop,genres,track_pop
0,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,Lose Control (feat. Ciara & Fat Man Scoop),226863,226864,The Cookbook,0.904,0.813,4,-7.105,...,0.121,0.0311,0.00697,0.0471,0.81,125.461,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69
6,Britney Spears,6I9VzXrHxO9rA9A5euc8Ak,Toxic,198800,198800,In The Zone,0.774,0.838,5,-3.914,...,0.114,0.0249,0.025,0.242,0.924,143.04,4,84,dance_pop pop post-teen_pop,83
19,Beyoncé,0WqIKmW4BTrj3eJFmnCKMv,Crazy In Love,235933,235933,Dangerously In Love (Alben für die Ewigkeit),0.664,0.758,2,-6.583,...,0.21,0.00238,0.0,0.0598,0.701,99.259,4,86,dance_pop pop r&b,25
46,Justin Timberlake,1AWQoqb9bSvzTjaLralEkT,Rock Your Body,267266,267267,Justified,0.892,0.714,4,-6.055,...,0.141,0.201,0.000234,0.0521,0.817,100.972,4,81,dance_pop pop,79
55,Shaggy,1lzr43nnXAijIGYnCT8M8H,It Wasn't Me,227600,227600,Hot Shot,0.853,0.606,0,-4.596,...,0.0713,0.0561,0.0,0.313,0.654,94.759,4,74,pop_rap reggae_fusion,2


#### List Concatenation

After selecting the useful data, due to the import format of a dataframe, we need to convert the `genres` columns back into a list. This is done by using the `split()` function:

In [129]:
def genre_preprocess(df):
    '''
    Preprocess the genre data
    '''
    df['genres_list'] = df['genres'].apply(lambda x: x.split(" "))
    return df
songDF = genre_preprocess(songDF)
songDF['genres_list'].head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


0     [dance_pop, hip_hop, hip_pop, pop, pop_rap, r&...
6                       [dance_pop, pop, post-teen_pop]
19                                [dance_pop, pop, r&b]
46                                     [dance_pop, pop]
55                             [pop_rap, reggae_fusion]
Name: genres_list, dtype: object

Lastly, I created a pipeline for preprocessing any new playlist as below:

In [149]:
def playlist_preprocess(df):
    '''
    Preprocess imported playlist
    '''
    df = drop_duplicates(df)
    df = select_cols(df)
    df = genre_preprocess(df)

    return df

### Feature Generation
Now that the data is usable, we can now feature-engineer the data for the purpose of the recommendation system. In this project, the following process is conducted into a pipeline for feature generation.

1. Sentiment Analysis
2. One-hot Encoding
3. TF-IDF
4. Normalization

#### Sentiment Analysis

In our data, we will perform a simply sentiment analysis using subjectivity and polarity form `TextBlob` package.
- **Subjectivity** (0,1): The amount of personal opinion and factual information contained in the text.
- **Polarity** (-1,1): The degree of strong or clearly defined sentiment accounting for negation.

We will then use one-hot encoding to list the sentiment of the song titles as one of the input.

In [102]:
def getSubjectivity(text):
  '''
  Getting the Subjectivity using TextBlob
  '''
  return TextBlob(text).sentiment.subjectivity

def getPolarity(text):
  '''
  Getting the Polarity using TextBlob
  '''
  return TextBlob(text).sentiment.polarity

def getAnalysis(score, task="polarity"):
  '''
  Categorizing the Polarity & Subjectivity score
  '''
  if task == "subjectivity":
    if score < 1/3:
      return "low"
    elif score > 1/3:
      return "high"
    else:
      return "medium"
  else:
    if score < 0:
      return 'Negative'
    elif score == 0:
      return 'Neutral'
    else:
      return 'Positive'

def sentiment_analysis(df, text_col):
  '''
  Perform sentiment analysis on text
  ---
  Input:
  df (pandas dataframe): Dataframe of interest
  text_col (str): column of interest
  '''
  df['subjectivity'] = df[text_col].apply(getSubjectivity).apply(lambda x: getAnalysis(x,"subjectivity"))
  df['polarity'] = df[text_col].apply(getPolarity).apply(getAnalysis)
  return df

In [103]:
# Show result
sentiment = sentiment_analysis(songDF, "track_name")
sentiment.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,pos,artist_name,track_uri,artist_uri,track_name,album_uri,duration_ms_x,album_name,name,danceability,...,analysis_url,duration_ms_y,time_signature,artist_pop,genres,track_pop,artists_song,genres_list,subjectivity,polarity
0,0,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,Throwbacks,0.904,...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69,Missy ElliottLose Control (feat. Ciara & Fat M...,"[dance_pop, hip_hop, hip_pop, pop, pop_rap, r&...",low,Neutral
6,1,Britney Spears,6I9VzXrHxO9rA9A5euc8Ak,spotify:artist:26dSoYclwsYLMAKD3tpOr4,Toxic,spotify:album:0z7pVBGOD7HCIB7S8eLkLI,198800,In The Zone,Throwbacks,0.774,...,https://api.spotify.com/v1/audio-analysis/6I9V...,198800,4,84,dance_pop pop post-teen_pop,83,Britney SpearsToxic,"[dance_pop, pop, post-teen_pop]",low,Neutral
19,2,Beyoncé,0WqIKmW4BTrj3eJFmnCKMv,spotify:artist:6vWDO969PvNqNYHIOW5v0m,Crazy In Love,spotify:album:25hVFAxTlDvXbx2X2QkUkE,235933,Dangerously In Love (Alben für die Ewigkeit),Throwbacks,0.664,...,https://api.spotify.com/v1/audio-analysis/0WqI...,235933,4,86,dance_pop pop r&b,25,BeyoncéCrazy In Love,"[dance_pop, pop, r&b]",high,Negative
46,3,Justin Timberlake,1AWQoqb9bSvzTjaLralEkT,spotify:artist:31TPClRtHm23RisEBtV3X7,Rock Your Body,spotify:album:6QPkyl04rXwTGlGlcYaRoW,267266,Justified,Throwbacks,0.892,...,https://api.spotify.com/v1/audio-analysis/1AWQ...,267267,4,81,dance_pop pop,79,Justin TimberlakeRock Your Body,"[dance_pop, pop]",low,Neutral
55,4,Shaggy,1lzr43nnXAijIGYnCT8M8H,spotify:artist:5EvFsr3kj42KNv97ZEnqij,It Wasn't Me,spotify:album:6NmFmPX56pcLBOFMhIiKvF,227600,Hot Shot,Throwbacks,0.853,...,https://api.spotify.com/v1/audio-analysis/1lzr...,227600,4,74,pop_rap reggae_fusion,2,ShaggyIt Wasn't Me,"[pop_rap, reggae_fusion]",low,Neutral


#### One-hot encoding

One-hot encoding is a method to transform categorical variables into a machine-understandable langauge. This is done by converting each category into a column so that each category can be represented as either True or False.


![ohe_img](https://cdn-images-1.medium.com/max/1600/0*KVGWy9c3eo2RiAe3.png) 

In [104]:
def ohe_prep(df, column, new_name): 
    ''' 
    Create One Hot Encoded features of a specific column
    ---
    Input: 
    df (pandas dataframe): Spotify Dataframe
    column (str): Column to be processed
    new_name (str): new column name to be used
        
    Output: 
    tf_df: One-hot encoded features 
    '''
    
    tf_df = pd.get_dummies(df[column])
    feature_names = tf_df.columns
    tf_df.columns = [new_name + "|" + str(i) for i in feature_names]
    tf_df.reset_index(drop = True, inplace = True)    
    return tf_df

In [105]:
subject_ohe = ohe_prep(sentiment, 'subjectivity','subject')
subject_ohe.iloc[0]

subject|high      0
subject|low       1
subject|medium    0
Name: 0, dtype: uint8

### Feature Generation

#### TF-IDF



In [106]:
tfidf = TfidfVectorizer()
tfidf_matrix =  tfidf.fit_transform(songDF['genres_list'].apply(lambda x: " ".join(x)))
genre_df = pd.DataFrame(tfidf_matrix.toarray())
genre_df.columns = ['genre' + "|" + i for i in tfidf.get_feature_names()]
genre_df.reset_index(drop = True, inplace=True)
genre_df.iloc[0]



genre|_hip_hop            0.0
genre|abstract_hip_hop    0.0
genre|acoustic_pop        0.0
genre|adult_standards     0.0
genre|aesthetic_rap       0.0
                         ... 
genre|world_devotional    0.0
genre|world_worship       0.0
genre|worship             0.0
genre|yacht_rock          0.0
genre|zolo                0.0
Name: 0, Length: 527, dtype: float64

#### Normalization

In [108]:
print(songDF['artist_pop'].describe())

count    1000.000000
mean       64.627000
std        17.250085
min         0.000000
25%        55.000000
50%        67.000000
75%        77.000000
max        98.000000
Name: artist_pop, dtype: float64


In [115]:
pop = songDF[["artist_pop"]].reset_index(drop = True)
scaler = MinMaxScaler()
pop_scaled = pd.DataFrame(scaler.fit_transform(pop), columns = pop.columns)
pop_scaled.head()

Unnamed: 0,artist_pop
0,0.755102
1,0.857143
2,0.877551
3,0.826531
4,0.755102


#### Feature Generation
Finially, we generate all features mentioned below using the following cell and concatenate all variables into a new dataframe.

In [125]:
def create_feature_set(df, float_cols):
    '''
    Process spotify df to create a final set of features that will be used to generate recommendations
    ---
    Input: 
    df (pandas dataframe): Spotify Dataframe
    float_cols (list(str)): List of float columns that will be scaled
            
    Output: 
    final (pandas dataframe): Final set of features 
    '''
    
    # Tfidf genre lists
    tfidf = TfidfVectorizer()
    tfidf_matrix =  tfidf.fit_transform(df['genres_list'].apply(lambda x: " ".join(x)))
    genre_df = pd.DataFrame(tfidf_matrix.toarray())
    genre_df.columns = ['genre' + "|" + i for i in tfidf.get_feature_names()]
    genre_df.reset_index(drop = True, inplace=True)
    
    # Sentiment analysis
    df = sentiment_analysis(df, "track_name")

    # One-hot Encoding
    subject_ohe = ohe_prep(df, 'subjectivity','subject') * 0.3
    polar_ohe = ohe_prep(df, 'polarity','polar') * 0.5
    key_ohe = ohe_prep(df, 'key','key') * 0.5
    mode_ohe = ohe_prep(df, 'mode','mode') * 0.5

    # Normalization
    # Scale popularity columns
    pop = df[["artist_pop","track_pop"]].reset_index(drop = True)
    scaler = MinMaxScaler()
    pop_scaled = pd.DataFrame(scaler.fit_transform(pop), columns = pop.columns) #* 0.2 

    # Scale audio columns
    floats = df[float_cols].reset_index(drop = True)
    scaler = MinMaxScaler()
    floats_scaled = pd.DataFrame(scaler.fit_transform(floats), columns = floats.columns) #* 0.2

    # Concanenate all features
    final = pd.concat([genre_df, floats_scaled, pop_scaled, subject_ohe, polar_ohe, key_ohe, mode_ohe], axis = 1)
    
    # Add song id
    final['id']=df['id'].values
    
    return final

In [126]:
float_cols = songDF.dtypes[songDF.dtypes == 'float64'].index.values
songDF.to_csv("../data/allsong_data.csv", index = False)
complete_feature_set = create_feature_set(songDF, float_cols=float_cols)#.mean(axis = 0)
complete_feature_set.to_csv("../data/complete_feature.csv", index = False)
complete_feature_set


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,genre|_hip_hop,genre|abstract_hip_hop,genre|acoustic_pop,genre|adult_standards,genre|aesthetic_rap,genre|afrofuturism,genre|alabama_indie,genre|album_rock,genre|albuquerque_indie,genre|alt_z,...,key|5,key|6,key|7,key|8,key|9,key|10,key|11,mode|0,mode|1,id
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0UaMYEvWZi0ZqiDOoHU3YI
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,6I9VzXrHxO9rA9A5euc8Ak
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0WqIKmW4BTrj3eJFmnCKMv
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,1AWQoqb9bSvzTjaLralEkT
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,1lzr43nnXAijIGYnCT8M8H
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,5qOvJSBgSGUFEYKjrcxIH4
996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,581gLYhF5OxQzgfIMGlAvu
997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,7zPzpfKVpNPlk1qKhie2JZ
998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.5,1dKDRs99KkNbtC9AHM7TLm


### Playlist Preprocessing



In [156]:
# playlistDF_test = pd.read_csv("../data/processed_data.csv")
# playlistDF_test = playlist_preprocess(playlistDF_test)
# playlistDF_test.head()
playlistDF_test = playlistDF[playlistDF['name']=="Mom's playlist"]
playlistDF_test.head()


Unnamed: 0,pos,artist_name,track_uri,artist_uri,track_name,album_uri,duration_ms_x,album_name,name,danceability,...,id,uri,track_href,analysis_url,duration_ms_y,time_signature,artist_pop,genres,track_pop,artists_song
413,59,The Killers,7oK9VyNzrYvRFo7nQEYkWN,spotify:artist:0C0XlULifJtAgn6ZNCW2eu,Mr. Brightside,spotify:album:4undIeGmofnAYKhnDclN1w,222586,Hot Fuss,Mom's playlist,0.356,...,7oK9VyNzrYvRFo7nQEYkWN,spotify:track:7oK9VyNzrYvRFo7nQEYkWN,https://api.spotify.com/v1/tracks/7oK9VyNzrYvR...,https://api.spotify.com/v1/audio-analysis/7oK9...,222587,4,80,alternative_rock dance_rock modern_rock perman...,78,The KillersMr. Brightside
1234,18,Rihanna,6qn9YLKt13AGvpq9jfO8py,spotify:artist:5pKCCKE2ajJHZ9KAiaK11H,We Found Love,spotify:album:2g1EakEaW7fPTZC6vBmBCn,215226,Talk That Talk,Mom's playlist,0.734,...,6qn9YLKt13AGvpq9jfO8py,spotify:track:6qn9YLKt13AGvpq9jfO8py,https://api.spotify.com/v1/tracks/6qn9YLKt13AG...,https://api.spotify.com/v1/audio-analysis/6qn9...,215227,4,90,barbadian_pop dance_pop pop pop_rap urban_cont...,77,RihannaWe Found Love
1363,32,American Authors,5j9iuo3tMmQIfnEEQOOjxh,spotify:artist:0MlOPi3zIDMVrfA9R04Fe3,Best Day Of My Life,spotify:album:2AAVQqcejMEgNpdg2raPYE,194240,"Oh, What A Life",Mom's playlist,0.67,...,5j9iuo3tMmQIfnEEQOOjxh,spotify:track:5j9iuo3tMmQIfnEEQOOjxh,https://api.spotify.com/v1/tracks/5j9iuo3tMmQI...,https://api.spotify.com/v1/audio-analysis/5j9i...,194240,4,70,indie_poptimism modern_alternative_rock modern...,0,American AuthorsBest Day Of My Life
1579,38,Clean Bandit,5HuqzFfq2ulY1iBAW5CxLe,spotify:artist:6MDME20pz9RveH9rEXvrOM,Rather Be (feat. Jess Glynne),spotify:album:2xVeccmEU0zklK4XSKiDCW,227833,I Cry When I Laugh,Mom's playlist,0.799,...,5HuqzFfq2ulY1iBAW5CxLe,spotify:track:5HuqzFfq2ulY1iBAW5CxLe,https://api.spotify.com/v1/tracks/5HuqzFfq2ulY...,https://api.spotify.com/v1/audio-analysis/5Huq...,227833,4,80,dance_pop edm pop pop_dance tropical_house uk_...,53,Clean BanditRather Be (feat. Jess Glynne)
1732,17,Sia,4VrWlk8IQxevMvERoX08iC,spotify:artist:5WUlDfRSoLAfcVSX1WnrxN,Chandelier,spotify:album:3xFSl9lIRaYXIYkIn3OIl9,216120,1000 Forms Of Fear,Mom's playlist,0.399,...,4VrWlk8IQxevMvERoX08iC,spotify:track:4VrWlk8IQxevMvERoX08iC,https://api.spotify.com/v1/tracks/4VrWlk8IQxev...,https://api.spotify.com/v1/audio-analysis/4VrW...,216120,5,89,australian_dance australian_pop pop,81,SiaChandelier


The next step is to generate all the features.

In [175]:
def generate_playlist_feature(complete_feature_set, playlist_df, weight_factor):
    """ 
    Summarize a user's playlist into a single vector

    Parameters: 
        complete_feature_set (pandas dataframe): Dataframe which includes all of the features for the spotify songs
        playlist_df (pandas dataframe): playlist dataframe
        weight_factor (float): float value that represents the recency bias. The larger the recency bias, the most priority recent songs get. Value should be close to 1. 
        
    Returns: 
        playlist_feature_set_weighted_final (pandas series): single feature that summarizes the playlist
        complete_feature_set_nonplaylist (pandas dataframe): 
    """
    
    complete_feature_set_playlist = complete_feature_set[complete_feature_set['id'].isin(playlist_df['id'].values)]
    complete_feature_set_nonplaylist = complete_feature_set[~complete_feature_set['id'].isin(playlist_df['id'].values)]
    complete_feature_set_playlist_final = complete_feature_set_playlist.drop(columns = "id")
    return complete_feature_set_playlist_final.sum(axis = 0), complete_feature_set_nonplaylist

In [176]:
complete_feature_set_playlist_vector, complete_feature_set_nonplaylist = generate_playlist_feature(complete_feature_set, playlistDF_test, 1.09)

In [179]:
complete_feature_set_nonplaylist

Unnamed: 0,genre|_hip_hop,genre|abstract_hip_hop,genre|acoustic_pop,genre|adult_standards,genre|aesthetic_rap,genre|afrofuturism,genre|alabama_indie,genre|album_rock,genre|albuquerque_indie,genre|alt_z,...,key|5,key|6,key|7,key|8,key|9,key|10,key|11,mode|0,mode|1,id
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0UaMYEvWZi0ZqiDOoHU3YI
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,6I9VzXrHxO9rA9A5euc8Ak
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0WqIKmW4BTrj3eJFmnCKMv
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,1AWQoqb9bSvzTjaLralEkT
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,1lzr43nnXAijIGYnCT8M8H
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
926,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.5,0V1xJXwwuXsr5oW5nSBVOC
927,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,4CzUdbxR8UJAXqG6JYM3ma
928,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,70cpuGFNENOHuqNhtLVFJY
929,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.5,6FxbwB6o1MKOy0dHvxNr2W


In [180]:
complete_feature_set_playlist_vector

genre|_hip_hop             0.0
genre|abstract_hip_hop     0.0
genre|acoustic_pop         0.0
genre|adult_standards      0.0
genre|aesthetic_rap        0.0
                          ... 
key|9                      3.0
key|10                     2.5
key|11                     3.0
mode|0                     8.0
mode|1                    29.0
Length: 558, dtype: float64

In [182]:
def generate_playlist_recos(df, features, nonplaylist_features):
    """ 
    Pull songs from a specific playlist.
    
    Parameters: 
        df (pandas dataframe): spotify dataframe
        features (pandas series): summarized playlist feature
        nonplaylist_features (pandas dataframe): feature set of songs that are not in the selected playlist
        
    Returns: 
        non_playlist_df_top_40: Top 40 recommendations for that playlist
    """
    
    non_playlist_df = df[df['id'].isin(nonplaylist_features['id'].values)]
    non_playlist_df['sim'] = cosine_similarity(nonplaylist_features.drop('id', axis = 1).values, features.values.reshape(1, -1))[:,0]
    non_playlist_df_top_40 = non_playlist_df.sort_values('sim',ascending = False).head(40)
    
    return non_playlist_df_top_40

In [183]:
recommend = generate_playlist_recos(songDF, complete_feature_set_playlist_vector, complete_feature_set_nonplaylist)
recommend

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,artist_name,id,track_name,duration_ms_x,duration_ms_y,album_name,danceability,energy,key,loudness,...,acousticness,instrumentalness,liveness,valence,tempo,time_signature,artist_pop,genres,track_pop,sim
3796,Young the Giant,7w5Ww1cW8U9v8Q3g4qLpVD,Repeat,185200,185200,Home of the Strange,0.656,0.806,1,-5.354,...,0.0454,2.3e-05,0.341,0.572,149.975,4,69,modern_alternative_rock modern_rock pop_rock r...,45,0.894501
2737,blackbear,4dAUGcD3zfZhJPyLXFSAUW,Sniffing Vicodin In Paris (Danny Olson Remix) ...,192428,192429,Sniffing Vicodin In Paris (Danny Olson Remix) ...,0.654,0.825,7,-4.744,...,0.0736,4e-06,0.242,0.73,140.012,4,85,pop,38,0.886185
289,Jason Derulo,67T6l4q3zVjC5nZZPXByU8,Whatcha Say,221253,221253,Jason Derulo,0.615,0.711,11,-5.507,...,0.0444,0.0,0.145,0.711,144.036,4,85,dance_pop pop pop_rap post-teen_pop,68,0.885532
1335,Bruno Mars,6SKwQghsR8AISlxhcwyA9R,Marry You,230120,230192,Doo-Wops & Hooligans,0.621,0.82,10,-4.865,...,0.332,0.0,0.104,0.452,144.905,4,92,dance_pop pop,77,0.884425
2881,AJR,2pwnEzgIzYL4AOw4ousjkB,Let the Games Begin,201004,201004,Let the Games Begin,0.664,0.698,10,-5.084,...,0.127,0.0,0.118,0.573,135.023,4,78,modern_rock,62,0.882888
227,Beyoncé,6d8A5sAx9TfdeseDvfWNHd,Check On It - feat. Bun B and Slim Thug,210453,210453,B'Day,0.705,0.796,7,-6.845,...,0.0708,0.0,0.388,0.864,166.042,4,86,dance_pop pop r&b,34,0.882798
3183,James Bay,7tmtOEDxPN7CWaQWBsG1DY,Hold Back The River,238746,238747,Chaos And The Calm,0.715,0.715,5,-7.364,...,0.0526,0.0,0.0936,0.506,134.923,4,76,modern_rock neo_mellow pop pop_rock,65,0.881821
2109,Cage The Elephant,43O3Iu8mDJy10i6k8SVRXX,Take It or Leave It,207320,207320,Melophobia,0.71,0.847,0,-3.009,...,0.0289,0.00033,0.0613,0.65,119.944,4,77,modern_rock punk_blues rock,58,0.881352
1464,Taio Cruz,2CEgGE6aESpnmtfiZwYlbV,Dynamite,202613,202613,The Rokstarr Hits Collection,0.751,0.783,4,-3.724,...,0.00379,0.0,0.036,0.816,119.975,4,73,dance_pop pop pop_rap,82,0.879675
1511,Flo Rida,3bC1ahPIYt1btJzSSEyyrF,Whistle,225000,224653,Wild Ones,0.747,0.937,0,-5.746,...,0.0208,0.0,0.29,0.739,103.976,4,81,dance_pop miami_hip_hop pop pop_rap,80,0.878783
