# Content-based Filtering Spotify Song Recommendation System

This notebook describes is a content-based Filtering approach for Spotify Song recommendation. 
The code accompanys a [medium article]() called "Part III: Build a Recommendation System with Spotify Datasets".
This notebook is the thrid article in a [Spotify Song Recommendation System series]() by the ENCA team.

## Structure

- Package Setup
- Preprocessing
- Feature Generation
- Content-based Filtering Recommendation

## Setup

**Downloaded Package**
- TextBlob

**Imported Packages**

- Pandas
- Scikit-learn
- re
- Spotipy (refer to Part I of the series)

## Credits

This notebook builds on top of Madhav Thaker's [spotify-recommendation-system tutorial](https://github.com/madhavthaker/spotify-recommendation-system).





### Package Setup
#### Download Dependencies

In [1]:
!pip install textblob



#### Import Dependencies

In [2]:
# Import library
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from textblob import TextBlob
import re

#### Data Import
The data here is not raw data, it is imported after the retrieving Spotify data in Part I. Please refer to [Part I]() for more information.

In [3]:
# Import processed data
playlistDF = pd.read_csv("../data/processed_data.csv")
print(playlistDF.columns)
playlistDF.drop(columns=["Unnamed: 0",'Unnamed: 0.1'], inplace = True)
playlistDF.head()

Index(['Unnamed: 0', 'Unnamed: 0.1', 'pos', 'artist_name', 'track_uri',
       'artist_uri', 'track_name', 'album_uri', 'duration_ms_x', 'album_name',
       'name', 'danceability', 'energy', 'key', 'loudness', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'type', 'id', 'uri', 'track_href', 'analysis_url',
       'duration_ms_y', 'time_signature', 'artist_pop', 'genres', 'track_pop'],
      dtype='object')


Unnamed: 0,pos,artist_name,track_uri,artist_uri,track_name,album_uri,duration_ms_x,album_name,name,danceability,...,type,id,uri,track_href,analysis_url,duration_ms_y,time_signature,artist_pop,genres,track_pop
0,0,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,Throwbacks,0.904,...,audio_features,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69
1,73,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,w o r k o u t,0.904,...,audio_features,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69
2,14,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,party playlist,0.904,...,audio_features,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69
3,42,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,Dance mix,0.904,...,audio_features,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69
4,1,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,spin,0.904,...,audio_features,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69


### Preprocessing

The following cells conducts further preprocessing for the imported data to cater the data specifically for the content-based filtering.

Here is the general pipeline:
1. Useful data Selection
2. List concatenation

#### Useful Data Selection

Due to the nature of playlist, there will be duplicates in songs across multiple playlists. Therefore, I combined the song and the artist and used the `drop_duplicates()` function in `pandas` to remove duplicate songs when building the base dataframe with all unique songs.

In [4]:
# Show that there are duplicates of songs accross playlists
playlistDF[['artist_name','track_name','name']]

Unnamed: 0,artist_name,track_name,name
0,Missy Elliott,Lose Control (feat. Ciara & Fat Man Scoop),Throwbacks
1,Missy Elliott,Lose Control (feat. Ciara & Fat Man Scoop),w o r k o u t
2,Missy Elliott,Lose Control (feat. Ciara & Fat Man Scoop),party playlist
3,Missy Elliott,Lose Control (feat. Ciara & Fat Man Scoop),Dance mix
4,Missy Elliott,Lose Control (feat. Ciara & Fat Man Scoop),spin
...,...,...,...
4461,Built By Titan,Collide (feat. Jonathan Thulin),Mom's playlist
4462,Astoria Kings,Come Alive,Mom's playlist
4463,Anthem Lights,Best of 2012: Payphone / Call Me Maybe / Wide ...,Mom's playlist
4464,Anthem Lights,Best of 2012: Payphone / Call Me Maybe / Wide ...,Favorite Songs


Now, I drop the duplicates with `pandas` by combining the artist name and track name. This is to prevent droping songs from different artists but with the same names.

In [5]:
# Drop song duplicates
def drop_duplicates(df):
    '''
    Drop duplicate songs
    '''
    df['artists_song'] = df.apply(lambda row: row['artist_name']+row['track_name'],axis = 1)
    return df.drop_duplicates('artists_song')

songDF = drop_duplicates(playlistDF)
print("Are all songs unique: ",len(pd.unique(songDF.artists_song))==len(songDF))


Are all songs unique:  True


Finally, I select the features I would use later on. The following is a short list of them in categories:
1. Metadata
    - id
    - genres
    - artist_pop
    - track_pop
2. Audio
    - **Mood**: Danceability, Valence, Energy, Tempo
    - **Properties**: Loudness, Speechiness, Instrumentalness
    - **Context**: Liveness, Acousticness
    - **metadata**: key, mode
3. Text
    - track_name

In [6]:
# Select useful columns
def select_cols(df):
       '''
       Select useful columns
       '''
       return df[['artist_name','id','track_name','danceability', 'energy', 'key', 'loudness', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', "artist_pop", "genres", "track_pop"]]
songDF = select_cols(songDF)
songDF.head()

Unnamed: 0,artist_name,id,track_name,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,artist_pop,genres,track_pop
0,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,Lose Control (feat. Ciara & Fat Man Scoop),0.904,0.813,4,-7.105,0,0.121,0.0311,0.00697,0.0471,0.81,125.461,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69
6,Britney Spears,6I9VzXrHxO9rA9A5euc8Ak,Toxic,0.774,0.838,5,-3.914,0,0.114,0.0249,0.025,0.242,0.924,143.04,84,dance_pop pop post-teen_pop,83
19,Beyoncé,0WqIKmW4BTrj3eJFmnCKMv,Crazy In Love,0.664,0.758,2,-6.583,0,0.21,0.00238,0.0,0.0598,0.701,99.259,86,dance_pop pop r&b,25
46,Justin Timberlake,1AWQoqb9bSvzTjaLralEkT,Rock Your Body,0.892,0.714,4,-6.055,0,0.141,0.201,0.000234,0.0521,0.817,100.972,81,dance_pop pop,79
55,Shaggy,1lzr43nnXAijIGYnCT8M8H,It Wasn't Me,0.853,0.606,0,-4.596,1,0.0713,0.0561,0.0,0.313,0.654,94.759,74,pop_rap reggae_fusion,2


#### List Concatenation

After selecting the useful data, due to the import format of a dataframe, we need to convert the `genres` columns back into a list. This is done by using the `split()` function:

In [7]:
def genre_preprocess(df):
    '''
    Preprocess the genre data
    '''
    df['genres_list'] = df['genres'].apply(lambda x: x.split(" "))
    return df
songDF = genre_preprocess(songDF)
songDF['genres_list'].head()


0     [dance_pop, hip_hop, hip_pop, pop, pop_rap, r&...
6                       [dance_pop, pop, post-teen_pop]
19                                [dance_pop, pop, r&b]
46                                     [dance_pop, pop]
55                             [pop_rap, reggae_fusion]
Name: genres_list, dtype: object

Lastly, I created a pipeline for preprocessing any new playlist as below:

In [8]:
def playlist_preprocess(df):
    '''
    Preprocess imported playlist
    '''
    df = drop_duplicates(df)
    df = select_cols(df)
    df = genre_preprocess(df)

    return df

### Feature Generation
Now that the data is usable, we can now feature-engineer the data for the purpose of the recommendation system. In this project, the following process is conducted into a pipeline for feature generation.

1. Sentiment Analysis
2. One-hot Encoding
3. TF-IDF
4. Normalization

#### Sentiment Analysis

In our data, we will perform a simply sentiment analysis using subjectivity and polarity form `TextBlob` package.
- **Subjectivity** (0,1): The amount of personal opinion and factual information contained in the text.
- **Polarity** (-1,1): The degree of strong or clearly defined sentiment accounting for negation.

We will then use one-hot encoding to list the sentiment of the song titles as one of the input.

In [9]:
def getSubjectivity(text):
  '''
  Getting the Subjectivity using TextBlob
  '''
  return TextBlob(text).sentiment.subjectivity

def getPolarity(text):
  '''
  Getting the Polarity using TextBlob
  '''
  return TextBlob(text).sentiment.polarity

def getAnalysis(score, task="polarity"):
  '''
  Categorizing the Polarity & Subjectivity score
  '''
  if task == "subjectivity":
    if score < 1/3:
      return "low"
    elif score > 1/3:
      return "high"
    else:
      return "medium"
  else:
    if score < 0:
      return 'Negative'
    elif score == 0:
      return 'Neutral'
    else:
      return 'Positive'

def sentiment_analysis(df, text_col):
  '''
  Perform sentiment analysis on text
  ---
  Input:
  df (pandas dataframe): Dataframe of interest
  text_col (str): column of interest
  '''
  df['subjectivity'] = df[text_col].apply(getSubjectivity).apply(lambda x: getAnalysis(x,"subjectivity"))
  df['polarity'] = df[text_col].apply(getPolarity).apply(getAnalysis)
  return df

In [10]:
# Show result
sentiment = sentiment_analysis(songDF, "track_name")
sentiment.head()

Unnamed: 0,artist_name,id,track_name,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,artist_pop,genres,track_pop,genres_list,subjectivity,polarity
0,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,Lose Control (feat. Ciara & Fat Man Scoop),0.904,0.813,4,-7.105,0,0.121,0.0311,0.00697,0.0471,0.81,125.461,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69,"[dance_pop, hip_hop, hip_pop, pop, pop_rap, r&...",low,Neutral
6,Britney Spears,6I9VzXrHxO9rA9A5euc8Ak,Toxic,0.774,0.838,5,-3.914,0,0.114,0.0249,0.025,0.242,0.924,143.04,84,dance_pop pop post-teen_pop,83,"[dance_pop, pop, post-teen_pop]",low,Neutral
19,Beyoncé,0WqIKmW4BTrj3eJFmnCKMv,Crazy In Love,0.664,0.758,2,-6.583,0,0.21,0.00238,0.0,0.0598,0.701,99.259,86,dance_pop pop r&b,25,"[dance_pop, pop, r&b]",high,Negative
46,Justin Timberlake,1AWQoqb9bSvzTjaLralEkT,Rock Your Body,0.892,0.714,4,-6.055,0,0.141,0.201,0.000234,0.0521,0.817,100.972,81,dance_pop pop,79,"[dance_pop, pop]",low,Neutral
55,Shaggy,1lzr43nnXAijIGYnCT8M8H,It Wasn't Me,0.853,0.606,0,-4.596,1,0.0713,0.0561,0.0,0.313,0.654,94.759,74,pop_rap reggae_fusion,2,"[pop_rap, reggae_fusion]",low,Neutral


#### One-hot encoding

One-hot encoding is a method to transform categorical variables into a machine-understandable langauge. This is done by converting each category into a column so that each category can be represented as either True or False.


![ohe_img](https://cdn-images-1.medium.com/max/1600/0*KVGWy9c3eo2RiAe3.png) 

In [11]:
def ohe_prep(df, column, new_name): 
    ''' 
    Create One Hot Encoded features of a specific column
    ---
    Input: 
    df (pandas dataframe): Spotify Dataframe
    column (str): Column to be processed
    new_name (str): new column name to be used
        
    Output: 
    tf_df: One-hot encoded features 
    '''
    
    tf_df = pd.get_dummies(df[column])
    feature_names = tf_df.columns
    tf_df.columns = [new_name + "|" + str(i) for i in feature_names]
    tf_df.reset_index(drop = True, inplace = True)    
    return tf_df

In [12]:
# One-hot encoding for the subjectivity 
subject_ohe = ohe_prep(sentiment, 'subjectivity','subject')
subject_ohe.iloc[0]

subject|high      0
subject|low       1
subject|medium    0
Name: 0, dtype: uint8

#### TF-IDF
TF-IDF, also known as Term Frequency-Inverse Document Frequency, is a tool to quantify words in a set of documents. The goal of TF-IDF is to show the importance of a word in the documents and the corpus. The general formula for calculating TF-IDF is:
$$ \text{Term Frequency}\times\text{Inverse Document Frequency}$$
- **Term Frequency (TF)**: The number of times a term appears in each document divided by the total word count in the document.
- **Inverse Document Frequency (IDF)**: The log value of the document frequency. Document frequency is the total number of documents where one term is present.

The motivation is to find words that are not only important in each document but also accounting for the entire corpus. The log value was taken to decrease the impact of a large N, which would lead a very large IDF compared to TF. TF is focused on importance of a word in a document, while IDF is focused on the importance of a word across documents.

In this project, the documents are analogous to songs. Therefore, we are calculating the most prominent genre in each song and their prevelent across songs to determine the weight of the genre. This is much better than simply one-hot encoding since there is no weights to determine how important and widespread each genre is, leading to overweighting on uncommon genres.

![tfidf_img](https://miro.medium.com/max/1400/1*V9ac4hLVyms79jl65Ym_Bw.jpeg)

In [13]:
# TF-IDF implementation
tfidf = TfidfVectorizer()
tfidf_matrix =  tfidf.fit_transform(songDF['genres_list'].apply(lambda x: " ".join(x)))
genre_df = pd.DataFrame(tfidf_matrix.toarray())
genre_df.columns = ['genre' + "|" + i for i in tfidf.get_feature_names()]
genre_df.drop(columns='genre|unknown')
genre_df.reset_index(drop = True, inplace=True)
genre_df.iloc[0]



genre|_hip_hop            0.0
genre|abstract_hip_hop    0.0
genre|acoustic_pop        0.0
genre|adult_standards     0.0
genre|aesthetic_rap       0.0
                         ... 
genre|world_devotional    0.0
genre|world_worship       0.0
genre|worship             0.0
genre|yacht_rock          0.0
genre|zolo                0.0
Name: 0, Length: 527, dtype: float64

#### Normalization
Lastly, we need to normalize some variables. As shown below, the popularity variables are not normalized to 0 to 1, which would be problematic in the consine similarity function later on. In addition, the audio features are also not normalized. 

To solve this problem, we used the `MinMaxScaler()` function from `scikit learn` which automatically scales all values from the min and max into a range of 0 to 1.

In [14]:
# artist_pop distribution descriptive stats
print(songDF['artist_pop'].describe())

count    1000.000000
mean       64.627000
std        17.250085
min         0.000000
25%        55.000000
50%        67.000000
75%        77.000000
max        98.000000
Name: artist_pop, dtype: float64


In [15]:
# Normalization
pop = songDF[["artist_pop"]].reset_index(drop = True)
scaler = MinMaxScaler()
pop_scaled = pd.DataFrame(scaler.fit_transform(pop), columns = pop.columns)
pop_scaled.head()

Unnamed: 0,artist_pop
0,0.755102
1,0.857143
2,0.877551
3,0.826531
4,0.755102


#### Feature Generation
Finially, we generate all features mentioned above using the following cell and concatenate all variables into a new dataframe.

In [16]:
def create_feature_set(df, float_cols):
    '''
    Process spotify df to create a final set of features that will be used to generate recommendations
    ---
    Input: 
    df (pandas dataframe): Spotify Dataframe
    float_cols (list(str)): List of float columns that will be scaled
            
    Output: 
    final (pandas dataframe): Final set of features 
    '''
    
    # Tfidf genre lists
    tfidf = TfidfVectorizer()
    tfidf_matrix =  tfidf.fit_transform(df['genres_list'].apply(lambda x: " ".join(x)))
    genre_df = pd.DataFrame(tfidf_matrix.toarray())
    genre_df.columns = ['genre' + "|" + i for i in tfidf.get_feature_names()]
    genre_df.drop(columns='genre|unknown') # drop unknown genre
    genre_df.reset_index(drop = True, inplace=True)
    
    # Sentiment analysis
    df = sentiment_analysis(df, "track_name")

    # One-hot Encoding
    subject_ohe = ohe_prep(df, 'subjectivity','subject') * 0.3
    polar_ohe = ohe_prep(df, 'polarity','polar') * 0.5
    key_ohe = ohe_prep(df, 'key','key') * 0.5
    mode_ohe = ohe_prep(df, 'mode','mode') * 0.5

    # Normalization
    # Scale popularity columns
    pop = df[["artist_pop","track_pop"]].reset_index(drop = True)
    scaler = MinMaxScaler()
    pop_scaled = pd.DataFrame(scaler.fit_transform(pop), columns = pop.columns) * 0.2 

    # Scale audio columns
    floats = df[float_cols].reset_index(drop = True)
    scaler = MinMaxScaler()
    floats_scaled = pd.DataFrame(scaler.fit_transform(floats), columns = floats.columns) * 0.2

    # Concanenate all features
    final = pd.concat([genre_df, floats_scaled, pop_scaled, subject_ohe, polar_ohe, key_ohe, mode_ohe], axis = 1)
    
    # Add song id
    final['id']=df['id'].values
    
    return final

In [17]:
# Save the data and generate the features
float_cols = songDF.dtypes[songDF.dtypes == 'float64'].index.values
songDF.to_csv("../data/allsong_data.csv", index = False)

# Generate features
complete_feature_set = create_feature_set(songDF, float_cols=float_cols)
complete_feature_set.to_csv("../data/complete_feature.csv", index = False)
complete_feature_set.head()




Unnamed: 0,genre|_hip_hop,genre|abstract_hip_hop,genre|acoustic_pop,genre|adult_standards,genre|aesthetic_rap,genre|afrofuturism,genre|alabama_indie,genre|album_rock,genre|albuquerque_indie,genre|alt_z,...,key|5,key|6,key|7,key|8,key|9,key|10,key|11,mode|0,mode|1,id
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0UaMYEvWZi0ZqiDOoHU3YI
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,6I9VzXrHxO9rA9A5euc8Ak
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0WqIKmW4BTrj3eJFmnCKMv
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,1AWQoqb9bSvzTjaLralEkT
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,1lzr43nnXAijIGYnCT8M8H


### Content-based Filtering Recommendation
The next step is to perform content-based filtering based on the song features we have. To do so, we concatenate all songs in a playlist into one summarization vector. Then, we find the similarity between the summarized playlist vector with all songs (not including the songs in the playlist) in the database. Then, we use the similarity measure retrieved the most relevant song that is not in the playlist to recommend it.

There are thre steps in this section:
1. **Choose playlist**: In this part, we retrieve a playlist
2. **Extract features**: In this part, we retireve playlist-of-interest features and non-playlist-of-interest features.
3. **Find similarity**: In this part, we compare the summarized playlist features with all other songs.

#### Choose Playlist
In this part, we test the data with *Mom's playlist* in the dataset.


In [18]:
### This is the test data
# playlistDF_test = pd.read_csv("../data/test_playlist.csv")
# playlistDF_test = playlist_preprocess(playlistDF_test)
# playlistDF_test.head()

# Test playlist:  Mom's playlist
playlistDF_test = playlistDF[playlistDF['name']=="Mom's playlist"]
playlistDF_test.head()
playlistDF_test.to_csv("../data/test_playlist.csv")

#### Extract features
The next step is to generate all the features. We need to first use the `id` to differentiate songs that are in the playlist and those that are not. Then, we simply add the features for all songs in the playlist together as a summary vector, which is similar to the figure below that was modified version of the work by [Madhav Thaker](https://github.com/madhavthaker/spotify-recommendation-system/blob/main/spotify-recommendation-engine.ipynb).

![pipeline_img](flow.png)


In [19]:
def generate_playlist_feature(complete_feature_set, playlist_df):
    '''
    Summarize a user's playlist into a single vector
    ---
    Input: 
    complete_feature_set (pandas dataframe): Dataframe which includes all of the features for the spotify songs
    playlist_df (pandas dataframe): playlist dataframe
        
    Output: 
    complete_feature_set_playlist_final (pandas series): single vector feature that summarizes the playlist
    complete_feature_set_nonplaylist (pandas dataframe): 
    '''
    
    # Find song features in the playlist
    complete_feature_set_playlist = complete_feature_set[complete_feature_set['id'].isin(playlist_df['id'].values)]
    # Find all non-playlist song features
    complete_feature_set_nonplaylist = complete_feature_set[~complete_feature_set['id'].isin(playlist_df['id'].values)]
    complete_feature_set_playlist_final = complete_feature_set_playlist.drop(columns = "id")
    return complete_feature_set_playlist_final.sum(axis = 0), complete_feature_set_nonplaylist

In [20]:
# Generate the features
complete_feature_set_playlist_vector, complete_feature_set_nonplaylist = generate_playlist_feature(complete_feature_set, playlistDF_test)

In [21]:
# Non-playlist features
complete_feature_set_nonplaylist.head()

Unnamed: 0,genre|_hip_hop,genre|abstract_hip_hop,genre|acoustic_pop,genre|adult_standards,genre|aesthetic_rap,genre|afrofuturism,genre|alabama_indie,genre|album_rock,genre|albuquerque_indie,genre|alt_z,...,key|5,key|6,key|7,key|8,key|9,key|10,key|11,mode|0,mode|1,id
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0UaMYEvWZi0ZqiDOoHU3YI
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,6I9VzXrHxO9rA9A5euc8Ak
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0WqIKmW4BTrj3eJFmnCKMv
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,1AWQoqb9bSvzTjaLralEkT
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,1lzr43nnXAijIGYnCT8M8H


In [22]:
# Summarized playlist features
complete_feature_set_playlist_vector

genre|_hip_hop             0.0
genre|abstract_hip_hop     0.0
genre|acoustic_pop         0.0
genre|adult_standards      0.0
genre|aesthetic_rap        0.0
                          ... 
key|9                      3.0
key|10                     2.5
key|11                     3.0
mode|0                     8.0
mode|1                    29.0
Length: 558, dtype: float64

#### Find similarity
The last puzzle is to find the similarities between the summarized playlist vector and all other songs. There are many similarity measures but one of the most common measures is **cosine similarity**.

Cosine similarity is a mathematical value that measures the similarities between vectors. Imagining our songs vectors as only two dimensional, the visual representation would look similar to the figure below. 

The mathematical formula can be expressed as:
$$\text{Cosine Sim}(A,B)=\frac{A\cdot B}{||A||\times||B||}=\frac{\sum_{i=1}^n A_i\times B_i}{\sqrt{\sum_{i=1}^n A_i^2}\times \sqrt{\sum_{i=1}^n B_i^2}}$$

In our code, we used the `cosine_similarity()` function from `scikit learn` to measure the similarity between each song and the summarized playlist vector.

One big advatange of doing this is the time complexity of the whole algorithm is equal to a matrix multiplication since we are performing the cosine similarity measure between each row vector (song) and the column vector of summarized playlist feature.

![cossim_img](https://images.deepai.org/glossary-terms/cosine-similarity-1007790.jpg)

In [23]:
def generate_playlist_recos(df, features, nonplaylist_features):
    '''
    Generated recommendation based on songs in aspecific playlist.
    ---
    Input: 
    df (pandas dataframe): spotify dataframe
    features (pandas series): summarized playlist feature (single vector)
    nonplaylist_features (pandas dataframe): feature set of songs that are not in the selected playlist
        
    Output: 
    non_playlist_df_top_40: Top 40 recommendations for that playlist
    '''
    
    non_playlist_df = df[df['id'].isin(nonplaylist_features['id'].values)]
    # Find cosine similarity between the playlist and the complete song set
    non_playlist_df['sim'] = cosine_similarity(nonplaylist_features.drop('id', axis = 1).values, features.values.reshape(1, -1))[:,0]
    non_playlist_df_top_40 = non_playlist_df.sort_values('sim',ascending = False).head(40)
    
    return non_playlist_df_top_40

In [24]:
# Genreate top 10 recommendation
recommend = generate_playlist_recos(songDF, complete_feature_set_playlist_vector, complete_feature_set_nonplaylist)
recommend.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()


Unnamed: 0,artist_name,id,track_name,danceability,energy,key,loudness,mode,speechiness,acousticness,...,liveness,valence,tempo,artist_pop,genres,track_pop,genres_list,subjectivity,polarity,sim
3796,Young the Giant,7w5Ww1cW8U9v8Q3g4qLpVD,Repeat,0.656,0.806,1,-5.354,1,0.0433,0.0454,...,0.341,0.572,149.975,69,modern_alternative_rock modern_rock pop_rock r...,45,"[modern_alternative_rock, modern_rock, pop_roc...",low,Neutral,0.714529
3183,James Bay,7tmtOEDxPN7CWaQWBsG1DY,Hold Back The River,0.715,0.715,5,-7.364,1,0.0904,0.0526,...,0.0936,0.506,134.923,76,modern_rock neo_mellow pop pop_rock,65,"[modern_rock, neo_mellow, pop, pop_rock]",low,Neutral,0.69995
3812,Kings of Leon,4jyZ3I1hYRcOkI8RJhxgCb,Over,0.414,0.799,2,-8.253,1,0.0335,0.00199,...,0.114,0.283,126.852,77,modern_rock rock,52,"[modern_rock, rock]",low,Neutral,0.698399
227,Beyoncé,6d8A5sAx9TfdeseDvfWNHd,Check On It - feat. Bun B and Slim Thug,0.705,0.796,7,-6.845,1,0.267,0.0708,...,0.388,0.864,166.042,86,dance_pop pop r&b,34,"[dance_pop, pop, r&b]",low,Neutral,0.690241
1335,Bruno Mars,6SKwQghsR8AISlxhcwyA9R,Marry You,0.621,0.82,10,-4.865,1,0.0367,0.332,...,0.104,0.452,144.905,92,dance_pop pop,77,"[dance_pop, pop]",low,Neutral,0.68943
3821,Kings of Leon,6qV3OEpN6uFCZnzNSslbn1,Conversation Piece,0.425,0.593,4,-7.419,1,0.0241,0.35,...,0.216,0.653,170.506,77,modern_rock rock,53,"[modern_rock, rock]",low,Neutral,0.689029
493,Demi Lovato,5c1sfI6wIQEsSUw0xrkFdl,This Is Me,0.485,0.823,1,-2.816,1,0.0362,0.00981,...,0.116,0.555,91.005,83,dance_pop pop post-teen_pop,0,"[dance_pop, pop, post-teen_pop]",low,Neutral,0.686943
1160,P!nk,12lZTPlXwUtrQuhEty6098,Raise Your Glass,0.7,0.695,7,-4.973,1,0.0897,0.00629,...,0.0319,0.633,122.028,83,dance_pop pop,0,"[dance_pop, pop]",low,Neutral,0.678105
1398,The Black Eyed Peas,70cTMpcgWMcR18t9MRJFjB,I Gotta Feeling,0.743,0.766,0,-6.375,1,0.0265,0.0873,...,0.509,0.61,127.96,82,dance_pop pop pop_rap,0,"[dance_pop, pop, pop_rap]",low,Neutral,0.678001
3830,Night Riots,6VVd4kRfzBsZqFbvEAjloh,All For You,0.475,0.881,2,-3.208,1,0.0574,0.0103,...,0.0739,0.427,140.912,47,indie_poptimism modern_alternative_rock modern...,49,"[indie_poptimism, modern_alternative_rock, mod...",low,Neutral,0.677305
