# Content-based Filtering Spotify Song Recommendation System

This notebook describes is a content-based filtering approach for Spotify Song recommendation. 
The code accompanys a [medium article]() called "Part III: Build a Recommendation System with Spotify Datasets".
This notebook is the thrid article in a [Spotify Song Recommendation System series]() by the ENCA team.

## Structure

- Package Setup
- Preprocessing
- Feature Generation
- Content-based Filtering Recommendation

## Setup

**Downloaded Package**
- TextBlob

**Imported Packages**

- Pandas
- Scikit-learn
- re
- Spotipy (refer to Part I of the series)

## Credits

This notebook builds on top of Madhav Thaker's [spotify-recommendation-system tutorial](https://github.com/madhavthaker/spotify-recommendation-system).





### Package Setup
#### Download Dependencies

In [None]:
# !pip install textblob

#### Import Dependencies

In [1]:
# Import library
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from textblob import TextBlob
import re
from scripts.get_track_info import get_track_info

#### Data Import
The data here is not raw data, it is imported after the retrieving Spotify data in Part I. Please refer to [Part I]() for more information.

In [2]:
# Import processed data
playlistDF = pd.read_csv("../data/processed_data.csv")
print(playlistDF.columns)
playlistDF.drop(columns=["Unnamed: 0","Unnamed: 0.1"], inplace = True)
playlistDF.drop(columns=["track_name_x"], inplace=True)
playlistDF.rename(columns={"track_name_y": "track_name"}, inplace=True)
playlistDF.head()

Index(['Unnamed: 0.1', 'Unnamed: 0', 'track_id', 'artists', 'album_name_x',
       'track_name_x', 'popularity', 'duration_ms', 'explicit', 'danceability',
       'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature',
       'track_genre', 'id', 'artist_pop', 'artist_name', 'artist_uri',
       'track_name_y', 'album_uri', 'album_name_y', 'image_uri', 'track_pop',
       'genres'],
      dtype='object')


Unnamed: 0,track_id,artists,album_name_x,popularity,duration_ms,explicit,danceability,energy,key,loudness,...,id,artist_pop,artist_name,artist_uri,track_name,album_uri,album_name_y,image_uri,track_pop,genres
0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,73,230666,False,0.676,0.461,1,-6.746,...,5SuOikwiRyPMVoIQDJUgSV,58,Gen Hoshino,spotify:artist:1S2S00lgLYLGHWA44qGEUs,Comedy,spotify:album:41ERrwfzos93Xlf6hFBiDn,Comedy,https://i.scdn.co/image/ab67616d0000b27326573d...,65,j-acoustic j-pop japanese_singer-songwriter
1,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,73,230666,False,0.676,0.461,1,-6.746,...,5SuOikwiRyPMVoIQDJUgSV,58,Gen Hoshino,spotify:artist:1S2S00lgLYLGHWA44qGEUs,Comedy,spotify:album:41ERrwfzos93Xlf6hFBiDn,Comedy,https://i.scdn.co/image/ab67616d0000b27326573d...,65,j-acoustic j-pop japanese_singer-songwriter
2,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,73,230666,False,0.676,0.461,1,-6.746,...,5SuOikwiRyPMVoIQDJUgSV,58,Gen Hoshino,spotify:artist:1S2S00lgLYLGHWA44qGEUs,Comedy,spotify:album:41ERrwfzos93Xlf6hFBiDn,Comedy,https://i.scdn.co/image/ab67616d0000b27326573d...,65,j-acoustic j-pop japanese_singer-songwriter
3,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,73,230666,False,0.676,0.461,1,-6.746,...,5SuOikwiRyPMVoIQDJUgSV,58,Gen Hoshino,spotify:artist:1S2S00lgLYLGHWA44qGEUs,Comedy,spotify:album:41ERrwfzos93Xlf6hFBiDn,Comedy,https://i.scdn.co/image/ab67616d0000b27326573d...,65,j-acoustic j-pop japanese_singer-songwriter
4,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),55,149610,False,0.42,0.166,1,-17.235,...,4qPNDBW1i3p13qLCt0Ki3A,43,Ben Woodward,spotify:artist:142VT1MtWzaD13CnOiKFDn,Ghost - Acoustic,spotify:album:7eVgjZsbqcqTIAyYUpOJuR,Ghost (Acoustic),https://i.scdn.co/image/ab67616d0000b273935a71...,45,acoustic_chill


### Preprocessing

The following cells conducts further preprocessing for the imported data to cater the data specifically for the content-based filtering.

Here is the general pipeline:
1. Useful data Selection
2. List concatenation

#### Useful Data Selection

Due to the nature of playlist, there will be duplicates in songs across multiple playlists. Therefore, I combined the song and the artist and used the `drop_duplicates()` function in `pandas` to remove duplicate songs when building the base dataframe with all unique songs.

In [3]:
# Show that there are duplicates of songs accross playlists
playlistDF[['artist_name','track_name']]

Unnamed: 0,artist_name,track_name
0,Gen Hoshino,Comedy
1,Gen Hoshino,Comedy
2,Gen Hoshino,Comedy
3,Gen Hoshino,Comedy
4,Ben Woodward,Ghost - Acoustic
...,...,...
7163,Mrs. GREEN APPLE,私
7164,Adrian Barba,Sola Nunca Estarás
7165,The Covers Duo,Atraparlos Ya! (Pokemon)
7166,Doblecero,Rap de Buda


Now, I drop the duplicates with `pandas` by combining the artist name and track name. This is to prevent droping songs from different artists but with the same names.

In [4]:
# Drop song duplicates
def drop_duplicates(df):
    '''
    Drop duplicate songs
    '''
    df['artists_song'] = df.apply(lambda row: row['artist_name']+row['track_name'],axis = 1)
    return df.drop_duplicates('artists_song')

songDF = drop_duplicates(playlistDF)
print("Are all songs unique: ",len(pd.unique(songDF.artists_song))==len(songDF))


Are all songs unique:  True


In [5]:
display(songDF)

Unnamed: 0,track_id,artists,album_name_x,popularity,duration_ms,explicit,danceability,energy,key,loudness,...,artist_pop,artist_name,artist_uri,track_name,album_uri,album_name_y,image_uri,track_pop,genres,artists_song
0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,73,230666,False,0.676,0.4610,1,-6.746,...,58,Gen Hoshino,spotify:artist:1S2S00lgLYLGHWA44qGEUs,Comedy,spotify:album:41ERrwfzos93Xlf6hFBiDn,Comedy,https://i.scdn.co/image/ab67616d0000b27326573d...,65,j-acoustic j-pop japanese_singer-songwriter,Gen HoshinoComedy
4,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),55,149610,False,0.420,0.1660,1,-17.235,...,43,Ben Woodward,spotify:artist:142VT1MtWzaD13CnOiKFDn,Ghost - Acoustic,spotify:album:7eVgjZsbqcqTIAyYUpOJuR,Ghost (Acoustic),https://i.scdn.co/image/ab67616d0000b273935a71...,45,acoustic_chill,Ben WoodwardGhost - Acoustic
6,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,57,210826,False,0.438,0.3590,0,-9.734,...,54,Ingrid Michaelson,spotify:artist:2vm8GdHyrJh2O2MfbQFYG0,To Begin Again,spotify:album:5rrWBCnnYiFaT5EoyCeikd,To Begin Again,https://i.scdn.co/image/ab67616d0000b273ed344b...,51,acoustic_pop ectofolk lilith neo_mellow,Ingrid MichaelsonTo Begin Again
7,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,71,201933,False,0.266,0.0596,0,-18.515,...,57,Kina Grannis,spotify:artist:7h4j9YTJJuAHzLCc3KCvYu,Can't Help Falling In Love,spotify:album:2wMz3oVNS1bMXaEWY6QWmA,Crazy Rich Asians (Original Motion Picture Sou...,https://i.scdn.co/image/ab67616d0000b273bb3df7...,67,acoustic_pop viral_pop,Kina GrannisCan't Help Falling In Love
8,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,82,198853,False,0.618,0.4430,2,-9.681,...,59,Chord Overstreet,spotify:artist:5D3muNJhYYunbRkh3FKgX0,Hold On,spotify:album:2EfmyRWheMtmVTCIsptsLi,Hold On,https://i.scdn.co/image/ab67616d0000b273c60473...,79,acoustic_pop singer-songwriter_pop,Chord OverstreetHold On
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7161,4QYGss1JbmGTxte5nA4JNX,Mrs. GREEN APPLE,TWELVE,37,367506,False,0.484,0.6180,7,-5.623,...,75,Mrs. GREEN APPLE,spotify:artist:4QvgGvpgzgyUOo8Yp8LDm9,私,spotify:album:4seExhof6lZ2yg5dZfengb,TWELVE,https://i.scdn.co/image/ab67616d0000b273566489...,46,anime_rock j-pop j-rock,Mrs. GREEN APPLE私
7164,7pqcJ1n6HiGx7vOiDcb1Ml,Adrian Barba,Somos Una Sola Mente,36,240000,False,0.721,0.4860,1,-12.614,...,46,Adrian Barba,spotify:artist:5KK1FO30lzYPqnPYyS9bu5,Sola Nunca Estarás,spotify:album:2b16yWqku8m9us1Y2zkEPI,Somos Una Sola Mente,https://i.scdn.co/image/ab67616d0000b273ce69fe...,32,anime_latino,Adrian BarbaSola Nunca Estarás
7165,0ijeNNh5BfPrZb2mPIUuR2,The Covers Duo,Anime Openings 1,36,60010,False,0.654,0.8560,10,-5.207,...,36,The Covers Duo,spotify:artist:0vlbXMsO1PRqmfJv5tAJ8G,Atraparlos Ya! (Pokemon),spotify:album:2MWOz473PC86eJ4uasKHHK,Anime Openings 1,https://i.scdn.co/image/ab67616d0000b273fab27a...,35,anime_latino,The Covers DuoAtraparlos Ya! (Pokemon)
7166,2ccfTK4zy5LEZWsWmdPush,Doblecero,Rap de Buda,37,219536,False,0.743,0.7430,4,-6.982,...,44,Doblecero,spotify:artist:6qqvdLm9ZVjxCgHxGcL5ZW,Rap de Buda,spotify:album:2b2J2SRfyGCgNsU2KEI8gB,Rap de Buda,https://i.scdn.co/image/ab67616d0000b2732af3f6...,33,latin_viral_rap rap_anime,DobleceroRap de Buda


Finally, I select the features I would use later on. The following is a short list of them in categories:
1. Metadata
    - id
    - genres
    - artist_pop
    - track_pop
2. Audio
    - **Mood**: Danceability, Valence, Energy, Tempo
    - **Properties**: Loudness, Speechiness, Instrumentalness
    - **Context**: Liveness, Acousticness
    - **metadata**: key, mode
3. Text
    - track_name

In [6]:
# Select useful columns
def select_cols(df):
       '''
       Select useful columns
       '''
       return df[['artist_name','id','track_name','danceability', 'energy', 'key', 'loudness', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', "artist_pop", "genres", "track_pop","image_uri"]]
songDF = select_cols(songDF)
songDF.head()

Unnamed: 0,artist_name,id,track_name,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,artist_pop,genres,track_pop,image_uri
0,Gen Hoshino,5SuOikwiRyPMVoIQDJUgSV,Comedy,0.676,0.461,1,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,58,j-acoustic j-pop japanese_singer-songwriter,65,https://i.scdn.co/image/ab67616d0000b27326573d...
4,Ben Woodward,4qPNDBW1i3p13qLCt0Ki3A,Ghost - Acoustic,0.42,0.166,1,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,43,acoustic_chill,45,https://i.scdn.co/image/ab67616d0000b273935a71...
6,Ingrid Michaelson,1iJBSr7s7jYXzM8EGcbK5b,To Begin Again,0.438,0.359,0,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,54,acoustic_pop ectofolk lilith neo_mellow,51,https://i.scdn.co/image/ab67616d0000b273ed344b...
7,Kina Grannis,6lfxq3CG4xtTiEg7opyCyx,Can't Help Falling In Love,0.266,0.0596,0,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,57,acoustic_pop viral_pop,67,https://i.scdn.co/image/ab67616d0000b273bb3df7...
8,Chord Overstreet,5vjLSffimiIP26QG5WcN2K,Hold On,0.618,0.443,2,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,59,acoustic_pop singer-songwriter_pop,79,https://i.scdn.co/image/ab67616d0000b273c60473...


#### List Concatenation

After selecting the useful data, due to the import format of a dataframe, we need to convert the `genres` columns back into a list. This is done by using the `split()` function:

In [7]:
def genre_preprocess(df):
    '''
    Preprocess the genre data
    '''
    df['genres_list'] = df['genres'].apply(lambda x: x.split(" "))
    return df
songDF = genre_preprocess(songDF)
songDF['genres_list'].head()


0    [j-acoustic, j-pop, japanese_singer-songwriter]
4                                   [acoustic_chill]
6       [acoustic_pop, ectofolk, lilith, neo_mellow]
7                          [acoustic_pop, viral_pop]
8              [acoustic_pop, singer-songwriter_pop]
Name: genres_list, dtype: object

Lastly, I created a pipeline for preprocessing any new playlist as below:

In [8]:
def playlist_preprocess(df):
    '''
    Preprocess imported playlist
    '''
    df = drop_duplicates(df)
    df = select_cols(df)
    df = genre_preprocess(df)

    return df

### Feature Generation
Now that the data is usable, we can now feature-engineer the data for the purpose of the recommendation system. In this project, the following process is conducted into a pipeline for feature generation.

1. Sentiment Analysis
2. One-hot Encoding
3. TF-IDF
4. Normalization

#### Sentiment Analysis

In our data, we will perform a simply sentiment analysis using subjectivity and polarity form `TextBlob` package.
- **Subjectivity** (0,1): The amount of personal opinion and factual information contained in the text.
- **Polarity** (-1,1): The degree of strong or clearly defined sentiment accounting for negation.

We will then use one-hot encoding to list the sentiment of the song titles as one of the input.

In [9]:
def getSubjectivity(text):
  '''
  Getting the Subjectivity using TextBlob
  '''
  return TextBlob(text).sentiment.subjectivity

def getPolarity(text):
  '''
  Getting the Polarity using TextBlob
  '''
  return TextBlob(text).sentiment.polarity

def getAnalysis(score, task="polarity"):
  '''
  Categorizing the Polarity & Subjectivity score
  '''
  if task == "subjectivity":
    if score < 1/3:
      return "low"
    elif score > 1/3:
      return "high"
    else:
      return "medium"
  else:
    if score < 0:
      return 'Negative'
    elif score == 0:
      return 'Neutral'
    else:
      return 'Positive'

def sentiment_analysis(df, text_col):
  '''
  Perform sentiment analysis on text
  ---
  Input:
  df (pandas dataframe): Dataframe of interest
  text_col (str): column of interest
  '''
  df['subjectivity'] = df[text_col].apply(getSubjectivity).apply(lambda x: getAnalysis(x,"subjectivity"))
  df['polarity'] = df[text_col].apply(getPolarity).apply(getAnalysis)
  return df

In [10]:
# Show result
sentiment = sentiment_analysis(songDF, "track_name")
sentiment.head()

Unnamed: 0,artist_name,id,track_name,danceability,energy,key,loudness,mode,speechiness,acousticness,...,liveness,valence,tempo,artist_pop,genres,track_pop,image_uri,genres_list,subjectivity,polarity
0,Gen Hoshino,5SuOikwiRyPMVoIQDJUgSV,Comedy,0.676,0.461,1,-6.746,0,0.143,0.0322,...,0.358,0.715,87.917,58,j-acoustic j-pop japanese_singer-songwriter,65,https://i.scdn.co/image/ab67616d0000b27326573d...,"[j-acoustic, j-pop, japanese_singer-songwriter]",low,Neutral
4,Ben Woodward,4qPNDBW1i3p13qLCt0Ki3A,Ghost - Acoustic,0.42,0.166,1,-17.235,1,0.0763,0.924,...,0.101,0.267,77.489,43,acoustic_chill,45,https://i.scdn.co/image/ab67616d0000b273935a71...,[acoustic_chill],low,Neutral
6,Ingrid Michaelson,1iJBSr7s7jYXzM8EGcbK5b,To Begin Again,0.438,0.359,0,-9.734,1,0.0557,0.21,...,0.117,0.12,76.332,54,acoustic_pop ectofolk lilith neo_mellow,51,https://i.scdn.co/image/ab67616d0000b273ed344b...,"[acoustic_pop, ectofolk, lilith, neo_mellow]",low,Neutral
7,Kina Grannis,6lfxq3CG4xtTiEg7opyCyx,Can't Help Falling In Love,0.266,0.0596,0,-18.515,1,0.0363,0.905,...,0.132,0.143,181.74,57,acoustic_pop viral_pop,67,https://i.scdn.co/image/ab67616d0000b273bb3df7...,"[acoustic_pop, viral_pop]",high,Positive
8,Chord Overstreet,5vjLSffimiIP26QG5WcN2K,Hold On,0.618,0.443,2,-9.681,1,0.0526,0.469,...,0.0829,0.167,119.949,59,acoustic_pop singer-songwriter_pop,79,https://i.scdn.co/image/ab67616d0000b273c60473...,"[acoustic_pop, singer-songwriter_pop]",low,Neutral


#### One-hot encoding

One-hot encoding is a method to transform categorical variables into a machine-understandable langauge. This is done by converting each category into a column so that each category can be represented as either True or False.


![ohe_img](https://cdn-images-1.medium.com/max/1600/0*KVGWy9c3eo2RiAe3.png) 

In [11]:
def ohe_prep(df, column, new_name): 
    ''' 
    Create One Hot Encoded features of a specific column
    ---
    Input: 
    df (pandas dataframe): Spotify Dataframe
    column (str): Column to be processed
    new_name (str): new column name to be used
        
    Output: 
    tf_df: One-hot encoded features 
    '''
    
    tf_df = pd.get_dummies(df[column])
    feature_names = tf_df.columns
    tf_df.columns = [new_name + "|" + str(i) for i in feature_names]
    tf_df.reset_index(drop = True, inplace = True)    
    return tf_df

In [12]:
# One-hot encoding for the subjectivity 
subject_ohe = ohe_prep(sentiment, 'subjectivity','subject')
subject_ohe.iloc[0]

subject|high      False
subject|low        True
subject|medium    False
Name: 0, dtype: bool

#### TF-IDF
TF-IDF, also known as Term Frequency-Inverse Document Frequency, is a tool to quantify words in a set of documents. The goal of TF-IDF is to show the importance of a word in the documents and the corpus. The general formula for calculating TF-IDF is:
$$ \text{Term Frequency}\times\text{Inverse Document Frequency}$$
- **Term Frequency (TF)**: The number of times a term appears in each document divided by the total word count in the document.
- **Inverse Document Frequency (IDF)**: The log value of the document frequency. Document frequency is the total number of documents where one term is present.

The motivation is to find words that are not only important in each document but also accounting for the entire corpus. The log value was taken to decrease the impact of a large N, which would lead a very large IDF compared to TF. TF is focused on importance of a word in a document, while IDF is focused on the importance of a word across documents.

In this project, the documents are analogous to songs. Therefore, we are calculating the most prominent genre in each song and their prevelent across songs to determine the weight of the genre. This is much better than simply one-hot encoding since there is no weights to determine how important and widespread each genre is, leading to overweighting on uncommon genres.

![tfidf_img](https://miro.medium.com/max/1400/1*V9ac4hLVyms79jl65Ym_Bw.jpeg)

In [13]:
# TF-IDF implementation
tfidf = TfidfVectorizer()
tfidf_matrix =  tfidf.fit_transform(songDF['genres_list'].apply(lambda x: " ".join(x)))
genre_df = pd.DataFrame(tfidf_matrix.toarray())
genre_df.columns = ['genre' + "|" + i for i in tfidf.get_feature_names_out()]
# genre_df.drop(columns='genre|unknown')
genre_df.reset_index(drop = True, inplace=True)
genre_df.iloc[0]

genre|21st_century_classical    0.0
genre|_hip_hop                  0.0
genre|_indie                    0.0
genre|abstract                  0.0
genre|abstract_idm              0.0
                               ... 
genre|worship                   0.0
genre|yaoi                      0.0
genre|zambian_gospel            0.0
genre|zambian_hip_hop           0.0
genre|zambian_pop               0.0
Name: 0, Length: 703, dtype: float64

#### Normalization
Lastly, we need to normalize some variables. As shown below, the popularity variables are not normalized to 0 to 1, which would be problematic in the consine similarity function later on. In addition, the audio features are also not normalized. 

To solve this problem, we used the `MinMaxScaler()` function from `scikit learn` which automatically scales all values from the min and max into a range of 0 to 1.

In [14]:
# artist_pop distribution descriptive stats
print(songDF['artist_pop'].describe())

count    4269.000000
mean       50.886625
std        15.335571
min         0.000000
25%        41.000000
50%        52.000000
75%        61.000000
max        83.000000
Name: artist_pop, dtype: float64


In [15]:
# Normalization
pop = songDF[["artist_pop"]].reset_index(drop = True)
scaler = MinMaxScaler()
pop_scaled = pd.DataFrame(scaler.fit_transform(pop), columns = pop.columns)
pop_scaled.head()

Unnamed: 0,artist_pop
0,0.698795
1,0.518072
2,0.650602
3,0.686747
4,0.710843


#### Feature Generation
Finially, we generate all features mentioned above using the following cell and concatenate all variables into a new dataframe.

In [16]:
def create_feature_set(df, float_cols):
    '''
    Process spotify df to create a final set of features that will be used to generate recommendations
    ---
    Input: 
    df (pandas dataframe): Spotify Dataframe
    float_cols (list(str)): List of float columns that will be scaled
            
    Output: 
    final (pandas dataframe): Final set of features 
    '''
    
    # Tfidf genre lists
    tfidf = TfidfVectorizer()
    tfidf_matrix =  tfidf.fit_transform(df['genres_list'].apply(lambda x: " ".join(x)))
    genre_df = pd.DataFrame(tfidf_matrix.toarray())
    genre_df.columns = ['genre' + "|" + i for i in tfidf.get_feature_names_out()]
    genre_df.drop(columns='genre|unknown') # drop unknown genre
    genre_df.reset_index(drop = True, inplace=True)
    
    # Sentiment analysis
    df = sentiment_analysis(df, "track_name")

    # One-hot Encoding
    subject_ohe = ohe_prep(df, 'subjectivity','subject') * 0.3
    polar_ohe = ohe_prep(df, 'polarity','polar') * 0.5
    key_ohe = ohe_prep(df, 'key','key') * 0.5
    mode_ohe = ohe_prep(df, 'mode','mode') * 0.5

    # Normalization
    # Scale popularity columns
    pop = df[["artist_pop","track_pop"]].reset_index(drop = True)
    scaler = MinMaxScaler()
    pop_scaled = pd.DataFrame(scaler.fit_transform(pop), columns = pop.columns) * 0.2 

    # Scale audio columns
    floats = df[float_cols].reset_index(drop = True)
    scaler = MinMaxScaler()
    floats_scaled = pd.DataFrame(scaler.fit_transform(floats), columns = floats.columns) * 0.2

    # Concanenate all features
    final = pd.concat([genre_df, floats_scaled, pop_scaled, subject_ohe, polar_ohe, key_ohe, mode_ohe], axis = 1)
    
    # Add song id
    final['id']=df['id'].values
    
    return final

In [17]:
# Save the data and generate the features
float_cols = songDF.dtypes[songDF.dtypes == 'float64'].index.values
songDF.to_csv("../data/allsong_data.csv", index = False)

# Generate features
complete_feature_set = create_feature_set(songDF, float_cols=float_cols)
complete_feature_set.to_csv("../data/complete_feature.csv", index = False)
complete_feature_set.head()


Unnamed: 0,genre|21st_century_classical,genre|_hip_hop,genre|_indie,genre|abstract,genre|abstract_idm,genre|acoustic,genre|acoustic_blues,genre|acoustic_chill,genre|acoustic_cover,genre|acoustic_guitar_cover,...,key|5,key|6,key|7,key|8,key|9,key|10,key|11,mode|0,mode|1,id
0,0.0,0.0,0.0,0.0,0.0,0.519584,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,5SuOikwiRyPMVoIQDJUgSV
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,4qPNDBW1i3p13qLCt0Ki3A
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,1iJBSr7s7jYXzM8EGcbK5b
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,6lfxq3CG4xtTiEg7opyCyx
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,5vjLSffimiIP26QG5WcN2K


### Content-based Filtering Recommendation
The next step is to perform content-based filtering based on the song features we have. To do so, we concatenate all songs in a playlist into one summarization vector. Then, we find the similarity between the summarized playlist vector with all songs (not including the songs in the playlist) in the database. Then, we use the similarity measure retrieved the most relevant song that is not in the playlist to recommend it.

There are thre steps in this section:
1. **Choose playlist**: In this part, we retrieve a playlist
2. **Extract features**: In this part, we retireve playlist-of-interest features and non-playlist-of-interest features.
3. **Find similarity**: In this part, we compare the summarized playlist features with all other songs.

#### Choose Playlist
In this part, we test the data with *Mom's playlist* in the dataset.


In [None]:
## This is the test data
# playlistDF_test = pd.read_csv("../data/test_playlist.csv")
# playlistDF_test = playlist_preprocess(playlistDF_test)
# playlistDF_test.head()

## Test playlist:  Mom's playlist
# playlistDF_test = playlistDF[playlistDF['name']=="Mom's playlist"]
# playlistDF_test.head()
# playlistDF_test.to_csv("../data/test_playlist.csv")

#### Extract features
The next step is to generate all the features. We need to first use the `id` to differentiate songs that are in the playlist and those that are not. Then, we simply add the features for all songs in the playlist together as a summary vector, which is similar to the figure below that was modified version of the work by [Madhav Thaker](https://github.com/madhavthaker/spotify-recommendation-system/blob/main/spotify-recommendation-engine.ipynb).

![pipeline_img](flow.png)


In [None]:
# def generate_playlist_feature(complete_feature_set, playlist_df):
#     '''
#     Summarize a user's playlist into a single vector
#     ---
#     Input: 
#     complete_feature_set (pandas dataframe): Dataframe which includes all of the features for the spotify songs
#     playlist_df (pandas dataframe): playlist dataframe
        
#     Output: 
#     complete_feature_set_playlist_final (pandas series): single vector feature that summarizes the playlist
#     complete_feature_set_nonplaylist (pandas dataframe): 
#     '''
    
#     # Find song features in the playlist
#     complete_feature_set_playlist = complete_feature_set[complete_feature_set['id'].isin(playlist_df['id'].values)]
#     # Find all non-playlist song features
#     complete_feature_set_nonplaylist = complete_feature_set[~complete_feature_set['id'].isin(playlist_df['id'].values)]
#     complete_feature_set_playlist_final = complete_feature_set_playlist.drop(columns = "id")
#     return complete_feature_set_playlist_final.sum(axis = 0), complete_feature_set_nonplaylist

In [None]:
# Generate the features
# complete_feature_set_playlist_vector, complete_feature_set_nonplaylist = generate_playlist_feature(complete_feature_set, playlistDF_test)

In [None]:
# Non-playlist features
# complete_feature_set_nonplaylist.head()

In [None]:
# Summarized playlist features
# complete_feature_set_playlist_vector

Try it with a single song:

In [18]:
# Genreate top 10 recommendation
def generate_song_feature(complete_feature_set, song_id):
    '''
    Extract features for a specific song
    ---
    Input: 
    complete_feature_set (pandas dataframe): Dataframe containing all song features
    song_id (str): ID of the song
        
    Output: 
    song_features (pandas series): Features of the specified song
    '''
    
    # Find features for the specified song
    song_features = complete_feature_set[complete_feature_set['id'] == song_id].drop(columns='id').iloc[0]
    
    return song_features


In [19]:
song_features = generate_song_feature(complete_feature_set, "2sYFi9xVSZ56WHKSY2fN1K")
display(song_features)

genre|21st_century_classical    0.0
genre|_hip_hop                  0.0
genre|_indie                    0.0
genre|abstract                  0.0
genre|abstract_idm              0.0
                               ... 
key|9                           0.0
key|10                          0.0
key|11                          0.0
mode|0                          0.5
mode|1                          0.0
Name: 39, Length: 734, dtype: float64

#### Find similarity
The last puzzle is to find the similarities between the summarized playlist vector and all other songs. There are many similarity measures but one of the most common measures is **cosine similarity**.

Cosine similarity is a mathematical value that measures the similarities between vectors. Imagining our songs vectors as only two dimensional, the visual representation would look similar to the figure below. 

The mathematical formula can be expressed as:
$$\text{Cosine Sim}(A,B)=\frac{A\cdot B}{||A||\times||B||}=\frac{\sum_{i=1}^n A_i\times B_i}{\sqrt{\sum_{i=1}^n A_i^2}\times \sqrt{\sum_{i=1}^n B_i^2}}$$

In our code, we used the `cosine_similarity()` function from `scikit learn` to measure the similarity between each song and the summarized playlist vector.

One big advatange of doing this is the time complexity of the whole algorithm is equal to a matrix multiplication since we are performing the cosine similarity measure between each row vector (song) and the column vector of summarized playlist feature.

![cossim_img](https://images.deepai.org/glossary-terms/cosine-similarity-1007790.jpg)

In [None]:
# def generate_playlist_recos(df, features, nonplaylist_features):
#     '''
#     Generated recommendation based on songs in aspecific playlist.
#     ---
#     Input: 
#     df (pandas dataframe): spotify dataframe
#     features (pandas series): summarized playlist feature (single vector)
#     nonplaylist_features (pandas dataframe): feature set of songs that are not in the selected playlist
        
#     Output: 
#     non_playlist_df_top_40: Top 40 recommendations for that playlist
#     '''
    
#     non_playlist_df = df[df['id'].isin(nonplaylist_features['id'].values)]
#     # Find cosine similarity between the playlist and the complete song set
#     non_playlist_df['sim'] = cosine_similarity(nonplaylist_features.drop('id', axis = 1).values, features.values.reshape(1, -1))[:,0]
#     non_playlist_df_top_40 = non_playlist_df.sort_values('sim',ascending = False).head(40)
    
#     return non_playlist_df_top_40

In [None]:
# # Genreate top 10 recommendation
# recommend = generate_playlist_recos(songDF, complete_feature_set_playlist_vector, complete_feature_set_nonplaylist)
# recommend.head(10)

In [None]:
# playlistDF_test[["artist_name","track_name"]][:20]

Try with 1 song only:

In [20]:
def generate_song_recos(df, song_id, complete_feature_set, top_n=10):
    '''
    Generate recommendations based on a specific song.
    ---
    Input: 
    df (pandas dataframe): Spotify dataframe
    song_id (str): ID of the specific song
    complete_feature_set (pandas dataframe): Complete feature set of all songs
    top_n (int): Number of recommendations to return
        
    Output: 
    recommendations (pandas dataframe): Top recommendations for the specified song
    '''
    
    # Check if song_id is in complete_feature_set
    if song_id in complete_feature_set['id'].values:
        # Extract features of the specific song
        song_features = complete_feature_set[complete_feature_set['id'] == song_id].drop(columns='id').iloc[0].values.reshape(1, -1)
    else:
        print("No track found in dataset")
    
    # Compute cosine similarity between the specific song and all songs in the dataset
    similarities = cosine_similarity(song_features, complete_feature_set.drop(columns='id').values)
    
    # Sort the similarities and get indices of top recommendations
    top_indices = similarities.argsort()[0][-top_n:][::-1]
    
    # Get top recommendations
    recommendations = df.iloc[top_indices]
    
    return recommendations


In [21]:
# Lấy đặc trưng của bài hát được chỉ định
recommendations = generate_song_recos(songDF, "2sYFi9xVSZ56WHKSY2fN1K", complete_feature_set)

# Hiển thị 10 gợi ý đầu tiên
recommendations.head(5)

Unnamed: 0,artist_name,id,track_name,danceability,energy,key,loudness,mode,speechiness,acousticness,...,liveness,valence,tempo,artist_pop,genres,track_pop,image_uri,genres_list,subjectivity,polarity
101,Adam Christopher,2sYFi9xVSZ56WHKSY2fN1K,So Far Away - Acoustic,0.576,0.331,6,-9.389,0,0.0306,0.894,...,0.129,0.407,149.02,45,acoustic_cover,47,https://i.scdn.co/image/ab67616d0000b27350a641...,[acoustic_cover],high,Positive
1159,Vendredi,1DxwWEVXzZB2IqGxI6GYRK,Seven Nation Army - Acoustic Covers Versions o...,0.773,0.208,2,-9.482,0,0.0438,0.956,...,0.149,0.538,109.822,47,acoustic_cover,23,https://i.scdn.co/image/ab67616d0000b273e02585...,[acoustic_cover],high,Positive
1186,Son&Dad,0Stf1ND7zaL3TEo8kZgld1,I Want It That Way - Acoustic Covers of Popula...,0.392,0.253,8,-6.563,0,0.0354,0.746,...,0.321,0.368,203.571,43,acoustic_cover,39,https://i.scdn.co/image/ab67616d0000b2731c5482...,[acoustic_cover],high,Positive
256,Eden Elf,1fI1TMpz7FVkpoYBhmiywp,get better,0.639,0.201,7,-16.613,0,0.0526,0.907,...,0.124,0.4,88.107,44,acoustic_cover,47,https://i.scdn.co/image/ab67616d0000b273441fd2...,[acoustic_cover],high,Positive
821,Hannah's Yard,0wUVwtui3Se9kGUfSD7gaU,I Want to Know What Love Is - Acoustic,0.49,0.263,5,-9.248,0,0.0462,0.92,...,0.103,0.361,73.834,40,acoustic_cover,50,https://i.scdn.co/image/ab67616d0000b2739b19ad...,[acoustic_cover],high,Positive


In [35]:
import pandas as pd
from IPython.display import display, HTML

# Extract the required columns from the recommendations DataFrame
recommendations_table = recommendations[['track_name', 'artist_name', 'image_uri', 'id']].copy()

# Replace the image_uri column with the actual images
recommendations_table.loc[:, 'Image'] = recommendations_table['image_uri'].apply(lambda x: f'<img src="{x}" width="100" height="100">')

# Add Spotify link column
recommendations_table.loc[:, 'Spotify Link'] = recommendations_table['id'].apply(lambda x: f'<a href="https://open.spotify.com/track/{x}" target="_blank">Listen on Spotify</a>')

# Drop the image_uri and id columns
recommendations_table = recommendations_table.drop(['image_uri', 'id'], axis=1)

# Convert the DataFrame to HTML
recommendations_html = recommendations_table.to_html(escape=False, index=False)

# Display the table
display(HTML(recommendations_html))

track_name,artist_name,Image,Spotify Link
So Far Away - Acoustic,Adam Christopher,,Listen on Spotify
Seven Nation Army - Acoustic Covers Versions of Popular Songs,Vendredi,,Listen on Spotify
I Want It That Way - Acoustic Covers of Popular Songs,Son&Dad,,Listen on Spotify
get better,Eden Elf,,Listen on Spotify
I Want to Know What Love Is - Acoustic,Hannah's Yard,,Listen on Spotify
You're Not Sorry - Acoustic Covers Versions of Popular Songs,Covers Culture,,Listen on Spotify
Lost Cause - Acoustic Covers Versions of Popular Songs,Covers Culture,,Listen on Spotify
Heat Waves - Acoustic,Daniel Robinson,,Listen on Spotify
Leave The Door Open - Acoustic,Blame Jones,,Listen on Spotify
Have I Told You Lately That I Love You - Acoustic,John Adams,,Listen on Spotify
