# **2. Measure Similarity & Content Based Recommender System**

---
## Outline:

1. Background
2. Simplified Workflows.
3. Importing Data
4. Data Preparations
5. Data Preprocessing
6. Modeling
7. Decision Process (Recommendation Process)

# **Background**
---

## Problem Description
---

- A music streaming platform **denger-yuk.com** has its problem with its user engagement
- Currently , **denger-yuk.com** implemented recommendation such as Popular song recommendation such as billboard, chart,editor picks, etc.
- However after  3 months of implementation, the user session time dropped by ~20%, affecting their revenues.
- After the company conducted a research such as questionaire to their user, user feel bored and not helped by current song recommendation

## Business Objective
---

Our business objective would be **increasing session time** to **20%** (assumed ofcourse) in 3 months.

## Constraint
---


- currently our platform just started **six** months ago, we have not collected **enough** interaction data, such as like or dislike
- However, we have collected **enough** data including the song properties

## Solution
---


- We can improve **our current recommendation** to help **users discover** relevant song **easily** by using a more **personalized** recommendation.

Some recommendations approach in Personalized Recommendations:

1. **Collaborative Filtering**
   
   <img src="../assets/collaborative-filtering-shown-visually.png" width="600" height="500" img>
   
  
   [source](https://i0.wp.com/analyticsarora.com/wp-content/uploads/2022/03/collaborative-filtering-shown-visually.png?resize=800%2C600&ssl=1)

We can see that collaborative filtering **use interaction data** such as **how people give their preference through rating**  from users , to find similar users, recommend based what other similar users consumed

Since our streaming platform, **dengerin-yuk.com** have not collected enough interactions data such as like/ dislike we **can't implement this approach yet**.

2. **Content Based Filtering**
   
   <img src="../assets/content-based_filtering.png" width="600" height="500" img>
   
   [source](https://cdn.sanity.io/images/oaglaatp/production/a2fc251dcb1ad9ce9b8a82b182c6186d5caba036-1200x800.png?w=1200&h=800&auto=format)

As the name say **content** based filtering, recommend item that similar to what user has already consumed, to measure similarity between recommended items and what user already consumed,feature that can be used, but not limited to, such as **Metadata** (tags, description, etc).



## Data Description
---

- The data is obtained from [Spotify Playlist Dataset](https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset).

The dataset that we use is:

**tracks datasett** : `tracks_dataset.csv`

<center>

|Features|Descriptions|Data Type|
|:--|:--|:--:|
|`track_id`|The Spotify ID for the track|`obj`|
|`artists`|The artists' names who performed the track. If there is more than one artist, they are separated by a `;`|`obj`|
|`album_name`|The album name in which the track appears|`obj`|
|`track_name`|Name of the track|`obj`|
|`popularity`|The popularity of a track is a value between 0 and 100, with 100 being the most popular.|`int`|
|`duration_ms`|The track length in milliseconds|`float`|
|`explicit`|Whether or not the track has explicit lyrics (true = yes it does; false = no it does not OR unknown)|`int`|
|`danceability`|Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable|`float`|
|`energy`|Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.|`float`|
|`key`|The key the track is in.|`int`|
|`loudness`|The overall loudness of a track in decibels (dB)|`int`|
|`mode`|Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0|`int`|
|`speechiness`|Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. |`float`|
|`accousticness`|1.0 represents high confidence the track is acoustic|`float`|
|`instrumentalness`|The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content|`float`|
|`liveness`|Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live|`float`|
|`valence`|A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry)|`float`|
|`tempo`|The overall estimated tempo of a track in beats per minute (BPM).|`float`|
|`time_signature`|An estimated time signature.|`float`|
|`track_genre`|The genre in which the track belongs|`obj`|

# **Recommender System Workflow** (Simplified)
---

## 1. Importing Data

1. Load the data.
2. Check the shape & type of data.
3. Handle the duplicates data to maintain data validity.

## 2.Data Preparation

Create several data representations
1. numerical data
2. categorical data (with OHE)
3. numerical + categorical data

## 3. Content Based  recommendations: Using Last Activity Data

1. Create similarity functions
    - Jaccard similarity
    - Euclidean similarity
    - Cosine similarity
    - Pearson correlation similarity
2. Find top-n track_id that similar to the latest played track_id

# **1. Importing Data**
---

What do we do?
1. Load the data.
2. Check the shape & type of data.
3. Handle the duplicates data to maintain data validity.

## Load the data
---

In [1]:
# Load this library
import numpy as np
import pandas as pd

Load the data from given data path

In [2]:
tracks_data_path = '../data/sampled_tracks.csv'
tracks_data = pd.read_csv(tracks_data_path)

In [3]:
tracks_data.head()

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,5ujaK9hLZsIwMRBYbG2y7V,Kenny Larkin;Richie Hawtin,We Shall Overcome,We Shall Overcome - Richie's Loonie Mix,9,342217,False,0.679,0.749,6,-7.429,1,0.0472,0.000892,0.000711,0.104,0.886,124.281,4,chicago-house
1,46HD0Iv7eIF0uUEzAuxIxZ,Oleg Pogudin,Любовь и разлука. Песни Исаака Шварца,Дождик осенний,6,191146,False,0.449,0.185,4,-14.767,0,0.0532,0.889,7e-06,0.0837,0.302,137.821,4,romance
2,6KDG8vvFerlCU63IXdw1zb,Elvis Presley,Elvis 30 #1 Hits,Suspicious Minds,60,272360,False,0.542,0.722,7,-5.065,1,0.0262,0.0516,1.2e-05,0.131,0.609,115.21,4,rockabilly
3,6m1IAMCbpGECDzCc1qQ3Fa,Kelli Hand;Jacob Korn,Uncanny Valley 002,Dance Away,7,407816,False,0.815,0.574,2,-10.93,1,0.233,0.000196,0.384,0.101,0.442,120.006,4,detroit-techno
4,12xfRpp5r3akA2BZFPEVwv,Volbeat,Halloween Metal Nights,Heaven Nor Hell,1,322613,False,0.411,0.767,3,-4.211,1,0.056,0.000621,0.00842,0.134,0.659,134.408,4,metal


## Check data shapes & types
---

First we check the data shape

In [4]:
# Check the shape of data
tracks_data.shape

(56999, 20)

There are almost 60k songs in the database with 20 features.

In [5]:
# Check the type of data
tracks_data.dtypes

track_id             object
artists              object
album_name           object
track_name           object
popularity            int64
duration_ms           int64
explicit               bool
danceability        float64
energy              float64
key                   int64
loudness            float64
mode                  int64
speechiness         float64
acousticness        float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
time_signature        int64
track_genre          object
dtype: object

## Handling duplicates data
---

What are our current knowledges?
- A song can be on multiple album or genre
- Because of that, two different track_id can represent a single song.

Hence, we need to check for duplicates data from combination of `artists` and `track_name` first

In [6]:
# Check duplicate data
tracks_data.duplicated(subset = ['artists', 'track_name']).sum()

11393

Yes, we have some duplicated data. Let's see the duplicated data.

In [7]:
# Check the duplicated data
tracks_data[tracks_data.duplicated(subset = ['artists', 'track_name'])].iloc[:30]

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
314,2jXvdJ0tDtbksFOK85Vie0,Jhayco;J Balvin;Bad Bunny,FELIZ CUMPLEAÑOS,No Me Conoce - Remix,2,309120,False,0.804,0.786,10,-3.837,0,0.0735,0.144,0.0,0.0928,0.575,91.992,4,latin
413,7F7kL3i6SDEIbDcoJZGiig,Daddy Yankee,Sígueme Y Te Sigo,Sígueme Y Te Sigo,66,209786,False,0.736,0.878,1,-3.604,0,0.0428,0.069,6.4e-05,0.756,0.663,100.025,4,reggaeton
525,551qy5vUgrUfEUc4dCNfht,Nirvana,MTV Unplugged In New York,Where Did You Sleep Last Night,69,306066,False,0.573,0.59,3,-8.43,1,0.0557,0.667,0.000397,0.691,0.364,108.789,3,alt-rock
620,3IExS9dCfDAVCRY0Tq4qw7,AFI,Halloween Kids Party 2022,Halloween,0,238106,False,0.286,0.929,11,-8.494,0,0.091,0.00231,0.00074,0.0946,0.127,99.762,4,emo
672,2gLUbowsRFFwhmlAS3hZ5i,Jack Harlow,New Rap 20s,First Class,0,173947,True,0.905,0.563,8,-6.135,1,0.102,0.0254,1e-05,0.113,0.324,106.998,4,k-pop
710,2GKvSb11LvOsZWkuSqlhqe,Eric Burdon,East New York The Ultimate Fantasy Playlist,Don't Let Me Be Misunderstood,0,157360,False,0.673,0.712,11,-7.858,0,0.0273,0.159,0.0,0.183,0.912,113.061,4,british
763,27pyhbC00sCtwKPhlQtOGe,My Chemical Romance,Timeless Rock Hits,Teenagers,2,161920,False,0.459,0.858,4,-3.063,1,0.0626,0.0506,0.0,0.184,0.855,111.682,4,punk
816,7MLWLQSOUQc9sAghrNim6j,Stevie Wonder,Rock Christmas 2022 - The Very Best Of,What Christmas Means To Me - Single Version / ...,0,148146,False,0.545,0.7,5,-6.961,1,0.0323,0.0982,0.0,0.1,0.799,83.506,4,funk
846,3RU4IbiteV8v6Haa7sDrOV,Burna Boy,Your Perfect Soundtrack,Last Last,0,172342,False,0.795,0.565,3,-4.457,0,0.0948,0.131,0.0,0.0802,0.55,87.925,4,dancehall
958,6OfmhEOHcKhOUlh3vNE2zh,The Band,Home For Christmas 2022,Christmas Must Be Tonight - Remastered 2001,0,218666,False,0.668,0.665,7,-8.55,1,0.0267,0.0519,6.7e-05,0.109,0.766,128.619,4,folk


In [8]:
# Check more detail
tracks_data[(tracks_data['artists'] == 'Deep Purple')
            & (tracks_data['track_name'] == 'Perfect Strangers')]

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
189,2JhJOPGvtqMpj5RQC8cIYf,Deep Purple,Perfect Strangers,Perfect Strangers,66,328426,False,0.521,0.721,2,-7.699,1,0.0254,0.00348,0.0113,0.391,0.611,97.578,4,blues
1100,2JhJOPGvtqMpj5RQC8cIYf,Deep Purple,Perfect Strangers,Perfect Strangers,66,328426,False,0.521,0.721,2,-7.699,1,0.0254,0.00348,0.0113,0.391,0.611,97.578,4,hard-rock
45340,2JhJOPGvtqMpj5RQC8cIYf,Deep Purple,Perfect Strangers,Perfect Strangers,66,328426,False,0.521,0.721,2,-7.699,1,0.0254,0.00348,0.0113,0.391,0.611,97.578,4,metal


In [9]:
# Check more detail
tracks_data[(tracks_data['artists'] == 'ABBA')
            & (tracks_data['track_name'] == 'Little Things')]

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
1356,7vCtRoVwJDKyP6o7PT1s0I,ABBA,GOD JUL,Little Things,0,188320,False,0.647,0.165,10,-16.328,1,0.0315,0.894,0.00187,0.116,0.59,75.978,4,swedish
1704,0ECr8qukqLKU14g3eRNX54,ABBA,Holidays Are Coming,Little Things,0,188320,False,0.647,0.165,10,-16.328,1,0.0315,0.894,0.00187,0.116,0.59,75.978,4,swedish
7350,31cxwMfgThnUoyHbi0KSPM,ABBA,JULMYS,Little Things,0,188320,False,0.647,0.165,10,-16.328,1,0.0315,0.894,0.00187,0.116,0.59,75.978,4,swedish
22864,356BtzUMw3wWvyDlGzrotn,ABBA,JULMUSIK,Little Things,1,188320,False,0.647,0.165,10,-16.328,1,0.0315,0.894,0.00187,0.116,0.59,75.978,4,swedish
23164,1vLPyotXnRlM2EJIHK16LW,ABBA,Voyage,Little Things,57,188320,False,0.647,0.165,10,-16.328,1,0.0315,0.894,0.00187,0.116,0.59,75.978,4,swedish
29802,11Wlp3uhPE94k8hhI5LLne,ABBA,All I want For Christmas Is Music,Little Things,0,188320,False,0.647,0.165,10,-16.328,1,0.0315,0.894,0.00187,0.116,0.59,75.978,4,swedish
30100,7Bm1rOWQ8fD6CBtmstnBaW,ABBA,Weihnachten Musik,Little Things,0,188320,False,0.647,0.165,10,-16.328,1,0.0315,0.894,0.00187,0.116,0.59,75.978,4,swedish
37412,6u2DMKOq3A57msjbMPjZhF,ABBA,Weihnachtspop 2022,Little Things,0,188320,False,0.647,0.165,10,-16.328,1,0.0315,0.894,0.00187,0.116,0.59,75.978,4,swedish
43430,2nqhQaJH4zJ6F2X2wBQ1IT,ABBA,Beste Weihnachtslieder,Little Things,0,188320,False,0.647,0.165,10,-16.328,1,0.0315,0.894,0.00187,0.116,0.59,75.978,4,swedish


To sum up, we have exactly duplicated item. How do we handle this? We can pick ones with **the highest popularity**

In [10]:
# Easy way to drop
# 1. sort the data based on popularity first (descending order)
# 2. then drop the duplicates and keep first
tracks_data_dropped = (tracks_data.sort_values('popularity', ascending=False)
                                  .drop_duplicates(subset = ['artists', 'track_name'],
                                                   keep = 'first'))

tracks_data_dropped.head()

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
24620,3nqQXoyQOWXiESFLlDF1hG,Sam Smith;Kim Petras,Unholy (feat. Kim Petras),Unholy (feat. Kim Petras),100,156943,False,0.714,0.472,2,-7.375,1,0.0864,0.013,5e-06,0.266,0.238,131.121,4,dance
2100,5ww2BF9slyYgNOk37BlC4u,Manuel Turizo,La Bachata,La Bachata,98,162637,False,0.835,0.679,7,-5.329,0,0.0364,0.583,2e-06,0.218,0.85,124.98,4,reggaeton
10203,1IHWl5LamUGEuP4ozKQSXZ,Bad Bunny,Un Verano Sin Ti,Tití Me Preguntó,97,243716,False,0.65,0.715,5,-5.198,0,0.253,0.0993,0.000291,0.126,0.187,106.672,4,reggaeton
7414,4h9wh7iOZ0GGn8QVp4RAOB,OneRepublic,I Ain’t Worried (Music From The Motion Picture...,I Ain't Worried,96,148485,False,0.704,0.797,0,-5.927,1,0.0475,0.0826,0.000745,0.0546,0.825,139.994,4,pop
9961,5Eax0qFko2dh7Rl2lYs3bx,Bad Bunny,Un Verano Sin Ti,Efecto,96,213061,False,0.801,0.475,7,-8.797,0,0.0516,0.141,1.7e-05,0.0639,0.234,98.047,4,latin


In [11]:
tracks_data_dropped[(tracks_data_dropped['artists'] == 'ABBA')
            & (tracks_data_dropped['track_name'] == 'Little Things')]

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
23164,1vLPyotXnRlM2EJIHK16LW,ABBA,Voyage,Little Things,57,188320,False,0.647,0.165,10,-16.328,1,0.0315,0.894,0.00187,0.116,0.59,75.978,4,swedish


Check again the sum of duplicated data

In [12]:
# Check duplicate data
tracks_data.duplicated(subset = ['artists', 'track_name']).sum()

11393

In [13]:
tracks_data_dropped.shape

(45606, 20)

Now, let's check is there any duplicates from `track_id`

In [14]:
tracks_data_dropped['track_id'].duplicated().sum()

0

Great! Our `tracks_data_dropped` is free from duplicated data

## Set `track_id` as index
---

To answer our objective, we can set the `track_id` as our index thus we can call it easily

In [15]:
tracks_data_final = tracks_data_dropped.set_index(keys=['track_id'])
tracks_data_final.head()

Unnamed: 0_level_0,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
3nqQXoyQOWXiESFLlDF1hG,Sam Smith;Kim Petras,Unholy (feat. Kim Petras),Unholy (feat. Kim Petras),100,156943,False,0.714,0.472,2,-7.375,1,0.0864,0.013,5e-06,0.266,0.238,131.121,4,dance
5ww2BF9slyYgNOk37BlC4u,Manuel Turizo,La Bachata,La Bachata,98,162637,False,0.835,0.679,7,-5.329,0,0.0364,0.583,2e-06,0.218,0.85,124.98,4,reggaeton
1IHWl5LamUGEuP4ozKQSXZ,Bad Bunny,Un Verano Sin Ti,Tití Me Preguntó,97,243716,False,0.65,0.715,5,-5.198,0,0.253,0.0993,0.000291,0.126,0.187,106.672,4,reggaeton
4h9wh7iOZ0GGn8QVp4RAOB,OneRepublic,I Ain’t Worried (Music From The Motion Picture...,I Ain't Worried,96,148485,False,0.704,0.797,0,-5.927,1,0.0475,0.0826,0.000745,0.0546,0.825,139.994,4,pop
5Eax0qFko2dh7Rl2lYs3bx,Bad Bunny,Un Verano Sin Ti,Efecto,96,213061,False,0.801,0.475,7,-8.797,0,0.0516,0.141,1.7e-05,0.0639,0.234,98.047,4,latin


In [16]:
tracks_data_final.shape

(45606, 19)

## Create load function
---

Finally, we can create load data function

In [17]:
def load_tracks_data(tracks_path, sample_frac=0.05, seed=42):
    """
    Function to load user play count data
        & removing duplicate data on playcount data

    Parameters
    ----------
    tracks_path : str
        The path of tracks data (.csv)

    sample_frac : float
        % fraction of sample loaded
        Only for this course purposes

    seed : int, default=123
        For reproducibility

    Returns
    -------
    tracks_data : pandas DataFrame
        user playcount data
    """
    # Load data
    tracks_data = pd.read_csv(tracks_path)
    print('Original data shape                 :', tracks_data.shape)

    # Drop drop duplicate & keep the first item with highest popularity
    tracks_data = (tracks_data.sort_values('popularity', ascending=False)
                              .drop_duplicates(subset = ['artists', 'track_name'],
                                               keep = 'first'))

    print('Data shape after dropping duplicate :', tracks_data.shape)

    # Set track_id as index
    tracks_data = tracks_data.set_index(keys=['track_id'])

    # Sample the data
    tracks_data = tracks_data.sample(frac=sample_frac,
                                     replace=False,
                                     random_state=seed)
    print('Data shape final                    :', tracks_data.shape)

    return tracks_data


In [18]:
tracks_data = load_tracks_data(tracks_path = tracks_data_path)

Original data shape                 : (56999, 20)
Data shape after dropping duplicate : (45606, 20)
Data shape final                    : (2280, 19)


In [19]:
tracks_data.head()

Unnamed: 0_level_0,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
07tYyJZJyu7SYYG4EWKkGS,Jeff Bernat,This Time,This Time,59,185991,False,0.685,0.43,4,-9.586,1,0.0324,0.723,3e-06,0.135,0.482,107.042,4,chill
7kxfWvj6u9oWQ5C36kMtGb,Alter Bridge,Blackbird,Watch Over You,60,259133,False,0.276,0.658,7,-5.418,1,0.0329,0.0836,0.000553,0.161,0.16,172.326,4,metal
7GDwYUuvu1qjah9rAH15AH,Chilam;Xu Qiu Yi,EMI 星聲傳集之張智霖,現代愛情故事,46,280773,False,0.804,0.529,0,-8.57,1,0.0305,0.277,0.0,0.0916,0.723,106.908,4,cantopop
5GDtkgG9T1BDknHHyDtghv,Cory Asbury,The Father's House,The Father's House - Studio,62,253640,False,0.408,0.707,1,-5.4,1,0.0341,0.189,0.0,0.626,0.333,161.962,4,world-music
20iJjrQc6GRWDncFtrqWnd,Rota,Kukla,Kukla,41,157714,False,0.779,0.458,4,-13.562,0,0.133,0.0981,0.0289,0.0878,0.0543,104.903,4,turkish


# **2. Data Preparation**
---

- We will experimenting on how to find similarity between items.

- To find similarity between items, we have to characterize items, i.e. create a vector that represent an item.

- We will explore several ways to vectorize an item such as
  - Numerical features only
  - Categorical features only
  - Numerical + Categorical features

## Numerical Features Only
---

We will vectorize each item using numerical features only.

In [20]:
tracks_data.head().T

track_id,07tYyJZJyu7SYYG4EWKkGS,7kxfWvj6u9oWQ5C36kMtGb,7GDwYUuvu1qjah9rAH15AH,5GDtkgG9T1BDknHHyDtghv,20iJjrQc6GRWDncFtrqWnd
artists,Jeff Bernat,Alter Bridge,Chilam;Xu Qiu Yi,Cory Asbury,Rota
album_name,This Time,Blackbird,EMI 星聲傳集之張智霖,The Father's House,Kukla
track_name,This Time,Watch Over You,現代愛情故事,The Father's House - Studio,Kukla
popularity,59,60,46,62,41
duration_ms,185991,259133,280773,253640,157714
explicit,False,False,False,False,False
danceability,0.685,0.276,0.804,0.408,0.779
energy,0.43,0.658,0.529,0.707,0.458
key,4,7,0,1,4
loudness,-9.586,-5.418,-8.57,-5.4,-13.562


In [21]:
# Set the numerical feature columns
num_cols = ['popularity', 'duration_ms', 'danceability', 'energy',
            'loudness', 'speechiness', 'acousticness', 'instrumentalness',
            'liveness', 'valence', 'tempo']

Then let's create the numerical features tracks

In [22]:
tracks_data_num = tracks_data[num_cols]
tracks_data_num.head()

Unnamed: 0_level_0,popularity,duration_ms,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
07tYyJZJyu7SYYG4EWKkGS,59,185991,0.685,0.43,-9.586,0.0324,0.723,3e-06,0.135,0.482,107.042
7kxfWvj6u9oWQ5C36kMtGb,60,259133,0.276,0.658,-5.418,0.0329,0.0836,0.000553,0.161,0.16,172.326
7GDwYUuvu1qjah9rAH15AH,46,280773,0.804,0.529,-8.57,0.0305,0.277,0.0,0.0916,0.723,106.908
5GDtkgG9T1BDknHHyDtghv,62,253640,0.408,0.707,-5.4,0.0341,0.189,0.0,0.626,0.333,161.962
20iJjrQc6GRWDncFtrqWnd,41,157714,0.779,0.458,-13.562,0.133,0.0981,0.0289,0.0878,0.0543,104.903


In [23]:
tracks_data_num.shape

(2280, 11)

Finally, we normalize the data. We choose **MinMaxScaler** so that our similarity is always positive.

In [24]:
# Normalize
from sklearn.preprocessing import MinMaxScaler

In [25]:
# Create & Fitting Object
scaler = MinMaxScaler()
scaler.fit(tracks_data_num)

In [26]:
# Get normalize data
tracks_data_num_norm = pd.DataFrame(scaler.transform(tracks_data_num))

tracks_data_num_norm.index = tracks_data_num.index
tracks_data_num_norm.columns = tracks_data_num.columns

tracks_data_num_norm.head()

Unnamed: 0_level_0,popularity,duration_ms,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
07tYyJZJyu7SYYG4EWKkGS,0.614583,0.033018,0.695431,0.429965,0.77891,0.033962,0.725903,3e-06,0.114173,0.485398,0.486375
7kxfWvj6u9oWQ5C36kMtGb,0.625,0.048581,0.280203,0.657979,0.876084,0.034486,0.083935,0.000554,0.141111,0.161128,0.783012
7GDwYUuvu1qjah9rAH15AH,0.479167,0.053186,0.816244,0.528971,0.802597,0.031971,0.278112,0.0,0.069208,0.728097,0.485767
5GDtkgG9T1BDknHHyDtghv,0.645833,0.047413,0.414213,0.706982,0.876504,0.035744,0.189758,0.0,0.622876,0.335347,0.73592
20iJjrQc6GRWDncFtrqWnd,0.427083,0.027,0.790863,0.457966,0.686212,0.139413,0.098493,0.028929,0.065271,0.054683,0.476656


In [27]:
# Validate
tracks_data_num_norm.describe()

Unnamed: 0,popularity,duration_ms,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo
count,2280.0,2280.0,2280.0,2280.0,2280.0,2280.0,2280.0,2280.0,2280.0,2280.0,2280.0
mean,0.369335,0.043193,0.568348,0.63746,0.803813,0.089417,0.324365,0.184407,0.196116,0.461794,0.553941
std,0.210621,0.031001,0.181371,0.255241,0.1241,0.118204,0.336496,0.33138,0.203918,0.267555,0.136125
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.21875,0.030595,0.456853,0.472717,0.761995,0.037107,0.013352,0.0,0.07457,0.238671,0.453023
50%,0.375,0.039401,0.586802,0.67798,0.8335,0.049843,0.184738,0.00011,0.107957,0.440081,0.55886
75%,0.520833,0.049816,0.703807,0.848991,0.883154,0.088365,0.610441,0.149399,0.250932,0.682779,0.636084
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Great! Let's create a function

In [28]:
def numerical_vectorizer(data, num_cols):
    """
    Create numerical vector from a given data

    Parameters
    ----------
    data : pandas DataFrame
        The sample data

    num_cols : list
        The choosen numerical columns

    Returns
    -------
    data_num_clean : pandas DataFrame
        The sample data with choosen numerical columns
    """
    data = data.copy()

    # Filter data
    data_num = data[num_cols]

    # Transform data
    scaler = MinMaxScaler()
    scaler.fit(data_num)

    data_num_clean = pd.DataFrame(scaler.transform(data_num))
    data_num_clean.index = data_num.index
    data_num_clean.columns = data_num.columns

    print('Shape of original data  :', data.shape)
    print('Shape of numerical data :', data_num_clean.shape)

    return data_num_clean


In [29]:
tracks_data_num = numerical_vectorizer(data = tracks_data,
                                       num_cols = num_cols)

tracks_data_num.head()

Shape of original data  : (2280, 19)
Shape of numerical data : (2280, 11)


Unnamed: 0_level_0,popularity,duration_ms,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
07tYyJZJyu7SYYG4EWKkGS,0.614583,0.033018,0.695431,0.429965,0.77891,0.033962,0.725903,3e-06,0.114173,0.485398,0.486375
7kxfWvj6u9oWQ5C36kMtGb,0.625,0.048581,0.280203,0.657979,0.876084,0.034486,0.083935,0.000554,0.141111,0.161128,0.783012
7GDwYUuvu1qjah9rAH15AH,0.479167,0.053186,0.816244,0.528971,0.802597,0.031971,0.278112,0.0,0.069208,0.728097,0.485767
5GDtkgG9T1BDknHHyDtghv,0.645833,0.047413,0.414213,0.706982,0.876504,0.035744,0.189758,0.0,0.622876,0.335347,0.73592
20iJjrQc6GRWDncFtrqWnd,0.427083,0.027,0.790863,0.457966,0.686212,0.139413,0.098493,0.028929,0.065271,0.054683,0.476656


Great!

## Categorical Features Only
---

Now, we try to represent an item with categorical features only.

In [30]:
tracks_data.dtypes

artists              object
album_name           object
track_name           object
popularity            int64
duration_ms           int64
explicit               bool
danceability        float64
energy              float64
key                   int64
loudness            float64
mode                  int64
speechiness         float64
acousticness        float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
time_signature        int64
track_genre          object
dtype: object

There are four categorical features
- `artists`
- `album_name`
- `track_name`
- `explicit`
- `track_genre`


In [31]:
# Create categorical columns
cat_cols = ['artists', 'album_name', 'track_name', 'explicit', 'track_genre']

In [32]:
# Filter data
tracks_data_cat = tracks_data[cat_cols]

print(tracks_data_cat.shape)
tracks_data_cat.head()

(2280, 5)


Unnamed: 0_level_0,artists,album_name,track_name,explicit,track_genre
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
07tYyJZJyu7SYYG4EWKkGS,Jeff Bernat,This Time,This Time,False,chill
7kxfWvj6u9oWQ5C36kMtGb,Alter Bridge,Blackbird,Watch Over You,False,metal
7GDwYUuvu1qjah9rAH15AH,Chilam;Xu Qiu Yi,EMI 星聲傳集之張智霖,現代愛情故事,False,cantopop
5GDtkgG9T1BDknHHyDtghv,Cory Asbury,The Father's House,The Father's House - Studio,False,world-music
20iJjrQc6GRWDncFtrqWnd,Rota,Kukla,Kukla,False,turkish


Let's explore a bit

In [33]:
# Check number of unique value in each columns
for col in tracks_data_cat:
    print(f'Col-{col}: {len(set(tracks_data_cat[col]))}')

print('')
print('----------------------------------')
print('Full data shape :', tracks_data_cat.shape)

Col-artists: 1976
Col-album_name: 2160
Col-track_name: 2270
Col-explicit: 2
Col-track_genre: 114

----------------------------------
Full data shape : (2280, 5)


What do we observed?
- There are song that sang by multiple artists
- There are song that have similar `artists`, `album_name`, `track_name`, and `track_genre`

To vectorize these data numerically, we can use One Hot Encoding (OHE).

**Note**
- If we try to OHE-ing all the categorical features, we will have data with shape of `21k + 29k + 2 + 114 ~ 50k` features. It's too big for our current experiments.
- In this experiments, we just preprocess `artists` and `track_genre`

In [34]:
# Create NEW categorical columns
cat_cols = ['artists', 'track_genre']

# Filter data
tracks_data_cat = tracks_data[cat_cols]

print(tracks_data_cat.shape)
tracks_data_cat.head()

(2280, 2)


Unnamed: 0_level_0,artists,track_genre
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1
07tYyJZJyu7SYYG4EWKkGS,Jeff Bernat,chill
7kxfWvj6u9oWQ5C36kMtGb,Alter Bridge,metal
7GDwYUuvu1qjah9rAH15AH,Chilam;Xu Qiu Yi,cantopop
5GDtkgG9T1BDknHHyDtghv,Cory Asbury,world-music
20iJjrQc6GRWDncFtrqWnd,Rota,turkish


**Preprocess: Artists**

We observed that a single song can be sang by multiple artists that separated by `;`. Let's collect all the unique artists.

In [35]:
# Define the storage
artists_list = []

# Iterate over the data
for artists in tracks_data_cat['artists'].values:
    # Split the artists by ;
    artists_split = artists.split(';')

    # Append
    artists_list += artists_split

# Get the unique artists
artists_unique = set(artists_list)
artists_unique = sorted(artists_unique)

print('Number of unique artists : ', len(artists_unique))

Number of unique artists :  2570


Then, we can transform the data

First we create artists to index and index to artist for easier the mapping.

In [36]:
# Create the mapping
artist_to_idx = {artist : idx for idx, artist in enumerate(artists_unique)}
idx_to_artist = {idx : artists for artists, idx in artist_to_idx.items()}

Next, we create a place holder to map

In [37]:
# Placeholder to map
n_rows = tracks_data_cat.shape[0]
n_cols = len(artists_unique)

mapped_artists = np.zeros((n_rows, n_cols), dtype=int)
mapped_artists

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

After that, we iterate all the values

In [38]:
# Iterate the whole data
for i, artists in enumerate(tracks_data_cat['artists'].head()):
    # Split the artists
    artists_split = artists.split(';')

    # Convert the artists to index for mapping
    artists_split_idx = [artist_to_idx[artist] for artist in artists_split]

    # Assign value 1 on the index in the placeholder
    mapped_artists[i, artists_split_idx] = 1

# Print the results
mapped_artists

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

Finally, convert it to pandas DataFrame

In [39]:
artists_data = pd.DataFrame(mapped_artists)
artists_data.columns = artists_unique
artists_data.index = tracks_data_cat.index

artists_data.head()

Unnamed: 0_level_0,$pyda,&ME,'Bat Out Of Hell' Original Cast,-M-,10cc,1349,14 Bis,19,3 Doors Down,311,...,二口魔菜 Futakuchi Mana,小畑貴裕,尤雅,李龍基,櫻坂46,盧業瑂,菲道尔,邱芸子,陳中,陳慧珊
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
07tYyJZJyu7SYYG4EWKkGS,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7kxfWvj6u9oWQ5C36kMtGb,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7GDwYUuvu1qjah9rAH15AH,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5GDtkgG9T1BDknHHyDtghv,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
20iJjrQc6GRWDncFtrqWnd,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Finally, we create a Class of this operations.

**Note**: We want to create a sklearn-like OHE Encoder. This might be slightly complex. But what we do is the same.

In [40]:
class OHEVectorizer:
    """
    Transform data to OHE
    """
    def __init__(self, sep=';'):
        self.sep = sep

    def _generate_unique_values(self, data):
        """Generate unique values from a given data for each columns"""
        # Set a placeholder
        self.values_unique = {}

        # Iterate over columns
        for col in data.columns:
            values_list = []

            # Iterate over data
            for values in data[col].values:
                # Split the values by `sep`
                values_split = values.split(self.sep)

                # Append
                values_list += values_split

            # Get the unique artists
            values_unique = set(values_list)
            self.values_unique[col] = sorted(values_unique)

    def _generate_values_map(self, data):
        """Get the value to index and index to value map"""
        # Set a placeholder
        self.values_mapping = {}

        # Iterate over columns
        for col in data.columns:
            # Get the unique values
            values_unique = self.values_unique[col]

            # Create the mapping
            val_to_idx = {val:idx for idx, val in enumerate(values_unique)}
            idx_to_val = {idx:val for val, idx in val_to_idx.items()}

            # Save
            self.values_mapping[col] = {
                'val_to_idx': val_to_idx,
                'idx_to_val': idx_to_val
            }

    def fit(self, data):
        """
        Fit the OHE of all data
        """
        # 1. Generate the unique values
        self._generate_unique_values(data)

        # 2. Generate the mapping id
        self._generate_values_map(data)

    def transform(self, data):
        """
        Transform a sample data
        """
        # Set a placeholder
        mapped_cols = []
        n_rows = data.shape[0]

        # Iterate over columns
        for col in data.columns:
            # Extract encoder
            values_unique = self.values_unique[col]
            val_to_idx = self.values_mapping[col]['val_to_idx']

            # Create a mapped value for col data
            n_cols = len(values_unique)
            mapped_col = np.zeros((n_rows, n_cols), dtype=int)

            # Iterate over data
            for i, values in enumerate(data[col].values):
                # Split the artists
                values_split = values.split(';')

                # Convert the artists to index for mapping
                values_split_idx = [val_to_idx[val] for val in values_split]

                # Assign value 1 on the index in the placeholder
                mapped_col[i, values_split_idx] = 1

            # Convert to pandas DataFrame
            mapped_ohe = pd.DataFrame(mapped_col)
            mapped_ohe.columns = values_unique
            mapped_ohe.index = data.index

            # Append
            mapped_cols.append(mapped_ohe)

        # Concat the new data
        mapped_data = pd.concat([mapped_col for mapped_col in mapped_cols],
                                axis = 1)

        return mapped_data


Let's create the ohe vectorizer

In [41]:
# Create an object
ohe_vectorizer = OHEVectorizer()

# Fit the object with our current data
ohe_vectorizer.fit(data = tracks_data_cat)

In [42]:
# Transform current data with ohe vectorizer
tracks_data_ohe = ohe_vectorizer.transform(tracks_data_cat)

print('Original data shape :', tracks_data_cat.shape)
print('OHE data shape      :', tracks_data_ohe.shape)

Original data shape : (2280, 2)
OHE data shape      : (2280, 2684)


In [43]:
tracks_data_cat.head()

Unnamed: 0_level_0,artists,track_genre
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1
07tYyJZJyu7SYYG4EWKkGS,Jeff Bernat,chill
7kxfWvj6u9oWQ5C36kMtGb,Alter Bridge,metal
7GDwYUuvu1qjah9rAH15AH,Chilam;Xu Qiu Yi,cantopop
5GDtkgG9T1BDknHHyDtghv,Cory Asbury,world-music
20iJjrQc6GRWDncFtrqWnd,Rota,turkish


In [44]:
tracks_data_ohe.head()

Unnamed: 0_level_0,$pyda,&ME,'Bat Out Of Hell' Original Cast,-M-,10cc,1349,14 Bis,19,3 Doors Down,311,...,spanish,study,swedish,synth-pop,tango,techno,trance,trip-hop,turkish,world-music
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
07tYyJZJyu7SYYG4EWKkGS,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7kxfWvj6u9oWQ5C36kMtGb,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7GDwYUuvu1qjah9rAH15AH,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5GDtkgG9T1BDknHHyDtghv,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
20iJjrQc6GRWDncFtrqWnd,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


## Numerical with Categorical Features
---

In this section, we represent items with both numerical and categorical features.

In [45]:
# Numerical features
tracks_data_num.head()

Unnamed: 0_level_0,popularity,duration_ms,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
07tYyJZJyu7SYYG4EWKkGS,0.614583,0.033018,0.695431,0.429965,0.77891,0.033962,0.725903,3e-06,0.114173,0.485398,0.486375
7kxfWvj6u9oWQ5C36kMtGb,0.625,0.048581,0.280203,0.657979,0.876084,0.034486,0.083935,0.000554,0.141111,0.161128,0.783012
7GDwYUuvu1qjah9rAH15AH,0.479167,0.053186,0.816244,0.528971,0.802597,0.031971,0.278112,0.0,0.069208,0.728097,0.485767
5GDtkgG9T1BDknHHyDtghv,0.645833,0.047413,0.414213,0.706982,0.876504,0.035744,0.189758,0.0,0.622876,0.335347,0.73592
20iJjrQc6GRWDncFtrqWnd,0.427083,0.027,0.790863,0.457966,0.686212,0.139413,0.098493,0.028929,0.065271,0.054683,0.476656


In [46]:
# Categorical (OHE) features
tracks_data_ohe.head()

Unnamed: 0_level_0,$pyda,&ME,'Bat Out Of Hell' Original Cast,-M-,10cc,1349,14 Bis,19,3 Doors Down,311,...,spanish,study,swedish,synth-pop,tango,techno,trance,trip-hop,turkish,world-music
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
07tYyJZJyu7SYYG4EWKkGS,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7kxfWvj6u9oWQ5C36kMtGb,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7GDwYUuvu1qjah9rAH15AH,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5GDtkgG9T1BDknHHyDtghv,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
20iJjrQc6GRWDncFtrqWnd,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


And let's join those

In [47]:
tracks_data_num_ohe = pd.concat((tracks_data_num, tracks_data_ohe),
                                axis = 1)

print('Combined data shape :', tracks_data_num_ohe.shape)
tracks_data_num_ohe.head()

Combined data shape : (2280, 2695)


Unnamed: 0_level_0,popularity,duration_ms,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,...,spanish,study,swedish,synth-pop,tango,techno,trance,trip-hop,turkish,world-music
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
07tYyJZJyu7SYYG4EWKkGS,0.614583,0.033018,0.695431,0.429965,0.77891,0.033962,0.725903,3e-06,0.114173,0.485398,...,0,0,0,0,0,0,0,0,0,0
7kxfWvj6u9oWQ5C36kMtGb,0.625,0.048581,0.280203,0.657979,0.876084,0.034486,0.083935,0.000554,0.141111,0.161128,...,0,0,0,0,0,0,0,0,0,0
7GDwYUuvu1qjah9rAH15AH,0.479167,0.053186,0.816244,0.528971,0.802597,0.031971,0.278112,0.0,0.069208,0.728097,...,0,0,0,0,0,0,0,0,0,0
5GDtkgG9T1BDknHHyDtghv,0.645833,0.047413,0.414213,0.706982,0.876504,0.035744,0.189758,0.0,0.622876,0.335347,...,0,0,0,0,0,0,0,0,0,1
20iJjrQc6GRWDncFtrqWnd,0.427083,0.027,0.790863,0.457966,0.686212,0.139413,0.098493,0.028929,0.065271,0.054683,...,0,0,0,0,0,0,0,0,1,0


# **3. Content Based  recommendations: Using Last Activity Data**
---

Let say we want to recommend item based what on user last item interaction, we can use Content Based Recommendation

<center>
<img src="../assets/similarity_system.png" width=300 height=300>
</center>

How to do it?
1. Calculate similarity between item in item catalogue
2. Find similar items based on user last consumed item
3. Recommend Top `N` most similar items

## Calculate Similarity Between Items in Catalogue
---

There are several similarity functions we can use:
1. Jaccard
2. Euclidean
3. Cosine
4. Pearson Correlation

Let's try to build all of them

### Jaccard Similarity
---

Jaccard Similarity measuring similarity by counting how many feature has the same value divided by all features available

The similarity is ranged between 0 and 1.
- Not similar : 0
- Similar : 1

**Note**: Use Jaccard only for OHE features

$$
J(A,B) = \frac{|A \cap B|}{|A \cup  B |}
$$

With :

- $A$ = item A vector
- $B$ = item B vector
- $|A \cap B|$ = number of active features intersect between $A$ and $B$
- $|A \cup B|$ = number of total active features of $A$ and $B$




Next, we will implement how to measure jaccard similarity in python

In [48]:
def jaccard_similarity(vec_A, vec_B):
    """Calculate Jaccard similarity"""
    # Convert to vector
    vec_A = np.array(vec_A)
    vec_B = np.array(vec_B)

    # Find intersection & number of items
    n_intersect = np.sum((vec_A==1) & (vec_B==1))


    n_item = np.sum((vec_A==1) | (vec_B==1))

    # Calculate the similarity
    sim = n_intersect / n_item

    return sim


Let's validate

In [49]:
# Let
A = [1, 0, 0, 1, 0, 1, 1]
B = [1, 1, 0, 0, 0, 1, 0]

# Then the Jaccard similarity would be
# n_intersect = 2 (on features 1 and 6)
# n_item = 5 (on features 1, 2, 4, 6, and 7)
# Jaccard sim = 2/5 = 0.4

# Calculate automatically
sim = jaccard_similarity(vec_A = A, vec_B = B)
sim

0.4

For a more complex examples

Lets check the similarity of **BTS : Run BTS ** 69xohKu8C1fsflYAiSNbwM	 with
- **Christmas Lights** by ColdPlay 7gRDHPshgyWsrVUgSZAo6A
- **Not Alone** by LinkinPark 737yse38y2hsWQeNRrLaMY

IDS

Lets check the similarity of **BTS : Run BTS ** 69xohKu8C1fsflYAiSNbwM	 with
- **Christmas Lights** by ColdPlay 7gRDHPshgyWsrVUgSZAo6A
- **Not Alone** by LinkinPark 737yse38y2hsWQeNRrLaMY

Sanity Check:
- Love Story & Ghost are **pop** songs while Your Betrayal is a **metal** song.
- Hence **Ghost** should be more similar to **Love Story** than **Your Betrayal**

In [50]:
tracks_data.loc[tracks_data['artists']=='BTS']

Unnamed: 0_level_0,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
4ncTUgTfUV3wrjTPzKvn01,BTS,Yet To Come (Hyundai Ver.),Yet To Come (Hyundai Ver.),75,223535,False,0.508,0.842,1,-3.992,1,0.0612,0.078,0.0,0.0858,0.44,86.041,4,k-pop
2Ygw4CPjg1lg4zxTITYY2V,BTS,Proof,DNA,63,223441,False,0.595,0.754,1,-4.527,0,0.0554,0.0245,0.0,0.0726,0.672,129.951,4,k-pop
5rNZsITjXf23iFkRA924FV,BTS,Love Yourself 轉 'Tear',Intro: Singularity,68,196998,False,0.775,0.272,1,-9.126,0,0.091,0.56,0.00278,0.112,0.198,103.991,3,k-pop


In [51]:
tracks_data

Unnamed: 0_level_0,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
07tYyJZJyu7SYYG4EWKkGS,Jeff Bernat,This Time,This Time,59,185991,False,0.685,0.430,4,-9.586,1,0.0324,0.723000,0.000003,0.1350,0.4820,107.042,4,chill
7kxfWvj6u9oWQ5C36kMtGb,Alter Bridge,Blackbird,Watch Over You,60,259133,False,0.276,0.658,7,-5.418,1,0.0329,0.083600,0.000553,0.1610,0.1600,172.326,4,metal
7GDwYUuvu1qjah9rAH15AH,Chilam;Xu Qiu Yi,EMI 星聲傳集之張智霖,現代愛情故事,46,280773,False,0.804,0.529,0,-8.570,1,0.0305,0.277000,0.000000,0.0916,0.7230,106.908,4,cantopop
5GDtkgG9T1BDknHHyDtghv,Cory Asbury,The Father's House,The Father's House - Studio,62,253640,False,0.408,0.707,1,-5.400,1,0.0341,0.189000,0.000000,0.6260,0.3330,161.962,4,world-music
20iJjrQc6GRWDncFtrqWnd,Rota,Kukla,Kukla,41,157714,False,0.779,0.458,4,-13.562,0,0.1330,0.098100,0.028900,0.0878,0.0543,104.903,4,turkish
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0bLavFKcMljHzTJAx3sHNS,Capital Inicial;Seu Jorge,Capital Inicial Acústico NYC (Ao Vivo) [Deluxe],À Sua Maneira (De Música Ligera) (feat. Seu Jo...,49,267120,False,0.582,0.813,0,-4.558,1,0.0295,0.081900,0.001190,0.9020,0.3450,109.934,4,brazil
4MRylYuw7JcYVuSo3YFy5f,The Replacements,All for Nothing / Nothing for All,"Another Girl, Another Planet - Live",21,151893,False,0.228,0.980,4,-6.158,1,0.2510,0.000651,0.000259,0.5450,0.2260,160.512,4,power-pop
6jQPolTBq89z9AHrMrmYpS,SLANDER;Eric Leva,Superhuman,Superhuman,54,268611,False,0.569,0.727,7,-6.415,1,0.0437,0.028000,0.000000,0.9490,0.1600,127.908,4,dub
2XG8wjGeLpSeE3iublMgcG,Agents Of Time,Tsugi Club,Polina,0,471140,False,0.761,0.635,9,-7.386,1,0.0379,0.000302,0.918000,0.1040,0.1060,124.556,4,techno


In [52]:
# Get the target ID

tracks_target = tracks_data.loc[tracks_data['track_genre']=='pop',:].head(1)

tracks_target.T

track_id,0biCSADTAblvLTLtJz4pXO
artists,Vishal-Shekhar;Arijit Singh;Jaideep Sahni
album_name,Befikre
track_name,Nashe Si Chadh Gayi
popularity,69
duration_ms,237875
explicit,False
danceability,0.799
energy,0.784
key,4
loudness,-5.207


In [53]:
# Get the tracks option A
tracks_pop_other = tracks_data.loc[tracks_data['track_genre']=='pop',:].tail(1)

tracks_pop_other.T

track_id,2eyXrWRxc7g5yx0D4J3WTS
artists,The Weeknd
album_name,HÖST
track_name,Out of Time
popularity,6
duration_ms,214190
explicit,False
danceability,0.629
energy,0.763
key,0
loudness,-4.279


In [54]:
# Get the tracks option B
metal_tracks = tracks_data.loc[tracks_data['track_genre']=='metal',:].tail(1)

metal_tracks.T

track_id,0h20ZtzbQa7H9YzfYvV8pd
artists,Armored Dawn
album_name,Tides
track_name,Tides
popularity,53
duration_ms,244808
explicit,False
danceability,0.444
energy,0.97
key,10
loudness,-3.638


In [55]:
tracks_data_ohe

Unnamed: 0_level_0,$pyda,&ME,'Bat Out Of Hell' Original Cast,-M-,10cc,1349,14 Bis,19,3 Doors Down,311,...,spanish,study,swedish,synth-pop,tango,techno,trance,trip-hop,turkish,world-music
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
07tYyJZJyu7SYYG4EWKkGS,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7kxfWvj6u9oWQ5C36kMtGb,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7GDwYUuvu1qjah9rAH15AH,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5GDtkgG9T1BDknHHyDtghv,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
20iJjrQc6GRWDncFtrqWnd,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0bLavFKcMljHzTJAx3sHNS,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4MRylYuw7JcYVuSo3YFy5f,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6jQPolTBq89z9AHrMrmYpS,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2XG8wjGeLpSeE3iublMgcG,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


Let's calculate the similarity using Jaccard

In [56]:
# Calculate the similarity
sim_A = jaccard_similarity(vec_A = tracks_data_ohe.loc[tracks_target.index],
                           vec_B = tracks_data_ohe.loc[tracks_pop_other.index])

sim_B = jaccard_similarity(vec_A = tracks_data_ohe.loc[tracks_target.index],
                           vec_B = tracks_data_ohe.loc[metal_tracks.index])

sim_A, sim_B, sim_A > sim_B, sim_A - sim_B

(0.2, 0.0, True, 0.2)

Great! item A (Ghost - Justin Beiber) is more similar to Love Story than item B (Your Betrayal - Bullet for My Valentine)

What if we calculate the Jaccard similarity using numerical features?

In [57]:
tracks_data_num.loc[tracks_target.index]

Unnamed: 0_level_0,popularity,duration_ms,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0biCSADTAblvLTLtJz4pXO,0.71875,0.044058,0.811168,0.783987,0.881003,0.054403,0.110441,3e-06,0.20949,0.261833,0.440533


In [58]:
# Calculate the similarity
sim_A = jaccard_similarity(vec_A = tracks_data_num.loc[tracks_target.index],
                           vec_B = tracks_data_num.loc[tracks_pop_other.index])

sim_B = jaccard_similarity(vec_A = tracks_data_num.loc[tracks_target.index],
                           vec_B = tracks_data_num.loc[metal_tracks.index])

sim_A, sim_B, sim_A > sim_B, sim_A - sim_B

  sim = n_intersect / n_item


(nan, nan, False, nan)

We got an error because of **divizion by zero**. Why?
- Because there is no one hot item in the numerical features

### Euclidean Similarity
---

Euclidean Similarity measuring similarity based on Euclidean Distance.

The similarity is ranged between 0 and 1.
- Not similar : 0
- Similar : 1

**Note**: you can calculate numerical vectors or categorical vectors using this similarity function

First, calculate the Euclidean distance

$$
\text{Euclidean}_{\text{distance}}(A, B)
=
\sqrt{
    \sum_{i=1}^{n}
    \left (
        A_{i} - B_{i}
    \right )^{2}
}
$$

Finally we can calculate the similarity

$$
\text{Euclidean}_{\text{similarity}}(A, B)
=
\frac
{1.0}
{1.0 + \text{Euclidean}_{\text{distance}}(A, B)}
$$

Let's implement this similarity functions

In [59]:
def euclidean_distance(vec_A, vec_B):
    """Calculate the Euclidean distance between vec A and vec B"""
    # Transform to vector
    vec_A = np.array(vec_A)
    vec_B = np.array(vec_B)

    # Calculate the distance
    inside_the_root = np.sum((vec_A - vec_B)**2)
    dist = np.sqrt(inside_the_root)

    return dist

def euclidean_similarity(vec_A, vec_B):
    """Compute the Euclidean similarity between vec A and vec B"""
    # Calculate the Euclidean distance
    dist_AB = euclidean_distance(vec_A, vec_B)

    # Calculate the similarity
    sim = 1.0 / (1.0 + dist_AB)

    return sim


Let's validate

In [60]:
# Let
A = [1, 0, 0, 1, 1, 1, 1]
B = [1, 1, 0, 0, 0, 1, 0]

# Then the Euclidean distance would be
# dist_AB = 2.0
# And the similarity would be
# sim_AB = 1.0 / (1.0 + 2.0) = 1/3 ~ 0.33

# Calculate automatically
sim = euclidean_similarity(vec_A = A, vec_B = B)
sim

0.3333333333333333

Let's validate more complex vector using previous examples

In [61]:
# Calculate the similarity
sim_A = euclidean_similarity(vec_A = tracks_data_num.loc[tracks_target.index],
                             vec_B = tracks_data_num.loc[tracks_pop_other.index])

sim_B = euclidean_similarity(vec_A = tracks_data_num.loc[tracks_target.index],
                             vec_B = tracks_data_num.loc[metal_tracks.index])


sim_A, sim_B, sim_A > sim_B, sim_A - sim_B

(0.5289160951881118, 0.6223155326851512, False, -0.09339943749703938)

Oops! we have a **wrong** result.

Let's explore with the OHE data

In [62]:
# Calculate the similarity
sim_A = euclidean_similarity(vec_A = tracks_data_ohe.loc[tracks_target.index],
                             vec_B = tracks_data_ohe.loc[tracks_pop_other.index])

sim_B = euclidean_similarity(vec_A = tracks_data_ohe.loc[tracks_target.index],
                             vec_B = tracks_data_ohe.loc[metal_tracks.index])


sim_A, sim_B, sim_A > sim_B, sim_A - sim_B

(0.3333333333333333, 0.28989794855663564, True, 0.04343538477669767)

Now we have a good result

Let's explore with numerical + OHE data

In [63]:
# Calculate the similarity
sim_A = euclidean_similarity(vec_A = tracks_data_num_ohe.loc[tracks_target.index],
                             vec_B = tracks_data_num_ohe.loc[tracks_pop_other.index])

sim_B = euclidean_similarity(vec_A = tracks_data_num_ohe.loc[tracks_target.index],
                             vec_B = tracks_data_num_ohe.loc[metal_tracks.index])




sim_A, sim_B, sim_A > sim_B, sim_A - sim_B

(0.31354306040913904, 0.2838042820120055, True, 0.029738778397133514)

Great results!

**Quick Question**
- What if you did not normalize the numerical features data?
- Would you have the similar results?

### Cosine Similarity
---

Instead of measuring distance, cosine similarity measures angle formed between two vectors.

The similarity is ranged between -1 and 1.
- Opposite : -1
- Not similar : 0
- Similar : 1

**Note**: You can calculate the cosine similarity with numerical / categorical features

$$
\cos (\theta)
=
\cfrac
{A \cdot B}
{||A|| \ ||B||}
$$

$$
||A|| =
\sqrt{
    \sum_{i=1}^{m} A_{i}^{2}
}
$$

$$
A \cdot B
=
\sum_{i=1}^{m} A_{i} B_{i}
$$



with :
- $||A||$ = Norm of vector from entity A
- $||B||$ = Norm of vector from entity B
- $A \cdot B$ = Dot product between vector A and B

Let's create functions to build this similarity functions

In [64]:
def cosine_similarity(vec_A, vec_B):
    """Calculate the cosine similarity between vec A and vec B"""
    # Find the norm
    norm_A = np.linalg.norm(vec_A)
    norm_B = np.linalg.norm(vec_B)

    # Find the dot
    dot_AB = np.dot(vec_A, vec_B)

    # Calculate the similarity
    sim = dot_AB / (norm_A * norm_B)

    return sim


Let's validate

In [65]:
# Let
A = [1, 0, 0, 0, 0, 1, 1]
B = [1, 1, 0, 0, 0, 1, 0]

# Then the Cosine Similarity would be
# Norm A = sqrt(3),     Norm B = sqrt(3)
# Dot AB = 2
# Cosine similarity = 2/(sqrt(3) * sqrt(3))
# Cosine similarity = 2/3 = 0.67

# Calculate automatically
sim = cosine_similarity(vec_A = A, vec_B = B)
sim

0.6666666666666667

Let's validate more complex vector using previous examples

In [66]:
tracks_data_num

Unnamed: 0_level_0,popularity,duration_ms,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
07tYyJZJyu7SYYG4EWKkGS,0.614583,0.033018,0.695431,0.429965,0.778910,0.033962,0.725903,0.000003,0.114173,0.485398,0.486375
7kxfWvj6u9oWQ5C36kMtGb,0.625000,0.048581,0.280203,0.657979,0.876084,0.034486,0.083935,0.000554,0.141111,0.161128,0.783012
7GDwYUuvu1qjah9rAH15AH,0.479167,0.053186,0.816244,0.528971,0.802597,0.031971,0.278112,0.000000,0.069208,0.728097,0.485767
5GDtkgG9T1BDknHHyDtghv,0.645833,0.047413,0.414213,0.706982,0.876504,0.035744,0.189758,0.000000,0.622876,0.335347,0.735920
20iJjrQc6GRWDncFtrqWnd,0.427083,0.027000,0.790863,0.457966,0.686212,0.139413,0.098493,0.028929,0.065271,0.054683,0.476656
...,...,...,...,...,...,...,...,...,...,...,...
0bLavFKcMljHzTJAx3sHNS,0.510417,0.050281,0.590863,0.812988,0.896134,0.030922,0.082228,0.001191,0.908827,0.347432,0.499516
4MRylYuw7JcYVuSo3YFy5f,0.218750,0.025762,0.231472,0.979999,0.858831,0.263103,0.000653,0.000259,0.538956,0.227593,0.729331
6jQPolTBq89z9AHrMrmYpS,0.562500,0.050598,0.577665,0.726983,0.852840,0.045807,0.028111,0.000000,0.957522,0.161128,0.581186
2XG8wjGeLpSeE3iublMgcG,0.000000,0.093694,0.772589,0.634977,0.830201,0.039727,0.000302,0.918919,0.082056,0.106747,0.565955


In [67]:
# Calculate the similarity


sim_A = cosine_similarity(vec_A = tracks_data_num.loc[tracks_target.index].values.tolist()[0],
                          vec_B = tracks_data_num.loc[tracks_pop_other.index].values.tolist()[0])


sim_B = cosine_similarity(vec_A = tracks_data_num.loc[tracks_target.index].values.tolist()[0],
                          vec_B = tracks_data_num.loc[metal_tracks.index].values.tolist()[0])

sim_A, sim_B, sim_A > sim_B, sim_A - sim_B

(0.8607391185890462, 0.9379887129460561, False, -0.07724959435700995)

We do not have a good result in here too

Let's change it to OHE features

In [68]:
# Calculate the similarity
sim_A = cosine_similarity(vec_A = tracks_data_ohe.loc[tracks_target.index].values.tolist()[0],
                          vec_B = tracks_data_ohe.loc[tracks_pop_other.index].values.tolist()[0])

sim_B = cosine_similarity(vec_A = tracks_data_ohe.loc[tracks_target.index].values.tolist()[0],
                          vec_B = tracks_data_ohe.loc[metal_tracks.index].values.tolist()[0])



sim_A, sim_B, sim_A > sim_B, sim_A - sim_B

(0.35355339059327373, 0.0, True, 0.35355339059327373)

This result is correct. Great!

What if we use both features?

In [69]:
# Calculate the similarity
sim_A = cosine_similarity(vec_A = tracks_data_num_ohe.loc[tracks_target.index].values.tolist()[0],
                          vec_B = tracks_data_num_ohe.loc[tracks_pop_other.index].values.tolist()[0])

sim_B = cosine_similarity(vec_A = tracks_data_num_ohe.loc[tracks_target.index].values.tolist()[0],
                          vec_B = tracks_data_num_ohe.loc[metal_tracks.index].values.tolist()[0])

sim_A, sim_B, sim_A > sim_B, sim_A - sim_B

(0.5996673789650707, 0.47133750950195125, True, 0.12832986946311947)

Nice! The result is also correct

### Pearson Correlation Similarity
---

Let's improve our similarity function by normalizing the data before computing its similarity, i.e. the Pearson Correlation

The similarity is ranged between 0 and 1.
- Not similar (not correlated) : 0
- Similar (strong correlation) : 1

**Note**: You can calculate the cosine similarity with numerical / categorical features

$$
r
=
\cfrac
{\text{cov}(A, B)}
{
\sqrt{\text{cov}(A, A) \cdot \text{cov}(B, B)}
}
$$



with :
- $r$ = Pearson Correlation Coefficient
- $\bar{A}, \bar{B}$ = average of vector A and vector B, respectively
- $m$ number of features are in the vectors

Let's create functions to build this similarity functions

In [70]:
def pearson_similarity(vec_A, vec_B):
    """Calculate the Pearson similarity between vec A and vec B"""
    # Calculate the numerator & denominator
    numerator = np.cov(vec_A, vec_B)[0, 1]
    denominator = np.sqrt(np.cov(vec_A) * np.cov(vec_B))

    # Calculate the similarity
    sim = numerator / denominator

    return sim


Let's validate

In [71]:
# Let
A = [1, 0, 0, 0, 0, 1, 1]
B = [1, 1, 0, 0, 0, 1, 0]

# Then the Pearson Similarity would be
# numerator = (7*2) - (3)*(3) = 14 - 9 = 5
# denominator = sqrt( ((7*3)-(3^2)) * ((7*3)-(3^2)) )
# denominator = sqrt(12 * 12)
# denominator = 5
# Pearson similarity = numerator / denominator = 5/12 ~ 0.417

# Calculate automatically
sim = pearson_similarity(vec_A = A, vec_B = B)
sim

0.4166666666666667

Let's validate more complex vector using previous examples

In [72]:
# Calculate the similarity
sim_A = pearson_similarity(vec_A = tracks_data_num.loc[tracks_target.index],
                           vec_B = tracks_data_num.loc[tracks_pop_other.index])

sim_B = pearson_similarity(vec_A = tracks_data_num.loc[tracks_target.index],
                           vec_B = tracks_data_num.loc[metal_tracks.index])

sim_A, sim_B, sim_A > sim_B, sim_A - sim_B

(0.6655910530396849, 0.8499038161513984, False, -0.18431276311171352)

No good results from the numerical data.

Let's change it to OHE features

In [73]:
# Calculate the similarity
sim_A = pearson_similarity(vec_A = tracks_data_ohe.loc[tracks_target.index],
                           vec_B = tracks_data_ohe.loc[tracks_pop_other.index])

sim_B = pearson_similarity(vec_A = tracks_data_ohe.loc[tracks_target.index],
                           vec_B = tracks_data_ohe.loc[metal_tracks.index])

sim_A, sim_B, sim_A > sim_B, sim_A - sim_B

(0.3528940466398676, -0.0010549896760534126, True, 0.353949036315921)

This result is better. Great!

What if we use both features?

In [74]:
# Calculate the similarity
sim_A = pearson_similarity(vec_A = tracks_data_num_ohe.loc[tracks_target.index],
                           vec_B = tracks_data_num_ohe.loc[tracks_pop_other.index])

sim_B = pearson_similarity(vec_A = tracks_data_num_ohe.loc[tracks_target.index],
                           vec_B = tracks_data_num_ohe.loc[metal_tracks.index])

sim_A, sim_B, sim_A > sim_B, sim_A - sim_B

(0.5983361441217601, 0.4695590278659101, True, 0.12877711625584998)

Can You Feel My Heart is more similar than All It Was

**Quick questions**
- Which similarity function to use?
        - Pearson Similarity
- Which data to use?
        - Data numerical + OHE features

## Find Similar Items based on User Last Consumed Item
---

Now, let's create our recommendations.

First, we define our latest played items, e.g. **See You Again** by Wiz Khalifa feat. Charlie Puth

In [75]:
tracks_data

Unnamed: 0_level_0,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
07tYyJZJyu7SYYG4EWKkGS,Jeff Bernat,This Time,This Time,59,185991,False,0.685,0.430,4,-9.586,1,0.0324,0.723000,0.000003,0.1350,0.4820,107.042,4,chill
7kxfWvj6u9oWQ5C36kMtGb,Alter Bridge,Blackbird,Watch Over You,60,259133,False,0.276,0.658,7,-5.418,1,0.0329,0.083600,0.000553,0.1610,0.1600,172.326,4,metal
7GDwYUuvu1qjah9rAH15AH,Chilam;Xu Qiu Yi,EMI 星聲傳集之張智霖,現代愛情故事,46,280773,False,0.804,0.529,0,-8.570,1,0.0305,0.277000,0.000000,0.0916,0.7230,106.908,4,cantopop
5GDtkgG9T1BDknHHyDtghv,Cory Asbury,The Father's House,The Father's House - Studio,62,253640,False,0.408,0.707,1,-5.400,1,0.0341,0.189000,0.000000,0.6260,0.3330,161.962,4,world-music
20iJjrQc6GRWDncFtrqWnd,Rota,Kukla,Kukla,41,157714,False,0.779,0.458,4,-13.562,0,0.1330,0.098100,0.028900,0.0878,0.0543,104.903,4,turkish
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0bLavFKcMljHzTJAx3sHNS,Capital Inicial;Seu Jorge,Capital Inicial Acústico NYC (Ao Vivo) [Deluxe],À Sua Maneira (De Música Ligera) (feat. Seu Jo...,49,267120,False,0.582,0.813,0,-4.558,1,0.0295,0.081900,0.001190,0.9020,0.3450,109.934,4,brazil
4MRylYuw7JcYVuSo3YFy5f,The Replacements,All for Nothing / Nothing for All,"Another Girl, Another Planet - Live",21,151893,False,0.228,0.980,4,-6.158,1,0.2510,0.000651,0.000259,0.5450,0.2260,160.512,4,power-pop
6jQPolTBq89z9AHrMrmYpS,SLANDER;Eric Leva,Superhuman,Superhuman,54,268611,False,0.569,0.727,7,-6.415,1,0.0437,0.028000,0.000000,0.9490,0.1600,127.908,4,dub
2XG8wjGeLpSeE3iublMgcG,Agents Of Time,Tsugi Club,Polina,0,471140,False,0.761,0.635,9,-7.386,1,0.0379,0.000302,0.918000,0.1040,0.1060,124.556,4,techno


In [76]:
latest_track_id = ['6jQPolTBq89z9AHrMrmYpS']
tracks_data.loc[tracks_data.index.isin(latest_track_id)].T

track_id,6jQPolTBq89z9AHrMrmYpS
artists,SLANDER;Eric Leva
album_name,Superhuman
track_name,Superhuman
popularity,54
duration_ms,268611
explicit,False
danceability,0.569
energy,0.727
key,7
loudness,-6.415


To find the recommendations, what we do are
1. Iterate over tracks
2. Calculate each similarity with our target track (Love Story)
3. Return top n most similar track

In [77]:
# add a progress bar, it need some times to finish.
from tqdm import tqdm

In [78]:
# Generate the similarity score
n_tracks = len(tracks_data_num.index)
similarity_score = np.zeros(n_tracks)

# 1. Iterate the whole tracks (tracks_data_num)
track_target = tracks_data_num.loc[latest_track_id]
for i, track_id_i in enumerate(tqdm(tracks_data_num.index)):
    # Extract track_i
    track_i = tracks_data_num.loc[track_id_i]

    # 2. Calculate the similarity (we use Pearson Similarity)
    sim_i = euclidean_similarity(vec_A = track_target,
                                 vec_B = track_i)

    # Append
    similarity_score[i] = sim_i

100%|██████████| 2280/2280 [00:00<00:00, 10816.10it/s]


In [79]:
# print the similarity score
np.argsort(similarity_score)

array([1940, 1516,  623, ...,  152, 2275, 2277])

In [80]:
# 3.1 Sort in descending orders of similarity_score
sorted_idx = np.argsort(similarity_score)[::-1]

# 3.2 Return the n top similar track_id
n = 5
top_tracks_id = tracks_data_num_ohe.index[sorted_idx[1:n+1]]
top_tracks_id

Index(['0bLavFKcMljHzTJAx3sHNS', '4nOn70NoLWKOwqSaSwbjR4',
       '6deng8FUWMyGHyGgqs1VYm', '0tV0CmbHg3YkzCbPdJK35e',
       '5vJC6EHHjEfEqFSkAcLIg5'],
      dtype='object', name='track_id')

Finally, we can show the top n tracks from the latest played track

In [81]:
latest_track_id

['6jQPolTBq89z9AHrMrmYpS']

In [82]:
top_tracks_id = list(top_tracks_id)
top_tracks_id.append(latest_track_id[0])

tracks_data.loc[tracks_data.index.isin(top_tracks_id)].T

track_id,4nOn70NoLWKOwqSaSwbjR4,6deng8FUWMyGHyGgqs1VYm,0tV0CmbHg3YkzCbPdJK35e,5vJC6EHHjEfEqFSkAcLIg5,0bLavFKcMljHzTJAx3sHNS,6jQPolTBq89z9AHrMrmYpS
artists,Eluveitie,Hillsong Worship,E-girls,-M-,Capital Inicial;Seu Jorge,SLANDER;Eric Leva
album_name,Everything Remains (As It Never Was),OPEN HEAVEN / River Wild,LIVE×ONLINE BEYOND THE BORDER (Live at Ariake ...,En Tête-A-Tête,Capital Inicial Acústico NYC (Ao Vivo) [Deluxe],Superhuman
track_name,Isara,In God We Trust,THE NEVER ENDING STORY ~君に秘密を教えよう~ - Live at A...,Mama Sam - Live 2005,À Sua Maneira (De Música Ligera) (feat. Seu Jo...,Superhuman
popularity,39,41,19,33,49,54
duration_ms,164413,241263,90848,388159,267120,268611
explicit,False,False,False,False,False,False
danceability,0.392,0.38,0.56,0.575,0.582,0.569
energy,0.772,0.742,0.726,0.855,0.813,0.727
key,4,2,6,4,0,7
loudness,-4.349,-5.677,-8.693,-6.726,-4.558,-6.415


How's the recommendation? Go check it by yourself

Let's create a function to wrap all of those

In [83]:
def track_recommendation(track_id, n, track_data, similarity_func):
    """
    Recommend n item based on latest played track_id
    """
    # Generate the similarity score
    n_tracks = len(track_data.index)
    similarity_score = np.zeros(n_tracks)

    # Iterate the whole tracks
    track_target = track_data.loc[track_id]
    for i, track_id_i in enumerate(tqdm(track_data.index)):
        # Extract track_i
        track_i = track_data.loc[track_id_i]

        # Calculate the similarity
        sim_i = similarity_func(vec_A = track_target,
                                vec_B = track_i)

        # Append
        similarity_score[i] = sim_i

    # Sort in descending orders of similarity_score
    sorted_idx = np.argsort(similarity_score)[::-1]

    # Return the n top similar track_id
    top_tracks_id = track_data.index[sorted_idx[1:n+1]]

    return top_tracks_id


Let's test it!

Say the latest song I played is **Wake Me up When September Ends** by Green Day, what should you recommend me?

In [84]:
tracks_data

Unnamed: 0_level_0,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
07tYyJZJyu7SYYG4EWKkGS,Jeff Bernat,This Time,This Time,59,185991,False,0.685,0.430,4,-9.586,1,0.0324,0.723000,0.000003,0.1350,0.4820,107.042,4,chill
7kxfWvj6u9oWQ5C36kMtGb,Alter Bridge,Blackbird,Watch Over You,60,259133,False,0.276,0.658,7,-5.418,1,0.0329,0.083600,0.000553,0.1610,0.1600,172.326,4,metal
7GDwYUuvu1qjah9rAH15AH,Chilam;Xu Qiu Yi,EMI 星聲傳集之張智霖,現代愛情故事,46,280773,False,0.804,0.529,0,-8.570,1,0.0305,0.277000,0.000000,0.0916,0.7230,106.908,4,cantopop
5GDtkgG9T1BDknHHyDtghv,Cory Asbury,The Father's House,The Father's House - Studio,62,253640,False,0.408,0.707,1,-5.400,1,0.0341,0.189000,0.000000,0.6260,0.3330,161.962,4,world-music
20iJjrQc6GRWDncFtrqWnd,Rota,Kukla,Kukla,41,157714,False,0.779,0.458,4,-13.562,0,0.1330,0.098100,0.028900,0.0878,0.0543,104.903,4,turkish
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0bLavFKcMljHzTJAx3sHNS,Capital Inicial;Seu Jorge,Capital Inicial Acústico NYC (Ao Vivo) [Deluxe],À Sua Maneira (De Música Ligera) (feat. Seu Jo...,49,267120,False,0.582,0.813,0,-4.558,1,0.0295,0.081900,0.001190,0.9020,0.3450,109.934,4,brazil
4MRylYuw7JcYVuSo3YFy5f,The Replacements,All for Nothing / Nothing for All,"Another Girl, Another Planet - Live",21,151893,False,0.228,0.980,4,-6.158,1,0.2510,0.000651,0.000259,0.5450,0.2260,160.512,4,power-pop
6jQPolTBq89z9AHrMrmYpS,SLANDER;Eric Leva,Superhuman,Superhuman,54,268611,False,0.569,0.727,7,-6.415,1,0.0437,0.028000,0.000000,0.9490,0.1600,127.908,4,dub
2XG8wjGeLpSeE3iublMgcG,Agents Of Time,Tsugi Club,Polina,0,471140,False,0.761,0.635,9,-7.386,1,0.0379,0.000302,0.918000,0.1040,0.1060,124.556,4,techno


In [85]:
target_track_id = ['7kxfWvj6u9oWQ5C36kMtGb']

tracks_data.loc[tracks_data.index.isin(target_track_id)].T

track_id,7kxfWvj6u9oWQ5C36kMtGb
artists,Alter Bridge
album_name,Blackbird
track_name,Watch Over You
popularity,60
duration_ms,259133
explicit,False
danceability,0.276
energy,0.658
key,7
loudness,-5.418


Please do your experiment by changing
1. Tracks data features
2. Similarity functions

### **OHE features + Jaccard Similarity Function**
---

In [86]:
top_tracks_id = track_recommendation(track_id = target_track_id,
                                     n = 10,
                                     track_data = tracks_data_ohe,
                                     similarity_func = jaccard_similarity)

top_tracks_id

  0%|          | 0/2280 [00:00<?, ?it/s]

100%|██████████| 2280/2280 [00:02<00:00, 948.37it/s]


Index(['2cJDsHTLgDQvf5G73RCNX5', '5uMTA9SIe0sE4dnHHSHaYz',
       '3MNZ79grelbBXsQcKUQZAh', '5m9uiFH9sK5wxRZdfN62n9',
       '1qUHD9oPIMFHKpR12NY2KC', '4CvQVmA1wVX4gKOJ7aQOZQ',
       '6DaHEkkaHiAhRZVEFZOMJ8', '28IEbk5a7twNTbUEvWslUb',
       '7wTJ3ciKwjGcdENc5AOLZt', '4Y2bR875LdPq9JILrY2FSw'],
      dtype='object', name='track_id')

In [87]:
top_tracks_id = list(top_tracks_id)
top_tracks_id.append(target_track_id[0])

tracks_data.loc[tracks_data.index.isin(top_tracks_id)]

Unnamed: 0_level_0,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
7kxfWvj6u9oWQ5C36kMtGb,Alter Bridge,Blackbird,Watch Over You,60,259133,False,0.276,0.658,7,-5.418,1,0.0329,0.0836,0.000553,0.161,0.16,172.326,4,metal
5m9uiFH9sK5wxRZdfN62n9,Breaking Benjamin,Shallow Bay: The Best Of Breaking Benjamin (Ex...,The Diary Of Jane,59,200213,False,0.39,0.958,10,-4.455,0,0.0622,5.7e-05,0.0437,0.121,0.348,167.045,4,metal
3MNZ79grelbBXsQcKUQZAh,Alter Bridge,Pawns & Kings,Pawns & Kings,52,378313,False,0.447,0.866,6,-5.588,1,0.0775,1.1e-05,4e-06,0.185,0.285,86.469,4,grunge
28IEbk5a7twNTbUEvWslUb,Finger Eleven,Them Vs. You Vs. Me (Deluxe Edition),Paralyzer,75,208106,False,0.644,0.939,11,-3.486,0,0.0456,0.157,0.0,0.233,0.861,106.031,4,metal
6DaHEkkaHiAhRZVEFZOMJ8,P.O.D.,Metal,Alive - Chris Lord-Alge Mix,0,205026,False,0.421,0.95,10,-3.985,1,0.0769,1.2e-05,0.000217,0.314,0.553,80.919,4,metal
4Y2bR875LdPq9JILrY2FSw,Bad Wolves,Zombie EP,Zombie (Acoustic),55,259253,False,0.499,0.494,2,-6.069,0,0.0264,0.0646,0.0,0.182,0.129,77.011,4,metal
7wTJ3ciKwjGcdENc5AOLZt,NŪR,The Snake,The Snake,32,314760,False,0.434,0.822,1,-7.785,0,0.0671,4.7e-05,0.853,0.135,0.109,134.888,3,metal
5uMTA9SIe0sE4dnHHSHaYz,System Of A Down,Hypnotize,Vicinity Of Obscenity,65,171800,False,0.546,0.957,1,-3.189,1,0.164,0.0533,3e-06,0.683,0.514,109.775,4,metal
1qUHD9oPIMFHKpR12NY2KC,Thousand Foot Krutch,The End Is Where We Begin,War of Change,67,231665,False,0.531,0.887,5,-4.417,0,0.0329,0.00261,0.000512,0.0657,0.626,104.055,4,metal
2cJDsHTLgDQvf5G73RCNX5,Creed,Father's Day 2020,With Arms Wide Open,56,278440,False,0.415,0.533,0,-8.445,1,0.0308,0.0037,0.000732,0.176,0.126,138.929,4,metal


What do you think?
- How can we have multiple similar song to recommend?

### **Numerical features + Euclidean Similarity Function**
---

In [88]:
top_tracks_id = track_recommendation(track_id = target_track_id,
                                     n = 10,
                                     track_data = tracks_data_num,
                                     similarity_func = euclidean_similarity)

top_tracks_id

100%|██████████| 2280/2280 [00:00<00:00, 13868.81it/s]


Index(['6olS0TmHmsGr0hXtcBsiVM', '49EYyEIx0xpGuPLWxSsNgX',
       '6WVgEuHcBrrTWFGMJHF38I', '7ePjtuqKqa0ZbzznCNqiEf',
       '2cJDsHTLgDQvf5G73RCNX5', '7D5SHB5Qr6otVfXUh0jveD',
       '4RAOI1etsgbh5NP3T5R8rN', '7LJqcIIf1x2BQK0ARS9pQQ',
       '6H8qW0UoLvVWFaE0sms6NK', '20Dw7Jar5hUbwX5FwHdQoG'],
      dtype='object', name='track_id')

In [89]:
top_tracks_id = list(top_tracks_id)
top_tracks_id.append(target_track_id[0])

tracks_data.loc[tracks_data.index.isin(top_tracks_id)]

Unnamed: 0_level_0,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
7kxfWvj6u9oWQ5C36kMtGb,Alter Bridge,Blackbird,Watch Over You,60,259133,False,0.276,0.658,7,-5.418,1,0.0329,0.0836,0.000553,0.161,0.16,172.326,4,metal
7ePjtuqKqa0ZbzznCNqiEf,Alter Bridge,One Day Remains,Broken Wings,54,306600,False,0.449,0.72,10,-5.382,0,0.0355,0.00656,6.8e-05,0.127,0.114,138.482,4,grunge
7LJqcIIf1x2BQK0ARS9pQQ,Gabriela Rocha,Correrei (Ao Vivo),Correrei - Ao Vivo,42,331302,False,0.36,0.653,4,-6.506,1,0.0388,0.0256,0.0,0.0792,0.16,137.909,4,gospel
6olS0TmHmsGr0hXtcBsiVM,Megadeth,Youthanasia (Expanded Edition),A Tout Le Monde - Remastered 2004,66,262133,False,0.272,0.747,5,-4.86,0,0.0401,0.000393,4e-05,0.16,0.255,202.777,4,metal
4RAOI1etsgbh5NP3T5R8rN,My Chemical Romance,The Black Parade,I Don't Love You,73,238680,False,0.289,0.796,0,-4.22,1,0.0494,0.00837,2.6e-05,0.202,0.33,169.835,4,punk
6WVgEuHcBrrTWFGMJHF38I,SLANDER;NGHTMRE,Thrive,Monster,49,234000,False,0.361,0.773,4,-5.039,0,0.0444,0.000563,0.000353,0.089,0.0652,159.553,4,dub
7D5SHB5Qr6otVfXUh0jveD,Casa Worship;Julliany Souza,Seu Amor Me Persegue (Studio),Seu Amor Me Persegue (Studio),51,445970,False,0.364,0.603,6,-9.079,1,0.0356,0.0217,7.7e-05,0.0898,0.0886,133.784,4,gospel
2cJDsHTLgDQvf5G73RCNX5,Creed,Father's Day 2020,With Arms Wide Open,56,278440,False,0.415,0.533,0,-8.445,1,0.0308,0.0037,0.000732,0.176,0.126,138.929,4,metal
49EYyEIx0xpGuPLWxSsNgX,Passion;Kristian Stanfill;Tasha Cobbs Leonard;...,Burn Bright,What He's Done - Live From Passion 2022,50,397216,False,0.255,0.558,1,-7.974,1,0.0359,0.0169,2.2e-05,0.0879,0.183,141.486,4,world-music
6H8qW0UoLvVWFaE0sms6NK,GG Magree,My Wicked,My Wicked,42,233048,False,0.333,0.805,0,-4.376,1,0.0433,0.0856,0.0,0.285,0.175,163.969,4,dub


What do you think?

### **Numerical + OHE features + Pearson Similarity Function**
---

In [90]:
# This could took 5 mins or more to complete
top_tracks_id = track_recommendation(track_id = target_track_id,
                                     n = 10,
                                     track_data = tracks_data_num_ohe,
                                     similarity_func = pearson_similarity)

top_tracks_id

100%|██████████| 2280/2280 [00:03<00:00, 599.92it/s]


Index(['6olS0TmHmsGr0hXtcBsiVM', '5m9uiFH9sK5wxRZdfN62n9',
       '0h20ZtzbQa7H9YzfYvV8pd', '7ePjtuqKqa0ZbzznCNqiEf',
       '7DT6NuQBHmsSvA1RWuxoeX', '4FEr6dIdH6EqLKR0jB560J',
       '6Uaqf3LCwuVebfm5u3ndq9', '2cJDsHTLgDQvf5G73RCNX5',
       '1qUHD9oPIMFHKpR12NY2KC', '3MNZ79grelbBXsQcKUQZAh'],
      dtype='object', name='track_id')

In [91]:
top_tracks_id = list(top_tracks_id)
top_tracks_id.append(target_track_id[0])

tracks_data.loc[tracks_data.index.isin(top_tracks_id)]

Unnamed: 0_level_0,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
7kxfWvj6u9oWQ5C36kMtGb,Alter Bridge,Blackbird,Watch Over You,60,259133,False,0.276,0.658,7,-5.418,1,0.0329,0.0836,0.000553,0.161,0.16,172.326,4,metal
7ePjtuqKqa0ZbzznCNqiEf,Alter Bridge,One Day Remains,Broken Wings,54,306600,False,0.449,0.72,10,-5.382,0,0.0355,0.00656,6.8e-05,0.127,0.114,138.482,4,grunge
5m9uiFH9sK5wxRZdfN62n9,Breaking Benjamin,Shallow Bay: The Best Of Breaking Benjamin (Ex...,The Diary Of Jane,59,200213,False,0.39,0.958,10,-4.455,0,0.0622,5.7e-05,0.0437,0.121,0.348,167.045,4,metal
3MNZ79grelbBXsQcKUQZAh,Alter Bridge,Pawns & Kings,Pawns & Kings,52,378313,False,0.447,0.866,6,-5.588,1,0.0775,1.1e-05,4e-06,0.185,0.285,86.469,4,grunge
6olS0TmHmsGr0hXtcBsiVM,Megadeth,Youthanasia (Expanded Edition),A Tout Le Monde - Remastered 2004,66,262133,False,0.272,0.747,5,-4.86,0,0.0401,0.000393,4e-05,0.16,0.255,202.777,4,metal
1qUHD9oPIMFHKpR12NY2KC,Thousand Foot Krutch,The End Is Where We Begin,War of Change,67,231665,False,0.531,0.887,5,-4.417,0,0.0329,0.00261,0.000512,0.0657,0.626,104.055,4,metal
7DT6NuQBHmsSvA1RWuxoeX,Disidente,Y si tuviera disquera,Decidir,42,277533,False,0.368,0.872,5,-4.323,0,0.0318,0.00475,0.0,0.0877,0.558,168.035,4,metal
4FEr6dIdH6EqLKR0jB560J,Deftones,Koi No Yokan,Rosemary,73,413346,False,0.285,0.613,5,-6.412,1,0.0421,0.0185,0.1,0.114,0.0772,126.628,4,metal
2cJDsHTLgDQvf5G73RCNX5,Creed,Father's Day 2020,With Arms Wide Open,56,278440,False,0.415,0.533,0,-8.445,1,0.0308,0.0037,0.000732,0.176,0.126,138.929,4,metal
6Uaqf3LCwuVebfm5u3ndq9,Bad Wolves,N.A.T.I.O.N.,Sober,57,195477,False,0.456,0.813,6,-6.663,1,0.047,0.0111,0.0,0.174,0.619,157.031,4,metal


Great!

What do you think? **Which configuration do results the best recommendations?**