# Spotify Recommender Model

This notebook will explore making recommendations based on the vectorization of songs.  Using Word2Vec, a variety of different recommender model options are explored.  

1. **Content-Based**
> Predicts based on what a user has listened to in the past.
> Uses features of songs to find similar songs.

2. **Collaborative**
> Predicts based on what other listeners like
> Focuses on what songs other users liked who also liked a chosen song. 

2 collaborative-based approaches are presented in this notebook as well as 2 content-based models.

## Word2Vec
Word2Vec is a library that will create a vector space.  As its name implies, Word2Vec was originally intended to convert Words to Vectors.  Here, we will use that intended functionality to convert Songs to Vectors.  

### Embeddings
Word2Vec is a process that uses vectorized words to predict other words.  It does this by ingesting a series of documents, parsing out the words, vectorizing the words and then using the vector representations to predict other words.  The vectors are built in such a way that each word has a unique vector that is based on its usage in the documents.  The result is a vector space filled with words where related words have vectors that are similar.  This vector space is referred to an an **embedding**.  This embedding is used in two common word prediction tasks: `Skip-Gram` and `Continuous Bag of Words`.

> **Skip-Gram** <br>
> The Skip-Gram model attempt to find words that surround a given word or set of words.

> **Bag-of-Words** <br>
> The bag-of-words model asks for a series of words and will return words that appear to be missing from the provided context.

<a name='metrics'></a>
### Metrics
<a href=#r-prec>R-precision</a> and <a href=#ndgc>Normialzed Discounted Cumulative Gain (NDGC)</a> are used as metrics.  Methods are included below that calculate these metrics.  In order to test a model, 100 playlists are passed through each metric and a mean of the results is returned.


### Making a Playlist
What does this have to do with playlists?  Good question.  If we can consider a Song as a Word and a Playlist as a document, the applicability is more evident.

To make a playlist, we simply convert Songs to Vectors and then find new songs by finding other songs with similar vectors.  To schieve this, we can use the Bag-of-Words or Skip-Gram approach as mentioned above.  

Various approaches using these concepts are explored below:

<a name='index'></a>
### <a href=#1>1. Embeddings from Playlists - Content-Based</a>
> Here, we will take data from Spotify that included 1M playlists and the songs in each playlist.  We'll use the Word2Vec process supplying playlists as documents and each song's unique id is used as the word. <br><br>
Word2Vec will create an embedding of song vectors that can subsequently be used to create a Skip-Gram or Bag-of_words model; however, in this first approach, we will simply use the embeddings to find a playlist. <br><br>This approach is unsupervised.  No process is used to 'guide' the model into determining if it's output is correct or not.

### <a href=#2>2. Bag-of-Words Model - Collaborative</a>
> In this approach, we will use the embedding used above to train a Bag-of-Words model to create a recommended playlist.  Unlike the first approach, this model is supervised and requires a training process.


### <a href=#3>3. Skip-Gram Model - Collaborative</a>
> Using the same embedding as before, we will train a Skip-Gram model to create recommendations.  Like the Bag-of_words model, Skip-Gram is a supervised model and will require a training process.


### <a href=#4>4. Song Features Embedding - 'Home-Made Vectors' - Content-Based</a>
> Here, we can take a break from Word2Vec and get very basic.  We create our own vectors based on Spotify accoustive feature data.  We have a series of fields available for all of our songs that numerically represent various characteristics of the songs; dancebaility, loudness, temp, key, energy, etc.  We can create vectors for each song based on these values and cast them into the Word2Vec format so that we can exploit some of the Word2Vec functionality. <br><br> Like our first model, this approach is unsupervised.  The model will be built without any feedback on whether is is achieving a certain result or not.



<br>

**References:**

https://www.analyticsvidhya.com/blog/2019/07/how-to-build-recommendation-system-word2vec-python/

https://towardsdatascience.com/using-word2vec-for-music-recommendations-bb9649ac2484

https://towardsdatascience.com/how-to-build-a-simple-song-recommender-296fcbc8c85



### Import libraries

In [254]:
# Basic Imports
import warnings;
warnings.filterwarnings('ignore')

import os
import sys
import pandas as pd
import numpy as np
import seaborn as sns
import time
import random
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import normalize, Normalizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

from gensim.models import Word2Vec
from gensim import utils
import gensim.models
from gensim.models import KeyedVectors

# For the Spotify Dataset
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Table, Column, Integer, String, Float, MetaData, and_, or_, func
from sqlalchemy import create_engine
import sqlite3
from sqlalchemy.orm import sessionmaker
from sqlalchemy import exc

from sklearn.model_selection import train_test_split

sys.path.append('../../')
from spotify_api import get_spotify_data, get_tracks, get_artists, get_audiofeatures
from spotify_database import get_session, display_time
from spotify_utils import Table_Generator, List_Generator, pickle_load, pickle_save

# !pip install ipywidgets 
# !jupyter nbextension enable --py widgetsnbextension
# !jupyter labextension install @jupyter-widgets/jupyterlab-manager

# %%capture
from tqdm import tqdm_notebook as tqdm


## Metrics
Before creating any playlists, let's setup some metrics so that we can evaluate our models.

<a name='r-prec'></a>
### R-Precision
<a href=#metrics>back to index</a>

Compares the a recommended list to a ground truth list.  This metric will simply calculate the percentage of tracks that match a ground truth.

We can compare the intersection of tracks and the intersection of artists.

let: <br>
> $G$ = ground truth (validation playlist) <br>
> $R$ = recommendation list<br>

> $R-precision = \frac{|G \bigcap R_{1:|G|}|}{|G|}$


In [108]:
def calc_rPrecision(ground_truth, recommendation)->float:
    """
    Calculates r-precision based on a list of ground truth 
    items and a list of recommended items.
    Each argument is a list of items to compare.
    """
    G=set(ground_truth)
    R=set(recommendation)
    
    return len(G&R)/len(G)

<a name='ndgc'></a>
### NDCG - Normalized Discounted Cumulative Gain
<a href=#metrics>back to index</a>

NDCG will incorporate not only the relevance, but also the order of the items in the recommended playlist.

To caluclate, we need the the DCG (discounted cumulative gain)  which measures the ranking quality.  We also need the IDCG (ideal discounted cumulative gain).

> $DCG = rel_1 + \sum^{|R|}_{i=2}\frac{rel_i}{log_2(i+1)} $ <br>
> $IDCG = 1 + \sum^{|G|}_{i=2}\frac{1}{log_2(i+1)} $

NDGC is calculated as follows: <br>
> $NDCG=\frac{DCG}{IDCG}$

Where $rel_{{i}}$ is the graded relevance of the result at position $i$.  Relevance = 1 when the recommended track is in the ground truth playlist.

In [109]:
def calc_track_NDCG(ground_truth, recommendation)->float:
    list_len = len(recommendation)
    scores = [(track in recommendation) for track in ground_truth[0:list_len]]
    
    DCG  = scores[0] + np.sum(scores[1:]/np.log2( np.arange(1,list_len)+1) )
    IDCG = 1 + np.sum(1/np.log2( np.arange(1,list_len)+1 ))
    NDGC = DCG/IDCG
    
    return NDGC

In [110]:
def calc_track_metrics(ground_truth, recommendation, display=True)-> (float,float):
    """
    Prints relevant metrics give a grount truth playlist and a 
    recommended playlist.
    """
    r_prec = calc_rPrecision(ground_truth, recommendation)
    NDGC = calc_track_NDCG(ground_truth, recommendation)
    
    if display:
        print("Track R-Precision: {}".format(r_prec))
        print("Track NDGC       : {}".format(NDGC))
    
    return r_prec, NDGC

In [222]:
def eval_model(model, df_withheld, df_given, display=True, num_test:int=100) -> (float,float):
    """
    Will perform an r_precision and NDCG calculation on a
    restricted set of test playlists and return a mean of each score
    for all scored playlists.
    """
    r_precs = []
    NDCGs = []
    playlists = np.unique(df_given.playlist_id.values)
    playlists = np.random.choice(playlists, size=num_test)
    
    for plistID in tqdm(playlists, desc="calculating metrics"):
        df_seed_tracks = df_given[df_given.playlist_id.isin([plistID])]
        seed_uris = df_seed_tracks.track_uri.values
        
        try:
            # for embedding, use the first track as the seed
            if type(model) == gensim.models.keyedvectors.Word2VecKeyedVectors:
                playlist_rec = np.array(embedding.similar_by_word(seed_uris[0], 
                                                              topn=10))

            elif type(model) == gensim.models.word2vec.Word2Vec:
                playlist_rec = np.array(BOW_model.predict_output_word(seed_uris, 
                                                                      topn=10))

            withheld_uris = df_withheld[df_withheld.playlist_id.isin([plistID])].track_uri.values

            r_prec, NDGC = calc_track_metrics(  withheld_uris, 
                                                    playlist_rec[:,0],
                                                    display=False)
        except:
            continue
        
        r_precs.append(r_prec)
        NDCGs.append(NDGC)
    
    if display:
        print("Mean Track R-Precision: {}".format(np.mean(r_precs)))
        print("Mean Track NDGC       : {}".format(np.mean(NDCGs)))
        
    return r_precs, NDCGs
        
    

### Set Data Path Variables

In [5]:
data_path = '../../data/SpotifyDataSet'
db_path = '../../data/SpotifyDataSet/spotify_songs.db'

# Get sesion
session = get_session(db_path)
engine = create_engine('sqlite:///' + db_path)

# Get Songs class
Playlists = getattr(get_session, "Playlists")
Artists = getattr(get_session, "Artists")
Tracks = getattr(get_session, "Tracks")

In [14]:
# takes 5 minutes
df_playlists_test_withheld = pd.read_csv(os.path.join(data_path, "df_playlists_test_withheld.csv"), index_col='index')#.drop('Unnamed: 0', axis=1)
df_playlists_test_given    = pd.read_csv(os.path.join(data_path, "df_playlists_test_given.csv"), index_col='index')#.drop('Unnamed: 0', axis=1)
df_playlists_train         = pd.read_csv(os.path.join(data_path, "df_playlists_train.csv"), index_col='index').drop('Unnamed: 0', axis=1)


## Review the Given Set of Songs
For reference, let's look at the songs that will be used in our models to produce recommendations.  This will help us to determin if the recommendations are reasonable.

We will use the `df_playlists_test_given` and `df_playlists_test_withheld` dataframes to test the models.  The `given` dataframe inlcludes 10 tracks from each of 10,000 playlists.  Any playlist of 10 tracks should be used to predict the corresponding `withheld` tracks from each playlist.

In [202]:
# pick a random playlist
test_playlistIDs = np.unique(df_playlists_test_given.playlist_id.values)
test_playlist_ID = np.random.choice(test_playlistIDs, size=1)
test_playlist_ID

array([295069])

In [203]:
# define 'given' and 'withheld' portions of test set
test_given = df_playlists_test_given[df_playlists_test_given.playlist_id.isin(test_playlist_ID)].track_uri.values
test_withheld = df_playlists_test_withheld[df_playlists_test_withheld.playlist_id.isin(test_playlist_ID)]

In [242]:
def print_recommended_playlist(playlist:np.array, df_withheld:pd.DataFrame=None)->None:
    """
    Print a playlist recommendation. Display to terminal output.
    """
    
    if len(playlist)>50:
        playlist = playlist[0:50]
    
    note = ""
    match = np.zeros(len(playlist))
    if df_withheld is not None:
        note = "(* indicates a match)"
        for i, t in enumerate(playlist):
            if t in df_withheld.track_uri.values:
                match[i] = 1
        print ("{} tracks matches.".format(np.sum(match).astype(int)))
    
    sp_playlist = get_tracks(playlist)
    print("RECOMMENDED PLAYLIST {}".format(note))
    print("{:1}{:20}{:30}{:30}".format("","Artist","Track","URI"))
    for i, t in enumerate(sp_playlist):
        print("{0:1}{1:20}{2:30}{3:30}".format( "*" if match[i] else "", t['artists'][0]['name'],  t['name'], t['uri']))
        print(" {}".format("<no preview>" if t['preview_url']==None else t['preview_url']))
        print()

In [243]:
print_recommended_playlist(test_given)

RECOMMENDED PLAYLIST 
 Artist              Track                         URI                           
 Louis Armstrong     La vie en rose - Single Versionspotify:track:0AX6pLXtyR2vPLv2KYErAg
 <no preview>

 Bobby Darin         Dream Lover - 2006 Remaster   spotify:track:4GqMYg91LJXiLjvQBFc3s0
 https://p.scdn.co/mp3-preview/60b4d507dace4a9505a04d480b0f299e43d9994d?cid=72413f75d4db4ec79c6caaf02523959e

 Dean Martin         Everybody Loves Somebody      spotify:track:78VG6M1i7JQXBdygmWFwye
 https://p.scdn.co/mp3-preview/d2b12a99c2d07cc5f085acf174f75889678ecc0a?cid=72413f75d4db4ec79c6caaf02523959e

 Chuck Berry         School Day (Ring Ring Goes The Bell)spotify:track:3hNcrk8Ypht0x5CuT7pJnS
 <no preview>

 Otis Redding        Stand by Me                   spotify:track:1aj4GXfmEYXfdVZohCpNKu
 https://p.scdn.co/mp3-preview/94e1e88bd6e967752a5030b1cca3fff3beae8dce?cid=72413f75d4db4ec79c6caaf02523959e

 Dion                Life Is But A Dream           spotify:track:6C5YGS0bYivffnEH3EMC7h
 

<a name='1'></a>
## 1. Embeddings from Playlists - Content-Based
<a href=#index>back to index</a>


The baseline model will use embeddings to find similarities between songs.  The embeddings are built from playlists, where the playlist serves as a sentence made up of songs.

Similarities between songs are determined by their cosine distance with other songs.

To speed up to building of the enbedding, an extract is made from the database which will serve as documents for the embedding.  Each 'sentence' is a playlist and each 'word' is a song in the playlist.

From the DB, the following view is created which is subseqntly extracted as a CSV file:<br>

`CREATE VIEW playlist_tracks_uris AS SELECT t.playlist_id, group_concat(t.track_uri, ' ')  FROM playlists t GROUP BY  t.playlist_id;`


In [21]:
# get playlists CSV file into DataFrame - take 5 minutes
corpus_file = 'playlist_tracks.csv'
corpus_filepath = os.path.join(data_path, corpus_file)

df_playlists = pd.read_csv(corpus_filepath, sep='\t', header=None)
df_playlists.columns = ["playlistID","tracks"]
df_playlists.set_index('playlistID', drop=True, inplace=True)
df_playlists.head()

Unnamed: 0_level_0,tracks
playlistID,Unnamed: 1_level_1
1,spotify:track:2d7LPtieXdIYzf7yHPooWd spotify:t...
2,spotify:track:5j9iuo3tMmQIfnEEQOOjxh spotify:t...
3,spotify:track:4HBVGSeSPpSZ1QmMBhEtqp spotify:t...
4,spotify:track:1f5AW15GV76mk8JNxaPJIx spotify:t...
5,spotify:track:4Sj3djQIFuaH3VICDN3uAA spotify:t...


In [22]:
# remove test playlists from corpus
playlists_train = df_playlists[df_playlists.index.isin(np.unique(df_playlists_train.playlist_id))]


In [23]:
# Iterator that yields the songs for each playlist in a Dataframe
class Playlist_URIs_df(object):
    """
    Playlist generator that yileds the track uris in a playlist.
    Yields one playlist at a time.
    """
    def __init__(self,
                 dataframe:pd.DataFrame=None,
                 name:str=None,
                 iters:int=None):
        self.dataframe    = dataframe
        self.length       = len(dataframe)
        self.name         = name
        self.count        = 0
        self.iters        = iters
        print("Creating Playlist Track Listing Generator:")
        print("\tlength     : ", self.length)
    
    def __iter__(self):
        
        self.count += 1
        progbar = tqdm(total=self.length, desc="{}:{}/{}".format(self.name, self.count, self.iters+1))
        
        for plId, line in self.dataframe.itertuples():
            progbar.update(1)
            yield line.split(' ') # space-delimited tracks
            
        progbar.close()    

In [24]:
iters = 5
playlists_gen = Playlist_URIs_df(dataframe=playlists_train,
                                 name="Building Vectors",
                                 iters=iters) 

Creating Playlist Track Listing Generator:
	length     :  989001


### Build embedding and BOW model

In [25]:
# Build a gensim BOW model including a word embedding
model_BOW = gensim.models.Word2Vec(sentences=playlists_gen,
                               workers = 8,    # number of processors
                               sg = 0,         # 1=skip-gram, 0=CBOW
                               iter=iters      # training iterations - default=5
                              )

# save the model
model_filepath = os.path.join(data_path, 'playlists_BOW.model')
model_BOW.save(model_filepath)

# Save the embedding
kv_filepath = os.path.join(data_path, 'playlists.embedding')
model_BOW.wv.save(kv_filepath)

# about 5 minutes per iteration with 4 processors // 2 min per iteration with 8

HBox(children=(IntProgress(value=0, description='Building Vectors:1/6', max=989001, style=ProgressStyle(descri…




HBox(children=(IntProgress(value=0, description='Building Vectors:2/6', max=989001, style=ProgressStyle(descri…




HBox(children=(IntProgress(value=0, description='Building Vectors:3/6', max=989001, style=ProgressStyle(descri…




HBox(children=(IntProgress(value=0, description='Building Vectors:4/6', max=989001, style=ProgressStyle(descri…




HBox(children=(IntProgress(value=0, description='Building Vectors:5/6', max=989001, style=ProgressStyle(descri…




HBox(children=(IntProgress(value=0, description='Building Vectors:6/6', max=989001, style=ProgressStyle(descri…




### Build Skip-Gram Model

In [26]:
# Build a gensim Skip-Gram model including a word embedding
model_SG = gensim.models.Word2Vec(sentences=playlists_gen,
                               workers = 8,    # number of processors
                               sg = 1,         # 1=skip-gram, 0=CBOW
                               iter=iters      # training iterations - default=5
                              )

# save the model
model_filepath = os.path.join(data_path, 'playlists_SG.model')
model_SG.save(model_filepath)

# NOTE: No need to save the embedding again, it is the same as in the BOW model

# about 5 minutes per iteration with 4 processors // 2 min per iteration with 8

HBox(children=(IntProgress(value=0, description='Building Vectors:7/6', max=989001, style=ProgressStyle(descri…




HBox(children=(IntProgress(value=0, description='Building Vectors:8/6', max=989001, style=ProgressStyle(descri…




HBox(children=(IntProgress(value=0, description='Building Vectors:9/6', max=989001, style=ProgressStyle(descri…




HBox(children=(IntProgress(value=0, description='Building Vectors:10/6', max=989001, style=ProgressStyle(descr…




HBox(children=(IntProgress(value=0, description='Building Vectors:11/6', max=989001, style=ProgressStyle(descr…




HBox(children=(IntProgress(value=0, description='Building Vectors:12/6', max=989001, style=ProgressStyle(descr…




### Model Attributes
Now, we have created an embedding and an associated model which follows the BOW approach.

#### wv
> This object essentially contains the mapping between words and embeddings. It can be used directly to query the embeddings in various ways. 

#### vocabulary
> This object represents the vocabulary (sometimes called Dictionary in gensim) of the model. 

### Review the Embedding
Get 10 'words' from the embedding's vocabulary.

In [27]:
# reload saved embedding
kv_filepath = os.path.join(data_path, 'playlists.embedding')
embedding = KeyedVectors.load(kv_filepath, mmap='r')

In [28]:
# display list of random track along with a count of appearances
songids = list(embedding.wv.vocab.keys())
idxs_rnd = [random.randint(0,len(songids)) for x in range(10)]

for i in idxs_rnd:
    print("Track: {}".format(songids[i]))
    print("\t",embedding.wv.vocab[songids[i]])

Track: spotify:track:4rcQyhmbqea985kxXURyfM
	 Vocab(count:7, index:446332, sample_int:4294967296)
Track: spotify:track:2EsOy9Yk7Du8RvkA0lhm4L
	 Vocab(count:125, index:58093, sample_int:4294967296)
Track: spotify:track:7Di7t9yGoxdZRLAt5a4pi0
	 Vocab(count:2006, index:5245, sample_int:4294967296)
Track: spotify:track:1SOFGwoT6i5cB7bg4PtX2Y
	 Vocab(count:8, index:398504, sample_int:4294967296)
Track: spotify:track:4Ix57FqMyPMD1SouENEtV8
	 Vocab(count:91, index:74450, sample_int:4294967296)
Track: spotify:track:2mTKpYDTvqIkD42LRNob9F
	 Vocab(count:18, index:244527, sample_int:4294967296)
Track: spotify:track:4EU1Su5dQtnkRO7gSG35Ui
	 Vocab(count:10, index:348047, sample_int:4294967296)
Track: spotify:track:4XOkjFJ0qXZr61Lv116ybV
	 Vocab(count:296, index:29179, sample_int:4294967296)
Track: spotify:track:4XBjG0TN3iK2WKmz71GsmH
	 Vocab(count:77, index:84238, sample_int:4294967296)
Track: spotify:track:0BlNkCzdgYi9eJXwnYUho9
	 Vocab(count:165, index:46640, sample_int:4294967296)


### Display the Embedded Vector

In [201]:
# Show the vector value for the first word in in the list above
rnd = random.randint(0, len(embedding.wv.vocab))
print(songids[idxs_rnd[0]])
display(embedding.wv.get_vector(songids[idxs_rnd[0]]))

spotify:track:4rcQyhmbqea985kxXURyfM


memmap([ 2.54702047e-02, -9.36156362e-02,  9.72104743e-02,
        -4.94626276e-02, -8.21211338e-02,  1.19572319e-01,
        -1.75503939e-01, -1.24044880e-01,  1.66824795e-02,
        -4.97681387e-02,  4.08215784e-02,  9.62732956e-02,
         6.78425506e-02, -1.66927110e-02,  3.74965742e-02,
         2.13508070e-01,  9.58365761e-03,  1.95487574e-01,
        -8.74877442e-03, -1.10318847e-01, -1.53332502e-01,
        -3.12477536e-02,  9.62833911e-02,  5.39107509e-02,
        -3.35698132e-03,  8.46106857e-02,  1.01292215e-01,
        -1.47143617e-01,  1.00250535e-01, -8.91840458e-02,
         1.59551408e-02, -4.48449515e-02, -2.31906459e-01,
         1.02777421e-01, -3.16157639e-02, -1.13585718e-01,
        -1.22808926e-01, -1.27203807e-01,  8.03158730e-02,
        -1.66301671e-02, -1.66764006e-01, -7.57816620e-03,
        -1.70734935e-02, -2.64590490e-03, -6.29904866e-02,
        -4.23168589e-04,  4.00907360e-02, -4.87653948e-02,
        -1.13768861e-01,  1.10321514e-01,  4.28889990e-0

### Use the Word2Vec embedding to find similar songs based on single Song
Using the Word2Vec function `similar_by_word()`.  When using embeddings to predict a playlist, we are limited to inputting a single 'word' or song.

In [204]:
seed_uri = test_given[0]

In [205]:
# Find similar songs based on a single song
playlist_rec = np.array(embedding.similar_by_word(seed_uri, topn=10, restrict_vocab=None))

In [245]:
# Get the similar songs from Spotify to show their details, including preview link (if available)
sp_playlist = get_tracks(playlist_rec[:,0])
print("SEED TRACK")
sp_seed_track = get_tracks([seed_uri]) # get spotify data for track

print("Artist       : ",   sp_seed_track[0]['artists'][0]['name'])
print("Track        : ",   sp_seed_track[0]['name'])
print("Track Preview: \n", sp_seed_track[0]['preview_url'] )
print()

print_recommended_playlist(playlist_rec[:,0], test_withheld)


SEED TRACK
Artist       :  Louis Armstrong
Track        :  La vie en rose - Single Version
Track Preview: 
 None

1 tracks matches.
RECOMMENDED PLAYLIST (* indicates a match)
 Artist              Track                         URI                           
 Ella Fitzgerald     Dream A Little Dream Of Me    spotify:track:3Bbbz0IGORWZSLf9UqsAL4
 <no preview>

 Nat King Cole       L-O-V-E - Remastered          spotify:track:7E3rc13GL2I5wA6CIFXaxs
 <no preview>

 Frank Sinatra       Fly Me To The Moon            spotify:track:2y8Eez5cFFf2JzD546LThM
 <no preview>

 Frank Sinatra       I've Got You Under My Skin    spotify:track:1hSlBR0fAUCyB7jNtztE1s
 <no preview>

 Frank Sinatra       You Make Me Feel So Young     spotify:track:0RKWU3hF0xNGS5RJXAtzF5
 <no preview>

 Bobby Darin         Beyond the Sea                spotify:track:3KzgdYUlqV6TOG7JCmx2Wg
 https://p.scdn.co/mp3-preview/265a0fffe03b363973ddf23f0c8f4e55a3d0b45a?cid=72413f75d4db4ec79c6caaf02523959e

*Louis Armstrong     What A Wo

In [220]:
# Calculate the metrics on just this playlist
r_prec, NDCG = calc_track_metrics(test_withheld.track_uri.values, playlist_rec[:,0])

Track R-Precision: 0.1
Track NDGC       : 0.06779095235709004


In [223]:
# calculate metrics on 100 playlists
r_precs, NDCGs = eval_model(embedding, df_playlists_test_withheld, df_playlists_test_given)

HBox(children=(IntProgress(value=0, description='calculating metrics', style=ProgressStyle(description_width='…


Mean Track R-Precision: 0.01875
Mean Track NDGC       : 0.020612728023655536


### Results:
Based on the playlists that were supplied when buidling the embedding, the list of Top 10 most similar songs is presented above.  The results don't score well; however, the recommended tracks seem like reasonable recommendations based on the seed track.

This playlist is based on the top songs that other users placed in playlists that include the song we selected as our seed.  This is a good example of 'collaborative filtering' as it uses preferences from others to recommend songs.

<a name='2'></a>
## 2. Embeddings from Playlists - Song ID - BOW
<a href=#index>back to index</a>

Now, let's use the model that was created when we initially established our embedding.  The BOW model can be used to predict a song from a song or a list of supplied songs.

The same seed playlist will be used as previously.

In [144]:
# Load the BOW model saved previously
model_filepath = os.path.join(data_path, 'playlists_BOW.model')
BOW_model = Word2Vec.load(model_filepath)

In [244]:
recommended_songs = np.array(BOW_model.predict_output_word(test_given, topn=10))
print_recommended_playlist(recommended_songs[:,0], test_withheld)


1 tracks matches.
RECOMMENDED PLAYLIST (* indicates a match)
 Artist              Track                         URI                           
 Sam Cooke           (What A) Wonderful World - Remasteredspotify:track:27K3ZDS5B4fwjhwyihrdzC
 <no preview>

 Otis Redding        (Sittin' On) the Dock of the Bayspotify:track:3zBhihYUHBmGd2bcQIobrF
 https://p.scdn.co/mp3-preview/99bfd8043cd20bc3dc4b7aa9461ac268954efdf7?cid=72413f75d4db4ec79c6caaf02523959e

 King Harvest        Dancing In The Moonlight      spotify:track:55GxhCTq6SY3tFTVh7z1nR
 <no preview>

 Otis Redding        Stand by Me                   spotify:track:1aj4GXfmEYXfdVZohCpNKu
 https://p.scdn.co/mp3-preview/94e1e88bd6e967752a5030b1cca3fff3beae8dce?cid=72413f75d4db4ec79c6caaf02523959e

 Frank Sinatra       Fly Me To The Moon            spotify:track:2y8Eez5cFFf2JzD546LThM
 <no preview>

 The Temptations     My Girl                       spotify:track:6RrXd9Hph4hYR4bf3dbM6H
 <no preview>

 Van Morrison        Moondance - 2013 Re

In [225]:
r_precs, NDCGs = eval_model(BOW_model, df_playlists_test_withheld, df_playlists_test_given)

HBox(children=(IntProgress(value=0, description='calculating metrics', style=ProgressStyle(description_width='…


Mean Track R-Precision: 0.046
Mean Track NDGC       : 0.044119563042770704


### Result:
Only one song matched the ground truth; however, the returned tracks look very reasonable given the inputs.

<a name='3'></a>
## 3. Embeddings from Playlists - Song ID - Skip-Gram
<a href=#index>back to index</a>

Here, we create a playlist from the Skip-Gram model.  For consistency, we will use the same song, "Free Fallin'", again to see if our results differ.

In [226]:
# load the skip-gram model we previously created
model_filepath = os.path.join(data_path, 'playlists_SG.model')
SG_model = Word2Vec.load(model_filepath)

In [293]:
recommended_songs = np.array(SG_model.predict_output_word([seed_uri], topn=10))
# sp_tracks = get_tracks(recommended_songs[:,0])

print_recommended_playlist(recommended_songs[:,0], test_withheld)


1 tracks matches.
RECOMMENDED PLAYLIST (* indicates a match)
 Artist              Track                         URI                           
 Louis Armstrong     La vie en rose - Single Versionspotify:track:0AX6pLXtyR2vPLv2KYErAg
 <no preview>

 Louis Armstrong     A Kiss To Build A Dream On - Single Versionspotify:track:55qmtyvnYHkLsnEZGzrj8C
 <no preview>

 Billie Holiday      I'll Be Seeing You            spotify:track:6MIa10mpQQ3dEiaw8TA8JA
 <no preview>

 Ella Fitzgerald     Dream A Little Dream Of Me    spotify:track:3Bbbz0IGORWZSLf9UqsAL4
 <no preview>

 Ella Fitzgerald     Cheek To Cheek                spotify:track:6pPr1KLZit9FgFNhp7xE5m
 <no preview>

 Billie Holiday      All of Me                     spotify:track:5EsA2BJ1X2BCnbVvo9OByx
 <no preview>

 Glenn Miller        Moonlight Serenade - 2005 Remastered Versionspotify:track:3HRMOZk689zaR3z6NpEdfu
 https://p.scdn.co/mp3-preview/921dc76faae376d382ef25b066bf2ad5c2195d83?cid=72413f75d4db4ec79c6caaf02523959e

 Frank Sinatr

In [228]:
r_precs, NDCGs = eval_model(SG_model, df_playlists_test_withheld, df_playlists_test_given)

HBox(children=(IntProgress(value=0, description='calculating metrics', style=ProgressStyle(description_width='…


Mean Track R-Precision: 0.04152637485970819
Mean Track NDGC       : 0.037546862778789714


### Result:
Again, only one tracked matched producing a low evaluation result.  But like in the BOW model, the Skip-Gram model is producing song suggestions are are very much in line with the given set of tracks.

<a name='4'></a>
## 4. Embeddings from Song Features - Unsupervised
### 'Home-Made' Vectors
<a href=#index>back to index</a>

Here, we can take a break from Word2Vec and get very basic.  We create our own 'home-made' vectors based on Spotify accoustive feature data.  We have a series of fields available for all of our songs that numerically represent various characteristics of the songs; dancebaility, loudness, temp, key, energy, etc.

Instead of letter Word2Vec create an embedding, we create our own using these values.

In [303]:
# fetch all db tracks with accoustic features
db_tracks = display_time(session.query(Tracks).all)
session.close()

Time to Execute: 81.84 seconds


In [304]:
# create a Pandas dataframe
df_all_tracks = pd.DataFrame([x.__dict__ for x in db_tracks]).drop('_sa_instance_state', axis=1).set_index(['track_uri'])
df_all_tracks.head()

Unnamed: 0_level_0,acousticness,artist_uri,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,track_popularity,valence
track_uri,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
spotify:track:2d7LPtieXdIYzf7yHPooWd,0.974,spotify:artist:0MeLMJJcouYXCymQSHPn8g,0.467,242564,0.157,1e-06,11,0.0816,-9.649,1,0.0336,108.13,4,65,0.277
spotify:track:0y4TKcc7p2H6P0GJlt01EI,0.961,spotify:artist:7w0qj2HiAPIeUcoPogvOZ6,0.312,253933,0.207,0.00818,10,0.0773,-13.367,1,0.0347,93.778,4,36,0.278
spotify:track:6q4c1vPRZREh7nw3wG7Ixz,0.991,spotify:artist:32ogthv0BdaSMPml02X9YB,0.412,103920,0.159,0.772,9,0.083,-14.214,1,0.0278,85.462,4,54,0.0389
spotify:track:54KFQB6N4pn926IUUYZGzK,0.885,spotify:artist:32ogthv0BdaSMPml02X9YB,0.264,371320,0.122,0.349,9,0.094,-15.399,1,0.0349,148.658,4,72,0.0735
spotify:track:0NeJjNlprGfZpeX2LQuN6c,0.689,spotify:artist:3qnGvpP8Yth1AqSBMqON5x,0.658,238560,0.179,0.0,8,0.17,-10.866,1,0.0448,128.128,4,75,0.191


In [305]:
# define features that we will use to create our custom vectors
vector_features= [
    'acousticness',
    'danceability',
    'duration_ms',
    'energy',
    'instrumentalness',
    'key',
    'liveness',
    'loudness',
    'mode',
    'speechiness',
    'tempo',
    'time_signature',
    'valence'
]

In [306]:
# Create a dataframe with the vectors for simplicity
drop_cols = set(df_all_tracks.columns) - set(vector_features)
df = df_all_tracks.drop(drop_cols, axis=1)
df.head()

Unnamed: 0_level_0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
track_uri,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
spotify:track:2d7LPtieXdIYzf7yHPooWd,0.974,0.467,242564,0.157,1e-06,11,0.0816,-9.649,1,0.0336,108.13,4,0.277
spotify:track:0y4TKcc7p2H6P0GJlt01EI,0.961,0.312,253933,0.207,0.00818,10,0.0773,-13.367,1,0.0347,93.778,4,0.278
spotify:track:6q4c1vPRZREh7nw3wG7Ixz,0.991,0.412,103920,0.159,0.772,9,0.083,-14.214,1,0.0278,85.462,4,0.0389
spotify:track:54KFQB6N4pn926IUUYZGzK,0.885,0.264,371320,0.122,0.349,9,0.094,-15.399,1,0.0349,148.658,4,0.0735
spotify:track:0NeJjNlprGfZpeX2LQuN6c,0.689,0.658,238560,0.179,0.0,8,0.17,-10.866,1,0.0448,128.128,4,0.191


### Normalize Data
The values will need to be normalized to avoid placing greater importance on predictors that have naturally larger values.

In [307]:
df.columns

Index(['acousticness', 'danceability', 'duration_ms', 'energy',
       'instrumentalness', 'key', 'liveness', 'loudness', 'mode',
       'speechiness', 'tempo', 'time_signature', 'valence'],
      dtype='object')

In [308]:
for col in df.columns:
    data = np.array(df[col]).reshape(-1,1)
    scaled_data = MinMaxScaler(feature_range=(0,1)).fit(data).transform(data)
    df[col] = scaled_data

In [309]:
# Confirm scaling is reasonable
df[vector_features].head()

Unnamed: 0_level_0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
track_uri,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
spotify:track:2d7LPtieXdIYzf7yHPooWd,0.977912,0.468876,0.03995,0.157,1e-06,1.0,0.0816,0.775549,1.0,0.034604,0.432542,0.8,0.277
spotify:track:0y4TKcc7p2H6P0GJlt01EI,0.964859,0.313253,0.04183,0.207,0.00818,0.909091,0.0773,0.718282,1.0,0.035736,0.375132,0.8,0.278
spotify:track:6q4c1vPRZREh7nw3wG7Ixz,0.99498,0.413655,0.017021,0.159,0.772,0.818182,0.083,0.705235,1.0,0.02863,0.341866,0.8,0.0389
spotify:track:54KFQB6N4pn926IUUYZGzK,0.888554,0.26506,0.061243,0.122,0.349,0.818182,0.094,0.686983,1.0,0.035942,0.594663,0.8,0.0735
spotify:track:0NeJjNlprGfZpeX2LQuN6c,0.691767,0.660643,0.039288,0.179,0.0,0.727273,0.17,0.756804,1.0,0.046138,0.512539,0.8,0.191


### Convert 'home-made' Vectors into Word2Vec format
In order to exploit some of the Word2Vec built-in functionality, we can cast our vectors into a KeyedVectors object.  KeyedVectors is the Word2Vec object that hold all of the vectors in an embedded space.

In [310]:
# create a KeyedVectors object from the Word2Vec library
# This will allow us to use the built-in Word2Vec functions
accoustic_vectors = KeyedVectors(len(vector_features))

# weights are the vectors for each track
weights = np.array(df)

# entities are the trackuris
entities = np.array(df.index)

# add the vectors to the dataset
accoustic_vectors.add(entities, weights)

### How to calculate similarity
Here, we use Cosine similairity to determine recommended songs.  Cosine similarity will ignore the magnitude of a vector.  This is important in our usage.  Two completely different songs may have the exact same magnitude of the elements of the vector are similar in valud but in a different order.  Cosine similarity considers where each value is in a vector when making comparisons.

Cosine similairty is the default for Word2Vec.
(https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html#sphx-glr-auto-examples-core-run-similarity-queries-py)



In [311]:
# calculate recommendations based on 'home-made' vector
recommended_songs = np.array(accoustic_vectors.most_similar(df_playlists_test_given.track_uri.values, topn=10))[:,0]
print_recommended_playlist(recommended_songs, test_withheld)

0 tracks matches.
RECOMMENDED PLAYLIST (* indicates a match)
 Artist              Track                         URI                           
 Tommy Lee Sparta    Fi Get a 4Ward                spotify:track:5pXW9mNcfiy6Q0a3XnPvwR
 <no preview>

 Luscious Jackson    Hula Hoop                     spotify:track:2lhxvMG1gbH4gItZSFxe6W
 https://p.scdn.co/mp3-preview/a26555bb8b5abfeade1cf245aa823c0b93a730f4?cid=72413f75d4db4ec79c6caaf02523959e

 Kwesi Arthur        Grind Day                     spotify:track:1V3faphnJ5BVkflGrTlWf4
 https://p.scdn.co/mp3-preview/7c7317c7ac125e573e2b070b1237441c2b426e73?cid=72413f75d4db4ec79c6caaf02523959e

 J-Rio               Sors Ça                       spotify:track:27ECSowSIahjZVp3mTKQFU
 https://p.scdn.co/mp3-preview/d10fc955445868c2bd31dba385d85a36b3125bf0?cid=72413f75d4db4ec79c6caaf02523959e

 Joha                Shot for Me (La Contestacion) spotify:track:0LAewMyMUWcagznuGzzSVB
 <no preview>

 Khago               Me Blood Ah Boil              spotif

In [312]:
r_precs, NDCGs = eval_model(accoustic_vectors, df_playlists_test_withheld, df_playlists_test_given)

HBox(children=(IntProgress(value=0, description='calculating metrics', style=ProgressStyle(description_width='…


Mean Track R-Precision: 0.018367346938775512
Mean Track NDGC       : 0.017752074453489662


### Result:
This does not look so great (frankly, it is a disaster!).  The scores reflect this.  Not a single track was matched.  This list has nothing in common with our previous recommendations and the tracks selected are not similary to the seed playlist.  What happened?

This approach takes independent songs and creates vectors from the songs features.  These vectors have nothing to do with the playlists and have no other relationship to other vectors other than their cosine similarity.  If the cosine similarity defines a song then why doesn't this work?  We are overestimating our assumption that the accoustic vectors will determine a song.  We also don't account for the fact that some songs are missing features.  When a value is 0, it will have a significant impact on the calculated cosine similarity.

Our attempt with this approach was to create a 'content-based recommender' as described in the intoduction.  We are seeing that this is more challenging than expected.  

## Conclusion:
Collaborative-based recommendations appear to be easier to implement.  Predicting based on content requires a careful screening of the values that define the content and ensuring that all items have reasonable values.  This is a "pro" in favor of collarborative models.  However, collaborative models will enforce popularity trends that will compel already-popular items to be even more popular.  Another term for collaborative-based could be 'popularity-based'.  If someone wants something that is 'off-the-beaten-path', a collaborative model will not likely give them what they are looking for.  A content-based approach may be more valuable, but, as mentioned, is harder to implement.
