# Spotify Recommender Model

Using Word2Vec, a variety of different recommender model options are explored.

1. **Content-Based**
> Predicts based on what a user has listened to in the past.
> Uses features of songs to find similar songs.

2. **Collaborative**
> Predicts based on what other listeners like
> Focuses on what songs other users liked who also liked a chosen song. 


## Word2Vec
In both types of recommender models, a 'vectorized' representation of a song is used to find similar songs.  Word2Vec was originally intended to do what it's name implies; convert Words to Vectors.  Here, we will use that intended functionality to convert Songs to Vectors.

### Embeddings
Word2Vec is a process that uses vectorized words to predict other words.  It does this by ingesting a series of documents, parsing out the words, vectorizing the words and then using the vector representations to predict other words.  The vectors are built in such a way that each word has a unique vector that is based on its usage in the documents.  The result is a vector space filled with words where related words have vectors that are similar.  This vector space is referred to an an **embedding**.  This embedding is used in two common word prediction tasks: Skip-Gram and Continuous Bag of Words.

> **Skip-Gram** <br>
> The Skip-Gram model asks for a single word and then predicts words surrounding the word.

> **Bag-of-Words** <br>
> The bag-of-words model asks for a series of words and will return the missing word.


### Making a Playlist
What does this have to do with playlists?  Good question.  If we can consider a Song as a Word and a Playlist as related words or sentences, the applicability is evident.

To make a playlist, we simply convert Song to Vectors and then find new songs by finding other songs with similar vectors.  To schieve this, we can use the Bag-of-Words or Skip-Gram approach as mentioned above.  Provide a song, a Skip-Gram model can supply a playlist.  Provide a list of songs, and Bag-of-Words model can give you the next song.

### That Was Easy!
Not so fast!  How we build the embedding of Songs will have an impact on how the new songs are predicted.  When we make the embedding, what are we giving as a song?  The title of the song?  The genre?  The artist? What are we providing as documents?  Playlists?  Albums? These choices will provide different results.  Three embedding options are explored:

<a name='index'></a>
### <a href=#1>1. Embeddings from Playlists - Song ID - Unsupervised</a>
> Here, we will take data from Spotify that included 1M playlists and the songs in each playlist.  We'll use the Word2Vec process supplying playlists as documents and each song's unique id is used as the word. <br><br>
After the embedding is created, we can skip the creation of building and training a BOW or Skip-Gram model.  All we need to do is find vectors that are similar to a song or a list of songs.  

### <a href=#2>2. Embeddings from Playlists - Song ID - BOW</a>
> We can use the same embedding to create a BOW model.


### <a href=#3>3. Embeddings from Playlists - Song ID - Skip-Gram</a>
> Let's use the embedding from the playlists and use Word2Vec to create a Skip-Gram model.


### <a href=#4>4. Embeddings from Song Features - Unsupervised</a>
> Here, we can take a break from Word2Vec and get very basic.  We create our own vectors based on Spotify accoustive feature data.  We have a series of fields available for all of our songs that numerically represent various characteristics of the songs; dancebaility, loudness, temp, key, energy, etc.



<br>

**References:**

https://www.analyticsvidhya.com/blog/2019/07/how-to-build-recommendation-system-word2vec-python/

https://towardsdatascience.com/using-word2vec-for-music-recommendations-bb9649ac2484

https://towardsdatascience.com/how-to-build-a-simple-song-recommender-296fcbc8c85



### Import libraries

In [15]:
# Basic Imports
import warnings;
warnings.filterwarnings('ignore')

import os
import sys
import pandas as pd
import numpy as np
import seaborn as sns
import time
import random
import matplotlib.pyplot as plt
%matplotlib inline


from gensim.models import Word2Vec
from gensim import utils
import gensim.models
from gensim.models import KeyedVectors

# For the Spotify Dataset
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Table, Column, Integer, String, Float, MetaData, and_, or_, func
from sqlalchemy import create_engine
import sqlite3
from sqlalchemy.orm import sessionmaker
from sqlalchemy import exc

from sklearn.model_selection import train_test_split

sys.path.append('../../')
from spotify_api import get_spotify_data, get_tracks, get_artists, get_audiofeatures
from spotify_database import get_session, display_time
from spotify_utils import Table_Generator, List_Generator, pickle_load, pickle_save

# !pip install ipywidgets 
# !jupyter nbextension enable --py widgetsnbextension
# !jupyter labextension install @jupyter-widgets/jupyterlab-manager

# %%capture
from tqdm import tqdm_notebook as tqdm

### Set Data Path Variables

In [2]:
data_path = '../../data/SpotifyDataSet'
db_path = '../../data/SpotifyDataSet/spotify_songs.db'

# Get sesion
session = get_session(db_path)
engine = create_engine('sqlite:///' + db_path)

# Get Songs class
Playlists = getattr(get_session, "Playlists")
Artists = getattr(get_session, "Artists")
Tracks = getattr(get_session, "Tracks")

<a name='1'></a>
## 1. Embeddings from Playlists using Song ID - Unsupervised
<a href=#index>back to index</a>


The baseline model will use embeddings to find similarities between songs.  The embeddings are built from playlists, where the playlist serves as a sentence made up of songs.

Similarities between songs are determined by their cosine distance with other songs.

To speed up to building of the enbedding, an extract is made from the database which will serve as documents for the embedding.  Each 'sentence' is a playlist and each 'word' is a song in the playlist.

From the DB, the following view is created which is subseqntly extracted as a CSV file:<br>

`CREATE VIEW playlist_tracks_uris AS SELECT t.playlist_id, group_concat(t.track_uri, ' ')  FROM playlists t GROUP BY  t.playlist_id;`


In [9]:
# get playlists CSV file into DataFrame
corpus_file = 'playlist_tracks.csv'
corpus_filepath = os.path.join(data_path, corpus_file)

df_playlists = pd.read_csv(corpus_filepath, sep='\t', header=None)
df_playlists.columns = ["playlistID","tracks"]
df_playlists.set_index('playlistID', drop=True, inplace=True)
df_playlists.head()

In [73]:
# Create train and validation sets
playlists_train, val = train_test_split(df_playlists, test_size=10)

In [74]:
# Build validation playlists - list of lists
playlists_val = []
for plId, line in val.itertuples():
    playlists_val.append(line.split(' ')) # space-delimited tracks

In [79]:
# Iterator that yields the songs for each playlist in a Dataframe
class Playlist_URIs_df(object):
    """
    Playlist generator that yileds the track uris in a playlist.
    Yields one playlist at a time.
    """
    def __init__(self,
                 dataframe:pd.DataFrame=None,
                 name:str=None,
                 iters:int=None):
        self.dataframe    = dataframe
        self.length       = len(dataframe)
        self.name         = name
        self.count        = 0
        self.iters        = iters
        print("Creating Playlist Track Listing Generator:")
        print("\tlength     : ", self.length)
    
    def __iter__(self):
        
        self.count += 1
        progbar = tqdm(total=self.length, desc="{}:{}/{}".format(self.name, self.count, self.iters+1))
        
        for plId, line in self.dataframe.itertuples():
            progbar.update(1)
            yield line.split(' ') # space-delimited tracks
            
        progbar.close()    

In [80]:
iters = 5
playlists_gen = Playlist_URIs_df(dataframe=playlists_train,
                                 name="Building Vectors",
                                 iters=iters) 

Creating Playlist Track Listing Generator:
	length     :  998991


In [81]:
# Build a gensim BOW model including a word embedding
model_BOW = gensim.models.Word2Vec(sentences=playlists_gen,
                               workers = 8,    # number of processors
                               sg = 0,         # 1=skip-gram, 0=CBOW
                               iter=iters      # training iterations - default=5
                              )

# save the model
model_filepath = os.path.join(data_path, 'playlists_BOW.model')
model_BOW.save(model_filepath)

# Save the embedding
kv_filepath = os.path.join(data_path, 'playlists.embedding')
model_BOW.wv.save(kv_filepath)

# about 5 minutes per iteration with 4 processors // 2 min per iteration with 8

HBox(children=(IntProgress(value=0, description='Building Vectors:1/6', max=998991, style=ProgressStyle(descri…




HBox(children=(IntProgress(value=0, description='Building Vectors:2/6', max=998991, style=ProgressStyle(descri…




HBox(children=(IntProgress(value=0, description='Building Vectors:3/6', max=998991, style=ProgressStyle(descri…




HBox(children=(IntProgress(value=0, description='Building Vectors:4/6', max=998991, style=ProgressStyle(descri…




HBox(children=(IntProgress(value=0, description='Building Vectors:5/6', max=998991, style=ProgressStyle(descri…




HBox(children=(IntProgress(value=0, description='Building Vectors:6/6', max=998991, style=ProgressStyle(descri…




In [82]:
# Build a gensim Skip-Gram model including a word embedding
model_SG = gensim.models.Word2Vec(sentences=playlists_gen,
                               workers = 8,    # number of processors
                               sg = 1,         # 1=skip-gram, 0=CBOW
                               iter=iters      # training iterations - default=5
                              )

# save the model
model_filepath = os.path.join(data_path, 'playlists_SG.model')
model_SG.save(model_filepath)

# NOTE: No need to save the embedding again, it is the same as in the BOW model

# about 5 minutes per iteration with 4 processors // 2 min per iteration with 8

HBox(children=(IntProgress(value=0, description='Building Vectors:7/6', max=998991, style=ProgressStyle(descri…




HBox(children=(IntProgress(value=0, description='Building Vectors:8/6', max=998991, style=ProgressStyle(descri…




HBox(children=(IntProgress(value=0, description='Building Vectors:9/6', max=998991, style=ProgressStyle(descri…




HBox(children=(IntProgress(value=0, description='Building Vectors:10/6', max=998991, style=ProgressStyle(descr…




HBox(children=(IntProgress(value=0, description='Building Vectors:11/6', max=998991, style=ProgressStyle(descr…




HBox(children=(IntProgress(value=0, description='Building Vectors:12/6', max=998991, style=ProgressStyle(descr…




### Model Attributes
Now, we have created an embedding and an associated model which follows the BOW approach.

#### wv
> This object essentially contains the mapping between words and embeddings. It can be used directly to query the embeddings in various ways. 

#### vocabulary
> This object represents the vocabulary (sometimes called Dictionary in gensim) of the model. 

## Test the model
Get 10 'words' from the embedding's vocabulary.

In [84]:
# reload saved embedding
kv_filepath = os.path.join(data_path, 'playlists.embedding')
embedding = KeyedVectors.load(kv_filepath, mmap='r')

In [85]:
# display list of random track along with a count of appearances
songids = list(embedding.wv.vocab.keys())
idxs_rnd = [random.randint(0,len(songids)) for x in range(10)]

for i in idxs_rnd:
    print("Track: {}".format(songids[i]))
    print("\t",embedding.wv.vocab[songids[i]])

Track: spotify:track:3VovAXS4PthsLIh1E12YpN
	 Vocab(count:217, index:38036, sample_int:4294967296)
Track: spotify:track:5ku5j7ad7iF2pkBZSLvwaP
	 Vocab(count:6, index:515594, sample_int:4294967296)
Track: spotify:track:4wyT0AR1eZv7upSLu9igOn
	 Vocab(count:8, index:423116, sample_int:4294967296)
Track: spotify:track:4EYFoCd6syEWOiIriYzCqw
	 Vocab(count:9, index:389406, sample_int:4294967296)
Track: spotify:track:4362FPQ2tIiSbUOWQLS0zG
	 Vocab(count:29, index:174300, sample_int:4294967296)
Track: spotify:track:1XqYMw1Zzinyce2dIUqybS
	 Vocab(count:11, index:336668, sample_int:4294967296)
Track: spotify:track:3Cspnktoidrjtj2rqoAP5N
	 Vocab(count:6, index:478544, sample_int:4294967296)
Track: spotify:track:2T8si2zr5MsObOJkfLvpyS
	 Vocab(count:20, index:226351, sample_int:4294967296)
Track: spotify:track:33Z5Kc0n1eQDAxCa1KWo3M
	 Vocab(count:8, index:427531, sample_int:4294967296)
Track: spotify:track:1f7l0TJOWugKWJsSgGAJuO
	 Vocab(count:32, index:162379, sample_int:4294967296)


In [86]:
# Show the vector value for the first word in in the list above
rnd = random.randint(0, len(embedding.wv.vocab))
print(songids[idxs_rnd[0]])
display(embedding.wv.get_vector(songids[idxs_rnd[0]]))

spotify:track:3VovAXS4PthsLIh1E12YpN


memmap([ 8.0540013e-01, -5.0874066e-01,  1.1287875e+00,  6.4198679e-01,
        -9.0542966e-01, -4.2664385e-01, -4.7181049e-01, -1.3354428e+00,
         4.9593821e-03,  3.0341092e-01, -8.0220771e-01,  7.9279709e-01,
        -6.7764327e-02,  5.7272369e-01,  6.3917673e-01,  1.1619152e-01,
         9.2092216e-01,  2.9572269e-01, -5.4103029e-01, -7.1117826e-02,
        -8.3022183e-01,  1.1645370e-01,  2.0017964e-01, -1.1606492e+00,
        -3.2457730e-01, -5.4774272e-01,  1.1806794e-01,  5.0981635e-01,
         2.1882086e-01,  1.3184129e+00,  7.3587263e-01, -4.8899615e-01,
         1.4229004e-01, -9.6439630e-01,  1.0739563e-01, -4.7043553e-01,
        -9.8947376e-01,  2.8670225e-01, -1.4417414e-01,  1.8540988e-02,
         2.6174539e-01,  7.4254292e-01,  2.4746957e-01, -1.2949882e+00,
         6.6327482e-01,  1.3422848e-01,  8.3162546e-01,  2.9623902e-01,
        -5.4799962e-01, -1.0918818e-01,  1.3129501e+00, -2.0395775e-01,
        -5.5571031e-02,  7.9051864e-01,  1.5948552e-01,  9.00400

### Make a Playlist
Find the ID of a specific song of your choice.  Here, we use the Spotify API to get the song we are interested in.

In [87]:
# # Get a song as a seed for a test playlist
# db_track = display_time(session.query(Playlists.track_name, 
#                                 Playlists.track_uri,
#                                 Playlists.artist_uri,
#                                 Playlists.album_uri).filter(Playlists.track_name=="Free Fallin'").distinct().first)
# seed_uri = db_track.track_uri

# print("Artist:    {}".format(get_artists([db_track.artist_uri])[0]['name']))
# print("Track:     {}".format(db_track.track_name))
# print("Track URI: {}".format(seed_uri))

In [92]:
sp_seed_track #['artists'] #[0]['name']

[{'album': {'album_type': 'album',
   'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/4BxCuXFJrSWGi1KHcVqaU4'},
     'href': 'https://api.spotify.com/v1/artists/4BxCuXFJrSWGi1KHcVqaU4',
     'id': '4BxCuXFJrSWGi1KHcVqaU4',
     'name': 'Kodaline',
     'type': 'artist',
     'uri': 'spotify:artist:4BxCuXFJrSWGi1KHcVqaU4'}],
   'available_markets': ['AD',
    'AE',
    'AR',
    'AT',
    'AU',
    'BE',
    'BG',
    'BH',
    'BO',
    'BR',
    'CA',
    'CH',
    'CL',
    'CO',
    'CR',
    'CY',
    'CZ',
    'DE',
    'DK',
    'DO',
    'DZ',
    'EC',
    'EE',
    'EG',
    'ES',
    'FI',
    'FR',
    'GB',
    'GR',
    'GT',
    'HK',
    'HN',
    'HU',
    'ID',
    'IE',
    'IL',
    'IN',
    'IS',
    'IT',
    'JO',
    'JP',
    'KW',
    'LB',
    'LI',
    'LT',
    'LU',
    'LV',
    'MA',
    'MC',
    'MT',
    'MX',
    'MY',
    'NI',
    'NL',
    'NO',
    'NZ',
    'OM',
    'PA',
    'PE',
    'PH',
    'PL',
    'PS',
    'P

In [94]:
# Get a song from one of the validation playlists
rnd_val = random.randint(0,len(playlists_val))
seed_uri = playlists_val[rnd_val][0]

sp_seed_track = get_tracks([seed_uri]) # get spotify data for track

print("Artist       : ",   sp_seed_track[0]['artists'][0]['name'])
print("Track        : ",   sp_seed_track[0]['name'])
print("Track Preview: \n", sp_seed_track[0]['preview_url'] )
print()


Artist       :  Drake
Track        :  One Dance
Track Preview: 
 None



### Use the Word2Vec embedding to find similar songs
Using the Word2Vec function `similar_by_word()`

In [95]:
# Find similar songs 
playlist_rec = np.array(embedding.similar_by_word(seed_uri, topn=10, restrict_vocab=None))
playlist_rec

array([['spotify:track:11KJSRSgaDxqydKYiD2Jew', '0.8229084014892578'],
       ['spotify:track:6F609ICg9Spjrw1epsAnpa', '0.8087855577468872'],
       ['spotify:track:0azC730Exh71aQlOt9Zj3y', '0.7671030759811401'],
       ['spotify:track:5mPSyjLatqB00IkPqRlbTE', '0.7500139474868774'],
       ['spotify:track:27PmvZoffODNFW2p7ehZTQ', '0.7456628084182739'],
       ['spotify:track:5OOkp4U9P9oL23maHFHL1h', '0.7453163266181946'],
       ['spotify:track:6r2jK1A6oFRPREZfxjc5d1', '0.716140866279602'],
       ['spotify:track:1Tt4sE4pXi57mTD1GCzsqm', '0.7118954658508301'],
       ['spotify:track:4c0rkFPszqQTyC753tsCMU', '0.7113115191459656'],
       ['spotify:track:4tCtwWceOPWzenK2HAIJSb', '0.7020060420036316']],
      dtype='<U36')

In [98]:
# Get the similar songs from Spotify to show their details, including preview link (if available)
sp_playlist = get_tracks(playlist_rec[:,0])

print("RECOMMENDED PLAYLIST")
for t in sp_playlist:
    print("Artist/Track : {} / {}".format( t['artists'][0]['name'],t['name']))
    print("Preview: \n", t['preview_url'] )
    print()

RECOMMENDED PLAYLIST
Artist/Track : Drake / Too Good
Preview: 
 None

Artist/Track : Drake / Controlla
Preview: 
 None

Artist/Track : Calvin Harris / This Is What You Came For
Preview: 
 https://p.scdn.co/mp3-preview/16e42342599e423f7da57c9d2ce0c6758fc430e6?cid=72413f75d4db4ec79c6caaf02523959e

Artist/Track : Drake / Pop Style
Preview: 
 None

Artist/Track : Kent Jones / Don't Mind
Preview: 
 https://p.scdn.co/mp3-preview/d239d792c6a8c706ff0e57aa7b38eaa1b3a5ad48?cid=72413f75d4db4ec79c6caaf02523959e

Artist/Track : Desiigner / Panda
Preview: 
 None

Artist/Track : Desiigner / Panda
Preview: 
 None

Artist/Track : Rihanna / Needed Me
Preview: 
 None

Artist/Track : Ghost Town DJs / My Boo - Hitman's Club Mix
Preview: 
 https://p.scdn.co/mp3-preview/cf5c29df8e73f6c8301fdfc78e1849a02795cf7d?cid=72413f75d4db4ec79c6caaf02523959e

Artist/Track : Fifth Harmony / Work from Home (feat. Ty Dolla $ign)
Preview: 
 https://p.scdn.co/mp3-preview/b2d6e6ea9163eb361f5406f75264de490243964a?cid=72413f75d

In [108]:
# Get the similar songs from Spotify to show their details, including preview link (if available)
sp_playlist = get_tracks(playlists_val[0][0:10])

print("RECOMMENDED PLAYLIST")
for t in sp_playlist:
    print("Artist/Track : {} / {}".format( t['artists'][0]['name'],t['name']))
    print("Preview: \n", t['preview_url'] )
    print()

RECOMMENDED PLAYLIST
Artist/Track : Crown The Empire / Oh, Catastrophe
Preview: 
 https://p.scdn.co/mp3-preview/79b9cbaab59bd2fe691c4d4c2fa6e26ce873f0a0?cid=72413f75d4db4ec79c6caaf02523959e

Artist/Track : Crown The Empire / The Fallout
Preview: 
 https://p.scdn.co/mp3-preview/dc41b6cbb3d4113e9d74e4e089a8d7fb6806f289?cid=72413f75d4db4ec79c6caaf02523959e

Artist/Track : Crown The Empire / Memories Of A Broken Heart
Preview: 
 https://p.scdn.co/mp3-preview/74ff619ed37919a46b9727492f401000efca7ee1?cid=72413f75d4db4ec79c6caaf02523959e

Artist/Track : Crown The Empire / Makeshift Chemistry
Preview: 
 https://p.scdn.co/mp3-preview/2b4bdbc2fb5644082a90d2e603ee218a51f57ab2?cid=72413f75d4db4ec79c6caaf02523959e

Artist/Track : Crown The Empire / The One You Feed
Preview: 
 https://p.scdn.co/mp3-preview/80db629185ef4a8a9ea520135ed5696460385b30?cid=72413f75d4db4ec79c6caaf02523959e

Artist/Track : Crown The Empire / Menace
Preview: 
 https://p.scdn.co/mp3-preview/4a08b1d1d6e2f051aa1a68377859b61bdf9

## Metrics

### R-Precision
Lets' compare the list to the validation list.
We can compare the intersection of tracks and the intersection of artists.

let: <br>
> $G$ = ground truth (validation playlist) <br>
> $R$ = recommendation list<br>

> $R-precision = \frac{|G \bigcap R_{1:|G|}|}{|G|}$


In [155]:
def calc_track_rPrecision(ground_truth, recommendation)->float:
    """
    Calculates r-precision based on a ground truth playlist and a 
    recommendation playlist.
    Each playlist is execpexted to be a list of lists where 
    each item in the list is a playlist and the corresponding 
    list is a list of the tracks in that playlist.
    """
    G=set(ground_truth)
    R=set(recommendation)
    
    return len(G&R)/len(G)

### NDCG - Normalized Discounted Cumulative Gain
NDCG will incorporate not only the relevance, but also the order of the items in the recommended playlist.

To caluclate, we need the the DCG (discounted cumulative gain)  which measures the ranking quality.  We also need the IDCG (ideal discounted cumulative gain).

> $DCG = rel_1 + \sum^{|R|}_{i=2}\frac{rel_i}{log_2(i+1)} $ <br>
> $IDCG = 1 + \sum^{|G|}_{i=2}\frac{1}{log_2(i+1)} $

NDGC is calculated as follows: <br>
> $NDCG=\frac{DCG}{IDCG}$

Where $rel_{{i}}$ is the graded relevance of the result at position $i$.  Relevance = 1 when the recommended track is in the ground truth playlist.

In [156]:
def calc_track_NDCG(ground_truth, recommendation)->float:
    list_len = len(recommendation)
    scores = [(track in recommendation) for track in ground_truth[0:list_len]]
    
    DCG  = scores[0] + np.sum(scores[1:]/np.log2( np.arange(1,list_len)+1) )
    IDCG = 1 + np.sum(1/np.log2( np.arange(1,list_len)+1 ))
    NDGC = DCG/IDCG
    
    return NDGC

In [171]:
def calc_track_metrics(ground_truth, recommendation, display=True)-> (float,float):
    """
    Prints relevant metrics give a grount truth playlist and a 
    recommended playlist.
    """
    r_prec = calc_track_rPrecision(ground_truth, recommendation)
    NDGC = calc_track_NDCG(ground_truth, recommendation)
    
    if display:
        print("Track R-Precision: {}".format(r_prec))
        print("Track NDGC       : {}".format(NDGC))
    
    return r_prec, NDGC

In [172]:
r_prec, NDCG = calc_metrics(playlists_val[rnd_val], playlist_rec[:,0])

R-Precision: 0.028169014084507043
NDGC       : 0.19031326377064928


In [164]:
playlist_rec = np.array(embedding.similar_by_word(seed_uri, 
                                                          topn=10, 
                                                          restrict_vocab=None))

In [165]:
playlist_rec

array([['spotify:track:11KJSRSgaDxqydKYiD2Jew', '0.8229084014892578'],
       ['spotify:track:6F609ICg9Spjrw1epsAnpa', '0.8087855577468872'],
       ['spotify:track:0azC730Exh71aQlOt9Zj3y', '0.7671030759811401'],
       ['spotify:track:5mPSyjLatqB00IkPqRlbTE', '0.7500139474868774'],
       ['spotify:track:27PmvZoffODNFW2p7ehZTQ', '0.7456628084182739'],
       ['spotify:track:5OOkp4U9P9oL23maHFHL1h', '0.7453163266181946'],
       ['spotify:track:6r2jK1A6oFRPREZfxjc5d1', '0.716140866279602'],
       ['spotify:track:1Tt4sE4pXi57mTD1GCzsqm', '0.7118954658508301'],
       ['spotify:track:4c0rkFPszqQTyC753tsCMU', '0.7113115191459656'],
       ['spotify:track:4tCtwWceOPWzenK2HAIJSb', '0.7020060420036316']],
      dtype='<U36')

In [200]:
def eval_model(model, playlists_val, display=True) -> (float,float):
    """
    Will perform an r_precision and NDCG calculation on all
    validation playlists and return a mean of each score
    for all scored playlists.
    """
    r_precs = []
    NDCGs = []
    for plist in playlists_val:
        # get a recommended playlist based on first song
        seed_uri = plist[0]
        
        if type(model) == gensim.models.keyedvectors.Word2VecKeyedVectors:
            playlist_rec = np.array(embedding.similar_by_word(seed_uri, 
                                                          topn=10))
        
        elif type(model) == gensim.models.word2vec.Word2Vec:
            playlist_rec = np.array(BOW_model.predict_output_word([seed_uri], 
                                                                  topn=10))
        
        r_prec, NDGC = calc_track_metrics(plist, 
                                    playlist_rec[:,0],
                                    display=False)
        
        r_precs.append(r_prec)
        NDCGs.append(NDGC)
    
    if display:
        print("Mean Track R-Precision: {}".format(np.mean(r_precs)))
        print("Mean Track NDGC       : {}".format(np.mean(NDCGs)))
        
    return r_precs, NDCGs
        
    

In [202]:
r_precs, NDCGs = eval_model(embedding, playlists_val)

Mean Track R-Precision: 0.021781417850502447
Mean Track NDGC       : 0.06778209214352757


### Results:
Based on the playlists that were supplied when buidling the embedding, the list of Top 10 most similar songs is presented above.  Several other Tom Petty songs appeared as well as other artists that are similar in genre, attitude, etc.

This playlist is based on the top songs that other users placed in playlists that include the song we selected as our seed.  This is a good example of 'collaborative filtering' as it uses preferences from others to recommend songs.

<a name='2'></a>
## 2. Embeddings from Playlists - Song ID - BOW
<a href=#index>back to index</a>

Now, let's use the model that was created when we initially established our embedding.  The BOW model can be used to predict a song from a song or a list of supplied songs.

Let's use the seed song "Free Fallin'" again and see what is returned.

In [181]:
model_filepath = os.path.join(data_path, 'playlists_BOW.model')
BOW_model = Word2Vec.load(model_filepath)

In [183]:
recommended_songs = np.array(BOW_model.predict_output_word([seed_uri], topn=10))
sp_tracks = get_tracks(recommended_songs[:,0])

for track in sp_tracks:
    print("\nArtist: {}".format(track['artists'][0]['name']))
    print("Track Name: {}".format(track['name']))
    print("Preview: {}".format(track['preview_url']))

token():INFO:   Token refreshed

Artist: Drake
Track Name: Controlla
Preview: None

Artist: Drake
Track Name: Pop Style
Preview: None

Artist: Drake
Track Name: Too Good
Preview: None

Artist: Drake
Track Name: One Dance
Preview: None

Artist: Calvin Harris
Track Name: This Is What You Came For
Preview: https://p.scdn.co/mp3-preview/16e42342599e423f7da57c9d2ce0c6758fc430e6?cid=72413f75d4db4ec79c6caaf02523959e

Artist: Drake
Track Name: Views
Preview: None

Artist: Drake
Track Name: Hotline Bling
Preview: None

Artist: Kent Jones
Track Name: Don't Mind
Preview: https://p.scdn.co/mp3-preview/d239d792c6a8c706ff0e57aa7b38eaa1b3a5ad48?cid=72413f75d4db4ec79c6caaf02523959e

Artist: Desiigner
Track Name: Panda
Preview: None

Artist: Sia
Track Name: Cheap Thrills
Preview: https://p.scdn.co/mp3-preview/88816b2040a092aa99d5b0e42945d79dc5027c1a?cid=72413f75d4db4ec79c6caaf02523959e


In [203]:
r_precs, NDCGs = eval_model(BOW_model, playlists_val)

Mean Track R-Precision: 0.0356446392278155
Mean Track NDGC       : 0.23164306360256548


### Result:
Most of the songs are the same, but the BOW model only returned songs from Tom Petty where our basic embedding-only apprach found other artists as well.

<a name='3'></a>
## 3. Embeddings from Playlists - Song ID - Skip-Gram
<a href=#index>back to index</a>

Here, we create a playlist from the Skip-Gram model.  For consistency, we will use the same song, "Free Fallin'", again to see if our results differ.

In [205]:
# load the skip-gram model we previously created
model_filepath = os.path.join(data_path, 'playlists_SG.model')
SG_model = Word2Vec.load(model_filepath)

In [207]:
recommended_songs = np.array(SG_model.predict_output_word([seed_uri], topn=10))
sp_tracks = get_tracks(recommended_songs[:,0])

for track in sp_tracks:
    print("\nArtist: {}".format(track['artists'][0]['name']))
    print("Track Name: {}".format(track['name']))
    print("Preview: {}".format(track['preview_url']))


Artist: Drake
Track Name: One Dance
Preview: None

Artist: Drake
Track Name: Controlla
Preview: None

Artist: Drake
Track Name: Pop Style
Preview: None

Artist: Drake
Track Name: Too Good
Preview: None

Artist: Calvin Harris
Track Name: This Is What You Came For
Preview: https://p.scdn.co/mp3-preview/16e42342599e423f7da57c9d2ce0c6758fc430e6?cid=72413f75d4db4ec79c6caaf02523959e

Artist: Drake
Track Name: Grammys
Preview: None

Artist: Kent Jones
Track Name: Don't Mind
Preview: https://p.scdn.co/mp3-preview/d239d792c6a8c706ff0e57aa7b38eaa1b3a5ad48?cid=72413f75d4db4ec79c6caaf02523959e

Artist: Rihanna
Track Name: Needed Me
Preview: None

Artist: Drake
Track Name: Childs Play
Preview: None

Artist: Drake
Track Name: Views
Preview: None


In [208]:
r_precs, NDCGs = eval_model(SG_model, playlists_val)

Mean Track R-Precision: 0.0356446392278155
Mean Track NDGC       : 0.23164306360256548


### Result:
This list looks much like our original playlist based on the embeddings, but it isn't the same.  Here, different artists are suggested.

<a name='4'></a>
## 4. Embeddings from Song Features - Unsupervised
<a href=#index>back to index</a>

Here, we can take a break from Word2Vec and get very basic.  We create our own vectors based on Spotify accoustive feature data.  We have a series of fields available for all of our songs that numerically represent various characteristics of the songs; dancebaility, loudness, temp, key, energy, etc.

In [210]:
# fetch all db tracks
db_tracks = display_time(session.query(Tracks).all)
session.close()

Time to Execute: 83.7 seconds


In [211]:
# create a Pandas dataframe
df_all_tracks = pd.DataFrame([x.__dict__ for x in db_tracks]).drop('_sa_instance_state', axis=1).set_index(['track_uri'])
df_all_tracks.head()

Unnamed: 0_level_0,danceability,time_signature,tempo,liveness,instrumentalness,speechiness,loudness,energy,artist_uri,track_popularity,duration_ms,valence,acousticness,mode,key
track_uri,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
spotify:track:2d7LPtieXdIYzf7yHPooWd,0.467,4,108.13,0.0816,1e-06,0.0336,-9.649,0.157,spotify:artist:0MeLMJJcouYXCymQSHPn8g,65,242564,0.277,0.974,1,11
spotify:track:0y4TKcc7p2H6P0GJlt01EI,0.312,4,93.778,0.0773,0.00818,0.0347,-13.367,0.207,spotify:artist:7w0qj2HiAPIeUcoPogvOZ6,36,253933,0.278,0.961,1,10
spotify:track:6q4c1vPRZREh7nw3wG7Ixz,0.412,4,85.462,0.083,0.772,0.0278,-14.214,0.159,spotify:artist:32ogthv0BdaSMPml02X9YB,54,103920,0.0389,0.991,1,9
spotify:track:54KFQB6N4pn926IUUYZGzK,0.264,4,148.658,0.094,0.349,0.0349,-15.399,0.122,spotify:artist:32ogthv0BdaSMPml02X9YB,72,371320,0.0735,0.885,1,9
spotify:track:0NeJjNlprGfZpeX2LQuN6c,0.658,4,128.128,0.17,0.0,0.0448,-10.866,0.179,spotify:artist:3qnGvpP8Yth1AqSBMqON5x,75,238560,0.191,0.689,1,8


In [212]:
# define features that we will use to create our custom vectors
vector_features= [
    'acousticness',
    'danceability',
    'duration_ms',
    'energy',
    'instrumentalness',
    'key',
    'liveness',
    'loudness',
    'mode',
    'speechiness',
    'tempo',
    'time_signature',
    'valence'
]

In [213]:
# Create a dataframe with the vectors for simplicity
drop_cols = set(df_all_tracks.columns) - set(vector_features)
df = df_all_tracks.drop(drop_cols, axis=1)
df.head()

Unnamed: 0_level_0,danceability,time_signature,tempo,liveness,instrumentalness,speechiness,loudness,energy,duration_ms,valence,acousticness,mode,key
track_uri,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
spotify:track:2d7LPtieXdIYzf7yHPooWd,0.467,4,108.13,0.0816,1e-06,0.0336,-9.649,0.157,242564,0.277,0.974,1,11
spotify:track:0y4TKcc7p2H6P0GJlt01EI,0.312,4,93.778,0.0773,0.00818,0.0347,-13.367,0.207,253933,0.278,0.961,1,10
spotify:track:6q4c1vPRZREh7nw3wG7Ixz,0.412,4,85.462,0.083,0.772,0.0278,-14.214,0.159,103920,0.0389,0.991,1,9
spotify:track:54KFQB6N4pn926IUUYZGzK,0.264,4,148.658,0.094,0.349,0.0349,-15.399,0.122,371320,0.0735,0.885,1,9
spotify:track:0NeJjNlprGfZpeX2LQuN6c,0.658,4,128.128,0.17,0.0,0.0448,-10.866,0.179,238560,0.191,0.689,1,8


In [214]:
# create a KeyedVectors object from the Word2Vec library
# This will allow us to use the built-in Word2Vec functions
accoustic_vectors = KeyedVectors(len(vector_features))

# weights are the vectors for each track
weights = np.array(df)

# entities are the trackuris
entities = np.array(df.index)

# add the vectors to the dataset
accoustic_vectors.add(entities, weights)

In [216]:
# seed_uri = db_track.track_uri
playlist = np.array(accoustic_vectors.most_similar(seed_uri, topn=10))[:,0]
# playlist = np.array(accoustic_vectors.similar_by_word(seed_uri, topn=10, restrict_vocab=None))[:,0]
seed_track = get_tracks([seed_uri])[0]
sp_playlist = get_tracks(playlist)

In [217]:
print("Playlist Seed:")
print("\tArtist       : ", seed_track['artists'][0]['name'])
print("\tTrack        : ", seed_track['name'])
print("\tTrack Preview: ", seed_track['preview_url'] )
print()
for t in sp_playlist:
    print("Artist       : ", t['artists'][0]['name'])
    print("Track        : ", t['name'])
    print("Track Preview: ", t['preview_url'] )
    print() 

Playlist Seed:
	Artist       :  Drake
	Track        :  One Dance
	Track Preview:  None

Artist       :  Carlos Baute
Track        :  No me abandones amiga mía
Track Preview:  https://p.scdn.co/mp3-preview/ec833f1d7441b9452b719ae76d8c6070b464331e?cid=72413f75d4db4ec79c6caaf02523959e

Artist       :  Santiago Torres Jr
Track        :  Mi Milagro Viene De Camino (feat. Deborah Pruneda)
Track Preview:  https://p.scdn.co/mp3-preview/564d9574faf7759506d37c591f99f9f725e69d49?cid=72413f75d4db4ec79c6caaf02523959e

Artist       :  Sunny Lax
Track        :  Aeons [ABGTN2016]
Track Preview:  https://p.scdn.co/mp3-preview/3ebbb74e37dbc020633ca56baccb918379f7f5e2?cid=72413f75d4db4ec79c6caaf02523959e

Artist       :  Magic Castles
Track        :  Silent
Track Preview:  https://p.scdn.co/mp3-preview/23e01a2b695450b4e869a0366b3eddbe52324154?cid=72413f75d4db4ec79c6caaf02523959e

Artist       :  Metric
Track        :  Celebrate
Track Preview:  None

Artist       :  Las Pelotas
Track        :  Será
Track 

In [218]:
r_precs, NDCGs = eval_model(accoustic_vectors, playlists_val)

Mean Track R-Precision: 0.021781417850502447
Mean Track NDGC       : 0.06778209214352757


### Result:
This does not look so great.  It looks nothing like our previous lists and the songs are not as similar to the seed song as I would like.  What happened?

I can tell you what happened.  This approach takes independent songs and creates vectors from the songs features.  These vectors have nothing to do with the playlists and have no other relationship to other vectors other than their mathmatical distance.  Why doesn't this work?  Because 2 totally different songs can have a combination of feature values that result in the same/similar vector magnitude but not have any commonality other than that.  

This approach is more like the 'content-based recommender' approach described in the intoduction.  