# Spotify Recommender Model

Using Word2Vec, a variety of different recommender model options are explored.

1. **Content-Based**
> Predicts based on what a user has listened to in the past.
> Uses features of songs to find similar songs.

2. **Collarbotive**
> Predicts based on what other listeners like
> Focuses on what songs other users liked who also liked a chosen song. 


## Word2Vec
In both types of recommender models, a 'vectorized' representation of a song is used to find similar songs.  For a given song, we find other songs that look similar.  With collarborative recommendations we need to take another step.  A song's vector value isn't deetrmined by the song properties, but rather its appearnace in playlists.

For both models, the selection approach is similar, we find others songs with similar vector values, but the vector values are calculated differently in each model.

References:

https://www.analyticsvidhya.com/blog/2019/07/how-to-build-recommendation-system-word2vec-python/

https://towardsdatascience.com/using-word2vec-for-music-recommendations-bb9649ac2484

https://towardsdatascience.com/how-to-build-a-simple-song-recommender-296fcbc8c85



In [1]:
# Basic Imports
import warnings;
warnings.filterwarnings('ignore')

import os
import sys
import pandas as pd
import numpy as np
import seaborn as sns
import time
import random
import matplotlib.pyplot as plt
%matplotlib inline


from gensim.models import Word2Vec
from gensim import utils
import gensim.models
from gensim.models import KeyedVectors


In [2]:
# For the Spotify Dataset
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Table, Column, Integer, String, Float, MetaData, and_, or_, func
from sqlalchemy import create_engine
import sqlite3
from sqlalchemy.orm import sessionmaker
from sqlalchemy import exc

sys.path.append('../../')
from spotify_api import get_spotify_data, get_tracks, get_artists, get_audiofeatures
from spotify_database import get_session, display_time
from spotify_utils import Table_Generator, List_Generator, pickle_load, pickle_save

In [3]:
# !pip install ipywidgets 
# !jupyter nbextension enable --py widgetsnbextension
# !jupyter labextension install @jupyter-widgets/jupyterlab-manager

# %%capture
from tqdm import tqdm_notebook as tqdm

In [4]:
data_path = '../../data/SpotifyDataSet'
db_path = '../../data/SpotifyDataSet/spotify_songs.db'

# Get sesion
session = get_session(db_path)
engine = create_engine('sqlite:///' + db_path)

# Get Songs class
Playlists = getattr(get_session, "Playlists")
Artists = getattr(get_session, "Artists")
Tracks = getattr(get_session, "Tracks")

## Building a Vocabulary of Songs
The baseline model will use embeddings to find similarities between songs.  The embeddings are built from playlists, where the playlist serves as a sentence made up of songs.

Similarities between songs are determined by their cosine distance with other songs.

To speed up to building of the enbedding, an extract is made from the database which will serve as a document of sentences where each line in the file is a space delimited list of playlist songs.

From the DB, the following view is created which is subseqntly extracted as a CSV file:<br>

`CREATE VIEW playlist_tracks_uris AS SELECT t.playlist_id, group_concat(t.track_uri, ' ')  FROM playlists t GROUP BY  t.playlist_id;`


In [51]:
# Iterator that yields the songs for each playlist in the CSV file
class Playlist_URIs(object):
    """
    Playlist generator that yileds the track uris in a playlist.
    Yields one playlist at a time.
    """
    def __init__(self,
                 filename:str=os.path.join(data_path,'playlist_tracks.csv'),
                 name:str=None,
                 iters:int=None):
        self.filename     = filename
        self.length       = len(open(self.filename).readlines())
        self.name         = name
        self.count        = 0
        self.iters        = iters
        print("Creating Playlist Track Listing Generator:")
        print("\tlength     : ", self.length)
    
    def __iter__(self):
        
        self.count += 1
        progbar = tqdm(total=self.length, desc="{}:{}/{}".format(self.name, self.count, self.iters+1))
        
        # yield a list of lists; sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
        for line in open(self.filename, 'r'):
            progbar.update(1)
            yield line.split('\t')[1].split(' ') # tab-delimited CSV file
            
        progbar.close()    
        

In [52]:
# Create a generator object to use while building the embedding
corpus_file = 'playlist_tracks.csv'
corpus_filepath = os.path.join(data_path, corpus_file)

iters = 5
playlists_gen = Playlist_URIs(filename=corpus_filepath,
                              name="Building Vectors",
                              iters=iters) 

Creating Playlist Track Listing Generator:
	length     :  999001


In [53]:
# Build a gensim model including a word embedding
model = gensim.models.Word2Vec(sentences=playlists_gen,
                               workers = 8,    # number of processors
                               sg = 0,         # 1=skip-gram, 0=CBOW
                               iter=iters      # training iterations - default=5
                              )

# about 5 minutes per iteration with 4 processors // 2 min per iteration with 8

HBox(children=(IntProgress(value=0, description='Building Vectors:1/5', max=999001, style=ProgressStyle(descri…




HBox(children=(IntProgress(value=0, description='Building Vectors:2/5', max=999001, style=ProgressStyle(descri…




HBox(children=(IntProgress(value=0, description='Building Vectors:3/5', max=999001, style=ProgressStyle(descri…




HBox(children=(IntProgress(value=0, description='Building Vectors:4/5', max=999001, style=ProgressStyle(descri…




HBox(children=(IntProgress(value=0, description='Building Vectors:5/5', max=999001, style=ProgressStyle(descri…




HBox(children=(IntProgress(value=0, description='Building Vectors:6/5', max=999001, style=ProgressStyle(descri…




### Model Attributes
#### wv
> This object essentially contains the mapping between words and embeddings. After training, it can be used directly to query those embeddings in various ways. See the module level docstring for examples.

#### vocabulary
> This object represents the vocabulary (sometimes called Dictionary in gensim) of the model. Besides keeping track of all unique words, this object provides extra functionality, such as constructing a huffman tree (frequent words are closer to the root), or discarding extremely rare words.

## Test the model
Get 10 'words' from the model's vocabulary.

In [55]:
for i, word in enumerate(model.wv.vocab):
    if i == 10:
        break
    print(word)

spotify:track:2d7LPtieXdIYzf7yHPooWd
spotify:track:0y4TKcc7p2H6P0GJlt01EI
spotify:track:6q4c1vPRZREh7nw3wG7Ixz
spotify:track:54KFQB6N4pn926IUUYZGzK
spotify:track:0NeJjNlprGfZpeX2LQuN6c
spotify:track:2kuFVY6hWX6yavTiWHE3SQ
spotify:track:66mmvchQ4C3LnPzq4DiAI3
spotify:track:4gFxywaJejXWxo0NjlWzgg
spotify:track:6wQSrFnJYm3evLsavFeCVT
spotify:track:3ZjnFYlal0fXN6t61wdxhl


In [58]:
# Show the vector value for one word
model.wv.get_vector(word)

array([-0.7621344 , -0.03114427, -0.8835936 ,  0.36794358, -1.1382755 ,
        0.8673884 ,  1.5209368 , -0.08404895, -0.03294052, -0.09659185,
       -0.41151473,  0.6711556 , -0.1226874 , -0.39080906, -0.1941422 ,
        0.37506995, -0.57488316,  1.6192638 ,  1.0594045 , -0.80221117,
        1.1853622 , -0.64177287,  0.36271974,  0.2080922 , -0.03302405,
       -0.08925119,  0.28702548,  1.094027  , -0.75001746,  0.32850793,
        1.3414407 , -0.2149944 , -0.03232091,  0.97403723,  0.1955177 ,
        0.73916644,  0.06372388, -0.5650051 , -0.6680082 ,  0.46441278,
       -0.19055939, -1.1615809 , -0.7388865 ,  0.34048718, -0.30984044,
        0.15592104,  0.3560942 ,  0.19793318,  0.26497325,  1.8629072 ,
        0.6173503 ,  0.41372672,  0.2560014 ,  1.0204144 , -0.29449797,
        1.8490099 , -0.8553438 , -0.42974585,  0.0733617 ,  0.03775372,
       -0.10994548,  0.76888317,  0.8226318 , -0.32209927, -0.4137955 ,
        0.6056174 , -0.20869644, -0.4498154 ,  0.40722492, -0.48

In [None]:
# you can save the whole model - not necessary - will be large and includes things we don't need
# model.save("playlists_1.model")
# model = Word2Vec.load("playlists_1.model")

In [60]:
# only the embedding vectors are needed - so avoid saving the whole model

kv_filepath = os.path.join(data_path, 'playlists_BOW1')
model.wv.save(kv_filepath)

# reload saved vectors
model_v = KeyedVectors.load(kv_filepath, mmap='r')


In [61]:
type(model_v)

gensim.models.keyedvectors.Word2VecKeyedVectors

In [66]:
# Get a song as a seed for a test playlist
db_track = display_time(session.query(Playlists.track_name, 
                                Playlists.track_uri,
                                Playlists.artist_uri,
                                Playlists.album_uri).filter(Playlists.track_name=="Free Fallin'").distinct().first)
print("Artist: {}".format(get_artists([db_track.artist_uri])[0]['name']))
print("Track:  {}".format(db_track.track_name))

Time to Execute: 0.02 seconds
Artist: Tom Petty
Track:  Free Fallin'


In [70]:
# Find similar songs 
playlist = np.array(model_v.similar_by_word(rv.track_uri, topn=10, restrict_vocab=None))
playlist

array([['spotify:track:7gSQv1OHpkIoAdUiRLdmI6', '0.8619670867919922'],
       ['spotify:track:43btz2xjMKpcmjkuRsvxyg', '0.781667947769165'],
       ['spotify:track:7MRyJPksH3G2cXHN8UKYzP', '0.7804632186889648'],
       ['spotify:track:5xS9hkTGfxqXyxX6wWWTt4', '0.7684037685394287'],
       ['spotify:track:17S4XrLvF5jlGvGCJHgF51', '0.7546130418777466'],
       ['spotify:track:7MooGz4ZPE4bNxjFegR6Jx', '0.7539862990379333'],
       ['spotify:track:2HsjJJL4DhPCzMlnaGv7ap', '0.696096658706665'],
       ['spotify:track:67eX1ovaHyVPUinMHeUtIM', '0.693360447883606'],
       ['spotify:track:6N1EjQjnvhOjFrF6oUmGPa', '0.6904664039611816'],
       ['spotify:track:1fDsrQ23eTAVFElUMaf38X', '0.6783772110939026']],
      dtype='<U36')

In [74]:
# Get the similar songs from Spotify to show their details, including preview link (if available)
sp_playlist = get_tracks(playlist[:,0])

for t in sp_playlist:
    print("Artist       : ", t['artists'][0]['name'])
    print("Track        : ", t['name'])
    print("Track Preview: ", t['preview_url'] )
    print()

Artist       :  Tom Petty
Track        :  I Won't Back Down
Track Preview:  None

Artist       :  John Mellencamp
Track        :  Jack & Diane
Track Preview:  None

Artist       :  Tom Petty and the Heartbreakers
Track        :  American Girl
Track Preview:  https://p.scdn.co/mp3-preview/36d69a9c5b7a78b378f349e319ca49075993717b?cid=72413f75d4db4ec79c6caaf02523959e

Artist       :  Tom Petty and the Heartbreakers
Track        :  Mary Jane's Last Dance
Track Preview:  None

Artist       :  Tom Petty and the Heartbreakers
Track        :  Learning To Fly
Track Preview:  None

Artist       :  Tom Petty
Track        :  You Don't Know How It Feels
Track Preview:  https://p.scdn.co/mp3-preview/920e1367344e020100e499744890d45f0f8729ce?cid=72413f75d4db4ec79c6caaf02523959e

Artist       :  John Mellencamp
Track        :  Small Town
Track Preview:  None

Artist       :  John Mellencamp
Track        :  Hurts So Good
Track Preview:  None

Artist       :  Bryan Adams
Track        :  Summer Of '69
Tra