# Deezer playlist dataset and song recommendation with word2vec

In this mini project we will develop a word2vec network and use it to build a playlist completion tool (song suggestion). The data is hosted on the following repository: http://github.com/comeetie/deezerplay.git. To know more about word2vec and these data you can read the two following references:

- Efficient estimation of word representations in vector space, Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. (https://arxiv.org/abs/1301.3781)
- Word2with applied to Recommendation: Hyperparameters Matter, H. Caselles-Dupré, F. Lesaint and J. Royo-Letelier. (https://arxiv.org/pdf/1804.04212.pdf)

The elements you have to do are highlighted in red.

## Preparation of data

The data is in the form of a playlist list. Each playlist is a list with the deezer ID of the psong followed by the artist ID.

In [1]:
import numpy as np
data = np.load("./music_2.npy",allow_pickle=True)
[len(data), np.mean([len(p) for p in data])]

FileNotFoundError: [Errno 2] No such file or directory: './music_2.npy'

The dataset we are going to work on contains 100000 playlists which are composed of an average of 24.1 songs. We will start by keeping only the song identifiers. 

In [2]:
playlist_track = [list(filter(lambda w: w.split("_")[0]==u"track",playlist)) for playlist in data]
playlist_artist = [list(filter(lambda w: w.split("_")[0]==u"artist",playlist)) for playlist in data]

NameError: name 'data' is not defined

In [50]:
# songs != ?
tracks = np.unique(np.concatenate(playlist_track))
Vt = len(tracks)
Vt

338509

The number of different songs in this data-set is quite high with more than 300,000 songs.

## Creating a song dictionary
We will assign to each song an integer that will serve as a unique identifier and input for our network. In order to save a little bit of resources we will only work in this project on songs that appear in at least two playlists.

In [4]:
# how many occurence for each track ?
track_counts = dict((tracks[i],0) for i in range(0, Vt))
for p in playlist_track:
    for a in p:
        track_counts[a]=track_counts[a]+1;

In [51]:
# Filter very rare songs to save ressources
playlist_track_filter = [list(filter(lambda a : track_counts[a]> 1, playlist)) for playlist in playlist_track]
# get the counts
counts  =  np.array(list(track_counts.values()))
# sort
order = np.argsort(-counts)
# deezed_id array
tracks_list_ordered = np.array(list(track_counts.keys()))[order]
# Vocabulary size = number of kept songs
Vt=np.where(counts[order]==1)[0][0]
# dict construction id_morceaux num_id [0,Vt]
track_dict = dict((tracks_list_ordered[i],i) for i in range(0, Vt))
# playlist conversion to list of integers
corpus_num_track = [[track_dict[track] for track in play ] for play in playlist_track_filter]

### Creation of test and validation learning sets

To learn the parameters of our method we will keep the first l-1 songs of each playlist (with l the length of the playlist) for learning. To evaluate the completion performance of our method we keep for each playlist the last two songs. The objective will be to find the last one from the next-to-last one. 



In [53]:

# playlist main part used for trainning
play_app  = [corpus_num_track[i][:(len(corpus_num_track[i])-1)] 
             for i in range(len(corpus_num_track)) if len(corpus_num_track[i])>1]
# the two last elements are used for validation and training
index_tst = np.random.choice(100000,20000)
index_val = np.setdiff1d(range(100000),index_tst)

play_tst  = np.array([corpus_num_track[i][(len(corpus_num_track[i])-2):len(corpus_num_track[i])] 
             for i in index_tst if len(corpus_num_track[i])>3])
play_val  = np.array([corpus_num_track[i][(len(corpus_num_track[i])-2):len(corpus_num_track[i])] 
             for i in index_val if len(corpus_num_track[i])>3])[:10000]


In [7]:
# import Keras
from keras.models import Sequential, Model
from keras.layers import Embedding, Reshape, Activation, Input, Dense,Flatten
from keras.layers.merge import Dot
from keras.utils import np_utils
from keras.preprocessing.sequence import skipgrams

### hyper-paramètres de word2vec :

La méthode word2vec fait intervennir un certains nombre d'hyper paramètres. Nous allons les définirs et leurs donner des première valeurs que nous affinerons par la suite:


In [54]:
# latent space dimension
vector_dim = 30
# window size
window_width = 3
# number of negative sample per positive sample
neg_sample = 5
# taille des mini-batch
min_batch_size = 50
# smoothing factor for the sampling table of negative pairs 
samp_coef = 0.5
# cparameter to sub-sample frequent song
sub_samp = 0.00001

### Creation of the draw probability tables (smoothed) and unsmoothed

To draw the negative examples we need the smoothed frequencies of each song in our dataset. Likewise to under-sample very frequent pieces we need the raw frequencies. We will calculate these two vectors.

In [55]:
# get the counts
counts = np.array(list(track_counts.values()),dtype='float')[order[:Vt]]
# normalization
st =  counts/np.sum(counts)
# smoothing
st_smooth = np.power(st,samp_coef)
st_smooth = st_smooth/np.sum(st_smooth)

### Building the word2 network with

A word2 network with takes in input two integers corresponding to two songs, these are embedded in a latent space of dimension (vector_dim) thanks to a layer of embedding type (you will have to use the same layer to project the two pieces). Once these two vectors have been extracted, the array must calculate their scalar product normalize appleler cosine distance : 

$$cos(\theta_{ij})=\frac{z_i.z_j}{||z_i||||z_j||}$$

To carry out this treatment you will use a "dot" layer for "dot product". The model then uses a sigmoid layer to produce the output. This output will be 0 when both songs are randomly drawn from the whole dataset and 1 when they were extracted from the same playslist. <span style="color:red">You have to create the keras Track2Vec model corresponding to this architecture.</span>


In [10]:
# inputs
input_target = Input((1,), dtype='int32')
input_context = Input((1,), dtype='int32')

# TODO

#output = Dense(1, activation='sigmoid',name="classif")(dot_product)

#Track2Vec = Model(inputs=[input_target, input_context], outputs=output)
#Track2Vec.compile(loss='binary_crossentropy', optimizer='adam',metrics=["accuracy"])

In [14]:
Track2Vec.summary()

Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            [(None, 1)]          0                                            
__________________________________________________________________________________________________
input_4 (InputLayer)            [(None, 1)]          0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 1, 30)        3697230     input_3[0][0]                    
                                                                 input_4[0][0]                    
__________________________________________________________________________________________________
dot (Dot)                       (None, 1, 1)         0           embedding[0][0]       

### Creation of the data generator

To learn the projection layer at the heart of our model we will build a generator of positive and negative pair examples of close or random songs from our training data. The following function will allow us to generate such examples from a playlist (seq) provided as input. This function will first build all the pairs of songs that can be extracted from the playlist if they are within (windows) distance of each other. These pairs will constitute the positive pairs. The pairs concerning very frequent songs will be removed with a probability that depends on their frequencies. Finally a number of negative examples (corresponding to neg_samples * positive number of examples) will be randomly drawn using the neg_sampling_table.

In [57]:
# function to generate word2vec positive and begative pairs 
# from an array of int that represent a text ot here a playlist
# params 
# seq : input text or playlist (array of int)
# neg_samples : number of negative sample to generate per positive ones
# neg_sampling_table : sampling table for negative samples
# sub sampling_table : sampling table for sub sampling common words songs
# sub_t : sub sampling parameter
def word2vecSampling(seq,window,neg_samples,neg_sampling_table,sub_sampling_table,sub_t):
    # vocab size
    V = len(neg_sampling_table)
    # extract positive pairs 
    positives = skipgrams(sequence=seq, vocabulary_size=V, window_size=window,negative_samples=0)
    ppairs    = np.array(positives[0])
    # sub sampling
    if (ppairs.shape[0]>0):
        f = sub_sampling_table[ppairs[:,0]]
        subprob = ((f-sub_t)/f)-np.sqrt(sub_t/f)
        tokeep = (subprob<np.random.uniform(size=subprob.shape[0])) | (subprob<0)
        ppairs = ppairs[tokeep,:]
    nbneg     = ppairs.shape[0]*neg_samples
    # sample negative pairs
    if (nbneg > 0):
        negex     = np.random.choice(V, nbneg, p=neg_sampling_table)
        negexcontext = np.repeat(ppairs[:,0],neg_samples)
        npairs    = np.transpose(np.stack([negexcontext,negex]))
        pairs     = np.concatenate([ppairs,npairs],axis=0)
        labels    = np.concatenate([np.repeat(1,ppairs.shape[0]),np.repeat(0,nbneg)])
        perm      = np.random.permutation(len(labels))
        res = [pairs[perm,:],labels[perm]]
    else:
        res=[[],[]]
    return res

<span style="color:red">Use this function to build a "track_ns_generator" of data which will generate positive and negative examples from "nbm" playlists randomly drawn from the "corpus_num" dataset provided as input. </span>

In [56]:
import random
# def track_ns_generator(corpus_num,nbm):
    #while 1:
        # TODO
        # tirage de nbm playlist dans corpus_num
        # création des données x et y 
    #yield (x,y)

## Learning 
You should now be able to learn your first model with the following code. This should take between 15 and 30 min.

In [18]:
hist=Track2Vec.fit(track_ns_generator(play_app,min_batch_size),steps_per_epoch = 200,epochs=60)

Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60
Epoch 41/60
Epoch 42/60
Epoch 43/60
Epoch 44/60
Epoch 45/60
Epoch 46/60
Epoch 47/60
Epoch 48/60
Epoch 49/60
Epoch 50/60
Epoch 51/60
Epoch 52/60
Epoch 53/60
Epoch 54/60
Epoch 55/60
Epoch 56/60
Epoch 57/60
Epoch 58/60
Epoch 59/60
Epoch 60/60


## Save latent space
Once the learning is done, we can save the position of the songs in the latent space with the following code:

In [19]:
# récupérations des positions des morceaux dans l'espace de projection
vectors_tracks = Track2Vec.get_weights()[0]
with open('latent_positions.npy', 'wb') as f:
    np.save(f, vectors_tracks)

And latter load it with :

In [20]:
vectors_tracks=np.load("latent_positions.npy")

## Use in completion and evaluation
We can now use this space to make suggestions. <span style="color:red">Build a predict_batch function that takes as input a number vector of songs (seeds), (s) a number of suggestions to make per request, the vectors of the songs in the latent space X and a kd-tree to speed up the computation of closest neighbors. To make its propositions this function will return the indices of the s closest neighbors of each seed. </span> So that these predictions don't take too much time you will use a kd-tree (available in scikit learn) to speed up the search for nearest neighbors.

In [24]:
from sklearn.neighbors import KDTree
kdt = KDTree(vectors_tracks, leaf_size=10, metric='euclidean')

In [25]:
def predict_batch(seeds,k,X,kdt):
    # TODO
    return 0

<span style="color:red">Use this function to propose songs to complete the playlist of the validation dataset (the seeds correspond to the first column of play_val).</span>

In [28]:
indexes = predict_batch(play_val[:,0],10,vectors_tracks,kdt)

<span style="color:red">Compare these suggestions with the second column of play_val (the songs actually present). To do this you will calculate the hit@10 which is 1 if the song actually present in the playlist is one of the 10 suggestions (this score is averaged over the validation set) and the NDCG@10 (Normalized Discounted Cumulative Gain) which takes into account the order of the suggestions. This second score is worth $1/log2(k+1)$ if proposal k (k between 1 and 10) is the correct proposal and 0 if no proposal is correct. As before you will calculate the average score on the validation set. </span>


In [34]:
NDGCatK

0.0757820569218288

In [35]:
HitatK

0.134

## hyper parameters tunning

<span style="color:red">You can now try to vary the hyper parameters to improve your performance. Pay attention to the computing time : prepare a grid with about ten different configurations and evaluate each of them on your validation set.
Evaluate the final performance of the best configuration found on the test set. Don't forget to save your results.</span>



## Bonus, a little music

The TrackArtists file contains meta.data on the tracks and the artists for a subset of the 300,000 tracks in the dataset. We can use it to search for the number of a song from its title:

In [36]:
import pandas as pd
tr_meta=pd.read_csv("./TracksArtists.csv")
joindf = pd.DataFrame({"track_id":tracks_list_ordered[:Vt],"index":range(Vt)})
meta = tr_meta.merge(joindf, left_on="track_id",right_on="track_id")
meta.set_index("index",inplace=True)
meta[["title","name","preview","track_id"]]

Unnamed: 0_level_0,title,name,preview,track_id
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
14086,Alone,Petit Biscuit,http://cdn-preview-8.deezer.com/stream/c-89176...,track_100001884
1519,Memories,Petit Biscuit,http://cdn-preview-8.deezer.com/stream/c-883c9...,track_102400504
1127,Sunset Lover,Petit Biscuit,,track_102400506
22812,Night Trouble,Petit Biscuit,http://cdn-preview-b.deezer.com/stream/c-b1808...,track_102400604
12644,Palms,Petit Biscuit,http://cdn-preview-3.deezer.com/stream/c-3e57c...,track_102420192
...,...,...,...,...
22784,Donde Estés Llegaré,Alexis y Fido,http://cdn-preview-5.deezer.com/stream/c-542bf...,track_9975788
13071,Camuflaje,Alexis y Fido,http://cdn-preview-b.deezer.com/stream/c-b249e...,track_9975792
22782,Mala Conducta,Alexis y Fido,http://cdn-preview-a.deezer.com/stream/c-af834...,track_9975794
31160,The Name of the Wave,Strange Cargo,http://cdn-preview-d.deezer.com/stream/c-d20b3...,track_99961270


In [37]:
def find_track(title):
    return meta.loc[meta["title"]==title,:].index[0]

tr=find_track("Hexagone")
tr

19492

## Radio

The deeezer api allows you to retrieve information about the pieces of the dataset from their deezer id. Among this information when it is available a url to listen to a free sample is provided.

In [38]:
import urllib.request, json 
def gettrackinfo(number):
    track_url =  "https://api.deezer.com/track/{}".format(tracks_list_ordered[number].split("_")[1])
    with urllib.request.urlopen(track_url) as url:
        data = json.loads(url.read().decode())
    return data
track_apidata = gettrackinfo(find_track("Hexagone"))
track_apidata

{'id': 128093263,
 'readable': True,
 'title': 'Hexagone',
 'title_short': 'Hexagone',
 'title_version': '',
 'isrc': 'FRZ027500460',
 'link': 'https://www.deezer.com/track/128093263',
 'share': 'https://www.deezer.com/track/128093263?utm_source=deezer&utm_content=track-128093263&utm_term=0_1606918492&utm_medium=web',
 'duration': 330,
 'track_position': 4,
 'disk_number': 1,
 'rank': 669927,
 'release_date': '2016-07-08',
 'explicit_lyrics': False,
 'explicit_content_lyrics': 0,
 'explicit_content_cover': 0,
 'preview': 'https://cdns-preview-9.dzcdn.net/stream/c-93c768b47b54c1d295f92f59990f732a-6.mp3',
 'bpm': 125.66,
 'gain': -12.5,
 'available_countries': ['AE',
  'AF',
  'AG',
  'AI',
  'AL',
  'AM',
  'AO',
  'AQ',
  'AR',
  'AS',
  'AT',
  'AU',
  'AZ',
  'BA',
  'BB',
  'BD',
  'BE',
  'BF',
  'BG',
  'BH',
  'BI',
  'BJ',
  'BN',
  'BO',
  'BQ',
  'BR',
  'BT',
  'BV',
  'BW',
  'BY',
  'CC',
  'CD',
  'CF',
  'CG',
  'CH',
  'CI',
  'CK',
  'CL',
  'CM',
  'CO',
  'CR',
  'CU'

So we can use it to listen a preview:

In [39]:
from IPython.display import display, Audio, clear_output
display(Audio(track_apidata["preview"],autoplay=True))

<span style="color:red">Create a radio function that takes as input a track number in the dataset and launches a series of nb_track tracks by randomly pulling in the neighborhood of the current track the next track to listen to. The size of the neighborhood will be configurable and you will delete from the proposals the songs already listened to. You will handle exceptions if the track does not have an available extract. You can delete the current song with the clear_display function.</span>

In [49]:
import time
def start_radio(seed,nb_candidates,duration,nbsteps=20):
    print(meta.loc[seed,"title"])
    display(Audio(meta.loc[seed,"preview"],autoplay=True))
    time.sleep(duration)
    clear_output()
    already_played = [seed]
    for i in range(nbsteps):
        try:
            # TODO
        except:
            print("track not found")
            pass
        clear_output()

In [47]:
start_radio(find_track("Hexagone"),5,5,10)