# Deezer playlist dataset and song recommendation with word2vec

In this mini project we will develop a word2vec network and use it to build a playlist completion tool (song suggestion). The data is hosted on the following repository: http://github.com/comeetie/deezerplay.git. To know more about word2vec and these data you can read the two following references:

- Efficient estimation of word representations in vector space, Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. (https://arxiv.org/abs/1301.3781)
- Word2with applied to Recommendation: Hyperparameters Matter, H. Caselles-Dupré, F. Lesaint and J. Royo-Letelier. (https://arxiv.org/pdf/1804.04212.pdf)

The elements you have to do are highlighted in red.

## Preparation of data

The data is in the form of a playlist list. Each playlist is a list with the deezer ID of the psong followed by the artist ID.

In [4]:
import numpy as np
data = np.load("data/music_2.npy", allow_pickle=True)
[len(data), np.mean([len(p) for p in data])]

[100000, 24.21338]

The dataset we are going to work on contains 100000 playlists which are composed of an average of 24.1 songs. We will start by keeping only the song identifiers. 

In [5]:
playlist_track = [list(filter(lambda w: w.split("_")[0] == u"track", playlist)) for playlist in data]
playlist_artist = [list(filter(lambda w: w.split("_")[0] == u"artist", playlist)) for playlist in data]

In [6]:
# songs != playlists
tracks = np.unique(np.concatenate(playlist_track))
Vt = len(tracks)
Vt

338509

The number of different songs in this data-set is quite high with more than 300,000 songs.

## Creating a song dictionary
We will assign to each song an integer that will serve as a unique identifier and input for our network. In order to save a little bit of resources we will only work in this project on songs that appear in at least two playlists.

In [7]:
# counting occurences for each track
track_counts = dict((tracks[i], 0) for i in range(0, Vt))
for p in playlist_track:
    for track in p:
        track_counts[track] = track_counts[track] + 1

In [8]:
# Filter very rare songs to save ressources
playlist_track_filter = [list(filter(lambda track : track_counts[track] > 1, playlist)) for playlist in playlist_track]
# get the counts
counts  =  np.array(list(track_counts.values()))
# sort
order = np.argsort(-counts)
# deezed_id array
tracks_list_ordered = np.array(list(track_counts.keys()))[order]
# Vocabulary size = number of kept songs
Vt = np.where(counts[order] == 1)[0][0]
# dict construction id_morceaux num_id [0,Vt]
track_dict = dict((tracks_list_ordered[i], i) for i in range(0, Vt))
# playlist conversion to list of integers
corpus_num_track = [[track_dict[track] for track in play ] for play in playlist_track_filter]
print(Vt)


123241


### Creation of test and validation learning sets

To learn the parameters of our method we will keep the first l-1 songs of each playlist (with l the length of the playlist) for learning. To evaluate the completion performance of our method we keep for each playlist the last two songs. The objective will be to find the last one from the next-to-last one. 



In [9]:
# playlist main part used for trainning
play_app  = [corpus_num_track[i][:(len(corpus_num_track[i])-1)] for i in range(len(corpus_num_track)) if len(corpus_num_track[i]) > 1]

# the two last elements are used for validation and training
index_tst = np.random.choice(100000,20000)
index_val = np.setdiff1d(range(100000),index_tst)

play_tst  = np.array([corpus_num_track[i][(len(corpus_num_track[i])-2):len(corpus_num_track[i])] 
             for i in index_tst if len(corpus_num_track[i])>3])
play_val  = np.array([corpus_num_track[i][(len(corpus_num_track[i])-2):len(corpus_num_track[i])] 
             for i in index_val if len(corpus_num_track[i])>3])[:10000]

print(play_val[:,1])
print(play_tst[:,0])


[ 2016 33361 79201 ... 97379  3910  6500]
[50008   175  3920 ... 10659 39650 37384]


In [10]:
# import Keras
from keras.models import Sequential, Model
from keras.layers import Embedding, Reshape, Activation, Input, Dense, Flatten
from keras.layers.merge import Dot
from keras.utils import np_utils
from keras.preprocessing.sequence import skipgrams, make_sampling_table
import tensorflow as tf

### hyper-parameters of word2vec :

the method word2vec needs some hyper-parameters. We are going to give them the first values, but we will refine them later:


In [11]:
# latent space dimension
# the size of each of our word embedding vectors
vector_dim = 30
# window size
# the window of words around the target word that will be used to draw the context words from
window_width = 3
# number of negative sample per positive sample
neg_sample = 5
# taille des mini-batch
min_batch_size = 50
# smoothing factor for the sampling table of negative pairs 
samp_coef = 0.5
# cparameter to sub-sample frequent song
sub_samp = 0.00001

### Creation of the draw probability tables (smoothed) and unsmoothed

To draw the negative examples we need the smoothed frequencies of each song in our dataset. Likewise to under-sample very frequent pieces we need the raw frequencies. We will calculate these two vectors.

In [12]:
# get the counts
counts = np.array(list(track_counts.values()),dtype='float')[order[:Vt]]
# normalization
st =  counts/np.sum(counts)
# smoothing
st_smooth = np.power(st,samp_coef)
st_smooth = st_smooth/np.sum(st_smooth)

### Building the word2 network with

A word2 network with takes in input two integers corresponding to two songs, these are embedded in a latent space of dimension (vector_dim) thanks to a layer of embedding type (you will have to use the same layer to project the two pieces). Once these two vectors have been extracted, the array must calculate their scalar product normalize appleler cosine distance : 

$$cos(\theta_{ij})=\frac{z_i.z_j}{||z_i||||z_j||}$$

To carry out this treatment you will use a "dot" layer for "dot product". The model then uses a sigmoid layer to produce the output. This output will be 0 when both songs are randomly drawn from the whole dataset and 1 when they were extracted from the same playslist. <span style="color:red">You have to create the keras Track2Vec model corresponding to this architecture.</span>


In [13]:
# inputs

input_target = Input((1,), dtype='int32')
input_context = Input((1,), dtype='int32')

embedding = Embedding(Vt, vector_dim, input_length=1, name='embedding')
target = embedding(input_target)
context = embedding(input_context)
dot_product = Dot(axes=2)([target, context])
flatten = Flatten()(dot_product)
output = Dense(1, activation='sigmoid',name="classif")(flatten)

Track2Vec = Model(inputs=[input_target, input_context], outputs=output)
Track2Vec.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])

In [14]:
Track2Vec.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 1)]          0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 1)]          0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 1, 30)        3697230     input_1[0][0]                    
                                                                 input_2[0][0]                    
__________________________________________________________________________________________________
dot (Dot)                       (None, 1, 1)         0           embedding[0][0]              

### Creation of the data generator

To learn the projection layer at the heart of our model we will build a generator of positive and negative pair examples of close or random songs from our training data. The following function will allow us to generate such examples from a playlist (seq) provided as input. This function will first build all the pairs of songs that can be extracted from the playlist if they are within (windows) distance of each other. These pairs will constitute the positive pairs. The pairs concerning very frequent songs will be removed with a probability that depends on their frequencies. Finally a number of negative examples (corresponding to neg_samples * positive number of examples) will be randomly drawn using the neg_sampling_table.

In [15]:
# function to generate word2vec positive and negative pairs 
# from an array of int that represent a text ot here a playlist
# params 
# seq : input text or playlist (array of int)
# neg_samples : number of negative sample to generate per positive ones
# neg_sampling_table : sampling table for negative samples
# sub sampling_table : sampling table for sub sampling common words songs
# sub_t : sub sampling parameter
def word2vecSampling(seq, window, neg_samples, neg_sampling_table, sub_sampling_table, sub_t):
    # vocab size
    V = len(neg_sampling_table)
    # extract positive pairs 
    positives = skipgrams(sequence=seq, vocabulary_size=V, window_size=window, negative_samples=0) #return couples, labels: where couples are int pairs and labels are either 0 or 1.
    ppairs    = np.array(positives[0]) #couples
    # sub sampling
    if (ppairs.shape[0]>0):
        f = sub_sampling_table[ppairs[:,0]]
        subprob = ((f-sub_t)/f)-np.sqrt(sub_t/f)
        tokeep = (subprob<np.random.uniform(size=subprob.shape[0])) | (subprob<0)
        ppairs = ppairs[tokeep,:]
    nbneg     = ppairs.shape[0]*neg_samples
    # sample negative pairs
    if (nbneg > 0):
        negex     = np.random.choice(V, nbneg, p=neg_sampling_table)
        negexcontext = np.repeat(ppairs[:,0],neg_samples)
        npairs    = np.transpose(np.stack([negexcontext,negex]))
        pairs     = np.concatenate([ppairs,npairs],axis=0)
        labels    = np.concatenate([np.repeat(1,ppairs.shape[0]),np.repeat(0,nbneg)])
        perm      = np.random.permutation(len(labels))
        res = [pairs[perm,:],labels[perm]]
    else:
        res=[[],[]]
    return res

<span style="color:red">Use this function to build a "track_ns_generator" of data which will generate positive and negative examples from "nbm" playlists randomly drawn from the "corpus_num" dataset provided as input. </span>

In [16]:
import random
def track_ns_generator(corpus_num,nbm):
    
    
    while 1: 
        x = np.ndarray((0,2), dtype=np.int32)
        y = np.ndarray((0),dtype=np.int32)
        
        for i in range(nbm):
            randint = random.randint(0, len(corpus_num))
            a,b = word2vecSampling(corpus_num[randint], window_width, neg_sample, st_smooth, st, sub_samp)

            if(len(a) > 0):
                x = np.vstack((x,a))
                y = np.append(y,b)
                
        yield ((x[:,0],x[:,1]), y)
        

## Learning 
You should now be able to learn your first model with the following code. This should take between 15 and 30 min.

In [3]:
hist=Track2Vec.fit(track_ns_generator(play_app,min_batch_size),steps_per_epoch = 200,epochs=60)

NameError: name 'Track2Vec' is not defined

## Save latent space
Once the learning is done, we can save the position of the songs in the latent space with the following code:

In [17]:
# récupérations des positions des morceaux dans l'espace de projection
vectors_tracks = Track2Vec.get_weights()[0]

with open('latent_positions.npy', 'wb') as f:
    np.save(f, vectors_tracks)

And latter load it with :

In [18]:
vectors_tracks=np.load("latent_positions.npy")
print(len(vectors_tracks))

123241


## Use in completion and evaluation
We can now use this space to make suggestions. <span style="color:red">Build a predict_batch function that takes as input a number vector of songs (seeds), (s) a number of suggestions to make per request, the vectors of the songs in the latent space X and a kd-tree to speed up the computation of closest neighbors. To make its propositions this function will return the indices of the s closest neighbors of each seed. </span> So that these predictions don't take too much time you will use a kd-tree (available in scikit learn) to speed up the search for nearest neighbors.

In [19]:
from sklearn.neighbors import KDTree
kdt = KDTree(vectors_tracks, leaf_size=10, metric='euclidean')
print(len(kdt.data))
print(play_val[:,0])

123241
[  532 50537 79202 ... 97136  1386  6500]


In [20]:
def predict_batch(seeds, k, X, kdt):
    ind = kdt.query(X[seeds], k, return_distance = False)
    return ind

<span style="color:red">Use this function to propose songs to complete the playlist of the validation dataset (the seeds correspond to the first column of play_val).</span>

In [18]:
indexes = predict_batch(play_val[:,0],10,vectors_tracks,kdt)
print(indexes)

<span style="color:red">Compare these suggestions with the second column of play_val (the songs actually present). To do this you will calculate the hit@10 which is 1 if the song actually present in the playlist is one of the 10 suggestions (this score is averaged over the validation set) and the NDCG@10 (Normalized Discounted Cumulative Gain) which takes into account the order of the suggestions. This second score is worth $1/log2(k+1)$ if proposal k (k between 1 and 10) is the correct proposal and 0 if no proposal is correct. As before you will calculate the average score on the validation set. </span>


In [21]:
import math
def NDGCatK(indexes):
    NDGCatK = 0
    for i in range(len(play_val[:,1])):
        for j in range(len(indexes[i])):
            if play_val[i,1] == indexes[i][j]:
                NDGCatK += 1/(math.log(j+2,2))

    NDGCatK = NDGCatK/len(play_val[:,1])
    return NDGCatK

In [22]:
def HitatK(indexes):
    HitatK = 0
    for i in range(len(indexes)):
        if play_val[i,1] in indexes[i]:
            HitatK += 1
    HitatK = HitatK/len(play_val[:,1])
    return HitatK

## hyper parameters tunning

<span style="color:red">You can now try to vary the hyper parameters to improve your performance. Pay attention to the computing time : prepare a grid with about ten different configurations and evaluate each of them on your validation set.
Evaluate the final performance of the best configuration found on the test set. Don't forget to save your results.</span>



Possiamo cambiare: epochs (32,...), steps_per_epoch, batch_size, optimizer, forse anche la loss function

In [20]:
result = list(dict())


#1
min_batch_size = 32
hist1=Track2Vec.fit(track_ns_generator(play_app,min_batch_size),steps_per_epoch = 400,epochs=50)

vectors_tracks = Track2Vec.get_weights()[0]
kdt = KDTree(vectors_tracks, leaf_size=10, metric='euclidean')
indexes = predict_batch(play_val[:,0],10,vectors_tracks,kdt)

result1 = {"min_batch_size":min_batch_size, "steps_per_epoch":400, "epochs":50, "loss":"binary_crossentropy", "optimizer":"adam", "accuracy":hist1.history['accuracy'][-1], "NDGC@K":NDGCatK(indexes), "Hit@K":HitatK(indexes)}
result.append(result1)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
0.04649957713599054
0.0474


In [21]:
#2
min_batch_size = 64
Track2Vec = Model(inputs=[input_target, input_context], outputs=output)
Track2Vec.compile(loss='binary_crossentropy', optimizer='SGD', metrics=["accuracy"])
hist2=Track2Vec.fit(track_ns_generator(play_app,min_batch_size),steps_per_epoch = 350,epochs=32)

vectors_tracks = Track2Vec.get_weights()[0]
kdt = KDTree(vectors_tracks, leaf_size=10, metric='euclidean')
indexes = predict_batch(play_val[:,0],10,vectors_tracks,kdt)

result2 = {"min_batch_size":min_batch_size, "steps_per_epoch":350, "epochs":32, "loss":"binary_crossentropy", "optimizer":"SGD", "accuracy":hist2.history['accuracy'][-1], "NDGC@K":NDGCatK(indexes), "Hit@K":HitatK(indexes)}
result.append(result2)

Epoch 1/32
Epoch 2/32
Epoch 3/32
Epoch 4/32
Epoch 5/32
Epoch 6/32
Epoch 7/32
Epoch 8/32
Epoch 9/32
Epoch 10/32
Epoch 11/32
Epoch 12/32
Epoch 13/32
Epoch 14/32
Epoch 15/32
Epoch 16/32
Epoch 17/32
Epoch 18/32
Epoch 19/32
Epoch 20/32
Epoch 21/32
Epoch 22/32
Epoch 23/32
Epoch 24/32
Epoch 25/32
Epoch 26/32
Epoch 27/32
Epoch 28/32
Epoch 29/32
Epoch 30/32
Epoch 31/32
Epoch 32/32
0.04650077365292515
0.0474


In [23]:
#3
min_batch_size = 64
Track2Vec = Model(inputs=[input_target, input_context], outputs=output)
Track2Vec.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])
hist3=Track2Vec.fit(track_ns_generator(play_app,min_batch_size),steps_per_epoch = 350,epochs=32)

vectors_tracks = Track2Vec.get_weights()[0]
kdt = KDTree(vectors_tracks, leaf_size=10, metric='euclidean')
indexes = predict_batch(play_val[:,0],10,vectors_tracks,kdt)

result3 = {"min_batch_size":min_batch_size, "steps_per_epoch":350, "epochs":32, "loss":"binary_crossentropy", "optimizer":"adam", "accuracy":hist3.history['accuracy'][-1], "NDGC@K":NDGCatK(indexes), "Hit@K":HitatK(indexes)}
result.append(result3)

Epoch 1/32
Epoch 2/32
Epoch 3/32
Epoch 4/32
Epoch 5/32
Epoch 6/32
Epoch 7/32
Epoch 8/32
Epoch 9/32
Epoch 10/32
Epoch 11/32
Epoch 12/32
Epoch 13/32
Epoch 14/32
Epoch 15/32
Epoch 16/32
Epoch 17/32
Epoch 18/32
Epoch 19/32
Epoch 20/32
Epoch 21/32
Epoch 22/32
Epoch 23/32
Epoch 24/32
Epoch 25/32
Epoch 26/32
Epoch 27/32
Epoch 28/32
Epoch 29/32
Epoch 30/32
Epoch 31/32
Epoch 32/32


NameError: name 'result' is not defined

In [23]:

#4
min_batch_size = 100
Track2Vec = Model(inputs=[input_target, input_context], outputs=output)
Track2Vec.compile(loss='hinge', optimizer='adam', metrics=["accuracy"])
hist4=Track2Vec.fit(track_ns_generator(play_app,min_batch_size),steps_per_epoch = 200,epochs=50)

vectors_tracks = Track2Vec.get_weights()[0]
kdt = KDTree(vectors_tracks, leaf_size=10, metric='euclidean')
indexes = predict_batch(play_val[:,0],10,vectors_tracks,kdt)

result4 = {"min_batch_size":min_batch_size, "steps_per_epoch":200, "epochs":50, "loss":"hinge", "optimizer":"adam", "accuracy":hist4.history['accuracy'][-1], "NDGC@K":NDGCatK(indexes), "Hit@K":HitatK(indexes)}
result.append(result4)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
0.04695716978988141
0.0483


In [24]:

#5
min_batch_size = 100
Track2Vec_5 = Model(inputs=[input_target, input_context], outputs=output)
Track2Vec_5.compile(loss='hinge', optimizer='adam', metrics=["accuracy"])
hist5=Track2Vec_5.fit(track_ns_generator(play_app,min_batch_size),steps_per_epoch = 200,epochs=50)

vectors_tracks = Track2Vec.get_weights()[0]
kdt = KDTree(vectors_tracks, leaf_size=10, metric='euclidean')
indexes = predict_batch(play_val[:,0],10,vectors_tracks,kdt)

result5 = {"min_batch_size":min_batch_size, "steps_per_epoch":200, "epochs":50, "loss":"hinge", "optimizer":"adam", "accuracy":hist5.history['accuracy'][-1], "NDGC@K":NDGCatK(indexes), "Hit@K":HitatK(indexes)}
result.append(result5)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
0.046950237445688756
0.0482


In [25]:
#6
min_batch_size = 64
Track2Vec = Model(inputs=[input_target, input_context], outputs=output)
Track2Vec.compile(loss='squared_hinge', optimizer='adam', metrics=["accuracy"])
hist6=Track2Vec.fit(track_ns_generator(play_app,min_batch_size),steps_per_epoch = 200,epochs=32)

vectors_tracks = Track2Vec.get_weights()[0]
kdt = KDTree(vectors_tracks, leaf_size=10, metric='euclidean')
indexes = predict_batch(play_val[:,0],10,vectors_tracks,kdt)

result6 = {"min_batch_size":min_batch_size, "steps_per_epoch":200, "epochs":32, "loss":"squared_hinge", "optimizer":"adam", "accuracy":hist6.history['accuracy'][-1], "NDGC@K":NDGCatK(indexes), "Hit@K":HitatK(indexes)}
result.append(result6)

print(result)

Epoch 1/32
Epoch 2/32
Epoch 3/32
Epoch 4/32
Epoch 5/32
Epoch 6/32
Epoch 7/32
Epoch 8/32
Epoch 9/32
Epoch 10/32
Epoch 11/32
Epoch 12/32
Epoch 13/32
Epoch 14/32
Epoch 15/32
Epoch 16/32
Epoch 17/32
Epoch 18/32
Epoch 19/32
Epoch 20/32
Epoch 21/32
Epoch 22/32
Epoch 23/32
Epoch 24/32
Epoch 25/32
Epoch 26/32
Epoch 27/32
Epoch 28/32
Epoch 29/32
Epoch 30/32
Epoch 31/32
Epoch 32/32
0.04694670906775902
0.0482
[{'min_batch_size': 32, 'steps_per_epoch': 400, 'epochs': 50, 'loss': 'binary_crossentropy', 'optimizer': 'adam', 'accuracy': [0.8299152851104736, 0.8411121964454651, 0.9107281565666199, 0.9451447129249573, 0.9633159041404724, 0.9688215851783752, 0.971632719039917, 0.9733495712280273, 0.9744848608970642, 0.9756481647491455, 0.9760657548904419, 0.9767351746559143, 0.9773838520050049, 0.9776550531387329, 0.978101372718811, 0.9780935645103455, 0.9782416820526123, 0.9788303375244141, 0.9788075685501099, 0.9792325496673584, 0.9794274568557739, 0.9795085191726685, 0.9798117280006409, 0.97989773750

In [62]:
#7
min_batch_size = 100
Track2Vec = Model(inputs=[input_target, input_context], outputs=output)
Track2Vec.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])
hist7=Track2Vec.fit(track_ns_generator(play_app,min_batch_size),steps_per_epoch = 200,epochs=50)

vectors_tracks = Track2Vec.get_weights()[0]
kdt = KDTree(vectors_tracks, leaf_size=10, metric='euclidean')
indexes = predict_batch(play_val[:,0],10,vectors_tracks,kdt)

result7 = {"min_batch_size":min_batch_size, "steps_per_epoch":200, "epochs":50, "loss":"binary_crossentropy", "optimizer":"adam", "accuracy":hist7.history['accuracy'][-1], "NDGC@K":NDGCatK(indexes), "Hit@K":HitatK(indexes)}
result.append(result7)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
0.04644111531320798
0.0472


In [64]:
#8
min_batch_size = 100
Track2Vec = Model(inputs=[input_target, input_context], outputs=output)
Track2Vec.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])
hist8=Track2Vec.fit(track_ns_generator(play_app,min_batch_size),steps_per_epoch = 200,epochs=32)

vectors_tracks = Track2Vec.get_weights()[0]
kdt = KDTree(vectors_tracks, leaf_size=10, metric='euclidean')
indexes = predict_batch(play_val[:,0],10,vectors_tracks,kdt)

result8 = {"min_batch_size":min_batch_size, "steps_per_epoch":200, "epochs":32, "loss":"binary_crossentropy", "optimizer":"adam", "accuracy":hist8.history['accuracy'][-1], "NDGC@K":NDGCatK(indexes), "Hit@K":HitatK(indexes)}
result.append(result8)

Epoch 1/32
Epoch 2/32
Epoch 3/32
Epoch 4/32
Epoch 5/32
Epoch 6/32
Epoch 7/32
Epoch 8/32
Epoch 9/32
Epoch 10/32
Epoch 11/32
Epoch 12/32
Epoch 13/32
Epoch 14/32
Epoch 15/32
Epoch 16/32
Epoch 17/32
Epoch 18/32
Epoch 19/32
Epoch 20/32
Epoch 21/32
Epoch 22/32
Epoch 23/32
Epoch 24/32
Epoch 25/32
Epoch 26/32
Epoch 27/32
Epoch 28/32
Epoch 29/32
Epoch 30/32
Epoch 31/32
Epoch 32/32
0.046363785428859015
0.0471


In [69]:
m = 0
for r in result:
    print(r)
    print()
    if m <= r["NDGC@K"]:
        m = r["NDGC@K"]
print(m)

{'min_batch_size': 32, 'steps_per_epoch': 400, 'epochs': 50, 'loss': 'binary_crossentropy', 'optimizer': 'adam', 'accuracy': 0.9894142150878906, 'NDGC@K': 0.04649957713599054, 'Hit@K': 0.0474}

{'min_batch_size': 64, 'steps_per_epoch': 350, 'epochs': 32, 'loss': 'binary_crossentropy', 'optimizer': 'SGD', 'accuracy': 0.9894244074821472, 'NDGC@K': 0.04650077365292515, 'Hit@K': 0.0474}

{'min_batch_size': 64, 'steps_per_epoch': 350, 'epochs': 32, 'loss': 'binary_crossentropy', 'optimizer': 'adam', 'accuracy': 0.9922094345092773, 'NDGC@K': 0.04642999207812832, 'Hit@K': 0.0472}

{'min_batch_size': 100, 'steps_per_epoch': 200, 'epochs': 50, 'loss': 'hinge', 'optimizer': 'adam', 'accuracy': 0.994513213634491, 'NDGC@K': 0.04695716978988141, 'Hit@K': 0.0483}

{'min_batch_size': 100, 'steps_per_epoch': 200, 'epochs': 50, 'loss': 'hinge', 'optimizer': 'adam', 'accuracy': 0.995614230632782, 'NDGC@K': 0.046950237445688756, 'Hit@K': 0.0482}

{'min_batch_size': 64, 'steps_per_epoch': 200, 'epochs': 3

Sembra che min_batch_size = 100, steps_per_epoch = 200, epochs = 50, loss = hinge, optimizer = adam sia il migliore
<br><br>Bisogna testare con play_tst, ovvero il dataset di test.

In [34]:
indexes_test = predict_batch(play_tst[:,0],2,vectors_tracks,kdt)
print(indexes_test)

[[ 50008 112275]
 [   175  48073]
 [  3920  12561]
 ...
 [ 10659   7230]
 [ 39650   6733]
 [ 37384  11994]]


In [36]:
score = 0
for i in range(len(play_tst)):
    if play_tst[i,1] == indexes_test[i,1]:
        score += 1
accuracy = score / len(play_tst)
print(accuracy)

0.015051801655046588


## Bonus, a little music

The TrackArtists file contains meta.data on the tracks and the artists for a subset of the 300,000 tracks in the dataset. We can use it to search for the number of a song from its title:

In [97]:
import pandas as pd
tr_meta=pd.read_csv("./data/tracks_proj.csv")
joindf = pd.DataFrame({"track_id":tracks_list_ordered[:Vt],"index":range(Vt)})
meta = tr_meta.merge(joindf, left_on="id",right_on="track_id")
meta.set_index("index",inplace=True)
meta[["title","artist_name","preview","id"]]

Unnamed: 0_level_0,title,artist_name,preview,id
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
14086,Alone,Petit Biscuit,http://cdn-preview-8.deezer.com/stream/c-89176...,track_100001884
9768,It Was Always You,Maroon 5,http://cdn-preview-e.deezer.com/stream/c-e24ca...,track_100004586
11888,Unkiss Me,Maroon 5,http://cdn-preview-4.deezer.com/stream/c-42340...,track_100004588
321,Sugar,Maroon 5,http://cdn-preview-b.deezer.com/stream/c-b3342...,track_100004590
12477,Leaving California,Maroon 5,http://cdn-preview-5.deezer.com/stream/c-53dbb...,track_100004592
...,...,...,...,...
5338,Hometown,Twenty One Pilots,http://cdn-preview-2.deezer.com/stream/c-2d107...,track_99976972
9202,Not Today,Twenty One Pilots,http://cdn-preview-9.deezer.com/stream/c-9d2b0...,track_99976974
8386,Goner,Twenty One Pilots,http://cdn-preview-2.deezer.com/stream/c-242d7...,track_99976976
11491,Nobody Has To Know (feat. Ty Dolla $ign),Kranium,http://cdn-preview-b.deezer.com/stream/c-b2d78...,track_99976980


In [98]:
def find_track(title):
    return meta.loc[meta["title"]==title,:].index[0]

tr=find_track("Hexagone")
tr

19492

## Radio

The deeezer api allows you to retrieve information about the pieces of the dataset from their deezer id. Among this information when it is available a url to listen to a free sample is provided.

In [99]:
import urllib.request, json 
def gettrackinfo(number):
    track_url =  "https://api.deezer.com/track/{}".format(tracks_list_ordered[number].split("_")[1])
    with urllib.request.urlopen(track_url) as url:
        data = json.loads(url.read().decode())
    return data
track_apidata = gettrackinfo(find_track("Hexagone"))
track_apidata

{'id': 128093263,
 'readable': True,
 'title': 'Hexagone',
 'title_short': 'Hexagone',
 'title_version': '',
 'isrc': 'FRZ027500460',
 'link': 'https://www.deezer.com/track/128093263',
 'share': 'https://www.deezer.com/track/128093263?utm_source=deezer&utm_content=track-128093263&utm_term=0_1610988483&utm_medium=web',
 'duration': 330,
 'track_position': 4,
 'disk_number': 1,
 'rank': 700311,
 'release_date': '2016-07-08',
 'explicit_lyrics': False,
 'explicit_content_lyrics': 0,
 'explicit_content_cover': 0,
 'preview': 'https://cdns-preview-9.dzcdn.net/stream/c-93c768b47b54c1d295f92f59990f732a-6.mp3',
 'bpm': 125.66,
 'gain': -12.5,
 'available_countries': ['AE',
  'AF',
  'AG',
  'AI',
  'AL',
  'AM',
  'AO',
  'AQ',
  'AR',
  'AS',
  'AT',
  'AU',
  'AZ',
  'BA',
  'BB',
  'BD',
  'BE',
  'BF',
  'BG',
  'BH',
  'BI',
  'BJ',
  'BN',
  'BO',
  'BQ',
  'BR',
  'BT',
  'BV',
  'BW',
  'BY',
  'CC',
  'CD',
  'CF',
  'CG',
  'CH',
  'CI',
  'CK',
  'CL',
  'CM',
  'CO',
  'CR',
  'CU'

So we can use it to listen a preview:

In [100]:
from IPython.display import display, Audio, clear_output
display(Audio(track_apidata["preview"],autoplay=True))

<span style="color:red">Create a radio function that takes as input a track number in the dataset and launches a series of nb_track tracks by randomly pulling in the neighborhood of the current track the next track to listen to. The size of the neighborhood will be configurable and you will delete from the proposals the songs already listened to. You will handle exceptions if the track does not have an available extract. You can delete the current song with the clear_display function.</span>

In [126]:
ind = kdt.query(vectors_tracks, 5, return_distance = False)

In [1]:
import time
def start_radio(seed,nb_candidates,duration,nbsteps=20):
    print(meta.loc[seed,"title"])
    display(Audio(meta.loc[seed,"preview"],autoplay=True))
    time.sleep(duration)
    clear_output()
    already_played = [seed]
    new_seed = 0
    for i in range(nbsteps):
        try:
            new_seed = ind[seed][random.randint(1,4)]
            print(new_seed)
            while new_seed in already_played:
                new_seed = ind[seed][random.randint(1,4)]
            display(Audio(meta.loc[new_seed,"preview"],autoplay=True))
        except:
            print("track not found")
            pass
        seed = new_seed
        time.sleep(duration)
        clear_output()
        already_played.append(seed)

NameError: name 'ind' is not defined

In [130]:
start_radio(find_track("Hexagone"),5,5,10)

13813


KeyboardInterrupt: 