
# Spotify Sequential Skip Prediction Challenge

### Anna Martignano, Daniele Montesi, ID2223
With  this  project  we  would  like  to  use  Deep  Learning  to  model  a  solution  to  the  challenge  ”  Spotify Sequential  Skip  Prediction  Challenge  -  Predict  if  users  will  skip  or  listen  to  the  music  they’re  streamed” published by Spotify in collaboration with WSDM and CrowdAI.

Citing the CrowdAI Challenge Website:The task is to predict whether individual tracks encountered in a listening session will beskippedby a particular user.  In order to do this, complete information about the first half of a user’slistening session is provided, while the prediction is to be carried out on the second half.  

The output of a prediction is abinary variable for each track in the second half of the session indicating if it was skipped or not, with a 1 indicating that the track skipped, and a 0 indicating that the track was not skipped.The  problem  corresponds  hence  to  a  binary  classification.   However,  the  problem  is  hard  and  must take into account all the history of the user tracks listened, making the model that best fits this problem aRecurrent Neural Networks.


### Code organization


The problem will present the following sections:

1. Data exploration 
2. Data filtering/preprocessing 
3. Dataset creation
4. Model reasoning and creation and fitting (First Baseline)
5. Re-Implementation of the 5th solution
6. Further model improvement


## 0 - Libraries import and useful paths

In [0]:
!pip install -q pyyaml h5py  # Required to save models in HDF5 format

In [2]:
%tensorflow_version 2.x

TensorFlow 2.x selected.


In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
from tensorflow.keras import layers, utils, Model, Input
from tensorflow.keras.layers import Input, Dense, LSTM, Embedding, Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import model_from_json, load_model

from pickle import dump
import pickle
from tqdm import tqdm
import string
import numpy as np
import random
from PIL import Image
import time
import os
import pandas as pd
from glob import glob
from joblib import delayed, Parallel
import matplotlib.pyplot as plt
import datetime
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import GroupKFold, KFold

In [4]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [5]:
dirpath = "/content/drive/My Drive/Colab Notebooks/Project"
modelpath = dirpath + "/Model"
datapath = dirpath + "/Dataset"
processedpath = dirpath + "/ProcessedDataset"
testpath = datapath + "/test_set"
trainpath = datapath + "/training_set"
featurespath = datapath + "/track_features"

print(featurespath)

/content/drive/My Drive/Colab Notebooks/Project/Dataset/track_features


## 1 - Data Exploration

For the sake of order, we will perform the exploration phase on a different notebook:
`SpotifySequentialChallenge - DataExploration`

Here we upload the Training dataset and the Test Dataset.

- The Train and the Test are belonging to the **same day**. This avoid to suffer from the **concept drift phenomena**
- Only a small subset of the training has been employed (the total Spotify Dataset contained 100+GB of data!)
- The day is a Summer day (15 July 2018)

In [0]:
# Train

data = pd.DataFrame()

for i in range(2):
  data = pd.concat([data, pd.read_csv(trainpath+"/log_{}_20180715_000000000000.csv".format(i))])

# Setting the flag will be used after we concatenate train-test into one dataset
data['train_flag'] = True
data.head(5)

Unnamed: 0,session_id,session_position,session_length,track_id_clean,skip_1,skip_2,skip_3,not_skipped,context_switch,no_pause_before_play,short_pause_before_play,long_pause_before_play,hist_user_behavior_n_seekfwd,hist_user_behavior_n_seekback,hist_user_behavior_is_shuffle,hour_of_day,date,premium,context_type,hist_user_behavior_reason_start,hist_user_behavior_reason_end,train_flag
0,0_00006f66-33e5-4de7-a324-2d18e439fc1e,1,20,t_0479f24c-27d2-46d6-a00c-7ec928f2b539,False,False,False,True,0,0,0,0,0,0,True,16,2018-07-15,True,editorial_playlist,trackdone,trackdone,True
1,0_00006f66-33e5-4de7-a324-2d18e439fc1e,2,20,t_9099cd7b-c238-47b7-9381-f23f2c1d1043,False,False,False,True,0,1,0,0,0,0,True,16,2018-07-15,True,editorial_playlist,trackdone,trackdone,True
2,0_00006f66-33e5-4de7-a324-2d18e439fc1e,3,20,t_fc5df5ba-5396-49a7-8b29-35d0d28249e0,False,False,False,True,0,1,0,0,0,0,True,16,2018-07-15,True,editorial_playlist,trackdone,trackdone,True
3,0_00006f66-33e5-4de7-a324-2d18e439fc1e,4,20,t_23cff8d6-d874-4b20-83dc-94e450e8aa20,False,False,False,True,0,1,0,0,0,0,True,16,2018-07-15,True,editorial_playlist,trackdone,trackdone,True
4,0_00006f66-33e5-4de7-a324-2d18e439fc1e,5,20,t_64f3743c-f624-46bb-a579-0f3f9a07a123,False,False,False,True,0,1,0,0,0,0,True,16,2018-07-15,True,editorial_playlist,trackdone,trackdone,True


In [0]:
# Test

test_data_hist  = pd.read_csv(testpath + '/log_prehistory_20180715_000000000000.csv')
test_data_hist['train_flag']  = False

data = pd.concat([data,test_data_hist])
data.reset_index(drop=True, inplace=True)

test_data_input = pd.read_csv(testpath + '/log_input_20180715_000000000000.csv')
test_data_input['train_flag']  = False

# Feature engineering

We have observed that the  tracks suggested by Spotify are sometimes repeated. We will take that into account encoding that as a feature.

In [0]:
# Check for presence of duplicate tracks into a session
grouped_sess_tracks = data.groupby('session_id')['track_id_clean'].apply(lambda x: x.tolist()).tolist()

In [0]:
session_track_count = data.groupby(['session_id','track_id_clean'], as_index = False).size().reset_index().rename(columns = {0 : 'duplicated_time'})
data = data.merge(session_track_count, on = ['session_id', 'track_id_clean'], how = 'left')
print("On average, the number of duplicate songs in those sessions is --> ")
session_track_count.duplicated_time.mean()

On average, the number of duplicate songs in those sessions is --> 


1.1278782639423945

### We now load the Tracks dataset and joint it with the dataset:

In [0]:
tf = pd.read_csv(featurespath+"/tf_0.csv")
tf = pd.concat([tf, pd.read_csv(featurespath+"/tf_1.csv")])
tf.reset_index(drop=True, inplace=True)

# 2- Dataset Preprocessing
As we can see, the dataset is made of these columns:

- SessionPosition: the position of the track on the session (which can be long up to 20)
- skip_1	skip_2	skip_3	not_skipped: The possible values for skipping. For this purpose, we will drop all of these except for the skip_2, which is the type of skip we are interested on

In [0]:
data["label"] = data["skip_2"]
data.drop(['skip_1', 'skip_2','skip_3', 'not_skipped'], axis=1, inplace=True)

In [0]:

for i in tqdm(data.index):
    leng = int(data.at[i, "session_length"])
    hour = int(data.at[i, "hour_of_day"])
    if leng < 15:
        data.at[i, "session_length"] = 0
    elif leng >= 15 and leng<20:
        data.at[i, "session_length"] = 1
    else:
        data.at[i, "session_length"] = 2
        
        
    if hour >7 and hour<=13:
        hour = data.at[i, "hour_of_day"] = 0
    elif hour >13 and hour<20:
        hour = data.at[i, "hour_of_day"] = 1
    elif hour >=20:
        hour = data.at[i, "hour_of_day"] = 2
    else:
        hour = data.at[i, "hour_of_day"] = 3

   

100%|██████████| 9645404/9645404 [09:23<00:00, 17103.32it/s]


In [0]:
# TO DO: Optimize or remove

for i in tqdm(tf.index):
    yr = int(tf.at[i, "release_year"])
    if yr < 1960:
        tf.at[i, "release_year"] = 0
    elif yr >= 1960 and yr < 1970:
        tf.at[i, "release_year"] = 1
    elif yr > 1970 and yr <= 1980:
        tf.at[i, "release_year"] = 2
    elif yr > 1980 and yr <= 1990:
        tf.at[i, "release_year"] = 3
    elif yr > 1990 and yr <= 2000:
        tf.at[i, "release_year"] = 4
    elif yr > 2000 and yr <= 2005:
        tf.at[i, "release_year"] = 5
    elif yr > 2005 and yr <= 2010:
        tf.at[i, "release_year"] = 6
    elif yr > 2010 and yr <= 2015:
        tf.at[i, "release_year"] = 7

100%|██████████| 3706388/3706388 [01:27<00:00, 42382.51it/s]


### Scaling and one-hot-encoding

In [0]:
# Generate dummy columns for categorical variables
# Scaler
print('{0}\t\tNormalising context features...'.format(datetime.datetime.now()))
features = list(set(data.columns) - set(['session_id','track_id_clean','date', 'label', 'session_position', 'train_flag', "context_switch", 'session_length', 'no_pause_before_play','short_pause_before_play','long_pause_before_play','hist_user_behavior_n_seekfwd','hist_user_behavior_is_shuffle','premium','hour_of_day', 'hist_user_behavior_reason_start', 'hist_user_behavior_reason_end', 'session_length', 'context_type']))
data[features] = StandardScaler().fit_transform(data[features])

features_t = list(set(tf.columns) - set(['track_id' , 'release_year', 'time_signature', 'mode', 'key']))
tf[features_t] = StandardScaler().fit_transform(tf[features_t])

features.append('label')
features.append('session_position')

print('{0}\t\tOne-hot-encoding categorical variables...'.format(datetime.datetime.now()))
data = pd.get_dummies(data, columns = ['context_type'], prefix = ['context_type'])
data = pd.get_dummies(data, columns = ['hist_user_behavior_reason_start'], prefix = ['hist_user_behavior_reason_start'])
data = pd.get_dummies(data, columns = ['hist_user_behavior_reason_end'], prefix = ['hist_user_behavior_reason_end'])
data = pd.get_dummies(data, columns = ['session_length'], prefix = ['session_len'])
data = pd.get_dummies(data, columns = ['hour_of_day'], prefix = ['hour_of_day'])
# Features to use

tf = pd.get_dummies(tf, columns = ['release_year'], prefix = ['rel_year'])
tf = pd.get_dummies(tf, columns = ['mode'], prefix = ['mode'])
tf = pd.get_dummies(tf, columns = ['key'], prefix = ['key'])
tf = pd.get_dummies(tf, columns = ['time_signature'], prefix = ['time_signature'])


2020-01-10 19:00:59.746477		Normalising context features...
2020-01-10 19:01:15.228463		One-hot-encoding categorical variables...


In [0]:
missing_tdi_feat = set(data.columns.values) - set(test_data_input.columns.values) 
for f in list(missing_tdi_feat):
  test_data_input[f] = 0.0

In [0]:
data = data[:10000]

In [0]:
data = pd.concat([data,test_data_input], sort=False)
data.fillna(.0,inplace=True)

# 3- Dataset Creation

The dataset is formed analogously to whast reported in the 5th place solution paper: http://adrem.uantwerpen.be//bibrem/pubs/WSDMCupJeunen2019.pdf

![alt test](pics/model.png)

(https://github.com/olivierjeunen/sequential-skip-prediction/blob/master/src/RNN.py in generate_model() )

where Xh and Xf are respectively 2 matrices for the session history and future. 
### The steps to the dataset creation are:

1. Merge track features and session features
2. Split based on the time, and drop that feature (with other useless features)
3. create the History-Matrix and Future Matrix. Create duplicate files for each new song to predict its label 

## Keep only tracks present in the training set

Then, we will explore how frequent they are :)

In [0]:
set_tracks_present = set(data.track_id_clean.values)
tracks_set = set(tf.track_id)
print("There are", len(set_tracks_present), "tracks in the dataset \nInstead, in the tracks_data there are", len(tracks_set) )
print("Missing tracks on dataset:", len(set(set_tracks_present)-tracks_set)*100/len(set(set_tracks_present)), "%")

There are 701495 tracks in the dataset 
Instead, in the tracks_data there are 3706388
Missing tracks on dataset: 0.0 %


In [0]:
#TODO Correct: must output a number
data['date'] = pd.to_datetime(data['date'])
data["hist_user_behavior_is_shuffle"] = data['hist_user_behavior_is_shuffle'].astype(np.int64)
data["premium"] = data['premium'].astype(np.int64)
data["label"] = data['label'].astype(np.int64)

In [0]:
# Sorting by session & position of track in session
data.sort_values(['session_id','session_position', "date"], inplace = True)

In [0]:
tf["track_id_clean"] = tf["track_id"]
tf.drop(["track_id"], axis=1, inplace=True)

data.drop("date", axis=1, inplace=True)

In [0]:
dataset = data.merge(tf, on = 'track_id_clean', how = 'left')

# Feature engineering:

Feature engineering is the key for winning competitions of Data Science. Now that we have the whole dataset we are going to compute:
1. Popularity of the songs liked/disliked
2. Duration of songs liked/disliked
3. Release date estimate
4. Acoustic vector correlation


Following the work of the 7th place solution, we are going to first, select some tracks:
- Positive tracks (i.e. the ones that are listened by the user)
- Negative tracks (i.e. the ones that are skipped by the user)
- Song considered' feature

Then, we are saving the features as the DELTA between the
- positive_feat - song-feat
- negative-feat - song_feat

Where positive_feat is the feature (i.e. populatiry) **belonging to ALL the songs LIKED by the user, averaged**.

Moreover, we are going to compute the **dot product** between the acoustic vector of the song and the one of the positive/negative tracks, trying to see whether there is a correlation between the 2.


As addiction, we are computing some similarities of those (Cosine Similarity, Manhattan Distance, ...)

We are choosing to compute Cosine Similarity


In [0]:
features = ['us_popularity_estimate', 'acousticness', 'beat_strength', 'bounciness', 'danceability',
       'dyn_range_mean', 'energy', 'flatness', 'instrumentalness',
       'liveness', 'loudness', 'mechanism','tempo' , 'organism', 'speechiness', 'valence']

"""
'pos_year': year_pos,
'neg_year': year_neg,
"""

def extract_features_session_track(song, positive_tracks, negative_tracks):
    mean_duration_pos = (positive_tracks['duration'] - song['duration']).mean()
    mean_duration_neg = (negative_tracks['duration'] - song['duration']).mean()
    """
    year_pos = (positive_tracks['release_date_estimate'] - song['release_date_estimate']).mean()
    year_neg = (negative_tracks['release_date_estimate'] - song['release_date_estimate']).mean()
   """
    pop_pos = (positive_tracks['us_popularity_estimate'] - song['us_popularity_estimate']).mean()
    pop_neg = (negative_tracks['us_popularity_estimate'] - song['us_popularity_estimate']).mean()
 
    latent_vectors = ['acoustic_vector_0', 'acoustic_vector_1', 'acoustic_vector_2', 'acoustic_vector_3', 
                      'acoustic_vector_4', 'acoustic_vector_5', 'acoustic_vector_6', 'acoustic_vector_7']
    mean_dot_vector_pos = positive_tracks[latent_vectors].dot(song[latent_vectors]).mean()
    mean_dot_vector_neg = negative_tracks[latent_vectors].dot(song[latent_vectors]).mean()

    return {'pos_duration': mean_duration_pos, 
        'neg_duration': mean_duration_neg,
        'pop_neg': pop_neg,
        'pop_pos': pop_pos,
        'pos_dot': mean_dot_vector_pos,
        'neg_dot': mean_dot_vector_neg
        }


In [0]:
tf_old = tf.copy()
tf.set_index('track_id_clean', inplace=True)

dataset.reset_index(inplace=True, drop=True)

In [0]:
current_index = 0
num_lines_completed = 0
completed_index = 0

"""
'pos_year': [],
'neg_year': [],
"""

# Initialize the list
dict_feat = {'pos_duration': [], 
        'neg_duration': [],
        'pop_neg': [],
        'pop_pos': [],
        'pos_dot': [],
        'neg_dot': []
        }

# Iterating over the whole dataset:

pbar = tqdm(total=len(dataset)-1)

while current_index < len(dataset)-1:
  
  partial_length = dataset['session_length'].iloc[current_index]-dataset['session_position'].iloc[current_index]+1
  half = int(np.floor(partial_length / 2.0))
  last_session_tracks = dataset.loc[current_index+(half):current_index+partial_length-1]

  """
  if completed_index < num_lines_completed:
      current_index += partial_length 
      completed_index += len(last_session_tracks)
      continue
  """

  if current_index>9980:
    print(current_index, partial_length)
  first_session_tracks = dataset.loc[current_index:current_index+(half)-1]
  last_session_item = first_session_tracks.iloc[-1]
            
  skipped = first_session_tracks[first_session_tracks['label'] == 1]['track_id_clean']
  completed = first_session_tracks[first_session_tracks['label'] == 0]['track_id_clean']
  
  skipped = tf.loc[skipped]
  completed = tf.loc[completed]
  
  # Put at zero the new engineered features
  for k,v in dict_feat.items():
    dict_feat[k] = dict_feat[k]+[0]*len(first_session_tracks)

  for j, session_track_row in last_session_tracks.iterrows():
    song_row = tf.loc[session_track_row['track_id_clean']]
    track_features = extract_features_track(song_row)
    track_features["distance"] = session_track_row["session_position"] - last_session_item["session_position"]

    session_track_features = extract_features_session_track(song_row, completed, skipped)

    # appending to the whole feature dictionary
    for k,v in session_track_features.items():
      dict_feat[k] = dict_feat[k]+[v]

  current_index += partial_length
  pbar.update(partial_length)

pbar.close()

for k,v in dict_feat.items():
  dataset.loc[:, k] = v

NameError: ignored

## Organize the dataset into samples. 

Let n be the maximal session length. With k = ⌊n/2].

As session lengths can vary, shorter sessions are padded with zeroes to match the size of the longest ones. Historical tracks Xh are padded before the track sequence, and future tracks Xf are padded after the track sequence.

Input of the network is made of 2 np.arrays-matrices:
- **session_histories** old_matrix R^ (f x k x t), where k is the first half of the features. Made of all the features of thefirst half of the tracks
- **session_futures** new_matrix: R^ (f x (n-k) x t), where n is the total length of features. It is made of all the features in the second half of the tracks (without their labels)

To do so, create 4 lists containing:
1. session_histories as np.array
2. session_futures as np.array
3. session_labels as lists
4. sizes as list of int

In [0]:
max_seq_len = 20

def process_sessions(data, sess_features, track_feature_list, label_index):
    session_histories = []
    session_futures   = []
    session_labels    = []
    session_sizes     = []
    

    for key, subgroup in tqdm(data.groupby('session_id', sort = False)):
        # Split session in two
        half = int(np.floor(len(subgroup) / 2.0))
        first = subgroup[sess_features + track_feature_list].head(half).values
        second = subgroup[sess_features + track_feature_list].tail(subgroup.shape[0] - half)
        current_labels = second[label_index].astype(int).values

        second = second.values
        
        # Generate padding for session history
        padding = np.zeros((int(max_seq_len / 2.0) - first.shape[0], len(sess_features) + len(track_feature_list)))
        
        #stack all in a np array
        session_history = np.vstack([padding,first])
        
        # Generate padding for session future
        padding = np.zeros((int(max_seq_len / 2.0) - second.shape[0], len(track_feature_list)))
        session_future = np.vstack([second[:,-len(track_feature_list):],padding])
        
        # Save results
        session_histories.append(session_history)
        session_futures.append(session_future)
        
        s_l = list(current_labels)
        s_l = [x[0] for x in current_labels]
        times=int(max_seq_len / 2.0) - current_labels.shape[0]
        if times > 0:
           for i in range(times):
              np.array(s_l.append(0))
        session_labels.append(s_l)
        session_sizes.append(len(current_labels))
            
    return session_histories, session_futures, session_labels, session_sizes 
        


In [0]:
max_seq_len = 20

def process_sessions_train(data, sess_features, track_feature_list, label_index):
    session_histories = []
    session_futures   = []
    session_labels    = []
    session_sizes     = []
    
    for key, subgroup in tqdm(data.groupby('session_id', sort = False)):
        # Split session in two
        half = int(np.floor(len(subgroup) / 2.0))
        first = subgroup[sess_features + track_feature_list].head(half).values
        second = subgroup[sess_features + track_feature_list].tail(subgroup.shape[0] - half)
        
         
        # Generate labels
        current_labels = second[label_index].astype(int).values
        
        second = second.values
        # Generate padding for session history
        padding = np.zeros((int(max_seq_len / 2.0) - first.shape[0], len(sess_features) + len(track_feature_list)))
        
        #stack all in a np array
        session_history = np.vstack([padding,first])
        
        # Generate padding for session future
        padding = np.zeros((int(max_seq_len / 2.0) - second.shape[0], len(track_feature_list)))
        session_future = np.vstack([second[:,-len(track_feature_list):],padding])
        
        # Save results
        session_histories.append(session_history)
        session_futures.append(session_future)
        
        
        s_l = list(current_labels)
        s_l = [x[0] for x in current_labels]
        times=int(max_seq_len / 2.0) - current_labels.shape[0]
        if times > 0:
            for i in range(times):
                np.array(s_l.append(0))
        session_labels.append(s_l)
        session_sizes.append(len(current_labels))

    return session_histories, session_futures, session_labels, session_sizes 



In [0]:
max_seq_len = 20

def process_sessions_test(data, sess_features, track_feature_list, label_index):
    session_histories = []
    session_futures   = []
    session_labels    = []
    session_sizes     = []
    
    for key, subgroup in tqdm(data.groupby('session_id', sort = False)):
        # Split session in two
        half = int(np.floor(len(subgroup) / 2.0))
        first = subgroup[sess_features + track_feature_list].head(half).values
        second = subgroup[sess_features + track_feature_list].tail(subgroup.shape[0] - half)
        
        label_shape = second.shape[0]
        second = second.values
        # Generate padding for session history
        padding = np.zeros((int(max_seq_len / 2.0) - first.shape[0], len(sess_features) + len(track_feature_list)))
        
        #stack all in a np array
        session_history = np.vstack([padding,first])
        
        # Generate padding for session future
        padding = np.zeros((int(max_seq_len / 2.0) - second.shape[0], len(track_feature_list)))
        session_future = np.vstack([second[:,-len(track_feature_list):],padding])
        
        # Save results
        session_histories.append(session_history)
        session_futures.append(session_future)
        
        session_sizes.append(label_shape)

    return session_histories, session_futures, session_sizes


In [0]:
sf = list(data.columns.values)
for el in ["track_id_clean", "session_id", "train_flag"]:
    sf.remove(el)
tf_f = list(tf.columns.values)
for el in ["track_id_clean"]: 
    tf_f.remove(el)


In [0]:
#Prepare training dataset and save it

train_dataset = dataset[dataset.train_flag == True]
history_train, future_train, labels_train, session_len_train = process_sessions_train(train_dataset, sf, tf_f, ["label"])
#Save

# Encode sessions as numpy arrays
history_train = np.asarray(history_train)
future_train = np.asarray(future_train)
labels_train = np.asarray(labels_train)
session_len_train = np.asarray(session_len_train)

# Dump to file
np.savez_compressed(processedpath + "/trainprova.npz",
    history_train  = history_train,
    future_train = future_train,
    labels_train = labels_train,
    session_len_train = session_len_train)

In [0]:
#Prepare training dataset and save it
test_dataset = dataset.loc[dataset['train_flag'] == False]
history_test, future_test, session_len_test = process_sessions_test(test_dataset, sf, tf_f, ["label"])

#Save

# Encode sessions as numpy arrays
history_test = np.asarray(history_test)
future_test = np.asarray(future_test)
session_len_test = np.asarray(session_len_test)

# Dump to file
np.savez_compressed(processedpath + "/test.npz",
    history_test  = history_test,
    future_test = future_test,
    session_len_test = session_len_test)

100%|██████████| 58/58 [00:00<00:00, 271.49it/s]


### Problem: concept drift: must order by data and split according to that

As we are dealing with temporal data, new songs start to appear throughout the data, songs’ popularities change over time, et cetera. This phenomenon, widely known as concept drift, can heavily influence the performance of learning algorithms, if not dealt with properly.

Links: G. Widmer and M. Kubat. 1996. Learning in the Presence of Concept Drift and Hidden Contexts. Machine Learning 23, 1 (01 Apr 1996), 69–101.

In [0]:
""" #split todo
def getData(percentage=0.9):
    restored = np.load(processedpath + "/train2.npz")
    history = restored['history']
    future = restored['future']
    labels = restored['labels']
    session_len = restored['session_len']
    
    # Make sure we shuffle classes
    indices = np.arange(len(history))
    np.random.shuffle(indices)
    history = history[indices]
    future = future[indices]
    labels = labels[indices]
    session_len = session_len[indices]
        
    num = int(len(indices)*percentage)
    
    del restored
    return history[:num], history[num:], future[:num], future[num:], labels[:num], labels[num:], session_len[:num], session_len[num:]

train_history, test_history, train_future, test_future, train_labels, test_labels, train_session_len, test_session_len = getData(percentage=0.6)

"""

' #split todo\ndef getData(percentage=0.9):\n    restored = np.load(processedpath + "/train2.npz")\n    history = restored[\'history\']\n    future = restored[\'future\']\n    labels = restored[\'labels\']\n    session_len = restored[\'session_len\']\n    \n    # Make sure we shuffle classes\n    indices = np.arange(len(history))\n    np.random.shuffle(indices)\n    history = history[indices]\n    future = future[indices]\n    labels = labels[indices]\n    session_len = session_len[indices]\n        \n    num = int(len(indices)*percentage)\n    \n    del restored\n    return history[:num], history[num:], future[:num], future[num:], labels[:num], labels[num:], session_len[:num], session_len[num:]\n\ntrain_history, test_history, train_future, test_future, train_labels, test_labels, train_session_len, test_session_len = getData(percentage=0.6)\n\n'

# Loading saving dataset

In [0]:
import numpy as np
# save np.load
np_load_old = np.load

In [0]:
np.load = np_load_old

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)


restored = np.load(processedpath + "/trainJuly.npz")
train_history = restored['history_train']
train_future = restored['future_train']
train_labels = restored['labels_train']
train_session_len = restored['session_len_train']

del restored

# restore np.load for future normal usage
np.load = np_load_old

In [0]:
np.load = np_load_old
import numpy as np
# save np.load
np_load_old = np.load

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

restored2 = np.load(processedpath + "/test.npz")
test_history = restored2['history_test']
test_future = restored2['future_test']
test_session_len = restored2['session_len_test']

del restored2

# restore np.load for future normal usage
np.load = np_load_old

# 4- Model Fitting ad Evaluation Metrics 

To evaluate the performance of our solution we are going to use the same metrics proposed for this competition. 

The primary one is the Mean Average Accuracy:

\begin{equation*}
AA   =  \sum_{i=1}^{T} \frac{A(i)L(i)}{T}
\end{equation*}

The metric aims to weight higher the correct good prediction, instead of predicting the


- T is the number of tracks to be predicted for the given session
- A(i) is the accuracy at position i of the sequence
- L(i) is the boolean indicator for if the i‘th prediction was correct

Moreover, we are going to use the accuracy for the second half of the dataset


In [0]:
def evaluation_MAP_FPA(sizes, label_matrix, prediction_matrix):
    '''
    Return MAA and first-prediction-accuracy
    :param sizes:    A list of session sizes
    :param labels:      The correct labels list [ [1,0,1,1,0,... size_of_truth_labels], ...]
    :param predictions: The predicted labels list [ [1,0,1,0,0,... size_of_predicted_labels], ...]
    :returns:           Mean Average Precision and First Prediction Accuracy
    '''
    # Set variables
    score = 0.0
    first_acc = 0.0
    # For every session
    for size, label_row, prediction_row in zip(sizes, label_matrix, prediction_matrix):
        # Set variables
        n_correct    = 0
        session      = 0.0
        # For the first 'size' predictions
        for i in range(size):        
            # If the prediction is correct:
            if label_row[i] == prediction_row[i]:
                # Increase counter of correct predictions
                n_correct += 1
                session += n_correct / (i + 1)
                # If first prediction
                if i == 0:
                    first_acc += 1
        # Save session score
        score += session / size
    return score/sizes.shape[0], first_acc/sizes.shape[0]


## Model Description

RNN architecture: to capture the time-correlation among data input LSTM layers are used to model the input sequences.
Instead, dense layers are connecet to future input tracks, the ones needed to be predicted.
Input sequences are modelled using doubly
The outputs of these LSTM and dense layers are concatenated and fed into the tail of the network. All dense layers are followed by
exponential linear units (ELUs).

Avoid overfitting: dropout is applied, and as well batch normalization to allow more effective weights updates. When training, we have also adopted early stopping, when the results do not improve for 5 epochs consecutively the best weights are 

Personalization of the loss measure: since the challenge gives more importance to the accuracy of the prediction of the first tracks in the session, we have opt to use a weighted variant of binary cross-entropy as loss function.

In [0]:
from tensorflow.keras import losses
from tensorflow.keras.utils import plot_model

def generate_model(history, future):
    h_input  = tf.keras.layers.Input(shape = (history.shape[1], history.shape[2]))
    h_subnet = tf.keras.layers.Dropout(.35)(h_input)
    h_subnet = tf.compat.v1.keras.layers.CuDNNLSTM(256, return_sequences = True)(h_subnet)
    h_subnet = tf.compat.v1.keras.layers.CuDNNLSTM(256)(h_subnet)

    f_input  = tf.keras.layers.Input(shape = (future.shape[1], future.shape[2]))
    f_subnet = tf.keras.layers.Dropout(.35)(f_input)
    f_subnet = tf.compat.v1.keras.layers.CuDNNLSTM(128, return_sequences = True)(f_subnet)
    f_subnet = tf.compat.v1.keras.layers.CuDNNLSTM(128)(f_subnet)

    td_subnet = tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(32))(f_input)
    td_subnet = tf.keras.layers.Flatten()(td_subnet)

    concat = tf.keras.layers.concatenate([h_subnet, f_subnet, td_subnet])
    concat = tf.keras.layers.Dropout(.25)(concat)

    dense1  = tf.keras.layers.Dense(512, activation = 'elu')(concat)
    dense1  = tf.keras.layers.BatchNormalization()(dense1)
    dense2  = tf.keras.layers.Dense(512, activation = 'elu')(dense1)

    outputs = [tf.keras.layers.Dense(1, activation = 'sigmoid')(dense2) for _ in range(10)]

    model = tf.keras.Model(inputs = [h_input, f_input], outputs = outputs)
    
    return model

In [12]:
from tensorflow.keras import layers, utils, Model, Input
from tensorflow.keras.layers import Input, Dense, LSTM, Embedding, Dropout

# k-fold Cross-validation grouped on sessions
k = 5
n_epochs = 50
test_predictions = []
all_maps = []

# Generate model
#plot_model(model)
for fold_id, (train_idx, valid_idx) in enumerate(KFold(n_splits = k).split(train_history)):
    print('{0}\t----- FOLD {1} -----'.format(datetime.datetime.now(),fold_id))
    # Filter out training and testing data
    h_train = train_history[train_idx]
    h_valid = train_history[valid_idx]
    f_train = train_future[train_idx]
    f_valid = train_future[valid_idx]
    l_train = train_labels[train_idx]
    l_valid = train_labels[valid_idx]
    s_train = train_session_len[train_idx]
    s_valid = train_session_len[valid_idx]

    # Loss weights
    weights = np.asarray([(1 / val + sum((1 / (2*n)) for n in range(val + 1,11))) for val in range(1,11)])
    weights /= weights.max()

    model = generate_model(train_history, train_future)
    model.compile(loss='binary_crossentropy',
                                   optimizer = tf.keras.optimizers.Adam(lr = 0.002, amsgrad = True),
                                   metrics=['binary_accuracy'],
                                   loss_weights = weights.tolist())
    
    # Early stopping
    best_map = .0
    best_weights = None
    best_epoch = 0

    # For every epoch
    for epoch_id in range(n_epochs):
        model.fit([h_train, f_train], [l_train[:,i] for i in range(10)],
                  validation_data = ([h_valid, f_valid], [l_valid[:,i] for i in range(10)]),
                  batch_size = 2048, epochs = 1, verbose = 2)
        p_valid = model.predict([h_valid, f_valid], batch_size = 4096)
        MAP, FPA = evaluation_MAP_FPA(s_valid, l_valid, np.swapaxes(np.round(p_valid),0,1))
        print('{0}\t\tValid\tMAP:\t{1}\tFPA:\t{2}'.format(datetime.datetime.now(),MAP, FPA))
        if MAP > best_map:
            best_map = MAP
            best_epoch = epoch_id
            best_weights = parallel_model.get_weights()
            # save model architecture on file
            model.save_weights(modelpath + '/model_best_epoch_{}.h5'.format(epoch_id+1))
        elif epoch_id - best_epoch >= 5:
            break

        print('=========================================================')
        print('{0}\t\tStopping at epoch {1}, best epoch was {2} with MAP {3}'.format(datetime.datetime.now(),epoch_id, best_epoch, best_map))
        print('=========================================================')
        all_maps.append(best_map)

        print('{0}\t\tPredicting for test set...'.format(datetime.datetime.now()))
        # Reload best weights
        model.set_weights(best_weights)

    # Predict for test set
    p_test = model.predict([test_history, test_future], batch_size = 4096)
    test_predictions.append(np.swapaxes(p_test,0,1))

    print('=========================================================')
    print('{0}\t\tAverage best MAP over all folds:\t\t{1}...'.format(datetime.datetime.now(), np.mean(all_maps)))
    print('=========================================================')
    #print('{0}\t\tGenerating submission...'.format(datetime.datetime.now()))
    # Geometric mean of predictions over folds
    p_test = np.prod(test_predictions, axis = 0) ** (1.0 / len(test_predictions))

2020-01-12 14:32:47.954826	----- FOLD 0 -----
Train on 285011 samples, validate on 71253 samples
285011/285011 - 20s - loss: nan - dense_16_loss: nan - dense_17_loss: nan - dense_18_loss: nan - dense_19_loss: nan - dense_20_loss: nan - dense_21_loss: nan - dense_22_loss: nan - dense_23_loss: nan - dense_24_loss: nan - dense_25_loss: nan - dense_16_binary_accuracy: 0.4545 - dense_17_binary_accuracy: 0.4566 - dense_18_binary_accuracy: 0.4616 - dense_19_binary_accuracy: 0.4758 - dense_20_binary_accuracy: 0.5032 - dense_21_binary_accuracy: 0.5552 - dense_22_binary_accuracy: 0.6128 - dense_23_binary_accuracy: 0.6579 - dense_24_binary_accuracy: 0.6955 - dense_25_binary_accuracy: 0.7229 - val_loss: nan - val_dense_16_loss: nan - val_dense_17_loss: nan - val_dense_18_loss: nan - val_dense_19_loss: nan - val_dense_20_loss: nan - val_dense_21_loss: nan - val_dense_22_loss: nan - val_dense_23_loss: nan - val_dense_24_loss: nan - val_dense_25_loss: nan - val_dense_16_binary_accuracy: 0.4545 - val_

TypeError: ignored

In [13]:
#Import the model with the highest performances
epochs = 14

weights = np.asarray([(1 / val + sum((1 / (2*n)) for n in range(val + 1,11))) for val in range(1,11)])
weights /= weights.max()

new_model = generate_model(train_history, train_future)
new_model.load_weights(modelpath + '/model_weights_epoch_{}.h5'.format(epochs))
new_model.compile(loss='binary_crossentropy',
                                   optimizer = tf.keras.optimizers.Adam(lr = 0.002, amsgrad = True),
                                   metrics=['accuracy'])


ValueError: ignored

In [0]:
prediction = new_model.predict([test_history, test_future], batch_size = 4096)
MAP, FPA = evaluation_MAP_FPA(train_session_len, train_labels, np.swapaxes(np.round(prediction),0,1))

In [0]:
print("Performances--> MAP: {} | FPA: {}".format(MAP,FPA))

Performances--> MAP: 0.5793433557729396 | FPA: 0.7734857745231073


### Not bad!

With only few samples from the dataset and a good model we are able to perform a score similar to ~10th position of the rank in the global Leaderboard!

# 5- Concept Drift

*Definition*

Concept drift is a phenomena which occurs when dataset statistic properties change over time. This negative influnce predictive models that assume a statistic relathionship between input and output variables. When the concept drift occurs, the predictive performance can degrade significantly.


Even in this music streaming context, the presence of concept drift is tangible. For example, everyday new songs are published, the popularity vary, taste of users evolve over time, etc.

To tackle this challange, and as well limit the training size, the solution proposed by the 5th classified of the Spotify challange suggests to train the model every day only on the log for that specific day.

In order to understand how concept drift effects the performances of our model, we have decided to compare MAP e FPA metrics on both the proper test set, which date corresponds to the training dataset used to compute the optimal weights, and a test set obtained with the information of two months later.

In [0]:
del train_future_a
del train_history_a
del train_labels_a
del train_session_len_a

del test_future_a
del test_history_a
del test_session_len_a

In [0]:
#Load dataset test of September

np.load = np_load_old

import numpy as np
# save np.load
np_load_old = np.load

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)


restored = np.load(processedpath + "/trainSept.npz")
train_history_a = restored['history_train']
train_future_a = restored['future_train']
train_labels_a = restored['labels_train']
train_session_len_a = restored['session_len_train']

del restored

# restore np.load for future normal usage
np.load = np_load_old

In [0]:
train_future_a.shape

(175989, 10, 45)

In [0]:
np.load = np_load_old

import numpy as np
# save np.load
np_load_old = np.load

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

restored2 = np.load(processedpath + "/testSept.npz")
test_history_a = restored2['history_test']
test_future_a = restored2['future_test']
test_session_len_a = restored2['session_len_test']

del restored2

# restore np.load for future normal usage
np.load = np_load_old

## Performance on 2018/07/15

In [16]:
train_future.shape

(178342, 10, 45)

In [17]:
#Import the model with the highest performances
best_epochs = 14

weights = np.asarray([(1 / val + sum((1 / (2*n)) for n in range(val + 1,11))) for val in range(1,11)])
weights /= weights.max()

new_model = generate_model(train_history, train_future)
new_model.load_weights(modelpath + '/model_weights_epoch_{}.h5'.format(best_epochs))

new_model.compile(loss='binary_crossentropy',
                                   optimizer = tf.keras.optimizers.Adam(lr = 0.002, amsgrad = True),
                                   metrics=['accuracy'])


ValueError: ignored

In [0]:
prediction = new_model.predict([test_history, test_future], batch_size = 4096)
MAP, FPA = evaluation_MAP_FPA(train_session_len, train_labels, np.swapaxes(np.round(prediction),0,1))

In [0]:
print("Performances--> MAP: {} | FPA: {}".format(MAP,FPA))

Performances--> MAP: 0.5793433557729396 | FPA: 0.7734857745231073


# CONCEPT DRIFT - Prediction on a later day
## Performance on 2018/09/15


In [0]:
#Import the model with the highest performances
epochs = 1

weights = np.asarray([(1 / val + sum((1 / (2*n)) for n in range(val + 1,11))) for val in range(1,11)])
weights /= weights.max()

new_model = generate_model(test_history_a, test_future_a)
new_model.load_weights(modelpath + '/model_weights_epoch_{}.h5'.format(epochs))
new_model.compile(loss='binary_crossentropy',
                                   optimizer = tf.keras.optimizers.Adam(lr = 0.002, amsgrad = True),
                                   metrics=['accuracy'])

In [0]:
prediction = new_model.predict([test_history_a, test_future_a], batch_size = 4096)
MAPa, FPAa = evaluation_MAP_FPA(train_session_len_a, train_labels_a, np.swapaxes(np.round(prediction),0,1))

In [0]:
print("Performances--> MAP: {} | FPA: {}".format(MAPa,FPAa))

Performances--> MAP: 0.0 | FPA: 0.0


# Feature engineered dataset: New training & predictions

todo