**Table of contents**<a id='toc0_'></a>    
- 1. [Download data](#toc1_)    
- 2. [Read and display data](#toc2_)    
- 3. [Matrix Factorization model using alternating least squares](#toc3_)    
  - 3.1. [Results](#toc3_1_)    
- 4. [Multiclass classification model using neural networks](#toc4_)    
  - 4.1. [extending item model considering tracks features](#toc4_1_)    
    - 4.1.1. [retrieve data from spotify api](#toc4_1_1_)    
    - 4.1.2. [prepare features](#toc4_1_2_)    
    - 4.1.3. [build the item model](#toc4_1_3_)    
  - 4.2. [extending user model considering playlist name](#toc4_2_)    
    - 4.2.1. [prepare features](#toc4_2_1_)    
    - 4.2.2. [build the user model](#toc4_2_2_)    
  - 4.3. [Define metrics and loss](#toc4_3_)    
  - 4.4. [Combine user and item models](#toc4_4_)    
  - 4.5. [Results](#toc4_5_)    
    - 4.5.1. [Comparison to MF model](#toc4_5_1_)    
    - 4.5.2. [Using activation function](#toc4_5_2_)    

<!-- vscode-jupyter-toc-config
	numbering=true
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

– Describe the problem and the solution
– Describe the results
– Compare it with other approaches
– Discuss benefits and limitations

# 1. <a id='toc1_'></a>[Download data](#toc0_)

The Spotify Million Playlist Dataset, available at https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge, contains 1,000,000 playlists, including playlist titles and track titles, created by users on the Spotify platform between January 2010 and October 2017.

In [1]:
import tensorflow as tf
import os
import json
import numpy as np
import pandas as pd
from pathlib import Path
from scipy.sparse import coo_matrix
from implicit.als import AlternatingLeastSquares
from implicit.nearest_neighbours import bm25_weight
from implicit import evaluation
from tqdm import tqdm
from zipfile import ZipFile


2023-05-06 13:34:10.495191: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-05-06 13:34:10.690732: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-05-06 13:34:10.691584: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  from .autonotebook import tqdm as notebook_tqdm


# 2. <a id='toc2_'></a>[Read and display data](#toc0_)


Playlists are extracted in the `data` folder

In [3]:
# with ZipFile('spotify_million_playlist_dataset.zip', 'r') as zObject:
#     zObject.extractall(
#         path='mdp')

slices_list = sorted(os.listdir('mdp/data'))
len(slices_list)

1000

take the first 1*1000=1000 playlist, and compute the number of unique songs

In [55]:
items = set()
users = []
for slice in slices_list[:1]:
    with open(os.path.join(os.getcwd(),'mdp/data',slice),'r') as f:
        js = json.load(f)
        users.extend([[tl['track_uri'][14:] for tl in i['tracks']] for i in js['playlists']])
        items.update(set([tr['track_uri'][14:]  for pl in js['playlists'] for tr in pl['tracks']]))
print(len(users),len(items))


1000 34443


In [44]:
util_matrices_dir = os.path.join(os.getcwd(),'util_matrices')
if not os.path.isdir(util_matrices_dir):
    os.makedirs(util_matrices_dir)

In [None]:
files = list(Path(util_matrices_dir).glob(f'*.csv'))
for f in files:
    os.remove(f)  

In [8]:
ratings = pd.DataFrame(columns=list(items))
for i,lst in enumerate(tqdm(users)):
    ratings.loc[len(ratings)]=0
    ratings.iloc[len(ratings)-1,np.where(ratings.columns.isin(lst))] = 1
    filename = str(len(users))+'x'+str(len(items))+'_part_'+str(i)+'.csv'
    ratings.to_csv(Path(util_matrices_dir)/filename)
    ratings.drop(ratings.index,inplace=True) 
len(ratings)

100%|██████████| 1000/1000 [01:11<00:00, 13.93it/s]


0

In [9]:
files = list(Path(util_matrices_dir).glob(f'*.csv'))
ratings = pd.concat([pd.read_csv(f) for f in tqdm(files)])
ratings.shape 

100%|██████████| 1000/1000 [12:19<00:00,  1.35it/s]


(1000, 34444)

we assume each playlist corresponds to a single user (this is not mentioned in the MPD description), so we can build the utility (user/item) matrix 

In [47]:
print("Matrix sparsity is " + str(format(np.count_nonzero(ratings) / (ratings.shape[0]*ratings.shape[1]),'%')))

Matrix sparsity is 0.193709%


Rank the 30 most popular songs

In [36]:
pop_songs = ratings[ratings!=0].count(axis=0)

pop_songs = pd.DataFrame( data= pop_songs.sort_values(ascending=False)[:30], columns =['Nr. of playlists'])
display(pop_songs)

Unnamed: 0,Nr. of likes
7KXjTSCq5nL1LoYtL7XAwS,52
1xznGGDReH1oQq0xzbwXa3,50
7yyRTcZmCiyzzJlNzGC9Ol,49
7BKLCZ1jbUBVqRi2FVlTVw,45
3a1lNhkSLSkpJE4MSHpDu9,44
0QsvXIfqM0zZoerQfsI9lm,41
6O6M7pJLABmfBRoGZMu76Y,39
2EEeOnHehOozLq4aS0n6SL,39
0VgkVdmE4gld66l8iyGjgx,38
0SGkqnVQo9KPytSri1H6cF,38


# 3. <a id='toc3_'></a>[Matrix Factorization model using alternating least squares](#toc0_)


EXPLANATION

$$loss = \sum_u \sum_i { C_{ui}(P_{ui} - X_uY_i)^2} + \lambda (\|X_u\|^2 + \|Y_i\|^2)  $$ 
$$ {C_{ui} = 1 + αr_{u}}$$

rui
indicates how many times u fully watched show i. For example, rui = 0.7 indicates that u watched 70% of the show,
while for a user that watched the show twice we will set
rui = 2

references
http://yifanhu.net/PUB/cf.pdf

In [18]:
# transorfm in scipy coordinate sparse matrix
ratings_sparse = coo_matrix(ratings.values).tocsr()

# weight for popular items
ratings_sparse_weighted =  bm25_weight(ratings_sparse, K1=100, B=0.8)

# split train/test
train,test = evaluation.train_test_split(ratings_sparse)

## 3.1. <a id='toc3_1_'></a>[Results](#toc0_)

In [17]:
results=[]
for i in range(10):
    print('Alpha = '+str(i))
    model = AlternatingLeastSquares(factors=64, regularization=0.05, alpha=i)
    model.fit(train)
    results.append(evaluation.precision_at_k(model,train,test,show_progress=True, K=100))
results

Alpha = 0


100%|██████████| 15/15 [00:15<00:00,  1.01s/it]
100%|██████████| 983/983 [00:00<00:00, 2590.33it/s]


Alpha = 1


100%|██████████| 15/15 [00:09<00:00,  1.60it/s]
100%|██████████| 983/983 [00:00<00:00, 2027.34it/s]


Alpha = 2


100%|██████████| 15/15 [00:09<00:00,  1.60it/s]
100%|██████████| 983/983 [00:00<00:00, 2123.07it/s]


Alpha = 3


100%|██████████| 15/15 [00:09<00:00,  1.60it/s]
100%|██████████| 983/983 [00:00<00:00, 2423.64it/s]


Alpha = 4


100%|██████████| 15/15 [00:09<00:00,  1.58it/s]
100%|██████████| 983/983 [00:00<00:00, 2436.20it/s]


Alpha = 5


100%|██████████| 15/15 [00:09<00:00,  1.51it/s]
100%|██████████| 983/983 [00:00<00:00, 2018.21it/s]


Alpha = 6


100%|██████████| 15/15 [00:11<00:00,  1.32it/s]
100%|██████████| 983/983 [00:00<00:00, 1869.45it/s]


Alpha = 7


100%|██████████| 15/15 [00:14<00:00,  1.03it/s]
100%|██████████| 983/983 [00:00<00:00, 1832.91it/s]


Alpha = 8


100%|██████████| 15/15 [00:10<00:00,  1.48it/s]
100%|██████████| 983/983 [00:00<00:00, 1923.77it/s]


Alpha = 9


100%|██████████| 15/15 [00:10<00:00,  1.47it/s]
100%|██████████| 983/983 [00:00<00:00, 1772.33it/s]


[0.0032784442291930557,
 0.14395350570002236,
 0.16727516578496388,
 0.1724163624171075,
 0.17569480664630058,
 0.1794948215483198,
 0.1788987407793756,
 0.18061247299009014,
 0.18143208404738842,
 0.1817301244318605]

# 4. <a id='toc4_'></a>[Multiclass classification model using neural networks](#toc0_)

EXPLANATION

$$loss = \sum_u \sum_i { C_{ui}(P_{ui} - X_uY_i)^2} + \lambda (\|X_u\|^2 + \|Y_i\|^2)  $$ 
$$ {C_{ui} = 1 + αr_{u}}$$

rui
indicates how many times u fully watched show i. For example, rui = 0.7 indicates that u watched 70% of the show,
while for a user that watched the show twice we will set
rui = 2

references
http://yifanhu.net/PUB/cf.pdf

## 4.1. <a id='toc4_1_'></a>[extending item model considering tracks features](#toc0_)

### 4.1.1. <a id='toc4_1_1_'></a>[retrieve data from spotify api](#toc0_)

reference: https://developer.spotify.com/documentation/web-api/reference/get-several-audio-features

In [48]:
api_download_dir = os.path.join(os.getcwd(), 'api_download/')
if not os.path.isdir(api_download_dir):
    os.makedirs(api_download_dir)

In [49]:
# BATCH_SIZE = 50
# OAUTH_TOKEN = ''

chunks = (len(items) - 1) // BATCH_SIZE + 1
for i in range(chunks):
     batch = ','.join(list(items)[i*BATCH_SIZE:(i+1)*BATCH_SIZE])
     filename = 'chunck_'+str(i)+'.json'  
     !curl -X "GET" "https://api.spotify.com/v1/audio-features?ids="+{batch} -H "Accept: application/json" -H "Content-Type: application/json" \
     -H "Authorization: Bearer BQBvDlr6Qw3UzC6rQVtOAVXHd2eZkP84oqo8NODG-ogeY0H3WvfGXaWcq1ssBq6s7bjdQLwbTLQ3d0FAv9GoS-7vRDCgt956k-MO8vMR0-2p0Nx3VyOJxWPzQ4KFhAiBun7GuwY5L948wJ_2JDm74l9sfrm5ryX41vBqaicNrzQROQ" | jq --raw-output  > {filename}      
     ! mv chunck*.json api_download

NameError: name 'BATCH_SIZE' is not defined

### 4.1.2. <a id='toc4_1_2_'></a>[prepare features](#toc0_)

In [51]:
track_features_csv_name = 'track_features.csv'

if os.path.exists(os.path.join(os.getcwd(), track_features_csv_name)):
    track_features = pd.read_csv(track_features_csv_name, index_col=0)
else:
    track_features = pd.DataFrame(columns=['key','loudness','tempo'])
    for chunck in os.listdir('api_download'):
        with open(os.path.join(os.getcwd(),'api_download',chunck,),'r') as s:
            for i in json.load(s)['audio_features']:
                if i:
                    track_features.loc[len(track_features)]= pd.Series(i, index=track_features.columns).T 
                    track_features.rename(index={len(track_features)-1:i['id']},inplace=True)
    # 33754 < 34443 because of nulls
    track_features.to_csv(track_features_csv_name)
display(track_features)

Unnamed: 0,key,loudness,tempo
4Fpq4QkR06QRDkujBUk0JY,4,-7.371,138.389
5x5AACQgT2yK4VWwNn1bjO,0,-6.246,93.716
5IcMce4p9SbPwWeN4g4Oip,9,-12.957,99.494
6cWlmDC8vU2WOMEID6ZL5K,4,-13.794,175.797
3DTy1olOsyxfLmgcym1PRC,8,-6.843,83.212
...,...,...,...
0NNKB1vQpy4eTeRFdalIDA,5,-12.224,99.965
0B7wvvmu9EISAwZnOpjhNI,0,-8.688,145.494
2cViIXIe8Pbd1sOJExMJlK,1,-4.861,115.993
1EZ0V24RlTzzwTM6FCNpWO,2,-9.138,152.921


In [59]:
list(range(len(users)))

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,


In [60]:
tf.keras.Sequential([
        tf.keras.layers.StringLookup(
            vocabulary=list(range(len(users))), mask_token=None),
        tf.keras.layers.Embedding(len(users) + 1, 32),
    ])

  return bool(asarray(a1 == a2).all())


ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type int).

### 4.1.3. <a id='toc4_1_3_'></a>[build the item model](#toc0_)

In [None]:
class ItemModel(tf.keras.Model):
  
  def __init__(self):
    super().__init__()

    self.user_embedding = 

    self.timestamp_embedding = tf.keras.Sequential([
        tf.keras.layers.Discretization(timestamp_buckets.tolist()),
        tf.keras.layers.Embedding(len(timestamp_buckets) + 1, 32),
    ])
    self.normalized_timestamp = tf.keras.layers.Normalization(
        axis=None
    )

      self.normalized_timestamp.adapt(timestamps)

  def call(self, inputs):

    return tf.concat([
        self.user_embedding(inputs["user_id"]),
        self.timestamp_embedding(inputs["timestamp"]),
        tf.reshape(self.normalized_timestamp(inputs["timestamp"]), (-1, 1)),
    ], axis=1)

## 4.2. <a id='toc4_2_'></a>[extending user model considering playlist name](#toc0_)

### 4.2.1. <a id='toc4_2_1_'></a>[prepare features](#toc0_)

In [None]:
title_text = tf.keras.layers.TextVectorization()
title_text.adapt(ratings.map(lambda x: x["movie_title"]))

### 4.2.2. <a id='toc4_2_2_'></a>[build the user model](#toc0_)

In [None]:
class UserModel(tf.keras.Model):
  
  def __init__(self, use_timestamps):
    super().__init__()

    self._use_timestamps = use_timestamps

    self.user_embedding = tf.keras.Sequential([
        tf.keras.layers.StringLookup(
            vocabulary=unique_user_ids, mask_token=None),
        tf.keras.layers.Embedding(len(unique_user_ids) + 1, 32),
    ])

    if use_timestamps:
      self.timestamp_embedding = tf.keras.Sequential([
          tf.keras.layers.Discretization(timestamp_buckets.tolist()),
          tf.keras.layers.Embedding(len(timestamp_buckets) + 1, 32),
      ])
      self.normalized_timestamp = tf.keras.layers.Normalization(
          axis=None
      )

      self.normalized_timestamp.adapt(timestamps)

  def call(self, inputs):
    if not self._use_timestamps:
      return self.user_embedding(inputs["user_id"])

    return tf.concat([
        self.user_embedding(inputs["user_id"]),
        self.timestamp_embedding(inputs["timestamp"]),
        tf.reshape(self.normalized_timestamp(inputs["timestamp"]), (-1, 1)),
    ], axis=1)

## 4.3. <a id='toc4_3_'></a>[Define metrics and loss](#toc0_)

In [None]:

metrics = tfrs.metrics.FactorizedTopK(
  candidates=movies.batch(128).map(movie_model)
)

retrieval task is a keras layer with default loss function Categorical Cross Entropy

In [None]:
task = tfrs.tasks.Retrieval(
  metrics=metrics,
  batch_metrics=[tf.keras.metrics.Precision(top_k=2)]
)

## 4.4. <a id='toc4_4_'></a>[Combine user and item models](#toc0_)

In [None]:
class MovielensModel(tfrs.models.Model):

  def __init__(self, use_timestamps):
    super().__init__()
    self.query_model = tf.keras.Sequential([
      UserModel(use_timestamps),
      tf.keras.layers.Dense(32)
    ])
    self.candidate_model = tf.keras.Sequential([
      MovieModel(),
      tf.keras.layers.Dense(32)
    ])
    self.task = tfrs.tasks.Retrieval(
        metrics=tfrs.metrics.FactorizedTopK(
            candidates=movies.batch(128).map(self.candidate_model),
        ),
    )

  def compute_loss(self, features, training=False):
    # We only pass the user id and timestamp features into the query model. This
    # is to ensure that the training inputs would have the same keys as the
    # query inputs. Otherwise the discrepancy in input structure would cause an
    # error when loading the query model after saving it.
    query_embeddings = self.query_model({
        "user_id": features["user_id"],
        "timestamp": features["timestamp"],
    })
    movie_embeddings = self.candidate_model(features["movie_title"])

    return self.task(query_embeddings, movie_embeddings)

## 4.5. <a id='toc4_5_'></a>[Results](#toc0_)

### 4.5.1. <a id='toc4_5_1_'></a>[Comparison to MF model](#toc0_)

### 4.5.2. <a id='toc4_5_2_'></a>[Using activation function](#toc0_)