 # Collaborative Filtering and the Million Playlist Dataset

### Justin Moczynski, Gabe Pesco

## 1. Introduction

There are many famous artists whose work is frequently listened to on the radio and on streaming services, such as Spotify and Apple Music; however, these are not the only artists in the world. What about the other artists whose work may be more local or in the beginning stages of their careers? How does this affect an artist's appearance in music recommendations? In this project, we begin to examine the effect of a song's popularity on its appearance to listeners using a recommender system to dictate their future choices of music.

## 2. Procedure

The dataset used for this project was the Million Playlist Dataset (MPD) provided by Spotify for their Recommender Systems Challenge. The first obstacle was importing the data and converting it into a suitable form for use within the program.

### 2.1 Importing Data

<div class="alert alert-block alert-warning">
This code is only run once for the entire project because its purpose is to extract data and convert it into an npz file for use during the program. Once the npz file has been created, it does not need to be created multiple times since we hold a reference to the original file.
</div>

First, the following libraries were imported to use in the data conversion process:

In [1]:
import json
import os
import sys
import numpy as np
import implicit
import scipy.sparse
import progressbar as pb
import concurrent.futures
import threading
import time
from sklearn import *
from random import *

#### 2.1.1 Importing Test Data

After this, we began importing the data from the .json files. In order to do this, we used file scanners to read the contents of each file and dictionary, list, and set data structures to contain the scanned contents. For this project, we are using 2 of the given JSON files

In [None]:
path = "Data\data_big\\"
files = os.listdir(path)
master_playlists = list()
master_songs_set = set()
master_songs_records = list()

def process_file(file):
    start_time = time.time()
    file_row = files.index(file)
    file_contents_json = open(path + file)
    file_contents = json.load(file_contents_json)
    playlists_in_file = file_contents['playlists']

    for playlist in playlists_in_file:
        playlist_name = dict(playlist)['name']
        songs = list()
        
        for song in dict(playlist)['tracks']:
            song_name = dict(song)['track_name']
            songs.append(song_name)
            master_songs_set.add(song_name)
            master_songs_records.append((song_name, playlist_name))
            
        master_playlists.append(playlist_name)

    del playlists_in_file
    print(str(file_row) + "," + time.strftime("%H:%M:%S", time.gmtime(time.time() - start_time)) + "\t")

def process_all_files(file_list):
    start_time = time.time()
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        executor.map(process_file, file_list)
    print("all files processed in " + time.strftime("%H:%M:%S", time.gmtime(time.time() - start_time)) + "\t(" + str(len(master_playlists)) + " playlists, " + str(len(master_songs_set)) + " songs)")

process_all_files(files)

0,00:00:03	2,00:00:03	4,00:00:03	


1,00:00:03	3,00:00:03	

5,00:00:03	6,00:00:03	

7,00:00:03	8,00:00:03	
9,00:00:02	

10,00:00:03	
11,00:00:03	12,00:00:03	13,00:00:03	


14,00:00:03	
15,00:00:02	
18,00:00:02	16,00:00:03	

19,00:00:02	
17,00:00:04	
20,00:00:02	
22,00:00:03	21,00:00:03	

23,00:00:03	24,00:00:02	
25,00:00:01	

26,00:00:02	
27,00:00:02	28,00:00:00	

29,00:00:02	
30,00:00:03	
32,00:00:02	31,00:00:03	
33,00:00:02	

35,00:00:02	
36,00:00:02	
34,00:00:03	37,00:00:02	

38,00:00:03	
39,00:00:01	
40,00:00:02	
44,00:00:02	41,00:00:02	

43,00:00:02	42,00:00:03	

45,00:00:00	
46,00:00:02	
48,00:00:03	47,00:00:03	

49,00:00:03	50,00:00:03	

51,00:00:02	52,00:00:03	

53,00:00:03	54,00:00:02	
55,00:00:03	

56,00:00:01	57,00:00:02	59,00:00:02	


58,00:00:03	60,00:00:02	

61,00:00:03	65,00:00:03	63,00:00:03	64,00:00:03	



62,00:00:03	
68,00:00:03	66,00:00:03	69,00:00:03	


70,00:00:03	67,00:00:04	

71,00:00:02	73,00:00:02	72,00:00:03	

74,00:00:03	

75,00:00:02	
76,00:00:01	77,00:00:0

After extracting all of the data from the file, we create a sparse matrix with $m$ rows and $n$ columns where $m$ is the number of playlists and $n$ is the number of songs. This matrix is a sparse matrix because there are very few entries in the matrix which contain nonzero values.

In [None]:
master_songs_set_list = list(master_songs_set)
master_songs_vector = np.asarray(master_songs_set_list)
m = len(master_playlists)
n = len(master_songs_set_list)
matrix = scipy.sparse.dok_matrix((m,n), dtype=int)

start_time_total = time.time()
counter = 0
progress = pb.ProgressBar(widgets=[pb.Percentage(),"\t", pb.Bar(),"\t", pb.Timer(), "\tTotal Completed: ", pb.Counter()], maxval=len(master_songs_records)).start()
for record in master_songs_records:
    start_time = time.time()
    matrix[master_playlists.index(record[1]),master_songs_set_list.index(record[0])] = 1
    counter  = counter + 1
    progress.update(counter)
print("all playlists processed in " + time.strftime("%H:%M:%S", time.gmtime(time.time() - start_time_total)))

del master_songs_set
del master_songs_set_list
del master_songs_vector

After creating the sparse matrix, we convert it from a DOK (dictionary of keys) matrix to a CSC (compressed sparse column) matrix in order to create an npz file from the data.

In [None]:
sparse_matrix = scipy.sparse.csr_matrix(matrix)
scipy.sparse.save_npz("data_train.npz", sparse_matrix, "int32")

 The npz file is then reloaded into the program for further use.

In [None]:
sparse_matrix = scipy.sparse.load_npz("data_train.npz")
print(sparse_matrix.shape)

#### 2.1.2 Importing Test Data

We use the same procedures and code to choose test data

In [None]:
path = "data_big\\"
files = os.listdir(path)
master_playlists = list()
master_songs_set = set()
master_songs_records = list()

def process_file(file):
    start_time = time.time()
    file_row = files.index(file)
    file_contents_json = open(path + file)
    file_contents = json.load(file_contents_json)
    playlists_in_file = file_contents['playlists']

    for playlist in playlists_in_file:
        playlist_name = dict(playlist)['name']
        songs = list()
        
        for song in dict(playlist)['tracks']:
            song_name = dict(song)['track_name']
            songs.append(song_name)
            master_songs_set.add(song_name)
            master_songs_records.append((song_name, playlist_name))
            
        master_playlists.append(playlist_name)

    del playlists_in_file
    print(str(file_row) + "," + time.strftime("%H:%M:%S", time.gmtime(time.time() - start_time)) + "\t")

def process_all_files(file_list):
    start_time = time.time()
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        executor.map(process_file, file_list)
    print("all files processed in " + time.strftime("%H:%M:%S", time.gmtime(time.time() - start_time)) + "\t(" + str(len(master_playlists)) + " playlists, " + str(len(master_songs_set)) + " songs)")

process_all_files(files[100:103])

del files

master_songs_set_list = list(master_songs_set)
master_songs_vector = np.asarray(master_songs_set_list)
m = len(master_playlists)
n = len(master_songs_set_list)
matrix = scipy.sparse.dok_matrix((m,n), dtype=int)

start_time_total = time.time()
counter = 0
progress = pb.ProgressBar(widgets=[pb.Percentage(),"\t", pb.Bar(),"\t", pb.Timer(), "\tTotal Completed: ", pb.Counter()], maxval=len(master_songs_records)).start()
for record in master_songs_records:
    start_time = time.time()
    matrix[master_playlists.index(record[1]),master_songs_set_list.index(record[0])] = 1
    counter  = counter + 1
    progress.update(counter)
print("all playlists processed in " + time.strftime("%H:%M:%S", time.gmtime(time.time() - start_time_total)))

sparse_matrix_test = scipy.sparse.csr_matrix(matrix)
scipy.sparse.save_npz("data_test.npz", sparse_matrix_test, "int32")

sparse_matrix_test = scipy.sparse.load_npz("data_test.npz")
print(sparse_matrix_test.shape)

### 2.2 Penalizing for Popularity

In this project, we explored the introduction of a tunable hyperparameter in order to penalize songs for greater popularity. We defined a function which takes the parameters $X$ (the data matrix), and $\zeta$ (zeta, the hyperparameter) and scales the elements of the data matrix to penalize for popularity. We used the following formula to as a scale:
$$z = \frac{1-\zeta}{\zeta}$$
We also restricted $\zeta$ to $\zeta \neq 0$, so if $\zeta = 0$, then the function does not scale the matrix and returns the original matrix $X$.

In [None]:
def popularityScale(X, zeta):
    if zeta != 0:
        freqs = np.sum(X, axis = 0)
        z = (1-zeta)/zeta
        return X * (np.reciprocal(np.power(np.amax(freqs), z)) * np.power(freqs, z))
    else:
        return X

We converted the sparse matrix back into a dense matrix in order to run the popularityScale function on the imported data. The data was then converted back to a sparse matrix for memory conservation.

In [None]:
print("size of sparse_matrix:", np.shape(sparse_matrix))
dense_matrix = scipy.sparse.dok_matrix.toarray(sparse_matrix)
scaled_matrix = scipy.sparse.dok_matrix(popularityScale(dense_matrix, 2))
print("size of scaled_matrix:", np.shape(scaled_matrix))

### 2.3 Running Alternating Least Squares

We used the scikit-learn library to run the Alternating Least Squares learning algorithm on the imported data.

In [None]:
model = implicit.als.AlternatingLeastSquares(factors=200, use_gpu=False)
model.fit(scaled_matrix.T)
print(scaled_matrix.T)