## Collaborative Filtering Recommender

In [1]:
from comet_ml import Experiment
experiment = Experiment(api_key="ummagUWZ5eIZzmhPtFkA8oopu")

[codecarbon INFO @ 10:11:45] [setup] RAM Tracking...
[codecarbon INFO @ 10:11:45] [setup] GPU Tracking...
[codecarbon INFO @ 10:11:45] No GPU found.
[codecarbon INFO @ 10:11:45] [setup] CPU Tracking...
[codecarbon INFO @ 10:11:46] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
[codecarbon INFO @ 10:11:46] >>> Tracker's metadata:
[codecarbon INFO @ 10:11:46]   Platform system: Linux-5.4.0-147-generic-x86_64-with-glibc2.29
[codecarbon INFO @ 10:11:46]   Python version: 3.8.10
[codecarbon INFO @ 10:11:46]   Available RAM : 31.360 GB
[codecarbon INFO @ 10:11:46]   CPU count: 8
[codecarbon INFO @ 10:11:46]   CPU model: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
[codecarbon INFO @ 10:11:46]   GPU count: None
[codecarbon INFO @ 10:11:46]   GPU model: None
[1;38;5;39mCOMET INFO:[0m Couldn't find a Git repository in '/home/asadcor' nor in any parent directory. Set `COMET_GIT_DIRECTORY` if your Git Repository is elsewhere.
[1;38;5;39mCOMET INFO:[0m Experimen

In [2]:
from tqdm import tqdm

import csv
import numpy as np
import pandas as pd
import scipy.sparse as sp

from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix

## 0. Preprocessing, split data and binary sparse matrix

In [3]:
playlists_dataset_with_mood = pd.read_csv('working/playlists_dataset_with_mood.csv')

In [4]:
playlists_dataset_with_mood = playlists_dataset_with_mood.drop('description', axis=1)
playlists_dataset_with_mood = playlists_dataset_with_mood.dropna()

[codecarbon INFO @ 10:12:05] Energy consumed for RAM : 0.000050 kWh. RAM Power : 11.759872913360596 W
[codecarbon INFO @ 10:12:05] Energy consumed for all CPUs : 0.000180 kWh. All CPUs Power : 42.5 W
[codecarbon INFO @ 10:12:05] 0.000230 kWh of electricity used since the begining.


In [5]:
pdwm_100k = playlists_dataset_with_mood.sample(n=102773, random_state=12)

mask = pdwm_100k['pid'].duplicated(keep=False)
pdwm_100k_filtered = pdwm_100k[mask]

In [6]:
pdwm_100k_filtered.shape

(100000, 29)

In [7]:
# Primer split: 80% para entrenamiento y 20% para el conjunto validación-testeo
train, val_test = train_test_split(pdwm_100k_filtered, test_size=0.2, shuffle=True, random_state=12, stratify=pdwm_100k_filtered['pid'])

# Segundo split: utilizando el conjunto validación-testeo: 10% para val y 10% para test.
val, test = train_test_split(val_test, test_size=0.5, shuffle=True, random_state=12)

In [8]:
# Create Binary Sparse Matrix
s_matrix = pd.crosstab(train.pid, train.track_uri)
s_matrix = s_matrix.clip(upper=1)

assert np.max(s_matrix.describe().loc['max']) == 1

sparse_matrix = csr_matrix(s_matrix)

[codecarbon INFO @ 10:12:20] Energy consumed for RAM : 0.000099 kWh. RAM Power : 11.759872913360596 W
[codecarbon INFO @ 10:12:20] Energy consumed for all CPUs : 0.000357 kWh. All CPUs Power : 42.5 W
[codecarbon INFO @ 10:12:20] 0.000456 kWh of electricity used since the begining.
[codecarbon INFO @ 10:12:35] Energy consumed for RAM : 0.000148 kWh. RAM Power : 11.759872913360596 W
[codecarbon INFO @ 10:12:35] Energy consumed for all CPUs : 0.000534 kWh. All CPUs Power : 42.5 W
[codecarbon INFO @ 10:12:35] 0.000682 kWh of electricity used since the begining.
[codecarbon INFO @ 10:12:50] Energy consumed for RAM : 0.000197 kWh. RAM Power : 11.759872913360596 W
[codecarbon INFO @ 10:12:50] Energy consumed for all CPUs : 0.000711 kWh. All CPUs Power : 42.5 W
[codecarbon INFO @ 10:12:50] 0.000908 kWh of electricity used since the begining.
[codecarbon INFO @ 10:13:05] Energy consumed for RAM : 0.000246 kWh. RAM Power : 11.759872913360596 W
[codecarbon INFO @ 10:13:05] Energy consumed for all

In [9]:
print(np.isnan(sparse_matrix.data).any())
print(np.isinf(sparse_matrix.data).any())

False
False


In [10]:
sparse_matrix

<15414x40583 sparse matrix of type '<class 'numpy.int64'>'
	with 79939 stored elements in Compressed Sparse Row format>

In [11]:
# Train kNN model
model_kNN = NearestNeighbors(metric='cosine', algorithm='brute')

model_kNN.fit(sparse_matrix)

NearestNeighbors(algorithm='brute', metric='cosine')

The cosine similarity metric is commonly used in collaborative filtering models. it measures the similarity between two vectors based on their orientation, regardless of their magnitude. This is useful for recommendation systems as it allows us to compare the similarity between playlists or songs even if they have different numbers of tracks or different levels of popularity.

The brute force algorithm is a good choice when the dataset is not too large, which may force us to reduce the matrix and the data of the model. The brute force algorithm simply calculates the distance between each pair of points in the dataset, which can be computationally expensive for large datasets, but for smaller datasets it provides an accurate and simple solution.

### Making Predictions

In [12]:
def nholdout(playlist_id, df):
    """
    Returns the number of songs held out in the validation/test set for a given playlist ID.

    Parameters:
        playlist_id (int): The ID of the playlist.
        df (pandas.DataFrame): The DataFrame containing the playlist data.

    Returns:
        int: The number of songs held out in the validation/test set.

    """
    return len(df[df.pid == playlist_id].track_uri)

def kpredict(knnmodel, playlist_id, df):
    """
    Generates a list of 15*k predictions for a given playlist ID, where k is the number of holdouts.

    Parameters:
        knnmodel: The k-Nearest Neighbors model used for prediction.
        playlist_id (int): The ID of the playlist for which to generate predictions.
        df (pandas.DataFrame): The DataFrame containing the playlist data.

    Returns:
        list: A list of 15*k predictions for the specified playlist ID.

    """
    
    k = nholdout(playlist_id, df)*15 # number of holdouts
    ref_songs = s_matrix.columns.values[s_matrix.loc[playlist_id] == 1] # songs already in playlist
    dist, ind = knnmodel.kneighbors(np.array(s_matrix.loc[playlist_id]).reshape(1, -1), n_neighbors = 99)
    rec_ind = s_matrix.index[ind[0]] # recommended playlists
    
    n_pred = 0
    pred = []
    for i in rec_ind:
        new_songs = s_matrix.columns.values[s_matrix.loc[i] == 1] # potential recommendations
        for song in new_songs:
            if song not in ref_songs: # only getting songs not already in target playlist
                pred.append(song)
                n_pred += 1
                if n_pred == k:
                    break
        if n_pred == k:
            break
    
    return pred

### Metrics

In [13]:
def r_precision(predictions, val_set):
    """
    Computes the R-Precision score for a given playlist prediction set.

    Parameters:
        predictions (list or numpy.ndarray): A list or 1-D numpy array containing the predicted track URIs.
        val_set (pandas.Series): A pandas Series representing the ground truth track URIs for the validation set.

    Returns:
        float: The R-Precision score.

    """
    if val_set.shape[0] > 0:
        score = np.sum(val_set.isin(predictions))/val_set.shape[0]
    else:
        score = 0.0
    return score

In [14]:
def dcg_at_k(r, k, method=0):
    """
    Computes the Discounted Cumulative Gain (DCG) at a specified rank `k` given a list of relevance scores.

    Parameters:
        r (list or numpy.ndarray): A list or 1-D numpy array containing the relevance scores.
        k (int): The rank at which to compute the DCG.
        method (int, optional): The formula to use for computing the DCG. 0 for the default formula, 1 for the alternative formula. Default is 0.

    Returns:
        float: The DCG at rank `k`.

    Raises:
        ValueError: If `method` is not 0 or 1.

    """
    r = np.asfarray(r)[:k]
    if r.size:
        if method == 0:
            return r[0] + np.sum(r[1:] / np.log2(np.arange(2, r.size + 1)))
        elif method == 1:
            return np.sum(r / np.log2(np.arange(2, r.size + 2)))
        else:
            raise ValueError('method must be 0 or 1.')
    return 0.


def ndcg_at_k(r, k, method=0):
    """
    Computes the Normalized Discounted Cumulative Gain (NDCG) at a specified rank `k` given a list of relevance scores.

    Parameters:
        r (list or numpy.ndarray): A list or 1-D numpy array containing the relevance scores.
        k (int): The rank at which to compute the NDCG.
        method (int, optional): The formula to use for computing the NDCG. 0 for the default formula, 1 for the alternative formula. Default is 0.

    Returns:
        float: The NDCG at rank `k`.

    """
    dcg_max = dcg_at_k(sorted(r, reverse=True), k, method)
    if not dcg_max:
        return 0.
    return dcg_at_k(r, k, method) / dcg_max

## Baseline Model Performance

In [15]:
rps = []
ndcgs = []

for pid in tqdm(s_matrix.index):
    ps = kpredict(model_kNN, pid, val)
    vs = val[val.pid == pid].track_uri
    rps.append(r_precision(ps, vs))

    r = np.zeros(len(ps))
    for i, p in enumerate(ps):
        if np.any(vs.isin([p])):
            r[i] = 1
    ndcgs.append(ndcg_at_k(r, len(r)))

  0%|▏                                                                                                                     | 32/15414 [00:00<03:34, 71.73it/s][codecarbon INFO @ 10:15:50] Energy consumed for RAM : 0.000784 kWh. RAM Power : 11.759872913360596 W
[codecarbon INFO @ 10:15:50] Energy consumed for all CPUs : 0.002836 kWh. All CPUs Power : 42.5 W
[codecarbon INFO @ 10:15:50] 0.003620 kWh of electricity used since the begining.
  8%|█████████                                                                                                           | 1203/15414 [00:15<03:04, 77.12it/s][codecarbon INFO @ 10:16:05] Energy consumed for RAM : 0.000833 kWh. RAM Power : 11.759872913360596 W
[codecarbon INFO @ 10:16:05] Energy consumed for all CPUs : 0.003013 kWh. All CPUs Power : 42.5 W
[codecarbon INFO @ 10:16:05] 0.003846 kWh of electricity used since the begining.
 16%|██████████████████▊                                                                                                

In [16]:
avg_rp = np.mean(rps)
avg_ndcg = np.mean(ndcgs)

print('Avg. R-Precision: ', avg_rp)
print('Avg. NDCG: ', avg_ndcg)
print('Total Sum: ', np.mean([avg_rp, avg_ndcg]))

Avg. R-Precision:  0.006411919899658319
Avg. NDCG:  0.003375014498975917
Total Sum:  0.004893467199317118


In [17]:
experiment.end()

[codecarbon INFO @ 10:18:52] Energy consumed for RAM : 0.001379 kWh. RAM Power : 11.759872913360596 W
[codecarbon INFO @ 10:18:52] Energy consumed for all CPUs : 0.004984 kWh. All CPUs Power : 42.5 W
[codecarbon INFO @ 10:18:52] 0.006363 kWh of electricity used since the begining.
[1;38;5;39mCOMET INFO:[0m ---------------------------------------------------------------------------------------
[1;38;5;39mCOMET INFO:[0m Comet.ml Experiment Summary
[1;38;5;39mCOMET INFO:[0m ---------------------------------------------------------------------------------------
[1;38;5;39mCOMET INFO:[0m   Data:
[1;38;5;39mCOMET INFO:[0m     display_summary_level : 1
[1;38;5;39mCOMET INFO:[0m     url                   : https://www.comet.com/sadcor/general/bf65e21ab30141ec93b0fb71f9be1ea8
[1;38;5;39mCOMET INFO:[0m   Parameters:
[1;38;5;39mCOMET INFO:[0m     algorithm     : brute
[1;38;5;39mCOMET INFO:[0m     leaf_size     : 30
[1;38;5;39mCOMET INFO:[0m     metric        : cosine
[1;38;5