#Modeling

In this section, we propose several models to predict whether or not a song has been listened to. We begin with the implementation of several baselines, and proceed with a more advanced model.

In [2]:
!pip install nimfa

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting nimfa
  Downloading nimfa-1.4.0-py2.py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 5.2 MB/s 
Installing collected packages: nimfa
Successfully installed nimfa-1.4.0


In [1]:
import pandas as pd
import numpy as np
import json
import re
import os
import nimfa as nf
from collections import defaultdict
import copy

In [2]:
# Reading in training and testing data
with open('data_train.json', 'r') as train_reader:
  data_train_val = json.load(train_reader)

with open('data_test.json', 'r') as test_reader:
  data_test= json.load(test_reader) 

In [4]:
data_train, data_val = data_train_val[:600_000], data_train_val[600_000:]

In [None]:
# Adding negative records to validation data (already exist in test data)
val_neg_ex = []

# Sampling negatives
for ex in data_val:
    user = ex['user']
    random_song = ex
    
    # Sampling random songs until one found from different playlist
    while random_song['user'] == user:
        random_song = data_val[np.random.randint(0, len(data_val))]
    
    # Negative example modification
    neg_ex = copy.deepcopy(random_song)
    neg_ex['listened'] = False
    
    # Appending
    val_neg_ex.append(neg_ex)

data_val += val_neg_ex

In [84]:
# Storing data as dataframe for easier use
train_df = pd.DataFrame.from_records(data_train)
val_df = pd.DataFrame.from_records(data_val)
test_df = pd.DataFrame.from_records(data_test)

In [6]:
print(f"Validation Length: {len(data_val)}, Test Length: {len(data_test)}")

Validation Length: 400000, Test Length: 400000


##Baseline Models

We implement the following baseline models, and attempt to exceed the performance of all 3:

1. Naive Baseline: Prediction by Popularity
2. Medium Tier Baseline: Collaborative Filtering
3. Advanced Basline: Matrix Factorization

In [7]:
def build_relevant_ds(songs: list):
    """
    Preprocesses data, simultaneously building relevant data structures
    
    @param data - a data list of playlist dictionaries to preprocess
    @returns a list of tracks per user, users per track, watered down data list
    """
    
    def process_uri(uri:str):
        """URI Processing method"""
        return uri.split(":")[2]
        
    print("Preprocessing started...")
    tracks_per_user, users_per_track, users_per_artist, artists_per_user = defaultdict(list), defaultdict(list), defaultdict(list), defaultdict(list)
    
    # Traversing through data and preprocessing
    for song in songs:

      # Obtaining user
      user = song['user']

      # obtaining necessary data
      track, artist, album = song['track_name'], song['artist_name'], song['album_name']
      
      # Appending data to data structures
      tracks_per_user[user].append(track)
      users_per_track[track].append(user)
      users_per_artist[artist].append(user)
      artists_per_user[user].append(artist)
            
    return tracks_per_user, users_per_track, users_per_artist, artists_per_user
            


In [8]:
# Necessary
tracks_per_user, users_per_track, users_per_artist, artists_per_user = build_relevant_ds(data_train)

Preprocessing started...


In [116]:
def predict_by_pop(tracks: np.array, most_popular: set):
  """Predicts that a song was listened to if it was among the most popular songs"""
  return [True if track in most_popular else False for track in tracks]

def construct_most_popular(users_per_track: list) -> set:
    """Naive Baseline: Predicts track has been listened to by user if it's in the tracks that account for top 1/2 of listens"""
    # Most popular list init
    most_popular = []

    # Sorting tracks by popularity
    track_popularities = [(len(users_per_track[track]), track) for track in users_per_track]
    track_popularities.sort(reverse = True)

    # Computing half of total listens
    half_tot_popularity = sum([len(users_per_track[track]) for track in users_per_track]) // 2
    
    # init cumulative popularity
    cum_pop, counter = 0,0

    # While haven't accounted for half of total listens
    while(cum_pop < half_tot_popularity):
      # Appending song and adjusting iterators
      most_popular.append(track_popularities[counter][1])
      cum_pop += track_popularities[counter][0]
      counter += 1

    return set(most_popular)


def acc(labels: np.array, predictions: np.array):
  """Accuracy computation"""
  return sum(predictions == labels) / len(labels)


In [117]:
most_popular = construct_most_popular(users_per_track)

In [118]:
val_predictions = predict_by_pop(val_df['track_name'], most_popular)
print(f"Val Accuracy: {acc(val_df['listened'], val_predictions)}")

Val Accuracy: 0.4996475


In [119]:
test_predictions = predict_by_pop(test_df['track_name'], most_popular)
print(f"Test Accuracy: {acc(test_df['listened'], test_predictions)}")

Test Accuracy: 0.4973675


### Conclusions from extremely naive Baseline

Due to the construction of our validation and test set, prediction by popularity is an extremely ineffective way to approach our predictive task. By construction, our dataset