#Modeling

In this section, we propose several models to predict whether or not a song has been listened to. We begin with the implementation of several baselines, and proceed with a more advanced model.

In [7]:
!pip install spotipy
!pip install nimfa

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting nimfa
  Downloading nimfa-1.4.0-py2.py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 30.1 MB/s 
Installing collected packages: nimfa
Successfully installed nimfa-1.4.0


In [8]:
import pandas as pd
import numpy as np
import json
import re
import os
from collections import defaultdict
import copy
import spotipy
import nimfa

In [5]:
def build_relevant_ds(songs: list):
    """
    Preprocesses data, simultaneously building relevant data structures
    
    @param data - a data list of playlist dictionaries to preprocess
    @returns a list of tracks per user, users per track, watered down data list
    """
    
    def process_uri(uri:str):
        """URI Processing method"""
        return uri.split(":")[2]
        
    print("Preprocessing started...")
    tracks_per_user, users_per_track, users_per_artist, artists_per_user = defaultdict(list), defaultdict(list), defaultdict(list), defaultdict(list)
    
    # Traversing through data and preprocessing
    for song in songs:

      # Obtaining user
      user = song['user']

      # obtaining necessary data
      track, artist, album = song['track_name'], song['artist_name'], song['album_name']
      
      # Appending data to data structures
      tracks_per_user[user].append(track)
      users_per_track[track].append(user)
      users_per_artist[artist].append(user)
      artists_per_user[user].append(artist)
            
    return tracks_per_user, users_per_track, users_per_artist, artists_per_user
            


In [29]:
# Reading in training and testing data
with open('data_train.json', 'r') as train_reader:
  data_train_val = json.load(train_reader)

with open('data_test.json', 'r') as test_reader:
  data_test= json.load(test_reader) 

In [30]:
data_train, data_val = data_train_val[:600_000], data_train_val[600_000:]

In [31]:
val_tracks_per_user, val_users_per_track, val_users_per_artist, val_artists_per_user = build_relevant_ds(data_val)

Preprocessing started...


#### Validation, Test Set Construction

The Validation and Test set are balanced sets of positive and negative instances, maintaining 1/2 positive user-item pairs (the user listened to the song) and 1/2 negative user-item pairs (the user did not listen to the song).

In [32]:
# Adding negative records to validation data (already exist in test data)
val_neg_ex = []

# Sampling negatives
for ex in data_val:
    user = ex['user']
    random_song = ex
    
    # Sampling random songs until one found from different playlist
    while random_song['track_name'] in val_tracks_per_user[user]:
        random_song = data_val[np.random.randint(0, len(data_val))]
    
    # Negative example modification
    neg_ex = copy.deepcopy(random_song)
    neg_ex['listened'] = False

    # Group user with negative example
    neg_ex['user'] = user
    
    # Appending
    val_neg_ex.append(neg_ex)

data_val += val_neg_ex

In [33]:
test_tracks_per_user, test_users_per_track, test_users_per_artist, test_artists_per_user = build_relevant_ds(data_test)

Preprocessing started...


In [34]:
# Adding negative records to validation data (already exist in test data)
test_neg_ex = []

# Sampling negatives
for ex in data_test:
    user = ex['user']
    random_song = ex
    
    # Sampling random songs until one found from different playlist
    while random_song['track_name'] in test_tracks_per_user[user]:
        random_song = data_test[np.random.randint(0, len(data_test))]
    
    # Negative example modification
    neg_ex = copy.deepcopy(random_song)
    neg_ex['listened'] = False

    # Group user with negative example
    neg_ex['user'] = user
    
    # Appending
    test_neg_ex.append(neg_ex)

data_test += test_neg_ex

In [35]:
# Storing data as dataframe for easier use
train_df = pd.DataFrame.from_records(data_train)
val_df = pd.DataFrame.from_records(data_val)
test_df = pd.DataFrame.from_records(data_test)

In [36]:
print(f"Validation Length: {len(data_val)}, Test Length: {len(data_test)}")

Validation Length: 400000, Test Length: 400000


##Baseline Models

We implement the following baseline models, and attempt to exceed the performance of all 3:

1. Totally Naive Baseline: Prediction by Popularity
2. Medium Tier Baseline: Collaborative Filtering
3. Advanced Basline: Regression Modeling

In [37]:
# Necessary data structures
tracks_per_user, users_per_track, users_per_artist, artists_per_user = build_relevant_ds(data_train)

Preprocessing started...


### Motivating Naive Basline: Prediction by Popularity (Unpersonalized Recommendation)

In [42]:
def predict_by_pop(tracks: np.array, most_popular: set):
  """Predicts that a song was listened to if it was among the most popular songs"""
  return [True if track in most_popular else False for track in tracks]

def construct_most_popular(users_per_track: list) -> set:
    """Naive Baseline: Predicts track has been listened to by user if it's in the tracks that account for top 1/2 of listens"""
    # Most popular list init
    most_popular = []

    # Sorting tracks by popularity
    track_popularities = [(len(users_per_track[track]), track) for track in users_per_track]
    track_popularities.sort(reverse = True)

    # Computing half of total listens
    half_tot_popularity = sum([len(users_per_track[track]) for track in users_per_track]) // 2
    
    # init cumulative popularity
    cum_pop, counter = 0,0

    # While haven't accounted for half of total listens
    while(cum_pop < half_tot_popularity):
      # Appending song and adjusting iterators
      most_popular.append(track_popularities[counter][1])
      cum_pop += track_popularities[counter][0]
      counter += 1

    return set(most_popular)


def acc(labels: np.array, predictions: np.array):
  """Accuracy computation"""
  return sum(predictions == labels) / len(labels)


In [43]:
most_popular = construct_most_popular(users_per_track)

In [44]:
val_predictions = predict_by_pop(val_df['track_name'], most_popular)
print(f"Val Accuracy: {acc(val_df['listened'], val_predictions)}")

Val Accuracy: 0.50145


In [45]:
test_predictions = predict_by_pop(test_df['track_name'], most_popular)
print(f"Test Accuracy: {acc(test_df['listened'], test_predictions)}")

Test Accuracy: 0.5007125


### Conclusions from extremely naive Baseline

Due to the construction of our validation and test set, prediction by popularity is an entirely ineffective way to build a decision boundary. 

Although the top 55% (55% was the optimal validation parameter) of Popular songs comprise more of the positive instances, they also comprise more of the equivalent number of randomly sampled negative instances. 

Since we were just as likely to sample a popular song as a positive instance as we were to sample a popular song as a negative instance, prediction by popularity provides no new information. This motivates that all improvement over random chance can **only** be accounted for by personalization

### Basic Personalization Basline: Prediction by Collaborative Filtering (Personalized Recommendation)

In [55]:
def jaccard(set1: set, set2: set):
  """Jaccard similarity metric"""
  inter = len(set1.intersection(set2))
  union = len(set1.union(set2))
  return inter / union if union !=0 else 0

In [73]:
def collab_filter_predict(users: list, tracks: list, tracks_per_user: dict, users_per_track:dict):
  """Semi-Personalized Baseline: prediction on the basis of similar users"""
  predictions = []
  iter = 0
  for track_user, track in zip(users, tracks):

    # Similarity vector
    similarities = []

    # Obtaining set of user other tracks
    user_tracks = tracks_per_user[track_user]
    track_users = set(users_per_track[track])

    # Seeing if other tracks user listens to are similar
    for external_track in user_tracks:

      # Ensuring don't use track itself
      if external_track == track:
        continue

      # Obtaining other track users
      other_track_users = set(users_per_track[external_track])

      # Computing cross track similarity
      similarities.append(jaccard(other_track_users, track_users))

    predictions.append(True if (len(similarities) != 0 and max(similarities) > 0.005) else False)
  
  return predictions


In [74]:
val_predictions = collab_filter_predict(val_df['user'], val_df['track_name'], tracks_per_user, users_per_track)
print(f"Val Accuracy: {acc(val_df['listened'], val_predictions)}")

Val Accuracy: 0.663725


In [75]:
test_predictions = collab_filter_predict(test_df['user'], test_df['track_name'], tracks_per_user, users_per_track)
print(f"Test Accuracy: {acc(test_df['listened'], test_predictions)}")

Test Accuracy: 0.6981475


This thresholded collaborative filtering already does pretty well! By leveraging the properties of the validation and test sets (half of the examples are positive, so we can utilize only our 200_000 most positive predictions) let's check if we can do better by utilizing this ranking approach!

In [115]:
def collab_filter_predict_structurally(users: list, tracks: list, tracks_per_user: dict, users_per_track:dict):
  """Semi-Personalized Baseline: prediction on the basis of similar users"""
  predictions = []
  max_similarities = []
  for track_user, track in zip(users, tracks):

    # Similarity vector
    sims = []

    # Obtaining set of user other tracks
    user_tracks = tracks_per_user[track_user]
    track_users = set(users_per_track[track])

    # Seeing if other tracks user listens to are similar
    for external_track in user_tracks:

      # Ensuring don't use track itself
      if external_track == track:
        continue

      # Obtaining other track users
      other_track_users = set(users_per_track[external_track])

      # Computing cross track similarity
      sims.append(jaccard(other_track_users, track_users))

    # If other tracks exist for user, append compute
    max_similarities.append(max(sims) if len(sims) > 0 else 0)
  
  sorted_max_sims = sorted(max_similarities, reverse=True)
  confidence_thresh = sorted_max_sims[len(max_similarities)//2]

  predictions = [True if sim > confidence_thresh else False for sim in max_similarities]
  
  return predictions

In [116]:
val_predictions = collab_filter_predict_structurally(val_df['user'], val_df['track_name'], tracks_per_user, users_per_track)
print(f"Val Accuracy: {acc(val_df['listened'], val_predictions)}")


Val Accuracy: 0.6425425


In [117]:
test_predictions = collab_filter_predict_structurally(test_df['user'], test_df['track_name'], tracks_per_user, users_per_track)
print(f"Test Accuracy: {acc(test_df['listened'], test_predictions)}")

Test Accuracy: 0.690045


Somehow, adding structure did not make our model better. This could be due to several factors. For one, we suffer from a massive sparsity problem: many of our tracks have very few and hence only 0 overlap. This is part of the cold start problem: it is very unlikely that tracks with very few listeners will overlap with other tracks. We must figure out a better way to handle such tracks, as they are an integral part of our recommender system.

### Advanced Basline: Regression Modeling
Regression Modeling can utilize the same similarity factor as collaborative filtering, yet additionally accounts for the remainder of the data, building a model that can account both for user/item similarity and relevant features

NameError: ignored