<a href="https://colab.research.google.com/github/Kitwradr/Project-Automatic-playlist-continuation/blob/main/datapreprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing

## Imports

In [1]:
import pandas as pd
import numpy as np

from collections import defaultdict
from tqdm import tqdm
import random

from sklearn.model_selection import train_test_split

## Reading the necessary data files

**If you are running the following cells, please make sure the paths to the data files are correct.**

Interaction data

In [4]:
interactions=pd.read_csv('/content/drive/MyDrive/dsc210/project/interactions.csv')
# our work is done on 2% of the data (about 1.3 million interactions)
interactions=interactions.loc[:0.02*interactions.shape[0]].drop(['position'],axis=1)

Tracks Metadata

In [10]:
tracks=pd.read_csv('/content/drive/MyDrive/dsc210/project/tracks.csv')
tracks_dict=tracks.to_dict()

In [6]:
print(f'Number of interactions: {interactions.shape[0]}')
print(f'Number of playlists: {len(np.unique(interactions.pid))}')
print(f'Number of tracks: {len(np.unique(interactions.tid))}')

Number of interactions: 1326929
Number of playlists: 19904
Number of tracks: 262338


## Data Split

In [7]:
# train, validation and test split in 8:1:1 ratio
train, test = train_test_split(interactions,test_size=0.2)
valid, test = train_test_split(test,test_size=0.5)

## Necessary Data Structures

In [11]:
# set of all playlists present in the train set
train_playlists=set(np.unique(train.pid))
# set of all tracks present in the train set
train_tracks=set(np.unique(train.tid))
# set of all interactions in train set
unique_interaction_set=set([(p,t) for p,t in train.values])
# dictionary (playlist_id: number of tracks in playlist)
tracksInPlaylist=defaultdict(int,train.pid.value_counts().to_dict())

In [12]:
# dictionary (playlist: set of tracks in playlist)
tracksPerPlaylist=defaultdict(set)
# dictionary (track: set of playlist with track)
playlistsPerTrack=defaultdict(set)
# dictionary (playlist: dictionary (album: number of tracks from album in playlist))
artistCountInPlaylist=defaultdict(lambda: defaultdict(int))
# dictionary (playlist: dictionary (artist: number of tracks by artist in playlist))
albumCountInPlaylist=defaultdict(lambda: defaultdict(int))
# set of all tracks in train set
allTracks=set()
for p,t in tqdm(train[['pid','tid']].values):
  tracksPerPlaylist[p].add(t)
  playlistsPerTrack[t].add(p)
  allTracks.add(t)
  artistCountInPlaylist[p][tracks_dict['artist_uri'][t]]+=1
  albumCountInPlaylist[p][tracks_dict['album_uri'][t]]+=1

100%|██████████| 1061543/1061543 [00:11<00:00, 89952.74it/s]


## Generating negative samples for train set

For each sample in train set we generate one negative sample (a playlist-track pair where the track is not in the playlist)

A label of 1 is given for a positive sample and 0 for a negative sample.

It takes about 70 minutes to generate all the samples.
The generated data is saved for later.

In [13]:
negatives_per_sample=1
# playlist ids, track ids, meta data - [album_share, artist_share], labels
p_input,t_input,m_input,labels=[],[],[],[]
for p,t in tqdm(train.values):
  p_input.append(p)
  t_input.append(t)
  m_input.append([albumCountInPlaylist[p][tracks_dict['album_uri'][t]]/tracksInPlaylist[p],
                  artistCountInPlaylist[p][tracks_dict['artist_uri'][t]]/tracksInPlaylist[p]])
  labels.append(1)
  for i in range(negatives_per_sample):
    rand_t=random.sample(train_tracks,1)[0]
    while (p,rand_t) in unique_interaction_set:
      rand_t=random.sample(train_tracks,1)[0]
    p_input.append(p)
    t_input.append(rand_t)
    m_input.append([albumCountInPlaylist[p][tracks_dict['album_uri'][rand_t]]/tracksInPlaylist[p],
                    artistCountInPlaylist[p][tracks_dict['artist_uri'][rand_t]]/tracksInPlaylist[p]])
    labels.append(0)

100%|██████████| 1061543/1061543 [1:11:22<00:00, 247.89it/s]


In [14]:
# saving the data
pd.DataFrame({
    'pid':p_input,
    'tid':t_input,
    'label':labels,
    'album-artist share': m_input
}).to_csv('/content/drive/MyDrive/dsc210/project/train_with_negative.csv',index=False)

## Preparing the validation and test sets for evaluation

We will be evaluating one interaction for each playlist, i.e. one track per playlist to reduce the evaluation time.

In [15]:
valid=valid.drop_duplicates(subset=['pid']).copy()
test=test.drop_duplicates(subset=['pid']).copy()

For each playlist-track pair in the validation set we sample 10 tracks that we know are not present in the playlist. 
The models trained will be evaluated based on where it ranks the positive track among the 11 tracks for a given playlist.

In [16]:
neg_valid=[]
for p,t in tqdm(valid.values):
  negs=[]
  for i in range(10):
    rand_t=random.sample(train_tracks,1)[0]
    while (p,rand_t) in unique_interaction_set and rand_t not in negs:
      rand_t=random.sample(train_tracks,1)[0]
    negs.append(rand_t)
  neg_valid.append(negs)

100%|██████████| 18714/18714 [11:18<00:00, 27.58it/s]


In [22]:
# saving the validation set for later use
valid['negatives']=neg_valid
valid.to_csv('/content/drive/MyDrive/dsc210/project/valid_negatives.csv',index=False)

We do the same process on test set.

In [18]:
neg_test=[]
for p,t in tqdm(test.values):
  negs=[]
  for i in range(10):
    rand_t=random.sample(train_tracks,1)[0]
    while (p,rand_t) in unique_interaction_set and rand_t not in negs:
      rand_t=random.sample(train_tracks,1)[0]
    negs.append(rand_t)
  neg_test.append(negs)

100%|██████████| 18704/18704 [10:37<00:00, 29.34it/s]


In [23]:
# saving the test set for later use
test['negatives']=neg_test
test.to_csv('/content/drive/MyDrive/dsc210/project/test_negatives.csv',index=False)