## Prerequisite

Here are the details about the datasets I used for scraping

1. train_triplets --> user listening history (the 3GB one)
2. unique_tracks --> map song id to track id
3. mxm_779k_matches --> map track id to mxm id

** Better than expected. It covers 90% of all the taste profile songs.

## Results 

This file generates two results for us to play with.

1. seeds --> a simplified seeds only with unique song id, track id and mxm id pairs
2. listening_history --> a complete listening history (remove song id, title and artist etc to save memory)

## 1. Load Listening History (Train Triplets) 

In [1]:
import pandas as pd
file_name = "../data/input/train_triplets.txt"
df = pd.DataFrame()
chunksize = 10**8  # load by chunk to save memory

for chunk in pd.read_csv(file_name, chunksize=chunksize, sep='\t', names=['user_id','song_id','frequency']):
    df = df.append(chunk)

  df = df.append(chunk)


In [2]:
df.head()

Unnamed: 0,user_id,song_id,frequency
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1
1,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAPDEY12A81C210A9,1
2,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBBMDR12A8C13253B,2
3,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBFNSP12AF72A0E22,1
4,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBFOVM12A58A7D494,1


In [3]:
# count distinct song numbers 
df.song_id.nunique()

350552

There are 384546 unique songs in the original taste profile.

## 2. Map to the musixmatch

The mapping logic is slightly complicated:

- First, map the song_id to track_id
- Then, map track_id to mxm_id (musicxmatch id)

In [4]:
# step 1

df_track = pd.read_csv("../data/input/unique_tracks.txt", header=None, sep='<SEP>',
                       names=['track_id', 'song_id', 'artist_name', 'title'])

  df_track = pd.read_csv("../data/unique_tracks.txt", header=None, sep='<SEP>',


In [5]:
len(df_track)

1000000

It includes all 1 million song tracks.

In [6]:
df = df.merge(df_track, how='inner', on='song_id')
df.head()

Unnamed: 0,user_id,song_id,frequency,track_id,artist_name,title
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1,TRIQAUQ128F42435AD,Jack Johnson,The Cove
1,7c86176941718984fed11b7c0674ff04c029b480,SOAKIMP12A8C130995,1,TRIQAUQ128F42435AD,Jack Johnson,The Cove
2,76235885b32c4e8c82760c340dc54f9b608d7d7e,SOAKIMP12A8C130995,3,TRIQAUQ128F42435AD,Jack Johnson,The Cove
3,250c0fa2a77bc6695046e7c47882ecd85c42d748,SOAKIMP12A8C130995,1,TRIQAUQ128F42435AD,Jack Johnson,The Cove
4,3f73f44560e822344b0fb7c6b463869743eb9860,SOAKIMP12A8C130995,6,TRIQAUQ128F42435AD,Jack Johnson,The Cove


In [7]:
df_m = pd.read_csv("../data/input/mxm_779k_matches.txt", header=None, comment='#',
                   sep='<SEP>', names=['track_id','artist_name','title','mxm_id', 'mxm_artist_name', 'mxm_title'])

df_m = df_m[['track_id', 'mxm_id', 'mxm_artist_name', 'mxm_title']]
df_m.head()

  df_m = pd.read_csv("../data/mxm_779k_matches.txt", header=None, comment='#',


Unnamed: 0,track_id,mxm_id,mxm_artist_name,mxm_title
0,TRMMMKD128F425225D,4418550.0,Karkkiautomaatti,Tanssi vaan
1,TRMMMRX128F93187D9,8898149.0,Hudson Mohawke,No One Could Ever
2,TRMMMCH128F425532C,9239868.0,Yerba Brava,Si vos queres
3,TRMMMXN128F42936A5,5346741.0,Franz Berwald,"Symphony No. 1 in G minor ""Sinfonie Sérieuse"":..."
4,TRMMMBB12903CB7D21,2511405.0,Kris Kross,2 Da Beat Ch'yall


In [8]:
# merge on track_id
df_all = df.merge(df_m, how='left', on='track_id')
df_all.song_id.nunique()

350552

In [9]:
# It covers 90% of the songs in the original listening history
347092/384546

0.9026020294061049

In [10]:
# save unique id pairs as seeds

seeds = df_all[['song_id', 'track_id', 'mxm_id']].drop_duplicates()
seeds.reset_index(drop=True, inplace=True)
seeds.to_csv("../data/tmp/seeds.csv", index=False)

In [11]:
# save the listening history

df_all = df_all[['user_id', 'track_id', 'frequency']]

In [12]:
df_all.to_csv("../data/tmp/listening_history.csv", index=False)