# Kaggle MSD Challenge 

Here is the sample algorithm for the song recommendation system for Million Song Dataset Challenge on Kaggle 

In [13]:
DATA_PATH = 'data/'
EVAL_TRIPLETS_TXT = DATA_PATH + 'kaggle_visible_evaluation_triplets.txt'
USERS_TXT = DATA_PATH + 'kaggle_users.txt'
SONGS_TXT = DATA_PATH + 'kaggle_songs.txt'

In the following lines of code, we open the file, create a mapping from a song ID to the number
of times this song appears, and close the file.

In [14]:
f = open(EVAL_TRIPLETS_TXT, 'r')
    
song_to_count = dict()
for line in f:
    _, song, _ = line.strip().split('\t') 
    if song in song_to_count: 
        song_to_count[song] += 1 
    else: 
        song_to_count[song] = 1 
f.close()

In [15]:
songs_ordered = sorted(song_to_count.keys(), key=lambda s: song_to_count[s],reverse=True)

We will recommend the most popular songs to every user, but we must filter out songs already
in the user’s library. Reopening the triplets file, we will create a map from user to songs they
have listened to.

In [17]:
f = open(EVAL_TRIPLETS_TXT, 'r')

user_to_songs = dict()
for line in f:
    user, song, _ = line.strip().split('\t') 
    if user in user_to_songs: 
        user_to_songs[user].add(song) 
    else: 
        user_to_songs[user] = set([song]) 
f.close()

user_to_songs

{'04cd8d64e32be6c37a609d4cd548d6947c613829': {'SOCVNLL12A8C13B4EA',
  'SODCLQR12A67AE110D',
  'SONIKQT12A8AE475DF',
  'SOOAAJH12A58A7CEB2',
  'SOPMURQ12AF729C6E5',
  'SOZCWQA12A6701C798',
  'SOZEXZO12A8C13D591'},
 'eaff153ffbc6f86a9dcfaab089ca03fb350a0c34': {'SOADLEV12A8C1373DB',
  'SODEGIQ12A6D4FC6E3',
  'SOEKBXL12A6D4FC6FC',
  'SOJWPSD12A6D4F8FEE',
  'SONJSLQ12A6701C52C',
  'SONXULN12A58A7C23C',
  'SOQDFWD12AC468B24F',
  'SOSSCSY12A6D4FB262',
  'SOTKGXW12AB0185635',
  'SOWWHAB12AB018C2A9'},
 '5ef7e36fdbab2ba99a0134a61dff56d810b265a9': {'SOAGSNP12A8C1449E6',
  'SOAXNLP12A8AE47BDE',
  'SOBWDFY12A58A7B175',
  'SODQDMV12A670202A1',
  'SOFCYRW12A6310E05F',
  'SOFDVJJ12A58A7FE8A',
  'SOIDOAR12A8C144ED9',
  'SOLRARY12A8C142636',
  'SOPNSDD12A58A7DB74',
  'SOVLXSI12A6D4FDCA1',
  'SOZVBBA12A8C144EFB'},
 'd6f0c0574ec02ecb64aa352ba33fc24adde4e97f': {'SOASIRW12AF72A804F',
  'SOATZQV12AB01858AB',
  'SOBDASI12A8C13E814',
  'SOCJBOJ12A8C142D05',
  'SODNDDS12AB017F3D1',
  'SODSHAZ12AF729F2CD',
  'SO

Ok, we now have the songs ordered by popularity, and listening history for each user. To
produce our submission file, we’ll need to load the canonical ordering of users:

In [18]:
f = open(USERS_TXT, 'r')
canonical_users = map(lambda line: line.strip(), f.readlines())
f.close()

In [19]:
canonical_users[:2] 

['fd50c4007b68a3737fe052d5a4f78ce8aa117f3d',
 'd7083f5e1d50c264277d624340edaaf3dc16095b']

We are almost there, but we're missing one more thing. To reduce the size of submission files,
we do not submit a list of song IDs such as SOSOUKN12A8C13AB79, but rather their index in
the canonical list of songs. 

In [23]:
f = open(SONGS_TXT, 'r')
song_to_index = dict(map(lambda line: line.strip().split(' '), f.readlines()))
f.close()

song_to_index

{'SOZJAWB12A6D4FC287': '377526',
 'SOKJWVU12A8C13F329': '164366',
 'SOWAVQV12A58A7BBEF': '332385',
 'SOVTDOB12A6D4F668D': '328349',
 'SOELJVR12A67AD8119': '71368',
 'SOSXIJH12AB017C573': '288602',
 'SOTPCGN12A6D4FE3A6': '298226',
 'SOBZOKZ12A67ADD534': '32026',
 'SONPVXR12A8C144B3F': '212541',
 'SOQFRQU12AB018332F': '250530',
 'SOAMYKP12A8C13ABF9': '7990',
 'SOZJCHJ12AB018775D': '377543',
 'SOJBIWZ12A6D4F643D': '143871',
 'SOGGLWK12A8C13EB76': '100171',
 'SOKXFXD12A8AE46783': '172088',
 'SOKMCXP12AB018615D': '165656',
 'SOMKCUQ12A8C136861': '194519',
 'SOQJAAQ12AB018FD48': '252368',
 'SOAGAKZ12A8C13314B': '3679',
 'SOFZMNF12A8C13E800': '95898',
 'SOOFKFU12A6D4FC27B': '221436',
 'SOAIAOV12AC9072FC3': '4963',
 'SOSQNXN12A6D4F8CD4': '284845',
 'SOYPHRY12AC468A20D': '367335',
 'SOBDDWN12A58A7D540': '18079',
 'SOMWPWE12AC9097C1F': '201680',
 'SOZQKDL12AB01846F6': '381300',
 'SODLBFF12AB0186D51': '55126',
 'SOSYQHG12AC3DF8518': '289288',
 'SOYHJBP12A8C134E33': '363197',
 'SOKZUMG12A6701C91E'

Finally, we are ready to create the submission file. For each user in the canonical list,
recommend the songs in order of popularity, except those already in the user’s profile.

In [24]:
f = open('submission.txt', 'w')
for user in canonical_users:
    songs_to_recommend = []
    for song in songs_ordered:
        if len(songs_to_recommend) >= 500:
            break
        if not song in user_to_songs[user]:
            songs_to_recommend.append(song)
    # Transform song IDs to song indexes
    indices = map(lambda s: song_to_index[s], songs_to_recommend)
    # Write line for that user
    f.write(' '.join(indices) + '\n')
f.close()