# Goal of the Notebook

This notebook contains all things relevant to the Heuristic based Recommender Part of the project.

The Heuristic Recommender has several parts :

- Recommending Songs Users have in Common
- Recommending Songs from Artists Users have in Common
- Recommending Songs from Popular Artists in Genres User have in Common

In order to determine what users have in Common, we don't simply compute an intersection : this wouldn't scale, for a large amount of users, we would have absolutely no recall. 

Instead, we create some voting system, where each user has a vote, and we take all items which have enough votes.

### Loading Data

We simply load the datasets and create a sample of users (relatively large to verify that our method scales)

In [1]:
import pandas as pd
from functools import reduce

famous_tracks = pd.read_csv('data/features.csv')
df_1kfamous   = pd.read_csv('data/df_1kfamous.csv',index_col=0)
test_users = df_1kfamous['user-id'].sample(n=10).values

## Common Songs and artists

### Common artists

In this part, we compute the artists which are listened to by the most users, along with the mean number of plays per user. 

We will then use the Top-k from this table, sample a single song from each artist, to get a playlist of k songs.


In [2]:
df_play_per_user = df_1kfamous[df_1kfamous['user-id'].isin(test_users)]
df_artist_per_user = pd.DataFrame(df_play_per_user.groupby('user-id')['artist-name'].apply(set).apply(list))

df_plays_by_artist = (df_play_per_user[['artist-name','plays','user-id']]
 .groupby(['user-id','artist-name'])
 .sum()
 .reset_index()[['artist-name','plays']]
 .groupby('artist-name').mean()
 .sort_values('plays',ascending=False))

df_nb_users = df_artist_per_user.reset_index().explode('artist-name').groupby('artist-name').count().sort_values('user-id',ascending=False)

df_nb_users.merge(right=df_plays_by_artist,on='artist-name').sort_values(by=['user-id','plays'],ascending = False).head(5)

Unnamed: 0_level_0,user-id,plays
artist-name,Unnamed: 1_level_1,Unnamed: 2_level_1
The Killers,9,8.777778
Mgmt,8,33.75
Radiohead,7,41.0
Röyksopp,7,34.571429
Portishead,7,29.857143


### Common songs 

In this part, we compute the songs which are listened to by the most users, along with the mean number of plays per user. 

We will then use the Top-k from this table, to get a playlist of k songs.


In [3]:
# get all songs listened to by user
df_song_per_user = pd.DataFrame(df_play_per_user.groupby('user-id')['track-name'].apply(set).apply(list))

# get total number of plays per track in group
df_plays_by_track = (df_play_per_user[['track-name','plays','user-id']]
 .groupby(['user-id','track-name'])
 .sum()
 .reset_index()[['track-name','plays']]
 .groupby('track-name').mean()
 .sort_values('plays',ascending=False))

# get number of users that listen to track
df_nb_users_play = (df_song_per_user
                    .reset_index()
                    .explode('track-name')
                    .groupby('track-name')
                    .count()
                    .sort_values('user-id',ascending=False))

# merge both previous
df_nb_users_play.merge(right=df_plays_by_track,on='track-name').sort_values(by=['user-id','plays'],ascending = False).head(5)

Unnamed: 0_level_0,user-id,plays
track-name,Unnamed: 1_level_1,Unnamed: 2_level_1
Electric Feel,8,5.125
Time To Pretend,7,8.714286
Kids,7,6.142857
Weekend Wars,7,2.857143
4Th Dimensional Transition,7,2.0


## Popular songs

In this part, we compute the genres which are listened to by the most users, along with the mean number of plays per user. 

We will then use the Top-k from this table, sample a single artist from each genre. Then,from each artist, we sample a single song per artist to get a playlist of k songs.


In [4]:
# get all songs listened to by users
df_users = df_1kfamous[df_1kfamous['user-id'].isin(test_users)]

# per track genres
df_music_genre = famous_tracks[['musicbrainz-track-id','genres']]

# merge genres by users
df_merge = df_users.merge(right=df_music_genre,left_on='track-id',right_on='musicbrainz-track-id')
df_merge.genres = df_merge.genres.apply(lambda x : list(eval(x)))

# one entry per genre
df_merge_exploded = df_merge.explode('genres')

# get mean number of plays over the group of users
df_plays_by_genre = (df_merge_exploded[['genres','plays','user-id']]
 .groupby(['user-id','genres'])
 .sum()
 .reset_index()[['genres','plays']]
 .groupby('genres').median()
 .sort_values('plays',ascending=False))

# get number of users that listen to the genre
df_nb_users_genre = (df_merge_exploded.groupby('user-id')['genres'].apply(set).apply(list)
                     .reset_index()
                     .explode('genres')
                    .groupby('genres')
                    .count()
                    .sort_values('user-id',ascending=False))

# merge two previous df 
df_top_genres = df_nb_users_genre.merge(right=df_plays_by_genre,on='genres').sort_values(by=['user-id','plays'],ascending = False).head(20)

### Dictionary of artists per genre

We create and save a dictionary of artists in a given genre.

In [5]:
tracks = famous_tracks.copy()
tracks.genres = tracks.genres.apply(lambda x : list(eval(x)))
genre_dict = tracks.explode('genres')[['genres', 'musicbrainz-artist-id']].groupby('genres').apply(lambda x : set(x['musicbrainz-artist-id'])).to_dict()

In [None]:
#import pickle

#with open('data/genre_artists.pkl', 'wb') as f:
#    pickle.dump(genre_dict, f)

# Note : Integration in the Final Product

The python file **heuristic_recommender.py** includes a class which uses the result from the current notebook to be used in the final prototype (06 - Merged Recommender).

# Note : Evaluating the Final Playlist

We have thought thoroughly to find a manner in which we could evaluate the playlist returned by the heuristic recommender, but just like the Content Based Recommender, it is not really meant to work with the previous relevance, we however still show some metrics :