# Music Recommendation System

This is a project for the Aplica- course.

Year 2017, first period.

Students:
- Diego Vargas
- Andre Pando
- Ronie Arauco

## Making familiar with the dataset

In [1]:
import numpy as np
import pandas as pd
import codecs
# import matplotlib.pyplot as plt
# %matplotlib inline

artists = pd.read_table("./lastfm-data/artists.dat", encoding = 'latin1')
tags = pd.read_table("./lastfm-data/tags.dat", encoding = 'latin1')
user_artists = pd.read_table("./lastfm-data/user_artists.dat", encoding = 'latin1')
user_taggedartists = pd.read_table("./lastfm-data/user_taggedartists.dat",encoding = 'latin1')
user_friends = pd.read_table("./lastfm-data/user_friends.dat",encoding = 'latin1')


# Information taken from
#    Last.fm website, http://www.lastfm.com
#
#    @inproceedings{Cantador:RecSys2011,
#       author = {Cantador, Iv\'{a}n and Brusilovsky, Peter and Kuflik, Tsvi},
#       title = {2nd Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011)},
#       booktitle = {Proceedings of the 5th ACM conference on Recommender systems},
#       series = {RecSys 2011},
#       year = {2011},
#       location = {Chicago, IL, USA},
#       publisher = {ACM},
#       address = {New York, NY, USA},
#       keywords = {information heterogeneity, information integration, recommender systems},
#    } 

In [2]:
# Contains information about the artists that has been listened and tagged
# by the users
# id \t name \t url \t pictureURL
# artists.shape
artists.sample(3)

Unnamed: 0,id,name,url,pictureURL
2623,2643,Thursday,http://www.last.fm/music/Thursday,http://userserve-ak.last.fm/serve/252/63014.jpg
3529,3579,Danny Noriega,http://www.last.fm/music/Danny+Noriega,http://userserve-ak.last.fm/serve/252/30668633...
2549,2569,Styx,http://www.last.fm/music/Styx,http://userserve-ak.last.fm/serve/252/412304.jpg


In [3]:
# The tags available in the dataset
# tagID \t tagValue
# tags.shape
tags.sample(3)

Unnamed: 0,tagID,tagValue
3671,3743,camper van beethoven
2312,2358,melodic trance
10438,11019,children's


In [4]:
# Contains the artists listened by each user, providing also
# the listening count for each [user, artist] pair
# userID \t artistID \t weight
# user_artists.shape
user_artists.sample(3)

Unnamed: 0,userID,artistID,weight
19823,431,217,2287
44266,980,11528,94
91953,2079,704,255


In [5]:
# Tag assignments of artists provided by each particular user
# as well with the time of when was the tag assigned by the user
# userID \t artistID \t tagID \t day \t month \t year
# user_taggedartists.shape
user_taggedartists.sample(3)

Unnamed: 0,userID,artistID,tagID,day,month,year
50098,529,14880,1016,1,3,2008
147863,1679,959,72,1,1,2010
164795,1865,1300,575,1,12,2008


In [9]:
# Contains the friend relations between users in the database
# userID \t friendID
# user_friends.shape
user_friends.sample(3)

Unnamed: 0,userID,friendID
14524,1133,584
983,78,552
12364,941,198


Obj| shape 
--- | ---
artists | (17632, 4)
tags | (11946, 2)
user_artists | (92834, 3)
user_taggedartists | (186479, 6)
user_friends | (25434, 2)

In [12]:
# What is the artist with most and least listeners?

# - Most listeners
listeners_agg = user_artists[['artistID','userID']].groupby('artistID', sort=False).agg(['count'])
print("artists with least followers")
print(listeners_agg['userID'].sort_values('count').head(3)) #-- least 9201
print("--------------------")
print("artists with most followers")
print(listeners_agg['userID'].sort_values('count').tail(3)) #-- most 89

# And how many plays do they make?
listens_agg = user_artists[['artistID', 'weight']].groupby(['artistID']).agg(['sum'])
print("--------------------")
print("Amount of plays for the artist with least followers")
print(listens_agg.filter(regex='^9201$',axis=0)) # -- least 139 plays
print("--------------------")
print("Amount of plays for the artist with most followers")
print(listens_agg.filter(regex='^89$',axis=0)) # -- most 1291387 plays
# What are the tags made by those users?



# What is the artist with most and the least listen counts? 
# (the least can't be 0, according with the description of the artist dataset)
print("artist with least plays")
print(listens_agg['weight'].sort_values('sum').head(3)) # -- least 14371
print("--------------------")
print("artist with Most plays")
print(listens_agg['weight'].sort_values('sum').tail(3)) # -- most 2393140

# and how many users makes those listen counts?
print("--------------------")
print("Amount of users for the artist with least plays")
print(listeners_agg.filter(regex='^14371$',axis=0)) # -- artist with less 
print("--------------------")
print("Amount of users for the artist with most plays")
print(listeners_agg.filter(regex='^289$',axis=0)) # -- artist with moee

# What is the most and the least used tag?
# What is the most and the least tagged artists?
# What is the user that tagges the most and tagges the least? 

artists with least followers
          count
artistID       
9201          1
12363         1
12366         1
--------------------
artists with most followers
          count
artistID       
288         484
289         522
89          611
--------------------
Amount of plays for the artist with least followers
         weight
            sum
artistID       
9201        139
--------------------
Amount of plays for the artist with most followers
           weight
              sum
artistID         
89        1291387
artist with least plays
          sum
artistID     
14371       1
11746       1
9493        1
--------------------
artist with Most plays
              sum
artistID         
89        1291387
72        1301308
289       2393140
--------------------
Amount of users for the artist with least plays
         userID
          count
artistID       
14371         1
--------------------
Amount of users for the artist with most plays
         userID
          count
artistID       
289 

### The problem
The database doesn't contain any rating/rate column, rather a _weight_ for each artists by user which works as a _listen_ counter. That said, there's going to be artists that has a high amount of plays, but little users - and viceversa.

So, for this solution, the amount of plays has to be converted to a relative along to the amount of users. 

The following graph shows how the data is being shown.

![Graph](graph.png)

One is using the **Content-Based Filtering**, since the data set we currently have is a set of users and a set of categories (keywords or tags). The similarity between the two will be the keywords extracted from the artists tags. Each user should have a degree of interest in certain tags, which can be retrieved using the most tagged item in the most frequent artists the user hears (See table 1). That said, we can only recommend artists to the already given set of users.


| Tag  | $U_1$ | $U_2$ | $U_3$ | $U_x$ |
|------|-----|-----|-----|-----|
| $Tag_1$ |  3  |  2  |     |     |
| $Tag_2$ |  5  |  3  |  3  |     |
| $Tag_3$ |     |  3  |  5  |  4  |
| $Tag_4$ |  1  |     |  5  |  4  |


#### How to retrieve the Interest (Ideas)
The interest can be retrieved from the following table (which belongs for one user):

<table>
    <thead>
        <tr>
            <th>Plays $P_i$</th>
            <th>Artist $A_i$</th>
            <th>Tag $T_j$</th>
            <th>Weight $W_{ij}$</th>
        </tr>
    </thead>
    <tbody>
        <tr>
             <td rowspan="2">$P_1$</td>
             <td rowspan="2">$A_1$</td>
             <td>$T_1$</td>
             <td>$W_{11}$</td>
        </tr>
        <tr>
             <td>$T_2$</td>
             <td>$W_{12}$</td>
        </tr>
        <tr>
             <td rowspan="2">$P_2$</td>
             <td rowspan="2">$A_2$</td>
             <td>$T_2$</td>
             <td>$W_{22}$</td>
        </tr>
        <tr>
             <td>$T_3$</td>
             <td>$W_{23}$</td>
        </tr>
        <tr>
             <td rowspan="4">$P_3$</td>
             <td rowspan="4">$A_3$</td>
             <td>$T_2$</td>
             <td>$W_{32}$</td>
        </tr>
        <tr>
             <td>$T_3$</td>
             <td>$W_{33}$</td>
        </tr>
        <tr>
             <td>$T_4$</td>
             <td>$W_{34}$</td>
        </tr>
        <tr>
             <td>$T_5$</td>
             <td>$W_{35}$</td>
        </tr>
        
    </tbody>
</table>

Being $P_i$ the amount of times the user has played the artist $A_i$ (found as _weight_); $W_i$ the amount of users that has tagged the artist $A_i$ with the tag $T_j$.

From the table we now an Artist has been listened: 

$$listenShare_i = \frac{P_i}{\sum_{i = 1}^{N}P_i}$$

And for the tag

$$tagShare_j = \sum_{i = 1}^{N}\frac{W_{ij}*listenShare_i}{\sum_{z=1}^{M}W_{iz}}$$

We can then, retrieve the interest from 0 to 5, capping the result of the $tagShare_j$ asigning 5 to the maximum value $max(tagShare_j)$

As an example, say we have the following data for the user $U_x$

<table>
    <thead>
        <tr>
            <th>Plays $P_i$</th>
            <th>Artist $A_i$</th>
            <th>Tag $T_j$</th>
            <th>Weight $W_{ij}$</th>
        </tr>
    </thead>
    <tbody>
        <tr>
             <td rowspan="2">$150$</td>
             <td rowspan="2">$A_1$</td>
             <td>$T_1$</td>
             <td>$30$</td>
        </tr>
        <tr>
             <td>$T_2$</td>
             <td>$15$</td>
        </tr>
        <tr>
             <td rowspan="2">$45$</td>
             <td rowspan="2">$A_2$</td>
             <td>$T_2$</td>
             <td>$13$</td>
        </tr>
        <tr>
             <td>$T_3$</td>
             <td>$7$</td>
        </tr>
        <tr>
             <td rowspan="4">$15$</td>
             <td rowspan="4">$A_3$</td>
             <td>$T_2$</td>
             <td>$45$</td>
        </tr>
        <tr>
             <td>$T_3$</td>
             <td>$15$</td>
        </tr>
        <tr>
             <td>$T_4$</td>
             <td>$16$</td>
        </tr>
        <tr>
             <td>$T_5$</td>
             <td>$6$</td>
        </tr>
        
    </tbody>
</table>


Using the formula $tagShare$ we can get the interest on the user $U_x$ on the tags:

| Tag | $tagShare$ | Interest |
| -- | -- | -- |
| $T_1$ | $0.476$ | $5.000$ |
| $T_2$ | $0.417$ | $4.374$ |
| $T_3$ | $0.088$ | $0.925$ |
| $T_4$ | $0.014$ | $0.146$ |
| $T_5$ | $0.005$ | $0.055$ |

### Bibliography

1. Robillard, M., Maalej, W., Walker, R. J., & Zimmermann, T. (Eds.). (2014). Recommendation Systems in Software Engineering. Springer Berlin Heidelberg. Cap. 2 p. 20-21 https://doi.org/10.1007/978-3-642-45135-5
2. Jannach, D., Zanker, M., Felfernig, A., & Friedrich, G. (2011). Recommender systems: an introduction. Cambridge University Press (Vol. 40). https://doi.org/10.1017/CBO9780511763113