# Music Recommendation System

This is a project for the Aplica- course.

Year 2017, first period.

Students:
- Diego Vargas
- Andre Pando
- Ronie Arauco

## Making familiar with the dataset

In [3]:
import numpy as np
import pandas as pd
import codecs
# import matplotlib.pyplot as plt
# %matplotlib inline

artists = pd.read_table("./lastfm-data/artists.dat", encoding = 'latin1')
tags = pd.read_table("./lastfm-data/tags.dat", encoding = 'latin1')
user_artists = pd.read_table("./lastfm-data/user_artists.dat", encoding = 'latin1')
user_taggedartists = pd.read_table("./lastfm-data/user_taggedartists.dat",encoding = 'latin1')
user_friends = pd.read_table("./lastfm-data/user_friends.dat",encoding = 'latin1')


# Information taken from
#    Last.fm website, http://www.lastfm.com
#
#    @inproceedings{Cantador:RecSys2011,
#       author = {Cantador, Iv\'{a}n and Brusilovsky, Peter and Kuflik, Tsvi},
#       title = {2nd Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011)},
#       booktitle = {Proceedings of the 5th ACM conference on Recommender systems},
#       series = {RecSys 2011},
#       year = {2011},
#       location = {Chicago, IL, USA},
#       publisher = {ACM},
#       address = {New York, NY, USA},
#       keywords = {information heterogeneity, information integration, recommender systems},
#    } 

In [4]:
# Contains information about the artists that has been listened and tagged
# by the users
# id \t name \t url \t pictureURL
# artists.shape
artists.sample(3)

Unnamed: 0,id,name,url,pictureURL
2251,2266,Kind of Like Spitting,http://www.last.fm/music/Kind+of+Like+Spitting,http://userserve-ak.last.fm/serve/252/18552.jpg
2245,2260,K-Dee,http://www.last.fm/music/K-Dee,http://userserve-ak.last.fm/serve/252/34474585...
1871,1880,Fashion,http://www.last.fm/music/Fashion,http://userserve-ak.last.fm/serve/252/48583585...


In [5]:
# The tags available in the dataset
# tagID \t tagValue
# tags.shape
tags.sample(3)

Unnamed: 0,tagID,tagValue
8278,8688,steve reich
8248,8658,benga
10144,10703,out of ether and friends


In [6]:
# Contains the artists listened by each user, providing also
# the listening count for each [user, artist] pair
# userID \t artistID \t weight
# user_artists.shape
user_artists.sample(3)

Unnamed: 0,userID,artistID,weight
26908,584,735,2440
55186,1230,13316,8
25952,565,187,439


In [7]:
# Tag assignments of artists provided by each particular user
# as well with the time of when was the tag assigned by the user
# userID \t artistID \t tagID \t day \t month \t year
# user_taggedartists.shape
user_taggedartists.sample(3)

Unnamed: 0,userID,artistID,tagID,day,month,year
106071,1202,15544,216,1,9,2008
53822,545,1593,167,1,10,2007
83280,921,11037,6164,1,8,2010


In [8]:
# Contains the friend relations between users in the database
# userID \t friendID
# user_friends.shape
user_friends.sample(3)

Unnamed: 0,userID,friendID
2601,196,1130
23662,1918,215
17826,1403,1153


Obj| shape 
--- | ---
artists | (17632, 4)
tags | (11946, 2)
user_artists | (92834, 3)
user_taggedartists | (186479, 6)
user_friends | (25434, 2)

In [33]:
# What is the artist with most and least listeners?

# - Most listeners
listeners_agg = user_artists[['artistID','userID']].groupby('artistID', sort=False).agg(['count'])
print("artists with least followers")
print(listeners_agg['userID'].sort('count').head(3)) #-- least 9201
print("--------------------")
print("artists with most followers")
print(listeners_agg['userID'].sort('count').tail(3)) #-- most 89

# And how many plays do they make?
listens_agg = user_artists[['artistID', 'weight']].groupby(['artistID']).agg(['sum'])
print("--------------------")
print("Amount of plays for the artist with least followers")
print(listens_agg.filter(regex='^9201$',axis=0)) # -- least 139 plays
print("--------------------")
print("Amount of plays for the artist with most followers")
print(listens_agg.filter(regex='^89$',axis=0)) # -- most 1291387 plays
# What are the tags made by those users?



# What is the artist with most and the least listen counts? 
# (the least can't be 0, according with the description of the artist dataset)
print("artist with least plays")
print(listens_agg['weight'].sort('sum').head(3)) # -- least 14371
print("--------------------")
print("artist with Most plays")
print(listens_agg['weight'].sort('sum').tail(3)) # -- most 2393140

# and how many users makes those listen counts?
print("--------------------")
print("Amount of users for the artist with least plays")
print(listeners_agg.filter(regex='^14371$',axis=0)) # -- artist with less 
print("--------------------")
print("Amount of users for the artist with most plays")
print(listeners_agg.filter(regex='^289$',axis=0)) # -- artist with moee

# What is the most and the least used tag?
# What is the most and the least tagged artists?
# What is the user that tagges the most and tagges the least? 

artists with least followers
          count
artistID       
9201          1
12363         1
12366         1
--------------------
artists with most followers
          count
artistID       
288         484
289         522
89          611
--------------------
Amount of plays for the artist with least followers
         weight
            sum
artistID       
9201        139
--------------------
Amount of plays for the artist with most followers
           weight
              sum
artistID         
89        1291387
artist with least plays
          sum
artistID     
14371       1
11746       1
9493        1
--------------------
artist with Most plays
              sum
artistID         
89        1291387
72        1301308
289       2393140
--------------------
Amount of users for the artist with least plays
         userID
          count
artistID       
14371         1
--------------------
Amount of users for the artist with most plays
         userID
          count
artistID       
289 



### The problem
The database doesn't contain any rating/rate column, rather a _weight_ for each artists by user which works as a _listen_ counter. That said, there's going to be artists that has a high amount of plays, but little users - and viceversa.

So, for this solution, the amount of plays has to be converted to a relative along to the amount of users. 

The following graph shows how the data is being shown.

![Graph](graph.png)

There has to be a function which gives a _weight_ or _cost_ in the relationships between artists, so we know which is the recommended artist.

