# Recommender System

[Yahoo! Music User Ratings of Musical Artists, version 1.0 (423 MB)](http://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did=1)
    This dataset represents a snapshot of the Yahoo! Music community's preferences for various musical artists. The dataset contains over ten million ratings of musical artists given by Yahoo! Music users over the course of a one month period sometime prior to March 2004. Users are represented as meaningless anonymous numbers so that no identifying information is revealed. The dataset may be used by researchers to validate recommender systems or collaborative filtering algorithms. The dataset may serve as a testbed for matrix and graph algorithms including PCA and clustering algorithms. The size of this dataset is 423 MB.

From the readme.txt:
```
This dataset consists of two files:
1. ydata-ymusic-user-artist-ratings-v1_0.txt
2. ydata-ymusic-artist-names-v1_0.txt

The content of the two files are as follows:

=====================================================================

(1) "ydata-ymusic-user-artist-ratings-v1_0.txt" contains user ratings
    of music artists. It contains 11,557,943 ratings of 98,211 artists
    by 1,948,882 anonymous users. The format of each line of the file
    is: anonymous_user_id (TAB) artist_id (TAB) rating. The ratings
    are integers ranging from 0 to 100, except 255 (a special case
    that means "never play again").

Snippet:
1       1000125 90
1       1006373 100
1       1006978 90
1       1007035 100
1       1007098 100

====================================================================

(2) "ydata-ymusic-artist-names-v1_0.txt" contains the artist_id and
    name of each musical artist.

Snippet:
-100    Not Applicable
-99     Unknown Artist
1000001 Bobby "O"
1000002 Jimmy "Z"
1000003 '68 Comeback
```



In [1]:
import re
import timeit
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pylab as plt
from collections import defaultdict

plt.style.use('ggplot')
%matplotlib inline
pd.options.display.max_columns=25



In [7]:
import psycopg2
conn = psycopg2.connect("dbname=ymusic_data user=btq")
# Open a cursor to perform database operations
cur = conn.cursor()
# Query the database and obtain data as Python objects
cur.execute("SELECT * FROM ym_ratings WHERE uid < 3;")
cur.fetchall()

[(1, 1000125, 90),
 (1, 1006373, 100),
 (1, 1006978, 90),
 (1, 1007035, 100),
 (1, 1007098, 100),
 (1, 1007723, 100),
 (1, 1008659, 100),
 (1, 1008916, 100),
 (1, 1012809, 70),
 (1, 1014635, 100),
 (1, 1016419, 100),
 (1, 1016470, 100),
 (1, 1016522, 100),
 (1, 1016885, 100),
 (1, 1017874, 100),
 (1, 1017881, 100),
 (1, 1019512, 100),
 (1, 1019522, 100),
 (1, 1020524, 100),
 (1, 1020560, 100),
 (1, 1020778, 100),
 (1, 1021623, 100),
 (1, 1024006, 100),
 (1, 1024015, 100),
 (1, 1024496, 100),
 (1, 1024635, 100),
 (1, 1024759, 100),
 (1, 1029612, 100),
 (1, 1033451, 100),
 (1, 1034801, 100),
 (1, 1036157, 90),
 (1, 1037847, 100),
 (1, 1041557, 90),
 (1, 1042768, 100),
 (1, 1043712, 100),
 (1, 1045024, 100),
 (1, 1045525, 100),
 (1, 1047584, 100),
 (1, 1053507, 90),
 (1, 1098798, 90),
 (2, 1004623, 0),
 (2, 1018143, 0),
 (2, 1040071, 90),
 (2, 1053438, 90),
 (2, 1098087, 90),
 (2, 1098636, 90)]

In [2]:
d = defaultdict(lambda: defaultdict(int))
tic = timeit.default_timer()
with open('data/ydata-ymusic-user-artist-ratings-v1_0_2002.txt') as infile:
    for line in infile:
        #print line
        vals = re.split('\t',line)
        #print vals
        d[vals[0]][vals[1]]= int(vals[2].strip())
toc=timeit.default_timer()
print toc - tic
print d['1']         

0.241866111755
defaultdict(<type 'int'>, {'1008916': 100, '1042768': 100, '1016885': 100, '1024496': 100, '1036157': 90, '1033451': 100, '1007035': 100, '1041557': 90, '1045525': 100, '1098798': 90, '1017874': 100, '1020778': 100, '1007723': 100, '1007098': 100, '1020524': 100, '1000125': 90, '1047584': 100, '1006978': 90, '1020560': 100, '1016419': 100, '1019522': 100, '1024006': 100, '1016470': 100, '1006373': 100, '1024759': 100, '1017881': 100, '1043712': 100, '1008659': 100, '1037847': 100, '1029612': 100, '1024635': 100, '1034801': 100, '1045024': 100, '1021623': 100, '1012809': 70, '1014635': 100, '1016522': 100, '1053507': 90, '1024015': 100, '1019512': 100})
