# An ordinal regression study

### Example: Music ratings 

For our computations we use a restricted version of the Yahoo Music User ratings (R1) <https://webscope.sandbox.yahoo.com/catalog.php?datatype=r>. Due to licensing terms, we can unfortunately not rehost it here. As the original dataset contains over 10 million ratings, we restrict it to ratings for only certain artists. To simplify it, we choose the 100 all time favourites as quoted by the Billboard charts <http://www.billboard.com/charts/greatest-billboard-200-artists>

For 'Carrie Underwood', 'Tim Mcgraw', 'Lady Gaga', 'Justin Bieber', 'Adele', 'Miley Cyrus', 'P!nk', 'Jay Z', 'Taylor Swift' no entries exist in the Yahoo Music User Ratings as these artists became known after 2004.

Thus, these artists have been excluded.


In [1]:
#import numpy as np
import matplotlib.pyplot as plt
#import scipy as sc
import autograd.scipy as sc  # Thinly-wrapped scipy
import autograd.numpy as np  # Thinly-wrapped numpy
from autograd import grad
from sklearn.linear_model import LogisticRegression
import os
import pandas as pd
import itertools
from parse import *
import shutil

#### Processing the Yahoo data

In [2]:
# import billboard top 100
dfbb = pd.read_csv('../data/billboard100.csv', header=None)
artists = dfbb.values.flatten()

In [3]:
# load yahoo data
artist_names_file = '../cache/ydata-ymusic-artist-names-v1_0.txt'

#artists_found = []
d_list = []
with open(artist_names_file, 'r') as f:
    for line in f:
        
        parsed_line = [el.strip() for el in line.split('\t')]
        
        artist = parsed_line[1]
        artist_id = int(parsed_line[0])

        if artist in artists:
            d_list.append(dict([('artist', artist), ('aid', artist_id)]))
#         if artist in artists:
#             print 'found ' + artist
#             artists_found.append(artist)
        

In [4]:
dfartists = pd.DataFrame(d_list)

In [5]:
# now join on the big dataset with artist table
#df = pd.read_csv('../cache/ydata-ymusic-user-artist-ratings-v1_0.txt', sep=' ')

In [6]:
#df.head()

In [7]:
import findspark
findspark.init()

from pyspark import SparkContext, SparkConf

# setup spark
try:
    conf = SparkConf().setAppName('Yahoo Music Ratings')
    spark_context = SparkContext(conf=conf)
except:
    print 'context already running'

In [10]:
rdd = spark_context.textFile('../cache/ydata-ymusic-user-artist-ratings-v1_0.txt')

# uncomment for testpurposes
#rdd = spark_context.parallelize(rdd.take(200))

In [11]:
%time
aids = dfartists.aid.values

# filter data according to artists (this might take a while, ~ 12min on my Macbook)
# first entry is user, artist, rating
frdd = rdd.map(lambda x: tuple([int(el) for el in x.split('\t')])).filter(lambda x: x[1] in aids)

# now transform ratings
# 255 means 'never play again' --> transform it to 1
# then we have 101 possible scores ranging from 0...100
# assign those to 2...102

srdd = frdd.map(lambda x: (x[0], x[1], 1 if x[2] == 255 else 1+x[2]))    

# save an rdd with its tuples to CSV (use chunk_size for merging of 16MB per default)
def saveRDD2CSV(rdd, filename, chunk_size = 1024 * 1024 * 16):
    tmppath = filename + '.tmp.dir'
    # by default SPARK will create a folder 
    # to force to one part use .coalesce(1, shuffle=True)
    # here we let spark do the distributed work and merge files later together
    rdd.map(lambda x: ','.join(str(el) for el in x)).saveAsTextFile(tmppath)
    
    # only use the part files
    sparkChunkFilter = lambda x: x[:4] == u'part'
    # now merge the files in the folder together
    # 16 MB chunks
    with open(filename,'wb') as wfd:
        for f in [el for el in os.listdir(tmppath) if sparkChunkFilter(el)]:
            print 'merging ' + f + '...'
            with open(tmppath + '/' + f,'rb') as fd:
                shutil.copyfileobj(fd, wfd, chunk_size)
            
    
    # remove tmp dir
    shutil.rmtree(filename + '.tmp.dir')
    
# save adjusted rdd to file
saveRDD2CSV(srdd, '../cache/musicdata.csv')

CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 10 µs
merging part-00000...
merging part-00001...
merging part-00002...
merging part-00003...
merging part-00004...
merging part-00005...
merging part-00006...
merging part-00007...
merging part-00008...
merging part-00009...
merging part-00010...
merging part-00011...
merging part-00012...
merging part-00013...
merging part-00014...
merging part-00015...
merging part-00016...
merging part-00017...
merging part-00018...
merging part-00019...
merging part-00020...
merging part-00021...
merging part-00022...
merging part-00023...
merging part-00024...
merging part-00025...
merging part-00026...
merging part-00027...
merging part-00028...
merging part-00029...
merging part-00030...
merging part-00031...
merging part-00032...
merging part-00033...
merging part-00034...
merging part-00035...
merging part-00036...
merging part-00037...
merging part-00038...
merging part-00039...
merging part-00040...
merging part-00041...
merging part-0

#### Building an ordinal regression model