# Music Recommender System using Apache Spark and Python


## Necessary Package Imports

In [1]:
import csv 
import random
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

## Loading data

In [2]:
data = list()
included_cols = [12, 13, 11]
with open('../Sample Data/merged_BR3.csv') as csvfile:
    reader = csv.reader(csvfile)
    next(reader, None)
    for row in reader:
        if row[3] == 'Huntersville':
            content = (int(float(row[12])), int(float(row[13])), float(row[11]))
            data.append(tuple(content))
dataParallelized = sc.parallelize(data)
#dataParallelized.collect()

####  Splitting Data for Testing

In [3]:
#splitting the RDD into training and test datasets [.6, .4]
training_set, testing_set = dataParallelized.randomSplit([.6,.4], 13)
training_set.cache()
testing_set.cache()

print training_set.take(3)
print testing_set.take(3)


[(22, 3, 1.85), (23, 3, 2.0), (24, 3, 0.644444444444)]
[(26, 3, 5.0), (27, 3, 4.0), (1551, 68, 5.0)]


## The Recommender Model

For this project, we will train the model with implicit feedback. You can read more information about this from the collaborative filtering page: [http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html](http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html). The [function you will be using](http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.recommendation.ALS.trainImplicit) has a few tunable parameters that will affect how the model is built. 

### Model Evaluation

Although there may be several ways to evaluate a model, we will use a simple method here. Suppose we have a model and some dataset of *true* artist plays for a set of users. This model can be used to predict the top X artist recommendations for a user and these recommendations can be compared the artists that the user actually listened to (here, X will be the number of artists in the dataset of *true* artist plays). Then, the fraction of overlap between the top X predictions of the model and the X artists that the user actually listened to can be calculated. This process can be repeated for all users and an average value returned.

For example, suppose a model predicted [1,2,4,8] as the top X=4 artists for a user. Suppose, that user actually listened to the artists [1,3,7,8]. Then, for this user, the model would have a score of 2/4=0.5. To get the overall score, this would be performed for all users, with the average returned.

In [12]:
import math

def score(predict, actual):
    MSE = []
    count = 0.0
    for a in actual:
        for p in predict:
            if a[0] == p[0]:
                #print str(p[1]) + " " + str(a[1])
                count += 1
                SE = (a[1] - p[1])**2
                MSE.append(SE)
    if count == 0.0:
        return -1
    else:
        return sum(MSE)/count

def modelEval(mod, trainData, testData):
    test_userIDs = testData.map(lambda p: p[0]).distinct().collect()
    #print test_userIDs
    test_companyIDs = dataParallelized.map(lambda p: p[1]).distinct().collect()
    #print test_companyIDs
    trainSet = trainData.map(lambda x: (x[0], x[1])).filter(lambda x: x[0] in test_userIDs)
    trainSet = trainSet.groupByKey().map(lambda x: (x[0], list(x[1])))
    #print trainSet.take(3)
    #if bid not in [y[0] for y in x[1]]
    validationSet = trainSet.flatMap(lambda x: [(x[0],bid) for bid in test_companyIDs])
    #print validationSet.take(3)
    actualD = testData.map(lambda x: (x[0], (x[1], x[2]))).groupByKey()
    actualD = actualD.map(lambda x: (x[0], list(x[1]))).collectAsMap()
    predictD = mod.predictAll(validationSet).map(lambda x: (x[0], (x[1], x[2])))
    predictD = predictD.groupByKey().map(lambda x: (x[0], sorted(list(x[1]), key=lambda score: score[1], reverse=True)))
    #print predictD.take(1)
    scores = []
    for entry in predictD.collect():
        score_pe = score(entry[1], actualD[entry[0]])
        #print score_pe
        if score_pe != -1:
            scores.append(score_pe)
    MSE_score = sum(scores)/float(len(scores))
    RMSE_score = math.sqrt(MSE_score)
    return RMSE_score

    
    


### Model Construct
Now we can build the best model possibly using the validation set of data and the `modelEval` function. Loop through the values [2, 10, 20] and figure out which one produces the highest scored based on your model evaluation function.

In [14]:
ranks = [2]
for r in ranks:
    model = ALS.trainImplicit(training_set, rank=r, seed=345)
    scorer = modelEval(model, training_set, testing_set)
    print "The model score for rank %d is %f" % (r, scorer)


5.93703842585
14.0457127999
24.7921753053
15.9779460795
19.0438809293
17.7055089997
13.3719263493
13.7400073351
18.4897684198
11.485811075
13.3000447342
11.7493225924
15.9773157122
20.7910249101
19.5884832447
2.88513295125
14.1412934057
8.98933435761
18.4380953574
12.5860922051
14.4989149879
8.86671998315
3.99976132305
15.9730946425
25.92757733
27.9678280356
9.8434377551
22.5452785624
24.6168995973
16.993272807
0.999992500445
23.5154194772
26.1780536209
22.042190697
14.2681155314
15.3452701742
0.992131937351
26.5217874259
18.0624999966
-1
27.4544100048
2.54848363835
17.4857060648
8.9997268019
13.3399274331
24.9902132331
29.6996718933
13.4385575943
2.95673362743
24.8765145163
19.0390494566
17.0448165059
20.3113334154
3.98399364077
19.335785543
8.89763206191
26.510161115
9.01858817378
18.0135648405
15.7815464962
11.0149644277
26.5224968476
17.9820977054
29.6942197425
9.33845863241
21.6355547365
24.9547124307
14.4313314871
18.9128805293
13.4118710994
0.997797503662
16.0900725547
12.379561

Now, using the bestModel, we will check the results over the test data. 

In [7]:
bestModel = ALS.trainImplicit(trainData, rank=10, seed=345)
modelEval(bestModel, testData)

0.05918491594853054

## Trying Some Businesses  Recommendations
Using the best model above, predict the top 5 artists for user `1059637` using the [recommendProducts](http://spark.apache.org/docs/1.5.2/api/python/pyspark.mllib.html#pyspark.mllib.recommendation.MatrixFactorizationModel.recommendProducts) function. Map the results (integer IDs) into the real artist name using `artistAlias`. Print the results. The output should look as follows:

In [16]:
recommend_artists = bestModel.recommendProducts(1059637, 5)
for i in range(5):
    print "Artist %s : %s" % (i, artist_data_map.get(recommend_artists[i].product))

Artist 0 : The Used
Artist 1 : blink-182
Artist 2 : Taking Back Sunday
Artist 3 : Brand New
Artist 4 : Jimmy Eat World
