# Music Recommender System using Apache Spark and Python


## Necessary Package Imports

In [2]:
import csv 
import random
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

## Loading data

In [4]:
data = list()
included_cols = [12, 13, 11]
with open('../Sample Data/merged_BR3.csv') as csvfile:
    reader = csv.reader(csvfile)
    next(reader, None)
    for row in reader:
        content = (int(float(row[12])), int(float(row[13])), float(row[11]))
        data.append(tuple(content))
dataParallelized = sc.parallelize(data)
#dataParallelized.collect()

####  Splitting Data for Testing

In [13]:
#splitting the RDD into training and test datasets [.6, .4]
training_set, testing_set = dataParallelized.randomSplit([.6,.4], 13)
training_set.cache()
testing_set.cache()

print training_set.take(3)
print testing_set.take(3)


[(1, 1, 5.0), (2, 1, 4.0), (3, 1, 2.0)]
[(5, 2, 5.0), (6, 2, 5.3), (10, 2, 1.0)]


## The Recommender Model

For this project, we will train the model with implicit feedback. You can read more information about this from the collaborative filtering page: [http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html](http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html). The [function you will be using](http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.recommendation.ALS.trainImplicit) has a few tunable parameters that will affect how the model is built. 

### Model Evaluation

Although there may be several ways to evaluate a model, we will use a simple method here. Suppose we have a model and some dataset of *true* artist plays for a set of users. This model can be used to predict the top X artist recommendations for a user and these recommendations can be compared the artists that the user actually listened to (here, X will be the number of artists in the dataset of *true* artist plays). Then, the fraction of overlap between the top X predictions of the model and the X artists that the user actually listened to can be calculated. This process can be repeated for all users and an average value returned.

For example, suppose a model predicted [1,2,4,8] as the top X=4 artists for a user. Suppose, that user actually listened to the artists [1,3,7,8]. Then, for this user, the model would have a score of 2/4=0.5. To get the overall score, this would be performed for all users, with the average returned.

In [39]:
import math

def score(predict, actual):
    MSE = []
    count = 0
    for a in actual:
        for p in predict:
            if a[0] == p[0]:
                count += 1
                SE = (a[1] - p[1])**2
                MSE.append(SE)
    return MSE/float(count)

def modelEval(mod, trainData, testData):
    test_userIDs = testData.map(lambda p: p[0]).distinct().collect()
    #print test_userIDs
    test_companyIDs = dataParallelized.map(lambda p: p[1]).distinct().collect()
    #print test_companyIDs
    trainSet = trainData.map(lambda x: (x[0], x[1])).filter(lambda x: x[0] in test_userIDs)
    trainSet = trainSet.groupByKey().map(lambda x: (x[0], list(x[1])))
    print trainSet.take(3)
    #if bid not in [y[0] for y in x[1]]
    validationSet = trainSet.flatMap(lambda x: [(x[0],bid) for bid in test_companyIDs])
    print validationSet.take(3)
    actualD = testData.map(lambda x: (x[0], (x[1], x[2]))).groupByKey()
    actualD = actualD.map(lambda x: (x[0], list(x[1]))).collectAsMap()
    predictD = mod.predictAll(validationSet).map(lambda x: (x[0], list((x[1], x[2]))))
    print predictD.take(3)
    scores = []
    for entry in predictD.collect():
        score_pe = score(entry[1], actualD[entry[0]])
        scores.append(score_pe) 
     
    MSE_score = sum(scores)/float(len(scores))
    RMSE_score = math.sqrt(MSE_score)
    return RMSE_score

    
    


### Model Construct
Now we can build the best model possibly using the validation set of data and the `modelEval` function. Loop through the values [2, 10, 20] and figure out which one produces the highest scored based on your model evaluation function.

In [None]:
ranks = [2]
for r in ranks:
    model = ALS.trainImplicit(training_set, rank=r, seed=345)
    print "The model score for rank %d is %d" % (r, modelEval(model, training_set, testing_set))


[(1, [1, 32, 866, 1187, 1896, 2198, 2251, 2462, 2467, 2752, 2815, 3176, 3735, 3777, 3905, 4476, 4532, 4650, 5112, 5178, 5401, 5458, 5497, 5613, 6239, 6413, 6800, 7074, 7261, 7323, 7595, 7868, 8604, 8982, 9522, 10135]), (2, [1, 2244, 2303, 2326, 2549, 3293, 7667]), (3, [1, 48, 124, 153, 186, 301, 327, 394, 411, 414, 474, 512, 519, 719, 721, 774, 815, 851, 870, 930, 950, 954, 1001, 1004, 1088, 1118, 1146, 1218, 1258, 1283, 1302, 1339, 1343, 1349, 1358, 1432, 1469, 1474, 1530, 1620, 1704, 1750, 1766, 1809, 1971, 1993, 2243, 2333, 2393, 2462, 2527, 2548, 2579, 2635, 2667, 2696, 2734, 2869, 2999, 3021, 3139, 3216, 3245, 3367, 3381, 3392, 3461, 3482, 3503, 3536, 3565, 3584, 3586, 3803, 3818, 3908, 3917, 4070, 4281, 4333, 4392, 4408, 4438, 4576, 4606, 4611, 4645, 4694, 4760, 4822, 4846, 4885, 4901, 4947, 4986, 5183, 5190, 5193, 5268, 5283, 5316, 5334, 5335, 5382, 5420, 5460, 5601, 5618, 5687, 5776, 5860, 5866, 5912, 6072, 6165, 6245, 6291, 6565, 6591, 6609, 6732, 6745, 6962, 6991, 7049, 7068,

Now, using the bestModel, we will check the results over the test data. 

In [7]:
bestModel = ALS.trainImplicit(trainData, rank=10, seed=345)
modelEval(bestModel, testData)

0.05918491594853054

## Trying Some Businesses  Recommendations
Using the best model above, predict the top 5 artists for user `1059637` using the [recommendProducts](http://spark.apache.org/docs/1.5.2/api/python/pyspark.mllib.html#pyspark.mllib.recommendation.MatrixFactorizationModel.recommendProducts) function. Map the results (integer IDs) into the real artist name using `artistAlias`. Print the results. The output should look as follows:

In [16]:
recommend_artists = bestModel.recommendProducts(1059637, 5)
for i in range(5):
    print "Artist %s : %s" % (i, artist_data_map.get(recommend_artists[i].product))

Artist 0 : The Used
Artist 1 : blink-182
Artist 2 : Taking Back Sunday
Artist 3 : Brand New
Artist 4 : Jimmy Eat World
