# Music Recommender System using Apache Spark and Python


## Necessary Package Imports

In [1]:
import csv 
import random
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

## Loading data

In [2]:
data = list()
included_cols = [12, 13, 11]
with open('../Sample Data/merged_BR3.csv') as csvfile:
    reader = csv.reader(csvfile)
    next(reader, None)
    for row in reader:
        if row[3] == 'Huntersville':
            content = (int(float(row[12])), int(float(row[13])), float(row[11]))
            data.append(tuple(content))
dataParallelized = sc.parallelize(data)
#dataParallelized.collect()

####  Splitting Data for Testing

In [3]:
#splitting the RDD into training and test datasets [.6, .4]
training_set, testing_set = dataParallelized.randomSplit([.6,.4], 13)
training_set.cache()
testing_set.cache()

print training_set.take(3)
print testing_set.take(3)


[(22, 3, 1.85), (23, 3, 2.0), (24, 3, 0.644444444444)]
[(26, 3, 5.0), (27, 3, 4.0), (1551, 68, 5.0)]


## The Recommender Model

For this project, we will train the model with implicit feedback. You can read more information about this from the collaborative filtering page: [http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html](http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html). The [function you will be using](http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.recommendation.ALS.trainImplicit) has a few tunable parameters that will affect how the model is built. 

### Model Evaluation

Although there may be several ways to evaluate a model, we will use a simple method here. Suppose we have a model and some dataset of *true* artist plays for a set of users. This model can be used to predict the top X artist recommendations for a user and these recommendations can be compared the artists that the user actually listened to (here, X will be the number of artists in the dataset of *true* artist plays). Then, the fraction of overlap between the top X predictions of the model and the X artists that the user actually listened to can be calculated. This process can be repeated for all users and an average value returned.

For example, suppose a model predicted [1,2,4,8] as the top X=4 artists for a user. Suppose, that user actually listened to the artists [1,3,7,8]. Then, for this user, the model would have a score of 2/4=0.5. To get the overall score, this would be performed for all users, with the average returned.

In [12]:
import math

def score(predict, actual):
    MSE = []
    count = 0.0
    for a in actual:
        for p in predict:
            if a[0] == p[0]:
                count += 1
                SE = (a[1] - p[1])**2
                MSE.append(SE)
    if count == 0.0:
        return 0.0
    else:
        return sum(MSE)/count

def modelEval(mod, trainData, testData):
    test_userIDs = testData.map(lambda p: p[0]).distinct().collect()
    #print test_userIDs
    test_companyIDs = dataParallelized.map(lambda p: p[1]).distinct().collect()
    #print test_companyIDs
    trainSet = trainData.map(lambda x: (x[0], x[1])).filter(lambda x: x[0] in test_userIDs)
    trainSet = trainSet.groupByKey().map(lambda x: (x[0], list(x[1])))
    print trainSet.take(3)
    #if bid not in [y[0] for y in x[1]]
    validationSet = trainSet.flatMap(lambda x: [(x[0],bid) for bid in test_companyIDs])
    print validationSet.take(3)
    actualD = testData.map(lambda x: (x[0], (x[1], x[2]))).groupByKey()
    actualD = actualD.map(lambda x: (x[0], list(x[1]))).collectAsMap()
    predictD = mod.predictAll(validationSet).map(lambda x: (x[0], (x[1], x[2])))
    predictD = predictD.groupByKey().map(lambda x: (x[0], sorted(list(x[1]), key=lambda score: score[1], reverse=True)))
    print predictD.take(1)
    scores = []
    for entry in predictD.collect():
        score_pe = score(entry[1], actualD[entry[0]])
        scores.append(score_pe) 
     
    MSE_score = sum(scores)/float(len(scores))
    RMSE_score = math.sqrt(MSE_score)
    return RMSE_score

    
    


### Model Construct
Now we can build the best model possibly using the validation set of data and the `modelEval` function. Loop through the values [2, 10, 20] and figure out which one produces the highest scored based on your model evaluation function.

In [13]:
ranks = [2]
for r in ranks:
    model = ALS.trainImplicit(training_set, rank=r, seed=345)
    print "The model score for rank %d is %d" % (r, modelEval(model, training_set, testing_set))


[(2, [2244, 6523]), (3, [308, 512, 1004, 1283, 1358, 1460, 1461, 2237, 2557, 3697, 3818, 3876, 3917, 3918, 3929, 4508, 4606, 5054, 5776, 5950, 7625, 7657, 7806, 8133, 9859]), (18437, [1372])]
[(2, 3), (2, 6149), (2, 6150)]
[(2, [(9877, 0.04101413012691457), (5506, 0.03161270607241612), (3917, 0.019960105048478637), (3924, 0.015093155285537008), (308, 0.013942062310764752), (4686, 0.013768617757681456), (3930, 0.013516469062727632), (8811, 0.010032798921046165), (3332, 0.008125190528966868), (4157, 0.008106629916954565), (3818, 0.007917484011004516), (4672, 0.007657296783966858), (3809, 0.007460382677147791), (4149, 0.0074599785426893855), (1372, 0.006860677578577223), (5213, 0.006758608841845357), (1805, 0.006261376766171018), (7116, 0.00612377816188589), (3424, 0.00610655773819687), (7918, 0.006049971227993615), (122, 0.005680739022906802), (6134, 0.005649449989391202), (1343, 0.005418265924229771), (3328, 0.005386891145690742), (7774, 0.005251689139248852), (4606, 0.00524508480119983

Now, using the bestModel, we will check the results over the test data. 

In [7]:
bestModel = ALS.trainImplicit(trainData, rank=10, seed=345)
modelEval(bestModel, testData)

0.05918491594853054

## Trying Some Businesses  Recommendations
Using the best model above, predict the top 5 artists for user `1059637` using the [recommendProducts](http://spark.apache.org/docs/1.5.2/api/python/pyspark.mllib.html#pyspark.mllib.recommendation.MatrixFactorizationModel.recommendProducts) function. Map the results (integer IDs) into the real artist name using `artistAlias`. Print the results. The output should look as follows:

In [16]:
recommend_artists = bestModel.recommendProducts(1059637, 5)
for i in range(5):
    print "Artist %s : %s" % (i, artist_data_map.get(recommend_artists[i].product))

Artist 0 : The Used
Artist 1 : blink-182
Artist 2 : Taking Back Sunday
Artist 3 : Brand New
Artist 4 : Jimmy Eat World
