# Music Recommender System using Apache Spark and Python
**Estimated time: 8hrs**

## Description

For this project, you are to create a recommender system that will recommend new musical artists to a user based on their listening history. Suggesting different songs or musical artists to a user is important to many music streaming services, such as Pandora and Spotify. In addition, this type of recommender system could also be used as a means of suggesting TV shows or movies to a user (e.g., Netflix). 

To create this system you will be using Spark and the collaborative filtering technique. The instructions for completing this project will be laid out entirely in this file. You will have to implement any missing code as well as answer any questions.

## Datasets

You will be using some publicly available song data from audioscrobbler, which can be found [here](http://www-etud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html). However, we modified the original data files so that the code will run in a reasonable time on a single machine. The reduced data files have been suffixed with `_small.txt` and contains only the information relevant to the top 50 most prolific users (highest artist play counts).

The original data file `user_artist_data.txt` contained about 141,000 unique users, and 1.6 million unique artists. About 24.2 million users’ plays of artists are recorded, along with their count.

Note that when plays are scribbled, the client application submits the name of the artist being played. This name could be misspelled or nonstandard, and this may only be detected later. For example, "The Smiths", "Smiths, The", and "the smiths" may appear as distinct artist IDs in the data set, even though they clearly refer to the same artist. So, the data set includes `artist_alias.txt`, which maps artist IDs that are known misspellings or variants to the canonical ID of that artist.

The `artist_data.txt` file then provides a map from the canonical artist ID to the name of the artist.

## The Recommender Model

For this project, we will train the model with implicit feedback. You can read more information about this from the collaborative filtering page: [http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html](http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html). The [function you will be using](http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.recommendation.ALS.trainImplicit) has a few tunable parameters that will affect how the model is built. Therefore, to get the best model, we will do a small parameter sweep and choose the model that performs the best on the validation set

Therefore, we must first devise a way to evaluate models. Once we have a method for evaluation, we can run a parameter sweep, evaluate each combination of parameters on the validation data, and choose the optimal set of parameters. The parameters then can be used to make predictions on the test data.

### Model Evaluation

Although there may be several ways to evaluate a model, we will use a simple method here. Suppose we have a model and some dataset of *true* artist plays for a set of users. This model can be used to predict the top X artist recommendations for a user and these recommendations can be compared the artists that the user actually listened to (here, X will be the number of artists in the dataset of *true* artist plays). Then, the fraction of overlap between the top X predictions of the model and the X artists that the user actually listened to can be calculated. This process can be repeated for all users and an average value returned.

For example, suppose a model predicted [1,2,4,8] as the top X=4 artists for a user. Suppose, that user actually listened to the artists [1,3,7,8]. Then, for this user, the model would have a score of 2/4=0.5. To get the overall score, this would be performed for all users, with the average returned.

**NOTE: when using the model to predict the top-X artists for a user, do not include the artists listed with that user in the training data.**

Name your function `modelEval` and have it take a model (the output of ALS.trainImplicit) and a dataset as input. For parameter tuning, the dataset parameter should be set to the validation data (`validationData`). After parameter tuning, the model can be evaluated on the test data (`testData`).

## Necessary Package Imports

In [27]:
# Import libraries

from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.evaluation import BinaryClassificationEvaluator

from pyspark.ml.recommendation import ALS
from pyspark.sql import Row
from pyspark.sql.functions import *

import pandas as pd


## Loading data

Load the three datasets into RDDs and name them `artistData`, `artistAlias`, and `userArtistData`. View the README, or the files themselves, to see how this data is formated. Some of the files have tab delimeters while some have space delimiters. Make sure that your `userArtistData` RDD contains only the canonical artist IDs.

In [8]:
artists = spark.read.format("csv").option('header','true').option('delimiter', '\t').\
  option('inferSchema', 'true').load("/data/audioscrobble/artist_data.txt.gz")

dataset = spark.read.csv("/data/audioscrobble/user_artist_data.csv.gz", header=True, inferSchema=True)

(training, test) = dataset.randomSplit([0.8, 0.2])


In [9]:
# Classic Rock Fan Data

pd_df = pd.DataFrame({'User' : [99999, 99999, 99999, 99999, 99999, 99999, 99999],
                      'Artist' : [10215385, 9915421, 3292, 5687, 1014221, 1000055,  1004241],
                      'Count' : [12, 7, 13, 8, 15, 5, 2]
             })

my_playlist = spark.createDataFrame(pd_df)

training = training.unionAll(my_playlist)


In [43]:
# TODO: Your data

# Use User #99998

# Create a pandas dataframe with your data.  Look up your data from the artists dataframe


# add it to training.

In [10]:
dataset.show()

+-------+-------+---------+
|   User| Artist|PlayCount|
+-------+-------+---------+
|2398725|    598|        1|
|2430132|1001582|        1|
|2063323|1342138|        1|
|2429835|1055519|        7|
|2073565|1030685|        1|
|2165208|6865656|        1|
|2125388|1007006|        2|
|2178179|1015794|        4|
|2337258|1002328|        1|
|2355419|1003741|        1|
|1005051|   2249|        1|
|2156825|2152544|        4|
|2050231|6755017|        3|
|1047853|    121|       22|
|2178597|   1390|       85|
|2004429|1100522|        1|
|2174785|1233793|       12|
|2075647|1001412|        4|
|2159048|1103014|       21|
|2271867|1006102|        2|
+-------+-------+---------+
only showing top 20 rows



In [7]:
artists.show()

+--------+--------------------+
|ArtistID|          ArtistName|
+--------+--------------------+
| 1134999|        06Crazy Life|
| 6821360|        Pang Nakarin|
|10113088|Terfel, Bartoli- ...|
|10151459| The Flaming Sidebur|
| 6826647|   Bodenstandig 3000|
|10186265|Jota Quest e Ivet...|
| 6828986|       Toto_XX (1977|
|10236364|         U.S Bombs -|
| 1135000|artist formaly kn...|
|10299728|Kassierer - Musik...|
|10299744|         Rahzel, RZA|
| 6864258|      Jon Richardson|
| 6878791|Young Fresh Fello...|
|10299751|          Ki-ya-Kiss|
| 6909716|Underminded - The...|
|10435121|             Kox-Box|
| 6918061|  alexisonfire [wo!]|
| 1135001|         dj salinger|
| 6940391|The B52's - Chann...|
|10475396|             44 Hoes|
+--------+--------------------+
only showing top 20 rows



In [11]:

# Build the recommendation model using ALS on the training data
# Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
als = ALS(maxIter=5, regParam=0.01, userCol="User", itemCol="Artist", ratingCol="PlayCount",
          coldStartStrategy="drop", implicitPrefs=True)
model = als.fit(training)

In [30]:
# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test)
predictions_with_liked = predictions.withColumn('liked', (predictions.PlayCount > lit(2)).cast('integer'))
predictions_with_predicted_like = predictions_with_liked.withColumn('predicted_liked', (predictions.prediction > lit(.01)).cast('integer'))
predictions_with_predicted_like = predictions_with_predicted_like.withColumn('raw_prediction', predictions.prediction.cast('double'))
predictions_with_predicted_like.show()

+-------+------+---------+------------+-----+---------------+--------------------+
|   User|Artist|PlayCount|  prediction|liked|predicted_liked|      raw_prediction|
+-------+------+---------+------------+-----+---------------+--------------------+
|1055046|   463|        4|   0.3475361|    1|              1|  0.3475360870361328|
|2073908|   463|       11| -0.03272286|    1|              0|-0.03272286057472229|
|2019909|   463|      104|   0.8063177|    1|              1|  0.8063176870346069|
|1046731|   463|        1|2.1941715E-4|    0|              0|2.194171538576483...|
|2023514|   463|        7| -0.21073908|    1|              0|-0.21073907613754272|
|2289818|   463|        2|-0.058894232|    0|              0|-0.05889423191547394|
|1071763|   463|        7| 0.035160925|    1|              1| 0.03516092523932457|
|2282933|   463|        3|  0.14566791|    1|              1|  0.1456679105758667|
|2271462|   463|       45|  0.20566681|    1|              1| 0.20566681027412415|
|200

In [32]:
# See the recomendations for the Classic Rock Fan, User 99999

# Generate top 10 movie recommendations for each user
userRecs = model.recommendForAllUsers(10)
userRecs.show()

+-------+--------------------+
|   User|     recommendations|
+-------+--------------------+
|   7340|[[1233196,0.03311...|
|   8389|[[5,1.9840317E-4]...|
|1000190|[[1000130,1.76781...|
|1001043|[[1006016,0.06945...|
|1001129|[[1275996,0.06328...|
|1001139|[[930,0.51843673]...|
|1002431|[[979,0.5386638],...|
|1002605|[[979,0.02897332]...|
|1004666|[[13,0.82681453],...|
|1005158|[[1000113,0.06653...|
|1005439|[[13,0.8639908], ...|
|1005853|[[1838,0.12224305...|
|1007007|[[4468,0.00248035...|
|1008081|[[4061,0.08051114...|
|1008804|[[1270639,0.23872...|
|1012261|[[15,0.16461673],...|
|1016416|[[1193,0.09318961...|
|1017914|[[1270639,0.84431...|
|1024037|[[15,0.29532593],...|
|1024947|[[12,1.2198695], ...|
+-------+--------------------+
only showing top 20 rows



In [34]:
userRecs.filter(userRecs.User == 99999).show()

+----+---------------+
|User|recommendations|
+----+---------------+
+----+---------------+



In [None]:
evaluator = RegressionEvaluator(metricName="rmse", labelCol="PlayCount",
                                predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))


In [31]:
evaluator = BinaryClassificationEvaluator(rawPredictionCol="raw_prediction", labelCol="liked")
evaluator.evaluate(predictions_with_predicted_like)  #AUC

0.5793276036849171

In [5]:
def modelEval(model, dataset):
    
    # All artists in the 'userArtistData' dataset
    allArtists = userArtistData.map(lambda x: x[1]).collect()
    
    # Set of all users in the current (Validation/Testing) dataset
    userArtists = set(dataset.map(lambda x: x[0]).collect())
    
    # Create a dictionary of (key, values) for current (Validation/Testing) dataset
    userArtistsDict = dict(dataset.map(lambda x: (x[0], x[1])).groupByKey().mapValues(set).collect())
    
    # Create a dictionary of (key, values) for training dataset
    userArtistTrain = dict(trainData.map(lambda x: (x[0],x[1])).groupByKey().mapValues(set).collect())
    
    # For each user, calculate the prediction score i.e. similarity between predicted and actual artists
    total = 0
    for key in userArtists:
        # Find the set of artists who are not in the training dataset
        nonTrainArtists = set(allArtists) - userArtistTrain[key]
        # Obtain artists actually listened to by the user
        origArtists = userArtistsDict[key]
        # Count of artists
        origArtistsCnt = len(origArtists)
        # Map user to each artists and create RDD
        userArtistTest = sc.parallelize(map(lambda x: (key, x),nonTrainArtists))
        # Predict top artists listen to by the user
        predArtists = model.predictAll(userArtistTest).sortBy(ascending=False, keyfunc = lambda x: x[2]).map(lambda x:x[1]).take(origArtistsCnt)
        # Add the score of the model
        total += (float(len(set(predArtists).intersection(origArtists))) / origArtistsCnt)
        
    # Print average score of the model for all users for the specified rank
    print "The model score for rank %d is %f"%(rank,float(total)/len(userArtists))

### Model Construction

Now we can build the best model possibly using the validation set of data and the `modelEval` function. Although, there are a few parameters we could optimize, for the sake of time, we will just try a few different values for the [rank parameter](http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html#collaborative-filtering) (leave everything else at its default value, **except make `seed`=345**). Loop through the values [2, 10, 20] and figure out which one produces the highest scored based on your model evaluation function.

Note: this procedure may take several minutes to run.

For each rank value, print out the output of the `modelEval` function for that model. Your output should look as follows:
```
The model score for rank 2 is 0.090431
The model score for rank 10 is 0.095294
The model score for rank 20 is 0.090248
```

In [6]:
rankList = [2,10,20]
for rank in rankList:
    model = ALS.trainImplicit(trainData, rank , seed=345)
    modelEval(model,validationData)

The model score for rank 2 is 0.093462
The model score for rank 10 is 0.097899
The model score for rank 20 is 0.084259


Now, using the bestModel, we will check the results over the test data. Your result should be ~`0.0507`.

In [7]:
bestModel = ALS.trainImplicit(trainData, rank=10, seed=345)
modelEval(bestModel, testData)

The model score for rank 20 is 0.061246


## Trying Some Artist Recommendations
Using the best model above, predict the top 5 artists for user `1059637` using the [recommendProducts](http://spark.apache.org/docs/1.5.2/api/python/pyspark.mllib.html#pyspark.mllib.recommendation.MatrixFactorizationModel.recommendProducts) function. Map the results (integer IDs) into the real artist name using `artistAlias`. Print the results. The output should look as follows:
```
Artist 0: Brand New
Artist 1: Taking Back Sunday
Artist 2: Evanescence
Artist 3: Elliott Smith
Artist 4: blink-182
```

In [8]:
# Find the top 5 artists for a particular user and list their names
topRating = bestModel.recommendProducts(1059637, 5)
artistRating = map(lambda x: x.product, topRating)
artistDataDict = dict(artistData.collect())
count = 0
for key in artistRating:
    print "Artist " + str(count) + ":", artistDataDict[key]
    count += 1

Artist 0: blink-182
Artist 1: Elliott Smith
Artist 2: Taking Back Sunday
Artist 3: Incubus
Artist 4: Death Cab for Cutie
