# Recommendations Lab Using ALS: Audioscrobble

## Description

We are going to be using the Audioscrobbler dataset and Spark's ALS recommendation system
using collaborative filtering. 


## Datasets

You will be using some publicly available song data from audioscrobbler, which can be found [here](http://www-etud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html). However, we modified the original data files so that the code will run in a reasonable time on a single machine. The reduced data files have been suffixed with `_small.txt` and contains only the information relevant to the top 50 most prolific users (highest artist play counts).

The original data file `user_artist_data.txt` contained about 141,000 unique users, and 1.6 million unique artists. About 24.2 million users’ plays of artists are recorded, along with their count.

Note that when plays are scribbled, the client application submits the name of the artist being played. This name could be misspelled or nonstandard, and this may only be detected later. For example, "The Smiths", "Smiths, The", and "the smiths" may appear as distinct artist IDs in the data set, even though they clearly refer to the same artist. So, the data set includes `artist_alias.txt`, which maps artist IDs that are known misspellings or variants to the canonical ID of that artist.

The `artist_data.txt` file then provides a map from the canonical artist ID to the name of the artist.

## The Recommender Model

For this project, we will train the model with implicit feedback. You can read more information about this from the collaborative filtering page: [http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html](http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html). The [function you will be using](http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.recommendation.ALS.trainImplicit) has a few tunable parameters that will affect how the model is built. 


## Necessary Package Imports

In [None]:
# Import libraries

from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.evaluation import BinaryClassificationEvaluator

from pyspark.ml.recommendation import ALS
from pyspark.sql import Row
from pyspark.sql.functions import *

import pandas as pd

print('Spark UI running on http://YOURIPADDRESS:' + sc.uiWebUrl.split(':')[2])

## Step 1: Loading data

Load the data into dataframes: artist and dataset.

In [None]:
artists = spark.read.format("csv").option('header','true').option('delimiter', '\t').\
  option('inferSchema', 'true').load("/data/audioscrobble/artist_data.txt.gz")

dataset = spark.read.csv("/data/audioscrobble/user_artist_data.csv.gz", header=True, inferSchema=True)

(training, test) = dataset.randomSplit([0.8, 0.2])


### Enter some sample data.

Just for fun, we are going to make our own imaginary user playlist, and see what is 
recommended to that user.


In [None]:
# Classic Rock Fan Data

pd_df = pd.DataFrame({'User' : [99999, 99999, 99999, 99999, 99999, 99999, 99999],
                      'Artist' : [10215385, 9915421, 3292, 5687, 1014221, 1000055,  1004241],
                      'Count' : [12, 7, 13, 8, 15, 5, 2]
             })

my_playlist = spark.createDataFrame(pd_df)

training = training.unionAll(my_playlist)


### You enter your own favorites

Open your music player or phone and see what some of your favorites are in your playlists. Then, find the artist ids from the artist file and create your own playlist.

**=> TODO: Create your own dataframe with your own favorites

You can grep for your favorite artists in data/audioscrobble/

In [None]:
# TODO: Your data

# Use User #99998

# Create a pandas dataframe with your data.  Look up your data from the artists dataframe


# add it to training.

In [None]:
dataset.show()

In [None]:
artists.show()

## Step 2:  Train ALS model using implicit ratings.

We are going to use the ALS model using implicit ratings.  That means that the playcount
will be more of a rough guide to how much the user likes the music rather than an explicit 
1-5 star rating.

In [None]:

# Build the recommendation model using ALS on the training data
# Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
als = ALS(maxIter=5, regParam=0.01, userCol="User", itemCol="Artist", ratingCol="PlayCount",
          coldStartStrategy="drop", implicitPrefs=True)
model = als.fit(training)

## Evaulate the model

The problem with implicit ratings is that we don't have an objective measure to going back and evaulating our model.  What we're going to do here is arbitraily create a column called "liked"
that we define as playcount greater than 2.  Then we'll see if the results are positive, meaning that our model predicted the user would like that.

In [None]:
# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test)
predictions_with_liked = predictions.withColumn('liked', (predictions.PlayCount > lit(2)).cast('integer'))
predictions_with_predicted_like = predictions_with_liked.withColumn('predicted_liked', (predictions.prediction > lit(.01)).cast('integer'))
predictions_with_predicted_like = predictions_with_predicted_like.withColumn('raw_prediction', predictions.prediction.cast('double'))
predictions_with_predicted_like.show()

### Run Recommendations for all users

Note that this will take a while. Be Patient.

Once you're done, you can check out the classic rock fan, user 99999, and your own, user 99998

In [None]:
# See the recomendations for the Classic Rock Fan, User 99999

# Generate top 10 movie recommendations for each user
userRecs = model.recommendForAllUsers(10)
userRecs.show()

In [None]:
userRecs.filter(userRecs.User == 99999).show()

**=> TODO: Show your own recommendations**

Print out receommendations for your user 99998

In [None]:
#TODO: show your own recomendations

In [None]:
evaluator = RegressionEvaluator(metricName="rmse", labelCol="PlayCount",
                                predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))


## Print out AUC for the predictions

In [None]:
evaluator = BinaryClassificationEvaluator(rawPredictionCol="raw_prediction", labelCol="liked")
evaluator.evaluate(predictions_with_predicted_like)  #AUC

## Perform Your own evalutation

Can you think of some of your own evaluation metrics that you can run?

**=> TODO: Look at some other evaluation metrics**

