## Collaborative Filtering
Collaborative filtering is a machine learning technique that predicts ratings awarded to items by users.

### Import the ALS class
In this exercise, you will use the Alternating Least Squares collaborative filtering algorithm to creater a recommender.

In [1]:
from pyspark.ml.recommendation import ALS

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
8,application_1501528070750_0012,pyspark,idle,Link,Link,✔


SparkSession available as 'spark'.


### Load Source Data
The source data for the recommender is in two files - one containing numeric IDs for movies and users, along with user ratings; and the other containing details of the movies.

In [2]:
ratings = spark.read.csv('wasb:///data/ratings.csv', inferSchema=True, header=True)
movies = spark.read.csv('wasb:///data/movies.csv', inferSchema=True, header=True)


In [3]:
ratings.show()
movies.show()

+------+-------+------+----------+
|userId|movieId|rating| timestamp|
+------+-------+------+----------+
|     1|     31|   2.5|1260759144|
|     1|   1029|   3.0|1260759179|
|     1|   1061|   3.0|1260759182|
|     1|   1129|   2.0|1260759185|
|     1|   1172|   4.0|1260759205|
|     1|   1263|   2.0|1260759151|
|     1|   1287|   2.0|1260759187|
|     1|   1293|   2.0|1260759148|
|     1|   1339|   3.5|1260759125|
|     1|   1343|   2.0|1260759131|
|     1|   1371|   2.5|1260759135|
|     1|   1405|   1.0|1260759203|
|     1|   1953|   4.0|1260759191|
|     1|   2105|   4.0|1260759139|
|     1|   2150|   3.0|1260759194|
|     1|   2193|   2.0|1260759198|
|     1|   2294|   2.0|1260759108|
|     1|   2455|   2.5|1260759113|
|     1|   2968|   1.0|1260759200|
|     1|   3671|   3.0|1260759117|
+------+-------+------+----------+
only showing top 20 rows

+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+

In [4]:
ratings.join(movies, "movieId").show()

+-------+------+------+----------+--------------------+--------------------+
|movieId|userId|rating| timestamp|               title|              genres|
+-------+------+------+----------+--------------------+--------------------+
|     31|     1|   2.5|1260759144|Dangerous Minds (...|               Drama|
|   1029|     1|   3.0|1260759179|        Dumbo (1941)|Animation|Childre...|
|   1061|     1|   3.0|1260759182|     Sleepers (1996)|            Thriller|
|   1129|     1|   2.0|1260759185|Escape from New Y...|Action|Adventure|...|
|   1172|     1|   4.0|1260759205|Cinema Paradiso (...|               Drama|
|   1263|     1|   2.0|1260759151|Deer Hunter, The ...|           Drama|War|
|   1287|     1|   2.0|1260759187|      Ben-Hur (1959)|Action|Adventure|...|
|   1293|     1|   2.0|1260759148|       Gandhi (1982)|               Drama|
|   1339|     1|   3.5|1260759125|Dracula (Bram Sto...|Fantasy|Horror|Ro...|
|   1343|     1|   2.0|1260759131|    Cape Fear (1991)|            Thriller|

### Prepare the Data
To prepare the data, split it into a training set and a test set.

In [5]:
data = ratings.select("userId", "movieId", "rating")
splits = data.randomSplit([0.7, 0.3])
train = splits[0].withColumnRenamed("rating", "label")
test = splits[1].withColumnRenamed("rating", "trueLabel")
train_rows = train.count()
test_rows = test.count()
print "Training Rows:", train_rows, " Testing Rows:", test_rows

Training Rows: 69742  Testing Rows: 30262

### Build the Recommender
The ALS class is an estimator, so you can use its **fit** method to traing a model, or you can include it in a pipeline. Rather than specifying a feature vector and as label, the ALS algorithm requries a numeric user ID, item ID, and rating.

In [6]:
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="label")
model = als.fit(train)

### Test the Recommender
Now that you've trained the recommender, you can see how accurately it predicts known ratings in the test set.

In [7]:
prediction = model.transform(test)
prediction.join(movies, "movieId").select("userId", "title", "prediction", "trueLabel").show(100, truncate=False)

+------+--------------------------------+----------+---------+
|userId|title                           |prediction|trueLabel|
+------+--------------------------------+----------+---------+
|452   |Guilty as Sin (1993)            |3.1304564 |2.0      |
|126   |Hudsucker Proxy, The (1994)     |3.8070188 |5.0      |
|285   |Hudsucker Proxy, The (1994)     |3.4840355 |5.0      |
|452   |Hudsucker Proxy, The (1994)     |3.0028875 |3.0      |
|299   |Hudsucker Proxy, The (1994)     |4.2841263 |4.5      |
|607   |Hudsucker Proxy, The (1994)     |3.396921  |4.0      |
|659   |Hudsucker Proxy, The (1994)     |3.1882033 |4.0      |
|380   |Hudsucker Proxy, The (1994)     |3.2525508 |4.0      |
|624   |Hudsucker Proxy, The (1994)     |3.9686584 |4.0      |
|509   |Hudsucker Proxy, The (1994)     |3.4507537 |4.0      |
|102   |Hudsucker Proxy, The (1994)     |3.9069932 |5.0      |
|487   |Hudsucker Proxy, The (1994)     |2.994244  |4.0      |
|105   |Hudsucker Proxy, The (1994)     |3.959364  |4.0

The data used in this exercise describes 5-star rating activity from [MovieLens](http://movielens.org), a movie recommendation service. It was created by GroupLens, a research group in the Department of Computer Science and Engineering at the University of Minnesota, and is used here with permission.

This dataset and other GroupLens data sets are publicly available for download at <http://grouplens.org/datasets/>.

For more information, see F. Maxwell Harper and Joseph A. Konstan. 2015. [The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015)](http://dx.doi.org/10.1145/2827872)