# Movie Recommendations based on Ratings

With Collaborative filtering we make predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption is that if a user A has the same opinion as a user B on an issue, A is more likely to have B's opinion on a different issue x than to have the opinion on x of a user chosen randomly.

The image below (from Wikipedia) shows an example of collaborative filtering. At first, people rate different items (like videos, images, games). Then, the system makes predictions about a user's rating for an item not rated yet. The new predictions are built upon the existing ratings of other users with similar ratings with the active user. In the image, the system predicts that the user will not like the video.

<img src=https://upload.wikimedia.org/wikipedia/commons/5/52/Collaborative_filtering.gif />

Spark MLlib library for Machine Learning provides a Collaborative Filtering implementation by using __Alternating Least Squares (ALS)__. The implementation in MLlib has these parameters:

* `numBlocks` is the number of blocks used to parallelize computation (set to -1 to auto-configure).
* `rank` is the number of latent factors in the model.
* `iterations` is the number of iterations to run.
* `lambda` specifies the regularization parameter in ALS.
* `implicitPrefs` specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data.
* `alpha` is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.


### Creating a Spark Session

In [1]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("movie").getOrCreate()

Intitializing Scala interpreter ...

Spark Web UI available at http://Varun-CK:4040
SparkContext available as 'sc' (version = 2.3.0, master = local[*], app id = local-1577810925749)
SparkSession available as 'spark'


2019-12-31 22:18:42 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-12-31 22:19:01 WARN  SparkContext:66 - Using an existing SparkContext; some configuration may not take effect.


import org.apache.spark.sql.SparkSession
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@41c96724


### Initializing Logger

In [2]:
import org.apache.log4j._
Logger.getLogger("org").setLevel(Level.ERROR)

import org.apache.log4j._


### Using Spark to read the movie lens data set.

In [3]:
val data = spark.read.options(Map(("header","true"),("inferSchema","true"))).csv("movielens_ratings.csv")

data: org.apache.spark.sql.DataFrame = [movieId: int, rating: double ... 1 more field]


### Printing the first few rows of the dataframe

In [4]:
data.show(5)

+-------+------+------+
|movieId|rating|userId|
+-------+------+------+
|      2|   3.0|     0|
|      3|   1.0|     0|
|      5|   2.0|     0|
|      9|   4.0|     0|
|     11|   1.0|     0|
+-------+------+------+
only showing top 5 rows



### Describe

In [5]:
data.describe().show()

+-------+------------------+------------------+------------------+
|summary|           movieId|            rating|            userId|
+-------+------------------+------------------+------------------+
|  count|              1501|              1501|              1501|
|   mean| 49.40572951365756|1.7741505662891406|14.383744170552964|
| stddev|28.937034065088994| 1.187276166124803| 8.591040424293272|
|    min|                 0|               1.0|                 0|
|    max|                99|               5.0|                29|
+-------+------------------+------------------+------------------+



### Count

In [6]:
data.count()

res3: Long = 1501


### Schema

In [7]:
data.printSchema

root
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- userId: integer (nullable = true)



We can do a split to evaluate how well our model performed, but it is very hard to know conclusively how well a recommender system is truly working for some topics. Especially if subjectivity is involved, for example not everyone that loves star wars is going to love star trek, even though a recommendation system may suggest otherwise.

### Since the dataset is smaller, we can use 80% and 20% for train-test splits

In [8]:
val Array(train_data,test_data) = data.randomSplit(Array(0.8,0.2))

train_data: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [movieId: int, rating: double ... 1 more field]
test_data: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [movieId: int, rating: double ... 1 more field]


In [9]:
train_data.describe().show()

+-------+------------------+------------------+----------------+
|summary|           movieId|            rating|          userId|
+-------+------------------+------------------+----------------+
|  count|              1202|              1202|            1202|
|   mean|  48.8801996672213|1.7678868552412645|            14.5|
| stddev|29.050485051463358|1.1785766096537431|8.59499011754712|
|    min|                 0|               1.0|               0|
|    max|                99|               5.0|              29|
+-------+------------------+------------------+----------------+



In [10]:
test_data.describe().show()

+-------+------------------+------------------+-----------------+
|summary|           movieId|            rating|           userId|
+-------+------------------+------------------+-----------------+
|  count|               299|               299|              299|
|   mean| 51.51839464882943|1.7993311036789297|13.91638795986622|
| stddev|28.426323514572527|1.2233190990867386|8.573587836250049|
|    min|                 0|               1.0|                0|
|    max|                99|               5.0|               29|
+-------+------------------+------------------+-----------------+



### Setting up Recommender system

In [11]:
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.recommendation.ALS

import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.recommendation.ALS


### Building the recommendation model using ALS on the training data

In [12]:
val als = new ALS().setUserCol("userId").setItemCol("movieId").setRatingCol("rating")

als: org.apache.spark.ml.recommendation.ALS = als_8ba9d7d70b99


In [13]:
val model = als.fit(train_data)

2019-12-31 22:25:34 WARN  BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
2019-12-31 22:25:34 WARN  BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
2019-12-31 22:25:36 WARN  LAPACK:61 - Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
2019-12-31 22:25:36 WARN  LAPACK:61 - Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK


model: org.apache.spark.ml.recommendation.ALSModel = als_8ba9d7d70b99


### Now let's see how the model performed!

In [14]:
val predictions = model.transform(test_data)

predictions: org.apache.spark.sql.DataFrame = [movieId: int, rating: double ... 2 more fields]


In [16]:
predictions.show(5,false)

+-------+------+------+----------+
|movieId|rating|userId|prediction|
+-------+------+------+----------+
|31     |1.0   |27    |1.2951454 |
|31     |1.0   |5     |0.48876566|
|31     |1.0   |19    |0.7833741 |
|31     |3.0   |14    |1.9824624 |
|85     |2.0   |20    |2.1791806 |
+-------+------+------+----------+
only showing top 5 rows



### Evaluating the model by computing the RMSE on the test data

In [17]:
// Setting evaluator to evaluate Root Mean Squared Error
val evaluator = new RegressionEvaluator().setLabelCol("rating").setPredictionCol("prediction").setMetricName("rmse")

evaluator: org.apache.spark.ml.evaluation.RegressionEvaluator = regEval_79dc675047ec


In [18]:
val rmse = evaluator.evaluate(predictions)

rmse: Double = 1.0364932540530363


In [21]:
println(f"Root Mean Squared Error: ${rmse}%1.2f")

Root Mean Squared Error: 1.04


**The RMSE `1.04` described our error in terms of the stars rating column.**

### Recommendation for a particular user

In [22]:
val single_user = test_data.select("userId","movieId").filter("userId == 11")

single_user: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [userId: int, movieId: int]


In [23]:
single_user.show()

+------+-------+
|userId|movieId|
+------+-------+
|    11|     10|
|    11|     11|
|    11|     21|
|    11|     32|
|    11|     41|
|    11|     43|
|    11|     47|
|    11|     48|
|    11|     59|
|    11|     64|
|    11|     69|
|    11|     77|
|    11|     79|
|    11|     80|
|    11|     97|
+------+-------+



In [24]:
val recommendations = model.transform(single_user)

recommendations: org.apache.spark.sql.DataFrame = [userId: int, movieId: int ... 1 more field]


In [27]:
recommendations.orderBy(recommendations("prediction").desc).show()

+------+-------+----------+
|userId|movieId|prediction|
+------+-------+----------+
|    11|     32|  4.413526|
|    11|     79| 2.7643094|
|    11|     80| 2.6780334|
|    11|     48| 2.6065779|
|    11|     10| 2.5574136|
|    11|     64| 2.2372367|
|    11|     69| 1.9722278|
|    11|     43| 1.8546999|
|    11|     59| 1.5770007|
|    11|     47| 1.4538007|
|    11|     11| 1.3336184|
|    11|     41| 1.3124973|
|    11|     21| 1.3037841|
|    11|     77| 1.2457387|
|    11|     97| 0.8805586|
+------+-------+----------+



### Closing the spark session

In [28]:
spark.stop()

## Thank You!