# Recommendation models # 

There are two types:

* **Content based filtering**: This uses the the attributes of an item to generate items similar to given item. These attributes could be text based like titles, names, tags, metadata or audio and video metadata based. User recommendations can be generated based on attributes of user profiles.

* **Collaborative filtering**: This uses the wisdom of crowd approach; if two users exhibited similar preferences then we assume they are similar to each other in terms of taste. This is also called the nearest-neighbor model/user and item based model.

## Explicit Matrix Factorization(Content Based filtering) ##

MF aims to directly model the user-item matrix by representing it as a product of two smaller matrices of lower dimension.

We call this 'Explicit' as we are using information given by the users themselves voluntarily like movie ratings, genre preferences etc.

User/Item   Superman Spiderman Batman


 Bill--------> 4 ---------> 2 ------> x     
 
 Xhang------> x ---------> 3 ------> 4
 
 Roy---------> 2 ---------> x ------> 4



If we have a sparse User-Item matrix `U x I`, our aim here is to create smaller matrices of the form `U x k` and `I x k` that are dense in nature, these smaller dimension matrices are called ** Factor Matrices **. 

If we were to multiply the factor matrices we could reconstruct an approximate version of the original ratings matrix. 

To compute a predicted rating for a user and the corresponding item:

* Compute the vector dot product between the **relevant row of the user-factor matrix(user's feature vector)** and the **relevant row of the item-factor matrix(items feature vector)**

**Advantages of the Matrix Factorization:**

* Ease of computing recommendations once the model is created.
* Good performance

**Disadvantages:**

* More complex when compared nearest neighbor filtering.
* Computationally intensive during the models training phase.

## Implicit Matrix Factorization ##

This form of MF is based on implicit feedback that is not given to us but can be inferred from the interactions the user might have with an item. 

For example:
* Binary data like if a user watched a movie, bought a book.
* Count data like how many movies watched or books bought.

MLlib implements Implicit MF in the following way:
* ** P (binary Preference matrix)**: This informs us if the movie was viewed by the user.
* ** C (matrix of Confidence weights)**: This tells us how many times the movie was viewed by the user.

The implicit model creates a user-item factor matrix. But the aim is not tell us the rating that a user will give to a movie, the score it will predict is an estimate fo the preference of a user for a given item, generally the score will be in the range 0,1.

** For example:**

The Preference matrix can look like:

User/Item   Superman Spiderman Batman


 Bill--------> 1 ---------> 0 ------> 0     
 
 Xhang------> 0 ---------> 0 ------> 1
 
 Roy---------> 1 ---------> 1 ------> 1
 
Here 1 indicates that the user watched the movie.

The Confidence matrix can look like:

User/Item   Superman Spiderman Batman


 Bill--------> 4 ---------> 0 ------> 0     
 
 Xhang------> 0 ---------> 0 ------> 2
 
 Roy---------> 2 ---------> 3 ------> 6
 
Here the numbers indicates the number of times the user watched that movie. We can infer that the more number of times that the user watched the movie, the more that person liked it.

## Alternating Least Squares ##

ALS is an optimization technique to solve MF problems. 

It works by iteratively solving a series of least squares regression problems. At each iteration:

* One of the user- or item-factor matrices is treated as fixed while the other is updated using the fixed factor and rating data. 
* Then the factor matrix that was solved for is turn, treated and fixed, while the other matrix is updated. 
* This process continues until the model has converged. 

# Step 1: Extracting the right features from the data #

We will use the explicit Matrix Factorization approach using the explicit features: userID, movieID, ratings for each user-movie pair.

## Start spark shell

`$Adarshs-MacBook-Pro:spark-2.0.1-bin-hadoop2.7 adarshnair$ ./bin/spark-shell --driver-memory 4g`

## Inspect the ratings dataset ##

** u.data ** 
* The full u data set, 100000 ratings by 943 users on 1682 items. Each user has rated at least 20 movies. Users and items are numbered consecutively from 1. The time stamps are unix seconds since 1/1/1970 UTC. The data is randomly ordered. This is a tab separated list of `user id | item id | rating | timestamp. `

`scala> val rawData = sc.textFile("/Users/adarshnair/spark-2.0.1-bin-hadoop2.7/spark_projects/MovieLens/ml-100k/u.data")`

`scala> rawData.first()
res2: String = 196	242	3	881250949`

### Remove the timestamp data and split each record on '\t' ###

`scala> val rawRatings = rawData.map(_.split("\t").take(3))`

Data now looks like:

`scala> rawRatings.first()
res3: Array[String] = Array(196, 242, 3)`

## Step 2: Spark MLlib Recommendation library ##

We wil use the MLlib library. 

### Import the ALS recommendation library ###

Import Spark's ALS recommendation model and inspect the train method

`scala> import org.apache.spark.mllib.recommendation.ALS`


### Import Rating library ###

https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/mllib/recommendation/Rating.html

Import the Rating class

`scala> import org.apache.spark.mllib.recommendation.Rating`

`scala> Rating()
error: not enough arguments for method apply: (user: Int, product: Int, rating: Double)org.apache.spark.mllib.recommendation.Rating in object Rating.`

We need to provide the Rating() function with three arguments:
* user
* product
* rating

** Construct the RDD of Rating objects around user id(user), movie id(product) and rating(rating). ** 

`scala> val ratings = rawRatings.map { case Array(user, movie, rating) => Rating(user.toInt, movie.toInt, rating.toDouble) }`

We use toInt, toDouble to convert string objects to numeric.

View the new ratings RDD:

`scala> ratings.first()
res5: org.apache.spark.mllib.recommendation.Rating = Rating(196,242,3.0)`


We now have the data we need in the format we need - userid, movieid, rating.


## Training the recommendation model using explicit data##

Train the ALS model using `ALS.train`. This function takes 3 parameters:

* **rank**: This is the number of hidden features in our low rank approximation matrices. The higher the rank the better but it directly impacts memory usage. Ideal range: 0-200

* **iterations**: The number of iterations, usually kept at around 10.

* **lambda**: Controls the regularization of the model and thereby mitigates overfitting. The higher the value of lambda, the more is the regularization applied. No 'ideal' values here, best to tune it using out of sample test data and cross validation.

** We will use rank = 50, iterations = 10, lambda = 0.01.**

`scala> val model = ALS.train(ratings, 50, 10, 0.01)`

This returns a `MatrixFactorizationModel` object which contains the user and item factors in the form of an RDD (id, factor). They can be called using `userFeatures` and `productFeatures`.

`scala> model.userFeatures
res6: org.apache.spark.rdd.RDD[(Int, Array[Double])] = users MapPartitionsRDD[211] at mapValues at ALS.scala:268`

Count the number of user features:

`scala> model.userFeatures.count
res7: Long = 943`

Count the product features:

`scala> model.productFeatures.count
res8: Long = 1682`



## Training the recommendation model using implicit data##

This can b doing using the `ALS.trainImplicit` method which has the same parameters as `ALS.train` and one more:

* **alpha**: Baseline level of confidence weighting applied. A high value for example tends to make the model more confident about the fact missing data equates to no preference for the relevant user-item pair.

## Step 3: Using the trained model to make predictions ##

There are 2 main approaches:

* **user-based**: Ratings of similar users on items are used to compute recommendations for a user.
* **item-based**: Similarity of items the user has rated is used to compute recommendations for a user.

We are using the MF approach: dot product between user-factor vector and the item-factor vector.

MLlib's recommendation model is based on Matrix Factorization.

## Step 3.1: User-Based apprach using `MatrixFactorizationModel` ###

Using the `predict()` method of `MatrixFactorizationModel` we will compute a predicted score for a given user-item combination.

** Make a prediction for a single user-item pair ** 

`scala> val predictedRating = model.predict(789, 123)
predictedRating: Double = 3.262303054887396`

*For user 789 and movie 123, the predicted movie rating is 3.26.*

## Generate the top-k recommended movies for a user ##

We will do this using the `recommendProducts()` method.

The top 10 recommended movies for user 841 will computed as follows:

`scala> val userId = 841
userId: Int = 841`

`scala> val K = 10
K: Int = 10`

`scala> val topKRecs = model.recommendProducts(userId, K)`

`scala> println(topKRecs.mkString("\n"))
Rating(841,690,5.76429175593364)
Rating(841,408,5.6560791568740365)
Rating(841,606,5.583861023045172)
Rating(841,813,5.52045983893597)
Rating(841,699,5.512711368309603)
Rating(841,969,5.491543598823136)
Rating(841,89,5.105260461263954)
Rating(841,919,5.052739029383338)
Rating(841,313,5.022547228198433)
Rating(841,311,5.0135961696036135)`

## Get the recommended movie names ##

Load the movie data which is of the form:

**u.item **
* Information about the items (movies); this is a tab separated list of the following. The last 19 fields are the genres, a 1 indicates the movie is of that genre, a 0 indicates it is not; movies can be in several genres at once. The movie ids are the ones used in the u.data data set.
          `movie id | movie title | release date | video release date |
          IMDb URL | unknown | Action | Adventure | Animation |
          Children's | Comedy | Crime | Documentary | Drama | Fantasy |
          Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
          Thriller | War | Western |`
          

** Save the u.item data in `movies`.**

`scala> val movies = sc.textFile("/Users/adarshnair/spark-2.0.1-bin-hadoop2.7/spark_projects/MovieLens/ml-100k/u.item")
movies: org.apache.spark.rdd.RDD[String] = /Users/adarshnair/spark-2.0.1-bin-hadoop2.7/spark_projects/MovieLens/ml-100k/u.item MapPartitionsRDD[218] at textFile at <console>:26`

** Get the `title` from `movies` and return it in the form `137 -> Big Night (1996)`: **

`scala> val titles = movies.map(line => line.split("\\|").take(2))
                           .map(array => (array(0).toInt,array(1)))
                           .collectAsMap()`

** Check movie title of random movie: **


`scala> titles(345)
res11: String = Deconstructing Harry (1997)`

** Check movies rated by user 841 **

We wil do this using `keyBy` to generate an RDD of key-value pairs from ratings where the key will be User ID. We then use lookup() to find the movies that user 841 has rated.

`scala> val moviesForUser = ratings.keyBy(_.user).lookup(841)`
`moviesForUser: Seq[org.apache.spark.mllib.recommendation.Rating] = WrappedArray(Rating(841,892,3.0), Rating(841,331,5.0), Rating(841,748,4.0), Rating(841,288,3.0), Rating(841,270,4.0), Rating(841,353,1.0), Rating(841,286,5.0), Rating(841,272,4.0), Rating(841,754,4.0), Rating(841,323,3.0), Rating(841,307,5.0), Rating(841,300,4.0), Rating(841,358,1.0), Rating(841,271,4.0), Rating(841,344,3.0), Rating(841,873,4.0), Rating(841,678,4.0), Rating(841,333,4.0), Rating(841,313,5.0), Rating(841,306,4.0), Rating(841,689,5.0), Rating(841,326,4.0), Rating(841,315,4.0), Rating(841,1294,5.0), Rating(841,325,3.0), Rating(841,258,5.0), Rating(841,751,3.0), Rating(841,302,5.0), Rating(841,316,4.0), Rating(841,322,4.0), Rating(841,888,5.0))`

** Count the number of movies rated by user 841 ** 

`scala> println(moviesForUser.size)
31`

** Get the names of the top 10 movies that user 841 has rated the highest **

We will use the `ratings` object which we defined as:

`scala> val ratings = rawRatings.map { case Array(user, movie, rating) => Rating(user.toInt, movie.toInt, rating.toDouble) }`

earlier, and the `moviesForUser` object which has all the movies rated by the user, sort them, take the top 10 values, map them to the format (movie, rating) and print them.

`scala> moviesForUser.sortBy(-_.rating).take(10).map(rating => (titles(rating.product), rating.rating)).foreach(println)`

`(Edge, The (1997),5.0)
(English Patient, The (1996),5.0)
(Devil's Advocate, The (1997),5.0)
(Titanic (1997),5.0)
(Jackal, The (1997),5.0)
(Ayn Rand: A Sense of Life (1997),5.0)
(Contact (1997),5.0)
(L.A. Confidential (1997),5.0)
(One Night Stand (1997),5.0)
(Saint, The (1997),4.0)`

** Get the names of the top 10 movies that were recommended by our algorithm to the user **

We will do this using the `topKrecs` object we defined earlier which has our movie recommendations and map them to the format `(movie, rating)` and print it.


`scala> topKRecs.map(rating => (titles(rating.product), rating.rating)).foreach(println)`

`(Seven Years in Tibet (1997),5.76429175593364)
(Close Shave, A (1995),5.6560791568740365)
(All About Eve (1950),5.583861023045172)
(Celluloid Closet, The (1995),5.52045983893597)
(Little Women (1994),5.512711368309603)
(Winnie the Pooh and the Blustery Day (1968),5.491543598823136)
(Blade Runner (1982),5.105260461263954)
(City of Lost Children, The (1995),5.052739029383338)
(Titanic (1997),5.022547228198433)
(Wings of the Dove, The (1997),5.0135961696036135)`


Using these two lists of movies, we can make our own judgement as to whether user 841 will like our recommended list.

## Step 3.2: Item based approach - finding the movies most similar to a certain movie. ##

Item based similarity is computed by comparing the vector representation of two items using a similarity measure.

The similariy measures are as follows:

* **Pearson Correlation, Cosine similarity**: for real valued vectors
* **Jaccard similarity**: for binary vectors

We have to compare the feature vector of a chosen item with other items using a similarity metric. We have to create a feature vector object out of the feature vectors which are in the Array[Double] form. 

Here I had to do the following to run the jblas library:

* brew install maven
* git clone https://github.com/mikiobraun/jblas.git
* cd jblas
* mvn install

Then, I had to relaunch scala with the --packages parameter and the location of the jblas package as follows:

`Adarshs-MacBook-Pro:spark-2.0.1-bin-hadoop2.7 adarshnair$ ./bin/spark-shell --driver-memory 4g --packages org.jblas:jblas:1.2.4-SNAPSHOT`

** Import the jblas library: ** 

`scala> import org.jblas.DoubleMatrix`

** Define a jblas.DoubleMatrix object **

`scala> val aMatrix = new DoubleMatrix(Array(1.0, 2.0, 3.0))
aMatrix: org.jblas.DoubleMatrix = [1.000000; 2.000000; 3.000000]`

** Define the cosine similarity function **

The cosine similarity is the measure of the angle between two vectors in an n-dimensional space. It has the following steps:

* Calculate dot product between the vectors
* Divide result by denominator

Cosine similarity is the normalized dot product. 

`scala> def cosineSimilarity(vec1: DoubleMatrix, vec2: DoubleMatrix): Double = { vec1.dot(vec2) / (vec1.norm2() * vec2.norm2()) }
cosineSimilarity: (vec1: org.jblas.DoubleMatrix, vec2: org.jblas.DoubleMatrix)Double`

The value of the cosine similarity goes from -1 to 1 with 1 being similar and -1 being dissimilar while 0 is independent.

** Apply cosine similarity to a movie **

`scala> val itemId = 404
itemId: Int = 404`

** Collect itemFactor of movie using lookup()**

`scala> val itemFactor = model.productFeatures.lookup(itemId).head
itemFactor: Array[Double] = Array(1.456715703010559, 0.09094125032424927, -1.323241114616394, -0.44890183210372925, 1.232578992843628, -0.15153035521507263, 0.28330186009407043, 0.14474798738956451, -0.1086246594786644, 1.0265341997146606, -0.8233844041824341, -0.28477057814598083, 0.20904113352298737, 0.5610338449478149, -0.17029500007629395, 0.5896046161651611, 0.45187196135520935, -0.36535635590553284, -1.1037601232528687, -1.2145742177963257, 1.314957618713379, -0.4275266230106354, -0.7305580377578735, -0.49946409463882446, 0.46623849868774414, 1.051638126373291, -0.4805344343185425, 0.4812560975551605, 0.47500789165496826, -0.14856795966625214, -0.6726163029670715, -0.2977805435657501, -1.1333670616149902, -0.9176173806190491, -0.11220163106918335, 0.17123645544052124, -0.365033537...
`

** Collect itemVector of same movie **

`scala> val itemVector = new DoubleMatrix(itemFactor)
itemVector: org.jblas.DoubleMatrix = [1.456716; 0.090941; -1.323241; -0.448902; 1.232579; -0.151530; 0.283302; 0.144748; -0.108625; 1.026534; -0.823384; -0.284771; 0.209041; 0.561034; -0.170295; 0.589605; 0.451872; -0.365356; -1.103760; -1.214574; 1.314958; -0.427527; -0.730558; -0.499464; 0.466238; 1.051638; -0.480534; 0.481256; 0.475008; -0.148568; -0.672616; -0.297781; -1.133367; -0.917617; -0.112202; 0.171236; -0.365034; 0.033380; -1.045698; -1.550930; -1.280466; 0.309942; 0.901337; 0.098294; 0.651365; 0.500232; -0.596415; -0.1`

** Find cosineSimilarity of movie with itself using itemFactor and itemVector **

`scala> cosineSimilarity(itemVector, itemVector)
res4: Double = 1.0`

As expected we get a value of 1.


** Apply similarity metric to all movies to find the score for all movies when compared to movie with id 404**

Store the scores for the movie, score pairs in the `sims` variable.

`scala> val sims = model.productFeatures.map{ case (id, factor) => 
     | val factorVector = new DoubleMatrix(factor)
     | val sim = cosineSimilarity(factorVector, itemVector)
     | (id, sim)
     | }
sims: org.apache.spark.rdd.RDD[(Int, Double)] = MapPartitionsRDD[222] at map at <console>:43`

** Find the top 10 most similar movies compared to movie with id 404.**

We use `top` to compute the top-k results instead of `collect` which returns all the data to the driver.
We use `Ordering` to sort the data (item id, similarity score) pairs in the sims RDD.

`scala> val sortedSims = sims.top(K)(Ordering.by[(Int, Double), Double] { case (id, similarity) => similarity })`

`scala> val sortedSims = sims.top(K)(Ordering.by[(Int, Double), Double] { case (id, similarity) => similarity })
sortedSims: Array[(Int, Double)] = Array((404,1.0), (99,0.9044948670945991), (501,0.896898212252714), (482,0.8843526451683614), (378,0.8750775812837056), (71,0.8683390197894713), (480,0.8651580922076123), (498,0.8644220348817375), (655,0.8609436553440514), (97,0.8601814581441996))`

** Print top 10 most similar movies when compared to our chosen movie with id 404.**

`scala> println(sortedSims.mkString("\n"))
(404,1.0)
(99,0.9044948670945991)
(501,0.896898212252714)
(482,0.8843526451683614)
(378,0.8750775812837056)
(71,0.8683390197894713)
(480,0.8651580922076123)
(498,0.8644220348817375)
(655,0.8609436553440514)
(97,0.8601814581441996)`

** Check the name of our movie ** 

`scala> println(titles(itemId))
Pinocchio (1940)`

** Get the 10 movies most similar to Pinocchio in a sorted list**

`scala> val sortedSims2 = sims.top(K + 1)(Ordering.by[(Int, Double), Double] { case (id, similarity) => similarity })
sortedSims2: Array[(Int, Double)] = Array((404,1.0), (99,0.9044948670945991), (501,0.896898212252714), (482,0.8843526451683614), (378,0.8750775812837056), (71,0.8683390197894713), (480,0.8651580922076123), (498,0.8644220348817375), (655,0.8609436553440514), (97,0.8601814581441996), (484,0.8601798273270549))`

** Get the names of the 10 movies most similar to Pinocchio **

`scala> sortedSims2.slice(1, 11).map{ case (id, sim) => (titles(id), sim) }.mkString("\n")
res7: String =
(Snow White and the Seven Dwarfs (1937),0.9044948670945991)
(Dumbo (1941),0.896898212252714)
(Some Like It Hot (1959),0.8843526451683614)
(Miracle on 34th Street (1994),0.8750775812837056)
(Lion King, The (1994),0.8683390197894713)
(North by Northwest (1959),0.8651580922076123)
(African Queen, The (1951),0.8644220348817375)
(Stand by Me (1986),0.8609436553440514)
(Dances with Wolves (1990),0.8601814581441996)
(Maltese Falcon, The (1941),0.8601798273270549)`


# Step 4: Evaluate performance of recommendation engine #

We use 2 metrics:

* **MSE, RMSE**
* **Mean Average Precision at K**: The APK score will be higher if the result documents are both relevant and the relevant documents are presented higher in the results. This metric is better suited for recommender systems than MSE and RMSE. In APK, each user is the equivalent of a query, the set of top-k items is the document result set. The relevant documents are the set of items that the user interacted with. Hence APK measures how good our model is at predicting items that a user will find relevant and will choose to interact with.

## Step 4.1 MSE, RMSE ##

### Check actual rating given by user 841 for a movie ###

** Load all the movies rated by user 841 into `moviesForUser` ** 

`scala> val moviesForUser = ratings.keyBy(_.user).lookup(841)
moviesForUser: Seq[org.apache.spark.mllib.recommendation.Rating] = WrappedArray(Rating(841,892,3.0), Rating(841,331,5.0), Rating(841,748,4.0), Rating(841,288,3.0), Rating(841,270,4.0), Rating(841,353,1.0), Rating(841,286,5.0), Rating(841,272,4.0), Rating(841,754,4.0), Rating(841,323,3.0), Rating(841,307,5.0), Rating(841,300,4.0), Rating(841,358,1.0), Rating(841,271,4.0), Rating(841,344,3.0), Rating(841,873,4.0), Rating(841,678,4.0), Rating(841,333,4.0), Rating(841,313,5.0), Rating(841,306,4.0), Rating(841,689,5.0), Rating(841,326,4.0), Rating(841,315,4.0), Rating(841,1294,5.0), Rating(841,325,3.0), Rating(841,258,5.0), Rating(841,751,3.0), Rating(841,302,5.0), Rating(841,316,4.0), Rating(841,322,4.0), Rating(841,888,5.0))`

** Check rating by user for the first movie the user rated **

`scala> val actualRating = moviesForUser.take(1)(0)
actualRating: org.apache.spark.mllib.recommendation.Rating = Rating(841,892,3.0)`

The user gave a rating of 3.0

** Check predicted rating for that movie by same user ** 

`scala> val predictedRating = model.predict(841, actualRating.product)
predictedRating: Double = 2.9593273900130113`

### Computing MSE, RMSE across entire dataset ###

We need the squared error of each (user, movie, actualRating, predictedRating) entry, sum them up and then divide by total number of ratings. 

** Extract user and product IDs from `ratings` RDD **

`scala> val usersProducts = ratings.map{ case Rating(user, product, rating)  => (user, product)}
usersProducts: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[228] at map at <console>:34`

** Make predictions for each user-item pair using `model.predict()`: **

`scala> val predictions = model.predict(usersProducts).map{ case Rating(user, product, rating) => ((user, product), rating) }
predictions: org.apache.spark.rdd.RDD[((Int, Int), Double)] = MapPartitionsRDD[237] at map at <console>:38`

** Extract actualRating and map the `ratings` RDD so that the user-item pair is the key and the actualRating is the value. We will get two RDDs with the same form of key, so we join them together to create a new RDD with the actual and predicted ratings for each user-item combination **

`scala> val ratingsAndPredictions = ratings.map{case Rating(user, product, rating) => ((user, product), rating)}.join(predictions)
ratingsAndPredictions: org.apache.spark.rdd.RDD[((Int, Int), (Double, Double))] = MapPartitionsRDD[241] at join at <console>:40`


** Formatted code for clarity **


`val usersProducts = ratings.map{ case Rating(user, product, rating)  => (user, product)}`

*usersProducts gives (user, product). This is fed into model.predict() so predictions can be made on all movies. The user-item is the key and rating is the value.*


`val predictions = model.predict(usersProducts)
                        .map{ case Rating(user, product, rating) => ((user, product), rating)}`

*predictions is joined with the actual ratings because both have user-item as the key.*

`val ratingsAndPredictions = ratings.map{ case Rating(user, product, rating) => ((user, product), rating)}
                                   .join(predictions)`

** Compute MSE ** 

`scala> val predictedAndTrue = ratingsAndPredictions.map { case ((user, product), (actual, predicted)) => (actual, predicted) }
predictedAndTrue: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[242] at map at <console>:42`

`scala> val regressionMetrics = new RegressionMetrics(predictedAndTrue)
regressionMetrics: org.apache.spark.mllib.evaluation.RegressionMetrics = org.apache.spark.mllib.evaluation.RegressionMetrics@6a45febd`

`scala> println("Mean Squared Error = " + regressionMetrics.meanSquaredError)
Mean Squared Error = 0.08473321421633566`

** Compute RMSE **

This is equivalent to the standard deviation of the differences between the predicted and actual ratings.

`scala> println("Root Mean Squared Error = " + regressionMetrics.rootMeanSquaredError)
Root Mean Squared Error = 0.2910897013230383`

## Step 4.2 Computing Mean Average Precision at K ##

** Load the library ** 

`scala> import org.apache.spark.mllib.evaluation.RankingMetrics
import org.apache.spark.mllib.evaluation.RankingMetrics`

** Get all the movie IDs rated by user 841** 

`scala> val actualMovies = moviesForUser.map(_.product)
actualMovies: Seq[Int] = ArrayBuffer(892, 331, 748, 288, 270, 353, 286, 272, 754, 323, 307, 300, 358, 271, 344, 873, 678, 333, 313, 306, 689, 326, 315, 1294, 325, 258, 751, 302, 316, 322, 888)`

** Use the top 10 recommended movies to compute the APK score**

`scala> val predictedMovies = topKRecs.map(_.product)
predictedMovies: Array[Int] = Array(185, 23, 187, 180, 175, 56, 127, 276, 443, 89)`

We need to compute the APK for each user and average them out to find the MAPK.

### Compute recommendations for all users ###

** Get the itemFactors and itemMatrix **

`scala> val itemFactors = model.productFeatures.map { case (id, factor) => factor }.collect()
itemFactors: Array[Array[Double]] = Array(Array(0.9928126931190491, 0.5980018377304077, -0.9922769069671631, -0.5807792544364929, 0.6617379784584045, -0.6692140102386475, 0.29881414771080017, -0.6581135392189026, -0.51359623670578, 0.4164090156555176, -0.4938781261444092, 0.21485476195812225, -0.30804479122161865, 1.064658284187317, 0.2200748324394226, 1.1584419012069702, 0.21837905049324036, 0.04997337982058525, -1.7830859422683716, -0.7733256816864014, 1.1847765445709229, -0.21797437965869904, -0.6154626607894897, -0.6413697600364685, -0.5309175848960876, 0.7262206077575684, 0.03381478786468506, 1.2232928276062012, -0.16844823956489563, -0.14984336495399475, -0.36415931582450867, -0.2926683723926544, -1.5741336345672607, -1.30470609664917, 0.11398006975650787, -0.5297638773918152, -0....`

The itemMatrix is the DoubleMatrix of itemFactors.

`scala> val itemMatrix = new DoubleMatrix(itemFactors)
itemMatrix: org.jblas.DoubleMatrix = [0.992813, 0.598002, -0.992277, -0.580779, 0.661738, -0.669214, 0.298814, -0.658114, -0.513596, 0.416409, -0.493878, 0.214855, -0.308045, 1.064658, 0.220075, 1.158442, 0.218379, 0.049973, -1.783086, -0.773326, 1.184777, -0.217974, -0.615463, -0.641370, -0.530918, 0.726221, 0.033815, 1.223293, -0.168448, -0.149843, -0.364159, -0.292668, -1.574134, -1.304706, 0.113980, -0.529764, -0.930787, -0.020343, -0.519978, -0.950019, -1.123361, 0.610111, 1.118598, 0.011164, 0.602885, 0.326406, -0.408680, -0.165954, -0.687462, 0.268771; 1.297704, -0.220837, -0.881212, -1.140893, 0.670369, -0.623236, 0.553994, 0.824082, 0.162887, 0.677236, -0.329635, 0.008709, 0.466816, 1.550000, -0.548380, 1.163319, -0.977304, 0.286353, -0.828565, -0.754250, 1.354773, 0.029863, -0...`


We get the itemMatric with the following dimensions.

`scala> println(itemMatrix.rows, itemMatrix.columns)
(1682,50)`

There are 1682 total movies(rows) and a factor dimension of 50(columns).

** Broadcast the item matrix to all worker nodes **

`scala> val imBroadcast = sc.broadcast(itemMatrix)
imBroadcast: org.apache.spark.broadcast.Broadcast[org.jblas.DoubleMatrix] = Broadcast(54)`

** Compute recommendations for each user **

* Apply map() to each user factor within which we perform a matrix multiplication between the user-factor vector and the movie-factor vector.
* The result is a vector of length 1682(the total number of movies) with the predicted rating for each movie.

`scala> val allRecs = model.userFeatures.map{ case (userId, array) => 
     | val userVector = new DoubleMatrix(array)
     | val scores = imBroadcast.value.mmul(userVector)
     | val sortedWithId = scores.data.zipWithIndex.sortBy(-_._1)
     | val recommendedIds = sortedWithId.map(_._2 + 1).toSeq
     | (userId, recommendedIds)
     | }
allRecs: org.apache.spark.rdd.RDD[(Int, Seq[Int])] = MapPartitionsRDD[245] at map at <console>:45`

We now have an RDD that contains the list of movie IDs for each user ID. The movie IDs are sorted in order of predicted rating.

** Get all the movie ids per user, grouped by user id**

`scala> val userMovies = ratings.map{ case Rating(user, product, rating) => (user, product) }.groupBy(_._1)
userMovies: org.apache.spark.rdd.RDD[(Int, Iterable[(Int, Int)])] = ShuffledRDD[248] at groupBy at <console>:35`

** Join the two RDDs on the user ID key. For each user we have the list of actual and predicted movie IDs that we pass into our MAPK error metric **

`scala> val predictedAndTrueForRanking = allRecs.join(userMovies).map{ case (userId, (predicted, actualWithIds)) => 
     | val actual = actualWithIds.map(_._2)
     | (predicted.toArray, actual.toArray)
     | }
predictedAndTrueForRanking: org.apache.spark.rdd.RDD[(Array[Int], Array[Int])] = MapPartitionsRDD[252] at map at <console>:49`

Give predictedAndTrueForRanking as the argument to the RankingMetrics().

`scala> val rankingMetrics = new RankingMetrics(predictedAndTrueForRanking)
rankingMetrics: org.apache.spark.mllib.evaluation.RankingMetrics[Int] = org.apache.spark.mllib.evaluation.RankingMetrics@80e095`

** Compute MAP **

`scala> println("Mean Average Precision = " + rankingMetrics.meanAveragePrecision)
Mean Average Precision = 0.07284042040123272`

# Conclusion #

Our recommendation engine performs with the following performance:

* MSE: 0.08
* RMSE: 0.29
* MAP: 0.0728