The two most common types of recommender systems are Content-Based and Collaborative Filtering (CF).

Collaborative filtering produces recommendations based on the knowledge of users’ attitude to items, that is it uses the "wisdom of the crowd" to recommend items.

Content-based recommender systems focus on the attributes of the items and give you recommendations based on the similarity between them.

In general, Collaborative filtering (CF) is more commonly used than content-based systems because it usually gives better results and is relatively easy to understand (from an overall implementation perspective).

The algorithm has the ability to do feature learning on its own, which means that it can start to learn for itself what features to use.

These techniques aim to fill in the missing entries of a user-item association matrix. 

spark.ml currently supports model-based collaborative filtering, in which users and products are described by a small set of latent factors that can be used to predict missing entries.

spark.ml uses the alternating least squares (ALS) algorithm to learn these latent factors.

Your data needs to be in a specific format to work with Spark’s ALS Recommendation Algorithm!

ALS is basically a Matrix Factorization approach to implement a recommendation algorithm you decompose your large user/item matrix into lower dimensional user factors and item factors.

To fully understand this model need to have a strong background in Linear Algebra 

Check out the various resource links for more detail on ALS and how it works.

The intuitive understanding of a recommender system is the following:
Imagine we have 3 customers: 1,2,3.
We also have some movies: A,B,C
Customers 1 and 2 really enjoy movies A and B and rate them five out of five stars! 

1 and 2 dislike movie C, and give it a one star rating.




In [None]:
!pip install pyspark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/f0/26/198fc8c0b98580f617cb03cb298c6056587b8f0447e20fa40c5b634ced77/pyspark-3.0.1.tar.gz (204.2MB)
[K     |████████████████████████████████| 204.2MB 67kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 41.9MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.0.1-py2.py3-none-any.whl size=204612242 sha256=2044a62c39d08fd8887420d7c07070ea071d7e883a3b741e2c0c3484117391f2
  Stored in directory: /root/.cache/pip/wheels/5e/bd/07/031766ca628adec8435bb40f0bd83bb676ce65ff4007f8e73f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.1


In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('rec').getOrCreate()

from pyspark.ml.recommendation import ALS

from pyspark.ml.evaluation import RegressionEvaluator

data = spark.read.csv('movielens_ratings.csv', inferSchema = True, header = True)

In [None]:
data.show()

+-------+------+------+
|movieId|rating|userId|
+-------+------+------+
|      2|   3.0|     0|
|      3|   1.0|     0|
|      5|   2.0|     0|
|      9|   4.0|     0|
|     11|   1.0|     0|
|     12|   2.0|     0|
|     15|   1.0|     0|
|     17|   1.0|     0|
|     19|   1.0|     0|
|     21|   1.0|     0|
|     23|   1.0|     0|
|     26|   3.0|     0|
|     27|   1.0|     0|
|     28|   1.0|     0|
|     29|   1.0|     0|
|     30|   1.0|     0|
|     31|   1.0|     0|
|     34|   1.0|     0|
|     37|   1.0|     0|
|     41|   2.0|     0|
+-------+------+------+
only showing top 20 rows



In [None]:
data.describe().show()

+-------+------------------+------------------+------------------+
|summary|           movieId|            rating|            userId|
+-------+------------------+------------------+------------------+
|  count|              1501|              1501|              1501|
|   mean| 49.40572951365756|1.7741505662891406|14.383744170552964|
| stddev|28.937034065088994| 1.187276166124803| 8.591040424293272|
|    min|                 0|               1.0|                 0|
|    max|                99|               5.0|                29|
+-------+------------------+------------------+------------------+



In [None]:
training, test = data.randomSplit([0.8, 0.3])

In [None]:
als= ALS(maxIter =5, regParam = 0.01, userCol = 'userId', itemCol = 'movieId', ratingCol='rating') #Alternating Least Squares (ALS) matrix factorization

In [None]:
model = als.fit(training)

In [None]:
predictions = model.transform(test)

In [None]:
predictions.show()

+-------+------+------+-----------+
|movieId|rating|userId| prediction|
+-------+------+------+-----------+
|     31|   3.0|     8| -1.2003973|
|     31|   2.0|    25|  2.8985767|
|     31|   1.0|    18|  0.2463209|
|     85|   1.0|    26|  3.1671884|
|     85|   1.0|    12| -1.2873697|
|     85|   3.0|     1|  3.6274376|
|     85|   1.0|    13| 0.52892125|
|     85|   3.0|     6| -4.1948276|
|     85|   5.0|    16|-0.16278215|
|     85|   1.0|    23|  -0.806761|
|     85|   1.0|    25|-0.60658103|
|     85|   3.0|    21|  0.0882335|
|     65|   2.0|     5| 0.64910364|
|     65|   1.0|    19|  2.8407307|
|     65|   2.0|    15|  3.3676324|
|     65|   1.0|     4| 0.65699005|
|     65|   1.0|    24|  1.9148284|
|     65|   1.0|     2|-0.71101534|
|     53|   3.0|    20|  2.5694869|
|     53|   2.0|    19|  2.7433527|
+-------+------+------+-----------+
only showing top 20 rows



In [None]:
evaluator = RegressionEvaluator(metricName='rmse', labelCol='rating', predictionCol='prediction')

In [None]:
rmse = evaluator.evaluate(predictions)

In [None]:
print('RMSE')
print(rmse)

RMSE
2.104467630624879


In [None]:
single_user = test.filter(test['userId']==11).select(['movieId', 'userId'])

In [None]:
single_user.show()

+-------+------+
|movieId|userId|
+-------+------+
|      0|    11|
|     10|    11|
|     16|    11|
|     23|    11|
|     32|    11|
|     38|    11|
|     61|    11|
|     62|    11|
|     72|    11|
|     76|    11|
|     77|    11|
|     78|    11|
|     82|    11|
|     97|    11|
|     99|    11|
+-------+------+



In [None]:
recommendations = model.transform(single_user)

In [None]:
recommendations.orderBy('prediction', ascending=False).show()

+-------+------+-----------+
|movieId|userId| prediction|
+-------+------+-----------+
|     32|    11|   2.487289|
|     61|    11|  1.7235298|
|     23|    11|  1.6220003|
|     10|    11|  1.4214804|
|     38|    11|  1.1953259|
|     99|    11|  1.1833932|
|     97|    11|    1.02893|
|      0|    11|  0.9721139|
|     77|    11|  0.6490396|
|     72|    11| 0.34924883|
|     16|    11|   0.325314|
|     78|    11| 0.30926734|
|     82|    11|-0.20031986|
|     62|    11| -1.8682793|
|     76|    11|  -2.234673|
+-------+------+-----------+

