# Simple recommender system on Movie lens dataset.

In this notebook, a simple recommender system was implemented on Movie lens dataset to predict movie ratings in Spark MLlib. Algorithm used is Alternative least squares method and evaluation is performed by calculating root mean sqaure error.

In [2]:
import findspark
findspark.init('/home/ubuntu/spark-2.2.1-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Bascis').getOrCreate()

In [3]:
df=spark.read.csv('ratings.csv',inferSchema=True,header=True)

In [4]:
df.show()

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|      2|   3.5|
|     1|     29|   3.5|
|     1|     32|   3.5|
|     1|     47|   3.5|
|     1|     50|   3.5|
|     1|    112|   3.5|
|     1|    151|   4.0|
|     1|    223|   4.0|
|     1|    253|   4.0|
|     1|    260|   4.0|
|     1|    293|   4.0|
|     1|    296|   4.0|
|     1|    318|   4.0|
|     1|    337|   3.5|
|     1|    367|   3.5|
|     1|    541|   4.0|
|     1|    589|   3.5|
|     1|    593|   3.5|
|     1|    653|   3.0|
|     1|    919|   3.5|
+------+-------+------+
only showing top 20 rows



In [5]:
df.describe().show()

+-------+------------------+------------------+------------------+
|summary|            userId|           movieId|            rating|
+-------+------------------+------------------+------------------+
|  count|           1048575|           1048575|           1048575|
|   mean|  3527.08612259495| 8648.988281238824|3.5292716305462175|
| stddev|2018.4244255314572|19100.143880088344|1.0519187535878674|
|    min|                 1|                 1|               0.5|
|    max|              7120|            130642|               5.0|
+-------+------------------+------------------+------------------+



Implementing collabarative filtering using alternate least squares method

In [7]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS

In [8]:
train,test=df.randomSplit([0.8,0.2])

In [9]:
als=ALS(userCol="userId", itemCol="movieId", ratingCol="rating",maxIter=20)
model=als.fit(train)

In [10]:
predictions=model.transform(test)

In [12]:
predictions=predictions.na.drop()

Calculating root mean square error using regression evaluator

In [13]:
evaluator = RegressionEvaluator(metricName="mse", labelCol="rating",predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error =  " + str(rmse))

Root-mean-square error = 0.6746652231309856


In [14]:
predictions.describe().show()

+-------+------------------+-----------------+------------------+------------------+
|summary|            userId|          movieId|            rating|        prediction|
+-------+------------------+-----------------+------------------+------------------+
|  count|            209236|           209236|            209236|            209236|
|   mean| 3529.866557380183|8505.675333116673|  3.52739490336271|3.4156059127123255|
| stddev|2018.6541865994247|18839.87187180783|1.0528225678392187|0.6515001520082517|
|    min|                 1|                1|               0.5|        -1.4508139|
|    max|              7120|           129428|               5.0|          5.938751|
+-------+------------------+-----------------+------------------+------------------+

