# Collaborative Filtering
Written by: Ryan Garnet Andrianto (Student ID: 05111940000063)

Learning source: https://spark.apache.org/docs/latest/ml-collaborative-filtering.html

## Download dataset
First, we download the dataset. We use `wget` to download the dataset since we have the direct download link for it.

In [1]:
!wget https://raw.githubusercontent.com/apache/spark/master/data/mllib/als/sample_movielens_ratings.txt

--2023-04-12 15:13:32--  https://raw.githubusercontent.com/apache/spark/master/data/mllib/als/sample_movielens_ratings.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 32363 (32K) [text/plain]
Saving to: ‘sample_movielens_ratings.txt’


2023-04-12 15:13:32 (64.9 MB/s) - ‘sample_movielens_ratings.txt’ saved [32363/32363]



## Install pySpark
We will need pySpark to conduct this Collaborative Filtering. Hence, we need to install pyspark python module by using `pip` command.

In [3]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.2.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 kB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.2-py2.py3-none-any.whl size=281824028 sha256=a21251a0407b4b0b30502e18c11d3718024d28778cf1decd726aaf4b885c2912
  Stored in directory: /root/.cache/pip/wheels/6c/e3/9b/0525ce8a69478916513509d43693511463c6468db0de237c86
Successfully built pyspark
Installing collected packages: py4j, pyspa

## Create spark session
After we install the pySpark python module, we can create a spark session and name it `collaborativeFiltering`.

In [6]:
from pyspark.sql import SparkSession

# Create spark session
spark = SparkSession \
    .builder \
    .appName('collaborativeFiltering') \
    .getOrCreate()

spark   

## Import required library / module
In this collaborative filtering, we will need `RegressionEvaluator` to measure RMSE (Root-mean-square error), `ALS` (Alternating Least Squares), and `Row`. Thus, we import those class into our project.

In [7]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row

## Read data and convert it to RDD (Resilient Distributed Datasets)
We read the dataset and store it as RDD.

In [8]:
lines = spark.read.text("sample_movielens_ratings.txt").rdd
parts = lines.map(lambda row: row.value.split("::"))
ratingsRDD = parts.map(lambda p: Row(userId=int(p[0]), movieId=int(p[1]),
                                     rating=float(p[2]), timestamp=int(p[3])))
ratings = spark.createDataFrame(ratingsRDD)
(training, test) = ratings.randomSplit([0.8, 0.2])

## Create the test cases
Our goal is to find a pair of maxIter and regParam value that has minimum RMSE value. In this case, we use maxIter = [5, 10, 20] and regParam = [0.01, 0.1, 0.5, 1.0]. The test case should be cross product of those unit tests.

In [15]:
# maxIter test
maxIter_test = [5, 10, 20]
regParam_test = [0.01, 0.1, 0.5, 1.0]

testCase = []
# Cross product between maxIter_test and regParam_test
for t1 in maxIter_test:
  for t2 in regParam_test:
    testCase.append({
        "maxIter": t1,
        "regParam": t2
    })

# Show test cases
print(testCase)

[{'maxIter': 5, 'regParam': 0.01}, {'maxIter': 5, 'regParam': 0.1}, {'maxIter': 5, 'regParam': 0.5}, {'maxIter': 5, 'regParam': 1.0}, {'maxIter': 10, 'regParam': 0.01}, {'maxIter': 10, 'regParam': 0.1}, {'maxIter': 10, 'regParam': 0.5}, {'maxIter': 10, 'regParam': 1.0}, {'maxIter': 20, 'regParam': 0.01}, {'maxIter': 20, 'regParam': 0.1}, {'maxIter': 20, 'regParam': 0.5}, {'maxIter': 20, 'regParam': 1.0}]


## Create the models
We create a model for each test cases.


In [16]:
# Create model for each test cases
models = []

for tc in testCase:
  # Build the recommendation model using ALS on the training data
  # Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
  als = ALS(maxIter=tc["maxIter"], regParam=tc["regParam"], userCol="userId", itemCol="movieId", ratingCol="rating",
            coldStartStrategy="drop")
  model = als.fit(training)

  models.append(model)

## Evaluate the models
After we create the models, we evaluate each of the models.



In [19]:
# Create regression evaluator
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
                                  predictionCol="prediction")
# Evaluate each model in the test cases
for id in range(0, len(testCase)):
  # Evaluate the model by computing the RMSE on the test data
  predictions = models[id].transform(test)
  rmse = evaluator.evaluate(predictions)
  print("For maxIter = " + str(testCase[id]["maxIter"]) + ", regParam = " + str(testCase[id]["regParam"]) + ", the Root-mean-square error = " + str(rmse))

For maxIter = 5, regParam = 0.01, the Root-mean-square error = 1.7062862282987898
For maxIter = 5, regParam = 0.1, the Root-mean-square error = 0.9236789089860488
For maxIter = 5, regParam = 0.5, the Root-mean-square error = 1.1725188816296388
For maxIter = 5, regParam = 1.0, the Root-mean-square error = 1.4590015642689835
For maxIter = 10, regParam = 0.01, the Root-mean-square error = 1.5783056204812427
For maxIter = 10, regParam = 0.1, the Root-mean-square error = 0.914439664825006
For maxIter = 10, regParam = 0.5, the Root-mean-square error = 1.1693271714900164
For maxIter = 10, regParam = 1.0, the Root-mean-square error = 1.4590026970984464
For maxIter = 20, regParam = 0.01, the Root-mean-square error = 1.4714557513537374
For maxIter = 20, regParam = 0.1, the Root-mean-square error = 0.9232271611157146
For maxIter = 20, regParam = 0.5, the Root-mean-square error = 1.1695704941189702
For maxIter = 20, regParam = 1.0, the Root-mean-square error = 1.4590027018137415


## Conclusion
By using maxIter = 10 and regParam = 0.1, the Root-mean-square error becomes 0.91 which is the lowest among the result.