# Recommendation System using PySpark

A recommendation system is a software application that suggests items or content to users based on
their preferences, behaviors, and historical interactions. It leverages algorithms to analyze user data and
identify patterns, aiming to provide personalized and relevant recommendations. There are two primary
types of recommendation systems: collaborative filtering, which recommends items based on the
preferences of users with similar tastes, and content-based filtering, which suggests items similar to
those the user has previously liked. Hybrid approaches combine these methods for more robust and
accurate suggestions. Recommendation systems are widely used in e-commerce platforms, streaming
services, social media, and other online applications to enhance user experience, engagement, and
satisfaction by delivering tailored content or product suggestions.

Lab Exercises:

1) Demonstrate how to load a dataset suitable for recommendation systems into a PySpark
DataFrame.

2) Implement a PySpark script that splits the data and trains a recommendation model.

3) Implement a PySpark script using the ALS algorithm for collaborative filtering.

4) Implement code to evaluate the performance of the recommendation model using
appropriate metrics.

In [8]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
spark = SparkSession.builder.getOrCreate()

ratings = spark.read.json("movies 1.json").select("user_id","product_id","score").cache()
ratings = ratings.head(10000)
ratings = spark.createDataFrame(ratings)

ratings.show()

+--------------+----------+-----+
|       user_id|product_id|score|
+--------------+----------+-----+
|A141HP4LYPWMSR|B003AI2VGA|  3.0|
|A328S9RN3U5M68|B003AI2VGA|  3.0|
|A1I7QGUDP043DG|B003AI2VGA|  5.0|
|A1M5405JH9THP9|B003AI2VGA|  3.0|
| ATXL536YX71TR|B003AI2VGA|  3.0|
|A3QYDL5CDNYN66|B003AI2VGA|  2.0|
| AQJVNDW6YZFQS|B003AI2VGA|  1.0|
| AD4CDZK7D31XP|B00006HAXW|  5.0|
|A3Q4S5DFVPB70D|B00006HAXW|  5.0|
|A2P7UB02HAVEPB|B00006HAXW|  5.0|
|A2TX99AZKDK0V7|B00006HAXW|  4.0|
| AFC8IKR407HSK|B00006HAXW|  5.0|
|A1FRPGQYQTAOR1|B00006HAXW|  5.0|
|A1RSDE90N6RSZF|B00006HAXW|  5.0|
|A1OUBOGB5970AO|B00006HAXW|  4.0|
|A3NPHQVIY59Y0Y|B00006HAXW|  5.0|
| AFKMBAY28XO8A|B00006HAXW|  5.0|
| A66KMXH9V7OGU|B00006HAXW|  5.0|
| AFJ27ZV9183B8|B00006HAXW|  5.0|
| AXMKAXC0TR9AW|B00006HAXW|  5.0|
+--------------+----------+-----+
only showing top 20 rows



In [9]:
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline

indexers = [
    StringIndexer(inputCol=column, outputCol=column+"_index").fit(ratings)
    for column in ["user_id", "product_id"]
]
pipeline = Pipeline(stages=indexers)
ratings_indexed = pipeline.fit(ratings).transform(ratings)

training_data,validation_data = ratings_indexed.randomSplit([8.0,2.0])

als = ALS(userCol="user_id_index",itemCol="product_id_index",ratingCol="score",rank=10,maxIter=5,regParam=0.01,coldStartStrategy="drop")
evaluator = RegressionEvaluator(metricName="rmse",labelCol="score",predictionCol="prediction")

model = als.fit(training_data)
predictions=model.transform(validation_data)
predictions.show(10,False)


+--------------+----------+-----+-------------+----------------+----------+
|user_id       |product_id|score|user_id_index|product_id_index|prediction|
+--------------+----------+-----+-------------+----------------+----------+
|A18758S1PUYIDT|B000063W1R|4.0  |27.0         |7.0             |1.6933117 |
|AJYGQV81FSFE2 |B000NDFLWG|4.0  |599.0        |91.0            |6.0264754 |
|A87RT63V7SMD3 |B000063W1R|4.0  |565.0        |7.0             |-1.0291104|
|AQ01Q3070LT29 |B000063W1R|1.0  |38.0         |7.0             |0.77850187|
|A1N8K1X0OLLADY|B000063W1R|5.0  |303.0        |7.0             |0.08824549|
|A2582KMXLK2P06|B00004CQT3|3.0  |66.0         |75.0            |-13.813685|
|A1GGOC9PVDXW7Z|B008FPU7AA|4.0  |0.0          |112.0           |4.055148  |
|A1GHUN5HXMHZ89|B000063W1R|4.0  |18.0         |7.0             |1.2238578 |
|A15Q7ABIU9O9YZ|B0001G6PZC|5.0  |243.0        |1.0             |2.174233  |
|AUEHG0DB54B7K |B0001G6PZC|5.0  |623.0        |1.0             |1.8417726 |
+-----------

In [6]:
user1 = validation_data.filter(validation_data['user_id_index']==1.0).select(['product_id','user_id','user_id_index','product_id_index'])
user1.show()
recommendations = model.transform(user1) 
recommendations.orderBy('prediction',ascending=False).show()

+----------+-------------+-------------+----------------+
|product_id|      user_id|user_id_index|product_id_index|
+----------+-------------+-------------+----------------+
|B004EPYZQM|ANCOMAI0I7LVG|          1.0|             9.0|
|B00004RXMK|ANCOMAI0I7LVG|          1.0|            62.0|
|B001AQT0VI|ANCOMAI0I7LVG|          1.0|            15.0|
+----------+-------------+-------------+----------------+

+----------+-------------+-------------+----------------+----------+
|product_id|      user_id|user_id_index|product_id_index|prediction|
+----------+-------------+-------------+----------------+----------+
|B004EPYZQM|ANCOMAI0I7LVG|          1.0|             9.0| 5.7559237|
|B001AQT0VI|ANCOMAI0I7LVG|          1.0|            15.0| -4.647329|
|B00004RXMK|ANCOMAI0I7LVG|          1.0|            62.0| -6.248013|
+----------+-------------+-------------+----------------+----------+



In [7]:
rmse = evaluator.evaluate(predictions)
print(f"Root Mean Squared Error (RMSE) = {rmse}")

# Additional Evaluation Metric: Mean Absolute Error (MAE)
evaluator_mae = RegressionEvaluator(
    metricName="mae",
    labelCol="score",
    predictionCol="prediction"
)

mae = evaluator_mae.evaluate(predictions)
print(f"Mean Absolute Error (MAE) = {mae}")

Root Mean Squared Error (RMSE) = 4.818847318876777
Mean Absolute Error (MAE) = 3.736825553559404


In [None]:
# TP=cm1.filter("above==true").select("true").collect()[0].true
# TN=cm1.filter("above==true").select("false").collect()[0].false
# FP=cm1.filter("above==false").select("true").collect()[0].true
# FN=cm1.filter("above==false").select("false").collect()[0].false

# precision = TP/(TP + FP)
# recall = TP/(TP + FN)
# f1score = 2*precision*recall/(precision+recall)

# print(f"Precision->{precision}\nRecall->{recall}\nF1-Score->{f1score}")