# Movie Recommendation System using ALS (Alternating Least Squares)
### Colabrative Filter : Alternating Least Squares
- Overview
This notebook implements a collaborative filtering recommendation system using the Alternating Least Squares (ALS) algorithm from PySpark's MLlib. The system recommends movies to users based on their past ratings and the ratings of similar users.

#### Key Features
- Dataset: Uses the MovieLens dataset (https://grouplens.org/datasets/movielens/)
- Algorithm: ALS (Alternating Least Squares) for matrix factorization
- Evaluation: RMSE (Root Mean Square Error) metric
- Output: Provides top 10 movie recommendations for each user

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, format_string, when
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator
import os

## Create Spark Session

In [None]:
# Create Spark session with 4GB memory allocation
spark = SparkSession.builder.appName("MovieRecommandation").config("spark.driver.memory", "4g").getOrCreate()

In [3]:
#Check Version
spark.version

'3.5.1'

## Load the Dataset

In [None]:
df_path = r"C:\Users\Pradeep\Downloads\ml-32m"

movies_df = spark.read.csv(os.path.join(df_path, "movies.csv"), inferSchema=True, header=True)
ratings_df = spark.read.csv(os.path.join(df_path, "ratings.csv"), inferSchema=True, header=True)
links_df = spark.read.csv(os.path.join(df_path, "links.csv"), inferSchema=True, header=True)

## Explore the Data

In [5]:
print("Movies Data Preview")
movies_df.show(5)
print("_"*50)
print("")
print("Ratings Data Preview")
ratings_df.show(5)

Movies Data Preview
+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
|      5|Father of the Bri...|              Comedy|
+-------+--------------------+--------------------+
only showing top 5 rows

__________________________________________________

Ratings Data Preview
+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|     17|   4.0|944249077|
|     1|     25|   1.0|944250228|
|     1|     29|   2.0|943230976|
|     1|     30|   5.0|944249077|
|     1|     32|   5.0|943228858|
+------+-------+------+---------+
only showing top 5 rows



In [6]:
print("Links Data Preview")
links_df.show(5)

Links Data Preview
+-------+------+------+
|movieId|imdbId|tmdbId|
+-------+------+------+
|      1|114709|   862|
|      2|113497|  8844|
|      3|113228| 15602|
|      4|114885| 31357|
|      5|113041| 11862|
+-------+------+------+
only showing top 5 rows



In [7]:
print("Total Ratings:", ratings_df.count())
print("Distinct Users:", ratings_df.select("userId").distinct().count())
print("Distinct Movies:", ratings_df.select("movieId").distinct().count())

Total Ratings: 693327
Distinct Users: 4516
Distinct Movies: 23744


## Data Preparation

In [8]:
ratings_df = ratings_df.select("userId", "movieId", "rating")

In [None]:
# Select relevant columns and split data into train/test sets (80/20 split)
train, test = ratings_df.randomSplit([0.8, 0.2], seed=42)

## ALS Model Training

In [None]:
# Configure ALS model with parameters
als = ALS(maxIter=10, rank=20, regParam=0.05, userCol="userId", itemCol="movieId", ratingCol="rating", coldStartStrategy="drop", seed=42)

In [None]:
# Train the model
model = als.fit(train)

##  Model Evaluation

In [None]:
# Make predictions on test set
pred = model.transform(test)

In [21]:
pred.show(10)

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|   148|    260|   4.5| 3.2413452|
|   148|    293|   3.5| 2.6870203|
|   148|    333|   0.5|  1.403361|
|   148|    541|   3.0| 2.8585865|
|   148|    551|   0.5| 2.0270708|
|   148|    588|   2.0| 1.0710027|
|   148|    593|   3.5|  2.671454|
|   148|    596|   1.0|0.80169785|
|   148|    953|   0.5| 1.4076023|
|   148|   1036|   4.0|  3.322714|
+------+-------+------+----------+
only showing top 10 rows



In [None]:
# Calculate RMSE (Root Mean Square Error)
eval = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")

In [23]:
## RMSE describe our error in terms of the starts rating columns
rmse = eval.evaluate(pred)
print(f"Root-Mean-Square Error (RMSE): {rmse:.4f}")

Root-Mean-Square Error (RMSE): 0.8470


## Generate Recommendations for a Specific User

In [24]:
user_1 = test.filter(test["userId"] == 1).select(["movieId", "userId"])

In [25]:
user_1.show(5)

+-------+------+
|movieId|userId|
+-------+------+
|     29|     1|
|     36|     1|
|    110|     1|
|    223|     1|
|    322|     1|
+-------+------+
only showing top 5 rows



In [26]:
# Recommendation
rec = model.transform(user_1)
rec.orderBy("prediction", ascending=False).show(10)

+-------+------+----------+
|movieId|userId|prediction|
+-------+------+----------+
|    916|     1| 4.8636007|
|    541|     1|  4.317715|
|   1213|     1| 4.2170897|
|   1178|     1| 3.9555745|
|   1236|     1| 3.8863523|
|   1120|     1| 3.6549542|
|   1090|     1| 3.6440308|
|   1094|     1|  3.570591|
|   1199|     1| 3.5297937|
|    322|     1|   3.48788|
+-------+------+----------+
only showing top 10 rows



## Generate Top 10 Recommendations for All Users

In [27]:
# Recommend top 10 movies for a few users
user_recs = model.recommendForAllUsers(10)

# Show recommendations for one user
user_recs.show(1, truncate=False)

+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|userId|recommendations                                                                                                                                                                           |
+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1     |[{47465, 7.664309}, {142115, 7.148727}, {4819, 6.5788894}, {6269, 6.247139}, {2936, 6.2443223}, {2575, 6.156912}, {309, 6.0103717}, {86320, 6.0054}, {36276, 5.977084}, {7215, 5.8681264}]|
+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 1 r

## Display Recommendations with Movie Titles

In [28]:
# Flatten the recommendations
from pyspark.sql.functions import explode

recs_flat = user_recs.select("userId", explode("recommendations").alias("rec"))
recs_flat = recs_flat.select("userId", col("rec.movieId"), col("rec.rating"))
recs_with_links = recs_flat.join(links_df, on="movieId", how="left")
recs_with_urls = recs_with_links.withColumn("tmdb_url", when(col("tmdbId").isNotNull(),format_string("https://www.themoviedb.org/movie/%d", col("tmdbId").cast("int"))))
recs_with_titles = recs_with_urls.join(movies_df, on="movieId", how="left")
recs_with_titles.select("userId", "title", "rating", "tmdb_url").show(10, truncate=False)

+------+--------------------------------------------------------------+---------+---------------------------------------+
|userId|title                                                         |rating   |tmdb_url                               |
+------+--------------------------------------------------------------+---------+---------------------------------------+
|1     |Tideland (2005)                                               |7.664309 |https://www.themoviedb.org/movie/11559 |
|1     |The Blue Planet (2001)                                        |7.148727 |https://www.themoviedb.org/movie/200813|
|1     |Go Figure (Va savoir) (2001)                                  |6.5788894|https://www.themoviedb.org/movie/55372 |
|1     |Stevie (2002)                                                 |6.247139 |https://www.themoviedb.org/movie/51927 |
|1     |Sullivan's Travels (1941)                                     |6.2443223|https://www.themoviedb.org/movie/16305 |
|1     |Dreamlife of Ang

## Model Persistence

In [None]:
# Save model for future use
model.save("/content/als_model")

## Load model and Test

In [None]:
# Load model for use
from pyspark.ml.recommendation import ALSModel

alsmodel = ALSModel.load("/content/als_model")

In [35]:
alsmodel.recommendForAllUsers(5).show(5, truncate=False)

+------+----------------------------------------------------------------------------------------------------+
|userId|recommendations                                                                                     |
+------+----------------------------------------------------------------------------------------------------+
|1     |[{47465, 7.664309}, {142115, 7.148727}, {4819, 6.5788894}, {6269, 6.247139}, {2936, 6.2443223}]     |
|2     |[{4802, 6.5115914}, {1150, 5.9384995}, {192803, 5.9136086}, {2739, 5.885508}, {986, 5.836393}]      |
|3     |[{26083, 5.279448}, {61406, 5.151223}, {80839, 5.1355705}, {71282, 5.09892}, {4184, 5.0110965}]     |
|4     |[{8019, 5.2029705}, {7311, 5.097209}, {209173, 4.997402}, {89305, 4.9882855}, {142082, 4.9497085}]  |
|5     |[{163072, 5.084265}, {77364, 4.8319287}, {101942, 4.735526}, {107953, 4.6873665}, {89554, 4.620346}]|
+------+----------------------------------------------------------------------------------------------------+
only showi

In [37]:
spark.stop()