### **Business Understanding**

#### *Project Overview:*
<p>The Movie Recommendation System is designed to enhance user engagement and satisfaction on our platform by providing personalized movie recommendations to users. The primary goal of this project is to leverage machine learning techniques to deliver relevant and engaging movie suggestions, thereby increasing user retention and the consumption of content.</p>

#### *Objectives:*

1. Enhance User Experience: The central objective of this project is to improve the user experience by offering movie recommendations tailored to individual preferences. Personalization can lead to increased user satisfaction and longer engagement with our platform.

2. Increase Content Consumption: By presenting users with movies they are likely to enjoy. I aim to boost the consumption of content on our platform. Satisfied users are more likely to explore and watch a wider range of movies.

#### *Key Stakeholders:*

1. Users: The primary stakeholders are the platform users who expect relevant and enjoyable movie recommendations.
2. Content Providers: Content providers benefit from increased viewership of their movies, driven by the recommendations.
3. Platform Operators: The platform operators benefit from higher user engagement and potential revenue growth.


#### *Challenges:*

1. Cold Start Problem: Addressing the challenge of providing recommendations for new users with limited viewing history.
2. Scalability: Ensuring that the recommendation system can handle a growing user base and a vast catalog of movies.
3. Balancing Relevance and Diversity: Striking the right balance between recommending popular movies and introducing diversity to avoid creating content "bubbles."
4. Data Privacy: Ensuring that user data is handled with care and in compliance with privacy regulations.


#### *Success Metrics:*

1. User Engagement: We will measure the increase in user engagement metrics such as the number of movies watched, session duration, and user retention.
2. Content Consumption: Tracking the consumption of recommended movies compared to non-recommended movies.
3. User Satisfaction: Gathering user feedback and conducting surveys to assess user satisfaction with the recommendations.
4. Revenue Growth: Monitoring any potential increase in revenue from users subscribing to premium services or purchasing content.

#### *Project Phases:*

1. Data Collection and Preparation: Collecting and preprocessing user behavior data, movie metadata, and ratings.
2. Model Development: Building and training machine learning models to predict user preferences and generate movie recommendations.
3. Evaluation and Fine-Tuning: Assessing model performance, optimizing recommendation quality, and fine-tuning algorithms.
4. Deployment: Integrating the recommendation system into the platform, ensuring scalability and real-time recommendations.
5. Monitoring and Iteration: Continuously monitoring system performance, collecting user feedback, and making improvements.

#### *Ethical Considerations:*

1. Protecting user privacy and ensuring data security.
2. Avoiding biases in recommendations, such as discrimination or filter bubbles.
3. Providing transparency in how recommendations are generated and allowing users to control their recommendations.

#### *Conclusion:*

<p>The Movie Recommendation System aims to create a win-win situation for both users and the platform by delivering personalized, engaging, and diverse movie recommendations. The success of this project will be measured through increased user engagement, content consumption, and user satisfaction, ultimately contributing to the growth and success of our platform.</p>

### **Data Understanding**

In [1]:
# !pip install pyspark

Collecting pyspark
  Downloading pyspark-3.4.1.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.1-py2.py3-none-any.whl size=311285387 sha256=cc4b7c3e54d77f405fb9fbcb207c76c4ea5e43a30560a569955b45196692b023
  Stored in directory: /root/.cache/pip/wheels/0d/77/a3/ff2f74cc9ab41f8f594dabf0579c2a7c6de920d584206e0834
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.1


In [34]:
# import necessary modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import rand
from pyspark.sql.functions import *
from pyspark.ml.feature import StringIndexer,IndexToString
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator


In [3]:
# create sparksession
spark = SparkSession.builder.appName('lin_reg').getOrCreate()

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
# read dataset
df_1 = spark.read.csv('/content/drive/MyDrive/Data/ratings.csv',inferSchema=True,header=True)

In [6]:
# size of dataset
print((df_1.count(),len(df_1.columns)))

(25000095, 4)


In [7]:
# validate data types
df_1.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)



In [8]:
df_2 = spark.read.csv('/content/drive/MyDrive/Data/movies.csv',inferSchema=True,header=True)

In [9]:
print((df_2.count(),len(df_2.columns)))

(62423, 3)


In [10]:
df_2.printSchema()

root
 |-- movieId: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- genres: string (nullable = true)



In [11]:
df = df_2.join(df_1, on=['movieId'], how='inner')

In [12]:
# validate data types
df.printSchema()

root
 |-- movieId: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- genres: string (nullable = true)
 |-- userId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)



In [13]:
# size of new dataset
print((df.count(),len(df.columns)))

(25000095, 6)


In [14]:
# view 10 random rows
df.orderBy(rand()).show(10,False)

+-------+--------------------------------------------------+-------------------------+------+------+----------+
|movieId|title                                             |genres                   |userId|rating|timestamp |
+-------+--------------------------------------------------+-------------------------+------+------+----------+
|4308   |Moulin Rouge (2001)                               |Drama|Musical|Romance    |133359|3.0   |1137511927|
|441    |Dazed and Confused (1993)                         |Comedy                   |77616 |4.0   |975230602 |
|2858   |American Beauty (1999)                            |Drama|Romance            |36553 |5.0   |997996079 |
|173291 |Valerian and the City of a Thousand Planets (2017)|Action|Adventure|Sci-Fi  |101415|3.5   |1548073509|
|6874   |Kill Bill: Vol. 1 (2003)                          |Action|Crime|Thriller    |76929 |3.5   |1532603210|
|196    |Species (1995)                                    |Horror|Sci-Fi            |110853|1.0   |9405

In [15]:
# group by userid and show count by descending
df.groupBy('userId').count().orderBy('count',
ascending=False).show(10,False)

+------+-----+
|userId|count|
+------+-----+
|72315 |32202|
|80974 |9178 |
|137293|8913 |
|33844 |7919 |
|20055 |7488 |
|109731|6647 |
|92046 |6564 |
|49403 |6553 |
|30879 |5693 |
|115102|5649 |
+------+-----+
only showing top 10 rows



In [16]:
# group by userid and show count by ascending
df.groupBy('userId').count().orderBy('count',
ascending=True).show(10,False)

+------+-----+
|userId|count|
+------+-----+
|19200 |20   |
|24950 |20   |
|12210 |20   |
|20240 |20   |
|2572  |20   |
|30289 |20   |
|8389  |20   |
|2983  |20   |
|11025 |20   |
|5363  |20   |
+------+-----+
only showing top 10 rows



In [17]:
# show movies with the highest rating count
df.groupBy('title').count().orderBy('count',
ascending=False).show(10,False)

+-----------------------------------------+-----+
|title                                    |count|
+-----------------------------------------+-----+
|Forrest Gump (1994)                      |81491|
|Shawshank Redemption, The (1994)         |81482|
|Pulp Fiction (1994)                      |79672|
|Silence of the Lambs, The (1991)         |74127|
|Matrix, The (1999)                       |72674|
|Star Wars: Episode IV - A New Hope (1977)|68717|
|Jurassic Park (1993)                     |64144|
|Schindler's List (1993)                  |60411|
|Braveheart (1995)                        |59184|
|Fight Club (1999)                        |58773|
+-----------------------------------------+-----+
only showing top 10 rows



### **Feature Engineering**

In [18]:
# drop genres column
df = df.drop('genres')

In [19]:
stringIndexer = StringIndexer(inputCol="title",
outputCol="title_new")

In [20]:
# fit model
model = stringIndexer.fit(df)

In [21]:
# create new df with numerical columns
indexed = model.transform(df)

In [22]:
# view first 10 rows
indexed.show(10)

+-------+--------------------+------+------+----------+---------+
|movieId|               title|userId|rating| timestamp|title_new|
+-------+--------------------+------+------+----------+---------+
|    296| Pulp Fiction (1994)|     1|   5.0|1147880044|      2.0|
|    306|Three Colors: Red...|     1|   3.5|1147868817|    879.0|
|    307|Three Colors: Blu...|     1|   5.0|1147868828|    942.0|
|    665|  Underground (1995)|     1|   5.0|1147878820|   3265.0|
|    899|Singin' in the Ra...|     1|   3.5|1147868510|    524.0|
|   1088|Dirty Dancing (1987)|     1|   4.0|1147868495|    453.0|
|   1175| Delicatessen (1991)|     1|   3.5|1147868826|    844.0|
|   1217|          Ran (1985)|     1|   3.5|1147878326|   1175.0|
|   1237|Seventh Seal, The...|     1|   5.0|1147868839|   1209.0|
|   1250|Bridge on the Riv...|     1|   4.0|1147868414|    420.0|
+-------+--------------------+------+------+----------+---------+
only showing top 10 rows



In [23]:
indexed.groupBy('title_new').count().orderBy('count',
ascending=False).show(10,False)

+---------+-----+
|title_new|count|
+---------+-----+
|0.0      |81491|
|1.0      |81482|
|2.0      |79672|
|3.0      |74127|
|4.0      |72674|
|5.0      |68717|
|6.0      |64144|
|7.0      |60411|
|8.0      |59184|
|9.0      |58773|
+---------+-----+
only showing top 10 rows



In [24]:
# split dataset
train,test = indexed.randomSplit([0.75,0.25])

In [25]:
# check the train count
train.count()

18750663

In [26]:
# check the test count
test.count()

6249432

### **Modelling**

In [29]:
rec=ALS(maxIter=10,regParam=0.01,userCol='userId',
itemCol='title_new',ratingCol='rating',nonnegative=True,
coldStartStrategy="drop")

In [30]:
rec_model=rec.fit(train)

### **Prediction and Evaluation**

In [31]:
# transorm
predicted_ratings=rec_model.transform(test)

In [32]:
# print schema
predicted_ratings.printSchema()

root
 |-- movieId: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- userId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)
 |-- title_new: double (nullable = false)
 |-- prediction: float (nullable = false)



In [33]:
predicted_ratings.orderBy(rand()).show(10)

+-------+--------------------+------+------+----------+---------+----------+
|movieId|               title|userId|rating| timestamp|title_new|prediction|
+-------+--------------------+------+------+----------+---------+----------+
|   3104| Midnight Run (1988)| 28746|   4.0| 997899769|   1212.0|  3.983959|
|  40629|Pride & Prejudice...| 62334|   4.0|1158689007|   1025.0|  3.609287|
|    157|Canadian Bacon (1...|118622|   3.0|1035426446|   2150.0|   2.48535|
|   7160|      Monster (2003)|128431|   4.5|1112314365|   1389.0| 3.6843584|
|   2616|   Dick Tracy (1990)| 68206|   3.0|1153727745|   1089.0| 2.6426737|
|   1339|Dracula (Bram Sto...|121234|   5.0|1210748190|    602.0| 4.4354486|
|    858|Godfather, The (1...| 66617|   5.0| 944917059|     18.0|  4.676374|
|   5445|Minority Report (...| 41060|   4.0|1301659581|     89.0| 2.2280412|
|  37729| Corpse Bride (2005)| 70547|   4.0|1457224155|    807.0| 3.6856215|
|   3130|Bonfire of the Va...| 60836|   1.0| 943560967|   3671.0| 2.3837423|

In [35]:
evaluator=RegressionEvaluator(metricName='rmse',
predictionCol='prediction',labelCol='rating')

In [37]:
rmse=evaluator.evaluate(predicted_ratings)

In [38]:
print(rmse)

0.804182482079586


In [39]:
unique_movies=indexed.select('title_new').distinct()

In [40]:
unique_movies.count()

58958

In [41]:
a = unique_movies.alias('a')

In [58]:
user_id=20

In [59]:
# apply filter
watched_movies=indexed.filter(indexed['userId'] ==
user_id).select('title_new').distinct()

In [60]:
watched_movies.count()

78

In [61]:
b=watched_movies.alias('b')

In [62]:
total_movies = a.join(b, a.title_new == b.title_new,how='left')

In [63]:
total_movies.show(10,False)

+---------+---------+
|title_new|title_new|
+---------+---------+
|305.0    |null     |
|769.0    |null     |
|934.0    |null     |
|2734.0   |null     |
|299.0    |null     |
|1051.0   |null     |
|496.0    |null     |
|596.0    |null     |
|558.0    |null     |
|692.0    |null     |
+---------+---------+
only showing top 10 rows



In [64]:
remaining_movies=total_movies.where(col("b.title_new").isNull()).select(a.title_new).distinct()

In [65]:
remaining_movies.count()

58880

In [66]:
remaining_movies=remaining_movies.withColumn("userId",lit(int(user_id)))

In [67]:
remaining_movies.show(10,False)

+---------+------+
|title_new|userId|
+---------+------+
|305.0    |20    |
|769.0    |20    |
|934.0    |20    |
|2734.0   |20    |
|299.0    |20    |
|1051.0   |20    |
|496.0    |20    |
|596.0    |20    |
|558.0    |20    |
|692.0    |20    |
+---------+------+
only showing top 10 rows



In [68]:
recommendations=rec_model.transform(remaining_movies).orderBy('prediction',ascendig=False)

In [70]:
recommendations.show(5,False)

+---------+------+----------+
|title_new|userId|prediction|
+---------+------+----------+
|46111.0  |20    |0.20617905|
|37171.0  |20    |0.21556269|
|51187.0  |20    |0.23353639|
|46263.0  |20    |0.23403239|
|55352.0  |20    |0.2580851 |
+---------+------+----------+
only showing top 5 rows



In [71]:
movie_title = IndexToString(inputCol="title_new",outputCol="title",labels=model.labels)

In [72]:
final_recommendations=movie_title.transform(recommendations)

In [73]:
final_recommendations.show(10,False)

+---------+------+----------+-----------------------------+
|title_new|userId|prediction|title                        |
+---------+------+----------+-----------------------------+
|46111.0  |20    |0.20617905|Secrets (1971)               |
|37171.0  |20    |0.21556269|Desert Victory (1943)        |
|51187.0  |20    |0.23353639|El Greco (1966)              |
|46263.0  |20    |0.23403239|Silva (1981)                 |
|55352.0  |20    |0.2580851 |Sant'Antonio di Padova (2002)|
|54839.0  |20    |0.2580851 |Preferisco il paradiso (2010)|
|53682.0  |20    |0.2580851 |Maria Goretti (2003)         |
|55351.0  |20    |0.2580851 |Sant'Agostino (2010)         |
|46192.0  |20    |0.28431967|Shark Hunter (2001)          |
|46190.0  |20    |0.28431967|Shark Alarm (2004)           |
+---------+------+----------+-----------------------------+
only showing top 10 rows

