## Step:1 Dataset Description and Objective:

**Dataset Description:**

The dataset for this project comprises two files, namely "customer_ratings.txt" and "movie_titles.csv" each providing essential information for building a recommendation engine.

`customer_ratings.txt:`

* ID: Unique identifiers for customers and movies.
* Rating: User ratings for movies.
* Year: The year when the movie was released.

`movie_titles.csv:`

* Movie ID: Unique keys to identify movies.
* Year: The year when the movie was released.
* Movie Name: Titles of movies corresponding to their IDs.

**Objective of the Project:**

The aim of this project is to develop a recommendation engine for an Over-The-Top (OTT) platform or streaming service. The specific objectives are:

* Create Personalized Recommendation Model: Develop a model suggesting best-suited movies for users based on their preferences and past ratings.

## Step:2 Import Necessary libraries:

In [None]:
# Importing necessary libraries
import math
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Importing Surprise for recommendation system
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

# Installing scikit-surprise library
!pip install scikit-surprise

## Step:3 Data loading:

In [None]:
data = pd.read_csv(r"E:\DS & ML Syllabus\DS and ML projects intellipat\netflix by harsh\customer_ratings.txt", 
                   header = None, names = ['Cust_Id', 'Rating'], usecols = [0,1])

## Step:4 Exploratory Data Analysis (EDA):

In [None]:
# Display first 5 rows
data.head()

* Here, the data is stored differently.
* For example, if you look at row number 0:
    * "1:" represents the movie ID with no rating, so by default, NaN appears. Below this row,
    * There are numerous customer IDs with corresponding ratings present for movie number "1."
* Similarly, data is stored for movie IDs "2:", "3:", and so on.

In [None]:
# Display last 5 rows
data.tail()

In [None]:
# checking shape of data
data.shape

* In this dataset, we have more than 24 million entries, and the data is stored in its raw form.

In [None]:
# It helps to understand the data type and information about data
data.info()

In [None]:
# To get the total number of movies, 
# we can simply count the number of null values in the rating column. 
# Null values in the rating column correspond to movie IDs stored in the customer_id column. 
# By counting these null values, we can determine the total number of movies in the dataframe.
movie_count = data.isna().sum()
movie_count = movie_count['Rating']
movie_count

* So, the total number of movies in this dataset is 4499.

In [None]:
# Getting the count of unique customers we have in this dataset.
customer_count = data['Cust_Id'].nunique()
customer_count

* Here, 475257 represents the total number of entries where both uinque customer_id and unique movie_id are stored together. 
* To obtain only customer_id, we need to exclude movie_id from this count.

In [None]:
# So we are removing movie_id from this count
customer_count = customer_count - movie_count
customer_count

* So total no of unique customers = 470758

In [None]:
# Calculating the total number of ratings provided by customers.
rating_count = data['Cust_Id'].count()-movie_count
rating_count

* We already know that we have 4,999 movies, and a customer can watch any movie multiple times. However, each customer's rating for a movie is stored only once and updated if they change their rating. We have calculated that 470,758 customers have given a total of 24,053,764 ratings.

In [None]:
#To find out how many people have rated the movies as 1*, 2*, 3*,4*, 5* stars ratings to the movies
stars = data.groupby('Rating')['Rating'].agg(['count'])

In [None]:
stars

* 1118186 people rated movies as 1*
* 2439073 people rated movies as 2*


In [None]:
# Bar plot showing the distribution of ratings
ax = stars.plot(kind='barh', legend=False, figsize=(10,5))
plt.title(f'Total pool: {movie_count} Movies, {customer_count} Customers, {rating_count} ratings given', fontsize=14, loc='left')
plt.grid(True)
plt.xlabel("Frequency")
plt.show()

In [None]:
data.head(4)

### 4.1 Feature Engineering:
* Creating relevant features from existing data.

In [None]:
# Adding another column in the dataset named 'movie id'.
# First, we'll calculate how many null values we have in the ratings column because i.e nothing 
# but total no of movies we have in the dataset.
df_nan = pd.DataFrame(data['Rating'].isnull())
df_nan

In [None]:
# Filtering data, for getting rows where Movie_Id is present
df_nan = df_nan[df_nan['Rating']==True]
df_nan

* From the above, I can say that movie1 is stored from index 0 to 547.
* How did we get this?
* The number of customers who rated movie1 is 548−0−1 (the -1 is because from index 548 onwards, another movie is stored).
* The number of customers who rated movie2 is 694−548−1 (the -1 is because from index 694 onwards, another movie is stored).
* Similarly, for other movies.

In [None]:
# now we will reset the index as the column
df_nan = df_nan.reset_index()

In [None]:
df_nan

In [None]:
# Just understand 
x=zip(df_nan['index'][1:], df_nan['index'][:-1])
tuple(x)[:10]

In [None]:
#To create a numpy array containing movie id's according to the 'ratings' dataset

movie_np = []
movie_id = 1

for i,j in zip(df_nan['index'][1:],df_nan['index'][:-1]):
    # Numpy approach: 
    # Fill 1 in rows from 0 to 547. 
    temp = np.full((1,i-j-1), movie_id)
    movie_np = np.append(movie_np, temp)
    movie_id += 1
    
# Account ""for last record"" and corresponding length
# Numpy approach
last_record = np.full((1,len(data) - df_nan.iloc[-1, 0] - 1),movie_id)
movie_np = np.append(movie_np, last_record)

print(f'Movie numpy: {movie_np}')
print(f'Length: {len(movie_np)}')

In [None]:
data.head(5)

In [None]:
data.info()

In [None]:
# Typecasting and adding Movie_Id column in dataset
data=data[pd.notnull(data['Rating'])]
data['Movie_Id']=movie_np.astype(int)
data['Cust_Id']=data['Cust_Id'].astype(int)
print("Now the dataset will look like: ")
data

In [None]:
data.info()

### 4.2 Data cleaning and Adding BenchMark:

In this step, our focus is on refining the dataset to meet specific criteria for analysis efficiency and reliability.

Establishing a benchmark involves setting minimum standards for dataset inclusion. This is vital to optimize our resources and ensure our analysis is grounded in meaningful data.

The benchmark criteria include:

* `Filtering Users`: We exclude users who have rated only a small number of movies. This step helps mitigate the influence of potential fake ratings, often used by media companies to artificially boost ratings.

* `Filtering Movies`: Similarly, we exclude movies with only a few ratings, even if they have high average ratings. For example, a movie like movieID = 65, despite receiving a high average rating (such as 5 stars), may not provide sufficient data for robust analysis if it's rated by only a handful of individuals.

Implementing these benchmarks aims to ensure our recommendation engine relies on a more substantial and reliable dataset. This approach enhances recommendation accuracy by leveraging diverse and representative data, ultimately improving the overall user experience.

In [None]:
# creating a list where we store aggregation var name
f = ['count','mean']

In [None]:
# Grouping data a/c to movie_id
dataset_movie_summary = data.groupby('Movie_Id').agg(f)

In [None]:
# Viewing grouped data
dataset_movie_summary

* so if you see above `cust_id` and `their mean` isn't making any sense so we are not taking this

In [None]:
# So now we are taking only 'Rating' and 'movie_id column and Grouped by "Movie_id" column
dataset_movie_summary = data.groupby('Movie_Id')['Rating'].agg(f)

In [None]:
# Now viewing it
dataset_movie_summary

In [None]:
# I am considering only that movie in our data, which is rated by almost 70% of people
dataset_movie_summary["count"].quantile(0.7)

* I am considering only that movie which is rated by almost 1800  people

In [None]:
# Now we will create a benchmark
# By considering only that movie which is rated by almost 1800 people
movie_benchmark=round(dataset_movie_summary['count'].quantile(0.7),0)
movie_benchmark

In [None]:
dataset_movie_summary['count']

* so now what we are going to do is, we are removing movies which is not rated by 1799 people
* like we are removing movie id 1,2,4,5,4495,4497,4498 like that

In [None]:
# Filtering movies which is less than 1799 ratings
drop_movie_list = dataset_movie_summary[dataset_movie_summary['count']<movie_benchmark].index
drop_movie_list

In [None]:
# Next, we'll filter out users who are inactive, meaning they haven't been active frequently
dataset_cust_summary=data.groupby('Cust_Id')['Rating'].agg(f)
dataset_cust_summary

In [None]:
cust_benchmark=round(dataset_cust_summary['count'].quantile(0.7),0)
cust_benchmark

* I am only considering that users who have rated atleast 52 movies
* so below we are going to drop users who have not rated 52 movies atleast.

In [None]:
# Filtering Customers Who Have Rated Fewer Than 52 Movies.
drop_cust_list=dataset_cust_summary[dataset_cust_summary['count']<cust_benchmark].index
drop_cust_list

In [None]:
# Just checking original dataframe size
# So further we will observe changes because we will remove all the customers and movies that are below the benchmark
print('The original dataframe has: ', data.shape, 'shape')

In [None]:
# Now we are removing customers and movies which are below the benchmark.
data = data[~data['Movie_Id'].isin(drop_movie_list)]
data = data[~data['Cust_Id'].isin(drop_cust_list)]
print('After the triming, the shape is: {}'.format(data.shape))

In [None]:
data.head()

### 4.3 Integrating 'movie_title' Column from Another DataFrame into the Current DataFrame:

* In this step, we merge data from another dataframe to enrich our movie information.

Since directly recommending movies based on their IDs may not provide meaningful insights to users, we opt to enhance our dataset by incorporating a file containing the title names of movies. This allows for a more user-friendly presentation of movie recommendations.

Our movie title data is initially in text format. To ensure compatibility with our current DataFrame and to avoid Unicode format errors during conversion to CSV, we've converted the data to CSV format. It's important to note that due to the text format's nature, encoding is required for proper functioning. Without encoding, attempting to convert the data directly into CSV format may result in Unicode format errors. Therefore, encoding is essential for seamless integration of the movie title information into our dataset.

In [None]:
# Loading dataset
df_title = pd.read_csv(r"E:\DS & ML Syllabus\DS and ML projects intellipat\netflix by harsh\movie_titles.csv",  encoding='ISO-8859-1', header=None, usecols=[0,1,2], names=['Movie_Id','Year','Name' ])
df_title.set_index('Movie_Id', inplace=True)

In [None]:
# Viewing new dataset
df_title.head()

## 5 Data visualization:

In [None]:
# No. of Movies Released per Year

plt.figure(figsize=(10, 5))
plt.hist(df_title['Year'], bins=20)
plt.xlabel('Year')
plt.ylabel('Frequency')
plt.title('Distribution of Movies Released per Year');

* From the above plot, we can clearly see that as the years increase, the number of movies released also increases.

In [None]:
# Number of Movies Released per decade
plt.figure(figsize=(10, 5))
(df_title['Year'] // 10 * 10).value_counts().sort_index().plot(kind='bar')
plt.xlabel('Decade')
plt.ylabel('Number of Movies Released')
plt.title('Number of Movies Released per decade');

* The plot shows a significant increase in the number of movies released after 1980, likely due to the advent of cheaper televisions and improved electricity supply.

In [None]:
# Average Movie Rating Over Time

merged_data = data.merge(df_title, on="Movie_Id")

# Calculate average rating per year
average_ratings = merged_data.groupby("Year")["Rating"].mean()

# Create line plot
plt.figure(figsize=(10, 5))  # Adjust figure size as desired
plt.plot(average_ratings.index, average_ratings.values)
plt.xlabel("Release Year")
plt.ylabel("Average Rating")
plt.title("Average Movie Rating Over Time")
plt.grid(True)
plt.show()

* If you observe this plot, you can see that from 1940 to 1970, there are consistently high ratings from customers. This may be because this period saw the release of many high-quality movies, and movies were still a relatively new form of entertainment for people during this time.

## Step: 6 Implementing a Netflix Recommendation Engine:

In [None]:
# The scikit-surprise package is a Python library for building and analyzing recommender systems. 
# To install this use "conda install -c conda-forge scikit-surprise" or 
!pip install scikit-surprise

In [None]:
# Model building
import math
import seaborn as sns
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

In [None]:
# Using the Reader class to read the dataset for the SVD algorithm
reader=Reader()

In [None]:
# Convert DatasetAutoFolds object to DataFrame
# Create a DataFrame from the dataset with columns: 'Cust_Id', 'Movie_Id', 'Rating'
df = pd.DataFrame(data, columns=['Cust_Id', 'Movie_Id', 'Rating'])

# We only work with the top 100K rows for quicker runtime, as even 12GB of RAM in Google Colab may not be sufficient
# Select the top 100K rows from the DataFrame
df_subset = df[['Cust_Id', 'Movie_Id', 'Rating']][:100000]

# Load the data into a Surprise dataset
# Load the DataFrame subset into a Surprise dataset using the previously defined 'reader'
dataset = Dataset.load_from_df(df_subset, reader)

In [None]:
dataset

In [None]:
# Initialize the SVD algorithm
# Create an instance of the SVD algorithm for collaborative filtering
svd = SVD()

# Perform cross-validation on the dataset
# Evaluate the SVD algorithm using cross-validation with measures of RMSE and MAE
# Print verbose output to show progress
cross_validate(svd, dataset, measures=['RMSE', 'MAE'], verbose=True)

In [None]:
data.head()

### 6.1 Predicting Movies for 'cust_id 712664':
* Apply machine learning techniques to predict movies for a specific customer.

In [None]:
# So first, we select user 712664 and attempt to recommend some movies based on their past data.
# Since the user has rated many movies with 5 stars, we'll use those ratings as a reference.
# This will help us understand the type of movies they like to watch.

# Filter the dataset to include only movies rated 5 stars by user 712664
dataset_712664 = data[(data['Cust_Id'] == 712664) & (data['Rating'] == 5)]
dataset_712664 = dataset_712664.set_index('Movie_Id')
dataset_712664 = dataset_712664.join(df_title)['Name']

# Display the filtered dataset
dataset_712664

* So from the filtered dataset above, we can see the movies that user 712664 rated with 5 stars.

In [None]:
df_title

In [None]:
# Now we will build the recommendation algorithm.
# First, we will make a shallow copy of the 'movie_titles.csv' file so that we can modify
# the values in the copied dataset, not in the actual dataset.

# Create a shallow copy of the 'movie_titles.csv' dataset
user_712664 = df_title.copy()

# Display the copied dataset
user_712664

In [None]:
# just resetting index
user_712664 = user_712664.reset_index()
user_712664

In [None]:
# Filtering out movies for user_712664 based on benchmark criteria
user_712664 = user_712664[~user_712664['Movie_Id'].isin(drop_movie_list)]
user_712664

In [None]:
# Applying estimator function on SVD decomposition for user 712664
user_712664['Estimate_Score'] = user_712664['Movie_Id'].apply(lambda x: svd.predict(712664, x).est)

# Dropping 'Movie_Id' column from the dataset
user_712664 = user_712664.drop('Movie_Id', axis=1)

### 6.2 Recommended Movies for Customer "cust_id 712664":

In [None]:
# Sorting user_712664 dataframe based on the estimated scores in descending order
user_712664 = user_712664.sort_values('Estimate_Score', ascending=False)

# Printing the sorted dataframe
print(user_712664)

* So, if you recommend these types of movies to this user, then he will likely enjoy them.

## Step 7 Model Saving and Loading:

In [None]:
import pickle

In [None]:
# Save the model to a file
with open('../models/recommendation_model.pkl', 'wb') as f:
    pickle.dump(svd, f)

In [None]:
# Load the model from the file
with open('../models/recommendation_model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

In [None]:
def get_user_recommendations(user_id, data, df_title, drop_movie_list, loaded_model):
    # Filter the dataset to include only movies rated 5 stars by the specified user
    dataset_user = data[(data['Cust_Id'] == user_id) & (data['Rating'] == 5)]
    dataset_user = dataset_user.set_index('Movie_Id')
    dataset_user = dataset_user.join(df_title)['Name']

    # Create a DataFrame containing all movie IDs
    user_movies = df_title.copy()
    user_movies = user_movies.reset_index()
    user_movies = user_movies[~user_movies['Movie_Id'].isin(drop_movie_list)]

    # Applying estimator function on SVD decomposition for the specified user
    user_movies['Estimate_Score'] = user_movies['Movie_Id'].apply(lambda x: loaded_model.predict(user_id, x).est)

    # Dropping 'Movie_Id' column from the dataset
    user_movies = user_movies.drop('Movie_Id', axis=1)

    # Sorting user_movies DataFrame based on the estimated scores in descending order
    user_movies = user_movies.sort_values('Estimate_Score', ascending=False)
    return user_movies

# Example of how to use the function for recommendations for user 712664
user_id = 712664
user_recommendations = get_user_recommendations(user_id, data, df_title, drop_movie_list, loaded_model)
print(user_recommendations)


* So, if you recommend these types of movies to this user, then he will likely enjoy them.

## Step 8 Conclusion:

This project aimed to develop a personalized recommendation engine for an OTT platform using user ratings and movie metadata. The dataset comprised user ratings and movie titles, and the primary goal was to recommend movies tailored to individual user preferences.

### 8.1 Key Steps and Findings:


**Data Exploration and Preprocessing:** We began with two datasets: customer_ratings.txt and movie_titles.csv, containing user ratings and movie information, respectively.
The initial exploration revealed the structure and format of the data, including the presence of NaN values representing movie IDs.
Through feature engineering, we created a cohesive dataset by integrating movie IDs into the customer ratings data.

**Data Cleaning and Benchmarking:** We established benchmarks to filter out movies and users with insufficient ratings. Specifically, we retained only movies rated by at least 1,799 users and users who rated at least 52 movies. This ensured our analysis was based on substantial and reliable data.

**Data Integration:** We merged the movie titles dataset with the ratings data, enriching our dataset with movie names and release years. This integration facilitated more meaningful recommendations by presenting movie titles instead of mere IDs.

**Data Visualization:** Visualizations provided insights into the distribution of movies released per year and decade, as well as average movie ratings over time. These visualizations highlighted trends and patterns, such as the increase in movie releases over the decades and high average ratings from the 1940s to the 1970s.

**Recommendation Engine Development:** Utilizing the scikit-surprise library, we implemented the Singular Value Decomposition (SVD) algorithm, a popular collaborative filtering technique for recommendation systems.
The dataset was loaded into the SVD model, and cross-validation was performed to evaluate its performance.

### 8.2 Final Thoughts:

The recommendation engine built in this project leverages collaborative filtering to provide personalized movie recommendations to users based on their past ratings and preferences. By cleaning the data and setting benchmarks, we ensured that the recommendations are based on reliable and significant user interactions. This engine is a crucial step toward enhancing user experience on an OTT platform, offering tailored suggestions that align with individual user tastes.

### 8.3 Future Improvements
Future improvements could include incorporating additional features such as movie genres, user demographics, and more advanced recommendation algorithms to further refine and enhance the recommendation capabilities of the system.