## A Hybrid Movie Recommendation System Project

### Business Problem and Stakeholder

A movie streaming platform is the primary stakeholder for this project. The platform offers a large catalog of movies and aims to improve user engagement, satisfaction, and retention by helping users quickly discover content that matches their preferences.

However, with thousands of available movies, users may struggle to identify titles they are likely to enjoy. Generic or poorly targeted recommendations can lead to user frustration, reduced platform usage, and increased churn. This challenge is further compounded by the cold-start problem, where new users or users with limited rating history receive less accurate recommendations.

The business problem addressed in this project is to develop a personalized movie recommendation system that can accurately suggest relevant movies to users based on their historical interactions, while also remaining effective when limited user rating data is available.

To address this challenge, the project implements a hybrid recommendation system that combines collaborative filtering based on user ratings with content-based filtering using user-generated movie tags. The system is designed to generate top-5 personalized movie recommendations that enhance the user experience and support the platform’s business objectives.

### Project Objectives
The objective of this project is to design and evaluate a hybrid movie recommendation system that provides personalized movie suggestions to users based on their historical preferences and content similarity.
Specifically, this project aims to:
- Build a collaborative filtering recommendation model using user–movie rating data to predict unseen ratings.
- Evaluate the collaborative filtering model using appropriate regression metrics, such as **RMSE** and **MAE**, to assess performance on unseen data.
- Develop a content-based recommendation component using user-generated movie tags to identify similar movies.
- Integrate collaborative filtering and content-based approaches into a hybrid recommendation system to address the cold-start problem.
- Generate and present **top-5 personalized movie recommendations** for users in a clear and interpretable manner.
- Demonstrate a clear modeling workflow, including data preparation, validation strategy, and result interpretation, suitable for a data science audience.


### Dataset Overview
This project uses the MovieLens (small) dataset, which contains user ratings, movie metadata, and user-generated tags. The dataset is commonly used for building and evaluating recommendation systems.
The following datasets are used in this project:
* ratings.csv: Contains user–movie interactions, including user IDs, movie IDs, ratings, and timestamps. This dataset forms the core input for training the collaborative filtering model.
* movies.csv: Contains movie titles and genre information associated with each movie ID. This dataset is used to display meaningful movie information in the final recommendations.
* tags.csv: Contains user-generated tags applied to movies. These tags are used to build a content-based recommendation component and to support a hybrid recommendation approach.
* links.csv: Contains external IMDb and TMDb identifiers and is not used in modeling.

In [1]:
# Loading the Datasets
import pandas as pd

ratings = pd.read_csv("ratings.csv")
movies = pd.read_csv("movies.csv")
tags = pd.read_csv("tags.csv")

### Preview the datasets

In [None]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [6]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


### Dataset Dimensions and Structure

In [9]:
print("Ratings shape:", ratings.shape)
ratings.info()

Ratings shape: (100836, 4)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [10]:
print("Movies shape:", movies.shape)
movies.info()

Movies shape: (9742, 3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [11]:
print("Tags shape:", tags.shape)
tags.info()

Tags shape: (3683, 4)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userId     3683 non-null   int64 
 1   movieId    3683 non-null   int64 
 2   tag        3683 non-null   object
 3   timestamp  3683 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 115.2+ KB


### Validation Strategy

To evaluate the performance of the recommendation system and ensure that the model generalizes well to unseen data, a train–test split validation strategy is employed.

For the collaborative filtering component, the user–movie ratings data is divided into a training set and a test set. The model is trained exclusively on the training data, while performance is evaluated on the test data using ratings that were not seen during training. This approach helps prevent overfitting and provides a realistic estimate of how the model would perform in a real-world setting.

Since the recommendation task involves predicting numerical ratings, regression-based evaluation metrics are used. Specifically, Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) are used to measure the difference between predicted and actual user ratings on the test set.

For the content-based and hybrid recommendation components, quantitative evaluation is less straightforward due to the lack of explicit ground-truth labels for similarity-based recommendations. Therefore, these components are evaluated qualitatively by inspecting the relevance and coherence of the generated movie recommendations, particularly in cold-start scenarios where user rating history is limited.

This validation strategy ensures that the collaborative filtering model is evaluated rigorously, while the hybrid approach is assessed in a manner consistent with its intended real-world application

### Data Preprocessing and Cleaning
Before building the recommendation models, the datasets are cleaned and prepared to ensure consistency and reliability. Since the MovieLens dataset is well-structured, preprocessing focuses primarily on validating data quality and preparing inputs for collaborative filtering and content-based modeling.

1. Preprocessing the Ratings Data

The ratings dataset is inspected to confirm that all ratings are valid, user and movie identifiers are consistent, and no missing values are present.

In [12]:
# Check for missing values in ratings
ratings.isna().sum()


userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [13]:
# Confirm rating scale
ratings['rating'].unique()


array([4. , 5. , 3. , 2. , 1. , 4.5, 3.5, 2.5, 0.5, 1.5])

In [14]:
# Drop timestamp (not required for modeling)
ratings = ratings.drop(columns=['timestamp'])


2. Preprocessing the Movies Data

The movies dataset is used for mapping movie IDs to titles and genres.

In [15]:
# Check for missing values
movies.isna().sum()


movieId    0
title      0
genres     0
dtype: int64

In [16]:
# Ensure movieId uniqueness
movies['movieId'].is_unique


True

3. Preprocessing the Tags Data

Tags are used to build the content-based recommendation component.

In [17]:
# Check for missing values
tags.isna().sum()


userId       0
movieId      0
tag          0
timestamp    0
dtype: int64

In [18]:
# Convert tags to lowercase for consistency
tags['tag'] = tags['tag'].str.lower()


In [19]:
# Remove timestamp column
tags = tags.drop(columns=['timestamp'])


4. Aligning Movie IDs Across Datasets

To ensure consistency across datasets, only movies that appear in the ratings data are retained in the movies and tags datasets.

In [20]:
valid_movie_ids = set(ratings['movieId'])

movies = movies[movies['movieId'].isin(valid_movie_ids)]
tags = tags[tags['movieId'].isin(valid_movie_ids)]


5. Preparing Tags for Content-Based Modeling

Tags are aggregated at the movie level to create a single text representation per movie.

In [21]:
# Combine tags per movie
movie_tags = (
    tags.groupby('movieId')['tag']
    .apply(lambda x: ' '.join(x))
    .reset_index()
)

movie_tags.head()


Unnamed: 0,movieId,tag
0,1,pixar pixar fun
1,2,fantasy magic board game robin williams game
2,3,moldy old
3,5,pregnancy remake
4,7,remake
