#  Hybrid Movie Recommendation System

This project builds a hybrid recommendation engine using:

- Collaborative Filtering (SVD)
- Content-Based Filtering (TF-IDF + Cosine Similarity)
- A hybrid method combining both

I have used the **MovieLens Latest Small Dataset** (100k ratings) from [GroupLens](https://grouplens.org/datasets/movielens/).


In [2]:
from google.colab import files
uploaded=files.upload() #upload all datasets manually

Saving movies_metadata.csv to movies_metadata.csv
Saving links.csv to links.csv
Saving movies.csv to movies.csv
Saving ratings.csv to ratings.csv
Saving tags.csv to tags.csv


## Load MovieLens Dataset

Load the following files from the MovieLens 100k dataset:
- `ratings.csv`: user-movie ratings (used for collaborative filtering)
- `movies.csv`: movie titles and genres (used for content-based filtering)
- `tags.csv`: user-added tags (used to enrich content features)
- `links.csv`: TMDB/IMDB ids (used to fetch posters later in the Streamlit app)


In [1]:
import pandas as pd

# Load files
ratings = pd.read_csv("ratings.csv")     # userId, movieId, rating, timestamp
movies = pd.read_csv("movies.csv")       # movieId, title, genres
tags = pd.read_csv("tags.csv")           # userId, movieId, tag, timestamp
links = pd.read_csv("links.csv")         # movieId, imdbId, tmdbId

## Preview and Understand Each File

Let's inspect the first few rows of each file to understand its structure and confirm successful loading.


In [2]:
print("Ratings.csv")
print(ratings.head(),"\n")
print(ratings.shape,'\n')

print("Movies.csv")
print(movies.head(),"\n")
print(movies.shape,'\n')

print("Tags.csv")
print(tags.head(),"\n")
print(tags.shape,'\n')

print("Links.csv")
print(links.head())
print(links.shape,'\n')


Ratings.csv
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931 

(100836, 4) 

Movies.csv
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy   

(9742, 3) 

Tags.csv
   userId  movieId              tag   timestamp
0       2    60756            funny  1445714994
1       2    60756 

###  Purpose of Each File

- `ratings.csv` → Used to build the collaborative filtering (SVD) model.
- `movies.csv` → Used to extract metadata (title, genres) for content-based filtering.
- `tags.csv` → Used to enrich movie profiles with user-generated tags.
- `links.csv` → Used to get tmdbId and ImdbId later for posters in app.




In [3]:
# Genres are originally pipe-separated like "Action|Adventure"
# convert to space-separated:"Action Adventure" for text processing

movies['genres'] = movies['genres'].str.replace('|', ' ', regex=False)

In [4]:
#Check for Missing Values
print("Missing in ratings:\n", ratings.isnull().sum())
print("Missing in movies:\n", movies.isnull().sum())
print("Missing in tags:\n", tags.isnull().sum())
print("Missing in links:\n", links.isnull().sum())

Missing in ratings:
 userId       0
movieId      0
rating       0
timestamp    0
dtype: int64
Missing in movies:
 movieId    0
title      0
genres     0
dtype: int64
Missing in tags:
 userId       0
movieId      0
tag          0
timestamp    0
dtype: int64
Missing in links:
 movieId    0
imdbId     0
tmdbId     8
dtype: int64


##  Building Content-Based Filtering DataFrame

Created a final clean dataframe `cb_movies_df` that contains:
- `movieId`
- `title`
- `cb_text`: concatenation of `title`, `genres`, and **all user tags** combined per movie

This ensures each movie has a rich content profile for TF-IDF modeling.


In [5]:
# Group all tags for each movie into a single string
tags_grouped = tags.groupby('movieId')['tag'].apply(lambda x: ' '.join(x)).reset_index()

# Merge movies with grouped tags using LEFT JOIN
cb_movies_df = pd.merge(movies, tags_grouped, on='movieId', how='left')

# Fill missing tag column with empty string (for movies with no tags)
cb_movies_df['tag'] = cb_movies_df['tag'].fillna('')

# Create combined content text (cb_text)
cb_movies_df['cb_text'] = (
    cb_movies_df['title'].fillna('') + ' ' +
    cb_movies_df['genres'].fillna('') + ' ' +
    cb_movies_df['tag'].fillna('')
)

# Keep only needed columns for CBF
cb_movies_df = cb_movies_df[['movieId', 'title', 'cb_text']]

# Preview final dataframe
cb_movies_df.head(5)


Unnamed: 0,movieId,title,cb_text
0,1,Toy Story (1995),Toy Story (1995) Adventure Animation Children ...
1,2,Jumanji (1995),Jumanji (1995) Adventure Children Fantasy fant...
2,3,Grumpier Old Men (1995),Grumpier Old Men (1995) Comedy Romance moldy old
3,4,Waiting to Exhale (1995),Waiting to Exhale (1995) Comedy Drama Romance
4,5,Father of the Bride Part II (1995),Father of the Bride Part II (1995) Comedy preg...


## TF-IDF Vectorization + Cosine Similarity

 used TF-IDF to convert movie content (`cb_text`) into numeric vectors, and cosine similarity to compare them. This allows us to recommend similar movies based purely on metadata and tags.


In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Initialize TF-IDF Vectorizer
tfidf = TfidfVectorizer(stop_words='english')

# Fit and transform the cb_text
tfidf_matrix = tfidf.fit_transform(cb_movies_df['cb_text'])

# Compute pairwise cosine similarity between all movies
cosine_sim_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Check shape
print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
print(f"Cosine similarity matrix shape: {cosine_sim_matrix.shape}")

TF-IDF matrix shape: (9742, 9949)
Cosine similarity matrix shape: (9742, 9742)


##  Content Based Recommendation Function

Created a function that:
1. Takes a movie title as input
2. Finds its index in the dataframe
3. Looks up cosine similarity scores for that movie
4. Returns top N most similar movies (excluding the input itself)


In [8]:
#  Create a reverse map of movie title to index
title_to_index = pd.Series(cb_movies_df.index, index=cb_movies_df['title'])

# Recommendation function
def get_similar_movies(title, top_n=10):
    # Check if title exists
    if title not in title_to_index:
        return f"Movie '{title}' not found in dataset."

    # Get the index of the movie
    idx = title_to_index[title]

    # Get similarity scores for this movie with all others
    sim_scores = list(enumerate(cosine_sim_matrix[idx]))

    # Sort movies by similarity score (highest first)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Skip the first one
    sim_scores = sim_scores[1:top_n+1]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the movie titles
    return cb_movies_df[['title']].iloc[movie_indices].reset_index(drop=True)


In [9]:
print(title_to_index[0:11])
cb_movies_df.head(10)

title
Toy Story (1995)                       0
Jumanji (1995)                         1
Grumpier Old Men (1995)                2
Waiting to Exhale (1995)               3
Father of the Bride Part II (1995)     4
Heat (1995)                            5
Sabrina (1995)                         6
Tom and Huck (1995)                    7
Sudden Death (1995)                    8
GoldenEye (1995)                       9
American President, The (1995)        10
dtype: int64


Unnamed: 0,movieId,title,cb_text
0,1,Toy Story (1995),Toy Story (1995) Adventure Animation Children ...
1,2,Jumanji (1995),Jumanji (1995) Adventure Children Fantasy fant...
2,3,Grumpier Old Men (1995),Grumpier Old Men (1995) Comedy Romance moldy old
3,4,Waiting to Exhale (1995),Waiting to Exhale (1995) Comedy Drama Romance
4,5,Father of the Bride Part II (1995),Father of the Bride Part II (1995) Comedy preg...
5,6,Heat (1995),Heat (1995) Action Crime Thriller
6,7,Sabrina (1995),Sabrina (1995) Comedy Romance remake
7,8,Tom and Huck (1995),Tom and Huck (1995) Adventure Children
8,9,Sudden Death (1995),Sudden Death (1995) Action
9,10,GoldenEye (1995),GoldenEye (1995) Action Adventure Thriller


In [10]:
get_similar_movies("Toy Story (1995)", top_n=15)

Unnamed: 0,title
0,Toy Story 2 (1999)
1,"Bug's Life, A (1998)"
2,Toy Story 3 (2010)
3,"Toy, The (1982)"
4,Up (2009)
5,Fun (1994)
6,We're Back! A Dinosaur's Story (1993)
7,Now and Then (1995)
8,Toy Soldiers (1991)
9,"NeverEnding Story, The (1984)"


##  Multi-Movie Content-Based Recommender

This version:
- Takes a list of movies the user likes
- Retrieves each movie's content similarity vector
- Averages them into a single "user interest profile"
- Returns top N recommended movies based on this combined profile


In [37]:
def get_cbf_recommendations(liked_titles, top_n=10):

    # Filter out movies not found
    valid_titles = [title for title in liked_titles if title in title_to_index]

    if not valid_titles:
        return "None of the selected movies were found in the dataset."

    # Get indices of liked movies
    liked_indices = [title_to_index[title] for title in valid_titles]

    # Average the cosine similarity vectors of liked movies
    user_profile = cosine_sim_matrix[liked_indices].mean(axis=0)

    # Get list of scores with their index
    sim_scores = list(enumerate(user_profile))

    # Remove the movies the user already liked
    liked_set = set(liked_indices)
    sim_scores = [score for score in sim_scores if score[0] not in liked_set]

    # Sort by similarity score
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get top N recommended movie indices
    top_indices = [idx for idx, _ in sim_scores[:top_n]]

    # Get movieId and title from cb_movies_df using indices
    recommended_movies = cb_movies_df.loc[top_indices, ['movieId', 'title']].reset_index(drop=True)

    return recommended_movies

In [40]:
get_cbf_recommendations(["Inception (2010)","Now and Then (1995)"], top_n=10)

Unnamed: 0,movieId,title
0,618,Two Much (1995)
1,54,"Big Green, The (1995)"
2,241,Fluke (1995)
3,158,Casper (1995)
4,6557,Born to Be Wild (1995)
5,48,Pocahontas (1995)
6,243,Gordy (1995)
7,218,Boys on the Side (1995)
8,13,Balto (1995)
9,714,Dead Man (1995)


## Collaborative Filtering (SVD)

 Built a recommendation engine based on **user rating patterns** **Singular Value Decomposition (SVD)** via the Surprise library.

Instead of relying on content, this approach learns:
- What types of movies a user prefers (latent user features)
- What hidden characteristics a movie has (latent item features)

The key idea is:
> "If User A and B both liked Movie X, and User A also liked Y, then maybe B will like Y too."

### Data Used:
- `ratings.csv`: userId, movieId, rating




In [13]:
ratings.info()
ratings.head()

# Unique stats
print(f"Total ratings: {len(ratings)}")
print(f"Unique users: {ratings['userId'].nunique()}")
print(f"Unique movies: {ratings['movieId'].nunique()}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB
Total ratings: 100836
Unique users: 610
Unique movies: 9724


## User-Item Matrix

 Reshaped the ratings data into a matrix format:

- Rows: Unique users
- Columns: Unique movies
- Values: Rating (0.5 to 5.0 stars)

 Fill missing values with 0 temporarily.


In [14]:
# Pivot ratings to create user-item matrix
user_item_matrix = ratings.pivot_table(index='userId', columns='movieId', values='rating')

# Fill missing values with 0 (assumes unrated movies)
user_item_matrix_filled = user_item_matrix.fillna(0)

# Shape of matrix
print("User-Item Matrix shape:", user_item_matrix_filled.shape)

# Preview
user_item_matrix_filled.head()

User-Item Matrix shape: (610, 9724)


movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0





### Model Training Steps

1. **Prepare the data**  
   Convert the `ratings.csv` file into the Surprise-compatible format.

2. **Train the SVD model**

3. **Evaluate performance**  
   Use **Root Mean Squared Error (RMSE)** on a test set.

---




In [None]:
# Install Surprise 
!pip install numpy==1.24.4
!pip install scikit-surprise --no-binary scikit-surprise

Collecting scikit-surprise
  Using cached scikit_surprise-1.1.4.tar.gz (154 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp311-cp311-linux_x86_64.whl size=2469546 sha256=c48532ccd9f79e349faad891b6e59e2f954fdec54e1e97590f43c04d4cbb1f48
  Stored in directory: /root/.cache/pip/wheels/2a/8f/6e/7e2899163e2d85d8266daab4aa1cdabec7a6c56f83c015b5af
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.4


In [16]:
import pandas as pd
from surprise import Dataset, Reader

# Define the reader with rating scale 0.5 to 5.0
reader = Reader(rating_scale=(0.5, 5.0))

# Load dataset into Surprise format
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)


In [19]:
from surprise.model_selection import train_test_split
from surprise import SVD, accuracy

# Split into train and test sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=2)

# Create and train SVD model
svd_model = SVD()
svd_model.fit(trainset)

# Predict on test set
predictions = svd_model.test(testset)

# Evaluate performance
print("RMSE:", accuracy.rmse(predictions))


RMSE: 0.8785
RMSE: 0.8784869093181379


In [29]:
def get_cf_recommendations(user_id, top_n=10):

    if user_id not in ratings['userId'].unique():
        return f"User ID {user_id} not found in the dataset."

    # Get all movie IDs
    all_movie_ids = ratings['movieId'].unique()

    # Get movies already rated by the user
    rated_movie_ids = ratings[ratings['userId'] == user_id]['movieId'].values

    # Predict ratings for unrated movies
    predictions = [
        svd_model.predict(user_id, movie_id)
        for movie_id in all_movie_ids if movie_id not in rated_movie_ids
    ]

    # Sort by predicted rating (descending)
    top_predictions = sorted(predictions, key=lambda x: x.est, reverse=True)[:top_n]

    # Prepare DataFrame
    results = []
    for pred in top_predictions:
        movie_id = int(pred.iid)
        title = movies[movies['movieId'] == movie_id]['title'].values[0]
        results.append({
            'movieId': movie_id,
            'title': title,
            'predicted_rating': round(pred.est, 2)
        })

    return pd.DataFrame(results)



In [30]:
# Test example
get_cf_recommendations(user_id=5668, top_n=50)

'User ID 5668 not found in the dataset.'

In [33]:
get_cf_recommendations(user_id=12, top_n=15)

Unnamed: 0,movieId,title,predicted_rating
0,50,"Usual Suspects, The (1995)",5.0
1,318,"Shawshank Redemption, The (1994)",5.0
2,48516,"Departed, The (2006)",5.0
3,58559,"Dark Knight, The (2008)",5.0
4,1203,12 Angry Men (1957),5.0
5,7361,Eternal Sunshine of the Spotless Mind (2004),5.0
6,2951,"Fistful of Dollars, A (Per un pugno di dollari...",5.0
7,1394,Raising Arizona (1987),5.0
8,296,Pulp Fiction (1994),4.99
9,1235,Harold and Maude (1971),4.99


## Hybrid Recommendation System

In real-world applications, both **Content-Based Filtering (CBF)** and **Collaborative Filtering (CF)** have their own strengths and limitations:

- **Content-Based Filtering (CBF)** recommends items based on a user's previously liked movies, using metadata like genres and tags.  
   Limitation: It cannot capture patterns beyond the item’s content — no learning from other users.

- **Collaborative Filtering (CF)** (using SVD) recommends items by learning from the interaction matrix of users and movies (ratings).  
   Limitation: It struggles with new users or new items — known as the *cold start problem*.

---

###  How it Works?

- Use **CBF** to calculate similarity scores between liked movies and all others using TF-IDF + cosine similarity.
- Use **SVD-based CF** to predict ratings of movies the user hasn’t rated yet.
- Combine both scores using a **weighted average** controlled by parameter `alpha`.

---

### Final Score Formula:

\[
Hybrid_score=CB_score x (alpha) + CF_score x (1-alpha)
\]

- `alpha ∈ [0, 1]`  
  - `alpha = 0.7` → 70% content-based, 30% collaborative  
  - `alpha = 0.5` → equal weight

---

###  Why use a hybrid model?

-  Balances **personal taste** with **community preferences**
-  Handles the **cold start** problem (new users or items)
-  Gives more **robust and personalized** recommendations
-  Outperforms individual models in most practical scenarios

---


In [34]:
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

def hybrid_recommendations(user_id, liked_titles, top_n=10, alpha=0.5):

    if user_id not in ratings['userId'].unique():
        return f"User ID {user_id} not found in the dataset."

    # Filter liked movies that exist in the model
    valid_titles = [title for title in liked_titles if title in title_to_index]
    if not valid_titles:
        return pd.DataFrame(columns=["movieId", "title", "cb_score", "cf_score", "hybrid_score"])

    # Compute average content-based profile
    liked_indices = [title_to_index[title] for title in valid_titles]
    user_profile = cosine_sim_matrix[liked_indices].mean(axis=0)

    # Create DataFrame of content-based scores (excluding liked movies)
    liked_set = set(liked_indices)
    sim_scores = [(idx, score) for idx, score in enumerate(user_profile) if idx not in liked_set]

    cb_df = pd.DataFrame(sim_scores, columns=['index', 'cb_score'])
    cb_df['movieId'] = movies.loc[cb_df['index'], 'movieId'].values
    cb_df = cb_df[['movieId', 'cb_score']]

    # Collaborative Filtering: Predict scores for unrated movies
    all_movie_ids = ratings['movieId'].unique()
    rated_movie_ids = ratings[ratings['userId'] == user_id]['movieId'].values
    unrated_ids = set(all_movie_ids) - set(rated_movie_ids)

    cf_scores = []
    for movie_id in unrated_ids:
        pred = svd_model.predict(user_id, movie_id)
        cf_scores.append((movie_id, pred.est))

    cf_df = pd.DataFrame(cf_scores, columns=['movieId', 'cf_score'])

    # Merge both scores
    hybrid_df = pd.merge(cb_df, cf_df, on='movieId')
    if hybrid_df.empty:
        return pd.DataFrame(columns=["movieId", "title", "cb_score", "cf_score", "hybrid_score"])

    # Normalize both score columns to [0, 1]
    scaler = MinMaxScaler()
    hybrid_df[['cb_score', 'cf_score']] = scaler.fit_transform(hybrid_df[['cb_score', 'cf_score']])

    # Compute weighted hybrid score
    hybrid_df['hybrid_score'] = alpha * hybrid_df['cb_score'] + (1 - alpha) * hybrid_df['cf_score']

    # Add movie titles
    hybrid_df = pd.merge(hybrid_df, movies[['movieId', 'title']], on='movieId')
    hybrid_df = hybrid_df.sort_values('hybrid_score', ascending=False).head(top_n)
    hybrid_df.reset_index(drop=True, inplace=True)

    return hybrid_df[['movieId', 'title', 'cb_score', 'cf_score', 'hybrid_score']]


In [35]:
# Example usage:
result=hybrid_recommendations(
    user_id=500221,
    liked_titles=["Toy Story (1995)", "Inception (2010)"],
    top_n=10,
    alpha=0
)
print(result)


User ID 500221 not found in the dataset.


In [36]:
result=hybrid_recommendations(
    user_id=51,
    liked_titles=["Toy Story (1995)", "Inception (2010)"],
    top_n=10,
    alpha=0.5
)
print(result)

   movieId                                         title  cb_score  cf_score  \
0     3114                            Toy Story 2 (1999)  1.000000  0.772275   
1    78499                            Toy Story 3 (2010)  0.772220  0.800219   
2   164179                                Arrival (2016)  0.536734  0.784632   
3     2355                          Bug's Life, A (1998)  0.759807  0.559438   
4     7361  Eternal Sunshine of the Spotless Mind (2004)  0.452517  0.849296   
5   174053          Black Mirror: White Christmas (2014)  0.352551  0.908988   
6     1203                           12 Angry Men (1957)  0.279720  0.969620   
7    89745                          Avengers, The (2012)  0.439723  0.795079   
8      628                            Primal Fear (1996)  0.424077  0.807834   
9     1274                                  Akira (1988)  0.326786  0.890791   

   hybrid_score  
0      0.886138  
1      0.786220  
2      0.660683  
3      0.659623  
4      0.650907  
5      0.63