# Problem Statement

In the age of digital content overload, users struggle to find movies tailored to their tastes. The objective is to build a Movie Recommendation System that understands user preferences and delivers personalized suggestions, improving user engagement and satisfaction.

# Goal

To develop a hybrid movie recommendation engine that:

- Suggests movies based on user ratings.
- Incorporates genre similarities and user behavior.
- Is scalable and easy to interpret.
- Can be integrated into real-world applications like streaming platforms.

# Datasets Used

1. **movies.csv:** Contains movie metadata (movieId, title, genres).
2. **ratings.csv:** User ratings for movies on a scale from 0.5 to 5.0.
3. **tags.csv:** Tags added by users (can be used for content-based filtering).
4. **links.csv:** Mapping to IMDb and TMDb for future extension (like fetching posters, summaries).

Together they form the famous MovieLens dataset.

## Methodology

In [2]:
# 1. Import necessary libraries

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#### Explanation:
- Loading the essential libraries for data processing and similarity computation.
- These tools are foundational for content-based filtering using genre data.

In [3]:
# 2. Load all the datasets
movies = pd.read_csv("movies.csv")
ratings = pd.read_csv("ratings.csv")
tags = pd.read_csv("tags.csv")
links = pd.read_csv("links.csv")

#### Explanation

Importing the MovieLens dataset into Pandas DataFrames.

In [4]:
# 3. Merge ratings with movies to create a unified base
movie_ratings = pd.merge(ratings, movies, on="movieId", how="left")


#### Explanation
- This code cobines ratings and movie metadata into a single DataFrame.
- We are using a left join, so that it ensures all ratings are preserved, even if movie metadata is missing.

In [5]:
# 4. Calculate movie popularity metrics: mean rating and number of ratings
movie_stats = movie_ratings.groupby("title").agg({
    "rating": ["mean", "count"]
}).reset_index()
movie_stats.columns = ["title", "average_rating", "num_ratings"]

#### Explanation

- This code computes average rating and number of ratings per movie.
- It provides a simple popularity baseline; movies with high ratings and many votes are likely crowd-pleasers.

In [6]:
# 5. Filter for movies with at least 50 ratings to avoid obscure/biased ratings
popular_movies = movie_stats[movie_stats["num_ratings"] >= 50]


#### Explanation

- It filters movies with at least 50 ratings to avoid obscure or biased entries.
- This threshold balances quality (reliable ratings) and quantity (sufficient data), reducing noise from sparsely rated movies.

In [7]:
# 6. Sort by highest average rating
top_popular_movies = popular_movies.sort_values(by="average_rating", ascending=False).head(10)


#### Explanation

- It identifies the top 10 movies by average rating.
- It offers a basic recommendation list (e.g., "Shawshank Redemption" often tops such lists), but lacks personalization.

In [8]:
# ----- Content-Based Filtering using Genres -----


In [9]:
# 7. Fill any missing genre values with empty string
movies["genres"] = movies["genres"].fillna("")

#### Explanation

- It replaces missing genre values with empty strings.
- It ensures TF-IDF vectorization works seamlessly, avoiding NaN-related errors.

In [10]:
# 8. Use TF-IDF Vectorizer to encode genres (like NLP encoding for movie flavor)
tfidf = TfidfVectorizer(token_pattern=r"[^|]+")  # Use pipe separator
tfidf_matrix = tfidf.fit_transform(movies["genres"])

#### Explanation

- Converts genres into a TF-IDF matrix, treating genres as "words" separated by "|".
- TF-IDF weights genres by importance (e.g., rare genres like "Film-Noir" get higher weights), enabling similarity computation.

In [11]:
# 9. Compute cosine similarity between all movies based on genre TF-IDF
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

#### Explanation

- It calculates pairwise cosine similarity between movies based on genres.
- It creates a similarity matrix where higher values indicate closer genre overlap (e.g., "Action|Adventure" vs. "Action|Sci-Fi").

In [12]:
# 10. Create a reverse index of movie titles to fetch by title
movie_indices = pd.Series(movies.index, index=movies['title']).drop_duplicates()

#### Explanation

- It maps movie titles to their DataFrame indices for quick lookup.
- It simplifies retrieval of movie data by title, crucial for recommendation functions.

In [13]:
# 11. Function to get similar movies based on genres
def get_recommendations(title, cosine_sim=cosine_sim):
    idx = movie_indices.get(title)
    if idx is None:
        return f"Movie '{title}' not found in the dataset."
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]  # Skip the first one as it is the same movie
    movie_indices_list = [i[0] for i in sim_scores]
    return movies.iloc[movie_indices_list][["title", "genres"]]

#### Explanation

- It defines a function to recommend movies similar to a given title based on genre similarity.
- It will return the top 10 most similar movies, excluding the input movie itself. Robust error handling for missing titles.


In [14]:
# Output everything built so far:
{
    "Top_10_Popular_Movies": top_popular_movies,
    "Sample_Content_Based_Recs_for_Fight_Club": get_recommendations("Fight Club (1999)")
}

{'Top_10_Popular_Movies':                                                   title  average_rating  \
 7593                   Shawshank Redemption, The (1994)        4.429022   
 3499                              Godfather, The (1972)        4.289062   
 3011                                  Fight Club (1999)        4.272936   
 1961                              Cool Hand Luke (1967)        4.271930   
 2531  Dr. Strangelove or: How I Learned to Stop Worr...        4.268041   
 6999                                 Rear Window (1954)        4.261905   
 3500                     Godfather: Part II, The (1974)        4.259690   
 2334                               Departed, The (2006)        4.252336   
 3564                                  Goodfellas (1990)        4.250000   
 1593                                  Casablanca (1942)        4.240000   
 
       num_ratings  
 7593          317  
 3499          192  
 3011          218  
 1961           57  
 2531           97  
 6999      

#### Explanation and Insights

- **Purpose:** This displays top popular movies and sample recommendations for "Fight Club (1999)".

- **Insight:**
- Top 10: Includes classics like "Shawshank Redemption" (4.43 avg, 317 ratings), showing popularity-based success.
- Fight Club Recs: All recommendations share "Action|Crime|Drama|Thriller", validating genre-based similarity.

In [18]:
# Example Usage
print("\nüîç Movies similar to 'Nixon (1995)':")
print(get_recommendations("Nixon (1995)"))


üîç Movies similar to 'Nixon (1995)':
                                title genres
25                     Othello (1995)  Drama
30             Dangerous Minds (1995)  Drama
36    Cry, the Beloved Country (1995)  Drama
39                 Restoration (1995)  Drama
50                     Georgia (1995)  Drama
51       Home for the Holidays (1995)  Drama
55          Mr. Holland's Opus (1995)  Drama
105   Boys of St. Vincent, The (1992)  Drama
120    Basketball Diaries, The (1995)  Drama
121  Awfully Big Adventure, An (1995)  Drama


#### Explanation and Insights

- Purpose: Demonstrates the content-based recommender with "Nixon (1995)".
- Insight: Gave us outputs of dramas from similar years (e.g., "Othello (1995)"), confirming the system's focus on genre ("Drama").

In [19]:
# 12. Importing Additional Libraries for Hybrid Model

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise import accuracy

#### Explanation

- Adding libraries for collaborative filtering (KNN) and matrix factorization (SVD).
- Expanding the system‚Äôs capabilities beyond content-based filtering.

In [20]:
# 13. Content-Based Recommender

# Preprocess genres
movies["genres"] = movies["genres"].fillna("")

# TF-IDF Vectorizer
tfidf = TfidfVectorizer(token_pattern=r"[^|]+")
tfidf_matrix = tfidf.fit_transform(movies["genres"])

# Compute cosine similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Reverse mapping of movie titles
movie_indices = pd.Series(movies.index, index=movies['title']).drop_duplicates()

def content_based_recommend(title, cosine_sim=cosine_sim):
    idx = movie_indices.get(title)
    if idx is None:
        return f"Movie '{title}' not found."
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:11]
    movie_ids = [i[0] for i in sim_scores]
    return movies.iloc[movie_ids][["title", "genres"]]


#### Explanation

- Reimplements content-based filtering (similar to step #11, but renamed for clarity).
- Consistent with earlier implementation, reinforcing genre-based similarity as a core feature.

In [21]:

# 14. Collaborative Filtering with KNN


# Create pivot table
user_movie_matrix = ratings.pivot(index='userId', columns='movieId', values='rating').fillna(0)

# KNN Recommender using Nearest Neighbors
knn = NearestNeighbors(metric='cosine', algorithm='brute')
knn.fit(user_movie_matrix)

def collaborative_recommend(user_id):
    if user_id not in user_movie_matrix.index:
        return "User not found"
    distances, indices = knn.kneighbors([user_movie_matrix.loc[user_id]], n_neighbors=6)
    similar_users = indices.flatten()[1:]
    
    # Aggregate ratings from similar users
    similar_users_ratings = user_movie_matrix.iloc[similar_users]
    mean_ratings = similar_users_ratings.mean(axis=0)
    
    # Drop movies the target user has already rated
    watched = user_movie_matrix.loc[user_id]
    unseen_movies = mean_ratings[watched == 0]
    
    recommended_movies = unseen_movies.sort_values(ascending=False).head(10)
    return movies[movies["movieId"].isin(recommended_movies.index)][["title", "genres"]]


#### Explanation 

- This implements user-based collaborative filtering using KNN.

- **Pivot Table:** Sparse matrix of user-movie ratings (0 for unrated).
- **KNN:** Finds 5 similar users (excluding the target) based on rating patterns.
- **Recommendations:** Suggests movies highly rated by similar users, excluding already-watched ones.

In [22]:

# 15. Matrix Factorization with SVD


# Prepare data for surprise
reader = Reader(rating_scale=(0.5, 5))
data = Dataset.load_from_df(ratings[["userId", "movieId", "rating"]], reader)
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Train SVD model
svd = SVD()
svd.fit(trainset)

# Evaluate performance (optional)
predictions = svd.test(testset)
print("RMSE:", accuracy.rmse(predictions))

# Function to get SVD predictions
def svd_recommend(user_id, n_recs=10):
    movie_ids = ratings["movieId"].unique()
    rated_movies = ratings[ratings["userId"] == user_id]["movieId"]
    unrated_movies = [m for m in movie_ids if m not in rated_movies.values]
    
    preds = [svd.predict(user_id, movie_id) for movie_id in unrated_movies]
    preds.sort(key=lambda x: x.est, reverse=True)
    top_movie_ids = [int(pred.iid) for pred in preds[:n_recs]]
    
    return movies[movies["movieId"].isin(top_movie_ids)][["title", "genres"]]


RMSE: 0.8796
RMSE: 0.8795950377006411


#### Explanation & Insights

- It implements matrix factorization using SVD from the Surprise library.

- Insight:
1. **Data Prep:** Formats ratings for Surprise (80% train, 20% test).
2. **SVD:** Reduces dimensionality to predict ratings (RMSE ~0.88, indicating decent accuracy).
3. **Recommendations:** Predicts ratings for unrated movies, returning the top 10.

In [23]:

# 16. Hybrid Recommender


def hybrid_recommend(user_id, liked_movie_title, top_n=10):
    # Content scores
    if liked_movie_title not in movie_indices:
        return "Liked movie not found."
    idx = movie_indices[liked_movie_title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    content_ids = [i[0] for i in sim_scores[1:50]]  # Take top 50 similar movies

    movie_candidates = movies.iloc[content_ids].copy()
    movie_candidates["movieId"] = movie_candidates["movieId"].astype(int)

    # Add SVD predictions
    movie_candidates["predicted_rating"] = movie_candidates["movieId"].apply(lambda x: svd.predict(user_id, x).est)

    top_movies = movie_candidates.sort_values("predicted_rating", ascending=False).head(top_n)
    return top_movies[["title", "genres", "predicted_rating"]]


#### Explanation 

- It combines content-based (genres) and collaborative (SVD) filtering.
- Starts with 50 genre-similar movies.
- Ranks them by predicted user ratings.
- Balances movie similarity with user preference.

In [24]:
# Example Outputs

# Content-based
print(content_based_recommend("Toy Story (1995)"))

# Collaborative
print(collaborative_recommend(user_id=5))

# Matrix Factorization
print(svd_recommend(user_id=5))

# Hybrid
print(hybrid_recommend(user_id=5, liked_movie_title="Inception (2010)"))


                                                  title  \
1706                                        Antz (1998)   
2355                                 Toy Story 2 (1999)   
2809     Adventures of Rocky and Bullwinkle, The (2000)   
3000                   Emperor's New Groove, The (2000)   
3568                              Monsters, Inc. (2001)   
6194                                   Wild, The (2006)   
6486                             Shrek the Third (2007)   
6948                     Tale of Despereaux, The (2008)   
7760  Asterix and the Vikings (Ast√©rix et les Viking...   
8219                                       Turbo (2013)   

                                           genres  
1706  Adventure|Animation|Children|Comedy|Fantasy  
2355  Adventure|Animation|Children|Comedy|Fantasy  
2809  Adventure|Animation|Children|Comedy|Fantasy  
3000  Adventure|Animation|Children|Comedy|Fantasy  
3568  Adventure|Animation|Children|Comedy|Fantasy  
6194  Adventure|Animation|Children|Co

#### Insights

- It demonstrates all recommenders for user 5 and "Inception (2010)".
- **Content-Based:** Recommending animated family movies like "Toy Story 2" for "Toy Story".
- **Collaborative:** Suggesting diverse genres (e.g., "Jurassic Park"), reflecting similar users‚Äô tastes.
- **SVD:** Offering high-quality classics (e.g., "Princess Bride"), showing latent factor influence.
- **Hybrid:** Combining "Inception"-like action/sci-fi with user-specific predictions (e.g., "Dark Knight", 4.06).

## Key Insights

- **Popularity-Based:** Simple but effective for new users; lacks personalization.
- **Content-Based:** Strong for genre lovers; limited by genre-only features.
- **Collaborative (KNN):** Captures user behavior; struggles with sparse data.
- **SVD:** Balances accuracy and scalability; RMSE ~0.88 is solid for predictions.
- **Hybrid:** Best of both worlds‚Äîgenre similarity plus user preference.

## Conclusion
This project successfully builds a scalable, interpretable movie recommendation system using a hybrid approach. It‚Äôs ready for integration into streaming platforms and extensible for future enhancements like tag-based filtering or real-time updates.