# Movie Recommender System

This project implements a **content-based movie recommendation system** using the **MovieLens 20M dataset**.  
The system recommends movies based on **genre similarity** and **popularity**.

**Author:** Fatemeh Mohammadi  
**Date:** 2025-10-17

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import re


## Load MovieLens Dataset

We use `movies.csv` and `ratings.csv` from MovieLens 20M dataset.


In [3]:
# Load movies and ratings datasets
movies = pd.read_csv(r"D:/me/projects/movielens-20m/movies.csv")
ratings = pd.read_csv(r"D:/me/projects/movielens-20m/ratings.csv")

# Quick look at the first few rows
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


## Feature Engineering

We transform the `genres` column into **one-hot encoded columns**.  
We also compute **average ratings** and **popularity score** for each movie.


In [6]:
# One-Hot Encoding of genres
genres_encoded = movies['genres'].str.get_dummies(sep='|')
movies_encoded = pd.concat([movies, genres_encoded], axis=1)

# Rating stats: avg_rating and rating_count
rating_stats = ratings.groupby('movieId')['rating'].agg(['mean','count']).reset_index()
rating_stats.columns = ['movieId', 'avg_rating', 'rating_count']

movies_encoded = movies_encoded.merge(rating_stats, on='movieId', how='left')
movies_encoded['avg_rating'] = movies_encoded['avg_rating'].fillna(movies_encoded['avg_rating'].mean())
movies_encoded['rating_count'] = movies_encoded['rating_count'].fillna(0)

# Normalize popularity (log + min-max)
movies_encoded['log_count'] = np.log1p(movies_encoded['rating_count'])
minc = movies_encoded['log_count'].min()
maxc = movies_encoded['log_count'].max()
movies_encoded['pop_norm'] = (movies_encoded['log_count'] - minc) / (maxc - minc)

# Extract release year from title
def extract_year(title):
    m = re.search(r'\((\d{4})\)', title)
    return int(m.group(1)) if m else np.nan

movies_encoded['year'] = movies_encoded['title'].apply(extract_year)
movies_encoded['year'] = movies_encoded['year'].fillna(0)

## Compute Genre Similarity

We compute the **cosine similarity** between movies based on their one-hot encoded genre features.  
This gives a similarity score between 0 (no match) and 1 (exact match) for each movie pair.

In [7]:
# Select only genre columns (assuming they start from column index 3 to 23)
genre_features = movies_encoded.iloc[:, 3:23]  # adjust indices if needed

# Compute cosine similarity matrix
cosine_sim = cosine_similarity(genre_features, genre_features)

print("✅ Cosine similarity matrix shape:", cosine_sim.shape)

✅ Cosine similarity matrix shape: (27278, 27278)


## Content-Based Recommendation Function

The function below returns top-N similar movies for a given title.  
It uses **genre similarity** and **popularity weighting** based on normalized rating counts.

In [9]:
# Prepare mapping from title to DataFrame index
titles = movies_encoded['title'].reset_index(drop=True)
title_to_idx = pd.Series(titles.index, index=titles.str.lower()).to_dict()

# Recommendation function
def get_similar_movies_weighted(title, n=10, alpha=1.0, min_year=None):
    """
    Return top-n movies similar to the given title, weighted by popularity.
    
    Parameters:
    title (str): Movie title (case-insensitive)
    n (int): Number of recommendations
    alpha (float): Popularity weight factor
    min_year (int, optional): Filter movies released after this year
    
    Returns:
    List of tuples: (movie title, combined_score, genre_similarity, popularity_score)
    """
    key = title.lower()
    if key not in title_to_idx:
        # attempt partial match
        matches = [t for t in titles.str.lower() if key in t]
        if len(matches) == 0:
            raise KeyError(f"Title '{title}' not found (exact or partial).")
        key = matches[0]
    idx = title_to_idx[key]
    
    sims = cosine_sim[idx]  # genre similarity array
    pop = movies_encoded['pop_norm'].values  # normalized popularity
    
    # Combined score
    combined = sims * (1.0 + alpha * pop)
    
    # Optional year filter
    indices = np.arange(len(titles))
    if min_year is not None:
        mask = movies_encoded['year'].fillna(0).values >= min_year
        indices = indices[mask]
        combined = combined[mask]
    
    # Get top-n recommendations excluding the movie itself
    sorted_idx = np.argsort(combined)[::-1]
    results = []
    for i in sorted_idx:
        orig_idx = indices[i] if min_year is not None else i
        if orig_idx == idx:
            continue
        results.append((
            titles.iloc[orig_idx], 
            float(combined[i]), 
            float(sims[orig_idx]), 
            float(movies_encoded.iloc[orig_idx]['pop_norm'])
        ))
        if len(results) >= n:
            break
    return results


## Example: Get Top-5 Recommendations

In [11]:
recommendations = get_similar_movies_weighted('Toy Story (1995)', n=5, alpha=1.0)
for rec in recommendations:
    print(rec)

('Monsters, Inc. (2001)', 1.9059445065408471, 0.9999999999999999, 0.9059445065408472)
('Toy Story 2 (1999)', 1.9025071352897738, 0.9999999999999999, 0.9025071352897741)
('Antz (1998)', 1.8299488909208368, 0.9999999999999999, 0.8299488909208371)
("Emperor's New Groove, The (2000)", 1.7747909399818165, 0.9999999999999999, 0.7747909399818168)
('Shrek (2001)', 1.7646133713151393, 0.9128709291752769, 0.9330370974890827)


In [2]:
import os
os.chdir(r"D:\me\projects\movie_recommender")
print(os.getcwd())

D:\me\projects\movie_recommender
