# Simple Content-Based Movie Recommendation in Jupyter

This notebook walks through the steps to build a basic content-based recommender system:

1. **Load the Dataset**  
2. **Preprocess Movie Overviews**  
3. **Vectorize (TF-IDF)**  
4. **Compute Similarities**  
5. **Recommend Top Movies**  

We'll use a small dataset (`movies.csv` with 500 rows) containing columns:  
- **title**  
- **overview**  
- **original_language**  
- **vote_count**  
- **vote_average**  

**Goal**: Given a user query describing their movie tastes, return the top similar movies.


In [32]:
# ---
# STEP 0: Import Libraries
# ---

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# For demonstration, we'll ignore some harmless warnings
import warnings
warnings.filterwarnings('ignore')


In [23]:
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_r

True

## STEP 1: Load Dataset

We'll read in the `movies.csv` file (500 rows).

In [20]:
# ---
# Load the dataset
# ---

df = pd.read_csv("movies.csv")

print("Data Shape:", df.shape)
df.head()
# Check for null values in each column
print(df.isnull().sum())

Data Shape: (500, 5)
title                0
overview             3
original_language    0
vote_count           0
vote_average         0
dtype: int64


We notice that some rows may not have an overview. We'll **drop** rows where `overview` is missing, just to keep things simple.



In [21]:
df.dropna(subset=['overview'], inplace=True)
df.reset_index(drop=True, inplace=True)

print("Data Shape after dropping NaNs:", df.shape)
df.head()

Data Shape after dropping NaNs: (497, 5)


Unnamed: 0,title,overview,original_language,vote_count,vote_average
0,7 Days in Entebbe,"In 1976, four hijackers take over an Air Franc...",en,234,5.8
1,The Scorpion King: Quest for Power,"When he is betrayed by a trusted friend, Matha...",en,109,4.7
2,Disobedience,A woman learns about the death of her Orthodox...,en,530,6.9
3,Wolf,Publisher Will Randall becomes a werewolf and ...,en,509,6.1
4,Flypaper,A man caught in the middle of two simultaneous...,en,446,6.3


## STEP 2: Preprocess the 'overview' Text

We'll do minimal preprocessing:
- Convert text to lowercase.  
- remove punctuation or apply more advanced NLP steps like lemmatization.  

This helps our TF-IDF vectorizer handle words consistently.


In [25]:
def preprocess_text(text):
    """
    Convert text to lowercase, remove punctuation, tokenize, and lemmatize.
    Example of a more in-depth preprocessing approach.

    Args:
        text (str): The raw text (e.g., movie overview).

    Returns:
        str: The processed text.
    """
    # 1) Lowercase
    text = text.lower()

    # 2) Remove punctuation (replace anything not a word/whitespace with a space)
    text = re.sub(r'[^\w\s]', ' ', text)

    # 3) Tokenize
    tokens = word_tokenize(text)

    # 4) Lemmatize each token
    lemmatizer = WordNetLemmatizer()
    lem_tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalpha()]

    # 5) Rejoin tokens back into a single string
    processed_text = " ".join(lem_tokens)

    return processed_text

df['overview'] = df['overview'].apply(preprocess_text)
df.head()

Unnamed: 0,title,overview,original_language,vote_count,vote_average
0,7 Days in Entebbe,in four hijacker take over an air france airpl...,en,234,5.8
1,The Scorpion King: Quest for Power,when he is betrayed by a trusted friend mathay...,en,109,4.7
2,Disobedience,a woman learns about the death of her orthodox...,en,530,6.9
3,Wolf,publisher will randall becomes a werewolf and ...,en,509,6.1
4,Flypaper,a man caught in the middle of two simultaneous...,en,446,6.3


## STEP 3: Build the TF-IDF Matrix

We will:
1. Create a TfidfVectorizer.
2. Fit it on the movie overviews.
3. Transform the overviews into a TF-IDF matrix.

This matrix will have one row per movie and one column per feature (word), capturing how "important" each word is in a movie's overview.


In [28]:
# Create and fit TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
tfidf_matrix = vectorizer.fit_transform(df['overview'])

# check the shape of the TF-IDF matrix
print("TF-IDF matrix shape:", tfidf_matrix.shape)

TF-IDF matrix shape: (497, 5000)


## STEP 4: Define a Recommendation Function

Given a user query string:
1. Transform the query using the *same* vectorizer.
2. Compute cosine similarity with each movie in our dataset.
3. Sort by similarity and return the top results.


In [29]:
def recommend_movies(user_query, df, tfidf_matrix, vectorizer, top_n=5):
    """
    Recommends top_n movies based on the user's text input (user_query).

    Args:
    - user_query (str): A description of what the user wants to watch.
    - df (pd.DataFrame): DataFrame containing movies (with 'title', 'overview').
    - tfidf_matrix (sparse matrix): TF-IDF features for all movies.
    - vectorizer (TfidfVectorizer): Fitted TF-IDF vectorizer.
    - top_n (int): How many recommendations to return.

    Returns:
    - pd.DataFrame: A subset of df, containing only the top_n movies + similarity scores.
    """
    # 1) Vectorize the user query
    query_vec = vectorizer.transform([user_query.lower()])

    # 2) Compute cosine similarity between query and all movie overviews
    cosine_sim = linear_kernel(query_vec, tfidf_matrix).flatten()  # shape: (num_movies,)

    # 3) Sort movies by similarity score, descending
    top_indices = cosine_sim.argsort()[::-1][:top_n]
    top_scores = cosine_sim[top_indices]

    # 4) Build a results DataFrame
    results = df.iloc[top_indices].copy()
    results['similarity'] = top_scores
    results = results[['title', 'similarity', 'overview']]  # Keep columns we care about

    # 5) Sort by similarity in descending order
    results.sort_values('similarity', ascending=False, inplace=True)

    return results


## STEP 5: Test the System

We'll try a sample query and see which movies are suggested.

### Example Query:
"I love thrilling action movies set in space, with a comedic twist."

In [33]:
# Define your query
user_query = "I love thrilling action movies set in space, with a comedic twist."

# Get the top recommendations
recommended_df = recommend_movies(user_query, df, tfidf_matrix, vectorizer, top_n=5)

# Display them
print("User Query:", user_query)
print("="*70)
recommended_df

User Query: I love thrilling action movies set in space, with a comedic twist.


Unnamed: 0,title,similarity,overview
337,Killer Klowns from Outer Space,0.145559,alien who look like clown come from outer spac...
261,13 Assassins,0.131812,a bravado period action film set at the end of...
300,Ender's Game,0.112121,based on the classic novel by orson scott card...
128,Life,0.10379,the six member crew of the international space...
163,Doctor Zhivago,0.09381,two protagonist love each other but because of...


## Observing Results

The table above shows the top 5 recommended movies with their similarity scores.  
Check if the overviews align with the idea of "action + space + comedy" or whatever the query suggests.

Feel free to try different queries like:
- `"romantic comedy with a quirky lead protagonist"`
- `"dark horror film set in a haunted house"`
- `"action-packed superhero movie with lots of humor"`
etc.

---

## Salary Expectation (Mandatory)

**Monthly Salary Expectation**: \$1600-\$2400 per month

---

**End of Notebook**  
Thank you for reading.
