## Movie Recommendation System

In [1]:
import pandas as pd

movies=pd.read_csv("movies.csv")

In [2]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
62418,209157,We (2018),Drama
62419,209159,Window of the Soul (2001),Documentary
62420,209163,Bad Poems (2018),Comedy|Drama
62421,209169,A Girl Thing (2001),(no genres listed)


# Title Cleaning Function

This code preprocesses movie titles by removing special characters, keeping only alphanumeric characters and spaces.

## How It Works

1. **Define Cleaning Function**: Uses regex to remove all non-alphanumeric characters (except spaces)
2. **Apply to Dataset**: Creates a new column `clean_title` by applying the function to all movie titles

## Purpose

- Standardizes titles for better matching (e.g., "Spider-Man" becomes "SpiderMan")
- Removes punctuation, symbols, and special characters
- Improves TF-IDF vectorization accuracy by reducing noise

**Example**: "The Matrix: Reloaded" → "The Matrix Reloaded"

In [3]:
import re

def clean_title(title):
    return re.sub("[^a-zA-Z0-9 ]","",title)

In [5]:
movies["clean_title"]=movies["title"].apply(clean_title)

In [6]:
movies

Unnamed: 0,movieId,title,genres,clean_title
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji 1995
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men 1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Waiting to Exhale 1995
4,5,Father of the Bride Part II (1995),Comedy,Father of the Bride Part II 1995
...,...,...,...,...
62418,209157,We (2018),Drama,We 2018
62419,209159,Window of the Soul (2001),Documentary,Window of the Soul 2001
62420,209163,Bad Poems (2018),Comedy|Drama,Bad Poems 2018
62421,209169,A Girl Thing (2001),(no genres listed),A Girl Thing 2001


# TF-IDF Vectorization Explained

This code uses scikit-learn's `TfidfVectorizer` to convert the cleaned movie titles into a sparse matrix of numerical features. This numerical matrix is essential for machine learning tasks like similarity matching or recommendation systems.

| Component | Purpose | Description |
|-----------|---------|-------------|
| `TfidfVectorizer` | The Tool | Calculates the Term Frequency-Inverse Document Frequency for words, weighting important (rare) words higher. |
| `ngram_range=(1,2)` | Feature Scope | Configures the vectorizer to consider both single words (unigrams) and sequential pairs of words (bigrams) as features. |
| `fit_transform()` | Execution | First, it builds the vocabulary from all titles (fit), then calculates and returns the TF-IDF matrix (transform). |
| `tfidf` | The Output | A sparse matrix where each row represents a movie and each column represents a word/bigram feature, populated with TF-IDF scores. |

## Example: How TF-IDF Works

The TF-IDF score for a word in a document is the product of two measures: **Term Frequency (TF)** and **Inverse Document Frequency (IDF)**.

$$\text{TFIDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$$

### 1. Term Frequency (TF)

This measures how frequently a term $t$ appears in a document $d$. (The `TfidfVectorizer` often uses a normalized/smoothed version of this.)

### 2. Inverse Document Frequency (IDF)

This measures how important a term is by weighing down common terms (like "The" or "A") and boosting rarer, more distinctive terms (like "Jedi" or "Inception").

$$\text{IDF}(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t}\right)$$

## Calculation Example

Imagine we have two clean titles:

| ID | Title (d) |
|----|-----------|
| 1 | The Cat in the Hat |
| 2 | Cat on a Hot Tin Roof |

| Term (t) | TF in Title 1 | Documents with Term | Total Documents | IDF Score |
|----------|---------------|---------------------|-----------------|-----------|
| The | $2 / 5 = 0.4$ | 1 | 2 | $\log(2/1) \approx 0.30$ |
| Cat | $1 / 5 = 0.2$ | 2 | 2 | $\log(2/2) = 0.00$ |
| Hat | $1 / 5 = 0.2$ | 1 | 2 | $\log(2/1) \approx 0.30$ |

**The TF-IDF Score for "Hat" in Title 1 is:**

$$\text{TFIDF}(\text{Hat, Title 1}) = \text{TF} \times \text{IDF} = 0.2 \times 0.30 \approx 0.06$$

**The TF-IDF Score for the common word "The" in Title 1 is:**

$$\text{TFIDF}(\text{The, Title 1}) = \text{TF} \times \text{IDF} = 0.4 \times 0.30 \approx 0.12$$

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer=TfidfVectorizer(ngram_range=(1,2))

tfidf=vectorizer.fit_transform(movies["clean_title"])

# Movie Search Function Using Cosine Similarity

## How It Works

1. **Clean Input**: Preprocesses the search query
2. **Vectorize Query**: Converts title into a TF-IDF vector
3. **Calculate Similarity**: Computes cosine similarity between query and all movies
4. **Find Top Matches**: Gets indices of 5 highest similarity scores
5. **Retrieve Results**: Returns the top 5 matching movies in descending order

## Cosine Similarity

Measures similarity between two vectors, ranging from 0 (dissimilar) to 1 (identical).

$$\text{cosine similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|}$$

**Example**: Searching "The Dark Knight" returns Batman-related films with highest similarity scores.

In [8]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def search(title):
    title = clean_title(title)
    query_vec = vectorizer.transform([title])
    similarity = cosine_similarity(query_vec, tfidf).flatten()
    indices = np.argpartition(similarity, -5)[-5:]
    result = movies.iloc[indices][::-1]
    return result

# Interactive Movie Search Widget

This code creates an interactive search interface using Jupyter widgets that displays matching movies as you type.

## Components

| Component | Purpose |
|-----------|---------|
| `widgets.Text()` | Creates an input box for typing movie titles |
| `widgets.Output()` | Creates a display area for search results |
| `on_type()` | Callback function triggered when input changes |
| `observe()` | Monitors the input widget for value changes |

## How It Works

1. User types in the text input box
2. The `on_type()` function is triggered on every keystroke
3. If the query is longer than 5 characters, it calls `search()`
4. Results are displayed in the output widget below the input box
5. Previous results are cleared before showing new ones

**Note**: The widget provides real-time search as you type, updating results dynamically.

In [None]:
import ipywidgets as widgets
from IPython.display import display

movies_input=widgets.Text(
    value="Toy Story",
    description='Movie Title:',
    disabled=False,
)

movies_list=widgets.Output()

def on_type(data):
    with movies_list:
        movies_list.clear_output()
        title=data["new"]
        if len(title)>5:
            display(search(title))

movies_input.observe(on_type, names="value")

display(movies_input,movies_list)

Text(value='Toy Story', description='Movie Title:')

Output()