# Exercise: Build a Simplified Search Engine

In this notebook, you will build a simplified search engine using the [Movies Metadata](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?resource=download&select=movies_metadata.csv) dataset.

### Objectives:

1. Learn to preprocess text data from a real dataset.
2. Implement a search function that retrieves the top-N results for a given query.
3. Explore and evaluate different similarity measures (e.g., TF-IDF and cosine similarity).

### Dataset:

You will use the `movies_metadata.csv` file. This dataset contains metadata about movies, including titles, overviews, genres, and other information.

---
### Setup:
Ensure you have the required libraries installed. You may need the following:

```bash
pip install pandas scikit-learn sentence-transformers
```


In [2]:
!pip install pandas scikit-learn sentence-transformers





In [7]:
# Import libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from sentence_transformers import SentenceTransformer, util


  from tqdm.autonotebook import tqdm, trange


### Step 1: Load and Explore the Dataset

Load the `movies_metadata.csv` file and inspect the first few rows. Focus on the `title` and `overview` columns, which will be key for building your search engine.

In [4]:
# Load the dataset
file_path = 'movies_metadata.csv'  # Update with your file path
df = pd.read_csv(file_path)

# Display the first few rows
df.head()

  df = pd.read_csv(file_path)


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


### Step 2: Preprocess the Data

1. Select relevant columns: `title` and `overview`.
2. Handle missing values by filling them with an empty string.
3. Combine or transform the data as needed for search.

In [5]:
# Preprocess the data
df = df[['title', 'overview']].dropna(subset=['title'])
df['overview'] = df['overview'].fillna('')

# Inspect the processed data
df.head()

Unnamed: 0,title,overview
0,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,Jumanji,When siblings Judy and Peter discover an encha...
2,Grumpier Old Men,A family wedding reignites the ancient feud be...
3,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom..."
4,Father of the Bride Part II,Just when George Banks has recovered from his ...


### Step 3: Implement the Search Function

Write a function `search(query, n, engine='tfidf')` that retrieves the top-N results for a given query using the specified search engine:

1. **TF-IDF**: Use `TfidfVectorizer` to compute similarity scores.
2. **Sentence Transformers** (optional): Use pre-trained embeddings for semantic search.

#### Hint:
- Compute the similarity between the query and all movie overviews.
- Sort the results by similarity and return the top-N titles.

In [14]:
def search(query, n=5, engine='tfidf'):
    overviews = df['overview'].tolist()
    titles = df['title'].tolist()

    if engine == 'tfidf':
        # TF-IDF vectorization
        vectorizer = TfidfVectorizer(stop_words='english')
        tfidf_matrix = vectorizer.fit_transform(overviews)
        query_vector = vectorizer.transform([query])
        # Compute cosine similarity
        similarities = cosine_similarity(query_vector, tfidf_matrix).flatten()
    else:
        raise ValueError("Unsupported engine. Only 'tfidf' is available.")

    # Get top-N indices sorted by similarity
    top_indices = similarities.argsort()[::-1][:n]
    # Retrieve corresponding titles
    top_titles = [titles[i] for i in top_indices]

    return top_titles


### Step 4: Test Your Function

Use different queries to test your search function. Verify the quality of the results and experiment with varying `n` values.

In [15]:
# Example usage of the search function
query = 'space adventure'
top_n = 5
results = search(query, n=top_n, engine='tfidf')
print(f"Top {top_n} results for query '{query}':", results)

Top 5 results for query 'space adventure': ['Manhunt in Space', 'Hail Columbia!', 'Space Station 3D', 'Space Pirate Captain Harlock', 'The Visit: An Alien Encounter']


### Step 5: Plot query and top documents

Plot the query and top retrieved documents or all documents to visualize the retrieval

In [None]:
# TODO

### Step 6: Compare engines

Compare the results of the same query using different engines.

Note: there are multiple ways of doing this, the simpler one is comparing the intersection of top-K (k = 5, 10 or 30) from each engine.

In [None]:
#TODO

### Questions to Think About:

1. How does the choice of similarity measure affect the results?
2. What happens when you use different queries?
3. Can you improve the results by incorporating additional metadata, such as genres?