# Exercise: Build a Simplified Search Engine

In this notebook, you will build a simplified search engine using the [Movies Metadata](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?resource=download&select=movies_metadata.csv) dataset.

### Objectives:

1. Learn to preprocess text data from a real dataset.
2. Implement a search function that retrieves the top-N results for a given query.
3. Explore and evaluate different similarity measures (e.g., TF-IDF and cosine similarity).

### Dataset:

You will use the `movies_metadata.csv` file. This dataset contains metadata about movies, including titles, overviews, genres, and other information.

---
### Setup:
Ensure you have the required libraries installed. You may need the following:

```bash
pip install pandas scikit-learn sentence-transformers
```


In [None]:
# Import libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Optionally, use Sentence Transformers for semantic similarity
# from sentence_transformers import SentenceTransformer, util


### Step 1: Load and Explore the Dataset

Load the `movies_metadata.csv` file and inspect the first few rows. Focus on the `title` and `overview` columns, which will be key for building your search engine.

In [None]:
# Load the dataset
file_path = 'movies_metadata.csv'  # Update with your file path
df = pd.read_csv(file_path)

# Display the first few rows
df.head()

### Step 2: Preprocess the Data

1. Select relevant columns: `title` and `overview`.
2. Handle missing values by filling them with an empty string.
3. Combine or transform the data as needed for search.

In [None]:
# Preprocess the data
df = df[['title', 'overview']].dropna(subset=['title'])
df['overview'] = df['overview'].fillna('')

# Inspect the processed data
df.head()

### Step 3: Implement the Search Function

Write a function `search(query, n, engine='tfidf')` that retrieves the top-N results for a given query using the specified search engine:

1. **TF-IDF**: Use `TfidfVectorizer` to compute similarity scores.
2. **Sentence Transformers** (optional): Use pre-trained embeddings for semantic search.

#### Hint:
- Compute the similarity between the query and all movie overviews.
- Sort the results by similarity and return the top-N titles.

In [None]:
def search(query, n=5, engine='tfidf'):
    """Search for the top-N movies based on the query.

    Args:
        query (str): The search query.
        n (int): Number of top results to return.
        engine (str): Search engine to use ('tfidf' or 'transformer').

    Returns:
        List[str]: Top-N movie titles matching the query.
    """
    pass


### Step 4: Test Your Function

Use different queries to test your search function. Verify the quality of the results and experiment with varying `n` values.

In [None]:
# Example usage of the search function
query = 'space adventure'
top_n = 5
results = search(query, n=top_n, engine='tfidf')
print(f"Top {top_n} results for query '{query}':", results)

### Step 5: Plot query and top documents

Plot the query and top retrieved documents or all documents to visualize the retrieval

In [None]:
# TODO

### Step 6: Compare engines

Compare the results of the same query using different engines.

Note: there are multiple ways of doing this, the simpler one is comparing the intersection of top-K (k = 5, 10 or 30) from each engine.

In [None]:
#TODO

### Questions to Think About:

1. How does the choice of similarity measure affect the results?
2. What happens when you use different queries?
3. Can you improve the results by incorporating additional metadata, such as genres?