## Building the Search Engine for search and recommendation

### Project Overview
This notebook focuses on constructing a semantic search engine for IMDB movie recommendations The steps include preprocessing, embedding computation, similarity calculation, ranking, knowledge graph integration, query expansion, and optimization.

### Steps to Implement

1. **Preprocess the Dataset**
2. **Compute Embeddings**: Use SBERT to generate embeddings 
3. **Calculate Cosine Similarity**: Compare the embeddings to find relevant products.
4. **Rank Products**: Sort results based on similarity scores.
5. **Knowledge Graph Integration**: simulate a knowledge graph for smaller datasets or prototyping.
6. **Query Expansion**: Allow users to expand their search terms.
7. **Optimization**: Implement caching for embeddings to improve speed.


### Why NetworkX for Knowledge Graph?
NetworkX provides a simple way to simulate a knowledge graph for smaller datasets or prototyping. It allows us to represent relationships between products based on attributes like color, which can enrich our search capabilities.


### Step 1: Preprocess the Dataset

In [3]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load the cleaned dataset
data = pd.read_csv('semantic_search_ready_imdb.csv')

# Combine Movie_name, Genre, Overviewfor embedding
data['combined_text'] = data['Movie_Name'] + ' ' + data['Genres'] + ' ' + data['Overview']
data['combined_text'] = data['combined_text'].fillna('') 

In [4]:
data.head()

Unnamed: 0,Home_Page,Movie_Name,Genres,Overview,Cast,Original_Language,Storyline,Production_Company,Release_Date,Tagline,Vote_Average,Vote_Count,Budget_USD,Revenue_$,Run_Time_Minutes,Release_Country,parsed_genres,parsed_original_language,combined_text
0,https://www.imdb.com/title/tt5040012/,Nosferatu,"['Fantasy', 'Horror', 'Mystery']",A gothic tale of obsession between a haunted y...,"['Lily-Rose Depp', 'Nicholas Hoult', 'Bill Ska...","['English', 'German', 'Romanian', 'Russian']",A gothic tale of obsession between a haunted y...,"['Focus Features', 'Maiden Voyage Pictures', '...",2024-12-25,Succumb to the darkness.,7.3,151K,$50.0M,$176.0M,132,United States,"[""['Fantasy'"", "" 'Horror'"", "" 'Mystery']""]","[""['English'"", "" 'German'"", "" 'Romanian'"", "" '...","Nosferatu ['Fantasy', 'Horror', 'Mystery'] A g..."
1,https://www.imdb.com/title/tt20221436/,Emilia Pérez,"['Comedy', 'Crime', 'Drama', 'Musical', 'Thril...",Emilia Pérez follows three remarkable women in...,"['Zoe Saldaña', 'Karla Sofía Gascón', 'Selena ...","['Spanish', 'English']","In modern-day Mexico, overqualified and exploi...","['Why Not Productions', 'Page 114', 'Pathé']",2024-11-13,Not_Found,5.5,71K,$27.3M,$15.0M,132,United States,"[""['Comedy'"", "" 'Crime'"", "" 'Drama'"", "" 'Music...","[""['Spanish'"", "" 'English']""]","Emilia Pérez ['Comedy', 'Crime', 'Drama', 'Mus..."
2,https://www.imdb.com/title/tt8999762/,The Brutalist,['Drama'],When a visionary architect and his wife flee p...,"['Adrien Brody', 'Felicity Jones', 'Guy Pearce...","['English', 'Hungarian', 'Italian', 'Hebrew', ...",When visionary architect László Toth and his w...,"['Brookstreet Pictures', 'Kaplan Morrison', 'I...",2025-01-24,Welcome to America.,7.9,28K,$10.0M,$31.1M,214,Canada,"[""['Drama']""]","[""['English'"", "" 'Hungarian'"", "" 'Italian'"", ""...",The Brutalist ['Drama'] When a visionary archi...
3,https://www.imdb.com/title/tt27657135/,Saturday Night,"['Biography', 'Comedy', 'Drama', 'History']","At 11:30pm on October 11th, 1975, a ferocious ...","['Gabriel LaBelle', 'Rachel Sennott', 'Cory Mi...",['English'],"At 11:30pm on October 11, 1975, a ferocious tr...","['Columbia Pictures', 'TSG Entertainment']",2024-10-11,"October 11, 1975, The First Saturday Night Liv...",7.0,26K,$25.0M,$10.0M,109,United States,"[""['Biography'"", "" 'Comedy'"", "" 'Drama'"", "" 'H...","[""['English']""]","Saturday Night ['Biography', 'Comedy', 'Drama'..."
4,https://www.imdb.com/title/tt17526714/,The Substance,"['Drama', 'Horror', 'Sci-Fi']",A fading celebrity takes a black-market drug: ...,"['Demi Moore', 'Margaret Qualley', 'Dennis Qua...",['English'],Have you ever dreamed of a better version of y...,"['Working Title Films', 'A Good Story', 'Black...",2024-09-20,"If you follow the instructions, what could go ...",7.3,249K,$17.5M,$77.0M,141,United States,"[""['Drama'"", "" 'Horror'"", "" 'Sci-Fi']""]","[""['English']""]","The Substance ['Drama', 'Horror', 'Sci-Fi'] A ..."


### Step 2 : Compute Embeddings

In [8]:
from sentence_transformers import SentenceTransformer

# Initialize SBERT model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Compute embeddings for combined text (name + sub_title + description)
data['embedding'] = data['combined_text'].apply(lambda x: model.encode(x))

# Save embeddings 
data.to_csv('semantic_search_ready_imdb_with_embeddings.csv', index=False)

### Step 3 - Calculate Cosine Similarity


In [9]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def search(query, data):
    # Compute embedding for the query
    query_embedding = model.encode(query)
    
    # Compute cosine similarity between query embedding and all movie embeddings
    similarities = cosine_similarity([query_embedding], list(data['embedding']))
    
    # Get the indices of the top 5 similar products
    top_indices = np.argsort(similarities[0])[-5:][::-1]
    
    return top_indices, similarities[0][top_indices]

### Step 4 - Rank Products

In [10]:
def rank_and_display_results(query, data):
    indices, scores = search(query, data)
    
    print(f"Top 5 results for query '{query}':")
    for idx, score in zip(indices, scores):
        movie = data.iloc[idx]
        print(f"- {movie['Movie_Name']} - {movie['Genres']} (Score: {score:.4f})")
        print(f"  Overview: {movie['Overview'][:100]}...")
        print()

# Example query
query = "action movie"
rank_and_display_results(query, data)

Top 5 results for query 'action movie':
- The Franchise - ['Comedy'] (Score: 0.4085)
  Overview: A team trapped inside the dysfunctional hell of creating franchise superhero movies. At the end of t...

- The Agency - ['Drama', 'Thriller'] (Score: 0.4072)
  Overview: Follows a world of even deeper intrigue and subterfuge, with characters fighting their own battles a...

- Watchmen: Chapter I - ['Animation', 'Action', 'Drama', 'Mystery', 'Sci-Fi'] (Score: 0.4036)
  Overview: In 1985, the murder of a government sponsored superhero draws his outlawed colleagues out of retirem...

- The Killer's Game - ['Action', 'Comedy', 'Thriller'] (Score: 0.4023)
  Overview: When a hitman is diagnosed with a terminal illness, he decides to take a hit out on himself. But whe...

- The Killer - ['Action', 'Thriller'] (Score: 0.3889)
  Overview: An assassin tries to make amends in an effort to restore the sight of a beautiful young singer....



### Step- 5 Query Expansion

In [11]:
import nltk
# Download wordnet
nltk.download('wordnet')
from nltk.corpus import wordnet

def expand_query(query):
    """
    Expands the query by adding synonyms from WordNet.
    
    Args:
    query (str): The original search query.
    
    Returns:
    list: A list of terms including the original query and its expansions.
    """
    terms = query.split()
    expanded_terms = []
    for term in terms:
        synsets = wordnet.synsets(term)
        if synsets:
            # Getting synonyms (lemmas) from the first synset
            for lemma in synsets[0].lemmas():
                expanded_terms.append(lemma.name())
    # Combine original terms with expanded terms and remove duplicates
    return list(set(expanded_terms + terms))

# Example usage
query = "action movies with a bit of sadness"
expanded_query = expand_query(query)
print(f"Original query: {query}")
print(f"Expanded query: {' '.join(expanded_query)}")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Abeni\AppData\Roaming\nltk_data...


Original query: action movies with a bit of sadness
Expanded query: moving-picture_show of angstrom with flick angstrom_unit motion-picture_show movies moving_picture bit film picture action movie pic A spot a picture_show sadness motion_picture unhappiness


### Step 6 - Knowledge graph representation

In [12]:
import networkx as nx
import pandas as pd

# Create a directed graph
G = nx.DiGraph()

# Add nodes and edges based on product attributes
for _, row in data.iterrows():
    # Adding Movie node
    G.add_node(row['Movie_Name'], type='Movie', Overview=row['Overview'], Genres=row['Genres'])
    
    # Example: Add edges based on color
    G.add_edge(row['Movie_Name'], row['Genres'], type='has_genre')

def query_knowledge_graph(query, G):
    """
    Queries the knowledge graph simulation for movies related to the query.
    
    Args:
    query (str): The search query.
    G (nx.DiGraph): The knowledge graph.
    
    Returns:
    list: A list of tuples with movie details matching the query.
    """
    results = []
    # Convert query to lowercase for case-insensitive search
    query_lower = query.lower()
    for node, attributes in G.nodes(data=True):
        if attributes.get('type') == 'Movie':
            # Check if query is in description or color
            if query_lower in attributes['Overview'].lower() or query_lower in attributes['Genres'].lower():
                results.append((node, attributes['Overview'], attributes['Genres']))
    
    # Return the results, or an empty list if none found
    return results[:5] if results else []

# Example usage
query = "Sad"
results = query_knowledge_graph(query, G)
print("\nKnowledge Graph Results:")
if results:
    for name, overview, genres in results:
        print(f"- Name: {name}, genres: {genres}")
        print(f"  Overview: {overview[:100]}...")
        print()
else:
    print("No results found for the query.")


Knowledge Graph Results:
- Name: Indiana Jones and the Great Circle, genres: ['Not_Found']
  Overview: Uncover one of history's greatest mysteries in Indiana Jones and the Great Circle, a first-person, s...

- Name: Batman: Caped Crusader, genres: ['Animation', 'Action', 'Adventure', 'Crime', 'Family', 'Fantasy', 'Sci-Fi']
  Overview: After a family tragedy, affluent philanthropist Bruce Wayne transforms into The Batman, where his cr...

- Name: Stream, genres: ['Horror']
  Overview: The Keenan family's simple weekend getaway truly turns into a vacation to die for, as four deranged ...



### Cosine Similarity with query expansion

In [14]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def search_with_expansion(query, data):
    """
    Performs a search using an expanded query and returns the top 5 results.
    
    Args:
    query (str): The original search query.
    data (pd.DataFrame): The dataset with movie information and embeddings.
    
    Returns:
    tuple: Indices of top results and their similarity scores.
    """
    expanded_query = expand_query(query)
    query_embedding = model.encode(' '.join(expanded_query))
    similarities = cosine_similarity([query_embedding], list(data['embedding']))
    top_indices = np.argsort(similarities[0])[-5:][::-1]
    return top_indices, similarities[0][top_indices]

def rank_and_display_results(query, data):
    indices, scores = search_with_expansion(query, data)
    
    print(f"\nTop 5 results for query '{query}':")
    for idx, score in zip(indices, scores):
        movie = data.iloc[idx]
        print(f"- {movie['Movie_Name']} - {movie['Genres']} (Score: {score:.4f})")
        print(f"  Overview: {movie['Overview'][:100]}...")
        print()

# Example query
query = "crime movies with dark plot"
rank_and_display_results(query, data)


Top 5 results for query 'crime movies with dark plot':
- Critical Incident - ['Crime', 'Drama'] (Score: 0.3678)
  Overview: When a police officer pursues a teenage suspect on to a busy train platform, a bystander is critical...

- Secret Level - ['Animation', 'Action', 'Adventure', 'Fantasy', 'Sci-Fi'] (Score: 0.3540)
  Overview: Original stories set within the worlds of iconic video games are featured in this animated anthology...

- The Instigators - ['Action', 'Comedy', 'Crime', 'Drama', 'Mystery', 'Thriller'] (Score: 0.3536)
  Overview: Follows two robbers who must go on the run with the help of one of their therapists after a theft do...

- Protection - ['Crime', 'Drama', 'Thriller'] (Score: 0.3357)
  Overview: Follows a witness protection officer who finds herself at the center of a breach; compromised by an ...

- Criminal Record - ['Crime', 'Drama', 'Thriller'] (Score: 0.3350)
  Overview: Follows two brilliant detectives in a tug of war over a historic murder conviction and th

### Optimizing using cache

In [None]:
import joblib

# Save embeddings to disk for caching
joblib.dump(data['embedding'].tolist(), 'embeddings_cache.joblib')

# Load cached embeddings for faster access
cached_embeddings = joblib.load('embeddings_cache.joblib')

def search_with_cache(query):
    """
    Performs a search using cached embeddings for speed optimization.
    
    Args:
    query (str): The search query.
    
    Returns:
    tuple: Indices of top results and their similarity scores.
    """
    query_embedding = model.encode(query)
    similarities = cosine_similarity([query_embedding], cached_embeddings)
    top_indices = np.argsort(similarities[0])[-5:][::-1]
    return top_indices, similarities[0][top_indices]

# Example using cached embeddings
indices, scores = search_with_cache(query)
rank_and_display_results(query, data)

FileNotFoundError: [Errno 2] No such file or directory: '../data/embeddings_cache.joblib'