# Gen AI Intensive Course Capstone 2025Q1: Advanced Emotional Semantic Search Engine for Anime/Manga (RAG + ChromaDB + Re-ranking)

*   **The Challenge:** Discovering new anime or manga often goes beyond simple genre tags or keyword searches. Users frequently look for titles based on a specific *feeling*, *atmosphere*, *vibe*, or inspiration derived from past experiences (e.g., "something melancholic and beautiful like *Your Name*", "an epic adventure that gets you hyped like *Gurren Lagann*"). Traditional search methods struggle to capture these nuanced, emotional queries.

*   **Our Enhanced Solution: A Sophisticated RAG Pipeline:** This project tackles the challenge by implementing an advanced **Retrieval-Augmented Generation (RAG)** system specifically designed for vibe-based anime/manga discovery. The pipeline involves several key stages:
    1.  **Data Loading & Filtering:** We start with datasets containing anime information (synopses, genres, scores, popularity, etc.) and user reviews. Crucially, we apply **genre filtering** upfront to exclude categories that might skew recommendations, ensuring a more relevant starting pool.
    2.  **Preprocessing:** Text data (synopses, reviews) is cleaned and prepared for processing.
    3.  **Semantic Embedding:** We leverage Google's powerful `text-embedding-004` model via the `google-genai` library to generate rich **semantic embeddings** for the anime synopses, capturing their underlying meaning.
    4.  **Vector Storage & Indexing:** Instead of in-memory arrays, we utilize **ChromaDB**, a persistent **vector database**, to efficiently store and index the anime embeddings along with relevant metadata (title, genre, score, popularity). This allows for persistence across sessions and faster retrieval.
    5.  **Hybrid Retrieval & Re-ranking:** When a user query is received:
        *   A **vector search** is performed in ChromaDB to retrieve an initial set of `k_initial` candidates based on semantic similarity between the query and anime synopses.
        *   A crucial **re-ranking** step is applied to these candidates. We calculate a combined score based on weighted factors: **semantic similarity**, **user score** (quality), and **popularity rank** (inverse rank, lower is better). This allows high-quality or popular relevant items to surface even if they aren't the absolute closest semantically.
    6.  **Context Augmentation:** The top `k` re-ranked anime are then augmented with additional context: their detailed **synopses** and relevant **user review snippets**, providing qualitative insights into the perceived vibe and emotional impact.
    7.  **Motivated Generation:** Finally, Google's **Gemini LLM** is prompted with the original user query and the rich, augmented context (re-ranked results, synopses, metadata, reviews). The LLM synthesizes this information to generate a personalized, **motivated recommendation**, explaining *why* the suggested titles might match the user's desired feeling.


*   **Notebook Objective:** This notebook provides a comprehensive, step-by-step implementation of this advanced RAG pipeline. It covers data loading, filtering, preprocessing, embedding generation, ChromaDB integration, the hybrid retrieval/re-ranking mechanism, review augmentation, LLM-based generation, and includes a section on evaluating the system's performance.

*   **Key Technologies & Gen AI Capabilities Demonstrated:**
    *   **Retrieval-Augmented Generation (RAG):** The core architecture.
    *   **Semantic Embeddings:** Using `google-genai`'s `text-embedding-004`.
    *   **Vector Database / Vector Search:** Implementation with `chromadb`.
    *   **LLM Integration:** Using `google-genai`'s `gemini-2.0-flash` for response generation.
    *   **Re-ranking Algorithms:** Combining semantic similarity with metadata features.
    *   **Data Preprocessing & Filtering:** Essential steps for quality results.
    *   **Context Augmentation:** Using reviews and metadata to enrich LLM input.
    *   **Libraries:** `pandas`, `numpy`, `google-genai`, `chromadb`, `scikit-learn`.

This updated introduction sets the stage more accurately, highlighting the specific techniques and the flow of your sophisticated recommendation engine.

In [1]:
!rm -rf /kaggle/working/*

In [2]:
!pip uninstall -qqy jupyterlab kfp  # Remove unused conflicting packages
!pip install -qU "google-genai==1.7.0" "chromadb==0.6.3"

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.7/144.7 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m611.1/611.1 kB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m56.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m100.9/100.9 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.2/95.2 kB[0m [31m4.8 MB/s[0

In [3]:
from google import genai
from google.genai import types

from IPython.display import Markdown

genai.__version__

'1.7.0'

In [4]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/myanimelist-dataset-animes-profiles-reviews/animes.csv
/kaggle/input/myanimelist-dataset-animes-profiles-reviews/profiles.csv
/kaggle/input/myanimelist-dataset-animes-profiles-reviews/reviews.csv


In [5]:
# ==============================================================================
# 2. Environment Setup
# ==============================================================================
print("--- 2. Environment Setup ---")
from google.api_core import retry # For handling temporary API errors
import os
import time
from sklearn.metrics.pairwise import cosine_similarity
import textwrap # For formatting LLM output
import re
import warnings
import chromadb
from chromadb.utils import embedding_functions
import ast
from sklearn.preprocessing import MinMaxScaler # For normalization

warnings.filterwarnings("ignore") # Hide non-critical warnings

# Configure API Key from Kaggle Secrets
try:
    from kaggle_secrets import UserSecretsClient
    user_secrets = UserSecretsClient()
    GOOGLE_API_KEY = user_secrets.get_secret("GOOGLE_API_KEY")
    client = genai.Client(api_key=GOOGLE_API_KEY)
    print("Google AI SDK Configured successfully.")
except Exception as e:
    print(f"Error configuring Google AI SDK: {e}")
    print("Make sure you have added your API Key as 'GOOGLE_API_KEY' in Kaggle Secrets.")
    GOOGLE_API_KEY = None # Set to None for later checks

# Global Parameters
EMBEDDING_MODEL_NAME = "text-embedding-004"
GENERATIVE_MODEL_NAME = "gemini-2.0-flash" #"gemini-1.5-flash-latest"
TOP_K_RESULTS = 10 # Number of documents to retrieve for the LLM
REVIEWS_PER_ANIME = 3 # Number of review snippets to use for augmentation
MIN_SYNOPSIS_LENGTH = 250 # Minimum synopsis length in characters
MIN_REVIEW_LENGTH = 30 # Minimum review length in characters

ANIME_DATA_PATH = "/kaggle/input/myanimelist-dataset-animes-profiles-reviews/animes.csv"
REVIEWS_DATA_PATH = "/kaggle/input/myanimelist-dataset-animes-profiles-reviews/reviews.csv"

PATH_PROCESSED_ANIME = "/kaggle/working/processed_anime.csv"
PATH_PROCESSED_REVIEWS = "/kaggle/working/processed_reviews.csv"

--- 2. Environment Setup ---
Google AI SDK Configured successfully.


In [6]:
# --- Extract and Display Unique Genres (NEW STEP) ---
try:
    df_anime = pd.read_csv(ANIME_DATA_PATH)
    print(f"Anime Dataset loaded: {df_anime.shape[0]} rows, {df_anime.shape[1]} columns.")
    print("Anime Columns:", df_anime.columns.tolist())

    if not df_anime.empty and 'genre' in df_anime.columns:
        print("\nExtracting unique genres from the dataset...")
        all_genres = set()
    
        # Handle potential NaN values before attempting to parse genres
        df_anime['genre'].dropna(inplace=True) # Drop rows where genre is NaN
    
        # Iterate through the genre column to parse and collect unique genres
        for genre_input in df_anime['genre']:
            genre_list = []
            if isinstance(genre_input, str):
                try:
                    # Attempt 1: Assume string representation of list like "['Action', 'Sci-Fi']"
                    parsed_list = ast.literal_eval(genre_input)
                    if isinstance(parsed_list, list):
                        genre_list = parsed_list
                    else:
                        genre_list = [str(parsed_list)] # Treat as single item list
                except (ValueError, SyntaxError):
                    # Attempt 2: Assume comma-separated string like "Action, Sci-Fi"
                    genre_list = [g.strip() for g in genre_input.split(',') if g.strip()]
            elif isinstance(genre_input, list): # If data is already a list
                 genre_list = genre_input
            # else: Ignore other types for genre extraction
    
            # Add valid genres to the set
            for genre in genre_list:
                if isinstance(genre, str) and genre.strip(): # Ensure it's a non-empty string
                    all_genres.add(genre.strip())
    
        # Sort the unique genres alphabetically for readability
        unique_genres_sorted = sorted(list(all_genres))
    
        print(f"\nFound {len(unique_genres_sorted)} unique genres:")
        # Print genres in multiple columns for better readability if the list is long
        genres_per_line = 5
        for i in range(0, len(unique_genres_sorted), genres_per_line):
            print(", ".join(unique_genres_sorted[i:i+genres_per_line]))
    
        print("\nUse the list above to define the 'EXCLUDED_GENRES' set in the next step.")
    
    elif 'genre' not in df_anime.columns:
         print("Warning: 'genre' column not found in the anime dataset. Cannot extract unique genres.")
    else:
         print("Anime dataset is empty. Cannot extract unique genres.")

except FileNotFoundError:
    print(f"ERROR: Anime dataset file not found at {ANIME_DATA_PATH}")

Anime Dataset loaded: 19311 rows, 12 columns.
Anime Columns: ['uid', 'title', 'synopsis', 'genre', 'aired', 'episodes', 'members', 'popularity', 'ranked', 'score', 'img_url', 'link']

Extracting unique genres from the dataset...

Found 43 unique genres:
Action, Adventure, Cars, Comedy, Dementia
Demons, Drama, Ecchi, Fantasy, Game
Harem, Hentai, Historical, Horror, Josei
Kids, Magic, Martial Arts, Mecha, Military
Music, Mystery, Parody, Police, Psychological
Romance, Samurai, School, Sci-Fi, Seinen
Shoujo, Shoujo Ai, Shounen, Shounen Ai, Slice of Life
Space, Sports, Super Power, Supernatural, Thriller
Vampire, Yaoi, Yuri

Use the list above to define the 'EXCLUDED_GENRES' set in the next step.


In [7]:
# Define the set of genres to exclude (CASE-SENSITIVE! Match exact names in your data)
# Adjust this set based on the genres you want to remove
EXCLUDED_GENRES = {'Hentai', 'Yaoi', 'Yuri'}

In [8]:
# ==============================================================================
# 3. Data Loading and Preparation (Anime and Reviews)
# ==============================================================================
print("\n--- 3. Data Loading and Preparation ---")

if os.path.exists(PATH_PROCESSED_ANIME) and os.path.exists(PATH_PROCESSED_REVIEWS):
    print(f"Skipping preprocessing step...")
    df_processed = pd.read_csv(PATH_PROCESSED_ANIME)
    print(f"Processed Anime Dataset: {df_processed.shape[0]} anime ready.")
    print(df_processed[['uid', 'title', 'cleaned_synopsis']].head())
    
    df_reviews_processed = pd.read_csv(PATH_PROCESSED_REVIEWS)
    print(f"Processed Reviews Dataset: {df_reviews_processed.shape[0]} reviews ready.")
    print(df_reviews_processed[['anime_uid', 'cleaned_review']].head())
else:
    # --- 3.1 Load Anime Dataset ---
    try:
        df_anime = pd.read_csv(ANIME_DATA_PATH)
        print(f"Anime Dataset loaded: {df_anime.shape[0]} rows, {df_anime.shape[1]} columns.")
        print("Anime Columns:", df_anime.columns.tolist())
    except FileNotFoundError:
        print(f"ERROR: Anime dataset file not found at {ANIME_DATA_PATH}")
        df_anime = pd.DataFrame() # Create empty dataframe to avoid subsequent errors

    # --- 3.1.1 Genre Filtering (NEW STEP) ---
    if not df_anime.empty:
        print("\nFiltering Anime based on excluded genres...")
        
        print(f"Excluding genres: {EXCLUDED_GENRES}")
    
        # Handle potential NaN values in the genre column before processing
        # Option 1: Drop rows with NaN genres
        df_anime.dropna(subset=['genre'], inplace=True)
        # Option 2: Fill NaN with an empty list representation (if you prefer to keep them)
        # df_anime['genre'].fillna('[]', inplace=True) # Use '[]' if using literal_eval, '' if using split
    
        original_rows_before_genre_filter = len(df_anime)
    
        # Function to parse the genre string and check for excluded genres
        def check_and_filter_genres(genre_input, excluded_set):
            if not genre_input or pd.isna(genre_input):
                return False # Keep if genre is missing or NaN (already handled by dropna above ideally)
    
            genre_list = []
            if isinstance(genre_input, str):
                try:
                    # Attempt 1: Assume string representation of list like "['Action', 'Sci-Fi']"
                    parsed_list = ast.literal_eval(genre_input)
                    if isinstance(parsed_list, list):
                        genre_list = parsed_list
                    else:
                        # If literal_eval results in something else (e.g., just a string), treat as single item list
                        genre_list = [str(parsed_list)]
                except (ValueError, SyntaxError):
                    # Attempt 2: Assume comma-separated string like "Action, Sci-Fi"
                    # This handles cases like "Action" or "Action, Comedy"
                    genre_list = [g.strip() for g in genre_input.split(',') if g.strip()]
            elif isinstance(genre_input, list): # If data is already a list
                 genre_list = genre_input
            else:
                 # Handle other potential types if necessary
                 # print(f"Warning: Unexpected genre type: {type(genre_input)}, value: {genre_input}")
                 return False # Keep if type is unexpected
    
            # Ensure we have a list of strings now
            if not isinstance(genre_list, list):
                 # print(f"Warning: Could not parse genre input into a list: {genre_input}")
                 return False # Keep if parsing failed
    
            # Check for intersection with excluded genres using sets for efficiency
            current_genres_set = set(str(g) for g in genre_list) # Ensure all items are strings
            if not current_genres_set.isdisjoint(excluded_set): # isdisjoint is False if there is overlap
                return True # Exclude this anime (it has an excluded genre)
            else:
                return False # Keep this anime
    
        # Apply the function to create a boolean mask for rows TO EXCLUDE
        # It's safer to iterate using apply for robust parsing logic per row
        print("Applying genre filter function...")
        exclude_mask = df_anime['genre'].apply(lambda x: check_and_filter_genres(x, EXCLUDED_GENRES))
    
        # Keep rows where the mask is False (i.e., do NOT exclude)
        df_anime_filtered = df_anime[~exclude_mask].copy() # Use .copy() to avoid SettingWithCopyWarning
    
        print(f"Removed {original_rows_before_genre_filter - len(df_anime_filtered)} anime containing excluded genres.")
        print(f"Anime remaining after genre filtering: {len(df_anime_filtered)}")
    
        # Use the filtered DataFrame for subsequent steps
        df_anime_to_process = df_anime_filtered
        
    else:
        print("Anime Dataset is empty, skipping genre filtering.")
        df_anime_to_process = pd.DataFrame() # Ensure it's an empty DF
    
    # --- 3.2 Load Reviews Dataset ---
    try:
        df_reviews = pd.read_csv(REVIEWS_DATA_PATH)
        print(f"Reviews Dataset loaded: {df_reviews.shape[0]} rows, {df_reviews.shape[1]} columns.")
        print("Reviews Columns:", df_reviews.columns.tolist())
    except FileNotFoundError:
        print(f"ERROR: Reviews dataset file not found at {REVIEWS_DATA_PATH}")
        df_reviews = pd.DataFrame() # Create empty dataframe
    
    # --- 3.3 Text Cleaning Function ---
    def clean_text(text):
        """Cleans text: lowercase, removes specific tags, extra spaces."""
        if isinstance(text, str):
            text = text.lower()
            text = re.sub(r'\[written by mal rewrite\]', '', text, flags=re.IGNORECASE) # Remove common boilerplate
            text = re.sub(r'<.*?>', '', text) # Remove HTML tags (if any)
            text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE) # Remove URLs
            text = re.sub(r'\s+', ' ', text).strip() # Normalize whitespace
            # Add other rules if needed (e.g., removing specific special characters)
        return text
    
    # --- 3.4 Preprocessing Filtered Anime Dataset --- (Modify to use df_anime_to_process)
    df_processed = pd.DataFrame() # Initialize empty processed dataframe
    # Use the filtered dataframe 'df_anime_to_process' from now on
    if not df_anime_to_process.empty:
        print("\nPreprocessing Filtered Anime Dataset...")
        # Select relevant columns from the filtered dataframe
        cols_to_keep = ['uid', 'title', 'synopsis', 'genre', 'score', 'popularity'] # Add others if useful
        df_processed = df_anime_to_process[cols_to_keep].copy()
    
        # Handle missing values in synopsis (on the filtered data)
        original_rows = len(df_processed)
        df_processed.dropna(subset=['synopsis'], inplace=True)
        print(f"Anime: Removed {original_rows - len(df_processed)} rows with missing synopsis (post-genre filter).")
    
        # Clean synopsis
        df_processed['cleaned_synopsis'] = df_processed['synopsis'].apply(clean_text)
    
        # Filter out very short synopses
        original_rows = len(df_processed)
        df_processed = df_processed[df_processed['cleaned_synopsis'].str.len() >= MIN_SYNOPSIS_LENGTH]
        print(f"Anime: Removed {original_rows - len(df_processed)} rows with synopsis shorter than {MIN_SYNOPSIS_LENGTH} characters (post-genre filter).")
    
        # Handle 'uid' type
        try:
            df_processed['uid'] = pd.to_numeric(df_processed['uid'], errors='coerce')
            df_processed.dropna(subset=['uid'], inplace=True)
            df_processed['uid'] = df_processed['uid'].astype(int)
        except Exception as e:
             print(f"Warning: Problem converting 'uid' in df_processed to numeric: {e}")
    
        # Reset index
        df_processed.reset_index(drop=True, inplace=True)
        df_processed = df_processed.sort_values("popularity", ascending=False).drop_duplicates(subset=["uid"], keep="first")
        df_processed.to_csv(PATH_PROCESSED_ANIME, index=False)
        print(f"Processed Anime Dataset: {df_processed.shape[0]} anime ready.")
        print(df_processed[['uid', 'title', 'cleaned_synopsis']].head())
        if not df_processed.empty:
            print(df_processed[['uid', 'title', 'genre']].head()) # Show genres to verify filtering
    else:
        print("No anime data available after genre filtering to proceed with preprocessing.")
    
    # --- 3.5 Preprocessing Reviews Dataset ---
    df_reviews_processed = pd.DataFrame() # Initialize empty processed dataframe
    if not df_reviews.empty and not df_processed.empty: # Proceed only if both dfs are loaded
        print("\nPreprocessing Reviews Dataset...")
        # Select relevant columns (use the provided names)
        cols_to_keep_reviews = ['anime_uid', 'text', 'score'] # 'profile', 'scores', 'link' might be useful for future analysis
        df_reviews_processed = df_reviews[cols_to_keep_reviews].copy()
    
        # Handle missing values in review text
        original_rows = len(df_reviews_processed)
        df_reviews_processed.dropna(subset=['text'], inplace=True)
        print(f"Reviews: Removed {original_rows - len(df_reviews_processed)} rows with missing text.")
    
        # Clean review text
        df_reviews_processed['cleaned_review'] = df_reviews_processed['text'].apply(clean_text)
    
        # Filter out very short reviews
        original_rows = len(df_reviews_processed)
        df_reviews_processed = df_reviews_processed[df_reviews_processed['cleaned_review'].str.len() >= MIN_REVIEW_LENGTH]
        print(f"Reviews: Removed {original_rows - len(df_reviews_processed)} rows with review shorter than {MIN_REVIEW_LENGTH} characters.")
    
        # Handle 'anime_uid' type (IMPORTANT for joining)
        # Ensure it's the same type as 'uid' in df_processed (int)
        try:
            df_reviews_processed['anime_uid'] = pd.to_numeric(df_reviews_processed['anime_uid'], errors='coerce')
            df_reviews_processed.dropna(subset=['anime_uid'], inplace=True) # Remove if not convertible
            df_reviews_processed['anime_uid'] = df_reviews_processed['anime_uid'].astype(int)
        except Exception as e:
             print(f"Warning: Problem converting 'anime_uid' in df_reviews to numeric: {e}")
    
        # Filter reviews to only include those for anime present in df_processed (optimization)
        valid_anime_uids = df_processed['uid'].unique()
        original_rows = len(df_reviews_processed)
        df_reviews_processed = df_reviews_processed[df_reviews_processed['anime_uid'].isin(valid_anime_uids)]
        print(f"Reviews: Kept {len(df_reviews_processed)} reviews related to anime in the processed dataset (removed {original_rows - len(df_reviews_processed)}).")

        valid_anime_uids = df_processed['uid'].unique() # Use UIDs from the filtered anime set
        original_rows = len(df_reviews_processed)
        df_reviews_processed = df_reviews_processed[df_reviews_processed['anime_uid'].isin(valid_anime_uids)]
        print(f"Reviews: Kept {len(df_reviews_processed)} reviews related to anime in the *filtered* processed dataset (removed {original_rows - len(df_reviews_processed)}).")
    
        # Reset index
        df_reviews_processed.reset_index(drop=True, inplace=True)
        df_reviews_processed.drop_duplicates(inplace=True)
        df_reviews_processed.to_csv(PATH_PROCESSED_REVIEWS, index=False)
        print(f"Processed Reviews Dataset: {df_reviews_processed.shape[0]} reviews ready.")
        print(df_reviews_processed[['anime_uid', 'cleaned_review']].head())
    
    elif df_reviews.empty:
        print("Reviews Dataset is empty or not loaded. Recommendations will be based on synopses only.")
    else: # df_processed is empty
         print("Anime Dataset is empty, cannot process reviews meaningfully.")


--- 3. Data Loading and Preparation ---
Anime Dataset loaded: 19311 rows, 12 columns.
Anime Columns: ['uid', 'title', 'synopsis', 'genre', 'aired', 'episodes', 'members', 'popularity', 'ranked', 'score', 'img_url', 'link']

Filtering Anime based on excluded genres...
Excluding genres: {'Yuri', 'Yaoi', 'Hentai'}
Applying genre filter function...
Removed 2640 anime containing excluded genres.
Anime remaining after genre filtering: 16671
Reviews Dataset loaded: 192112 rows, 7 columns.
Reviews Columns: ['uid', 'profile', 'anime_uid', 'text', 'score', 'scores', 'link']

Preprocessing Filtered Anime Dataset...
Anime: Removed 729 rows with missing synopsis (post-genre filter).
Anime: Removed 7010 rows with synopsis shorter than 250 characters (post-genre filter).
Processed Anime Dataset: 7686 anime ready.
        uid                                        title  \
1127  40911                          Yuukoku no Moriarty   
1164  40908                                 Kemono Jihen   
6667  408

In [9]:
# Check for duplicates into the df_processed 
len(df_processed['uid'].astype(str).tolist()) == len(set(df_processed['uid'].astype(str).tolist()))

True

In [10]:
import shutil

chroma_db_path = "/kaggle/working/anime_chroma_db" # Make sure this matches the path used in PersistentClient

if os.path.exists(chroma_db_path):
    print(f"Removing existing ChromaDB directory: {chroma_db_path}")
    shutil.rmtree(chroma_db_path)
    print("Directory removed.")
else:
    print(f"ChromaDB directory not found at {chroma_db_path}, no need to remove.")

ChromaDB directory not found at /kaggle/working/anime_chroma_db, no need to remove.


In [11]:
# ==============================================================================
# 4. Semantic Embedding Generation & Storage (Using ChromaDB)
# ==============================================================================
print("\n--- 4. Semantic Embedding Generation & Storage (ChromaDB) ---")

# Define a helper to retry when per-minute quota is reached.
is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

# --- 4.1 Define the Custom Gemini Embedding Function for ChromaDB ---
class GeminiEmbeddingFunction(embedding_functions.EmbeddingFunction):
    """
    Custom embedding function for ChromaDB using Google Gemini API.
    Handles batching and potential retries.
    """
    def __init__(self, model_name: str = f"models/{EMBEDDING_MODEL_NAME}", task_type: str = "RETRIEVAL_DOCUMENT", api_key: str = None):
        if not api_key:
            raise ValueError("API Key must be provided for GeminiEmbeddingFunction")
        
        self._model_name = model_name
        self._task_type = task_type
        print(f"GeminiEmbeddingFunction initialized with model: {self._model_name}, task_type: {self._task_type}")

    # Add the retry decorator for robustness
    @retry.Retry(predicate=is_retriable)
    def _embed_batch(self, batch_texts: list[str]) -> list:
        """Embeds a single batch of texts, handling API calls and errors."""
        try:
            # Filter out empty strings, as they can cause API errors
            valid_texts = [text for text in batch_texts if isinstance(text, str) and text.strip()]
            if not valid_texts:
                # Return list of Nones matching original batch size if all were invalid
                return [None] * len(batch_texts)

            # Call the Gemini API
            response = client.models.embed_content(
                model=self._model_name,
                contents=valid_texts,
                config=types.EmbedContentConfig(
                    task_type=self._task_type,
                ),
            )
            
            # Create a map from valid text to its embedding
            embedding_map = {text: emb for text, emb in zip(valid_texts, [e.values for e in response.embeddings])}
            
            # Return embeddings in the original order, inserting None for invalid texts
            final_embeddings = [embedding_map.get(text) for text in batch_texts]
            return final_embeddings

        except Exception as e:
            print(f"Error embedding batch with {len(batch_texts)} texts: {e}. Retrying if applicable...")
            # Reraise the exception to allow the @retry decorator to work
            raise e

    def __call__(self, input: list[str]) -> list[list[float]]:
        """
        Embeds a list of documents using batching and rate limiting.
        Expected input: list of strings.
        Expected output: list of embeddings (list of floats).
        """
        all_embeddings = []
        batch_size = 100 # Gemini API batch limit (adjust if needed)
        total_texts = len(input)

        for i in range(0, total_texts, batch_size):
            batch_texts = input[i:i + batch_size]
            
            # Add delay BETWEEN batches to respect per-minute limits
            if i > 0:
                time.sleep(1.0) # Wait 1 second between batch API calls

            print(f"Embedding batch {i//batch_size + 1}/{(total_texts + batch_size - 1)//batch_size}...")
            batch_embeddings = self._embed_batch(batch_texts)
            
            # Handle potential Nones returned from _embed_batch (if a text failed permanently)
            # Replace None with zero vectors of the correct dimension if possible
            # For simplicity here, we'll assume _embed_batch succeeds or raises an error handled by retry.
            # A more robust implementation would determine dim and replace Nones.
            # If the entire batch fails repeatedly, the retry deadline will be hit.
            if any(emb is None for emb in batch_embeddings):
                 print(f"Warning: Some embeddings in batch {i//batch_size + 1} failed and are None.")
                 # Decide on a strategy: raise error, replace with zeros, or filter out.
                 # For now, let's filter them out along with their original texts,
                 # but this means the final list might be shorter than the input list.
                 # A better approach for db.add might be to skip adding failed ones.
                 # We will return them as None for now, ChromaDB might handle it.
                 pass # Keep Nones for now

            all_embeddings.extend(batch_embeddings)

        # Check if all embeddings are None (total failure)
        if all(e is None for e in all_embeddings):
             raise RuntimeError("Failed to generate any embeddings.")

        # ChromaDB expects a list of lists of floats. Handle Nones if necessary.
        # If ChromaDB cannot handle None, you MUST replace them with zero vectors here.
        # Let's try returning as is first.
        # Example replacement (if needed):
        # first_valid_emb = next((e for e in all_embeddings if e is not None), None)
        # if first_valid_emb:
        #    dim = len(first_valid_emb)
        #    all_embeddings = [e if e is not None else [0.0] * dim for e in all_embeddings]
        # else:
        #    raise RuntimeError("No valid embeddings generated to determine dimension.")

        print(f"Generated {len(all_embeddings)} embeddings.")
        return all_embeddings


# --- 4.2 Initialize ChromaDB Client and Collection ---
print(f"Initializing ChromaDB PersistentClient at: {chroma_db_path}")
chroma_client = chromadb.PersistentClient(path=chroma_db_path)

# Define the embedding function instance (requires API Key)
if GOOGLE_API_KEY:
    gemini_ef = GeminiEmbeddingFunction(api_key=GOOGLE_API_KEY, task_type="RETRIEVAL_DOCUMENT") # Use DOCUMENT type for indexing

    collection_name = "anime_synopses"
    print(f"Getting or creating ChromaDB collection: {collection_name}")
    # Pass the embedding function instance to the collection
    collection = chroma_client.get_or_create_collection(
        name=collection_name,
        embedding_function=gemini_ef,
        metadata={"hnsw:space": "cosine"} # Specify cosine distance (good practice)
    )
    print(f"Collection '{collection_name}' ready. Current count: {collection.count()}")
else:
    print("ERROR: GOOGLE_API_KEY not set. Cannot initialize GeminiEmbeddingFunction for ChromaDB.")
    collection = None # Set collection to None if setup failed

# --- 4.3 Add Documents to ChromaDB (if collection is ready and data exists) ---
# This replaces the NumPy matrix generation loop

if collection is not None and not df_processed.empty:
    print("\nAdding documents to ChromaDB collection...")
    start_time = time.time()

    # Prepare data for ChromaDB
    # We need lists of documents, ids, and optional metadatas
    documents_to_add = df_processed['cleaned_synopsis'].tolist()
    # Use the anime 'uid' as the ChromaDB ID (must be string)
    ids_to_add = df_processed['uid'].astype(str).tolist()

    # Create metadata (store title, genre, score, popularity for filtering/re-ranking)
    # --- ENSURE 'score' and 'popularity' ARE INCLUDED HERE ---
    metadata_columns = ['title', 'genre', 'score', 'popularity'] # Add 'popularity' or other relevant columns
    # Check if columns exist in df_processed
    valid_metadata_columns = [col for col in metadata_columns if col in df_processed.columns]
    print(f"Including metadata fields: {valid_metadata_columns}")
    if 'score' not in valid_metadata_columns or 'popularity' not in valid_metadata_columns:
         print("WARNING: 'score' or 'popularity' column not found in df_processed. Re-ranking might not work as expected.")

    metadatas_to_add = df_processed[valid_metadata_columns].to_dict('records')

    # --- IMPORTANT ---
    # Handle potential NaN or non-numeric types in score/popularity before adding to metadata
    # ChromaDB metadata values should ideally be simple types (string, int, float).
    for meta_dict in metadatas_to_add:
        if 'score' in meta_dict:
            # Replace NaN score with a default (e.g., 0 or average) or handle as needed
            meta_dict['score'] = float(meta_dict['score']) if pd.notna(meta_dict['score']) else 0.0
        if 'popularity' in meta_dict:
             # Ensure popularity is a number, replace NaN if necessary
             meta_dict['popularity'] = int(meta_dict['popularity']) if pd.notna(meta_dict['popularity']) else 0

    # Check if documents already exist (based on IDs) to avoid duplicates if re-running
    existing_ids = set(collection.get(ids=ids_to_add)['ids'])
    print(f"Found {len(existing_ids)} existing IDs in the collection.")

    # Filter out data that already exists
    new_documents = []
    new_ids = []
    new_metadatas = []
    for doc, id_val, meta in zip(documents_to_add, ids_to_add, metadatas_to_add):
        if id_val not in existing_ids:
            new_documents.append(doc)
            new_ids.append(id_val)
            new_metadatas.append(meta)

    if new_documents:
        print(f"Adding {len(new_documents)} new documents to the collection (this will trigger embedding generation)...")
        
        # Add data in batches (ChromaDB handles calling the embedding function)
        # The embedding function itself has internal batching for the API calls.
        # ChromaDB's add also benefits from batching the add operation itself.
        chroma_batch_size = 500 # How many items to add to ChromaDB at once
        added_count = 0
        failed_ids = []

        for i in range(0, len(new_documents), chroma_batch_size):
            print(f"Adding ChromaDB batch {i//chroma_batch_size + 1}...")
            batch_docs = new_documents[i:i + chroma_batch_size]
            batch_ids = new_ids[i:i + chroma_batch_size]
            batch_metas = new_metadatas[i:i + chroma_batch_size]
            
            try:
                 # This call will trigger the GeminiEmbeddingFunction internally
                collection.add(
                    documents=batch_docs,
                    ids=batch_ids,
                    metadatas=batch_metas
                )
                added_count += len(batch_docs)
            except Exception as e:
                print(f"ERROR adding ChromaDB batch starting at index {i}: {e}")
                # Log failed IDs for potential retry or investigation
                failed_ids.extend(batch_ids)
                # Decide whether to continue with next batch or stop
                # continue

        print(f"Finished adding documents. Added: {added_count}, Failed: {len(failed_ids)}")
        print(f"Total documents in collection: {collection.count()}")
        print(f"Time taken for adding: {time.time() - start_time:.2f} seconds.")
    else:
        print("No new documents to add. Collection is up-to-date.")

    # Clean up memory if df_processed is large and embeddings are now in ChromaDB
    # del df_processed['embedding'] # Remove the old embedding column if it exists
    # Note: We might still need df_processed for metadata if not fully stored in Chroma

elif collection is None:
    print("ChromaDB collection not initialized. Skipping document adding.")
else: # df_processed is empty
    print("Processed Anime Dataset is empty. Skipping document adding to ChromaDB.")

# We no longer need the NumPy embeddings_matrix
embeddings_matrix = None # Set to None to indicate it's not used


--- 4. Semantic Embedding Generation & Storage (ChromaDB) ---
Initializing ChromaDB PersistentClient at: /kaggle/working/anime_chroma_db
GeminiEmbeddingFunction initialized with model: models/text-embedding-004, task_type: RETRIEVAL_DOCUMENT
Getting or creating ChromaDB collection: anime_synopses
Collection 'anime_synopses' ready. Current count: 0

Adding documents to ChromaDB collection...
Including metadata fields: ['title', 'genre', 'score', 'popularity']
Found 0 existing IDs in the collection.
Adding 7686 new documents to the collection (this will trigger embedding generation)...
Adding ChromaDB batch 1...
Embedding batch 1/5...
Embedding batch 2/5...
Embedding batch 3/5...
Embedding batch 4/5...
Embedding batch 5/5...
Generated 500 embeddings.
Adding ChromaDB batch 2...
Embedding batch 1/5...
Embedding batch 2/5...
Embedding batch 3/5...
Embedding batch 4/5...
Embedding batch 5/5...
Generated 500 embeddings.
Adding ChromaDB batch 3...
Embedding batch 1/5...
Embedding batch 2/5...

In [12]:
# ==============================================================================
# 5. Semantic Retrieval Implementation (Using ChromaDB with Re-ranking)
# ==============================================================================
print("\n--- 5. Semantic Retrieval Implementation (ChromaDB + Re-ranking) ---")

# Define weights for re-ranking (these should sum to 1)
# Adjust these based on experimentation and desired outcome
W_SIMILARITY = 0.3
W_SCORE = 0.1
W_POPULARITY = 0.6 # Lower weight for popularity unless desired otherwise

# Number of initial candidates to fetch for re-ranking
INITIAL_K = 100 # Fetch more candidates (e.g., 25)

def find_similar_anime_chromadb(query, collection, top_k=TOP_K_RESULTS):
    """Finds anime most similar to a query using ChromaDB."""
    if collection is None:
         print("Error: ChromaDB collection is not available.")
         return pd.DataFrame() # Return empty DataFrame

    print(f"\nQuerying ChromaDB for: '{query}'...")
    try:
        # ChromaDB uses the collection's embedding function to embed the query
        # By default, it uses the same task_type ("RETRIEVAL_DOCUMENT")
        # For potentially better results, embed query separately with "RETRIEVAL_QUERY"
        # query_embedding = get_embedding(query, task_type="RETRIEVAL_QUERY") # Use the single embedding func
        # if query_embedding is None: raise ValueError("Failed to embed query")
        # results = collection.query(query_embeddings=[query_embedding], n_results=top_k, include=['metadatas', 'distances'])

        # Simpler approach: Let ChromaDB handle query embedding with its default EF
        results = collection.query(
            query_texts=[query],
            n_results=top_k,
            include=['metadatas', 'distances'] # Request metadata and distances
            # You could add a 'where' clause here for filtering if needed
            # where={"genre": {"$like": "%Slice of Life%"}}
        )

    except Exception as e:
        print(f"Error querying ChromaDB: {e}")
        return pd.DataFrame()

    # Process ChromaDB results into a DataFrame similar to the previous one
    if not results or not results.get('ids') or not results['ids'][0]:
        print("No results found in ChromaDB for this query.")
        return pd.DataFrame()

    ids = results['ids'][0]
    distances = results['distances'][0]
    metadatas = results['metadatas'][0]

    # Convert distance to similarity (Cosine distance d = 1 - s => similarity s = 1 - d)
    similarities = [1 - d for d in distances]

    # Create DataFrame
    results_df = pd.DataFrame({
        'uid': ids, # ChromaDB IDs are strings, match original type if needed later
        'similarity': similarities,
        # Extract metadata fields
        'title': [m.get('title', 'N/A') for m in metadatas],
        'genre': [m.get('genre', 'N/A') for m in metadatas],
        'score': [m.get('score', np.nan) for m in metadatas]
        # Add other metadata fields if stored
    })
    
    # Convert uid back to int if needed for joins later (assuming original uid was int)
    try:
        results_df['uid'] = results_df['uid'].astype(int)
    except ValueError:
        print("Warning: Could not convert ChromaDB string IDs back to integer.")


    print(f"Found {len(results_df)} results from ChromaDB (Top {top_k}).")
    return results_df

def find_similar_anime_chromadb_reranked(query, collection, top_k=TOP_K_RESULTS, k_initial=INITIAL_K):
    """
    Finds anime most similar to a query using ChromaDB, then re-ranks
    the initial results based on similarity, score, and popularity.
    """
    if collection is None:
         print("Error: ChromaDB collection is not available.")
         return pd.DataFrame()

    print(f"\nQuerying ChromaDB for initial {k_initial} candidates for: '{query}'...")
    try:
        # Query ChromaDB for more initial candidates
        results = collection.query(
            query_texts=[query],
            n_results=k_initial, # Fetch more results initially
            include=['metadatas', 'distances'] # Ensure metadata (score, pop) is included
        )
    except Exception as e:
        print(f"Error querying ChromaDB: {e}")
        return pd.DataFrame()

    # Process ChromaDB results
    if not results or not results.get('ids') or not results['ids'][0]:
        print(f"No results found in ChromaDB for this query (initial fetch).")
        return pd.DataFrame()

    ids = results['ids'][0]
    distances = results['distances'][0]
    metadatas = results['metadatas'][0]

    # Convert distance to similarity
    similarities = [1 - d for d in distances]

    # Create DataFrame with initial results
    initial_results_df = pd.DataFrame({
        'uid': ids,
        'similarity': similarities,
        'title': [m.get('title', 'N/A') for m in metadatas],
        'genre': [m.get('genre', 'N/A') for m in metadatas],
        # --- Extract score and popularity ---
        # Use .get with a default (e.g., 0 or NaN) if metadata might be missing
        'score': [m.get('score', 0.0) for m in metadatas],
        'popularity': [m.get('popularity', 0) for m in metadatas]
    })

    # --- Re-ranking Phase ---
    print(f"Re-ranking the initial {len(initial_results_df)} candidates...")

    # Handle potential missing data before normalization (though defaults in .get should handle it)
    initial_results_df['score'].fillna(0.0, inplace=True)
    initial_results_df['popularity'].fillna(0, inplace=True)

    # Normalize features (Similarity, Score, Popularity) to 0-1 range
    scaler = MinMaxScaler()
    # Prepare data for scaling (handle cases with only 1 result where min=max)
    features_to_scale = []
    if initial_results_df['similarity'].nunique() > 1:
        features_to_scale.append('similarity')
    else: # Handle constant value case
         initial_results_df['norm_similarity'] = 0.5 # Assign a neutral value or 1.0

    if initial_results_df['score'].nunique() > 1:
         features_to_scale.append('score')
    else:
         initial_results_df['norm_score'] = 0.5

    # --- Popularity (Lower is better) ---
    # No log transform needed if it's already a rank
    if initial_results_df['popularity'].nunique() > 1:
        features_to_scale.append('popularity')
    else:
         initial_results_df['norm_popularity'] = 0.5


    if features_to_scale:
        scaled_features = scaler.fit_transform(initial_results_df[features_to_scale])
        # Create normalized columns
        for i, feature in enumerate(features_to_scale):
             initial_results_df[f'norm_{feature.replace("log_", "")}'] = scaled_features[:, i] # Assign back to df

    # Calculate combined score using weights
    # Ensure normalized columns exist before calculating
    norm_sim = initial_results_df.get('norm_similarity', 0.5) # Default if column doesn't exist
    norm_score = initial_results_df.get('norm_score', 0.5)
    norm_pop = initial_results_df.get('norm_popularity', 0.5)

    initial_results_df['combined_score'] = (W_SIMILARITY * norm_sim +
                                            W_SCORE * norm_score +
                                            W_POPULARITY * (1 - norm_pop)) # Invert popularity contribution

    # Sort by the combined score in descending order
    reranked_df = initial_results_df.sort_values(by='combined_score', ascending=False)

    # Select the final top_k results
    final_results_df = reranked_df.head(top_k).copy()

    # Convert uid back to int if needed for joins later
    try:
        final_results_df['uid'] = final_results_df['uid'].astype(int)
    except ValueError:
        print("Warning: Could not convert ChromaDB string IDs back to integer in final results.")

    print(f"Re-ranking complete. Returning top {len(final_results_df)} results.")
    # Optionally display the combined score for debugging/analysis
    # print(final_results_df[['uid', 'title', 'similarity', 'score', 'popularity', 'combined_score']].round(4).to_string(index=False))

    return final_results_df


--- 5. Semantic Retrieval Implementation (ChromaDB + Re-ranking) ---


In [13]:
# ==============================================================================
# 6. Response Generation with LLM (RAG: Synopsis + Reviews)
# ==============================================================================
print("\n--- 6. Response Generation with LLM (RAG) ---")

def generate_recommendations(query, context_df, reviews_df, model_name=GENERATIVE_MODEL_NAME, reviews_per_anime=REVIEWS_PER_ANIME):
    """Generates motivated recommendations using LLM, retrieved context (synopsis), and reviews."""

    if not GOOGLE_API_KEY:
        return "API Key not available. Cannot generate recommendation."

    # context_df here is the DataFrame returned by find_similar_anime_chromadb
    # *after* merging it with df_processed to get the 'cleaned_synopsis'
    if context_df.empty:
        return "No relevant context provided to generate recommendations."

    context_text_for_prompt = ""
    print("\nPreparing context for LLM (including reviews)...")
    for index, row in context_df.iterrows():
        # Ensure column names match the DataFrame passed to this function
        try:
            anime_id = row['uid'] # Should be present after merge
            title = row['title'] # Should be present from Chroma metadata or merge
            synopsis = row['cleaned_synopsis'] # CRUCIAL: Must be present after merge
            similarity = row['similarity'] # Present from Chroma results
        except KeyError as e:
            print(f"ERROR: Missing expected column in context_df for LLM: {e}")
            continue # Skip this entry if essential data is missing

        review_snippets = "No relevant reviews found in the processed dataset."
        if not reviews_df.empty:
            # Retrieve reviews for this anime using 'anime_uid' from the reviews dataframe
            # Ensure 'anime_id' (which is the 'uid' from context_df) is the correct type for matching
            try:
                # Ensure reviews_df['anime_uid'] is the same type as anime_id (e.g., int)
                anime_reviews = reviews_df[reviews_df['anime_uid'] == int(anime_id)] # Explicit cast if needed
                if not anime_reviews.empty:
                    selected_reviews = anime_reviews['cleaned_review'].dropna().head(reviews_per_anime).tolist()
                    if selected_reviews:
                        review_snippets = "\n".join([f"- \"{rev[:350]}...\"" for rev in selected_reviews])
            except Exception as e:
                 print(f"Warning: Error retrieving reviews for anime_id {anime_id}: {e}")


        # Add the information to the context for the prompt
        context_text_for_prompt += f"--- Retrieved Anime ---\n"
        context_text_for_prompt += f"ID: {anime_id}\n"
        context_text_for_prompt += f"Title: {title}\n"
        context_text_for_prompt += f"Synopsis Similarity to Query: {similarity:.4f}\n"
        context_text_for_prompt += f"Synopsis: {synopsis}\n"
        context_text_for_prompt += f"User Review Excerpts:\n{review_snippets}\n\n"

    if not context_text_for_prompt:
         return "Could not construct any valid context for the LLM."

    # Build the updated prompt
    prompt = f"""You are an expert and empathetic anime and manga advisor. A user expresses a desire based on a feeling, inspiration, or atmosphere:
"{query}"

Based EXCLUSIVELY on the following retrieved information (titles, synopses, synopsis similarity to query, and excerpts of user reviews), suggest 2 or 3 anime/manga from the provided list.
For each suggestion:
1.  State the Title.
2.  Briefly explain why you think it might match the user's requested feeling/inspiration. Refer to BOTH the synopsis (for plot/theme) AND the user impressions in the reviews (for atmosphere/perceived emotional impact).
3.  Be concise, engaging, and direct. Do not invent information not present in the provided context.

Here is the retrieved information:
{context_text_for_prompt}
---

Your 2-3 motivated recommendations:
"""

    # Call the LLM
    print(f"Sending request to {model_name}...")
    try:
        answer = client.models.generate_content(
            model="gemini-2.0-flash",
            contents=prompt
        )
        return textwrap.fill(answer.text, width=90)
    except Exception as e:
        print(f"Error during LLM response generation: {e}")
        try:
            if 'response' in locals() and hasattr(response, 'prompt_feedback'):
                 print(f"Prompt Feedback: {response.prompt_feedback}")
            elif 'response' in locals() and hasattr(response, 'candidates') and response.candidates:
                 print(f"Candidate Finish Reason: {response.candidates[0].finish_reason}")
                 print(f"Safety Ratings: {response.candidates[0].safety_ratings}")
        except Exception as feedback_e:
            print(f"Could not retrieve feedback details: {feedback_e}")
        return "Sorry, I encountered a technical issue while generating the recommendation."


--- 6. Response Generation with LLM (RAG) ---


In [14]:
# ==============================================================================
# 7. Putting It All Together: The Main Search Function (Using ChromaDB + Re-ranking)
# ==============================================================================
print("\n--- 7. Main Semantic Search Function (Using ChromaDB + Re-ranking) ---")

# Make sure 'collection' is globally accessible or passed correctly
# If running sections independently, re-get the collection:
# try:
#     chroma_client = chromadb.PersistentClient(path=chroma_db_path)
#     collection = chroma_client.get_collection(name=collection_name, embedding_function=gemini_ef) # Need EF again if client is recreated
# except Exception as e:
#     print(f"Error re-getting collection: {e}")
#     collection = None

def semantic_anime_search_engine(query, chroma_collection=collection, df_rev=df_reviews_processed):
    """Performs the complete semantic search using ChromaDB + Re-ranking + RAG."""
    print("=" * 60)
    print(f"NEW SEMANTIC SEARCH (ChromaDB + Re-ranking) FOR: '{query}'")
    print("=" * 60)

    # Preliminary checks
    if not GOOGLE_API_KEY:
         print("ERROR: Google API Key not configured.")
         return "API configuration missing."
    if chroma_collection is None:
         print("ERROR: ChromaDB collection is not available.")
         return "Cannot perform search. ChromaDB collection missing."
    # We still need reviews data
    if df_rev.empty:
         print("INFO: Reviews dataset is not available or empty. Recommendations will be based on synopses only.")

    # 1. Retrieve AND RE-RANK similar documents using ChromaDB
    print("\n--- Phase 1: Retrieval & Re-ranking (ChromaDB) ---")
    # --- CALL THE NEW RE-RANKING FUNCTION ---
    #retrieved_results = find_similar_anime_chromadb(query, chroma_collection)
    retrieved_results = find_similar_anime_chromadb_reranked(query, chroma_collection, top_k=TOP_K_RESULTS, k_initial=INITIAL_K)

    if retrieved_results.empty:
        print("\n--- Final Result ---")
        print("I couldn't find any similar anime/manga in the ChromaDB collection for your request after re-ranking.")
        print("=" * 60)
        return "No relevant results found in the retrieval/re-ranking phase."

    print(f"\nDocuments retrieved and re-ranked (Top {TOP_K_RESULTS}):")
    # Display relevant info from the re-ranked DataFrame
    print(retrieved_results[['uid', 'title', 'similarity', 'score', 'popularity', 'combined_score']].round(4).to_string(index=False))

    # 2. Fetch Synopses (Merge step - remains the same logic)
    print("\nFetching synopses for re-ranked results...")
    # ... (The merging code to get 'cleaned_synopsis' remains the same) ...
    # Make sure df_processed is available
    if 'df_processed' not in globals() or df_processed.empty:
         print("ERROR: df_processed not available to fetch synopses for retrieved results.")
         return "Cannot proceed without original data to fetch synopses."
    try:
         df_processed['uid'] = df_processed['uid'].astype(int) # Ensure type match
         retrieved_results_with_synopsis = pd.merge(
             retrieved_results,
             df_processed[['uid', 'cleaned_synopsis']],
             on='uid',
             how='left'
         )
         retrieved_results_with_synopsis.dropna(subset=['cleaned_synopsis'], inplace=True) # Drop if merge failed for some
         if retrieved_results_with_synopsis.empty: raise ValueError("No results after merge")
    except Exception as e:
         print(f"ERROR merging results with df_processed to get synopsis: {e}")
         return "Failed to fetch synopses for re-ranked results."


    # 3. Generate the response using the LLM (remains the same logic)
    print("\n--- Phase 2: Generation (Creating recommendation with LLM) ---")
    # Pass the re-ranked DataFrame (now including 'cleaned_synopsis')
    final_recommendation = generate_recommendations(query, retrieved_results_with_synopsis, df_rev)

    #print("\n--- Final Generated Recommendation ---")
    #print(final_recommendation)
    #print("=" * 60)
    return Markdown(final_recommendation)


--- 7. Main Semantic Search Function (Using ChromaDB + Re-ranking) ---


In [15]:
# ==============================================================================
# 8. Usage Examples (Updated to use ChromaDB search function)
# ==============================================================================
print("\n--- 8. Usage Examples (ChromaDB) ---")

# Example 1: Epic and Inspiring
#semantic_anime_search_engine("I want something truly epic and inspiring, that gives me a huge emotional and visual charge")

# Example 2: Melancholic and Reflective
#semantic_anime_search_engine("I'm looking for a melancholic, quiet, and reflective anime, maybe with a slightly mysterious or supernatural atmosphere")

# Example 3: Light Romantic Comedy
#semantic_anime_search_engine("I need to unwind with a light, funny romantic comedy that makes me feel good")

# Example 4: Dark and Psychological
#semantic_anime_search_engine("Recommend something dark, psychological, a bit unsettling, that makes you think about human nature")

# Example 5: Carefree Fantasy Adventure
#semantic_anime_search_engine("A carefree adventure in a colorful fantasy world, to dream a little")

semantic_anime_search_engine("a tear-jerking romantic story between a boy and a girl from different social classes")


--- 8. Usage Examples (ChromaDB) ---
NEW SEMANTIC SEARCH (ChromaDB + Re-ranking) FOR: 'a tear-jerking romantic story between a boy and a girl from different social classes'

--- Phase 1: Retrieval & Re-ranking (ChromaDB) ---

Querying ChromaDB for initial 100 candidates for: 'a tear-jerking romantic story between a boy and a girl from different social classes'...
Embedding batch 1/1...
Generated 1 embeddings.
Re-ranking the initial 100 candidates...
Re-ranking complete. Returning top 10 results.

Documents retrieved and re-ranked (Top 10):
  uid                      title  similarity  score  popularity  combined_score
 5150          Hatsukoi Limited.      0.7972   7.37        1449          0.8365
22839                 Cross Road      0.7800   7.46        2040          0.7572
  345 Eikoku Koi Monogatari Emma      0.7802   7.72        2281          0.7522
13833           Nagareboshi Lens      0.7901   6.71        3448          0.7303
 1689      Byousoku 5 Centimeter      0.7460   7.86  

Okay, based on your request for a tear-jerking romantic story between a boy and a girl
from different social classes, here are a couple of anime suggestions:  1.  **Title:**
Eikoku Koi Monogatari Emma     *   **Why:** This anime directly addresses the social class
divide in 19th-century London. The synopsis describes a maid, Emma, falling for William, a
member of the gentry. The reviews highlight the classic, potentially heartbreaking nature
of the love story, with one reviewer directly comparing it to "Romeo and Juliet." This
suggests a strong potential for the tear-jerking element you're looking for. 2.
**Title:** Mudai     *   **Why:** While not explicitly about social class, the synopsis
speaks of a "modest" life for the artist and his girlfriend, followed by unprecedented
success that seems to lead to the world crumbling around him. This has the potential to be
a tear-jerking story about love tested by changing circumstances and the pressures of
success, with emotional highs and lows.