# Using the RAG Pipeline with the IMDb Dataset

This notebook demonstrates how to use the existing RAG (Retrieval Augmented Generation) pipeline to load, process, and query data from the IMDb dataset. We will leverage the helper scripts available in the `app` directory for interacting with Azure OpenAI and Azure Cognitive Search.

**Prerequisites:**

Before running this notebook, ensure you have:
1.  Configured your Azure OpenAI and Azure Cognitive Search credentials in a `.env` file in the root of this repository. The required environment variables are detailed in:
    *   `app/openai_client.py` (for Azure OpenAI details like `AZURE_OPENAI_KEY`, `AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_EMBEDDING_MODEL`, `AZURE_OPENAI_COMPLETION_MODEL`)
    *   `app/search_client.py` (for Azure Cognitive Search details like `AZURE_SEARCH_ENDPOINT`, `AZURE_SEARCH_KEY`, `AZURE_SEARCH_INDEX`)
2.  Installed all necessary Python packages. You can usually install them by running `pip install -r requirements.txt` in your terminal. This notebook might also require additional libraries like `datasets` (for Hugging Face datasets) and `pandas`, which we'll install below if needed.

In [None]:
# Install specific libraries needed for this notebook if not already present
# It's generally recommended to manage dependencies via requirements.txt
# but this cell ensures these are available for notebook execution.
try:
    import datasets
    import pandas
    import fitz # PyMuPDF, used by app.ingest
    import openai # Used by app.openai_client
    from azure.search.documents import SearchClient # Used by app.search_client
    from azure.core.credentials import AzureKeyCredential # Used by app.search_client
    print("Required libraries are likely already installed.")
except ImportError:
    print("Installing missing libraries: datasets, pandas, PyMuPDF, openai-python, azure-search-documents, azure-core...")
    # Note: The notebook environment might require a kernel restart after installation.
    %pip install datasets pandas PyMuPDF openai azure-search-documents azure-core
    print("Installation complete. You might need to restart the kernel for the changes to take effect.")

# Basic imports to check if the app modules can be reached (assuming notebook is in repo root)
# Actual usage will be in later cells.
print("\nChecking if app modules are accessible...")
import app.openai_client
import app.search_client
import app.ingest
import app.rag_pipeline
print("App modules seem accessible.")

## 2. Load IMDb Dataset

We'll use the Hugging Face `datasets` library to easily download and load the IMDb dataset. This dataset contains movie reviews, which we'll use as the source text for our RAG system. We'll load a small subset for demonstration purposes.

In [None]:
from datasets import load_dataset
import pandas as pd

# Load the IMDb dataset from Hugging Face datasets
print("Loading IMDb dataset...")
try:
    # Load the 'train' split, and take only the first 1000 examples for quicker processing
    # You can adjust the number of samples or use the full dataset if preferred.
    imdb_dataset = load_dataset("imdb", split="train[:1000]") 
    print("IMDb dataset loaded successfully.")
except Exception as e:
    print(f"Error loading IMDb dataset: {e}")
    print("Please ensure you have an active internet connection and the 'datasets' library is correctly installed.")
    imdb_dataset = None

if imdb_dataset:
    # Convert to pandas DataFrame for easier manipulation and viewing
    imdb_df = pd.DataFrame(imdb_dataset)
    
    print("\nFirst 5 rows of the IMDb dataset:")
    print(imdb_df.head())
    
    print("\nDataset information:")
    imdb_df.info()
    
    # We will primarily use the 'text' (review) and 'label' (sentiment) fields.
    # For RAG, the 'text' field will be our document content.
else:
    print("\nSkipping DataFrame creation as dataset loading failed.")
    imdb_df = None # Ensure imdb_df is None if dataset loading failed

## 3. Prepare Data for Ingestion

Now that we have the IMDb data, we need to process it into a format suitable for our RAG pipeline. This involves:
1.  **Chunking**: Breaking down long reviews into smaller, manageable pieces.
2.  **Embedding**: Converting each text chunk into a numerical vector representation using an Azure OpenAI embedding model.
3.  **Formatting**: Structuring the data (chunk, embedding, metadata) into documents for Azure Cognitive Search.

In [None]:
import sys
import os
import json # Added import for json
from pathlib import Path

# Add the parent directory (root of the repo) to the Python path
# to allow importing from the 'app' module.
# This is necessary if the notebook is in the root and 'app' is a subdirectory.
notebook_dir = os.getcwd() # Should be the repo root
if notebook_dir not in sys.path:
    sys.path.append(notebook_dir)

try:
    from app.openai_client import get_embedding
    from app.ingest import chunk_text, EMBEDDING_DIMENSIONS # EMBEDDING_DIMENSIONS is defined in ingest.py
    from app.search_client import AZURE_SEARCH_INDEX_NAME # Using the default from search_client
    print("Successfully imported functions from 'app' module.")
except ImportError as e:
    print(f"Error importing from 'app' module: {e}")
    print("Please ensure the notebook is in the root directory of the repository and the 'app' module is structured correctly.")
    print("Stopping execution for this cell.")
    # You might want to raise an error or stop notebook execution here if imports fail
    # For now, we'll let it proceed and subsequent cells will likely fail if imports are missing.
    # Fallback values if imports fail, to prevent NameError later, though functionality will be broken.
    def get_embedding(text): print("Error: get_embedding not loaded"); return []
    def chunk_text(text, **kwargs): print("Error: chunk_text not loaded"); return [text]
    EMBEDDING_DIMENSIONS = 1536 # Default, but might be wrong if not loaded
    AZURE_SEARCH_INDEX_NAME = "rag-vector-index" # Default, but might be wrong

# Ensure imdb_df is available from the previous step
if 'imdb_df' not in globals() or imdb_df is None:
    print("Error: imdb_df is not available. Please ensure the previous cells for loading data have run successfully.")
    # Stop or handle error appropriately
    documents_to_upload = []
else:
    print(f"Preparing documents for ingestion. Using embedding dimension: {EMBEDDING_DIMENSIONS}")
    print(f"Target Azure Search Index Name: {AZURE_SEARCH_INDEX_NAME}")

    documents_to_upload = []
    # Let's process a smaller subset for this example to speed up embedding generation.
    # You can increase this number or process the whole imdb_df.
    sample_size_for_ingestion = 50 
    
    # Check if imdb_df has enough rows
    if len(imdb_df) < sample_size_for_ingestion:
        print(f"Warning: Requested sample size {sample_size_for_ingestion} is larger than the DataFrame size {len(imdb_df)}. Processing all available rows.")
        sample_df = imdb_df
    else:
        sample_df = imdb_df.head(sample_size_for_ingestion)

    print(f"Processing {len(sample_df)} reviews for ingestion...")

    for index, row in sample_df.iterrows():
        review_text = row['text']
        review_label = row['label'] # 0 for negative, 1 for positive

        # Chunk the review text
        text_chunks = chunk_text(review_text, chunk_size=1000, overlap=200)

        for i, chunk in enumerate(text_chunks):
            print(f"Processing review {index}, chunk {i+1}/{len(text_chunks)}...")
            embedding = get_embedding(chunk)

            if not embedding:
                print(f"Warning: Failed to generate embedding for review {index}, chunk {i}. Skipping this chunk.")
                continue
            
            if len(embedding) != EMBEDDING_DIMENSIONS:
                print(f"Warning: Embedding for review {index}, chunk {i} has dimension {len(embedding)}, expected {EMBEDDING_DIMENSIONS}. Skipping chunk.")
                continue

            # Create a unique ID for each chunk
            # Using DataFrame index and chunk index to ensure uniqueness
            doc_id = f"imdb_{index}_chunk_{i}"
            
            document = {
                "id": doc_id,
                "content": chunk,
                "content_vector": embedding,
                "source_document_id": f"imdb_review_{index}", # Original review identifier
                "metadata": json.dumps({ # search_client expects metadata as a JSON string
                    "original_review_index": index,
                    "sentiment_label": "positive" if review_label == 1 else "negative",
                    "chunk_index": i,
                    "text_length": len(chunk)
                })
            }
            documents_to_upload.append(document)
            # print(f"Prepared document ID: {doc_id} for upload.")

    print(f"\nTotal documents prepared for upload: {len(documents_to_upload)}")
    if documents_to_upload:
        print("First prepared document (sample):")
        print(documents_to_upload[0]['id'])
        print(documents_to_upload[0]['content'][:100] + "...") # Print first 100 chars of content
        # print(documents_to_upload[0]['metadata']) # metadata can be long

**Note on API Calls and Costs:**

Generating embeddings involves making API calls to Azure OpenAI, which may incur costs depending on your subscription and the number of chunks processed. For this demonstration, we are processing a small subset of the data. If you run this on a very large dataset, be mindful of the potential cost and time involved.

## 4. Ingest Data into Azure Cognitive Search

With the data prepared, the next step is to upload it to our Azure Cognitive Search index. This will make the data searchable and retrievable by our RAG pipeline.

We will:
1.  Ensure the search index exists (and create it if it doesn't) with the correct vector configuration.
2.  Upload the prepared documents (text chunks and their embeddings) to the index.

In [None]:
# Ensure necessary functions and variables are available
# These should have been imported in Cell 6, but we re-affirm for clarity of this step.
try:
    from app.search_client import create_vector_index_if_not_exists, upload_documents, AZURE_SEARCH_INDEX_NAME
    from app.ingest import EMBEDDING_DIMENSIONS # Or define/load EMBEDDING_DIMENSIONS if not already
    print("Search client functions and variables are loaded.")
except ImportError as e:
    print(f"Error importing for search client operations: {e}")
    print("Stopping execution for this cell. Please ensure previous imports were successful.")
    # Fallback definitions to prevent NameError, though functionality will be broken
    def create_vector_index_if_not_exists(**kwargs): print("Error: create_vector_index_if_not_exists not loaded")
    def upload_documents(docs, **kwargs): print("Error: upload_documents not loaded")
    AZURE_SEARCH_INDEX_NAME = "rag-vector-index" # Default
    EMBEDDING_DIMENSIONS = 1536 # Default

if 'documents_to_upload' not in globals() or not documents_to_upload:
    print("Error: `documents_to_upload` is empty or not defined. Please ensure the previous data preparation step ran successfully and produced documents.")
    print("Skipping ingestion.")
else:
    print(f"Starting data ingestion into Azure Cognitive Search index: '{AZURE_SEARCH_INDEX_NAME}'")
    
    # 1. Create vector index if it doesn't exist
    # This uses the EMBEDDING_DIMENSIONS defined in app.ingest (or imported) 
    # and AZURE_SEARCH_INDEX_NAME from app.search_client.
    print(f"Ensuring index '{AZURE_SEARCH_INDEX_NAME}' exists with vector dimensions {EMBEDDING_DIMENSIONS}...")
    try:
        create_vector_index_if_not_exists(index_name=AZURE_SEARCH_INDEX_NAME, vector_dimensions=EMBEDDING_DIMENSIONS)
        print(f"Index '{AZURE_SEARCH_INDEX_NAME}' is ready.")
    except Exception as e:
        print(f"An error occurred while trying to create or verify the index: {e}")
        print("Please check your Azure Search service and configurations.")
        # Depending on the error, you might want to stop here.
        # For now, we'll attempt to upload anyway, but it might fail.

    # 2. Upload the documents
    print(f"Uploading {len(documents_to_upload)} documents to the index...")
    try:
        upload_documents(documents=documents_to_upload, index_name=AZURE_SEARCH_INDEX_NAME)
        # The upload_documents function in search_client.py should print success/failure counts.
        print("Document upload process initiated.")
        print("Check the output from 'upload_documents' above for details on success/failures.")
    except Exception as e:
        print(f"An error occurred during document upload: {e}")
        print("Please ensure your Azure Search service is running and configured correctly, and that the documents are in the correct format.")

    print("\nData ingestion step complete.")
    print(f"The documents should now be indexed in '{AZURE_SEARCH_INDEX_NAME}'.")
    print("It might take a few moments for the indexing process to complete on Azure's side before the data is fully searchable.")

## 5. Demonstrate RAG Pipeline

Now that the IMDb data is ingested into Azure Cognitive Search, we can use the RAG pipeline to ask questions and retrieve information. The `run_rag_pipeline` function from `app.rag_pipeline` will:
1.  Take your query.
2.  Generate an embedding for the query.
3.  Search the Azure Cognitive Search index for relevant document chunks.
4.  Construct a prompt using these retrieved chunks and your original query.
5.  Send the prompt to an Azure OpenAI model to generate an answer.
6.  It also includes caching for repeated queries.

Let's try some example queries!

In [None]:
# Ensure necessary function is available
try:
    from app.rag_pipeline import run_rag_pipeline
    print("RAG pipeline function loaded.")
except ImportError as e:
    print(f"Error importing RAG pipeline function: {e}")
    print("Stopping execution for this cell. Please ensure 'app.rag_pipeline' is accessible.")
    def run_rag_pipeline(query): 
        print(f"Error: run_rag_pipeline not loaded. Query was: {query}")
        return "RAG pipeline is not available."

# Example Queries
# Note: The quality of answers depends heavily on the ingested data (we only did a small sample),
# the underlying LLM, and the effectiveness of the retrieval.
# With only 50 reviews processed, the context might be limited.

queries = [
    "What are people saying about movies with positive reviews?",
    "Are there any mentions of 'action sequences' in the reviews?",
    "Tell me about a film that someone found 'boring'.",
    # Add a query that might hit cache if run twice
    "What is a common positive sentiment expressed in reviews?" 
]

if 'run_rag_pipeline' in locals() and callable(run_rag_pipeline):
    for query in queries:
        print(f"\n--- Querying RAG pipeline ---")
        print(f"User Query: {query}")
        
        response = run_rag_pipeline(query) # This function handles printing from the app logic too.
                                           # The function already prefixes with [CAG - cached] or [RAG - generated]
        print(f"Response:\n{response}")

    # Example of running a query again to demonstrate caching
    if len(queries) > 0:
        cached_query_example = queries[-1] # Use the last query from the list
        print(f"\n--- Querying RAG pipeline AGAIN (testing cache) ---")
        print(f"User Query: {cached_query_example}")
        response = run_rag_pipeline(cached_query_example)
        print(f"Response:\n{response}")
        print("Note: If the response above starts with '[CAG - cached]', it means the result was successfully retrieved from the cache.")

else:
    print("Cannot run RAG pipeline demonstrations as the function is not loaded.")

print("\nRAG pipeline demonstration complete.")

### Try Your Own Queries!

Feel free to modify the `queries` list in the cell above or add new cells to experiment with your own questions about the IMDb reviews you've ingested. Remember that the scope of retrievable information is limited to the subset of data processed and ingested in the earlier steps.

## 6. Conclusion and Next Steps

Congratulations! You've successfully walked through the process of using the existing RAG pipeline with a new dataset (IMDb reviews).

**In this notebook, we have:**
1.  Set up the environment and understood the prerequisites.
2.  Loaded a sample of the IMDb dataset.
3.  Prepared the data by chunking text and generating embeddings using Azure OpenAI.
4.  Ingested the processed data into an Azure Cognitive Search index.
5.  Demonstrated how to query the data using the RAG pipeline and observed its components, including caching.

**Next Steps:**

This notebook provides a foundation. You can further explore and expand upon this by:
*   **Experimenting with More Queries**: Try different and more complex questions to test the limits of the RAG system with the current data.
*   **Ingesting More Data**:
    *   Increase the `sample_size_for_ingestion` in Cell 6 to include more IMDb reviews.
    *   Adapt the ingestion process for other text datasets you might have.
*   **Exploring `app` Components**: Dive deeper into the Python scripts in the `app/` directory:
    *   `app/openai_client.py`: See how Azure OpenAI services are called for embeddings and completions.
    *   `app/search_client.py`: Understand the interaction with Azure Cognitive Search (index creation, data upload, vector search).
    *   `app/ingest.py`: Review the data preparation and ingestion logic, which you can adapt for other data sources (e.g., different file types, databases).
    *   `app/rag_pipeline.py`: Analyze the core RAG orchestration logic.
*   **Modifying Parameters**:
    *   Adjust `chunk_size` and `overlap` in `app.ingest.chunk_text` (or in Cell 6 when calling it) to see how it affects retrieval.
    *   If you have access to different Azure OpenAI models, try changing `AZURE_OPENAI_EMBEDDING_MODEL` or `AZURE_OPENAI_COMPLETION_MODEL` in your `.env` file (and update `EMBEDDING_DIMENSIONS` in `app/ingest.py` and the notebook if your embedding model's output dimension changes).
*   **Evaluating Performance**: For more advanced use cases, you might look into RAG evaluation frameworks (like RAGAs, for which logging is already partially implemented in `app.rag_pipeline`) to systematically assess the quality of the retrieved context and generated answers.

Happy exploring!