# Chatbot via Data Retrieval with LangChain

## RAG Workflow Overview
```
Documents → Loading → Splitting → Embeddings → Vector Store → Retrieval → LLM → Response
```

## Structure:
1. **Document Loading** - Ingest from PDFs, web, YouTube, Notion
2. **Document Splitting** - Chunk documents with overlap and metadata preservation
3. **Vector Stores and Embeddings** - Convert text to vectors, store in Chroma
4. **Advanced Retrieval** - MMR, metadata filtering, compression, self-query
5. **Question Answering** - RetrievalQA with custom prompts and chain types
6. **Conversational Chat** - Memory-enabled chatbot with GUI

Each section includes multiple techniques and addresses common failure modes.

In [5]:
# Install all required packages
!pip install langchain langchain-community langchain-openai langchain-chroma \
             langchain-huggingface langchain-aws langchain-text-splitters \
             beautifulsoup4 chromadb sentence-transformers pypdf yt-dlp pydub \
             panel param docarray tiktoken lark-parser



## Setup and Configuration

Configure Azure OpenAI and AWS credentials for comprehensive embedding options.

In [6]:
import os
import datetime
from google.colab import userdata
from langchain_openai import AzureChatOpenAI
import numpy as np

# Set Azure OpenAI credentials
os.environ["AZURE_OPENAI_API_KEY"] = userdata.get('eduhkkey')
os.environ['AWS_ACCESS_KEY_ID'] = userdata.get('awsid')
os.environ['AWS_SECRET_ACCESS_KEY'] = userdata.get('awssecret')
os.environ['AWS_DEFAULT_REGION'] = 'us-east-1'
os.environ['HF_TOKEN'] = userdata.get('hugging')

# Configure LLM
llm = AzureChatOpenAI(
    azure_endpoint="https://aai02.eduhk.hk/openai/deployments/gpt-4o-mini/chat/completions?Hello=",
    api_version="2024-02-15-preview",
    deployment_name="gpt-4o-mini",
    temperature=0,
    streaming=False,
)

print(f"LLM configured: {llm.deployment_name}")
print(f"Base URL: {llm.client._client._base_url}")

LLM configured: gpt-4o-mini
Base URL: https://aai02.eduhk.hk/openai/deployments/gpt-4o-mini/chat/completions?Hello=/openai/deployments/gpt-4o-mini/


## 1. Document Loading

### Comprehensive Loading from Multiple Sources

Document loading is the first step in RAG, involving:
- Converting raw data to Document objects
- Preserving metadata (source, page numbers, etc.)
- Handling multiple formats and sources

In [10]:
from langchain_community.document_loaders import (
    WebBaseLoader
)
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import OpenAIWhisperParser
from langchain_community.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
import bs4

# 1.1 Web Loading with HTML filtering
# Define the SoupStrainer to target <p> tags
bs4_strainer = bs4.SoupStrainer("p")

# Initialize the WebBaseLoader with the URL and SoupStrainer
web_loader = WebBaseLoader(
    web_paths=("https://enoch-sit.github.io/Blog/docs/12CCNAStory/storyline",),
    bs_kwargs={"parse_only": bs4_strainer},
)
web_docs = web_loader.load()

print(f"Loaded {len(web_docs)} web documents")
print(f"First 200 chars: {web_docs[0].page_content[:200]}")
print(f"Metadata: {web_docs[0].metadata}")

Loaded 1 web documents
First 200 chars: 網道劍影錄序章：風雨前夕
（原文照錄：交代故事背景，引出互聯山莊、黑風堡、網道真解以及最初的江湖形勢。）卷一：初窺門徑 – 基礎心法第一章：萬物有靈 – 識器 (Day 1: Networking Devices - 網絡設備)
葉辰，互聯山莊一名尋常弟子，得隱世長老風明暗中指點，開始了他的修行之路。風明首先向他介紹了網絡中的“靈物”：「Router導引訣」、「Switch分流指」、「Firewa
Metadata: {'source': 'https://enoch-sit.github.io/Blog/docs/12CCNAStory/storyline'}


# Flowise counter part - Cheerio

![Cheerio](https://raw.githubusercontent.com/enoch-sit/publicimages/refs/heads/main/f02_Cheerio_Web_Scraper.png)

# There are different text loader
https://python.langchain.com/docs/integrations/document_loaders/

## 2. Document Splitting

### Advanced Text Splitting Techniques

Effective text splitting is essential for:
- Fitting text into large language model (LLM) context windows
- Retaining semantic meaning
- Preserving document structure and metadata

This guide explores the `RecursiveCharacterTextSplitter`, `HtmlToMarkdownTextSplitter`, and `MarkdownTextSplitter` from the `langchain_text_splitters` module, explaining their functionality and use cases.

---

### Understanding `RecursiveCharacterTextSplitter`

The `RecursiveCharacterTextSplitter` breaks text into smaller, manageable chunks, ideal for processing long documents in applications like language models. It prioritizes maintaining context by splitting at natural boundaries.

- **How it Operates**: It uses a recursive method to divide text based on a set `chunk_size` (e.g., 100 characters or tokens). It starts with high-level separators like double newlines (`\n\n` for paragraphs) and falls back to lower-level ones (e.g., single newlines `\n`, spaces, or characters) if the chunk exceeds the limit.
- **What "Recursive" Means**: This term describes the hierarchical splitting process. Instead of cutting at a fixed point (which might split words), it tries higher-level breaks first, ensuring coherence. For instance, with a 100-character limit on a 500-character paragraph, it seeks `\n\n` first, then `\n` or spaces if needed.
- **Configurable Options**:
  - `chunk_size`: Maximum chunk length.
  - `chunk_overlap`: Overlap between chunks for context (e.g., 5 characters).
  - `separators`: Custom list (default: `["\n\n", "\n", " ", ""]`).
  - `length_function`: Measures size (e.g., `len` for characters).
- **Example**:
  ```python
  from langchain_text_splitters import RecursiveCharacterTextSplitter

  text = """This is a sample text.\n\nIt has multiple paragraphs.\nEach paragraph has sentences.\n\nThis is another paragraph."""
  splitter = RecursiveCharacterTextSplitter(chunk_size=30, chunk_overlap=5, separators=["\n\n", "\n", " ", ""])
  chunks = splitter.split_text(text)
  for i, chunk in enumerate(chunks):
      print(f"Chunk {i+1}: {chunk}")
  ```
  **Output** (approximate):
  ```
  Chunk 1: This is a sample text.
  Chunk 2: It has multiple paragraphs.
  Chunk 3: Each paragraph has sentences.
  Chunk 4: This is another paragraph.
  ```
  Here, splits occur at `\n\n` and `\n`, keeping chunks under 30 characters with overlap.
- **Benefits**: Preserves context, flexible for various formats, and customizable.
- **Best Use**: Ideal for long documents needing semantic splits, outperforming rigid splitters like `CharacterTextSplitter`.

---

### Exploring `HtmlToMarkdownTextSplitter`

The `HtmlToMarkdownTextSplitter` transforms HTML text into Markdown and splits it into chunks, perfect for web content processing.

- **How it Works**: It converts HTML tags (e.g., `<p>` to plain text, `<h1>` to `#`) using a library like `html2text`, then applies splitting based on `chunk_size` and separators.
- **Purpose**: Suited for scraping web pages, preserving structure (e.g., headers, lists) in Markdown.
- **Key Feature**: Maintains semantic elements in the converted text, with splitting handled similarly to `RecursiveCharacterTextSplitter`.
- **Headers and Metadata**: Headers (e.g., `#` from `<h1>`) remain in the content, with no automatic metadata extraction for headers based on available documentation.

---

### Understanding `MarkdownTextSplitter`

The `MarkdownTextSplitter` splits Markdown text into chunks, respecting its structure like headers and lists.

- **How it Works**: It divides text at natural Markdown boundaries (e.g., after `#` or `##` headers) and includes the content under each header in separate chunks, with configurable `chunk_size` and `chunk_overlap`.
- **Purpose**: Great for processing Markdown files (e.g., docs, notes) while retaining hierarchy.
- **Key Feature**: Extracts headers into metadata for each chunk, enhancing downstream use.
- **Evidence**: In a sample run, output shows:
  - Chunk 0: `Artificial Intelligence (AI) is transforming...`, Metadata: `{'Header 1': 'Introduction to AI'}`
  - This confirms headers are stored in `doc.metadata`.
- **Headers and Metadata**: Unlike my earlier general claim, this splitter does place headers into metadata, a feature designed to preserve document structure.

---

### What is Markdown? How is HTML Converted to Markdown?

- **What is Markdown?**: A lightweight markup language using syntax like `#` for headers, `*` for bullets, and `[link](url)` for hyperlinks, convertible to HTML or other formats.
- **HTML to Markdown Conversion**: Tools like `HtmlToMarkdownTextSplitter` parse HTML (e.g., `<h1>Header</h1>` to `# Header`, `<a href="url">Link</a>` to `[Link](url)`) using libraries such as `html2text`.

---

### Do Splitters Put Headers in Metadata?

- **General Rule**: Most splitters focus on content division, not metadata.
- **Specifics**:
  - **`HtmlToMarkdownTextSplitter`**: Headers stay in content (e.g., `# Header`), no metadata extraction for headers.
  - **`MarkdownTextSplitter`**: Extracts headers into metadata (e.g., `{'Header 1': '...'}`), as evidenced by your code output.
  - **`RecursiveCharacterTextSplitter`**: No metadata, only content chunks.
- **Clarification**: Metadata handling depends on the splitter’s design, with `MarkdownTextSplitter` uniquely supporting header metadata.

---

### When to Use Each Splitter

- **`RecursiveCharacterTextSplitter`**: For general text needing coherent splits.
- **`HtmlToMarkdownTextSplitter`**: For HTML-to-Markdown conversion and splitting.
- **`MarkdownTextSplitter`**: For Markdown files requiring header-based splits and metadata.

Let me know if you’d like examples or further details!

---

In [11]:
from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter,
    MarkdownHeaderTextSplitter
)

# 2.1 Recursive Character Text Splitter (Recommended)
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    add_start_index=True,  # Adds start index metadata
    separators=["\n\n", "\n", ". ", " ", ""]  # Hierarchical separators
)

# 2.2 Token-based splitter for precise token control
token_splitter = TokenTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

# 2.3 Structure-aware markdown splitter
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "Header 1"),
        ("##", "Header 2"),
        ("###", "Header 3"),
    ]
)

# Split documents using recursive splitter
splits = recursive_splitter.split_documents(web_docs)

print(f"Original documents: {len(web_docs)}")
print(f"Split chunks: {len(splits)}")
print(f"Sample chunk metadata: {splits[0].metadata}")
print(f"Sample chunk content: {splits[0].page_content[:200]}...")

Original documents: 1
Split chunks: 13
Sample chunk metadata: {'source': 'https://enoch-sit.github.io/Blog/docs/12CCNAStory/storyline', 'start_index': 0}
Sample chunk content: 網道劍影錄序章：風雨前夕
（原文照錄：交代故事背景，引出互聯山莊、黑風堡、網道真解以及最初的江湖形勢。）卷一：初窺門徑 – 基礎心法第一章：萬物有靈 – 識器 (Day 1: Networking Devices - 網絡設備)
葉辰，互聯山莊一名尋常弟子，得隱世長老風明暗中指點，開始了他的修行之路。風明首先向他介紹了網絡中的“靈物”：「Router導引訣」、「Switch分流指」、「Firewa...


In [12]:
# 2.4 Demonstrate different splitting strategies
sample_text = """# Introduction to AI

Artificial Intelligence (AI) is transforming our world. It encompasses machine learning, deep learning, and natural language processing.

## Machine Learning

Machine learning enables computers to learn without explicit programming. Key algorithms include linear regression, decision trees, and neural networks.

### Supervised Learning

Supervised learning uses labeled training data to make predictions.
"""

# Compare splitting methods
char_splits = CharacterTextSplitter(chunk_size=100, chunk_overlap=20).split_text(sample_text)
recursive_splits = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20).split_text(sample_text)
markdown_splits = markdown_splitter.split_text(sample_text)

print("Character Splitter Results:")
for i, chunk in enumerate(char_splits):
    print(f"Chunk {i}: {chunk[:50]}...")

print("\nMarkdown Header Splitter Results:")
for i, doc in enumerate(markdown_splits):
    print(f"Chunk {i}: {doc.page_content[:50]}...")
    print(f"Metadata: {doc.metadata}")



Character Splitter Results:
Chunk 0: # Introduction to AI...
Chunk 1: Artificial Intelligence (AI) is transforming our w...
Chunk 2: ## Machine Learning...
Chunk 3: Machine learning enables computers to learn withou...
Chunk 4: ### Supervised Learning

Supervised learning uses ...

Markdown Header Splitter Results:
Chunk 0: Artificial Intelligence (AI) is transforming our w...
Metadata: {'Header 1': 'Introduction to AI'}
Chunk 1: Machine learning enables computers to learn withou...
Metadata: {'Header 1': 'Introduction to AI', 'Header 2': 'Machine Learning'}
Chunk 2: Supervised learning uses labeled training data to ...
Metadata: {'Header 1': 'Introduction to AI', 'Header 2': 'Machine Learning', 'Header 3': 'Supervised Learning'}


## 3. Vector Stores and Embeddings

### Multiple Embedding Providers and Vector Storage

Embeddings convert text to vectors that capture semantic meaning. Different providers offer various capabilities and pricing models.

In [13]:
from langchain_aws import BedrockEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma
from langchain_core.documents import Document

# 3.1 Multiple Embedding Options

# AWS Bedrock Embeddings (requires AWS setup)
try:
    bedrock_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1")
    print("AWS Bedrock embeddings configured")
except Exception as e:
    print(f"AWS Bedrock not available: {e}")
    bedrock_embeddings = None

# Hugging Face Embeddings (free, local)
hf_embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={'device': 'cpu'},
    encode_kwargs={'normalize_embeddings': True}
)

print("Hugging Face embeddings loaded")

# 3.2 Demonstrate embedding similarity
sentences = [
    "I love machine learning and AI",
    "Artificial intelligence and ML are fascinating",
    "The weather is beautiful today"
]

embeddings_list = [hf_embeddings.embed_query(sent) for sent in sentences]

# Calculate similarity between embeddings
similarity_1_2 = np.dot(embeddings_list[0], embeddings_list[1])
similarity_1_3 = np.dot(embeddings_list[0], embeddings_list[2])

print(f"\nSimilarity between sentences 1 and 2 (related): {similarity_1_2:.4f}")
print(f"Similarity between sentences 1 and 3 (unrelated): {similarity_1_3:.4f}")

AWS Bedrock embeddings configured


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Hugging Face embeddings loaded

Similarity between sentences 1 and 2 (related): 0.7240
Similarity between sentences 1 and 3 (unrelated): 0.0977


In [14]:
# 3.3 Create Vector Store with Chroma
vectorstore = Chroma.from_documents(
    documents=splits,
    embedding=bedrock_embeddings,
    persist_directory="./chroma_db"
)

print(f"Vector store created with {vectorstore._collection.count()} documents")

# 3.4 Add documents with rich metadata
enhanced_docs = [
    Document(
        page_content="Machine learning is a subset of AI focused on algorithms that learn from data.",
        metadata={"topic": "machine_learning", "difficulty": "beginner", "year": 2024}
    ),
    Document(
        page_content="Deep learning uses neural networks with multiple layers to model complex patterns.",
        metadata={"topic": "deep_learning", "difficulty": "advanced", "year": 2024}
    ),
    Document(
        page_content="Natural language processing enables computers to understand human language.",
        metadata={"topic": "nlp", "difficulty": "intermediate", "year": 2024}
    )
]

vectorstore.add_documents(enhanced_docs)
print(f"Added {len(enhanced_docs)} documents with enhanced metadata")

# 3.5 Basic similarity search
query = "What is artificial intelligence?"
basic_results = vectorstore.similarity_search(query, k=3)

print(f"\nBasic similarity search for: '{query}'")
for i, doc in enumerate(basic_results):
    print(f"Result {i+1}: {doc.page_content[:100]}...")
    print(f"Metadata: {doc.metadata}")
    print()

Vector store created with 13 documents
Added 3 documents with enhanced metadata

Basic similarity search for: 'What is artificial intelligence?'
Result 1: Machine learning is a subset of AI focused on algorithms that learn from data....
Metadata: {'year': 2024, 'difficulty': 'beginner', 'topic': 'machine_learning'}

Result 2: Natural language processing enables computers to understand human language....
Metadata: {'year': 2024, 'difficulty': 'intermediate', 'topic': 'nlp'}

Result 3: Deep learning uses neural networks with multiple layers to model complex patterns....
Metadata: {'topic': 'deep_learning', 'year': 2024, 'difficulty': 'advanced'}



## 4. Advanced Retrieval Techniques

### Addressing Common Retrieval Problems

Basic similarity search has limitations:
- **Diversity**: Results may be too similar
- **Specificity**: Metadata filtering needed
- **Relevance**: Context compression required
- **Query Understanding**: Self-query for complex requests

Of course. Let's walk through a simple, concrete example of how MMR works in an embedding search, step by step.

### Scenario

Imagine you have a small database of four documents, and you've already converted them and your query into vector embeddings. The goal is to retrieve the top 2 documents that are both relevant to the query and diverse from each other.

**Documents (with hypothetical topics):**
* **$D_1$**: "A guide to growing and caring for roses."
* **$D_2$**: "The history of the rose flower."
* **$D_3$**: "Gardening tips for beginners."
* **$D_4$**: "How to grow vegetables in a small garden."

**Search Query ($Q$):** "gardening tips"

**Pre-computation Step:**
Before we start, we need to calculate the vector similarity scores for all pairs. We'll use cosine similarity, where a score of 1 means identical and 0 means completely different.

* **Relevance Scores (Query vs. Documents):** $Sim_1(D_i, Q)$
    * $Sim_1(D_1, Q)$: 0.8 (High relevance: talks about gardening, specifically roses)
    * $Sim_1(D_2, Q)$: 0.1 (Low relevance: history is not "tips")
    * $Sim_1(D_3, Q)$: 0.9 (Very high relevance: exactly "gardening tips")
    * $Sim_1(D_4, Q)$: 0.85 (High relevance: also about gardening tips)

* **Redundancy Scores (Document vs. Document):** $Sim_2(D_i, D_j)$
    * $Sim_2(D_1, D_2)$: 0.7 (High similarity: both about roses)
    * $Sim_2(D_1, D_3)$: 0.6 (Some similarity: both about gardening)
    * $Sim_2(D_1, D_4)$: 0.3 (Low similarity: one is about roses, the other is vegetables)
    * $Sim_2(D_3, D_4)$: 0.75 (High similarity: both are general gardening tips)
    * ... (and so on for all pairs)

We'll set our diversity parameter $\lambda$ to **0.5**, giving equal weight to relevance and diversity.

***

### The MMR Search Process (Step-by-Step)

#### **Step 1: The Initial Selection**

The first document selected by MMR is always the one with the highest pure relevance score. It's the "most relevant" starting point.

* $MMR(D_1) = 0.8$
* $MMR(D_2) = 0.1$
* $MMR(D_3) = 0.9$
* $MMR(D_4) = 0.85$

The highest relevance score is for **$D_3$ (0.9)**.

* **Result Set ($S$)**: {$D_3$}
* **Candidate Set ($U$)**: {$D_1$, $D_2$, $D_4$}

#### **Step 2: The Second Selection**

Now we apply the full MMR formula to the remaining candidates to find the next document. We need to calculate the MMR score for $D_1$, $D_2$, and $D_4$. The formula is:

$$MMR(D_i) = \lambda * Sim_1(D_i, Q) - (1-\lambda) * \max_{D_j \in S} Sim_2(D_i, D_j)$$

**Calculation for $D_1$:**
* Relevance Term: $\lambda * Sim_1(D_1, Q) = 0.5 * 0.8 = 0.4$
* Diversity Term: $(1-\lambda) * \max_{D_j \in S} Sim_2(D_1, D_j) = (1-0.5) * Sim_2(D_1, D_3) = 0.5 * 0.6 = 0.3$
* $MMR(D_1) = 0.4 - 0.3 = \mathbf{0.1}$

**Calculation for $D_2$:**
* Relevance Term: $\lambda * Sim_1(D_2, Q) = 0.5 * 0.1 = 0.05$
* Diversity Term: $(1-\lambda) * \max_{D_j \in S} Sim_2(D_2, D_j) = (1-0.5) * Sim_2(D_2, D_3) = 0.5 * 0.2 = 0.1$
* $MMR(D_2) = 0.05 - 0.1 = \mathbf{-0.05}$

**Calculation for $D_4$:**
* Relevance Term: $\lambda * Sim_1(D_4, Q) = 0.5 * 0.85 = 0.425$
* Diversity Term: $(1-\lambda) * \max_{D_j \in S} Sim_2(D_4, D_j) = (1-0.5) * Sim_2(D_4, D_3) = 0.5 * 0.75 = 0.375$
* $MMR(D_4) = 0.425 - 0.375 = \mathbf{0.05}$

**Comparison of MMR Scores:**
* $MMR(D_1) = 0.1$
* $MMR(D_2) = -0.05$
* $MMR(D_4) = 0.05$

The highest MMR score is for **$D_1$ (0.1)**.

* **Final Result Set ($S$)**: {$D_3$, $D_1$}

***

### Analysis of the Results

* **Without MMR (pure relevance)**, the top 2 results would have been **$D_3$ (0.9)** and **$D_4$ (0.85)**. Both are very similar and talk about general gardening tips. The result set is relevant, but redundant.
* **With MMR**, the top 2 results are **$D_3$** and **$D_1$**.
    * $D_3$ is the most relevant.
    * $D_1$ is also highly relevant but is less similar to $D_3$ than $D_4$ is. $D_1$ introduces a new specific topic (roses) that is still relevant to the general query.

This simple example shows how MMR successfully avoids the redundancy problem. It takes a slightly less-relevant document ($D_1$ at 0.8 relevance) over a highly-relevant but redundant one ($D_4$ at 0.85 relevance) to provide a more diverse and informative result set.

In [17]:
from langchain.retrievers.self_query.base import SelfQueryRetriever  # This import is correct—no change
from langchain.chains.query_constructor.schema import AttributeInfo  # Updated: was .base, now .schema
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever  # Updated: was direct from .retrievers
from langchain.retrievers.document_compressors import LLMChainExtractor  # This is correct—no change

# 4.1 Maximum Marginal Relevance (MMR) - Balances relevance and diversity
mmr_retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 5, "fetch_k": 10, "lambda_mult": 0.5}
)

print("MMR Retrieval Results:")
mmr_results = mmr_retriever.invoke("machine learning algorithms")  # Note: Use .invoke() if async elsewhere
for i, doc in enumerate(mmr_results):
    print(f"MMR Result {i+1}: {doc.page_content[:80]}...")

# 4.2 Metadata Filtering
filtered_results = vectorstore.similarity_search(
    "learning algorithms",
    k=3,
    filter={"topic": "machine_learning"}
)

print("\nFiltered Results (topic=machine_learning):")
for i, doc in enumerate(filtered_results):
    print(f"Filtered Result {i+1}: {doc.page_content[:80]}...")
    print(f"Metadata: {doc.metadata}")


MMR Retrieval Results:
MMR Result 1: Machine learning is a subset of AI focused on algorithms that learn from data....
MMR Result 2: Deep learning uses neural networks with multiple layers to model complex pattern...
MMR Result 3: Natural language processing enables computers to understand human language....
MMR Result 4: 風明解釋說，對於龐大且不斷變化的網絡，手動定義每條路徑（靜態路由）是不切實際的。他引入了動態路由協議的概念——能夠自動適應的“活地圖”。第二十五章：古道遺蹤 –...
MMR Result 5: 標準ACL顯得過於粗略。葉辰學會創建擴展ACL，通過指定源、目標，甚至敵人使用的“武功類型”（協議和端口號）來實現更精細的控制。第三十六章：知己知彼 – 探查 ...

Filtered Results (topic=machine_learning):
Filtered Result 1: Machine learning is a subset of AI focused on algorithms that learn from data....
Metadata: {'year': 2024, 'topic': 'machine_learning', 'difficulty': 'beginner'}


## 5. Question Answering with RetrievalQA

### Multiple Chain Types and Custom Prompts

RetrievalQA combines document retrieval with LLM generation using different strategies:
- **Stuff**: Concatenate all documents (default)
- **Map-Reduce**: Process documents separately, then combine
- **Refine**: Iteratively refine answer with each document

In [25]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import RetrievalQA
from langchain import hub

# 5.1 Basic RAG Chain
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Custom prompt template
custom_prompt = ChatPromptTemplate.from_template(
    """You are a helpful AI assistant specializing in machine learning and AI.

    Use the following context to answer the question. If you don't know the answer based on the context,
    say "I don't have enough information in the provided context to answer that question."

    Always cite which part of the context you used for your answer.

    Context: {context}

    Question: {question}

    Answer:"""
)

# Create RAG chain
rag_chain = (
    {"context": mmr_retriever | format_docs, "question": RunnablePassthrough()}
    | custom_prompt
    | llm
    | StrOutputParser()
)

# Test the chain
questions = [
    "What is machine learning?",
    "What is Deep learning?",
    "What is OSPF?"
]

print("RAG Chain Responses:")
for question in questions:
    try:
        response = rag_chain.invoke(question)
        print(f"\nQ: {question}")
        print(f"A: {response}")
        print("-" * 80)
    except Exception as e:
        print(f"Error processing question '{question}': {e}")

RAG Chain Responses:

Q: What is machine learning?
A: Machine learning is a subset of AI focused on algorithms that learn from data. This definition is taken from the context provided.
--------------------------------------------------------------------------------

Q: What is Deep learning?
A: Deep learning uses neural networks with multiple layers to model complex patterns. This information is derived from the context provided.
--------------------------------------------------------------------------------

Q: What is OSPF?
A: OSPF stands for Open Shortest Path First, which is a powerful and widely recognized routing technique. In the context, it is introduced as "OSPFv2群龍聚首訣" and is described in the sections related to Leaf Chen learning about OSPF's fundamental concepts, including areas, router IDs, and the process of establishing neighbor adjacencies (neighbor relationships) (Chapter 26). As he progresses, he also studies OSPF network types and the concept of DR/BDR (Designated R

Summary and Best Practices

### Key Takeaways for Production RAG Systems

**Document Processing:**
- Use appropriate loaders for different formats
- Preserve metadata for filtering and traceability
- Choose chunk sizes based on your domain and use case

**Retrieval Strategy:**
- Start with basic similarity search, add MMR for diversity
- Use metadata filtering for domain-specific queries
- Consider compression for long documents
- Implement fallback retrievers for robustness

**Generation Quality:**
- Custom prompts improve response quality
- Test different chain types (stuff, map-reduce, refine)
- Add conversation memory for interactive applications
- Implement proper error handling

**Evaluation and Monitoring:**
- Track retrieval precision and recall
- Monitor response quality and user satisfaction
- Benchmark different approaches
- Log queries and responses for analysis

**Scalability Considerations:**
- Use persistent vector stores for large datasets
- Consider distributed retrieval for high throughput
- Cache frequent queries
- Monitor latency and costs

This notebook provides a comprehensive foundation for building production-ready RAG systems. Experiment with different combinations of techniques based on your specific use case and requirements.