# Lesson 3.7: Practical Application - Building a Q&A System

---

In this Module 3, we've journeyed from loading raw data to preparing it for LLM consumption: **Document Loaders**, **Text Splitters**, **Embeddings**, **Vector Stores**, and **Retrievers**. Now it's time to bring all these pieces together to build a practical and powerful application: a **Question Answering (Q&A) system based on Retrieval-Augmented Generation (RAG)**.

This lesson will guide you step-by-step through building a complete RAG pipeline, from preparing your data to generating intelligent answers based on your own knowledge.

## 1. Objectives of this Practical Exercise

The main objectives of this practical exercise are to:
* Apply all the knowledge learned from Module 3 to a real-world project.
* Understand the data flow and how LangChain components interact within a RAG system.
* Build a Q&A system capable of answering questions based on a specific set of documents you provide.
* Test and evaluate the performance of the Q&A system.




---

## 2. Environment and Data Preparation

To begin, ensure you have installed the necessary libraries and set up your API keys for your LLM and Embedding models. We will use OpenAI for LLM and Embeddings, along with ChromaDB as the Vector Store.

**Required Libraries:**
```bash
pip install langchain-openai openai chromadb pypdf reportlab
```
* `langchain-openai`: LangChain integration with OpenAI.
* `openai`: Official OpenAI Python library.
* `chromadb`: Open-source vector database.
* `pypdf`: For reading PDF files.
* `reportlab`: (Optional) For creating sample PDF files if you don't have them readily available.

**API Key Setup:**
```python
import os
# Set your API key in an environment variable
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
```

**Prepare Sample Documents:**
We will create two simple sample PDF files for illustration. In a real-world scenario, you would replace these with your own documents (e.g., company internal documents, books, reports).

In [None]:
import os
import shutil
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import HumanMessage, SystemMessage

# Thiết lập biến môi trường cho khóa API của OpenAI
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

# Khởi tạo LLM và Embeddings Model
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# Thư mục lưu trữ ChromaDB
persist_directory = "./chroma_qa_system_db"
# Xóa thư mục cũ nếu tồn tại để đảm bảo sạch sẽ
if os.path.exists(persist_directory):
    shutil.rmtree(persist_directory)
    print(f"Old Chroma directory removed: {persist_directory}")

# Tạo hai file PDF mẫu
pdf_file_path_1 = "document_part_1.pdf"
pdf_file_path_2 = "document_part_2.pdf"

try:
    from reportlab.pdfgen import canvas
    # Nội dung cho PDF 1
    c1 = canvas.Canvas(pdf_file_path_1)
    c1.drawString(100, 750, "Tài liệu này nói về các nguyên tắc cơ bản của Trí tuệ Nhân tạo (AI).")
    c1.drawString(100, 730, "AI là lĩnh vực khoa học máy tính tập trung vào việc tạo ra máy móc thông minh.")
    c1.drawString(100, 710, "Các nhánh chính của AI bao gồm Học máy, Học sâu và Xử lý Ngôn ngữ Tự nhiên (NLP).")
    c1.drawString(100, 690, "Học máy cho phép hệ thống học từ dữ liệu mà không cần lập trình rõ ràng.")
    c1.save()
    print(f"Sample PDF file created: {pdf_file_path_1}")

    # Nội dung cho PDF 2
    c2 = canvas.Canvas(pdf_file_path_2)
    c2.drawString(100, 750, "Phần này tập trung vào ứng dụng của AI trong y tế và giáo dục.")
    c2.drawString(100, 730, "Trong y tế, AI giúp chẩn đoán bệnh sớm và phát triển thuốc mới.")
    c2.drawString(100, 710, "Trong giáo dục, AI cá nhân hóa trải nghiệm học tập và đánh giá tiến độ.")
    c2.drawString(100, 690, "AI cũng được sử dụng trong xe tự lái và trợ lý ảo thông minh.")
    c2.save()
    print(f"Sample PDF file created: {pdf_file_path_2}")

except ImportError:
    with open(pdf_file_path_1, "w") as f:
        f.write("This is a dummy PDF file 1. Please replace with a real PDF.\n")
    with open(pdf_file_path_2, "w") as f:
        f.write("This is a dummy PDF file 2. Please replace with a real PDF.\n")
    print("Could not create real PDF with reportlab. Using dummy files.")
    print("Please ensure you have real PDF files for the example to work best.")


---

## 3. Building the Complete RAG Pipeline

We will build the RAG pipeline following the steps we've learned: Load, Split, Embed, Store, Retrieve, and Generate.

### 3.1. Indexing Phase (Load, Split, Embed, Store)

In [None]:
# --- Indexing Phase ---

# 1. Load documents using Document Loader
print("Loading documents from PDF files...") # Đang tải tài liệu từ các file PDF...
loader1 = PyPDFLoader(pdf_file_path_1)
loader2 = PyPDFLoader(pdf_file_path_2)

documents = []
documents.extend(loader1.load())
documents.extend(loader2.load())
print(f"Loaded a total of {len(documents)} pages from PDF files.") # Đã tải tổng cộng {len(documents)} trang từ các file PDF.

# 2. Split documents using Text Splitter
print("Splitting text into chunks...") # Đang chia nhỏ văn bản thành các chunks...
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, # Desired chunk size
    chunk_overlap=50, # Overlap between chunks
    length_function=len,
    add_start_index=True
)
chunks = text_splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks.") # Đã chia thành {len(chunks)} chunks.

# 3. Create embeddings and store in Vector Store (Chroma)
print("Creating embeddings and storing in Chroma Vector Store...") # Đang tạo embeddings và lưu trữ vào Chroma Vector Store...
vector_store = Chroma.from_documents(
    chunks,
    embeddings,
    persist_directory=persist_directory
)
print("Indexing phase completed. Data is ready for querying.") # Đã hoàn thành giai đoạn Indexing. Dữ liệu đã sẵn sàng để truy vấn.


### 3.2. Runtime Phase (Retrieval and Generation)

In [None]:
# --- Runtime Phase ---

# 4. Set up Retriever
# The Retriever will retrieve relevant text segments from the Vector Store.
# k=3 means it will return the 3 most similar text segments.
retriever = vector_store.as_retriever(search_kwargs={"k": 3})
print(f"Retriever set up, will retrieve k=3 documents for each query.") # Đã thiết lập Retriever, sẽ truy xuất k=3 tài liệu cho mỗi truy vấn.

# 5. Connect with LLM to generate answers (Build RAG Chain)
# This prompt will instruct the LLM on how to use the provided context.
rag_prompt = ChatPromptTemplate.from_messages([
    SystemMessage(content="You are a helpful Q&A assistant. Answer the user's question based on the provided context. If you cannot find the answer in the context, state that you don't know."),
    HumanMessage(content="Context: {context}\n\nQuestion: {question}"),
])

# Build the RAG chain using LCEL
# Data flow:
# - Input: {"question": "User's question"}
# - Step 1 (RunnableParallel):
#   - "context": Retriever will receive "question" and retrieve relevant documents.
#   - "question": RunnablePassthrough will pass the original "question" through.
# - Step 2 (rag_prompt): Receives {"context": [docs], "question": "..."} and formats the prompt.
# - Step 3 (llm): Receives the prompt and generates a response.
# - Step 4 (StrOutputParser): Extracts the string from the LLM's response.
rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)
print("Complete RAG chain built.") # Đã xây dựng chuỗi RAG hoàn chỉnh.



---

## 4. Testing and Evaluating the Q&A System's Performance

Now, let's try asking a few questions to the system and observe how it responds.

In [None]:
# --- Test and Evaluate ---

print("\n--- Starting RAG Q&A System Test ---") # --- Bắt đầu kiểm tra hệ thống Hỏi & Đáp RAG ---

# Question 1: Related to PDF content 1
question_1 = "What are the main branches of Artificial Intelligence?"
print(f"\nQuestion: {question_1}") # Câu hỏi:
answer_1 = rag_chain.invoke({"question": question_1})
print(f"Answer: {answer_1}") # Trả lời:

# Question 2: Related to PDF content 2
question_2 = "How is AI applied in the medical field?"
print(f"\nQuestion: {question_2}") # Câu hỏi:
answer_2 = rag_chain.invoke({"question": question_2})
print(f"Answer: {answer_2}") # Trả lời:

# Question 3: Related to both PDFs or general knowledge
question_3 = "What is LangChain and how is it related to RAG?"
print(f"\nQuestion: {question_3}") # Câu hỏi:
answer_3 = rag_chain.invoke({"question": question_3})
print(f"Answer: {answer_3}") # Trả lời:

# Question 4: Out of provided context
question_4 = "Who invented the light bulb?"
print(f"\nQuestion: {question_4}") # Câu hỏi:
answer_4 = rag_chain.invoke({"question": question_4})
print(f"Answer: {answer_4}") # Trả lời:
# LLM might answer if it has general knowledge, or say it doesn't know if the prompt is well-crafted
# and the question is truly outside the provided context.

print("\n--- RAG Q&A System Test Ended ---") # --- Kết thúc kiểm tra hệ thống Hỏi & Đáp RAG ---

# Clean up sample PDF files and Chroma directory
os.remove(pdf_file_path_1)
os.remove(pdf_file_path_2)
if os.path.exists(persist_directory):
    shutil.rmtree(persist_directory)
    print(f"Removed sample PDF files and Chroma directory '{persist_directory}'.") # Đã xóa các file PDF mẫu và thư mục Chroma '{persist_directory}'.


**Performance Evaluation:**

* **Accuracy:** Is the answer correct based on the information in your documents?
* **Relevance:** Does the answer directly address the question?
* **Completeness:** Does the answer provide sufficient necessary information?
* **Out-of-context question handling:** Does the system respond with "I don't know" or avoid hallucinating information when the question is not in the documents? (This heavily depends on your prompt engineering).
* **Speed:** How fast does the system respond?

For a more scientific evaluation, you can create a dataset of (question, correct answer) pairs and relevant documents, then use automated or manual evaluation tools.


---

## Lesson Summary

In this lesson, you practiced building a **complete Question Answering (Q&A) system based on Retrieval-Augmented Generation (RAG)**, applying all the knowledge learned from Module 3. You went through each step of the RAG pipeline:
* **Loading documents** using `PyPDFLoader`.
* **Splitting documents** using `RecursiveCharacterTextSplitter`.
* **Creating embeddings** and **storing** them in a **Chroma Vector Store**.
* **Setting up a Retriever** to fetch relevant text segments.
* **Connecting** the Retriever with an LLM using LCEL to **generate answers** based on the retrieved context.

This practical exercise not only reinforced theoretical knowledge but also provided you with hands-on experience in building a crucial LLM application, capable of being scaled and customized with your own data.