# Lesson 6.4: Evaluation and Performance Optimization

---

After building Large Language Model (LLM) applications with LangChain, a crucial next step is to **evaluate their performance** and find ways to **optimize them**. This lesson will focus on important evaluation metrics, prompt optimization techniques, Chain optimization, and how to reduce token costs.

## 1. LLM Application Evaluation Metrics

To understand how well your LLM application is performing, you need to define evaluation metrics.

* **Accuracy:**
    * **Concept:** The degree to which the LLM's response matches the ground truth or expected information.
    * **Measurement:** Often evaluated by comparing the LLM's response to reference answers (ground truth) or through human evaluation. For RAG, it can involve checking if the answer is supported by the retrieved context.
    * **Example:** In a Q&A system, if the user asks "What is the capital of France?" and the LLM answers "Paris," that is accurate.

* **Relevance:**
    * **Concept:** The degree to which the LLM's response directly addresses the user's question or request, without going off-topic or including superfluous information.
    * **Measurement:** Subjective human evaluation, or automated evaluation models can score relevance.
    * **Example:** If the user asks about the weather, the response should only talk about the weather, not about the city's history.

* **Fluency:**
    * **Concept:** The degree to which the LLM's response is natural, coherent, grammatically correct, and easy to understand.
    * **Measurement:** Subjective human evaluation.
    * **Example:** The response should not have spelling errors, grammatical mistakes, or awkward sentence structures.

* **Safety:**
    * **Concept:** The degree to which the LLM's response avoids generating harmful, inappropriate, biased, or unethical content.
    * **Measurement:** Checked using content filters, human review, or toxic classification models.
    * **Example:** The LLM should not generate hate speech, discriminatory language, or instructions for illegal activities.




---

## 2. Prompt Optimization Techniques

**Prompt Engineering** is an art and science of designing effective prompts that help LLMs perform better.

* **Iterative Prompting:**
    * **Concept:** This is an iterative process of refining your prompt. You start with a basic prompt, test the LLM's response, then adjust the prompt based on the results to improve performance.
    * **Steps:**
        1.  **Write Initial Prompt:** Start with a clear instruction.
        2.  **Run and Evaluate:** Run the prompt with the LLM and evaluate the response quality based on metrics.
        3.  **Analyze and Improve:** Identify why the response was not as expected (e.g., missing information, misunderstanding instructions, hallucination).
        4.  **Adjust Prompt:** Add details, constraints, examples, or change phrasing.
        5.  **Iterate:** Go back to step 2 until the desired performance is achieved.
    * **Example:** If the LLM's response is too long, you can add "in about 50 words" to the prompt.

* **Chain-of-Thought Prompting:**
    * **Concept:** This technique encourages the LLM to display its intermediate reasoning steps before providing the final answer. This helps the LLM solve more complex problems and often leads to more accurate results.
    * **Usage:** Add phrases like "Think step by step," "Explain how you arrived at this answer," or provide examples where the thought process is explicitly shown.
    * **Benefits:** Improves reasoning capabilities, easier to debug (you can see where the LLM "thought" incorrectly).


---

## 3. Chain Optimization

Chains are the backbone of many LangChain applications. Optimizing them can significantly improve overall performance.

* **Refining Steps:**
    * **Simplify Logic:** Each step in the Chain should perform a clear and efficient task. Avoid unnecessary or overly complex steps.
    * **Use the Right Chain/Runnable Type:** LCEL provides various `Runnable` types (`RunnableSequence`, `RunnableParallel`, `RunnableLambda`). Choose the most appropriate type for the task to optimize data flow and parallel execution capabilities.
    * **Validate Inputs/Outputs:** Ensure data passed between steps is valid and correctly formatted to avoid parsing errors.

* **Reducing the Number of LLM Calls:**
    * **Use Caching:** As learned in Lesson 6.2, caching is the most effective way to avoid duplicate LLM calls.
    * **Combine Tasks:** If possible, design prompts so that a single LLM call can address multiple small tasks instead of many individual calls. However, balance this to avoid making the prompt too complex.
    * **Local Processing:** Perform simple tasks (like string formatting, basic data filtering) using traditional Python code instead of relying on the LLM.


---

## 4. Token Cost Reduction

Token cost is a significant factor in LLM applications. Reducing the number of tokens used can significantly save API costs.

* **Optimizing Chunk Size (for RAG):**
    * **Concept:** In RAG, documents are split into smaller "chunks" before being embedded and stored. Chunk size affects the number of documents retrieved and the number of tokens passed to the LLM.
    * **Optimization:**
        * **Smaller Chunk Size:** Can help retrieve more precise context segments, but might require more chunks to cover information, or lose broader context.
        * **Larger Chunk Size:** Can provide broader context in a single chunk, but might contain more irrelevant information, increasing unnecessary tokens.
        * **Overlap:** Ensure chunks have a certain overlap to avoid losing context at boundaries.
    * **Goal:** Find the optimal chunk size to balance retrieval accuracy and token count/cost.

* **Using Smaller Models When Appropriate:**
    * **Concept:** LLM providers often have multiple model sizes (e.g., `gpt-3.5-turbo` vs. `gpt-4`). Smaller models are generally faster and cheaper.
    * **Optimization:**
        * Use smaller models for simpler tasks (e.g., short summarization, simple classification, clearly structured information extraction).
        * Only use larger models for complex tasks requiring deep reasoning or creativity.
    * **Benefits:** Significantly reduces cost and latency.


---

## 5. Practical Example: Optimizing Performance for a Q&A Application

We will practice optimizing a simple RAG Q&A chain by adjusting chunk size and observing the impact.

**Preparation:**
* Ensure you have the necessary libraries installed: `langchain-openai`, `chromadb`, `pypdf`.
* Set the `OPENAI_API_KEY` environment variable.
* Reuse the sample PDF file from Lesson 6.1 (`lcel_rag_document.pdf`).

In [None]:
# Install libraries if not already installed
# pip install langchain-openai openai chromadb pypdf

import os
import shutil
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser
import time

# Thiết lập biến môi trường cho khóa API của OpenAI
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

# Khởi tạo LLM và Embeddings Model
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# Tạo một file PDF mẫu (tái sử dụng từ Bài 6.1)
pdf_file_path = "lcel_rag_document.pdf"
try:
    from reportlab.pdfgen import canvas
    c = canvas.Canvas(pdf_file_path)
    c.drawString(100, 750, "LangChain Expression Language (LCEL) là một cách mạnh mẽ để xây dựng các chuỗi.") # LangChain Expression Language (LCEL) is a powerful way to build chains.
    c.drawString(100, 730, "LCEL cho phép kết hợp các Runnable bằng toán tử pipe (|).") # LCEL allows combining Runnables using the pipe operator (|).
    c.drawString(100, 710, "Các loại Runnable bao gồm RunnableSequence, RunnableParallel và RunnablePassthrough.") # Runnable types include RunnableSequence, RunnableParallel, and RunnablePassthrough.
    c.drawString(100, 690, "Bạn có thể tùy chỉnh Runnable bằng .bind(), .with_config(), v.v.") # You can customize Runnables using .bind(), .with_config(), etc.
    c.drawString(100, 670, "RAG (Retrieval-Augmented Generation) kết hợp truy xuất thông tin với LLM.") # RAG (Retrieval-Augmented Generation) combines information retrieval with LLM.
    c.drawString(100, 650, "Chroma là một Vector Store phổ biến cho việc lưu trữ embeddings.") # Chroma is a popular Vector Store for storing embeddings.
    c.drawString(100, 630, "Tối ưu hóa kích thước chunk là quan trọng để cân bằng giữa ngữ cảnh và chi phí.") # Optimizing chunk size is important to balance context and cost.
    c.drawString(100, 610, "Các mô hình LLM nhỏ hơn thường rẻ hơn và nhanh hơn.") # Smaller LLM models are generally cheaper and faster.
    c.save()
    print(f"Đã tạo file PDF mẫu: {pdf_file_path}") # Sample PDF file created:
except ImportError:
    with open(pdf_file_path, "w") as f:
        f.write("Đây là file giả lập PDF cho LCEL RAG. Vui lòng thay bằng PDF thật.") # This is a dummy PDF file for LCEL RAG. Please replace with a real PDF.
    print("Không thể tạo PDF thật bằng reportlab. Sử dụng file giả lập.") # Could not create real PDF with reportlab. Using dummy file.
    print("Vui lòng đảm bảo bạn có file PDF thật 'lcel_rag_document.pdf' để ví dụ hoạt động tốt nhất.") # Please ensure you have a real PDF file 'lcel_rag_document.pdf' for the example to work best.

# Hàm để xây dựng và chạy chuỗi RAG với kích thước chunk khác nhau
def run_rag_with_chunk_size(chunk_size: int, chunk_overlap: int, question: str):
    print(f"\n--- Chạy RAG với Chunk Size: {chunk_size}, Chunk Overlap: {chunk_overlap} ---") # --- Running RAG with Chunk Size: {chunk_size}, Chunk Overlap: {chunk_overlap} ---

    # Thư mục lưu trữ ChromaDB (đảm bảo mỗi lần chạy là mới)
    persist_directory = f"./chroma_rag_chunk_size_{chunk_size}_db"
    if os.path.exists(persist_directory):
        shutil.rmtree(persist_directory)

    # Giai đoạn Indexing
    loader = PyPDFLoader(pdf_file_path)
    documents = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    chunks = text_splitter.split_documents(documents)
    print(f"Tổng số chunks được tạo: {len(chunks)}") # Total chunks created:

    vector_store = Chroma.from_documents(chunks, embeddings, persist_directory=persist_directory)
    retriever = vector_store.as_retriever(search_kwargs={"k": 2}) # Truy xuất 2 đoạn liên quan nhất

    # Định nghĩa Prompt cho RAG
    rag_prompt = ChatPromptTemplate.from_messages([
        ("system", "Trả lời câu hỏi dựa trên ngữ cảnh được cung cấp. Nếu không tìm thấy, hãy nói không biết."), # Answer the question based on the provided context. If not found, say you don't know.
        ("human", "Ngữ cảnh: {context}\n\nCâu hỏi: {question}"), # Context: {context}\n\nQuestion: {question}
    ])

    # Xây dựng chuỗi RAG với LCEL
    rag_chain = (
        {"context": retriever | RunnableLambda(lambda docs: "\n\n".join(doc.page_content for doc in docs)),
         "question": RunnablePassthrough()}
        | rag_prompt
        | llm
        | StrOutputParser()
    )

    start_time = time.time()
    response = rag_chain.invoke({"question": question})
    end_time = time.time()

    print(f"Thời gian phản hồi: {end_time - start_time:.2f} giây") # Response time: {end_time - start_time:.2f} seconds
    print(f"Phản hồi: {response}") # Response:

    # Dọn dẹp thư mục Chroma
    if os.path.exists(persist_directory):
        shutil.rmtree(persist_directory)
    print(f"Đã xóa thư mục Chroma '{persist_directory}'.") # Chroma directory '{persist_directory}' deleted.
    print("=" * 50)


# --- Thực hành ---
question_to_ask = "LCEL là gì và nó cho phép kết hợp những gì?" # What is LCEL and what does it allow to combine?

# Thử nghiệm với các kích thước chunk khác nhau
run_rag_with_chunk_size(chunk_size=50, chunk_overlap=10, question=question_to_ask)
run_rag_with_chunk_size(chunk_size=100, chunk_overlap=20, question=question_to_ask)
run_rag_with_chunk_size(chunk_size=200, chunk_overlap=40, question=question_to_ask)

# Dọn dẹp file PDF mẫu
os.remove(pdf_file_path)
print(f"\nĐã xóa file PDF mẫu: {pdf_file_path}.") # Sample PDF file deleted:

print("\n--- Kết thúc tối ưu hóa hiệu suất ---") # --- End performance optimization ---


**Explanation of practical section:**

* We define a function `run_rag_with_chunk_size` to easily test different `chunk_size` and `chunk_overlap` configurations.
* Each run, a new Vector Store is created and populated with data using the given chunk size.
* You will observe:
    * **Number of chunks:** Smaller chunk sizes will generate more chunks from the same document.
    * **Response time:** There might be differences in response time depending on the number of retrieved chunks and the number of tokens passed to the LLM.
    * **Response quality:** For the given question, you might find that a chunk size that is too small could lose necessary context to answer fully, while an appropriate size will provide enough information.
* **Goal:** In a real application, you would need to experiment with your data and actual questions to find the optimal chunk size for both accuracy and performance/cost.

**Other optimization points to try (not included in this example code):**

* **Caching:** Enable caching as learned in Lesson 6.2 to avoid calling the LLM multiple times for the same question.
* **Lower LLM `temperature`:** For Q&A tasks requiring high accuracy, lowering `temperature` closer to 0 can help the LLM be less "creative" and stick more closely to the context.
* **Prompt Refinement:** Experiment with different phrasings in the `rag_prompt` to instruct the LLM to answer more accurately and relevantly.
* **Use Smaller LLM Models:** If `gpt-3.5-turbo` is too expensive or slow, try other cheaper and faster models if they are good enough for your task.


---

## Lesson Summary

This lesson equipped you with the knowledge and skills to **evaluate and optimize the performance** of LLM applications. You learned about important **evaluation metrics** such as accuracy, relevance, fluency, and safety. We explored **prompt optimization** techniques (Iterative Prompting, Chain-of-Thought Prompting), **Chain optimization** (refining steps, reducing LLM calls), and **token cost reduction** strategies (optimizing chunk size, using smaller models). Finally, you practiced **optimizing performance for a RAG Q&A application** by adjusting chunk size, illustrating how small changes can impact the overall efficiency of the system.