# Lesson 7.2: Automatic Document Summarization System

---

In real-world applications, we often deal with very large documents that exceed the context window limits of Large Language Models (LLMs). Effectively summarizing these documents is a significant challenge. This lesson will introduce strategies and techniques in LangChain to build an **automatic document summarization system**, with a focus on handling large documents.

## 1. Strategies for Summarizing Large Documents

The main issue when summarizing large documents is the LLM's token limit. You cannot simply feed the entire document into the prompt. Strategies for summarizing large documents typically revolve around splitting the document into smaller parts and processing each part.

* **Summarize then Combine (Map-Reduce):**
    * **Concept:** Split the large document into multiple small segments (chunks). Summarize each segment independently. Then, combine these small summaries into a final overall summary.
    * **Pros:** Can handle arbitrarily long documents, segment summarization tasks can run in parallel.
    * **Cons:** Can lose coherence and overall context if segments are summarized too independently.
    * **LangChain Application:** Implemented using the **Map-Reduce Chain**.

* **Iterative Refinement (Refine Chain):**
    * **Concept:** Start by summarizing the first segment of the document. Then, take that summary and the next segment of the document, asking the LLM to refine the existing summary based on the new information. This process repeats until all segments have been processed.
    * **Pros:** Maintains better coherence and overall context because each summary is built upon the previous one.
    * **Cons:** Sequential process, can be slower than Map-Reduce for very large documents.
    * **LangChain Application:** Implemented using the **Refine Chain**.




---

## 2. Using Map-Reduce Chain

**Map-Reduce** is a common design pattern in distributed computing, and it's also very effective for summarizing large documents.

* **How it Works:**
    1.  **Map (Summarize Each Chunk):** LangChain splits the document into chunks. Each chunk is sent to the LLM with a prompt to generate a small summary (map step).
    2.  **Reduce (Combine):** All small summaries from the "Map" step are collected. They are sent to the LLM again with a prompt to generate a final overall summary (reduce step).

* **Pros:**
    * **Handles Long Documents:** Capable of summarizing documents of any length.
    * **Parallelizable:** Segment summarization tasks can be performed in parallel (though LangChain doesn't do this by default and requires additional configuration).
    * **Conceptually Simple:** Easy to understand the processing flow.

* **Cons:**
    * **Context Loss:** Segment summaries are generated independently, which can miss important connections or overall context spanning across multiple segments.
    * **Fluency:** The final summary might be less fluent as it's aggregated from several disjoint summaries.

* **Structure in LangChain:**
    ```python
    from langchain.chains.summarize import load_summarize_chain
    from langchain_openai import ChatOpenAI

    llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo")
    chain = load_summarize_chain(llm, chain_type="map_reduce", verbose=True)
    # chain.invoke({"input_documents": docs})
    ```


---

## 3. Using Refine Chain

**Refine Chain** offers a different approach, focusing on maintaining coherence and overall context.

* **How it Works:**
    1.  **Initial Summary:** The first document segment is sent to the LLM to generate an initial summary.
    2.  **Iterative Refinement:** For each subsequent document segment, the existing summary and the new document segment are sent to the LLM. The LLM is asked to "refine" the current summary based on the information from the new segment.
    3.  This process repeats until all segments have been processed.

* **Pros:**
    * **Maintains Coherence:** The final summary is often more coherent and fluent because it's built sequentially and contextually.
    * **Better Context Capture:** Capable of capturing relationships spanning across multiple segments better than Map-Reduce.

* **Cons:**
    * **Slower:** The process is sequential, so it can be significantly slower for very large documents.
    * **Order Dependent:** The quality of the summary can be affected by the order of the document segments.

* **Structure in LangChain:**
    ```python
    from langchain.chains.summarize import load_summarize_chain
    from langchain_openai import ChatOpenAI

    llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo")
    chain = load_summarize_chain(llm, chain_type="refine", verbose=True)
    # chain.invoke({"input_documents": docs})
    ```


---

## 4. Handling Complex Document Types

When working with document summarization systems, you'll encounter various document types.

* **Multiple Files:**
    * **Loading:** Use appropriate `DocumentLoader`s (e.g., `PyPDFLoader`, `TextLoader`, `UnstructuredFileLoader`) to load each file.
    * **Combining:** Collect all `Document` objects from the files into a single list. Then, split and summarize this list as a large document.

* **Complex Structures (tables, images, headings):**
    * **Text Splitting:** This is the most crucial step. Use `RecursiveCharacterTextSplitter` or more specialized `TextSplitter`s to divide the document into meaningful chunks, trying not to cut across tables, headings, or important information blocks.
    * **Image/Table Processing:** For images, you might need OCR (Optical Character Recognition) techniques to extract text. For tables, ensure they are converted into a format easily understandable by the LLM (e.g., Markdown table). LangChain has more advanced loaders and splitters (e.g., `UnstructuredFileLoader` combined with `UnstructuredHTMLLoader` for HTML, or `PartitionPDFLoader` for PDF) that can help extract structure better.
    * **Metadata:** Retain metadata (e.g., page numbers, section titles) during splitting so they can be used in the prompt or for source referencing.


---

## 5. Practical Example: Building a System to Summarize a Collection of Large Documents

We will build a document summarization system using both **Map-Reduce** and **Refine Chain** to summarize a simulated large document.

**Preparation:**
* Ensure you have the necessary libraries installed: `langchain-openai`, `pypdf`, `tiktoken`.
* Set the `OPENAI_API_KEY` environment variable.

In [None]:
import os
from langchain_openai import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
import time

# Thiết lập biến môi trường cho khóa API của OpenAI
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

# 1. Khởi tạo LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# 2. Tạo một tài liệu lớn giả lập
# Chúng ta sẽ lặp lại một đoạn văn bản nhiều lần để mô phỏng tài liệu dài
long_text = """
LangChain là một framework mã nguồn mở giúp các nhà phát triển xây dựng các ứng dụng được hỗ trợ bởi các mô hình ngôn ngữ lớn (LLM). Nó cung cấp các công cụ và thành phần để tạo ra các chuỗi xử lý phức tạp, cho phép LLM tương tác với các nguồn dữ liệu bên ngoài, thực hiện các hành động và duy trì trạng thái cuộc trò chuyện. Các thành phần chính của LangChain bao gồm:

1.  **Models:** Giao diện cho các mô hình ngôn ngữ khác nhau (LLM, Chat Models, Embeddings).
2.  **Prompts:** Quản lý và tối ưu hóa các prompt gửi đến LLM.
3.  **Chains:** Kết hợp các LLM và các thành phần khác thành một chuỗi các bước.
4.  **Agents:** Cho phép LLM ra quyết định về việc sử dụng các công cụ để thực hiện hành động.
5.  **Tools:** Các chức năng mà Agent có thể gọi để tương tác với thế giới bên ngoài (tìm kiếm web, tính toán, API).
6.  **Retrieval:** Tích hợp các nguồn dữ liệu bên ngoài để tăng cường khả năng của LLM (ví dụ: RAG).
7.  **Memory:** Giúp LLM duy trì ngữ cảnh cuộc trò chuyện.

LangChain giúp đơn giản hóa quá trình phát triển các ứng dụng LLM phức tạp như chatbot, hệ thống Q&A, và các tác vụ tự động hóa. Nó được thiết kế để có tính mô-đun, linh hoạt và dễ dàng mở rộng.
""" * 10 # Lặp lại 10 lần để tạo tài liệu lớn

# 3. Chia nhỏ tài liệu thành các đoạn (chunks)
# Sử dụng RecursiveCharacterTextSplitter để chia văn bản
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, # Kích thước tối đa của mỗi đoạn
    chunk_overlap=100, # Độ chồng lấn giữa các đoạn
    length_function=len,
    add_start_index=True,
)
docs = [Document(page_content=long_text)] # Bọc văn bản dài vào một đối tượng Document
chunks = text_splitter.split_documents(docs)

print(f"Tổng số đoạn (chunks) được tạo: {len(chunks)}") # Total chunks created:
print(f"Kích thước đoạn đầu tiên: {len(chunks[0].page_content)} ký tự") # Size of the first chunk: {len(chunks[0].page_content)} characters
print(f"Kích thước đoạn cuối cùng: {len(chunks[-1].page_content)} ký tự") # Size of the last chunk: {len(chunks[-1].page_content)} characters

# 4. Thực hành tóm tắt với Map-Reduce Chain
print("\n--- Bắt đầu tóm tắt với Map-Reduce Chain ---") # --- Starting summarization with Map-Reduce Chain ---
map_reduce_chain = load_summarize_chain(llm, chain_type="map_reduce", verbose=True)

start_time_mr = time.time()
map_reduce_summary = map_reduce_chain.invoke({"input_documents": chunks})
end_time_mr = time.time()

print(f"\nBản tóm tắt Map-Reduce:\n{map_reduce_summary['output_text']}") # Map-Reduce Summary:
print(f"Thời gian thực thi Map-Reduce: {end_time_mr - start_time_mr:.2f} giây") # Map-Reduce execution time: {end_time_mr - start_time_mr:.2f} seconds
print("-" * 50)

# 5. Thực hành tóm tắt với Refine Chain
print("\n--- Bắt đầu tóm tắt với Refine Chain ---") # --- Starting summarization with Refine Chain ---
refine_chain = load_summarize_chain(llm, chain_type="refine", verbose=True)

start_time_refine = time.time()
refine_summary = refine_chain.invoke({"input_documents": chunks})
end_time_refine = time.time()

print(f"\nBản tóm tắt Refine:\n{refine_summary['output_text']}") # Refine Summary:
print(f"Thời gian thực thi Refine: {end_time_refine - start_time_refine:.2f} giây") # Refine execution time: {end_time_refine - start_time_refine:.2f} seconds
print("-" * 50)

print("\n--- So sánh kết quả ---") # --- Comparing results ---
print("Map-Reduce có thể nhanh hơn nhưng có thể kém mạch lạc hơn.") # Map-Reduce might be faster but can be less coherent.
print("Refine có thể chậm hơn nhưng thường tạo ra bản tóm tắt mạch lạc hơn.") # Refine might be slower but usually produces a more coherent summary.

print("\n--- Kết thúc hệ thống tóm tắt tài liệu tự động ---") # --- Ending automatic document summarization system ---


**Explanation:**
* We create a very long text string to simulate a large document.
* `RecursiveCharacterTextSplitter` is used to split the document into smaller chunks that the LLM can process.
* Both `map_reduce_chain` and `refine_chain` are initialized with the same LLM and document chunks.
* `verbose=True` is used so you can see the individual LLM calls and the step-by-step summarization process of each chain type.
* You will observe differences in execution time and the quality of the summary between the two methods. `Refine` typically takes longer due to its sequential nature, but the summary might be more fluent.


---

## Lesson Summary

This lesson equipped you with strategies and techniques to build an **automatic document summarization system**, with a focus on handling large documents. You learned about two main strategies: **summarize then combine (Map-Reduce)**, and **iterative refinement (Refine)**, along with the pros and cons of each. We also discussed how to **handle complex document types** by text splitting and managing metadata. Finally, you **practiced building a summarization system** using both chain types on a simulated large document, helping you visualize how they work and make appropriate choices for your use cases.