# Document Summarization


This notebook demonstrates an application of long document summarization techniques to a work of literature using Granite.

### Python version

Ensure you are running python 3.10, 3.11, or 3.12 in a freshly-created virtual environment.

In [1]:
import sys

assert sys.version_info >= (3, 10) and sys.version_info < (
    3,
    13,
), "Use Python 3.10, 3.11, or 3.12 to run this notebook."

In [2]:
# To format the cells
import jupyter_black

jupyter_black.load()

## Serving the Granite AI model


This notebook requires IBM Granite models to be served by an AI model runtime so that the models can be invoked or called. This notebook can use a locally accessible [Ollama](https://github.com/ollama/ollama) server to serve the models, or the [Replicate](https://replicate.com) cloud service.

During the pre-work, you may have either started a local Ollama server on your computer, or setup Replicate access and obtained an [API token](https://replicate.com/account/api-tokens).

## Select your model

Select a Granite model to use. Here we use a Langchain client to connect to the model. If there is a locally accessible Ollama server, we use an Ollama client to access the model. Otherwise, we use a Replicate client to access the model.

When using Replicate, if the `REPLICATE_API_TOKEN` environment variable is not set, or a `REPLICATE_API_TOKEN` Colab secret is not set, then the notebook will ask for your [Replicate API token](https://replicate.com/account/api-tokens) in a dialog box.

In [3]:
import os
import requests
from langchain_ollama.llms import OllamaLLM
from langchain_community.llms import Replicate
from ibm_granite_community.notebook_utils import get_env_var

try:  # Look for a locally accessible Ollama server for the model
    response = requests.get(os.getenv("OLLAMA_HOST", "http://127.0.0.1:11434"))
    model = OllamaLLM(
        model="granite3.2:2b",
        num_ctx=65536,  # 64K context window
    )
    model = model.bind(raw=True)  # Client side controls prompt
except Exception:  # Use Replicate for the model
    model = Replicate(
        model="ibm-granite/granite-3.2-8b-instruct",
        replicate_api_token=get_env_var("REPLICATE_API_TOKEN"),
        model_kwargs={
            "max_tokens": 2000,  # Set the maximum number of tokens to generate as output.
            "min_tokens": 200,  # Set the minimum number of tokens to generate as output.
            "temperature": 0.75,
            "presence_penalty": 0,
            "frequency_penalty": 0,
        },
    )

## Download a book

Here we fetch H.D. Thoreau's "Walden" from [Project Gutenberg](https://www.gutenberg.org/) for summarization.

We have to chunk the book text so that chunks will fit in the context window size of the AI model.

### Count the tokens

Before sending our book chunks to the AI model, it's crucial to understand how much of the model's capacity we're using. Language models typically have a limit on the number of tokens they can process in a single request.

Key points:
- We're using the [`granite-3.2`](https://huggingface.co/ibm-granite/granite-3.1-2b-instruct) model, which has a context window of 128K tokens.
- Tokenization can vary between models, so we use the specific tokenizer for our chosen model.

Understanding token count helps us optimize our prompts and ensure we're using the model efficiently.

In [4]:
from transformers import AutoTokenizer

model_path = "ibm-granite/granite-3.2-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)

### Summary of Summaries

Here we use a hierarchical abstractive summarization technique to adapt to the context length of the model. Our approach uses [Docling](https://docling-project.github.io/docling/) to understand the document's structure, chunk the document into text passages, and group the text passages by chapter which we can then summarize. 

In [5]:
import itertools
from typing import Iterator, Callable
from docling.document_converter import DocumentConverter
from docling_core.transforms.chunker.hierarchical_chunker import HierarchicalChunker
from docling_core.transforms.chunker.base import BaseChunk


def chunk_document(
    source: str,
    *,
    dropwhile: Callable[[BaseChunk], bool] = lambda c: False,
    takewhile: Callable[[BaseChunk], bool] = lambda c: True,
) -> Iterator[BaseChunk]:
    """Read the document and perform a hierarchical chunking"""
    converter = DocumentConverter()
    chunks = HierarchicalChunker().chunk(converter.convert(source=source).document)
    return itertools.takewhile(takewhile, itertools.dropwhile(dropwhile, chunks))


def merge_chunks(
    chunks: Iterator[BaseChunk],
    *,
    headings: Callable[[BaseChunk], list[str]] = lambda c: c.meta.headings,
) -> Iterator[dict[str, str]]:
    """Merge chunks having the same headings"""
    prior_headings: list[str] | None = None
    document: dict[str, str] = {}
    for chunk in chunks:
        text = chunk.text.replace("\r\n", "\n")
        current_headings = headings(chunk)
        if prior_headings != current_headings:
            if document:
                yield document
            prior_headings = current_headings
            document = {"title": " - ".join(current_headings), "text": text}
        else:
            document["text"] += f"\n\n{text}"
    if document:
        yield document


def chunk_dropwhile(chunk: BaseChunk) -> bool:
    """Ignore front matter prior to the book start"""
    return "WALDEN" not in chunk.meta.headings


def chunk_takewhile(chunk: BaseChunk) -> bool:
    """Ignore remaining chunks once we see this heading"""
    return "ON THE DUTY OF CIVIL DISOBEDIENCE" not in chunk.meta.headings


def chunk_headings(chunk: BaseChunk) -> list[str]:
    """Use the h1 and h2 (chapter) headings"""
    return chunk.meta.headings[:2]


documents: list[dict[str, str]] = list(
    merge_chunks(
        chunk_document(
            "https://www.gutenberg.org/cache/epub/205/pg205-images.html",
            dropwhile=chunk_dropwhile,
            takewhile=chunk_takewhile,
        ),
        headings=chunk_headings,
    )
)

print(f"{len(documents)} documents created")
print(
    f"Max document size: {max(len(tokenizer.tokenize(document['text'])) for document in documents)} tokens"
)

18 documents created
Max document size: 38275 tokens


## Summarize the chunks

Here we define a method to generate a response using a list of documents and a user prompt about those documents. 

We create the prompt according to the [Granite Prompting Guide](https://www.ibm.com/granite/docs/models/granite/#chat-template) and provide the documents using the `documents` parameter.

In [6]:
def generate(user_prompt: str, documents: list[dict[str, str]]):
    """Use the chat template to format the prompt"""
    prompt = tokenizer.apply_chat_template(
        conversation=[
            {
                "role": "user",
                "content": user_prompt,
            }
        ],
        documents=documents,  # This uses the documents support in the Granite chat template
        add_generation_prompt=True,
        tokenize=False,
    )

    print(f"Input size: {len(tokenizer.tokenize(prompt))} tokens")
    output = model.invoke(prompt)
    print(f"Output size: {len(tokenizer.tokenize(output))} tokens")

    return output

For each chapter, we create a separate summary. This can take a few minutes.

In [7]:
if get_env_var("GRANITE_TESTING", "false").lower() == "true":
    documents = documents[:5]  # shorten testing work

user_prompt = """\
Using only the the book chapter document, compose a summary of the book chapter.
Your response should only include the summary. Do not provide any further explanation."""

summaries: list[dict[str, str]] = []

for document in documents:
    print(
        f"============================= {document['title']} ============================="
    )
    output = generate(user_prompt, [document])
    summaries.append({"title": document["title"], "text": output})

print("Summary count: " + str(len(summaries)))

Input size: 38418 tokens
Output size: 348 tokens
Input size: 9133 tokens
Output size: 348 tokens
Input size: 5719 tokens
Output size: 170 tokens
Input size: 9102 tokens
Output size: 237 tokens
Input size: 5214 tokens
Output size: 370 tokens
Input size: 7148 tokens
Output size: 191 tokens
Input size: 6332 tokens
Output size: 312 tokens
Input size: 3133 tokens
Output size: 255 tokens
Input size: 13841 tokens
Output size: 195 tokens
Input size: 4156 tokens
Output size: 192 tokens
Input size: 6463 tokens
Output size: 398 tokens
Input size: 7278 tokens
Output size: 155 tokens
Input size: 8665 tokens
Output size: 254 tokens
Input size: 7546 tokens
Output size: 265 tokens
Input size: 5751 tokens
Output size: 243 tokens
Input size: 7783 tokens
Output size: 182 tokens
Input size: 10407 tokens
Output size: 458 tokens
Input size: 7047 tokens
Output size: 293 tokens
Summary count: 18


## Create the Final Summary

Now we need to summarize the chapter summaries. We prompt the model to create a unified summary of the chapter summaries we previously generated.

In [8]:
user_prompt = """\
Using only the book chapter summary documents, compose a single, unified summary of the book.
Your response should only include the unified summary. Do not provide any further explanation."""

output = generate(user_prompt, summaries)
print(output)

Input size: 5122 tokens
Output size: 733 tokens
"Walden" by Henry David Thoreau is a reflection on simple living and self-reliance, set near Walden Pond in Concord, Massachusetts. Thoreau recounts his experiences in a small, self-built cabin, advocating for a life free from materialism and social conventions. He praises the beauty and tranquility of nature, contrasting it with the hurried, materialistic lifestyle of his contemporaries. Thoreau critiques the excesses of civilized life and the pursuit of wealth, arguing that most people lead lives of "quiet desperation." Instead, he encourages readers to embrace voluntary simplicity, self-sufficiency, and a deeper understanding of the world through personal experience and direct observation of nature.

Thoreau details his modest lifestyle, including a simple diet, minimal furniture, and inexpensive clothing, and shares his experiences in various trades, enabling him to work only about six weeks a year. He questions the value of conventio

So we have now summarized a document larger than the AI model's context window length by breaking the document down into smaller pieces to summarize and then summarizing those summaries.