# LangChain 1.0 Middleware – Summarization for Long Doc Q&A Chats

In this lesson you’ll watch a normal doc Q&A agent slowly dig its own grave with a huge conversation history, and then see how the **built-in Summarization middleware** bails it out.

The focus here isn’t *"what is summarization?"* — you already know that. The focus is: **what you get for free when you plug in the prebuilt middleware** instead of hand-rolling your own trimming logic.


## 1. Setup

We’ll use the same LangChain 1.0-style `create_agent` API from the first middleware lesson, but now point it at a tiny slice of *LangChain docs* and let it answer questions about them.

First, imports and environment:


In [None]:
from dotenv import load_dotenv

load_dotenv()  # Load OPENAI_API_KEY and LangSmith settings if present

import uuid
from langchain_mcp_adapters.client import MultiServerMCPClient
from langchain.agents import create_agent
from langchain.agents.middleware import SummarizationMiddleware
from langgraph.checkpoint.memory import InMemorySaver
from langchain.messages import HumanMessage
from helpers import draw_mermaid_png


## 2. LangChain docs via MCP

Instead of hard-coding a docs string, we'll talk to the real LangChain docs via an MCP server.

That server exposes tools the agent can call to look things up in the docs, and those tool calls will show up in the message history that our summarization middleware compresses.


In [None]:
MCP_SERVERS = {
    "langchain-docs": {
        "transport": "streamable_http",
        "url": "https://docs.langchain.com/mcp/",
    }
}

mcp_client = MultiServerMCPClient(MCP_SERVERS)
LANGCHAIN_DOCS_TOOLS = await mcp_client.get_tools()

SYSTEM_PROMPT = """
You are a helpful assistant answering questions about the LangChain and LangGraph ecosystem.
You have access to tools that query the official LangChain docs via MCP.
Always call the docs tools when you need information, and base your answers only on what they return.
If something is not covered, say you don't know.

Keep answers short, concrete, and oriented toward developers.
"""


In [None]:
LANGCHAIN_DOCS_TOOLS[0]

## 3. Baseline doc Q&A agent (no summarization)

First we build a normal agent with **no middleware**.

We’ll feed it a series of questions about these docs and watch the conversation history grow.


In [None]:
baseline_checkpointer = InMemorySaver()

baseline_agent = create_agent(
    model="gpt-4o-mini",
    tools=LANGCHAIN_DOCS_TOOLS,  # MCP tools that query the live LangChain docs
    system_prompt=SYSTEM_PROMPT,
    checkpointer=baseline_checkpointer,
)

draw_mermaid_png(baseline_agent)

In [None]:
user_questions = [
    "What is LangChain 1.0 in one sentence?",
    "How does LangChain 1.0 relate to LangGraph?",
    "What role do tools play in an agentic application?",
    "What is middleware used for in LangChain 1.0?",
    "Where in the flow does summarization middleware act?",
    "Why might I need summarization for long-running chats?",
    "How is summarization middleware different from just trimming messages?",
]

### 3.1 Generate a long-ish conversation without summarization

Here we pretend to be a curious user asking several follow-up questions.

The key part: we keep passing the **full message history** back into the agent. No trimming, no summarization.


In [None]:
baseline_thread_id = str(uuid.uuid4())
baseline_config = {"configurable": {"thread_id": baseline_thread_id}}

baseline_messages = []

for question in user_questions:
    baseline_messages.append(HumanMessage(question))
    response = await baseline_agent.ainvoke({"messages": baseline_messages}, config=baseline_config)
    baseline_messages = response["messages"]

len(baseline_messages)


### 3.2 Inspect the raw history

Let’s peek at the full, uncompressed history so you can see exactly what the agent is carrying around without any summarization.

In [None]:
for m in baseline_messages:
    m.pretty_print()

## 4. Plug in the prebuilt Summarization middleware

Now we let the **middleware** do the heavy lifting.

Two key ideas in this implementation:

- It **triggers** when the total token count or message count crosses the `trigger` threshold (here: `("tokens", 2000)`).
- It **compresses** by keeping a small tail of recent messages verbatim (`keep=("messages", 4)`) and replacing everything before that with a single summary message that lives in the history.

In other setups you can configure `trigger` and `keep` with `("tokens", N)`, `("messages", N)`, or `("fraction", f)`, and for `trigger` even pass a list like `[ ("fraction", 0.8), ("messages", 100) ]`.

We keep the agent code exactly the same and bolt this behavior on via the `middleware` argument.

In [None]:
summary_checkpointer = InMemorySaver()

summary_agent = create_agent(
    model="gpt-4o-mini",
    tools=LANGCHAIN_DOCS_TOOLS,
    system_prompt=SYSTEM_PROMPT,
    checkpointer=summary_checkpointer,
    middleware=[
        SummarizationMiddleware(
            model="gpt-4o-mini",
            trigger=("tokens", 2000),   # set small for demo so it triggers quickly
            keep=("messages", 4),      # always keep the most recent 4 messages verbatim
            trim_tokens_to_summarize=None,
        ),
    ],
)

draw_mermaid_png(summary_agent)

### 4.1 Re-run the same conversation with summarization enabled

We’ll ask the **same questions** in the same order, but now the middleware is watching the message history and is allowed to rewrite it when the token threshold is crossed.

The important part to notice: the **summary does not show up as the user-visible answer**. It shows up as a new message inside `state["messages"]` *before* the next model call.


In [None]:
summary_thread_id = str(uuid.uuid4())
summary_config = {"configurable": {"thread_id": summary_thread_id}}

summary_messages = []

for question in user_questions:
    summary_messages.append(HumanMessage(question))
    response = await summary_agent.ainvoke({"messages": summary_messages}, config=summary_config)
    summary_messages = response["messages"]

len(summary_messages)

In [None]:
for m in summary_messages:
    m.pretty_print()

The takeaway: by adding a single middleware entry, we turned an ever-growing transcript into a **summary + small tail**, without touching the agent logic itself.


## 5. Customizing what the summary focuses on

The prebuilt middleware also lets you steer **how** it summarizes, via the `summary_prompt` parameter.

In this section we keep the same trigger/keep configuration, but change the prompt to do something more opinionated:

- Focus only on what the **human** has been saying.
- Infer their underlying intent and goals.
- Suggest where the conversation is likely heading next.

This turns the summary from a neutral recap into a kind of "intent + guidance" overlay for the chat.

In [None]:
CUSTOM_SUMMARY_PROMPT = """
You are summarizing a conversation between a developer and an assistant about LangChain 1.0.

Your job is not to recap every fact. Instead, you are trying to read the developer's mind a bit.

Rules:
- Focus only on the Human messages (what the user typed). Treat assistant and tool messages only as weak clues about the user's intent.
- Infer what the human is really trying to do, ask, or fix over the course of the chat.
- Capture any frustration, constraints, or strong preferences the human expressed.
- Keep a light, slightly funny tone, but stay respectful and concise.
- Never say the conversation is too long to summarize or that you cannot summarize it.
- Even if the messages look long, repetitive, or truncated, still summarize what you can see.

Produce three short sections:
1. User storyline  – a narrative summary of what the human has been doing/asking so far.
2. Intent & goals  – what the human seems to ultimately want to achieve in this session.
3. Likely next steps  – 2–4 concrete suggestions for where this conversation is heading or what the assistant should help with next.

Conversation messages to summarize:
{messages}

Return only the summary text in those three sections, using short bullet points where helpful.
"""

custom_summary_checkpointer = InMemorySaver()

custom_summary_agent = create_agent(
    model="gpt-4o-mini",
    tools=LANGCHAIN_DOCS_TOOLS,
    system_prompt=SYSTEM_PROMPT,
    checkpointer=custom_summary_checkpointer,
    middleware=[
        SummarizationMiddleware(
            model="gpt-4o-mini",
            trigger=("tokens", 2000),
            keep=("messages", 4),
            summary_prompt=CUSTOM_SUMMARY_PROMPT,
            trim_tokens_to_summarize=None,
        ),
    ],
)

draw_mermaid_png(custom_summary_agent)

In [None]:
summary_thread_id = str(uuid.uuid4())
summary_config = {"configurable": {"thread_id": summary_thread_id}}

summary_messages = []

for question in user_questions:
    summary_messages.append(HumanMessage(question))
    response = await custom_summary_agent.ainvoke({"messages": summary_messages}, config=summary_config)
    summary_messages = response["messages"]

len(summary_messages)

In [None]:
for m in summary_messages:
    m.pretty_print()

You can keep iterating on `summary_prompt` until the summary matches the level of detail your application needs. The important point: **all of this customization lives in middleware**, not in the agent loop itself.

That’s the power of the new LangChain 1.0 middleware system: one agent, many cross-cutting behaviors you can plug in or swap out without rewriting your core logic.


## Reference Links

**1. SummarizationMiddleware – Python docs**

https://docs.langchain.com/oss/python/langchain/middleware/built-in

→ Overview of the built-in Summarization middleware, its configuration options (`model`, `trigger`, `keep`, `summary_prompt`, `trim_tokens_to_summarize`), and example usage.

**2. Middleware in LangChain 1.0**

https://docs.langchain.com/oss/python/langchain/middleware

→ High-level guide to the middleware system, how hooks work, and how to combine multiple middlewares in a single agent.

**3. Context engineering & summarization example**

https://docs.langchain.com/oss/javascript/langchain/context-engineering

→ Conceptual walk-through of summarization as a context-engineering pattern, including how summaries replace old messages and keep a short tail of recent turns.

**4. Model Context Protocol (MCP) in LangChain**

https://docs.langchain.com/oss/python/langchain/mcp

→ How to connect to MCP servers with `langchain-mcp-adapters`, load tools with `MultiServerMCPClient`, and plug MCP-backed tools into LangChain/LangGraph agents.