<a href="https://colab.research.google.com/github/dimitarpg13/agentic_architectures_and_design_patterns/blob/main/notebooks/observability/braintrust_agentic_observability.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Agentic Workflow for Word Document Ingestion

This notebook demonstrates a multi-agent system for intelligently ingesting Word documents and using them as context for LLM interactions.

## Architecture Overview

The system consists of multiple specialized agents:
- **Document Parser Agent**: Extracts and structures content from Word documents
- **Content Analyzer Agent**: Analyzes document structure, identifies key sections, and creates metadata
- **Chunking Strategy Agent**: Determines optimal chunking strategy based on document type
- **Context Builder Agent**: Assembles relevant context for LLM queries
- **Supervisor Agent**: Orchestrates the workflow and handles user queries

## Installation

In [None]:
!pip install python-docx langgraph langchain langchain-anthropic langchain-community tiktoken

## Imports and Setup

In [None]:
import os
from typing import TypedDict, Annotated, List, Dict, Any, Optional
from docx import Document
from docx.table import Table
from docx.text.paragraph import Paragraph
import json
from operator import add

from langgraph.graph import StateGraph, END
from langchain_anthropic import ChatAnthropic
from langchain.schema import HumanMessage, SystemMessage
import tiktoken

# Set your API key
os.environ["ANTHROPIC_API_KEY"] = "your-api-key-here"

## Define State Schema

The state is shared across all agents in the workflow.

In [None]:
class AgentState(TypedDict):
    """State passed between agents in the workflow."""
    
    # Input
    document_path: str
    user_query: Optional[str]
    
    # Document parsing results
    raw_text: str
    paragraphs: List[str]
    tables: List[Dict[str, Any]]
    
    # Analysis results
    document_type: str
    key_sections: List[Dict[str, str]]
    metadata: Dict[str, Any]
    
    # Chunking results
    chunks: List[Dict[str, Any]]
    chunking_strategy: str
    
    # Context building
    relevant_context: str
    context_metadata: Dict[str, Any]
    
    # Final output
    response: str
    
    # Agent messages
    messages: Annotated[List[str], add]
    
    # Control flow
    next_agent: str

## Utility Functions

In [None]:
def count_tokens(text: str, model: str = "cl100k_base") -> int:
    """Count tokens in text using tiktoken."""
    encoding = tiktoken.get_encoding(model)
    return len(encoding.encode(text))

def extract_table_data(table: Table) -> Dict[str, Any]:
    """Extract data from a Word table."""
    data = []
    for row in table.rows:
        row_data = [cell.text.strip() for cell in row.cells]
        data.append(row_data)
    
    # Try to identify headers
    headers = data[0] if data else []
    rows = data[1:] if len(data) > 1 else []
    
    return {
        "headers": headers,
        "rows": rows,
        "raw_data": data
    }

## Agent 1: Document Parser

Extracts content from Word documents including text, tables, and structure.

In [None]:
def document_parser_agent(state: AgentState) -> AgentState:
    """Parse Word document and extract structured content."""
    
    print("üìÑ Document Parser Agent: Parsing document...")
    
    try:
        doc = Document(state["document_path"])
        
        # Extract paragraphs
        paragraphs = []
        raw_text = []
        
        for para in doc.paragraphs:
            text = para.text.strip()
            if text:
                paragraphs.append({
                    "text": text,
                    "style": para.style.name,
                    "is_heading": para.style.name.startswith('Heading')
                })
                raw_text.append(text)
        
        # Extract tables
        tables = []
        for table in doc.tables:
            table_data = extract_table_data(table)
            tables.append(table_data)
        
        state["raw_text"] = "\n".join(raw_text)
        state["paragraphs"] = paragraphs
        state["tables"] = tables
        state["messages"] = [f"Parsed document: {len(paragraphs)} paragraphs, {len(tables)} tables"]
        state["next_agent"] = "analyzer"
        
        print(f"  ‚úì Extracted {len(paragraphs)} paragraphs and {len(tables)} tables")
        
    except Exception as e:
        state["messages"] = [f"Error parsing document: {str(e)}"]
        state["next_agent"] = "end"
        print(f"  ‚úó Error: {str(e)}")
    
    return state

## Agent 2: Content Analyzer

Analyzes document structure and identifies key sections using LLM.

In [None]:
def content_analyzer_agent(state: AgentState) -> AgentState:
    """Analyze document content and identify structure."""
    
    print("üîç Content Analyzer Agent: Analyzing document structure...")
    
    llm = ChatAnthropic(model="claude-sonnet-4-20250514", temperature=0)
    
    # Create a preview of the document
    preview_paragraphs = state["paragraphs"][:20]  # First 20 paragraphs
    preview_text = "\n\n".join([p["text"] for p in preview_paragraphs])
    
    # Identify headings
    headings = [p for p in state["paragraphs"] if p["is_heading"]]
    
    analysis_prompt = f"""
    Analyze this document and provide:
    1. Document type (e.g., technical report, business proposal, research paper, manual, etc.)
    2. Key sections and their purposes
    3. Main topics covered
    4. Suggested chunking approach (semantic, fixed-size, section-based)
    
    Headings found:
    {json.dumps([h['text'] for h in headings[:10]], indent=2)}
    
    Document preview:
    {preview_text[:2000]}
    
    Respond in JSON format:
    {{
      "document_type": "...",
      "key_sections": [
        {{"title": "...", "purpose": "..."}}
      ],
      "main_topics": ["..."],
      "chunking_strategy": "semantic|fixed|section",
      "reasoning": "..."
    }}
    """
    
    try:
        response = llm.invoke([
            SystemMessage(content="You are a document analysis expert. Provide concise, structured analysis."),
            HumanMessage(content=analysis_prompt)
        ])
        
        # Parse JSON response
        analysis = json.loads(response.content)
        
        state["document_type"] = analysis.get("document_type", "unknown")
        state["key_sections"] = analysis.get("key_sections", [])
        state["chunking_strategy"] = analysis.get("chunking_strategy", "semantic")
        state["metadata"] = {
            "main_topics": analysis.get("main_topics", []),
            "analysis_reasoning": analysis.get("reasoning", ""),
            "num_paragraphs": len(state["paragraphs"]),
            "num_tables": len(state["tables"]),
            "token_count": count_tokens(state["raw_text"])
        }
        
        state["messages"] = [f"Analyzed document: {state['document_type']}"]
        state["next_agent"] = "chunker"
        
        print(f"  ‚úì Document type: {state['document_type']}")
        print(f"  ‚úì Chunking strategy: {state['chunking_strategy']}")
        
    except Exception as e:
        print(f"  ‚úó Error in analysis: {str(e)}")
        # Fallback to basic analysis
        state["document_type"] = "document"
        state["chunking_strategy"] = "semantic"
        state["key_sections"] = []
        state["metadata"] = {"token_count": count_tokens(state["raw_text"])}
        state["messages"] = ["Used fallback analysis"]
        state["next_agent"] = "chunker"
    
    return state

## Agent 3: Chunking Strategy Agent

Implements intelligent chunking based on document analysis.

In [None]:
def chunking_agent(state: AgentState) -> AgentState:
    """Chunk document based on optimal strategy."""
    
    print(f"‚úÇÔ∏è  Chunking Agent: Using {state['chunking_strategy']} strategy...")
    
    chunks = []
    strategy = state["chunking_strategy"]
    
    if strategy == "section":
        # Chunk by sections (headings)
        current_chunk = []
        current_heading = "Introduction"
        
        for para in state["paragraphs"]:
            if para["is_heading"]:
                if current_chunk:
                    chunks.append({
                        "text": "\n".join(current_chunk),
                        "section": current_heading,
                        "type": "section",
                        "token_count": count_tokens("\n".join(current_chunk))
                    })
                current_chunk = []
                current_heading = para["text"]
            current_chunk.append(para["text"])
        
        # Add final chunk
        if current_chunk:
            chunks.append({
                "text": "\n".join(current_chunk),
                "section": current_heading,
                "type": "section",
                "token_count": count_tokens("\n".join(current_chunk))
            })
    
    elif strategy == "fixed":
        # Fixed-size chunking with overlap
        chunk_size = 500  # tokens
        overlap = 50
        
        all_text = state["raw_text"]
        words = all_text.split()
        
        # Simple word-based chunking (approximation)
        words_per_chunk = chunk_size // 1.3  # rough estimate
        overlap_words = overlap // 1.3
        
        for i in range(0, len(words), int(words_per_chunk - overlap_words)):
            chunk_text = " ".join(words[i:i+int(words_per_chunk)])
            if chunk_text:
                chunks.append({
                    "text": chunk_text,
                    "section": f"Chunk {len(chunks)+1}",
                    "type": "fixed",
                    "token_count": count_tokens(chunk_text)
                })
    
    else:  # semantic (default)
        # Semantic chunking - group by topic/meaning
        # For simplicity, we'll use paragraph-level with small groups
        current_chunk = []
        current_tokens = 0
        max_chunk_tokens = 800
        
        for para in state["paragraphs"]:
            para_tokens = count_tokens(para["text"])
            
            if current_tokens + para_tokens > max_chunk_tokens and current_chunk:
                chunks.append({
                    "text": "\n".join([p["text"] for p in current_chunk]),
                    "section": current_chunk[0]["text"][:50] + "...",
                    "type": "semantic",
                    "token_count": current_tokens
                })
                current_chunk = []
                current_tokens = 0
            
            current_chunk.append(para)
            current_tokens += para_tokens
        
        # Add final chunk
        if current_chunk:
            chunks.append({
                "text": "\n".join([p["text"] for p in current_chunk]),
                "section": current_chunk[0]["text"][:50] + "...",
                "type": "semantic",
                "token_count": current_tokens
            })
    
    # Add tables as separate chunks
    for i, table in enumerate(state["tables"]):
        table_text = "\n".join(["|" + "|".join(row) + "|" for row in table["raw_data"]])
        chunks.append({
            "text": table_text,
            "section": f"Table {i+1}",
            "type": "table",
            "token_count": count_tokens(table_text)
        })
    
    state["chunks"] = chunks
    state["messages"] = [f"Created {len(chunks)} chunks"]
    state["next_agent"] = "context_builder"
    
    print(f"  ‚úì Created {len(chunks)} chunks")
    
    return state

## Agent 4: Context Builder

Selects and assembles relevant context for user queries.

In [None]:
def context_builder_agent(state: AgentState) -> AgentState:
    """Build relevant context based on user query."""
    
    print("üèóÔ∏è  Context Builder Agent: Assembling context...")
    
    if not state.get("user_query"):
        # No query - return full document summary
        context = f"""Document Type: {state['document_type']}
        
Key Sections:
{json.dumps(state['key_sections'], indent=2)}

Full Content:
{state['raw_text'][:5000]}...
"""
        state["relevant_context"] = context
        state["context_metadata"] = {"strategy": "full_document"}
        state["next_agent"] = "supervisor"
        return state
    
    # Use LLM to identify relevant chunks
    llm = ChatAnthropic(model="claude-sonnet-4-20250514", temperature=0)
    
    # Create chunk summaries for selection
    chunk_summaries = []
    for i, chunk in enumerate(state["chunks"]):
        chunk_summaries.append({
            "id": i,
            "section": chunk["section"],
            "preview": chunk["text"][:200],
            "tokens": chunk["token_count"]
        })
    
    selection_prompt = f"""
    Given the user query: "{state['user_query']}"
    
    Select the most relevant chunks from this document:
    {json.dumps(chunk_summaries, indent=2)}
    
    Respond with JSON:
    {{
      "selected_chunk_ids": [0, 3, 5],
      "reasoning": "why these chunks are relevant"
    }}
    """
    
    try:
        response = llm.invoke([
            SystemMessage(content="You are an expert at identifying relevant document sections."),
            HumanMessage(content=selection_prompt)
        ])
        
        selection = json.loads(response.content)
        selected_ids = selection.get("selected_chunk_ids", [])
        
        # Assemble context from selected chunks
        relevant_chunks = [state["chunks"][i] for i in selected_ids if i < len(state["chunks"])]
        
        context_parts = [f"Document Type: {state['document_type']}\n"]
        for chunk in relevant_chunks:
            context_parts.append(f"\n--- {chunk['section']} ---\n{chunk['text']}\n")
        
        state["relevant_context"] = "\n".join(context_parts)
        state["context_metadata"] = {
            "strategy": "query_based",
            "selected_chunks": len(relevant_chunks),
            "reasoning": selection.get("reasoning", "")
        }
        
        print(f"  ‚úì Selected {len(relevant_chunks)} relevant chunks")
        
    except Exception as e:
        print(f"  ‚úó Error in selection: {str(e)}")
        # Fallback - use first few chunks
        context_parts = [f"Document Type: {state['document_type']}\n"]
        for chunk in state["chunks"][:3]:
            context_parts.append(f"\n--- {chunk['section']} ---\n{chunk['text']}\n")
        state["relevant_context"] = "\n".join(context_parts)
        state["context_metadata"] = {"strategy": "fallback"}
    
    state["messages"] = ["Context assembled"]
    state["next_agent"] = "supervisor"
    
    return state

## Agent 5: Supervisor

Orchestrates the workflow and generates final responses.

In [None]:
def supervisor_agent(state: AgentState) -> AgentState:
    """Supervise workflow and generate final response."""
    
    print("üëî Supervisor Agent: Generating response...")
    
    llm = ChatAnthropic(model="claude-sonnet-4-20250514", temperature=0.7)
    
    if not state.get("user_query"):
        # No query - provide document summary
        summary_prompt = f"""
        Provide a comprehensive summary of this document:
        
        {state['relevant_context']}
        
        Include:
        - Document type and purpose
        - Key topics and sections
        - Main takeaways
        """
        
        response = llm.invoke([
            SystemMessage(content="You are a document summarization expert."),
            HumanMessage(content=summary_prompt)
        ])
        
        state["response"] = response.content
    else:
        # Answer user query with context
        query_prompt = f"""
        Based on the following document context, answer the user's question.
        
        Context:
        {state['relevant_context']}
        
        User Question: {state['user_query']}
        
        Provide a detailed, accurate answer based on the document. If the information
        isn't in the document, say so.
        """
        
        response = llm.invoke([
            SystemMessage(content="You are a helpful assistant with access to document context."),
            HumanMessage(content=query_prompt)
        ])
        
        state["response"] = response.content
    
    state["messages"] = ["Response generated"]
    state["next_agent"] = "end"
    
    print("  ‚úì Response generated")
    
    return state

## Build the LangGraph Workflow

In [None]:
def create_workflow():
    """Create the agentic workflow graph."""
    
    workflow = StateGraph(AgentState)
    
    # Add nodes
    workflow.add_node("parser", document_parser_agent)
    workflow.add_node("analyzer", content_analyzer_agent)
    workflow.add_node("chunker", chunking_agent)
    workflow.add_node("context_builder", context_builder_agent)
    workflow.add_node("supervisor", supervisor_agent)
    
    # Add edges
    workflow.set_entry_point("parser")
    
    # Conditional routing based on next_agent
    def route(state: AgentState):
        next_agent = state.get("next_agent", "end")
        if next_agent == "end":
            return END
        return next_agent
    
    workflow.add_conditional_edges(
        "parser",
        route,
        {"analyzer": "analyzer", "end": END}
    )
    
    workflow.add_conditional_edges(
        "analyzer",
        route,
        {"chunker": "chunker", "end": END}
    )
    
    workflow.add_conditional_edges(
        "chunker",
        route,
        {"context_builder": "context_builder", "end": END}
    )
    
    workflow.add_conditional_edges(
        "context_builder",
        route,
        {"supervisor": "supervisor", "end": END}
    )
    
    workflow.add_conditional_edges(
        "supervisor",
        route,
        {"end": END}
    )
    
    return workflow.compile()

# Create the workflow
app = create_workflow()

## Example Usage

### 1. Process Document Without Query (Summary)

In [None]:
# Create a sample Word document for testing
from docx import Document

doc = Document()
doc.add_heading('Sample Technical Report', 0)
doc.add_heading('Introduction', 1)
doc.add_paragraph('This is a sample document demonstrating the agentic workflow.')
doc.add_paragraph('It contains multiple sections with different types of content.')

doc.add_heading('Methodology', 1)
doc.add_paragraph('Our approach consists of three main phases:')
doc.add_paragraph('1. Data collection and preprocessing')
doc.add_paragraph('2. Model training and validation')
doc.add_paragraph('3. Deployment and monitoring')

doc.add_heading('Results', 1)
doc.add_paragraph('The system achieved 95% accuracy on the test set.')

# Add a table
table = doc.add_table(rows=3, cols=3)
table.cell(0, 0).text = 'Metric'
table.cell(0, 1).text = 'Training'
table.cell(0, 2).text = 'Testing'
table.cell(1, 0).text = 'Accuracy'
table.cell(1, 1).text = '98%'
table.cell(1, 2).text = '95%'
table.cell(2, 0).text = 'F1 Score'
table.cell(2, 1).text = '0.97'
table.cell(2, 2).text = '0.94'

doc.add_heading('Conclusion', 1)
doc.add_paragraph('The results demonstrate the effectiveness of our approach.')

doc.save('sample_document.docx')
print("‚úì Sample document created: sample_document.docx")

In [None]:
# Run workflow for document summary
initial_state = {
    "document_path": "sample_document.docx",
    "user_query": None,
    "messages": []
}

print("\n" + "="*60)
print("WORKFLOW EXECUTION: DOCUMENT SUMMARY")
print("="*60 + "\n")

result = app.invoke(initial_state)

print("\n" + "="*60)
print("FINAL RESULT")
print("="*60)
print(f"\nResponse:\n{result['response']}")
print(f"\nMetadata:")
print(f"  Document Type: {result.get('document_type', 'N/A')}")
print(f"  Chunks Created: {len(result.get('chunks', []))}")
print(f"  Chunking Strategy: {result.get('chunking_strategy', 'N/A')}")
print(f"  Token Count: {result.get('metadata', {}).get('token_count', 'N/A')}")

### 2. Process Document With User Query

In [None]:
# Run workflow with a specific query
query_state = {
    "document_path": "sample_document.docx",
    "user_query": "What were the accuracy results?",
    "messages": []
}

print("\n" + "="*60)
print("WORKFLOW EXECUTION: QUERY-BASED")
print("="*60 + "\n")

result = app.invoke(query_state)

print("\n" + "="*60)
print("FINAL RESULT")
print("="*60)
print(f"\nQuery: {query_state['user_query']}")
print(f"\nResponse:\n{result['response']}")
print(f"\nContext Metadata:")
print(json.dumps(result.get('context_metadata', {}), indent=2))

## Advanced: Interactive Query Loop

In [None]:
def interactive_document_qa(document_path: str):
    """Interactive Q&A session with a document."""
    
    # First, process the document
    print("Processing document...\n")
    initial_state = {
        "document_path": document_path,
        "user_query": None,
        "messages": []
    }
    
    # Get initial processing done
    base_result = app.invoke(initial_state)
    
    print("\n" + "="*60)
    print("Document processed! You can now ask questions.")
    print("Type 'quit' to exit.")
    print("="*60 + "\n")
    
    while True:
        query = input("\nYour question: ")
        
        if query.lower() in ['quit', 'exit', 'q']:
            print("Goodbye!")
            break
        
        # Process query with existing chunks
        query_state = {
            **base_result,
            "user_query": query,
            "messages": [],
            "next_agent": "context_builder"  # Skip to context building
        }
        
        # Create a mini-workflow for querying only
        query_workflow = StateGraph(AgentState)
        query_workflow.add_node("context_builder", context_builder_agent)
        query_workflow.add_node("supervisor", supervisor_agent)
        query_workflow.set_entry_point("context_builder")
        query_workflow.add_edge("context_builder", "supervisor")
        query_workflow.add_edge("supervisor", END)
        query_app = query_workflow.compile()
        
        result = query_app.invoke(query_state)
        
        print(f"\nAnswer:\n{result['response']}\n")
        print("-" * 60)

# Uncomment to run interactively
# interactive_document_qa("sample_document.docx")

## Visualize the Workflow Graph

In [None]:
from IPython.display import Image, display

try:
    # This requires graphviz to be installed
    display(Image(app.get_graph().draw_mermaid_png()))
except Exception as e:
    print(f"Could not display graph: {e}")
    print("\nWorkflow structure:")
    print("parser -> analyzer -> chunker -> context_builder -> supervisor -> END")

## Production Enhancements

For production use, consider adding:

1. **Vector Storage**: Store chunks in a vector database (Pinecone, Weaviate, Chroma) for semantic search
2. **Caching**: Cache parsed documents and embeddings
3. **Error Handling**: More robust error handling and retry logic
4. **Monitoring**: Add logging and metrics tracking
5. **Multi-document**: Support for processing multiple documents
6. **Streaming**: Stream responses for better UX
7. **Memory**: Add conversation memory for follow-up questions
8. **Security**: Validate and sanitize document inputs
9. **Optimization**: Parallel processing where possible
10. **Advanced RAG**: Implement hybrid search, re-ranking, etc.

## Extension: Add Vector Search

Here's how you'd add vector search for better retrieval:

In [None]:
# Example with ChromaDB (uncomment to use)
"""
!pip install chromadb sentence-transformers

import chromadb
from chromadb.utils import embedding_functions

# Initialize vector store
client = chromadb.Client()
embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

collection = client.create_collection(
    name="document_chunks",
    embedding_function=embedding_function
)

def add_chunks_to_vectordb(chunks, document_id):
    """Add chunks to vector database."""
    for i, chunk in enumerate(chunks):
        collection.add(
            documents=[chunk['text']],
            metadatas=[{
                'document_id': document_id,
                'chunk_id': i,
                'section': chunk['section']
            }],
            ids=[f"{document_id}_chunk_{i}"]
        )

def semantic_search(query, top_k=3):
    """Perform semantic search on chunks."""
    results = collection.query(
        query_texts=[query],
        n_results=top_k
    )
    return results
"""
print("Vector search extension example (commented out)")