# Building Intelligent Document Processing Pipelines with Nemotron RAG

## The Challenge: Unlocking Trapped Enterprise Data

In the enterprise, valuable intelligence is often trapped in complex document formats. Standard text extraction tools frequently fail due to **Linearization Loss**—the loss of structural context when converting spatial documents into flat text strings.

1.  **Complex Layouts**: Financial reports and technical manuals use multi-column layouts and nested tables that simple text parsers destroy.
2.  **Visual Data**: Critical trends often live exclusively in charts (e.g., "Figure 7: Employment Growth"), which text-only models ignore.
3.  **Dense Tabular Data**: Appendices often contain massive compliance or statistical tables where row/column alignment is essential for accuracy.

## The Solution: A Multimodal RAG Pipeline

This notebook demonstrates how to build a production-grade **Intelligent Document Processing (IDP)** pipeline using **NVIDIA NeMo Retriever** and **Nemotron RAG**. We will transform raw PDFs into a queryable knowledge base that understands text, charts, and tables with high precision.

### Core Components
We utilize NVIDIA's latest open-source models optimized for document understanding:

* **Extract**: `NeMo Retriever Extraction library (nv-ingest)` — A library that uses vision-based microservices (YOLOX, Transformers) to detect, crop, and parse charts and tables alongside text.
* **Embed**: `llama-nemotron-embed-vl-1b-v2` — A multimodal embedding model that encodes text and visual structures into a shared semantic space.
* **Rerank**: `llama-nemotron-rerank-vl-1b-v2` — A cross-encoder that "looks" at the retrieved images and text to score relevance with high accuracy.
* **Generate**: `llama-3.3-nemotron-super-49b` — A reasoning model that provides grounded, citation-backed answers.

### Representative Document: World Bank Enterprise Survey (Peru 2017)
To stress-test this pipeline, we will use the **World Bank Peru 2017 Country Profile**. This document is a perfect example of complex enterprise data, containing a mix of dense narrative text, statistical charts (Figures 7 & 8), and detailed data appendices.

In [None]:
# Step 1: Install Dependencies
# Note: We install dependencies for local VLM inference and the nv-ingest client.

!pip install -q ninja packaging wheel --break-system-packages
!pip install -q markitdown nv-ingest==26.1.1 nv-ingest-api==26.1.1 nv-ingest-client==26.1.1 milvus-lite==2.4.12 "openai>=1.51.0" requests python-dotenv ipywidgets transformers accelerate pillow torch torchvision timm --break-system-packages
!pip install -q pymilvus milvus-lite
# Attempt flash-attn (Recommended for VLM performance)
# NOTE: If this fails with a red error (due to missing nvcc), "ignore it". The pipeline includes a runtime patch to handle it.
!pip install -q flash-attn --no-build-isolation --break-system-packages || echo "[WARNING] Flash Attention failed to compile. Using software fallback."

import sys
import os
import logging
import getpass
import torch

# Setup Logging
logging.getLogger("httpx").setLevel(logging.WARNING)
logging.getLogger("nv_ingest").setLevel(logging.INFO)

print(f"[INFO] Python Version: {sys.version.split()[0]}")

In [None]:
# Setup Compute Device
# We target the local GPU for the Embedding and Reranking VLMs.
device = "cuda" if torch.cuda.is_available() else "cpu"

if torch.cuda.is_available():
    try:
        # Explicitly access device 0 to avoid index errors on Colab T4
        props = torch.cuda.get_device_properties(0)
        print(f"[INFO] Local Inference Device: {device} ({props.name})")
        print(f"[INFO] VRAM Available: {props.total_memory / 1024**3:.2f} GB")
    except Exception as e:
        print(f"[ERROR] Could not access GPU: {e}")
else:
    print("[WARNING] CUDA not available. Defaulting to CPU (Slow!).")

[INFO] Local Inference Device: cuda (NVIDIA H100 80GB HBM3)
[INFO] VRAM Available: 79.18 GB


In [3]:
# API Key Setup
# We need an NVIDIA API Key for the Remote NIMs (Extraction & Generation).
if not os.environ.get("NVIDIA_API_KEY"):
    print("[ACTION REQUIRED] Please enter your NVIDIA API Key (starts with 'nvapi-').")
    os.environ["NVIDIA_API_KEY"] = getpass.getpass("Enter NVIDIA API Key: ")
else:
    print("[INFO] NVIDIA API Key detected in environment.")

[ACTION REQUIRED] Please enter your NVIDIA API Key (starts with 'nvapi-').


## Part 1: Intelligent Extraction Architecture

We use the **`nv-ingest` Library Mode** (NeMo Retriever Extraction library) to orchestrate the ingestion process. Unlike monolithic scripts, this pipeline orchestrates distinct workers (managed by **Ray**) to handle specific tasks such as Optical Character Recognition (OCR), Table Structure Recognition (TSR), and Layout Analysis.

### High-Level Pipeline Flow

![](./IDP_flowchart.png)

### Library Mode vs Container Mode
For this demo, we run in Library Mode, but for enterprise deployment, Container Mode is recommended.

| Mode | Communication | Best For | Setup |
|------|--------------|----------|-------|
| **Library** (this notebook) | SimpleBroker (in-memory) | Development, <100 docs, Debugging | No Docker required, runs in local Python env |
| **Container** | Redis/Kafka | Production, horizontal scaling, Fault Tolerance | Docker/Kubernetes, separate control plane |


In [None]:
import requests
import time
from pathlib import Path
import base64
import io
from PIL import Image

# nv-ingest imports
from nv_ingest.framework.orchestration.ray.util.pipeline.pipeline_runners import run_pipeline
from nv_ingest_client.client import Ingestor, NvIngestClient
from nv_ingest_api.util.message_brokers.simple_message_broker import SimpleClient

# 1. Download Sample PDF (World Bank Peru 2017)
PDF_URL = "https://documents1.worldbank.org/curated/en/484531533637705178/pdf/129273-WP-PUBLIC-Peru-2017.pdf"
PDF_PATH = "129273-WP-PUBLIC-Peru-2017.pdf"

if not Path(PDF_PATH).exists():
    print(f"[INFO] Downloading {PDF_PATH}...")
    response = requests.get(PDF_URL)
    with open(PDF_PATH, "wb") as f:
        f.write(response.content)

# 2. Start nv-ingest in Library Mode
# This spins up the pipeline locally but uses remote NIMs by default for inference.
print("[INFO] Starting Ingestion Pipeline (Library Mode)...")
run_pipeline(
    block=False,
    disable_dynamic_scaling=True,
    run_in_subprocess=True,
    quiet=True
)
time.sleep(15) # Warmup

# 3. Configure Client
client = NvIngestClient(
    message_client_allocator=SimpleClient,
    message_client_port=7671, # Default LibMode port
    message_client_hostname="localhost"
)

print("[INFO] Submitting job...")

# 4. Extract Content
ingestor = (
    Ingestor(client=client)
    .files([PDF_PATH])
    .extract(
        extract_text=True,
        extract_tables=True,
        extract_charts=True,  # Triggers remote YOLOX for chart crops
        extract_images=False, # Focus on charts/tables for VLM
        extract_method="pdfium", # CPU-based text extraction
        table_output_format="markdown"
    )
)

job_results = ingestor.ingest()
extracted_data = job_results[0]
print(f"[SUCCESS] Extracted {len(extracted_data)} items from document.")

[INFO] Starting Ingestion Pipeline (Library Mode)...
DEBUG: Direct print to stdout - should appear in parent process


DEBUG: Direct write to stderr - should appear in parent process
SimpleMessageBroker port already in use at 0.0.0.0:7671; continuing to spawn a broker process (tests expect a Process to be returned)
Process Process-1:1:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/chiachihc/Projects/venv/lib/python3.12/site-packages/nv_ingest/framework/orchestration/process/dependent_services.py", line 25, in _broker_server_target
    server = SimpleMessageBroker(host, port, max_queue_size)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/chiachihc/Projects/venv/lib/python3.12/site-packages/nv_ingest_api/util/message_brokers/simple_message_broker/broker.py", line 445, in __init__
    super().__init__((host, port), SimpleMessageBrokerHandler)
  File "/usr/lib/pyt

[INFO] Submitting job...


[2026-01-20 19:49:36,446 E 2885417 2885472] (raylet) file_system_monitor.cc:116: /tmp/ray_spill_testing_0 is over 90% full, available space: 4.11696 GB; capacity: 44.0714 GB. Object creation will fail if spilling is required.
[2026-01-20 19:49:46,455 E 2885417 2885472] (raylet) file_system_monitor.cc:116: /tmp/ray_spill_testing_0 is over 90% full, available space: 4.11692 GB; capacity: 44.0714 GB. Object creation will fail if spilling is required.
[2026-01-20 19:49:56,464 E 2885417 2885472] (raylet) file_system_monitor.cc:116: /tmp/ray_spill_testing_0 is over 90% full, available space: 4.11691 GB; capacity: 44.0714 GB. Object creation will fail if spilling is required.


[SUCCESS] Extracted 23 items from document.


[2026-01-20 19:50:06,473 E 2885417 2885472] (raylet) file_system_monitor.cc:116: /tmp/ray_spill_testing_0 is over 90% full, available space: 4.11688 GB; capacity: 44.0714 GB. Object creation will fail if spilling is required.
[2026-01-20 19:50:16,481 E 2885417 2885472] (raylet) file_system_monitor.cc:116: /tmp/ray_spill_testing_0 is over 90% full, available space: 4.11486 GB; capacity: 44.0714 GB. Object creation will fail if spilling is required.
[2026-01-20 19:50:26,489 E 2885417 2885472] (raylet) file_system_monitor.cc:116: /tmp/ray_spill_testing_0 is over 90% full, available space: 4.11373 GB; capacity: 44.0714 GB. Object creation will fail if spilling is required.
[2026-01-20 19:50:36,498 E 2885417 2885472] (raylet) file_system_monitor.cc:116: /tmp/ray_spill_testing_0 is over 90% full, available space: 4.11372 GB; capacity: 44.0714 GB. Object creation will fail if spilling is required.
[2026-01-20 19:50:46,508 E 2885417 2885472] (raylet) file_system_monitor.cc:116: /tmp/ray_spill_

### Understanding the Extraction Pipeline Models

Let's examine the output. The **Enterprise Survey** is data-rich. Standard tools might turn the detailed Appendix tables into a soup of numbers. `nv-ingest` leverages [NeMo Retriever OCR](https://build.nvidia.com/nvidia/nemoretriever-ocr-v1), which utilizes specific AI architectures to maintain structural integrity:

| Model | Architecture | Technical Benefit |
|-------|--------------|---------------------|
| **page-elements** | **YOLOX** (Object Detection) | Distinguishes between prose, sidebars, and data tables. Prevents mixing unrelated text blocks. |
| **table-structure** | **Transformer** (TATR/DETR variant) | Understands grid topology. Preserves row/column alignment for financial figures in appendices. |
| **graph-elements** | **YOLOX** | Extracts data from performance bar charts and line graphs. |
| **OCR** | **CRNN** (Conv. Recurrent Neural Net) | Handles scanned text and chart labels where PDF text layers are missing. |

**Why This Matters**: `nv-ingest` preserves the structural integrity of these tables in Markdown format. Furthermore, for charts like "Figure 8: Reliability of electricity supply", the pipeline extracts the image itself. This allows our multimodal models to "see" the bar chart and answer questions about power outages that are not explicitly written in the text.

In [5]:
table_items = [item for item in extracted_data if item["metadata"]["content_metadata"].get("subtype") == "table"]

if table_items:
    # Let's inspect a table from the Appendix (usually near the end of the doc)
    sample = table_items[-1]["metadata"]
    print("="*60)
    print("EXTRACTED TABLE ITEM")
    print("="*60)
    print(f"Type: {sample['content_metadata']['subtype']}")
    print(f"Page: {sample['content_metadata']['page_number']}")

    image_meta = sample.get("image_metadata") or {}
    has_image = bool(image_meta.get("content"))

    print(f"Has Image Content? {has_image}")

    print("\nMarkdown Content (Snippet):")
    table_meta = sample.get("table_metadata") or {}
    content = table_meta.get("table_content", "")
    print(content[:300] + "...")

EXTRACTED TABLE ITEM
Type: table
Page: 3
Has Image Content? False

Markdown Content (Snippet):
Motor vehicles sales and repair
| business sector, firm size, and location. | business sector, firm size, and location. |
| --- | --- |
| Enterprise Surveys Figure 1: Sectors of the economy covered by the |  |
| Excluded | Included |
| SECTORS | SECTORS |
| Agriculture | Manufacturing (all subsector...


## Part 2: Local Multimodal Embedding

We now load **`nvidia/llama-nemotron-embed-vl-1b-v2`** locally. This is a state-of-the-art embedding model developed by NVIDIA for multimodal retrieval.

### Model Spotlight: llama-nemotron-embed-vl-1b-v2

* **Architecture**: Combined Language and Vision model (Transformer Encoder).
* **Input**: Text, Image (Text, Tables, Charts, Infographics), or Combined.
* **Output**: 2048-dimensional dense vector.
* **Use Case**: Ideal for vectorizing mixed-modality documents where visual cues (layout, color coding) carry semantic meaning.

**Configuration Note**: The code below includes a robust patch to handle hardware differences. It automatically enables `flash_attention_2` on H100/A100 GPUs while correctly falling back to `eager` mode on T4 GPUs to prevent crashes.

In [7]:
from transformers import AutoModel, AutoProcessor, AutoConfig
from pymilvus import MilvusClient
import numpy as np
import torch
import io
from PIL import Image
import base64

# Configuration
HF_EMBED_MODEL_ID = "nvidia/llama-nemotron-embed-vl-1b-v2"
COLLECTION_NAME = "worldbank_peru_2017"
MILVUS_URI = "milvus_wb_demo.db"

print(f"[INFO] Loading Embedding Model: {HF_EMBED_MODEL_ID}...")

# --- 1. HARDWARE-AWARE CONFIGURATION PATCH ---
# Check GPU capabilities to ensure T4 (Turing) stability and H100 (Hopper) performance.
use_flash_attn = False
model_dtype = torch.float32 # Default fallback

if torch.cuda.is_available():
    major, minor = torch.cuda.get_device_capability(0)
    print(f"[INFO] GPU Compute Capability: {major}.{minor}")

    # Precision: Use bfloat16 for Ampere+ (8.0+), float16 for Turing (7.5) or older
    if major >= 8:
        model_dtype = torch.bfloat16
        print("[INFO] Using bfloat16 precision (Ampere+ detected).")
    else:
        model_dtype = torch.float16
        print("[INFO] Using float16 precision (Turing/Older detected).")

    # Attention: Flash Attention 2 requires Ampere+ (8.0+)
    if major >= 8:
        try:
            import flash_attn
            use_flash_attn = True
        except ImportError:
            pass

    if use_flash_attn:
        attn_impl = "flash_attention_2"
        print("[INFO] Flash Attention 2 enabled.")
        config = None
    else:
        attn_impl = "eager"
        reason = "Hardware incompatible (needs Ampere+)" if major < 8 else "Library not installed"
        print(f"[WARNING] Flash Attention 2 disabled: {reason}. Forcing 'eager' mode.")

        # --- CONFIG PATCH ---
        config = AutoConfig.from_pretrained(HF_EMBED_MODEL_ID, trust_remote_code=True)
        if hasattr(config, "llm_config"):
            original_config_class = config.llm_config.__class__
            class SafeConfig(original_config_class):
                def __setattr__(self, name, value):
                    if name == "_attn_implementation" and value == "flash_attention_2":
                        return
                    super().__setattr__(name, value)
            config.llm_config.__class__ = SafeConfig
            config.llm_config._attn_implementation = "eager"
else:
    attn_impl = "eager"
    config = None
    print("[WARNING] Running on CPU.")

# --- 2. LOAD MODEL ---
embed_model = AutoModel.from_pretrained(
    HF_EMBED_MODEL_ID,
    config=config,
    torch_dtype=model_dtype,
    trust_remote_code=True,
    attn_implementation=attn_impl,
    device_map="cuda:0" if torch.cuda.is_available() else "cpu"
).eval()

embed_processor = AutoProcessor.from_pretrained(
    HF_EMBED_MODEL_ID,
    trust_remote_code=True
)

# Configure processor
embed_model.processor.max_input_tiles = 6
embed_model.processor.use_thumbnail = True

def base64_to_pil(b64_str):
    if not b64_str:
        return None
    try:
        return Image.open(io.BytesIO(base64.b64decode(b64_str))).convert("RGB")
    except Exception:
        return None

# --- 3. VECTOR DB SETUP ---
milvus_client = MilvusClient(MILVUS_URI)
if milvus_client.has_collection(COLLECTION_NAME):
    milvus_client.drop_collection(COLLECTION_NAME)

milvus_client.create_collection(
    collection_name=COLLECTION_NAME,
    dimension=2048,
    auto_id=True
)

# --- 4. INFERENCE LOOP ---
vectors_to_insert = []
print("[INFO] Generating Multimodal Embeddings...")

# Define safe limits based on hardware to prevent OOM on T4
if use_flash_attn:
    LIMIT_TEXT = 8192
    LIMIT_MM = 10240
else:
    print("[INFO] T4/Eager mode active: Reducing embedding context window to 4096.")
    LIMIT_TEXT = 4096
    LIMIT_MM = 4096

with torch.inference_mode():
    for item in extracted_data:
        metadata = item.get("metadata", {})
        content_meta = metadata.get("content_metadata") or {}

        doc_type = content_meta.get("type", "text")
        subtype = content_meta.get("subtype", "")
        content_text = metadata.get("content", "")

        image_obj = None
        modality = "text"

        # Check for Chart/Table images
        b64_data = None
        if doc_type == "structured" or subtype in ["chart", "table"]:
            # Safe access to nested metadata
            image_meta = metadata.get("image_metadata") or {}
            if image_meta.get("content"):
                 b64_data = image_meta["content"]
            elif len(content_text) > 100 and " " not in content_text[:50]:
                 b64_data = content_text

            if b64_data:
                image_obj = base64_to_pil(b64_data)

            if image_obj:
                modality = "image"
                table_meta = metadata.get("table_metadata") or {}
                table_content = table_meta.get("table_content")
                if table_content:
                    content_text = table_content
                    modality = "image_text"

        if not content_text and not image_obj:
            continue

        # Encode
        try:
            if modality == "image":
                embed_model.processor.p_max_length = 2048
                embeddings = embed_model.encode_documents(images=[image_obj])
            elif modality == "image_text":
                embed_model.processor.p_max_length = LIMIT_MM
                embeddings = embed_model.encode_documents(images=[image_obj], texts=[content_text])
            else:
                embed_model.processor.p_max_length = LIMIT_TEXT
                embeddings = embed_model.encode_documents(texts=[content_text])

            # L2 Normalize
            def _l2_normalize(x: torch.Tensor, eps: float = 1e-12) -> torch.Tensor:
                return x / (x.norm(p=2, dim=-1, keepdim=True) + eps)

            vec = _l2_normalize(embeddings[0]).float().cpu().numpy().tolist()

            vectors_to_insert.append({
                "vector": vec,
                "text": content_text[:60000],
                "source": metadata.get("source_metadata", {}).get("source_name", "Unknown"),
                "page": content_meta.get("page_number", -1),
                "type": subtype or doc_type,
                "has_image": True if image_obj else False,
                "image_b64": b64_data if image_obj else ""
            })

            # Periodic cleanup for T4 fragmentation
            if len(vectors_to_insert) % 10 == 0:
                torch.cuda.empty_cache()

        except RuntimeError as e:
            if "out of memory" in str(e):
                print(f"[WARN] Skipped embedding item due to OOM (Length: {len(content_text)}). Clearing cache.")
                torch.cuda.empty_cache()
            else:
                raise e

milvus_client.insert(collection_name=COLLECTION_NAME, data=vectors_to_insert)
print(f"[SUCCESS] Indexed {len(vectors_to_insert)} items.")

[INFO] Loading Embedding Model: nvidia/llama-nemotron-embed-vl-1b-v2...
[INFO] GPU Compute Capability: 9.0
[INFO] Using bfloat16 precision (Ampere+ detected).
[INFO] Generating Multimodal Embeddings...
[INFO] T4/Eager mode active: Reducing embedding context window to 4096.
[SUCCESS] Indexed 23 items.


## Part 3: Local Multimodal Reranking

To eliminate hallucinations, we implement a **Retrieve-then-Rerank** strategy. We load **`nvidia/llama-nemotron-rerank-vl-1b-v2`** locally.

### Why Rerank with VLMs?

| Approach | How It Works | Accuracy | Speed |
|----------|--------------|----------|-------|
| **Dense Retrieval** | Query & docs encoded separately, similarity search | Good | Fast |
| **Reranking** | Query & each doc encoded together (cross-encoder) | Excellent | Slower |

Standard text rerankers lose the visual context. Our VLM Reranker is a **Cross-Encoder** based on the Eagle VLM architecture. It reads the Query and the Document (Text + Image) *together* to output a relevance score.

**Enterprise Value**: It can "look" at the image of the "Electricity Reliability" chart (Figure 8) and understand that the visual trend matches the user's question, even if the text description is sparse.

In [None]:
import torch
import gc
from transformers import AutoModelForSequenceClassification, AutoProcessor, AutoConfig, AutoModel

# Configuration
HF_EMBED_MODEL_ID = "nvidia/llama-nemotron-embed-vl-1b-v2"
HF_RERANK_MODEL_ID = "nvidia/llama-nemotron-rerank-vl-1b-v2"

# --- SMART HARDWARE DETECTION ---
def get_hardware_strategy():
    """
    Determines precision, attention, tiling, and memory strategy.
    Checks BOTH hardware capability AND library availability.
    """
    strategy = {
        "dtype": torch.float32,
        "attn": "eager",
        "tiles": 1,
        "swap_models": True # Default to safe mode (swap)
    }
    
    if torch.cuda.is_available():
        props = torch.cuda.get_device_properties(0)
        major, minor = props.major, props.minor
        vram_gb = props.total_memory / (1024**3)
        
        print(f"[HARDWARE] Detected: {props.name} | VRAM: {vram_gb:.1f}GB | Compute: {major}.{minor}")
        
        # Check if Flash Attention library is actually installed
        has_flash_lib = False
        try:
            import flash_attn
            has_flash_lib = True
        except ImportError:
            pass

        # 1. Architecture Strategy
        if major >= 8: # Ampere/Hopper (A100, H100)
            strategy["dtype"] = torch.bfloat16
            
            if has_flash_lib:
                strategy["attn"] = "flash_attention_2"
                strategy["tiles"] = 6 # Max resolution
                print("[CONFIG] Mode: High Performance (Flash Attention 2)")
            else:
                strategy["attn"] = "eager"
                strategy["tiles"] = 6 # H100 has 80GB, can handle high-res eager
                print("[CONFIG] Mode: Fallback (Eager Attention on High-End GPU)")
                
        else: # Turing (T4)
            strategy["dtype"] = torch.float16
            strategy["attn"] = "eager"
            strategy["tiles"] = 1 # Low resolution for memory safety
            print("[CONFIG] Mode: Low VRAM (T4 Optimization)")
            
        # 2. Memory Strategy (Swap vs Keep)
        # If we have > 24GB VRAM, we can keep both models loaded safely.
        if vram_gb > 24.0:
            strategy["swap_models"] = False
            print("[STRATEGY] High VRAM detected. Keeping models resident in memory.")
        else:
            strategy["swap_models"] = True
            print("[STRATEGY] Low VRAM detected. Enabling Aggressive Model Swapping.")
            
    return strategy

config_strat = get_hardware_strategy()

# Global placeholders
embed_model = None
rerank_model = None
rerank_processor = None

def apply_safe_config(config_obj):
    """
    Prevents the model from auto-enabling Flash Attention 
    when the library is installed but broken (missing nvcc) OR when on T4.
    """
    # If we decided on Eager, force the config to match
    if config_strat["attn"] == "eager":
        if hasattr(config_obj, "llm_config"):
            original_class = config_obj.llm_config.__class__
            
            # Create a "Safe" config class that ignores attempts to set FA2
            class SafeConfig(original_class):
                def __setattr__(self, name, value):
                    # Block attempts to enable FA2
                    if name == "_attn_implementation" and value == "flash_attention_2":
                        return 
                    super().__setattr__(name, value)
            
            # Apply the safe class to the config object
            config_obj.llm_config.__class__ = SafeConfig
            config_obj.llm_config._attn_implementation = "eager"
            
    return config_obj

def load_embed_model():
    global embed_model
    if embed_model is not None: return
    
    print("[LOAD] Loading Embedding Model...")
    cfg = AutoConfig.from_pretrained(HF_EMBED_MODEL_ID, trust_remote_code=True)
    cfg = apply_safe_config(cfg) 
    
    embed_model = AutoModel.from_pretrained(
        HF_EMBED_MODEL_ID,
        config=cfg,
        torch_dtype=config_strat["dtype"],
        trust_remote_code=True,
        attn_implementation=config_strat["attn"],
        device_map="cuda:0"
    ).eval()

def load_rerank_model():
    global rerank_model, rerank_processor
    if rerank_model is not None: return

    print(f"[LOAD] Loading Reranker (Tiles={config_strat['tiles']})...")
    cfg = AutoConfig.from_pretrained(HF_RERANK_MODEL_ID, trust_remote_code=True)
    cfg = apply_safe_config(cfg)
    
    rerank_model = AutoModelForSequenceClassification.from_pretrained(
        HF_RERANK_MODEL_ID,
        config=cfg,
        torch_dtype=config_strat["dtype"],
        trust_remote_code=True,
        attn_implementation=config_strat["attn"],
        device_map="cuda:0"
    ).eval()
    
    rerank_processor = AutoProcessor.from_pretrained(
        HF_RERANK_MODEL_ID, 
        trust_remote_code=True,
        max_input_tiles=config_strat["tiles"],
        use_thumbnail=True
    )

def unload_model(model_ref):
    """Forcefully removes a model from VRAM."""
    if model_ref is None: return None
    del model_ref
    torch.cuda.empty_cache()
    gc.collect()
    return None

def retrieve_and_rerank(query, retrieve_k=20, final_k=5):
    global embed_model, rerank_model, rerank_processor
    
    # ---------------------------------------------------------
    # STAGE 1: DENSE RETRIEVAL
    # ---------------------------------------------------------
    load_embed_model()
    
    with torch.no_grad():
        q_emb = embed_model.encode_queries([query])[0].float().cpu().numpy().tolist()
    
    # If low VRAM, unload embedding model immediately
    if config_strat["swap_models"]:
        embed_model = unload_model(embed_model)
        
    hits = milvus_client.search(
        collection_name=COLLECTION_NAME,
        data=[q_emb],
        limit=retrieve_k,
        output_fields=["text", "page", "source", "type", "has_image", "image_b64"]
    )[0]
    
    if not hits:
        return [], []

    # ---------------------------------------------------------
    # STAGE 2: RERANKING
    # ---------------------------------------------------------
    load_rerank_model()
    
    rerank_inputs = []
    valid_hits = []
    
    for hit in hits:
        ent = hit['entity']
        img_obj = base64_to_pil(ent.get("image_b64")) if ent.get("has_image") else None
        # Truncate text if needed
        txt_limit = 1000 if config_strat["swap_models"] else 8000
        text_content = ent['text'][:txt_limit] if ent['text'] else ""
        
        rerank_inputs.append({
            "question": query,
            "doc_text": text_content,
            "doc_image": img_obj if img_obj else ""
        })
        valid_hits.append(hit)

    # Process
    all_scores = []
    # Batch size: 1 for T4 safety, 4 for H100 speed
    batch_size = 4 if not config_strat["swap_models"] else 1
    
    for i in range(0, len(rerank_inputs), batch_size):
        batch = rerank_inputs[i:i+batch_size]
        inputs = rerank_processor.process_queries_documents_crossencoder(batch)
        inputs = {k: v.to("cuda") if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
        
        with torch.no_grad():
            outputs = rerank_model(**inputs)
            logits = outputs.logits.squeeze(-1).float().cpu().numpy()
            if logits.ndim == 0: all_scores.append(float(logits))
            else: all_scores.extend(logits.tolist())
            
        del inputs, outputs
        if config_strat["swap_models"] and i % 5 == 0: 
            torch.cuda.empty_cache()

    # If low VRAM, unload reranker
    if config_strat["swap_models"]:
        rerank_model = unload_model(rerank_model)
        rerank_processor = unload_model(rerank_processor)

    # ---------------------------------------------------------
    # SORT & RETURN
    # ---------------------------------------------------------
    for i, score in enumerate(all_scores):
        valid_hits[i]['score'] = score
    
    valid_hits.sort(key=lambda x: x['score'], reverse=True)
    return hits, valid_hits[:final_k]

print(f"[INFO] Universal Retrieval Pipeline Ready (Strategy: {config_strat['attn']}).")

[HARDWARE] Detected: NVIDIA H100 80GB HBM3 | VRAM: 79.2GB | Compute: 9.0
[CONFIG] Mode: Fallback (Eager Attention on High-End GPU)
[STRATEGY] High VRAM detected. Keeping models resident in memory.
[INFO] Universal Retrieval Pipeline Ready (Strategy: eager).


## Part 4: Citation-Backed Enterprise Intelligence (Cloud NIM)

We employ the [Llama-3.3-Nemotron-Super-49B](https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1_5) model for response generation.

### Model Selection Rationale

This model is fine-tuned for high instruction adherence. We inject a **System Prompt** that enforces:

1.  **Verification**: Use a `<think>` block to internally verify facts against context
2.  **Strict Citation**: Every factual claim must reference a specific source ID
3.  **Knowledge Boundaries**: Admit ignorance rather than hallucinate

In [12]:
from openai import OpenAI
import re
from IPython.display import display, Markdown

def generate_enterprise_answer(query, hits):
    context_str = ""
    for i, hit in enumerate(hits):
        ent = hit['entity']
        type_label = ent['type'].upper()
        context_str += f"[Source ID: {i+1} | Type: {type_label} | Page: {ent['page']}]\n{ent['text']}\n\n"

    system_prompt = (
        "You are an expert Document Intelligence Analyst. Your goal is to extract precise insights from complex multimodal documents (text, charts, and tables).\n"
        "Answer based ONLY on the provided context chunks.\n\n"
        "GUIDELINES:\n"
        "1. **Start with a <think> block**: Before answering, explicitly reason through the following checks:\n"
        "   - **Visual Legend Check**: If analyzing a chart, identify the legend or color coding first. Ensure you are extracting data for the specific entity requested (e.g., the specific country/company vs. the benchmark/competitor).\n"
        "   - **Table Logic**: If extracting data from a table, look for 'Total', 'All', or 'Average' rows/columns before defaulting to granular sub-categories (like 'Small/Medium/Large').\n"
        "   - **Numerical Precision**: Be vigilant for decimal points and units (e.g., '0.5' vs '5', 'millions' vs 'billions'). If a number seems like an OCR error, verify it against surrounding text.\n"
        "   - **Temporal Filtering**: Ensure the data corresponds to the specific time period requested in the user's query.\n"
        "2. **Citation**: Every factual statement must cite its source exactly as provided (e.g., '[Source ID: 1, Page: 5]').\n"
        "3. **Honesty**: If the exact data point is not in the retrieval, state 'The provided context does not contain this specific data' rather than hallucinating.\n"
        "4. **Format**: After your <think> analysis, provide the final response starting with 'Answer:'."
    )

    client_llm = OpenAI(
        base_url="https://integrate.api.nvidia.com/v1",
        api_key=os.environ["NVIDIA_API_KEY"]
    )

    response = client_llm.chat.completions.create(
        model="nvidia/llama-3.3-nemotron-super-49b-v1.5",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Context:\n{context_str}\n\nQuestion: {query}"}
        ],
        temperature=0.1
    )
    return response.choices[0].message.content

def display_formatted_response(question, raw_output):
    """Display response with collapsible reasoning."""
    safe_text = raw_output.replace('$', '&dollar;')
    thought_match = re.search(r'<think>(.*?)</think>', safe_text, re.DOTALL)

    if thought_match:
        reasoning = thought_match.group(1).strip()
        answer = safe_text.split('</think>')[-1].strip()
    else:
        reasoning = "No internal reasoning trace provided."
        answer = safe_text

    output_md = f"""
### Question
{question}

<details>
<summary><strong>View Reasoning Process</strong></summary>
<br>
<div style="padding-left: 15px; border-left: 3px solid #ddd; color: #555;">

{reasoning}

</div>
</details>

### Answer
{answer}
<hr style="margin-top: 20px; border: 0; border-top: 1px solid #eee;">
"""
    display(Markdown(output_md))

## Part 5: Live Demonstration

We will now run a Mixed-Source Q&A session. This demonstrates the pipeline's ability to seamlessly switch between:
1.  **General Text**: Summarizing survey methodology.
2.  **Chart Extraction**: Reading "Figure 7" and "Figure 8" directly.
3.  **Tabular Lookups**: Extracting precise financial and regulatory data from the dense Appendix.

**Note**: In this pipeline, if the retrieval finds a **Chart**, the VLM Reranker will see the **Image** of that chart to verify the answer.

In [13]:
questions = [
    # 1. Reading bar charts for growth stats
    "What are Peru’s annual employment growth and real annual sales growth shown in Figure 7?",

    # 2. Reading charts regarding infrastructure reliability
    "According to Figure 8, how many power outages do firms experience in a typical month in Peru, and what are the losses due to outages (as % of sales)?",

    # 3. Table Query: Column/Row lookup
    "In the Appendix finance table, what percent of firms in Peru 2017 have (a) a checking/savings account and (b) a bank loan/line of credit?",

    # 4. Table Query: Comparative analysis
    "In the Appendix 'Regulations and Taxes' table, how many days does it take (on average) to obtain an import license in Peru 2017, and how does that compare to Latin America & Caribbean?"
]

print("="*60)
print("Enterprise Intelligent Document Processing: LIVE DEMO")
print("="*60)

for q in questions:
    # 1. Retrieve & Rerank
    # We only want the final reranked results for the generation step
    _, hits = retrieve_and_rerank(q, retrieve_k=50, final_k=20)

    # 2. Generate
    if hits:
        # Generate raw text from the LLM
        raw_answer = generate_enterprise_answer(q, hits)

        # Format and display cleanly
        display_formatted_response(q, raw_answer)

    else:
        print(f"[No relevant data found for query: {q}]")

Enterprise Intelligent Document Processing: LIVE DEMO
[LOAD] Loading Embedding Model...
[LOAD] Loading Reranker (Tiles=6)...



### Question
What are Peru’s annual employment growth and real annual sales growth shown in Figure 7?

<details>
<summary><strong>View Reasoning Process</strong></summary>
<br>
<div style="padding-left: 15px; border-left: 3px solid #ddd; color: #555;">

Okay, let's tackle this question. The user is asking for Peru's annual employment growth and real annual sales growth as shown in Figure 7.

First, I need to locate Figure 7 in the provided context. Looking through the sources, Source ID: 1 is a chart on Page 5 titled "Figure 7: Annual employment and sales growth". That's the one we need.

Now, checking the data in Source ID: 1. The description mentions "Peru2017 - Latin America & Caribbean - Upper Middle Income" followed by a series of numbers: 3.0 - 4.2 - 4.2 - -2.1 - 2.3 - 0.1. 

The chart's title indicates it's about annual employment growth and real annual sales growth. The numbers listed here are likely the data points for these metrics. Since the user is asking for Peru's specific values, I need to map these numbers correctly.

Looking at the structure, the first part of the numbers might correspond to different years or categories. However, the way the data is presented here is a bit unclear. But given the context, the numbers after Peru2017 are probably the values for different years. However, the user is asking for the values shown in Figure 7, which might be the most recent or the specific ones highlighted.

Wait, in the Source ID: 1, the data line is: "3.0 - 4.2 - 4.2 - -2.1 - 2.3 - 0.1". Let me parse this. The chart's title mentions both employment growth and sales growth. The numbers might be alternating between employment and sales growth for different years. But without a clear legend, it's a bit tricky. However, looking at the numbers, the first number (3.0) could be employment growth, followed by sales growth (4.2), then another year's employment (4.2), sales (-2.1), etc. But that might not make sense. Alternatively, the first three numbers could be employment growth for three years, and the next three sales growth. But the user is asking for the values shown in Figure 7, which might be the latest year available.

Alternatively, checking the Table in Source ID: 15, under the "Performance" section, there's "Real annual sales growth (%)" and "Annual employment growth (%)". For Peru2017, the sales growth is listed as "-2.1 -2.2" and employment growth as "3.0". Wait, that might be conflicting. But the user specifically refers to Figure 7, so we should prioritize the chart data over the table unless there's a discrepancy.

But in Source ID: 1, the numbers are listed as 3.0, 4.2, 4.2, -2.1, 2.3, 0.1. Let me think. The chart's title is "Annual employment and sales growth", so perhaps each pair is employment and sales for a particular year. For example, 3.0 (employment) and 4.2 (sales) for one year, then 4.2 and -2.1 for another, etc. But the user is asking for the values shown in Figure 7, which might be the most recent or the ones highlighted. However, without a clear legend or axis labels, it's challenging. 

Alternatively, looking at the Table (Source ID: 15), under Peru2017, the Annual employment growth is 3.0, and Real annual sales growth is -2.1. These numbers match the first and fourth values in the Source ID: 1 data (3.0 and -2.1). This suggests that in Figure 7, the employment growth is 3.0% and sales growth is -2.1%. The other numbers might be for other years or regions. Since the user is asking for Peru's values as shown in Figure 7, and the Table confirms these numbers, it's safe to go with 3.0% for employment growth and -2.1% for sales growth.

Therefore, the answer should be annual employment growth of 3.0% and real annual sales growth of -2.1% for Peru in 2017.

</div>
</details>

### Answer
Answer: Peru’s annual employment growth is 3.0% and real annual sales growth is -2.1%, as shown in Figure 7 [Source ID: 1, Page: 5]. These values align with the data provided in the table under "Performance" for Peru2017 [Source ID: 15, Page: 11].
<hr style="margin-top: 20px; border: 0; border-top: 1px solid #eee;">



### Question
According to Figure 8, how many power outages do firms experience in a typical month in Peru, and what are the losses due to outages (as % of sales)?

<details>
<summary><strong>View Reasoning Process</strong></summary>
<br>
<div style="padding-left: 15px; border-left: 3px solid #ddd; color: #555;">

Okay, let's tackle this question. The user is asking about Figure 8, which is mentioned in the context. The question is about the number of power outages firms experience in a typical month in Peru and the losses due to those outages as a percentage of sales.

First, I need to find the correct source ID that corresponds to Figure 8. Looking through the provided context, Source ID: 1 is a chart on Page 6 titled "Figure 8: Reliability of electricity supply and related losses". That's the one we need.

Now, looking at the data provided in Source ID: 1. The chart mentions "No. of Power Outages" and "% Losses due to power outages". The data points listed are: Peru2017 - Latin America & Caribbean - Upper Middle Income with values 0.8 - 0.7 - 1.0 - 0.5 - 2.0 - 4.0 5. Wait, that's a bit confusing. Let me parse this.

The chart's description says "No. of Power Outages % Losses due to power outages - No. of power outages". It seems like there are two series here: one for the number of outages and another for the percentage losses. The values listed are a sequence of numbers. Let me check the structure again.

The line for Peru2017 has the numbers: 0.8 - 0.7 - 1.0 - 0.5 - 2.0 - 4.0 5. Hmm, maybe the first set of numbers corresponds to the number of outages and the second set to the percentage losses. But the way it's written is a bit unclear. Alternatively, perhaps the numbers are listed in order for each category. Let me think.

Looking at the chart's title, it's about reliability of electricity supply and related losses. The axes are "No. of Power Outages" and "% Losses due to power outages". The data points for Peru2017 are listed as 0.8, 0.7, 1.0, 0.5, 2.0, 4.0, and 5. Wait, that's seven numbers. But the chart might have two separate data series. Maybe the first three numbers are for the number of outages and the next three for the percentage losses, but there's a 5 at the end. Alternatively, perhaps the numbers are paired. Let me check other sources for consistency.

Looking at Source ID: 12, which is a table, under the Infrastructure section, there's a row for "Number of electrical outages in typical month" with Peru2017 value 0.5. Similarly, "Losses due to electrical outages (% of annual sales)" is 0.8 for Peru2017. This matches the first and fourth numbers in the list from Source ID: 1 (0.8, 0.7, 1.0, 0.5, 2.0, 4.0, 5). Wait, but in the table, the number of outages is 0.5 and losses are 0.8. However, in Source ID: 1, the numbers listed are 0.8, 0.7, 1.0, 0.5, 2.0, 4.0, 5. Maybe the first number (0.8) is the percentage loss, and the fourth number (0.5) is the number of outages. But that would conflict with the table. Alternatively, perhaps the chart's data is presented differently.

Wait, the chart's description says "No. of Power Outages % Losses due to power outages - No. of power outages Figure 8: Reliability of electricity supply and related losses % of Sales 0.0 - 0.2 - 0.4 - 0.6 - 0.8 - 1.0 - 1.2 Peru2017 - Latin America & Caribbean - Upper Middle Income 0.8 - 0.7 - 1.0 - 0.5 - 2.0 - 4.0 5". This is a bit jumbled. Let me parse it again.

The Y-axis is "% of Sales" with values from 0.0 to 1.2. The data points for Peru2017 are listed as 0.8, 0.7, 1.0, 0.5, 2.0, 4.0, and 5. But that doesn't align with the Y-axis scale. Alternatively, maybe the numbers correspond to different categories. Wait, perhaps the first three numbers (0.8, 0.7, 1.0) are for the number of outages, and the next three (0.5, 2.0, 4.0) are for the percentage losses, with 5 being something else. But this is unclear.

Alternatively, looking at the table in Source ID: 12, which is more detailed, the "Number of electrical outages in typical month" for Peru2017 is 0.5, and "Losses due to electrical outages (% of annual sales)" is 0.8. This would mean that the number of outages is 0.5 per month, and losses are 0.8% of sales. However, in Source ID: 1, the numbers listed are 0.8, 0.7, 1.0, 0.5, 2.0, 4.0, 5. If the chart is showing both the number of outages and the percentage losses, perhaps the first value (0.8) is the percentage loss, and the fourth value (0.5) is the number of outages. But that would mean the chart's data is ordered as [losses, ..., outages, ...]. Alternatively, maybe the numbers are in pairs. For example, 0.8 (losses) and 0.5 (outages), but the other numbers are for other regions or categories.

Alternatively, the numbers might be listed in the order of the chart's data series. Since the chart has two series: "No. of Power Outages" and "% Losses due to power outages", the data points for Peru2017 might be 0.5 for outages and 0.8 for losses. But in the provided context for Source ID: 1, the numbers are listed as 0.8 - 0.7 - 1.0 - 0.5 - 2.0 - 4.0 5. This is confusing. Let me check if there's a mention of the exact values in the chart's description.

Wait, the chart's title is "Reliability of electricity supply and related losses" and the data points are listed as "0.8 - 0.7 - 1.0 - 0.5 - 2.0 - 4.0 5". The Y-axis is "% of Sales" from 0.0 to 1.2. The numbers 0.8, 0.7, 1.0, 0.5 are within the Y-axis range, but 2.0, 4.0, and 5 are way above. That suggests that maybe the first four numbers are for the percentage losses (since they are within 0-1.2), and the higher numbers (2.0, 4.0, 5) might be for the number of outages, which would be on a different scale. However, the chart's description mentions both "No. of Power Outages" and "% Losses due to power outages", so perhaps the chart has two Y-axes: one for the number of outages (left) and one for percentage losses (right). 

In that case, the data points might be paired. For example, the first value (0.8) could be the percentage loss, and the next value (0.7) could be the number of outages. But that doesn't make sense. Alternatively, the numbers might be listed in the order of the categories. For example, the first three numbers (0.8, 0.7, 1.0) could be for different regions, and the next three (0.5, 2.0, 4.0) for another set. But this is speculative.

Alternatively, looking at the table in Source ID: 12, which is more reliable as it's a table with clear labels, the number of electrical outages in a typical month for Peru2017 is 0.5, and the losses due to outages are 0.8% of annual sales. This aligns with the user's question. However, the user specifically asked about Figure 8, so we need to check if the chart provides the same data.

But in the context for Source ID: 1, the data points are listed as "0.8 - 0.7 - 1.0 - 0.5 - 2.0 - 4.0 5". If we consider that the first number (0.8) is the percentage loss and the fourth number (0.5) is the number of outages, that would match the table. However, the other numbers (0.7, 1.0, 2.0, 4.0, 5) might be for other countries or categories like Latin America & Caribbean and Upper Middle Income. For example, the structure might be:

- Peru2017: 0.8 (losses), 0.5 (outages)
- Latin America & Caribbean: 0.7 (losses), 2.0 (outages)
- Upper Middle Income: 1.0 (losses), 4.0 (outages)
- And the 5 might be an outlier or another category.

But this is an assumption. Since the user is asking specifically about Peru2017, we can focus on the values associated with Peru2017. If the first value (0.8) is the percentage loss and the fourth value (0.5) is the number of outages, that would align with the table data. However, the chart's description might have the data points in a different order.

Alternatively, the numbers might be listed in the order of the chart's data series. For example, the first three numbers (0.8, 0.7, 1.0) could be the percentage losses for Peru, Latin America, and Upper Middle Income, and the next three (0.5, 2.0, 4.0) the number of outages for the same regions. Then the 5 might be an error or another value. If that's the case, then for Peru2017, the number of outages would be 0.5 and losses 0.8%, which matches the table.

Given that the table (Source ID: 12) provides clear values and is more structured, and the chart's data seems to align with those values when parsed correctly, the answer should be 0.5 outages per month and 0.8% losses. However, the user specifically asked about Figure 8, so we need to ensure that the chart indeed provides these numbers. Since the chart's context lists 0.8 and 0.5 among the values for Peru2017, and considering the table confirms this, it's safe to conclude that the answer is 0.5 power outages and 0.8% losses.

</div>
</details>

### Answer
Answer: According to Figure 8 (Source ID: 1, Page: 6), firms in Peru experience **0.5 power outages** in a typical month, and the **losses due to outages amount to 0.8% of annual sales**. These values align with the detailed data provided in the table (Source ID: 12, Page: 11) under the "Infrastructure" section for Peru2017. 

[Source ID: 1, Page: 6] lists the data points for Peru2017 as "0.8 - 0.7 - 1.0 - 0.5 - 2.0 - 4.0 5", where the fourth value (0.5) corresponds to the number of outages and the first value (0.8) corresponds to the percentage losses, consistent with the table. 

[Source ID: 12, Page: 11] explicitly states:  
- **Number of electrical outages in typical month**: 0.5  
- **Losses due to electrical outages (% of annual sales)**: 0.8  

Thus, the answer is derived from both the chart and the corroborating table data.
<hr style="margin-top: 20px; border: 0; border-top: 1px solid #eee;">



### Question
In the Appendix finance table, what percent of firms in Peru 2017 have (a) a checking/savings account and (b) a bank loan/line of credit?

<details>
<summary><strong>View Reasoning Process</strong></summary>
<br>
<div style="padding-left: 15px; border-left: 3px solid #ddd; color: #555;">

Okay, let's tackle this question. The user is asking about the percentage of firms in Peru 2017 that have (a) a checking/savings account and (b) a bank loan/line of credit, specifically from the Appendix finance table.

First, I need to locate the correct table. The context mentions an Appendix with tables containing indicator variables. Looking through the provided sources, Source ID: 8 is a table on Page 11 titled "ENTERPRISE SURVEYS" with columns for Peru2017, Small firms, Medium firms, Large firms, Latin America & Caribbean, and Upper Middle Income. The rows under "Finance" should have the required data.

In Source ID: 8, under the "Finance" section, there are two rows: "Percent of firms with checking or savings account" and "Percent of firms with bank loan/line of credit". The values for Peru2017 are listed as 96.2 and 77.8 respectively. 

I should verify that these are the correct entries. The question specifies the Appendix finance table, which matches Source ID: 8. The columns for Peru2017 are clearly labeled, so the numbers 96.2 and 77.8 are the percentages for parts (a) and (b). 

No other sources seem to have this exact data in a table format. The other sources are charts, which might have similar information but the user specifically asked for the Appendix table. Therefore, the answer should be based on Source ID: 8.

</div>
</details>

### Answer
Answer:  
(a) The percentage of firms in Peru 2017 with a checking/savings account is **96.2%** [Source ID: 8, Page: 11].  
(b) The percentage of firms in Peru 2017 with a bank loan/line of credit is **77.8%** [Source ID: 8, Page: 11].
<hr style="margin-top: 20px; border: 0; border-top: 1px solid #eee;">



### Question
In the Appendix 'Regulations and Taxes' table, how many days does it take (on average) to obtain an import license in Peru 2017, and how does that compare to Latin America & Caribbean?

<details>
<summary><strong>View Reasoning Process</strong></summary>
<br>
<div style="padding-left: 15px; border-left: 3px solid #ddd; color: #555;">

Okay, let's tackle this question. The user is asking about the average number of days to obtain an import license in Peru 2017 and how that compares to Latin America & Caribbean. They mentioned the 'Regulations and Taxes' table in the Appendix.

First, I need to find the correct table. Looking at the context chunks, Source ID 15 is a table on page 12 with the title "ENTERPRISE SURVEYS" and includes sections like "Informality", "Regulations and Taxes", "Corruption", etc. The "Regulations and Taxes" section should have the relevant data.

In the "Regulations and Taxes" part of the table, there's a row labeled "Days to obtain an import license". The columns are Peru2017, Small firms, Medium firms, Large firms, Latin America & Caribbean, and Upper Middle Income. The values listed under these columns are: 22.1, 32.9, 10.0, 26.8, 18.2, 16.8. 

Wait, the user is asking for the average days for Peru 2017. The first value under Peru2017 is 22.1. Then, the Latin America & Caribbean column has 18.2. So, Peru's average is 22.1 days, and Latin America & Caribbean is 18.2 days. That means Peru takes longer on average compared to the regional average.

I should check if there's any other mention of import licenses elsewhere, but the user specifically mentioned the Appendix table, so Source ID 15 is the right one. Also, the numbers in the chart (Source ID 3) mention days for import license as 22, but the table gives more precise numbers. Since the question refers to the table, the answer should come from there.

So the answer is Peru 2017: 22.1 days, Latin America & Caribbean: 18.2 days. Therefore, Peru takes about 4 days longer on average.

</div>
</details>

### Answer
Answer: According to the "Regulations and Taxes" table in the Appendix (Source ID: 15, Page: 12), the average number of days to obtain an import license in Peru 2017 is **22.1 days**. This compares to **18.2 days** for Latin America & Caribbean. Peru's average is thus approximately 4 days longer than the regional average. [Source ID: 15, Page: 12]
<hr style="margin-top: 20px; border: 0; border-top: 1px solid #eee;">


## Conclusion: The Business Value of IDP

This notebook demonstrated a scalable **Intelligent Document Processing (IDP)** pipeline that solves the "Last Mile" problem in enterprise data extraction.

### Key Capabilities Demonstrated

| Capability | Technology | Business Value |
|------------|------------|----------------|
| **Visual Intelligence** | VLM Embedding (Multimodal) | Unlocks data trapped in charts (Figures 7 & 8) |
| **Table Precision** | Markdown Parsing + VLM Reranking | Enables accurate financial lookups in appendices |
| **Auditability** | Chain-of-thought + Citations | Provides trusted, verifiable answers for compliance |

### Production Checklist

- [ ] **Deploy nv-ingest**: Switch from Library mode to Container mode using Docker/Kubernetes for horizontal scaling.
- [ ] **Scale Milvus**: Deploy a GPU-accelerated Milvus instance for million-scale vector collections.
- [ ] **Metadata Filtering**: Implement filters for multi-tenant access (e.g., filter by `department_id` or `security_level`).
- [ ] **Monitoring**: Add Prometheus/OpenTelemetry to track retrieval latency and token usage.
- [ ] **Document Versioning**: Implement a strategy to handle updates to regulatory documents.

### Next Steps

- **Documentation**: [NeMo Retriever Docs](https://docs.nvidia.com/nemo/retriever/latest/)
- **NIMs Catalog**: [build.nvidia.com](https://build.nvidia.com)
- **GitHub**: [nv-ingest](https://github.com/NVIDIA/nv-ingest)
- **Milvus**: [milvus.io/docs](https://milvus.io/docs)

**You've learned how to build a production-ready Document Intelligence pipeline with Nemotron RAG!** 🚀