# Global Logistics Intelligence Hub — Demo

This notebook demonstrates the core capabilities of the RAG pipeline:
1. Document ingestion and parsing
2. Semantic chunking with parent-child indexing
3. PII masking
4. Hybrid search (BM25 + vector)
5. End-to-end query answering with source attribution

In [None]:
import sys
sys.path.insert(0, "..")

from langchain_core.documents import Document
from src.processing.chunking import SemanticChunker
from src.processing.pii_masking import PIIMasker

## 1. Sample Supply Chain Documents

We'll create sample documents that mimic real logistics data.

In [None]:
sample_docs = [
    Document(
        page_content="""
Global Logistics Quarterly Report — Q3 2024

Ocean Freight Summary:
Ocean freight rates from Shanghai to Rotterdam have increased by 15% compared to Q2,
driven by Red Sea diversions and increased demand from European importers. Average
transit time is now 35 days via the Cape of Good Hope route, compared to 25 days
through the Suez Canal. Container availability remains tight with equipment imbalances
persisting across major trade lanes.

The Shanghai Containerized Freight Index (SCFI) reached 2,847 points, up from 2,150
in Q2. Spot rates for 40ft containers on the Asia-Europe route averaged $4,200 per TEU.

Air Cargo Summary:
Air cargo volumes grew 8% year-over-year with strong demand from e-commerce and
pharmaceutical sectors. Rate per kg from Hong Kong to Los Angeles averaged $4.85,
up from $4.20 in Q2. Capacity constraints at major hubs including Dubai (DXB) and
Singapore (SIN) contributed to rate pressures.

| Route | Mode | Transit Days | Rate Change Q/Q | Volume Change Y/Y |
| --- | --- | --- | --- | --- |
| Shanghai-Rotterdam | Ocean | 35 | +15% | +5% |
| Shenzhen-Los Angeles | Ocean | 18 | +12% | +8% |
| Hong Kong-LAX | Air | 2 | +15.5% | +8% |
| Dubai-London | Air | 1 | +10% | +6% |
| Singapore-Sydney | Ocean | 12 | +8% | +3% |
""",
        metadata={"source": "logistics_report_q3_2024.pdf", "page": 1}
    ),
    Document(
        page_content="""
Compliance & Customs Update — October 2024

EU Carbon Border Adjustment Mechanism (CBAM):
Starting January 2026, importers must purchase certificates for embedded carbon
in goods including steel, cement, aluminum, fertilizers, and electricity. During
the transitional phase (Oct 2023 - Dec 2025), importers must report embedded
emissions without purchasing certificates.

Key action items:
- Map all in-scope product categories against HS codes
- Collect verified emissions data from suppliers
- Register in the CBAM transitional registry
- Prepare for certificate purchasing starting 2026

US Trade Compliance:
New Section 301 tariff increases effective September 2024:
- Electric vehicles from China: 100% (up from 25%)
- Solar cells: 50% (up from 25%)
- Steel and aluminum: 25% (no change)
- Lithium-ion batteries: 25% (up from 7.5%)
""",
        metadata={"source": "compliance_update_oct2024.pdf", "page": 1}
    ),
    Document(
        page_content="""
Warehouse Operations Report — Southeast Asia Hub

Contact: John Smith (john.smith@globallogistics.com), +65-9123-4567
Container Reference: MSCU7654321
Customs Declaration: MRN98765432109876

Inventory Levels:
The Singapore distribution center currently holds 45,000 SKUs across 3 warehouse
zones. Average pick-and-pack time is 4.2 minutes per order, with a 99.7% accuracy
rate. Cold chain storage utilization is at 87%, approaching capacity limits.

Key Metrics (September 2024):
- Inbound receipts: 12,450 containers
- Outbound shipments: 11,890 containers
- Average dwell time: 3.8 days
- Damage rate: 0.12%
- On-time dispatch: 96.5%
""",
        metadata={"source": "warehouse_ops_sea_sep2024.pdf", "page": 1}
    )
]

print(f"Loaded {len(sample_docs)} sample documents")
for doc in sample_docs:
    print(f"  - {doc.metadata['source']}: {len(doc.page_content)} chars")

## 2. PII Masking

Before processing, we mask personally identifiable information and logistics-specific identifiers.

In [None]:
masker = PIIMasker()

# Demonstrate on the warehouse document (contains PII)
warehouse_doc = sample_docs[2]
result = masker.mask(warehouse_doc.page_content)

print("=== Entities Detected ===")
for entity in result.entities_found:
    print(f"  {entity['entity_type']}: score={entity['score']}")

print("\n=== Entity Mapping (for audit) ===")
for placeholder, original in result.entity_mapping.items():
    print(f"  {placeholder} → {original}")

print("\n=== Masked Text (first 500 chars) ===")
print(result.masked_text[:500])

## 3. Semantic Chunking

Documents are split into parent chunks (for context) and child chunks (for retrieval), with tables preserved intact.

In [None]:
chunker = SemanticChunker(
    parent_chunk_size=1000,
    child_chunk_size=400,
    child_chunk_overlap=50,
)

# Apply PII masking first
masked_docs = []
for doc in sample_docs:
    mask_result = masker.mask(doc.page_content)
    masked_docs.append(Document(
        page_content=mask_result.masked_text,
        metadata=doc.metadata
    ))

chunk_result = chunker.chunk_documents(masked_docs)

print(f"Parent chunks: {len(chunk_result.parent_chunks)}")
print(f"Child chunks:  {len(chunk_result.child_chunks)}")
print(f"\n=== Sample Parent Chunk ===")
print(f"Content ({len(chunk_result.parent_chunks[0].page_content)} chars):")
print(chunk_result.parent_chunks[0].page_content[:300])
print(f"\nMetadata: {chunk_result.parent_chunks[0].metadata}")

In [None]:
# Show table chunks
table_chunks = [c for c in chunk_result.parent_chunks if c.metadata.get("content_format") == "table"]
print(f"Table chunks found: {len(table_chunks)}")
if table_chunks:
    print("\n=== Table Chunk ===")
    print(table_chunks[0].page_content)

In [None]:
# Verify parent-child linkage
parent_ids = {p.metadata["chunk_id"] for p in chunk_result.parent_chunks}
orphaned = [c for c in chunk_result.child_chunks if c.metadata["parent_id"] not in parent_ids]
print(f"Total children: {len(chunk_result.child_chunks)}")
print(f"Orphaned children (should be 0): {len(orphaned)}")

## 4. Hybrid Search

Index the chunks and run hybrid queries combining BM25 keyword matching with dense vector similarity.

In [None]:
from src.processing.embeddings import EmbeddingService
from src.vectorstore.hybrid_search import HybridSearchEngine

# Initialize embedding service (uses sentence-transformers locally)
embedding_service = EmbeddingService()
print(f"Embedding backend: {embedding_service.backend}")

# Create hybrid search engine
search_engine = HybridSearchEngine(
    embedding_service=embedding_service,
    bm25_weight=0.3,
    semantic_weight=0.7,
)

# Index all child chunks (used for retrieval)
all_chunks = chunk_result.child_chunks
search_engine.add_documents(all_chunks)
print(f"Indexed {len(all_chunks)} chunks")

In [None]:
# Run sample queries
queries = [
    "What are the current ocean freight rates from Shanghai to Rotterdam?",
    "What is the CBAM regulation and when does it take effect?",
    "What is the average dwell time at the Singapore warehouse?",
]

for query in queries:
    print(f"\n{'='*60}")
    print(f"Query: {query}")
    print(f"{'='*60}")
    results = search_engine.search(query, top_k=3)
    for i, (doc, score) in enumerate(results, 1):
        print(f"\n  [{i}] Score: {score:.4f} | Source: {doc.metadata.get('source', 'unknown')}")
        print(f"      {doc.page_content[:150]}...")

## 5. End-to-End RAG Query

Full pipeline: query → hybrid retrieval → context expansion → LLM generation → source attribution.

> **Note:** This cell requires an OpenAI API key configured in `.env`. Skip if running without API access.

In [None]:
from src.rag.retriever import HybridRetriever
from src.rag.chain import RAGChain

# Add parent chunks to the engine for context expansion
search_engine_full = HybridSearchEngine(embedding_service=embedding_service)
all_indexed = chunk_result.parent_chunks + chunk_result.child_chunks
search_engine_full.add_documents(all_indexed)

retriever = HybridRetriever(
    search_engine=search_engine_full,
    top_k=5,
    expand_to_parent=True,
)

try:
    rag_chain = RAGChain(retriever=retriever)
    
    query = "How have ocean freight rates changed and what routes are most affected?"
    response = rag_chain.invoke(query)
    
    print(f"Query: {response.query}")
    print(f"\n{'='*60}")
    print(f"Answer:\n{response.answer}")
    print(f"\n{'='*60}")
    print(f"Sources:")
    for src in response.sources:
        print(f"  - {src['source']} (page {src.get('page', 'N/A')})")
except Exception as e:
    print(f"LLM generation requires API key. Error: {e}")
    print("\nRetrieval still works — showing retrieved context:")
    docs = retriever.invoke("How have ocean freight rates changed?")
    for i, doc in enumerate(docs, 1):
        print(f"\n[{i}] {doc.metadata.get('source', 'unknown')}")
        print(f"    {doc.page_content[:200]}...")

## Summary

This demo showed the complete RAG pipeline:

| Stage | Component | What it does |
|-------|-----------|------|
| Ingestion | `PDFLoader`, `ExcelLoader` | Parse documents with table awareness |
| PII Masking | `PIIMasker` | Redact sensitive data with audit trail |
| Chunking | `SemanticChunker` | Parent-child hierarchy with table preservation |
| Search | `HybridSearchEngine` | BM25 + dense vector with RRF fusion |
| RAG | `RAGChain` | LCEL chain with source attribution |
| API | `FastAPI` | REST endpoints for production deployment |