# 📥 Week 07-08 · Notebook 05 · Document Loaders & Compliance Filters

Ingest manufacturing knowledge sources safely using LangChain loaders and custom sanitizers.

## 🎯 Learning Objectives
- Load SOP manuals (PDF), maintenance logs (CSV), and safety bulletins (HTML).
- Enrich documents with plant metadata for downstream retrieval.
- Apply PII redaction and policy filters during ingestion.
- Validate ingestion pipelines with governance evidence.

## 🧩 Scenario
Corporate legal mandates that all maintenance tickets older than 365 days be excluded from RAG context. Build loaders that enforce the rule and tag documents by plant.

In [None]:
from datetime import datetime, timedelta
from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader, CSVLoader, UnstructuredHTMLLoader
import re

# --- Configuration ---
# Define the cutoff for old data as per legal requirements
CUTOFF_DATE = datetime.utcnow() - timedelta(days=365)
DATA_DIR = Path('data')

# --- Helper Functions ---

def load_and_filter_csv(path: Path, metadata: dict):
    """
    Loads a CSV file and filters records based on a timestamp.
    Assumes a 'timestamp_utc' column in ISO format.
    """
    loader = CSVLoader(
        file_path=str(path),
        csv_args={'delimiter': ','},
        source_column='ticket_id' # Use ticket_id as the source identifier
    )
    docs = loader.load()
    
    filtered_docs = []
    for doc in docs:
        # The CSVLoader puts all columns into page_content. We need to parse it.
        # This is a simplification; a real implementation might use a more robust parser.
        try:
            # A simple way to find the timestamp in the unstructured page_content
            timestamp_str = re.search(r"timestamp_utc: ([\d-T:Z]+)", doc.page_content)
            if timestamp_str:
                timestamp = datetime.fromisoformat(timestamp_str.group(1).replace('Z', '+00:00'))
                if timestamp.replace(tzinfo=None) >= CUTOFF_DATE:
                    doc.metadata.update(metadata)
                    filtered_docs.append(doc)
        except (ValueError, TypeError):
            # Handle cases where the timestamp is missing or malformed
            print(f"Skipping record due to missing/malformed timestamp in {doc.metadata.get('source')}")
            continue
            
    return filtered_docs

def load_generic(loader_cls, path: Path, metadata: dict):
    """Generic loader for file types that don't need special filtering."""
    loader = loader_cls(str(path))
    docs = loader.load()
    for doc in docs:
        doc.metadata.update(metadata)
    return docs

def redact_pii(text: str) -> str:
    """A simple PII redactor. In production, use a more advanced tool like Presidio."""
    # This regex looks for names like "Technician John Doe"
    return re.sub(r"Technician\s+[A-Z][a-z]+\s+[A-Z][a-z]+", "Technician [REDACTED]", text)

# --- Data Ingestion Pipeline ---

# Define file paths
pdf_path = DATA_DIR / 'sop_manuals' / 'press_safety.pdf'
csv_path = DATA_DIR / 'logs' / 'maintenance_logs.csv'
html_path = DATA_DIR / 'bulletins' / 'ehs_update.html'

# Load documents from different sources with appropriate metadata
pdf_docs = load_generic(PyPDFLoader, pdf_path, {'plant': 'PNQ', 'doc_type': 'SOP'})
log_docs = load_and_filter_csv(csv_path, {'plant': 'PNQ', 'doc_type': 'ticket'})
bulletin_docs = load_generic(UnstructuredHTMLLoader, html_path, {'plant': 'PNQ', 'doc_type': 'ehs'})

print(f'Loaded {len(pdf_docs)} SOP pages.')
print(f'Loaded {len(log_docs)} recent maintenance tickets (older than 1 year excluded).')
print(f'Loaded {len(bulletin_docs)} HTML notices.')

In [None]:
def sanitize_documents(docs):
    for doc in docs:
        doc.page_content = redact_pii(doc.page_content)
        doc.metadata['ingested_at'] = datetime.utcnow().isoformat() + 'Z'
    return docs

sanitized_docs = sanitize_documents(pdf_docs + log_docs + bulletin_docs)
sanitized_docs[:2]

## 🔐 Compliance Checklist
- Tickets older than 365 days removed (`timestamp_utc` filter).
- PII sanitized using redaction function.
- Metadata tags (`plant`, `doc_type`, `ingested_at`) applied.
- Evidence exported to governance storage.

## 🧪 Lab Assignment
1. Extend the loader to ingest CAD manuals (DOCX) via Unstructured.
2. Plug in the Great Expectations suite from Week 05 to validate metadata fields.
3. Export sanitized documents to `data/processed/{plant}` with checksum manifest.
4. Document ingestion run (dataset, operator, duration) and attach to change ticket.

## ✅ Checklist
- [ ] Datasets ingested with metadata
- [ ] PII redaction validated
- [ ] Cutoff policy enforced
- [ ] Evidence archived

## 📚 References
- LangChain Document Loaders Documentation
- Corporate Data Retention Policy 2024
- Week 05 Data Intake Notebook