# 📥 Week 07-08 · Notebook 05: Document Loaders & The Ingestion Pipeline

**Objective:** Build a robust and compliant data ingestion pipeline for a RAG application by loading, transforming, and enriching documents from various sources.

In this notebook, we will simulate a real-world scenario where a manufacturing company needs to build a knowledge base for a maintenance chatbot. The process of preparing data for a RAG system is known as the **ingestion pipeline**. It involves several critical steps:

1.  **Loading:** Reading raw data from different file formats (PDFs, CSVs, web pages).
2.  **Transforming:** Cleaning, filtering, and modifying the content (e.g., redacting sensitive information).
3.  **Enriching:** Adding valuable metadata to each document for better retrieval and context.

We will use LangChain's powerful `DocumentLoader` ecosystem to handle this process efficiently and apply custom logic to meet strict business and legal requirements.

## 🎯 Learning Objectives

By the end of this notebook, you will be able to:

1.  **Load Documents from Multiple Sources:** Use LangChain's `DocumentLoader` classes to ingest data from PDF, CSV, and HTML files.
2.  **Apply Custom Transformations:** Implement functions to filter documents based on business rules (e.g., date cutoffs).
3.  **Redact Sensitive Information:** Create a sanitization step to remove Personally Identifiable Information (PII) from document content.
4.  **Enrich Documents with Metadata:** Systematically add metadata (like source, plant, and ingestion timestamp) to each document for improved traceability and retrieval.
5.  **Construct a Reusable Ingestion Pipeline:** Combine loading, transforming, and enriching into a coherent workflow.

## 🧩 Scenario: Building a Compliant Knowledge Base for a Maintenance Chatbot

You are an AI engineer at a large manufacturing company. Your task is to build the data foundation for a new RAG-based chatbot that will help maintenance technicians diagnose and repair machinery.

The knowledge base must be built from three sources:
1.  **SOP Manuals (PDF):** Official Standard Operating Procedures for all equipment.
2.  **Maintenance Logs (CSV):** A running log of all maintenance tickets, including technician notes.
3.  **Safety Bulletins (HTML):** Internal web pages with the latest safety alerts.

However, there are strict compliance requirements:
-   **Data Retention Policy:** The company's legal department mandates that any maintenance ticket older than 365 days must be excluded from the RAG system's knowledge base to avoid providing outdated advice.
-   **PII Redaction:** To protect employee privacy, all technician names mentioned in the maintenance logs must be redacted.
-   **Traceability:** Every document ingested into the system must be tagged with metadata indicating its source, the plant it belongs to, and when it was ingested.

Your goal is to build an ingestion pipeline that loads data from these sources while enforcing all three compliance rules.

## 1. Setup and Data Creation

First, let's install the required libraries. We'll need `langchain` for the core framework, `langchain-community` for the document loaders, and `unstructured` for processing HTML and other file types.

> ⚠️ **Kernel Restart**: After running the installation cell below, you may need to restart the kernel for the changes to take effect. You can do this from the "Kernel" menu in your Jupyter environment.

In [None]:
%pip install -qU langchain langchain-community unstructured pypdf beautifulsoup4 reportlab

### Creating Dummy Data Files

To make this notebook self-contained, we'll programmatically create the dummy data files. This cell will generate the PDF, CSV, and HTML files needed for our scenario.

In [None]:
import os
import csv
from datetime import datetime, timedelta
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter

# --- Create Directories ---
DATA_DIR = "data"
os.makedirs(os.path.join(DATA_DIR, "sop_manuals"), exist_ok=True)
os.makedirs(os.path.join(DATA_DIR, "logs"), exist_ok=True)
os.makedirs(os.path.join(DATA_DIR, "bulletins"), exist_ok=True)

# --- 1. Create a Dummy PDF SOP Manual ---
pdf_path = os.path.join(DATA_DIR, "sop_manuals", "press_safety.pdf")
c = canvas.Canvas(pdf_path, pagesize=letter)
c.drawString(72, 800, "Standard Operating Procedure: Hydraulic Press H-45")
c.drawString(72, 780, "1. Ensure all safety guards are in place before operation.")
c.drawString(72, 760, "2. Perform daily maintenance checks as per the log.")
c.save()
print(f"Created dummy PDF: {pdf_path}")

# --- 2. Create a Dummy CSV Maintenance Log ---
csv_path = os.path.join(DATA_DIR, "logs", "maintenance_logs.csv")
now = datetime.utcnow()
two_years_ago = now - timedelta(days=730)
six_months_ago = now - timedelta(days=180)

log_data = [
    ["ticket_id", "timestamp_utc", "issue_description", "technician_notes"],
    ["TICKET-001", two_years_ago.isoformat() + "Z", "Hydraulic fluid leak", "Replaced seal. Work completed by Technician Jane Doe."],
    ["TICKET-002", six_months_ago.isoformat() + "Z", "Pressure sensor fault", "Recalibrated sensor. Work completed by Technician John Smith."],
    ["TICKET-003", now.isoformat() + "Z", "Emergency stop button stuck", "Replaced the button assembly. Work completed by Technician Emily Jones."]
]

with open(csv_path, "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerows(log_data)
print(f"Created dummy CSV: {csv_path}")

# --- 3. Create a Dummy HTML Safety Bulletin ---
html_path = os.path.join(DATA_DIR, "bulletins", "ehs_update.html")
html_content = """
<html>
<head><title>EHS Update</title></head>
<body>
    <h1>Safety Bulletin: Q3 2024</h1>
    <p>All personnel must complete the mandatory annual safety training by October 31st.</p>
    <p>A new lockout/tagout procedure is now in effect for the assembly line.</p>
</body>
</html>
"""
with open(html_path, "w") as f:
    f.write(html_content)
print(f"Created dummy HTML: {html_path}")

## 2. Loading Documents with LangChain Loaders

LangChain provides a rich ecosystem of `DocumentLoader` classes, each designed for a specific file type or data source. A `Document` is a simple object with `page_content` (the text) and `metadata` (a dictionary of attributes).

We will use three different loaders for our sources:
-   `PyPDFLoader`: For loading and parsing PDF files.
-   `UnstructuredHTMLLoader`: For parsing HTML files. It's part of the powerful `unstructured` library that can handle many complex formats.
-   `CSVLoader`: For loading data from CSV files.

### Generic Loader Function
For simple cases like PDF and HTML, we can create a generic function to load the document and add our desired metadata. We'll tag each document with the `plant` and `doc_type`.

In [None]:
from langchain_community.document_loaders import PyPDFLoader, UnstructuredHTMLLoader
from pathlib import Path

def load_and_enrich(loader_cls, file_path: str, metadata: dict):
    """
    Generic function to load a document and enrich it with metadata.
    """
    # Instantiate the loader with the file path
    loader = loader_cls(file_path)
    # Load the documents
    docs = loader.load()
    # Add the provided metadata to each loaded document
    for doc in docs:
        doc.metadata.update(metadata)
    return docs

# Define file paths and metadata
pdf_path = os.path.join(DATA_DIR, "sop_manuals", "press_safety.pdf")
html_path = os.path.join(DATA_DIR, "bulletins", "ehs_update.html")

# Load the PDF and HTML documents
pdf_docs = load_and_enrich(PyPDFLoader, pdf_path, {"plant": "PNQ", "doc_type": "SOP"})
html_docs = load_and_enrich(UnstructuredHTMLLoader, html_path, {"plant": "PNQ", "doc_type": "EHS_Bulletin"})

print(f"Loaded {len(pdf_docs)} PDF document(s).")
print(f"Loaded {len(html_docs)} HTML document(s).")
print("\n--- Sample PDF Metadata ---")
print(pdf_docs[0].metadata)

### Custom Loader for CSV with Date Filtering

The maintenance logs (CSV) require special handling to meet the legal requirement of excluding records older than 365 days. We can't use a generic loader because we need to inspect the content of each row and apply a filter.

We will create a custom function that:
1.  Defines a `CUTOFF_DATE` (365 days ago).
2.  Uses `CSVLoader` to load all records from the file. `CSVLoader` creates one `Document` per row.
3.  Iterates through each `Document` and parses the `timestamp_utc` from the `page_content`.
4.  Compares the record's timestamp to the `CUTOFF_DATE`.
5.  Only keeps the records that are more recent than the cutoff date.
6.  Enriches the valid records with the appropriate metadata.

This demonstrates how to inject custom business logic directly into your ingestion pipeline.

In [None]:
from langchain_community.document_loaders import CSVLoader
import re

# Define the cutoff for old data as per legal requirements
CUTOFF_DATE = datetime.utcnow() - timedelta(days=365)

def load_and_filter_csv(file_path: str, metadata: dict):
    """
    Loads a CSV file, filters records based on a timestamp, and enriches metadata.
    """
    # Use CSVLoader, specifying which column to use as the 'source' in metadata
    loader = CSVLoader(
        file_path=file_path,
        source_column='ticket_id'
    )
    docs = loader.load()
    
    filtered_docs = []
    print(f"Loaded {len(docs)} total records from CSV. Applying date filter...")
    
    for doc in docs:
        # The CSVLoader crams all columns into page_content. We need to parse it.
        # A simple regex can find the timestamp in the unstructured page_content.
        timestamp_match = re.search(r"timestamp_utc: ([\d\-T:Z.]+)", doc.page_content)
        
        if timestamp_match:
            timestamp_str = timestamp_match.group(1).replace('Z', '')
            record_timestamp = datetime.fromisoformat(timestamp_str)
            
            # Compare the record's date with our cutoff date
            if record_timestamp >= CUTOFF_DATE:
                # If the record is recent enough, add it to our list
                doc.metadata.update(metadata)
                filtered_docs.append(doc)
            else:
                print(f"  - Excluding old record: {doc.metadata['source']} (date: {record_timestamp.date()})")
        else:
            print(f"  - Skipping record with missing timestamp: {doc.metadata['source']}")
            
    return filtered_docs

# Define the path to the CSV file
csv_path = os.path.join(DATA_DIR, "logs", "maintenance_logs.csv")

# Load and filter the CSV documents
csv_docs = load_and_filter_csv(csv_path, {"plant": "PNQ", "doc_type": "MaintenanceLog"})

print(f"\nLoaded {len(csv_docs)} recent maintenance log(s) after filtering.")
print("\n--- Sample Filtered CSV Record ---")
if csv_docs:
    print(csv_docs[0])

## 3. Transforming and Sanitizing Documents

Loading is just the first step. A robust ingestion pipeline must also clean and transform the data to meet compliance and quality standards. We will perform two transformations:

1.  **PII Redaction:** We'll create a function to find and replace technician names in the document content, replacing them with `[REDACTED]`. In a production environment, you would use a more sophisticated tool like Amazon Comprehend, Google Cloud DLP, or Microsoft Presidio, but a simple regex is sufficient for this example.
2.  **Final Enrichment:** We'll add a final piece of metadata—an `ingested_at` timestamp—to every document for audit and traceability purposes.

We'll wrap these steps in a `sanitize_documents` function that processes a list of documents.

In [None]:
def redact_pii(text: str) -> str:
    """
    A simple PII redactor that looks for names following 'Technician'.
    In production, use a more advanced tool like Microsoft Presidio or AWS Comprehend.
    """
    # This regex finds "Technician" followed by a capitalized first and last name.
    return re.sub(r"Technician\s+([A-Z][a-z]+\s+[A-Z][a-z]+)", "Technician [REDACTED]", text)

def sanitize_and_enrich(docs: list) -> list:
    """
    Applies PII redaction and adds a final ingestion timestamp to each document.
    """
    sanitized_docs = []
    for doc in docs:
        # Apply the PII redaction to the document's content
        doc.page_content = redact_pii(doc.page_content)
        
        # Add the final enrichment metadata
        doc.metadata['ingested_at'] = datetime.utcnow().isoformat() + 'Z'
        
        sanitized_docs.append(doc)
    return sanitized_docs

# Combine all our loaded documents into a single list
all_docs = pdf_docs + html_docs + csv_docs

# Run the final sanitization and enrichment step
final_docs = sanitize_and_enrich(all_docs)

print(f"Total documents in the final knowledge base: {len(final_docs)}")
print("\n--- Sample Sanitized Document (from CSV) ---")
# Find a sanitized CSV doc to display
for doc in final_docs:
    if doc.metadata['doc_type'] == 'MaintenanceLog':
        print("Original Content Snippet from CSV log might contain a name.")
        print("Sanitized Content:")
        print(doc.page_content)
        print("\nFinal Metadata:")
        print(doc.metadata)
        break

In [None]:
from datetime import datetime, timedelta
from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader, CSVLoader, UnstructuredHTMLLoader
import re

# --- Configuration ---
# Define the cutoff for old data as per legal requirements
CUTOFF_DATE = datetime.utcnow() - timedelta(days=365)
DATA_DIR = Path('data')

# --- Helper Functions ---

def load_and_filter_csv(path: Path, metadata: dict):
    """
    Loads a CSV file and filters records based on a timestamp.
    Assumes a 'timestamp_utc' column in ISO format.
    """
    loader = CSVLoader(
        file_path=str(path),
        csv_args={'delimiter': ','},
        source_column='ticket_id' # Use ticket_id as the source identifier
    )
    docs = loader.load()
    
    filtered_docs = []
    for doc in docs:
        # The CSVLoader puts all columns into page_content. We need to parse it.
        # This is a simplification; a real implementation might use a more robust parser.
        try:
            # A simple way to find the timestamp in the unstructured page_content
            timestamp_str = re.search(r"timestamp_utc: ([\d-T:Z]+)", doc.page_content)
            if timestamp_str:
                timestamp = datetime.fromisoformat(timestamp_str.group(1).replace('Z', '+00:00'))
                if timestamp.replace(tzinfo=None) >= CUTOFF_DATE:
                    doc.metadata.update(metadata)
                    filtered_docs.append(doc)
        except (ValueError, TypeError):
            # Handle cases where the timestamp is missing or malformed
            print(f"Skipping record due to missing/malformed timestamp in {doc.metadata.get('source')}")
            continue
            
    return filtered_docs

def load_generic(loader_cls, path: Path, metadata: dict):
    """Generic loader for file types that don't need special filtering."""
    loader = loader_cls(str(path))
    docs = loader.load()
    for doc in docs:
        doc.metadata.update(metadata)
    return docs

def redact_pii(text: str) -> str:
    """A simple PII redactor. In production, use a more advanced tool like Presidio."""
    # This regex looks for names like "Technician John Doe"
    return re.sub(r"Technician\s+[A-Z][a-z]+\s+[A-Z][a-z]+", "Technician [REDACTED]", text)

# --- Data Ingestion Pipeline ---

# Define file paths
pdf_path = DATA_DIR / 'sop_manuals' / 'press_safety.pdf'
csv_path = DATA_DIR / 'logs' / 'maintenance_logs.csv'
html_path = DATA_DIR / 'bulletins' / 'ehs_update.html'

# Load documents from different sources with appropriate metadata
pdf_docs = load_generic(PyPDFLoader, pdf_path, {'plant': 'PNQ', 'doc_type': 'SOP'})
log_docs = load_and_filter_csv(csv_path, {'plant': 'PNQ', 'doc_type': 'ticket'})
bulletin_docs = load_generic(UnstructuredHTMLLoader, html_path, {'plant': 'PNQ', 'doc_type': 'ehs'})

print(f'Loaded {len(pdf_docs)} SOP pages.')
print(f'Loaded {len(log_docs)} recent maintenance tickets (older than 1 year excluded).')
print(f'Loaded {len(bulletin_docs)} HTML notices.')

In [None]:
def sanitize_documents(docs):
    for doc in docs:
        doc.page_content = redact_pii(doc.page_content)
        doc.metadata['ingested_at'] = datetime.utcnow().isoformat() + 'Z'
    return docs

sanitized_docs = sanitize_documents(pdf_docs + log_docs + bulletin_docs)
sanitized_docs[:2]

## 🔐 Compliance Checklist
- Tickets older than 365 days removed (`timestamp_utc` filter).
- PII sanitized using redaction function.
- Metadata tags (`plant`, `doc_type`, `ingested_at`) applied.
- Evidence exported to governance storage.

## 🧪 Lab Assignment
1. Extend the loader to ingest CAD manuals (DOCX) via Unstructured.
2. Plug in the Great Expectations suite from Week 05 to validate metadata fields.
3. Export sanitized documents to `data/processed/{plant}` with checksum manifest.
4. Document ingestion run (dataset, operator, duration) and attach to change ticket.

## ✅ Checklist
- [ ] Datasets ingested with metadata
- [ ] PII redaction validated
- [ ] Cutoff policy enforced
- [ ] Evidence archived

## 📚 References
- LangChain Document Loaders Documentation
- Corporate Data Retention Policy 2024
- Week 05 Data Intake Notebook