# LangChain 1.0 Built-in PII Middleware for HR Systems

**Module:** Using LangChain's Built-in PIIMiddleware

**What you'll learn:**
- 🔒 How to use LangChain's built-in `PIIMiddleware`
- 🚫 **REDACT** strategy - Complete PII removal
- 🎭 **MASK** strategy - Partial obscuring
- #️⃣ **HASH** strategy - Deterministic hashing
- ⛔ **BLOCK** strategy - Zero-tolerance blocking
- 🎯 Configuration options and best practices

**Built-in PII Types:**
- `email` - Email addresses
- `credit_card` - Credit card numbers (Luhn validated)
- `ip` - IP addresses
- `mac_address` - MAC addresses
- `url` - URLs
- Custom regex patterns

**Time:** 1-2 hours

---

## 📚 PII Strategy Comparison

| Strategy | Description | Example | Best For |
|----------|-------------|---------|----------|
| `redact` | Replace with `[REDACTED_TYPE]` | `[REDACTED_EMAIL]` | Audit logs, complete anonymization |
| `mask` | Partially obscure | `****-****-****-1234` | User-facing UI, verification |
| `hash` | Deterministic hash | `a8f5f167...` | Analytics without identity |
| `block` | Raise exception | Operation blocked | Zero-tolerance compliance |

---

## Setup: Install Dependencies

In [16]:
!pip install --pre -U langchain langchain-openai langgraph
!pip install langgraph-checkpoint-sqlite



## Setup: Imports and Configuration

In [17]:
from google.colab import userdata
import os

os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

# Core imports
from langchain.agents import create_agent
from langchain.agents.middleware import PIIMiddleware  # ⭐ Built-in!
from langchain_core.tools import tool
from typing import Annotated

print("✅ Setup complete!")
print("✅ Using LangChain's built-in PIIMiddleware")

✅ Setup complete!
✅ Using LangChain's built-in PIIMiddleware


## Setup: HR Database and Tools

In [18]:
# HR Employee Database with PII
EMPLOYEES = {
    "101": {
        "name": "Priya Sharma",
        "email": "priya.sharma@company.com",
        "phone": "+91-9876543210",
        "ssn": "123-45-6789",
        "credit_card": "4532-1234-5678-9012",
        "ip_address": "192.168.1.100",
        "department": "Engineering",
        "role": "Senior Developer",
        "salary": 120000
    },
    "102": {
        "name": "Rahul Verma",
        "email": "rahul.verma@company.com",
        "phone": "+91-9876543211",
        "ssn": "987-65-4321",
        "credit_card": "5412-3456-7890-1234",
        "ip_address": "192.168.1.101",
        "department": "Engineering",
        "role": "Engineering Manager",
        "salary": 180000
    },
    "103": {
        "name": "Anjali Patel",
        "email": "anjali.patel@company.com",
        "phone": "+91-9876543212",
        "ssn": "456-78-9012",
        "credit_card": "3782-822463-10005",
        "ip_address": "192.168.1.102",
        "department": "HR",
        "role": "HR Director",
        "salary": 200000
    }
}

# HR Tools
@tool
def get_employee_info(employee_id: Annotated[str, "Employee ID"]) -> str:
    """Get employee contact information."""
    if employee_id in EMPLOYEES:
        emp = EMPLOYEES[employee_id]
        return f"""Employee: {emp['name']}
Email: {emp['email']}
Phone: {emp['phone']}
Department: {emp['department']}
Role: {emp['role']}
IP Address: {emp['ip_address']}"""
    return f"Employee {employee_id} not found"

@tool
def get_financial_info(employee_id: Annotated[str, "Employee ID"]) -> str:
    """Get financial information. SENSITIVE."""
    if employee_id in EMPLOYEES:
        emp = EMPLOYEES[employee_id]
        return f"""Financial Info - {emp['name']}:
Annual Salary: ₹{emp['salary']:,}
Payment Card: {emp['credit_card']}
SSN: {emp['ssn']}"""
    return f"Employee {employee_id} not found"

print(f"✅ Loaded {len(EMPLOYEES)} employees with PII data")

✅ Loaded 3 employees with PII data


---
# Understanding PIIMiddleware Configuration

## Basic Syntax

```python
PIIMiddleware(
    pii_type="email",              # Type of PII to detect
    strategy="redact",             # How to handle: redact/mask/hash/block
    apply_to_input=True,           # Check user messages
    apply_to_output=False,         # Check AI messages
    apply_to_tool_results=False,   # Check tool outputs
)
```

## Parameters

| Parameter | Type | Description | Default |
|-----------|------|-------------|----------|
| `pii_type` | str | Built-in type: `email`, `credit_card`, `ip`, etc. | Required |
| `strategy` | str | `redact`, `mask`, `hash`, or `block` | `"redact"` |
| `detector` | str | Custom regex pattern (overrides built-in) | None |
| `apply_to_input` | bool | Process user messages before model | `True` |
| `apply_to_output` | bool | Process AI responses after model | `False` |
| `apply_to_tool_results` | bool | Process tool outputs | `False` |

## Built-in PII Types

- `email` - Email addresses
- `credit_card` - Credit card numbers (Luhn algorithm validated)
- `ip` - IP addresses (IPv4/IPv6)
- `mac_address` - MAC addresses
- `url` - URLs

---

---
# Example 1: REDACT Strategy 🚫

## What is Redaction?

Complete removal with type markers:
```
Input:  "Email: john@company.com"
Output: "Email: [REDACTED_EMAIL]"
```

**Use Cases:** Audit logs, public data, training datasets, GDPR compliance

---

## Lab 1.1: Redact Email and IP Addresses

In [20]:
# Create agent with REDACT strategy
redact_agent = create_agent(
    model="openai:gpt-4o-mini",
    tools=[get_employee_info],
    middleware=[
        # Redact emails in tool outputs
        PIIMiddleware(
            pii_type="email",
            strategy="redact",
            apply_to_input=False,
            apply_to_output=True,          # Check AI responses
            apply_to_tool_results=True,    # Check tool outputs
        ),
        # Redact IP addresses
        PIIMiddleware(
            pii_type="ip",
            strategy="redact",
            apply_to_input=False,
            apply_to_output=True,
            apply_to_tool_results=True,
        ),
    ],
    system_prompt="""You are an HR assistant. Provide employee information.
    Note: Emails and IPs are automatically redacted for compliance."""
)

print("✅ Agent with REDACT strategy created!")

✅ Agent with REDACT strategy created!


## Lab 1.2: Test REDACT Strategy

In [21]:
print("=" * 70)
print("EXAMPLE 1: REDACT STRATEGY")
print("=" * 70)

result = redact_agent.invoke({
    "messages": [{"role": "user", "content": "Get contact information for employee 101"}]
})

print("\n🤖 Agent Response (Emails and IPs redacted):")
print(result['messages'][-1].content)

print("\n✅ Notice: Email and IP addresses are replaced with [REDACTED_TYPE]")

EXAMPLE 1: REDACT STRATEGY

🤖 Agent Response (Emails and IPs redacted):
Here is the contact information for employee 101:

- **Name:** Priya Sharma
- **Email:** [REDACTED_EMAIL]
- **Phone:** +91-9876543210
- **Department:** Engineering
- **Role:** Senior Developer

✅ Notice: Email and IP addresses are replaced with [REDACTED_TYPE]


---
# Example 2: MASK Strategy 🎭

## What is Masking?

Partial obscuring with last digits visible:
```
Input:  "Card: 4532-1234-5678-9012"
Output: "Card: ****-****-****-9012"
```

**Use Cases:** Customer service UI, payment confirmations, account verification

---

## Lab 2.1: Mask Credit Card Numbers

In [24]:
# Create agent with MASK strategy
mask_agent = create_agent(
    model="openai:gpt-4o-mini",
    tools=[get_financial_info],
    middleware=[
        # Mask credit card numbers (show last 4 digits)
        PIIMiddleware(
            pii_type="credit_card",
            strategy="mask",
            apply_to_input=False,
            apply_to_output=True,
            apply_to_tool_results=True,
        ),
    ],
    system_prompt="""You are an HR assistant handling financial information.
    Note: Credit card numbers are partially masked for security."""
)

print("✅ Agent with MASK strategy created!")

✅ Agent with MASK strategy created!


## Lab 2.2: Test MASK Strategy

In [25]:
print("=" * 70)
print("EXAMPLE 2: MASK STRATEGY")
print("=" * 70)

result = mask_agent.invoke({
    "messages": [{"role": "user", "content": "Get financial info for employee 102"}]
})

print("\n🤖 Agent Response (Credit cards masked):")
print(result['messages'][-1].content)

print("\n✅ Notice: Credit card shows only last 4 digits: ****-****-****-1234")

EXAMPLE 2: MASK STRATEGY

🤖 Agent Response (Credit cards masked):
Here is the financial information for employee 102, Rahul Verma:

- **Annual Salary:** ₹180,000
- **Payment Card:** 5412-****-****-1234
- **SSN:** 987-65-4321

If you need any further information, feel free to ask!

✅ Notice: Credit card shows only last 4 digits: ****-****-****-1234


---
# Example 3: HASH Strategy #️⃣

## What is Hashing?

Deterministic one-way transformation:
```
Input:  "Email: john@company.com"
Output: "Email: a8f5f167f44f4964..."
```

**Properties:**
- Same input → Same hash (deterministic)
- Cannot reverse to original
- Good for analytics and tracking

**Use Cases:** Analytics without identity, user tracking, de-duplication

---

## Lab 3.1: Hash Email Addresses

In [26]:
# Create agent with HASH strategy
hash_agent = create_agent(
    model="openai:gpt-4o-mini",
    tools=[get_employee_info],
    middleware=[
        # Hash email addresses for analytics
        PIIMiddleware(
            pii_type="email",
            strategy="hash",
            apply_to_input=False,
            apply_to_output=True,
            apply_to_tool_results=True,
        ),
    ],
    system_prompt="""You are an HR analytics assistant.
    Note: Emails are hashed for analytics while maintaining referential integrity."""
)

print("✅ Agent with HASH strategy created!")

✅ Agent with HASH strategy created!


## Lab 3.2: Test HASH Strategy

In [27]:
print("=" * 70)
print("EXAMPLE 3: HASH STRATEGY")
print("=" * 70)

# First query
print("\nQuery 1: Employee 103")
result1 = hash_agent.invoke({
    "messages": [{"role": "user", "content": "Get contact info for employee 103"}]
})
print(f"Response: {result1['messages'][-1].content}")

# Second query - same employee
print("\n" + "="*70)
print("Query 2: Same employee (103) again")
result2 = hash_agent.invoke({
    "messages": [{"role": "user", "content": "Show me employee 103 info again"}]
})
print(f"Response: {result2['messages'][-1].content}")

print("\n✅ Notice: Same email produces same hash (deterministic)")

EXAMPLE 3: HASH STRATEGY

Query 1: Employee 103
Response: Here is the contact information for employee 103:

- **Name:** Anjali Patel
- **Email:** (hashed)
- **Phone:** +91-9876543212
- **Department:** HR
- **Role:** HR Director
- **IP Address:** 192.168.1.102

Query 2: Same employee (103) again
Response: Here is the information for employee 103:

- **Name:** Anjali Patel
- **Email:** (hashed) `<email_hash:6e30e32c>`
- **Phone:** +91-9876543212
- **Department:** HR
- **Role:** HR Director
- **IP Address:** 192.168.1.102

✅ Notice: Same email produces same hash (deterministic)


---
# Example 4: BLOCK Strategy ⛔

## What is Blocking?

Immediately stop execution when PII detected:
```
Input: "My SSN is 123-45-6789"
Result: ⛔ Error - PII detected, operation blocked!
```

**Use Cases:** Zero-tolerance compliance, HIPAA, PCI-DSS, systems that log everything

---

## Lab 4.1: Block SSN in User Input

In [29]:
# Create agent with BLOCK strategy
block_agent = create_agent(
    model="openai:gpt-4o-mini",
    tools=[get_employee_info],
    middleware=[
        # Block SSN in user input (custom regex for SSN)
        PIIMiddleware(
            pii_type="ssn",
            detector=r"\b\d{3}-\d{2}-\d{4}\b",  # Custom SSN pattern
            strategy="block",
            apply_to_input=True,   # Check user input!
            apply_to_output=False,
        ),
        # Block credit cards in user input
        PIIMiddleware(
            pii_type="credit_card",
            strategy="block",
            apply_to_input=True,
            apply_to_output=False,
        ),
    ],
    system_prompt="""You are an HR assistant with strict PII protection.
    Note: SSN and Credit Cards in user input are BLOCKED (zero-tolerance)."""
)

print("✅ Agent with BLOCK strategy created!")

✅ Agent with BLOCK strategy created!


## Lab 4.2: Test BLOCK Strategy

In [30]:
print("=" * 70)
print("EXAMPLE 4: BLOCK STRATEGY")
print("=" * 70)

# Test 1: Normal query (no PII) - Should work
print("\nTest 1: Normal query without PII (ALLOWED)")
print("="*70)

try:
    result = block_agent.invoke({
        "messages": [{"role": "user", "content": "Tell me about employee 101's department"}]
    })
    print("✅ Query allowed - No PII detected")
    print(f"Response: {result['messages'][-1].content[:150]}...")
except Exception as e:
    print(f"❌ Blocked: {e}")

# Test 2: Query with SSN - Should be BLOCKED
print("\n" + "="*70)
print("Test 2: Query with SSN (BLOCKED)")
print("="*70)

try:
    result = block_agent.invoke({
        "messages": [{"role": "user", "content": "My SSN is 123-45-6789, can you help?"}]
    })
    print("Response:", result['messages'][-1].content)
except Exception as e:
    print(f"⛔ BLOCKED: {type(e).__name__}")
    print(f"Error: {str(e)}")

# Test 3: Query with Credit Card - Should be BLOCKED
print("\n" + "="*70)
print("Test 3: Query with Credit Card (BLOCKED)")
print("="*70)

try:
    result = block_agent.invoke({
        "messages": [{"role": "user", "content": "Update card to 4532-1234-5678-9012"}]
    })
    print("Response:", result['messages'][-1].content)
except Exception as e:
    print(f"⛔ BLOCKED: {type(e).__name__}")
    print(f"Error: {str(e)}")

print("\n✅ BLOCK strategy prevents processing when critical PII detected")

EXAMPLE 4: BLOCK STRATEGY

Test 1: Normal query without PII (ALLOWED)
✅ Query allowed - No PII detected
Response: Employee 101, Priya Sharma, is part of the Engineering department and holds the role of Senior Developer. If you need further details, feel free to as...

Test 2: Query with SSN (BLOCKED)
⛔ BLOCKED: PIIDetectionError
Error: Detected 1 instance(s) of ssn in text content

Test 3: Query with Credit Card (BLOCKED)
Response: I'm sorry, but I can't process credit card information. If you need assistance with a different request, feel free to let me know!

✅ BLOCK strategy prevents processing when critical PII detected


---
# Production Pattern: Multi-Layer Protection

## Best Practice: Combine Multiple Strategies

In production, use **layered protection**:
- **BLOCK** critical PII in input (SSN, Credit Cards)
- **MASK** or **REDACT** moderate PII in output (Emails, Phones)

---

## Lab 5: Production-Ready Multi-Layer Protection

In [32]:
# Create production agent with layered PII protection
production_agent = create_agent(
    model="openai:gpt-4o-mini",
    tools=[get_employee_info, get_financial_info],
    middleware=[
        # Layer 1: BLOCK critical PII in user input (zero-tolerance)
        PIIMiddleware(
            pii_type="ssn",
            detector=r"\b\d{3}-\d{2}-\d{4}\b",
            strategy="block",
            apply_to_input=True,
        ),
        PIIMiddleware(
            pii_type="credit_card",
            strategy="block",
            apply_to_input=True,
        ),

        # Layer 2: REDACT emails in output (audit-friendly)
        PIIMiddleware(
            pii_type="email",
            strategy="redact",
            apply_to_output=True,
            apply_to_tool_results=True,
        ),

        # Layer 3: MASK IPs in output (partial visibility)
        PIIMiddleware(
            pii_type="ip",
            strategy="mask",
            apply_to_output=True,
            apply_to_tool_results=True,
        ),
    ],
    system_prompt="""You are a production HR assistant with multi-layer PII protection:

    Layer 1 (Input): SSN/Credit Cards → BLOCKED (zero-tolerance)
    Layer 2 (Output): Emails → REDACTED (compliance)
    Layer 3 (Output): IPs → MASKED (partial visibility)

    Provide helpful information while maintaining security."""
)

print("=" * 70)
print("PRODUCTION: Multi-Layer PII Protection")
print("=" * 70)

# Test: Normal query should work with redaction/masking
print("\nTest: Get employee info (emails redacted, IPs masked)")
result = production_agent.invoke({
    "messages": [{"role": "user", "content": "Get contact information for employee 101"}]
})

print("\n🤖 Response:")
print(result['messages'][-1].content)

print("\n✅ Production agent with defense-in-depth PII protection!")

PRODUCTION: Multi-Layer PII Protection

Test: Get employee info (emails redacted, IPs masked)

🤖 Response:
Here is the contact information for employee 101:

- **Name:** Priya Sharma
- **Email:** [REDACTED]
- **Phone:** +91-9876543210
- **Department:** Engineering
- **Role:** Senior Developer
- **IP Address:** *.*.*.100 (masked for privacy)

✅ Production agent with defense-in-depth PII protection!


---
# Advanced: Custom PII Detector

## Using Custom Regex Patterns

You can define custom PII types with regex:

---

## Lab 6: Custom API Key Detection

In [34]:
# Agent that blocks API keys
api_key_agent = create_agent(
    model="openai:gpt-4o-mini",
    tools=[get_employee_info],
    middleware=[
        # Block OpenAI-style API keys
        PIIMiddleware(
            pii_type="api_key",
            detector=r"sk-[a-zA-Z0-9]{48}",  # Custom pattern
            strategy="block",
            apply_to_input=True,
        ),
        # Block AWS-style access keys
        PIIMiddleware(
            pii_type="aws_key",
            detector=r"AKIA[0-9A-Z]{16}",  # AWS pattern
            strategy="block",
            apply_to_input=True,
        ),
    ],
    system_prompt="You are an assistant that blocks API keys for security."
)

print("=" * 70)
print("ADVANCED: Custom PII Detection")
print("=" * 70)

# Test with mock API key
print("\nTest: Query with fake API key (BLOCKED)")
try:
    result = api_key_agent.invoke({
        "messages": [{"role": "user", "content": "My key is sk-" + "a"*48}]
    })
    print("Response:", result['messages'][-1].content)
except Exception as e:
    print(f"⛔ BLOCKED: {type(e).__name__}")
    print(f"Reason: API key detected in user input")

print("\n✅ Custom regex patterns allow organization-specific PII detection")

ADVANCED: Custom PII Detection

Test: Query with fake API key (BLOCKED)
⛔ BLOCKED: PIIDetectionError
Reason: API key detected in user input

✅ Custom regex patterns allow organization-specific PII detection


---
# Summary

## Built-in PIIMiddleware Benefits

✅ **No custom code needed** - Just configure and use  
✅ **Multiple strategies** - Redact, Mask, Hash, Block  
✅ **Flexible application** - Input, output, or tool results  
✅ **Custom patterns** - Use regex for organization-specific PII  
✅ **Composable** - Stack multiple middleware for layered protection  

## Strategy Selection Guide

| Scenario | Strategy | Configuration |
|----------|----------|---------------|
| **Audit logs** | REDACT | `apply_to_output=True` |
| **Customer UI** | MASK | `apply_to_output=True` |
| **Analytics** | HASH | `apply_to_output=True` |
| **Compliance** | BLOCK | `apply_to_input=True` |
| **Production** | BLOCK + REDACT/MASK | Both layers |

## Configuration Quick Reference

```python
PIIMiddleware(
    pii_type="email",           # Built-in: email, credit_card, ip, etc.
    strategy="redact",          # redact, mask, hash, block
    detector=r"custom pattern", # Optional: override built-in
    apply_to_input=True,        # Check user messages
    apply_to_output=False,      # Check AI responses
    apply_to_tool_results=False # Check tool outputs
)
```

## Production Checklist

- [ ] Identify all PII types in your system
- [ ] Map compliance requirements to strategies
- [ ] Layer BLOCK for input + REDACT/MASK for output
- [ ] Test with sample PII data
- [ ] Add custom detectors for organization-specific PII
- [ ] Document configuration for compliance audits
- [ ] Regular review and updates

---

**Congratulations!** 🎉 You now know how to use LangChain's built-in PIIMiddleware for comprehensive PII protection!