# 🛡️ Sensitive Information Guardrail - Quickstart Guide

This notebook demonstrates how to use Fiddler's Sensitive Information Guardrail.

## ✨ Key Features:
- **PII Detection**: Personal Identifiable Information (35 entity types)
- **PHI Detection**: Protected Health Information (7 entity types) 
- **Custom Entities**: User-defined entity types
- **Flexible Configuration**: Mix and match entity categories
- **High Performance**: 0.1 confidence threshold, top-1024 entities

## 📋 Entity Categories Available:

### PII Entities (27 types):
- Personal: person, address, date of birth, phone numbers
- Financial: credit card, bank account, tax ID, SSN
- Digital: email, website, IP address
- Documents: passport, driver's license, certificates

### PHI Entities (7 types):
- Medical: medication, medical condition
- Health Insurance: health insurance numbers
- Certificates: birth certificate number
- Device: serial number

### Custom Entities:
- User-defined entity types for specific use cases


## 🚀 Setup and Configuration


In [None]:
import requests
import json
import time
import pandas as pd
from typing import Dict, List, Any, Tuple, Optional

# Fiddler API Configuration
FIDDLER_TOKEN = ""
FIDDLER_BASE_URL = "https://my_company.fiddler.ai"  # Make sure to include the full URL

SENSITIVE_INFORMATION_URL = f"{FIDDLER_BASE_URL}/v3/guardrails/sensitive-information"

FIDDLER_HEADERS = {
    "Authorization": f"Bearer {FIDDLER_TOKEN}",
    "Content-Type": "application/json",
}

assert FIDDLER_TOKEN != "", "Please set your Fiddler API token"
assert FIDDLER_BASE_URL != "https://my_company.fiddler.ai", (
    "Please set your Fiddler API URL"
)

print("✅ Configuration loaded successfully!")
print(f"📡 Sensitive Information URL: {SENSITIVE_INFORMATION_URL}")


✅ Configuration loaded successfully!
📡 Guardrail URL: https://mainbuild.dev.fiddler.ai/v3/guardrails//sensitive-information


## 🔧 Core API Functions


In [None]:
def get_sensitive_information_response(
    text: str, 
    entity_categories: str | List[str] = 'PII',
    custom_entities: Optional[List[str]] = None,
) -> Tuple[Dict[str, Any], float]:
    """
    Invokes the Sensitive Information Guardrail with configurable entity detection.
    
    Args:
        text (str): Input text to analyze
        entity_categories (str | List[str]): Detection categories ('PII', 'PHI', 'Custom Entities')
        custom_entities (List[str], optional): Custom entity list when using 'Custom Entities'
    
    Returns:
        Tuple[Dict[str, Any], float]: Full API response and latency in seconds
    """
    # Prepare request payload
    data = {'input': text}
    
    # Add entity configuration if different from default
    if entity_categories != 'PII' or custom_entities:
        data['entity_categories'] = entity_categories
        if custom_entities:
            data['custom_entities'] = custom_entities
    
    start_time = time.monotonic()
    
    try:
        response = requests.post(
            SENSITIVE_INFORMATION_URL,
            headers=FIDDLER_HEADERS,
            json={'data': data},
        )
        response.raise_for_status()
        return response.json(), (time.monotonic() - start_time)
    
    except (requests.exceptions.RequestException, json.JSONDecodeError) as e:
        print(f'❌ API call failed: {e}')
        return {}, (time.monotonic() - start_time)


def extract_entities(api_response: Dict[str, Any]) -> List[Dict[str, Any]]:
    """
    Extract entities from the API response.
    
    Args:
        api_response: Full API response dictionary
        
    Returns:
        List of detected entities with start, end, text, label, score
    """
    return api_response.get('fdl_sensitive_information_scores', [])


def print_detection_results(entities: List[Dict[str, Any]], latency: float, text: str = None):
    """
    Pretty print detection results.
    
    Args:
        entities: List of detected entities
        latency: API call latency in seconds
        text: Original input text (optional, for highlighting)
    """
    print(f"\n🔍 **Detection Results** (⏱️ {latency:.3f}s)")
    print(f"📊 **Total Entities Found:** {len(entities)}\n")
    
    if not entities:
        print("✅ No sensitive information detected.")
        return
    
    # Group by entity type
    by_type = {}
    for entity in entities:
        label = entity.get('label', 'unknown')
        if label not in by_type:
            by_type[label] = []
        by_type[label].append(entity)
    
    # Print grouped results
    for label, group in sorted(by_type.items()):
        print(f"🏷️  **{label.upper()}** ({len(group)} found):")
        for entity in group:
            score = entity.get('score', 0)
            text_span = entity.get('text', '')
            start = entity.get('start', 0)
            end = entity.get('end', 0)
            print(f"   • '{text_span}' (confidence: {score:.3f}, position: {start}-{end})")
        print()

print("✅ Core functions defined successfully!")


✅ Core functions defined successfully!


## 📝 Example 1: PII Detection (Default Configuration)


In [71]:
# Sample text with various PII types
sample_text = """
I'm John Doe and I live at 1234 Maple Street, Springfield, IL 62704. 
You can reach me at john.doe@email.com or call me at (217) 555-1234.
My social security number is 123-45-6789, and I was born on January 15, 1987.
My credit card number is 4111 1111 1111 1111 with CVV 123.
For official documents, my passport number is X1234567.
"""

print("🧪 **Testing PII Detection (Default Configuration)**")
print("📄 Input Text:")
print(sample_text)

# Call the API with default PII configuration
response, latency = get_sensitive_information_response(sample_text)
entities = extract_entities(response)

# Display results
print_detection_results(entities, latency, sample_text)


🧪 **Testing PII Detection (Default Configuration)**
📄 Input Text:

I'm John Doe and I live at 1234 Maple Street, Springfield, IL 62704. 
You can reach me at john.doe@email.com or call me at (217) 555-1234.
My social security number is 123-45-6789, and I was born on January 15, 1987.
My credit card number is 4111 1111 1111 1111 with CVV 123.
For official documents, my passport number is X1234567.


🔍 **Detection Results** (⏱️ 0.961s)
📊 **Total Entities Found:** 9

🏷️  **ADDRESS** (1 found):
   • '1234 Maple Street, Springfield, IL 62704' (confidence: 0.974, position: 28-68)

🏷️  **CREDIT CARD NUMBER** (1 found):
   • '4111 1111 1111 1111' (confidence: 0.999, position: 243-262)

🏷️  **CVV** (1 found):
   • '123' (confidence: 0.907, position: 272-275)

🏷️  **DATE OF BIRTH** (1 found):
   • 'January 15, 1987' (confidence: 0.952, position: 200-216)

🏷️  **EMAIL** (1 found):
   • 'john.doe@email.com' (confidence: 0.987, position: 91-109)

🏷️  **PASSPORT NUMBER** (1 found):
   • 'X1234567' (c

## 🏥 Example 2: PHI Detection for Healthcare Data


In [72]:
# Sample text with PHI information
healthcare_text = """
Patient report: John Smith was prescribed metformin for his diabetes condition.
His health insurance number is HI-987654321, and medical record shows 
serial number MED-2024-001 for his glucose monitor device.
Birth certificate number is BC-IL-1987-001234.
Current medication includes aspirin and lisinopril for blood pressure management.
"""

print("🏥 **Testing PHI Detection for Healthcare Data**")
print("📄 Input Text:")
print(healthcare_text)

# Call the API with PHI configuration
response, latency = get_sensitive_information_response(
    healthcare_text, 
    entity_categories='PHI'
)
entities = extract_entities(response)

# Display results
print_detection_results(entities, latency, healthcare_text)


🏥 **Testing PHI Detection for Healthcare Data**
📄 Input Text:

Patient report: John Smith was prescribed metformin for his diabetes condition.
His health insurance number is HI-987654321, and medical record shows 
serial number MED-2024-001 for his glucose monitor device.
Birth certificate number is BC-IL-1987-001234.
Current medication includes aspirin and lisinopril for blood pressure management.


🔍 **Detection Results** (⏱️ 0.648s)
📊 **Total Entities Found:** 8

🏷️  **BIRTH CERTIFICATE NUMBER** (1 found):
   • 'BC-IL-1987-001234' (confidence: 0.997, position: 239-256)

🏷️  **HEALTH INSURANCE NUMBER** (1 found):
   • 'HI-987654321' (confidence: 0.986, position: 112-124)

🏷️  **MEDICAL CONDITION** (2 found):
   • 'diabetes' (confidence: 0.886, position: 61-69)
   • 'blood pressure' (confidence: 0.132, position: 313-327)

🏷️  **MEDICATION** (3 found):
   • 'metformin' (confidence: 0.997, position: 43-52)
   • 'lisinopril' (confidence: 0.995, position: 298-308)
   • 'aspirin' (confiden

## 🔄 Example 3: Combined PII + PHI Detection


In [73]:
# Sample text with both PII and PHI
combined_text = """
Patient: Sarah Johnson, DOB: 03/15/1985, SSN: 456-78-9012
Address: 5678 Oak Avenue, Chicago, IL 60611
Contact: sarah.j@email.com, (312) 555-9876
Insurance: Health Plan ID HI-555-2024, Policy BC-CH-1985-5678
Current medications: insulin, metformin
Medical condition: Type 2 diabetes, hypertension
Device serial number: GLU-2024-789
"""

print("🔄 **Testing Combined PII + PHI Detection**")
print("📄 Input Text:")
print(combined_text)

# Call the API with combined configuration
response, latency = get_sensitive_information_response(
    combined_text, 
    entity_categories=['PII', 'PHI']
)
entities = extract_entities(response)

# Display results
print_detection_results(entities, latency, combined_text)


🔄 **Testing Combined PII + PHI Detection**
📄 Input Text:

Patient: Sarah Johnson, DOB: 03/15/1985, SSN: 456-78-9012
Address: 5678 Oak Avenue, Chicago, IL 60611
Contact: sarah.j@email.com, (312) 555-9876
Insurance: Health Plan ID HI-555-2024, Policy BC-CH-1985-5678
Current medications: insulin, metformin
Medical condition: Type 2 diabetes, hypertension
Device serial number: GLU-2024-789


🔍 **Detection Results** (⏱️ 1.001s)
📊 **Total Entities Found:** 12

🏷️  **ADDRESS** (1 found):
   • '5678 Oak Avenue, Chicago, IL 60611' (confidence: 0.982, position: 68-102)

🏷️  **DATE OF BIRTH** (1 found):
   • '03/15/1985' (confidence: 0.989, position: 30-40)

🏷️  **EMAIL** (1 found):
   • 'sarah.j@email.com' (confidence: 0.985, position: 112-129)

🏷️  **HEALTH INSURANCE ID NUMBER** (2 found):
   • 'HI-555-2024' (confidence: 0.688, position: 172-183)
   • 'BC-CH-1985-5678' (confidence: 0.411, position: 192-207)

🏷️  **MEDICAL CONDITION** (1 found):
   • 'Type 2 diabetes, hypertension' (confidence: 

## 🎯 Example 4: Custom Entity Detection


In [74]:
# Sample text with custom entities
custom_text = """
Employee ID: EMP-2024-001, Badge Number: BD-789456
Project code: PROJ-AI-2024, Server hostname: srv-prod-01
API key: sk-abc123xyz789, Database connection string: mysql://user:pass@host
Internal ticket: TICK-2024-5678, Customer reference: CUST-VIP-001
"""

# Define custom entities for this organization
custom_entities = [
    'employee id',
    'badge number', 
    'project code',
    'api key',
    'server hostname',
    'database connection',
    'ticket number',
    'customer reference'
]

print("🎯 **Testing Custom Entity Detection**")
print(f"🏷️ Custom entities: {custom_entities}")
print("📄 Input Text:")
print(custom_text)

# Call the API with custom entity configuration
response, latency = get_sensitive_information_response(
    custom_text, 
    entity_categories='Custom Entities',
    custom_entities=custom_entities
)
entities = extract_entities(response)

# Display results
print_detection_results(entities, latency, custom_text)


🎯 **Testing Custom Entity Detection**
🏷️ Custom entities: ['employee id', 'badge number', 'project code', 'api key', 'server hostname', 'database connection', 'ticket number', 'customer reference']
📄 Input Text:

Employee ID: EMP-2024-001, Badge Number: BD-789456
Project code: PROJ-AI-2024, Server hostname: srv-prod-01
API key: sk-abc123xyz789, Database connection string: mysql://user:pass@host
Internal ticket: TICK-2024-5678, Customer reference: CUST-VIP-001


🔍 **Detection Results** (⏱️ 0.737s)
📊 **Total Entities Found:** 8

🏷️  **API KEY** (1 found):
   • 'sk-abc123xyz789' (confidence: 0.957, position: 118-133)

🏷️  **BADGE NUMBER** (1 found):
   • 'BD-789456' (confidence: 0.900, position: 42-51)

🏷️  **CUSTOMER REFERENCE** (1 found):
   • 'CUST-VIP-001' (confidence: 0.979, position: 239-251)

🏷️  **DATABASE CONNECTION** (1 found):
   • 'mysql://user:pass@host' (confidence: 0.908, position: 163-185)

🏷️  **EMPLOYEE ID** (1 found):
   • 'EMP-2024-001' (confidence: 0.940, position: 14

## 📊 Batch Processing for Datasets

For processing multiple texts efficiently, here's a batch processing function:


In [75]:
def process_dataset_batch(
    texts: List[str], 
    entity_categories: str | List[str] = 'PII',
    custom_entities: Optional[List[str]] = None,
    max_records: int = 10
) -> pd.DataFrame:
    """
    Process a batch of texts through the sensitive information guardrail.
    
    Args:
        texts: List of input texts to process
        entity_categories: Entity categories to detect
        custom_entities: Custom entity list if applicable
        max_records: Maximum number of records to process
        
    Returns:
        DataFrame with results including entities, counts, and latency
    """
    results = []
    total_latency = 0.0
    
    print(f"🔄 Processing {min(len(texts), max_records)} texts...")
    
    for i, text in enumerate(texts[:max_records]):
        if not text or not isinstance(text, str):
            continue
            
        print(f"📝 Processing text {i+1}/{min(len(texts), max_records)}...", end=" ")
        
        # Get API response
        response, latency = get_sensitive_information_response(
            text, entity_categories, custom_entities
        )
        entities = extract_entities(response)
        
        # Aggregate by entity type
        entity_counts = {}
        for entity in entities:
            label = entity.get('label', 'unknown')
            entity_counts[label] = entity_counts.get(label, 0) + 1
        
        results.append({
            'text_id': i,
            'text_length': len(text),
            'total_entities': len(entities),
            'entity_counts': entity_counts,
            'entities': entities,
            'latency_seconds': latency,
            'text_preview': text[:100] + '...' if len(text) > 100 else text
        })
        
        total_latency += latency
        print(f"✅ {len(entities)} entities found ({latency:.3f}s)")
    
    df = pd.DataFrame(results)
    
    if len(df) > 0:
        print(f"\n📊 **Batch Processing Summary:**")
        print(f"   • Processed: {len(df)} texts")
        print(f"   • Total entities: {df['total_entities'].sum()}")
        print(f"   • Average entities per text: {df['total_entities'].mean():.1f}")
        print(f"   • Total latency: {total_latency:.3f}s")
        print(f"   • Average latency: {df['latency_seconds'].mean():.3f}s per text")
    
    return df


# Sample dataset for batch processing
sample_dataset = [
    "Contact John Smith at john.smith@company.com or call (555) 123-4567.",
    "Patient ID: 12345, prescribed metformin for diabetes management.",
    "SSN: 123-45-6789, Credit Card: 4111 1111 1111 1111, CVV: 123",
    "Address: 123 Main St, Anytown, ST 12345. DOB: 01/15/1990",
    "Health insurance: HI-789456, Medical condition: hypertension",
    "API Key: sk-abc123xyz, Employee ID: EMP-001, Badge: BD-789",
    "Email: user@domain.com, Phone: +1-800-555-0199, Website: https://example.com",
    "Passport: X1234567, Driver's License: D987654321, IBAN: GB82WEST12345698765432"
]

print("📊 **Testing Batch Processing with Sample Dataset**")
print(f"📄 Dataset size: {len(sample_dataset)} texts\n")

# Process with PII detection
results_df = process_dataset_batch(
    sample_dataset, 
    entity_categories='PII',
    max_records=10
    )


📊 **Testing Batch Processing with Sample Dataset**
📄 Dataset size: 8 texts

🔄 Processing 8 texts...
📝 Processing text 1/8... ✅ 3 entities found (0.634s)
📝 Processing text 2/8... ✅ 0 entities found (0.680s)
📝 Processing text 3/8... ✅ 3 entities found (0.714s)
📝 Processing text 4/8... ✅ 2 entities found (0.715s)
📝 Processing text 5/8... ✅ 0 entities found (0.609s)
📝 Processing text 6/8... ✅ 4 entities found (0.641s)
📝 Processing text 7/8... ✅ 3 entities found (0.696s)
📝 Processing text 8/8... ✅ 3 entities found (0.715s)

📊 **Batch Processing Summary:**
   • Processed: 8 texts
   • Total entities: 18
   • Average entities per text: 2.2
   • Total latency: 5.403s
   • Average latency: 0.675s per text


## 🎛️ Configuration Examples

Here are various configuration examples for different use cases:


In [76]:
# Configuration examples for different scenarios
configurations = {
    "financial_screening": {
        "entity_categories": "PII",
        "description": "Detect financial and personal information",
        "use_case": "Banking, insurance, financial services"
    },
    
    "healthcare_compliance": {
        "entity_categories": ["PII", "PHI"],
        "description": "Comprehensive detection for healthcare data",
        "use_case": "HIPAA compliance, medical records, patient data"
    },
    
    "custom_enterprise": {
        "entity_categories": ["PII", "Custom Entities"],
        "custom_entities": ["employee id", "project code", "api key", "server hostname"],
        "description": "Enterprise-specific sensitive data",
        "use_case": "Internal communications, code repositories, documentation"
    },
    
    "comprehensive_audit": {
        "entity_categories": ["PII", "PHI", "Custom Entities"],
        "custom_entities": ["internal id", "system access", "security token"],
        "description": "Full spectrum sensitive information detection",
        "use_case": "Security audits, compliance reviews, data discovery"
    },
}

print("🎛️ **Configuration Examples for Different Use Cases**\n")

for config_name, config in configurations.items():
    print(f"📋 **{config_name.replace('_', ' ').title()}**")
    print(f"   • Entity Categories: {config['entity_categories']}")
    if 'custom_entities' in config:
        print(f"   • Custom Entities: {config['custom_entities']}")
    print(f"   • Description: {config['description']}")
    print(f"   • Use Case: {config['use_case']}")
    
    # Example API call
    print(f"   • API Call Example:")
    if 'custom_entities' in config:
        print(f"     get_sensitive_information_response(text, {config['entity_categories']}, {config['custom_entities']})")
    else:
        print(f"     get_sensitive_information_response(text, '{config['entity_categories']}')")
    print()

print("\n✨ **Quick Reference:**")
print("• PII: 35 entity types (personal, financial, digital identifiers)")
print("• PHI: 7 entity types (medical information, health insurance)")
print("• Custom: User-defined entity types")
print("• Confidence threshold: 0.1 (high sensitivity)")
print("• Max entities returned: 1024 (sorted by confidence)")


🎛️ **Configuration Examples for Different Use Cases**

📋 **Financial Screening**
   • Entity Categories: PII
   • Description: Detect financial and personal information
   • Use Case: Banking, insurance, financial services
   • API Call Example:
     get_sensitive_information_response(text, 'PII')

📋 **Healthcare Compliance**
   • Entity Categories: ['PII', 'PHI']
   • Description: Comprehensive detection for healthcare data
   • Use Case: HIPAA compliance, medical records, patient data
   • API Call Example:
     get_sensitive_information_response(text, '['PII', 'PHI']')

📋 **Custom Enterprise**
   • Entity Categories: ['PII', 'Custom Entities']
   • Custom Entities: ['employee id', 'project code', 'api key', 'server hostname']
   • Description: Enterprise-specific sensitive data
   • Use Case: Internal communications, code repositories, documentation
   • API Call Example:
     get_sensitive_information_response(text, ['PII', 'Custom Entities'], ['employee id', 'project code', 'api k

## 📚 Summary and Best Practices

### 🎯 Key Takeaways:

1. **Flexible Configuration**: Choose from PII, PHI, or custom entities based on your use case
2. **High Performance**: 0.1 confidence threshold with top-1024 entity filtering
3. **Comprehensive Coverage**: 35 PII + 7 PHI entity types supported
4. **Easy Integration**: Simple API calls with configurable parameters

### 🛡️ Best Practices:

- **Start with defaults**: Use PII detection first, then add PHI/custom as needed
- **Batch processing**: Process multiple texts efficiently for large datasets
- **Error handling**: Implement retry logic for production systems
- **Monitor performance**: Track latency and detection rates
- **Custom entities**: Define organization-specific sensitive data types

### 🔗 Next Steps:

- Integrate into your data pipelines
- Set up monitoring and alerting
- Customize entity lists for your domain
- Implement automated redaction workflows
- Establish compliance reporting

---

**🚀 Happy detecting! Your sensitive information is now protected with Fiddler's advanced guardrails.**
