# Using an LLM as a PII Data Judge for Secure Logging

This notebook demonstrates an advanced approach to PII (Personally Identifiable Information) detection in application logs. Instead of relying on rigid regular expressions (regex), we use a Large Language Model (LLM) as a sophisticated "judge" to analyze log messages for sensitive data.

### Core Concepts:
1.  **LLM as Judge**: We prompt the LLM with a specific set of instructions to classify a log message as either containing PII or not.
2.  **Context-Awareness**: The LLM can understand context that regex cannot (e.g., distinguishing a name like "John Smith" from a product name).
3.  **`logging.Filter` Integration**: We integrate this LLM judge into Python's standard `logging` module for a clean, practical implementation.

### ⚠️ **Important Disclaimer: Production Readiness** ⚠️
This is a proof-of-concept. Using an LLM for **synchronous, real-time log filtering** is generally not feasible due to:
- **Latency**: API calls can take seconds, which would cripple application performance.
- **Cost**: Each log line would become an API call, leading to high costs.
- **Reliability**: Network or API failures would break your logging.

**Realistic Use Cases**: This approach is better suited for **asynchronous log post-processing**, security auditing, or training a smaller, local model.

### Cell 1: Setup and API Key Configuration

First, we install the necessary Google library and configure our API key.

**Action Required**: You must add your Gemini API key to Colab's secrets manager with the name `GEMINI_API_KEY`.

In [None]:
!pip install -q google-generativeai

import logging
import google.generativeai as genai
from google.colab import userdata
from enum import Enum, auto

# --- Configure the Gemini API ---
try:
    # Use the userdata API to securely access the secret.
    GEMINI_API_KEY = userdata.get('GEMINI_API_KEY')
    genai.configure(api_key=GEMINI_API_KEY)
    print("✅ Gemini API configured successfully.")
except userdata.SecretNotFoundError:
    print("🛑 Error: Secret 'GEMINI_API_KEY' not found.")
    print("Please add your API key to the Colab Secrets manager (key icon on the left) with that name.")
except Exception as e:
    print(f"An error occurred: {e}")

✅ Gemini API configured successfully.


### Cell 2: The LLM PII "Judge" Logic

Here we define the core function that communicates with the LLM. The prompt is carefully engineered to force a simple, machine-parsable response (`PII_DETECTED` or `NO_PII`).

We also add a simple cache to avoid re-analyzing identical log messages, which saves time and money.

In [None]:
# A simple in-memory cache to avoid repeated API calls for the same message
llm_pii_cache = {}

def is_pii_detected_by_llm(log_message: str) -> bool:
    """Asks the LLM to judge if a log message contains PII."""
    if log_message in llm_pii_cache:
        return llm_pii_cache[log_message]

    model = genai.GenerativeModel('gemini-2.0-flash')

    # The prompt is critical. It defines the role, task, and expected output format.
    prompt = f"""
    You are a highly-trained security and privacy analysis model. Your task is to determine if the following log message contains Personally Identifiable Information (PII).
    PII includes, but is not limited to: full names, email addresses, phone numbers, Social Security Numbers (SSNs), credit card numbers, physical addresses, driver's license numbers, or specific medical information.

    Analyze the following log message:
    '{log_message}'

    Respond with ONLY ONE of the following two words: PII_DETECTED or NO_PII.
    Do not provide any explanation, punctuation, or other text.
    """

    try:
        response = model.generate_content(prompt)
        result_text = response.text.strip()

        is_pii = (result_text == 'PII_DETECTED')
        llm_pii_cache[log_message] = is_pii # Cache the result
        return is_pii
    except Exception as e:
        # Fail-safe: If the API call fails, assume it might contain PII and log an error.
        print(f"LLM API call failed: {e}. Defaulting to PII detected.")
        llm_pii_cache[log_message] = True
        return True

print("LLM PII judge function defined.")

LLM PII judge function defined.


### Cell 3: The Custom `logging.Filter`

This class integrates our LLM judge into the logging pipeline. If the LLM detects PII, the filter modifies the log message to add a warning prefix or returns `False` to block it entirely, depending on the policy.

In [None]:
class Policy(Enum):
    """Defines the action to take when PII is detected by the LLM."""
    WARN_AND_LOG = auto()
    BLOCK = auto()

class PiiFilterLLM(logging.Filter):
    """A logging filter that uses an LLM to detect PII."""
    def __init__(self, policy: Policy = Policy.BLOCK):
        super().__init__()
        self.policy = policy

    def filter(self, record: logging.LogRecord) -> bool:
        original_message = record.getMessage()

        if is_pii_detected_by_llm(original_message):
            if self.policy == Policy.BLOCK:
                # By returning False, we stop the log from being processed further.
                return False
            elif self.policy == Policy.WARN_AND_LOG:
                # Modify the message in-place to add a security warning.
                record.msg = f"[PII DETECTED BY LLM] {original_message}"
                record.args = []

        # Return True to allow the (potentially modified) record to be logged.
        return True

print("PiiFilterLLM class defined.")

PiiFilterLLM class defined.


### Cell 4: Demonstration with 10 Test Cases

Now we'll configure a logger with our `PiiFilterLLM` and test it against a variety of log messages. The first run might be slow as it needs to make API calls. Subsequent runs will be faster due to the cache.

The policy is set to `BLOCK`, so any message the LLM identifies as containing PII will be completely suppressed.

In [None]:
# --- Logger Setup ---
logger = logging.getLogger("LLM_PII_SecureApp")
logger.setLevel(logging.INFO)
logger.propagate = False # Crucial to prevent duplicate output in Colab

# Clear any previous handlers to ensure a clean run
if logger.hasHandlers():
    logger.handlers.clear()

handler = logging.StreamHandler()
formatter = logging.Formatter('%(levelname)s - %(message)s')
handler.setFormatter(formatter)

# --- Add the LLM Filter ---
llm_filter = PiiFilterLLM(policy=Policy.BLOCK)
handler.addFilter(llm_filter)
logger.addHandler(handler)

# --- 10 Test Examples ---
test_cases = [
    # 1. Obvious PII (Regex would also catch)
    "User login failed for jane.doe@example.com.",
    # 2. Obvious PII (Regex would also catch)
    "Customer support called user at 555-867-5309.",
    # 3. Subtle PII (Harder for Regex)
    "Processing request for customer Johnathan Smith.",
    # 4. Subtle PII (Very hard for Regex)
    "Shipment #54321 failed for delivery to 123 Oak Avenue, Springfield, IL.",
    # 5. Benign - No PII
    "Service health check passed with status 200 OK.",
    # 6. Benign - Technical ID
    "Failed to retrieve record with correlation_id=f47ac10b-58cc-4372-a567-0e02b2c3d479.",
    # 7. Benign - Technical Error
    "SQL error: new row for relation \"users\" violates check constraint \"users_email_check\"",
    # 8. Tricky PII (Name in context)
    "Note from support ticket T5678: Spoke with Bob Johnson regarding the outage.",
    # 9. Tricky Negative (Looks like PII but lacks specifics)
    "User mentioned they lived on a street named 'Elm' but did not give a full address.",
    # 10. Financial PII
    "Transaction declined for Visa card ending in 4242."
]

print("--- Running logs through LLM Filter (Policy: BLOCK) ---")
print("NOTE: Any message judged as PII will be silently dropped.\n")

for i, message in enumerate(test_cases):
    print(f"--- Test Case #{i+1} ---")
    logger.info(message)

print("\n--- Caching Demonstration ---")
print("Re-running a sensitive log. This should be much faster and still be blocked.")
logger.info("Processing request for customer Johnathan Smith.")

print("\n--- Log Analysis Complete ---")

--- Running logs through LLM Filter (Policy: BLOCK) ---
NOTE: Any message judged as PII will be silently dropped.

--- Test Case #1 ---
--- Test Case #2 ---
--- Test Case #3 ---
--- Test Case #4 ---
--- Test Case #5 ---


INFO - Service health check passed with status 200 OK.


--- Test Case #6 ---


INFO - Failed to retrieve record with correlation_id=f47ac10b-58cc-4372-a567-0e02b2c3d479.


--- Test Case #7 ---


INFO - SQL error: new row for relation "users" violates check constraint "users_email_check"


--- Test Case #8 ---
--- Test Case #9 ---


INFO - User mentioned they lived on a street named 'Elm' but did not give a full address.


--- Test Case #10 ---


INFO - Transaction declined for Visa card ending in 4242.



--- Caching Demonstration ---
Re-running a sensitive log. This should be much faster and still be blocked.

--- Log Analysis Complete ---
