# Molt-Shield LLM Demo

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/batmanvane/molt-shield/blob/main/notebooks/molt_shield_llm_demo.ipynb)

**Real-world example** - This notebook demonstrates a complete workflow where:
1. Proprietary simulation data is sanitized
2. An LLM analyzes the sanitized data
3. LLM suggestions are rehydrated back to original values

This uses a simulated LLM response (since free API keys require personal accounts), but shows the exact workflow you'd use with Claude, GPT, or Gemini.

## 1. Setup

First, mount this notebook from GitHub or upload the files, then install dependencies.

### Option A: Clone from GitHub (recommended)
```python
!git clone https://github.com/batmanvane/molt-shield.git /content/molt-shield
```

### Option B: Upload files manually
Upload these files to Colab:
- `src/` folder
- `config/` folder

### Option C: Install dependencies only
```python
!pip install -q lxml pydantic pyyaml requests
```

In [None]:
# ============================================================
# SETUP: Install dependencies (run this FIRST)
# ============================================================
# These packages are NOT pre-installed in Google Colab

!pip install -q lxml pydantic pyyaml requests

print("✓ Dependencies installed: lxml, pydantic, pyyaml, requests")

# Clone repository
!git clone https://github.com/batmanvane/molt-shield.git /content/molt-shield

import sys
sys.path.insert(0, '/content/molt-shield/src')

import os
os.chdir('/content/molt-shield')

# Import Molt-Shield modules
from src.config import load_config, MaskingConfig, ShufflingConfig
from src.gatekeeper import apply_gatekeeper, mask_values, _apply_tag_shadowing, DEFAULT_TAG_MAP
from src.policy_engine import Policy, Rule
from src.vault import Vault
from pathlib import Path
from lxml import etree
import json

print("✓ Environment ready!")

## 2. Your Proprietary Data

Imagine this is a proprietary turbine blade simulation - highly valuable IP that you don't want to leak.

In [None]:
# Proprietary turbine blade simulation data
PROPRIETARY_XML = """<?xml version="1.0" encoding="UTF-8"?>
<turbine_simulation>
    <metadata>
        <project>CFD-2024-TURBINE-ALPHA</project>
        <engineer>Dr. Smith</engineer>
    </metadata>
    <blade id="blade_001">
        <material>Inconel718</material>
        <stress_mpa>850.5</stress_mpa>
        <temperature_celsius>650.0</temperature_celsius>
        <efficiency>0.92</efficiency>
        <lifespan_hours>25000</lifespan_hours>
    </blade>
    <blade id="blade_002">
        <material>Inconel718</material>
        <stress_mpa>920.3</stress_mpa>
        <temperature_celsius>680.5</temperature_celsius>
        <efficiency>0.89</efficiency>
        <lifespan_hours>18000</lifespan_hours>
    </blade>
    <blade id="blade_003">
        <material>Ti6Al4V</material>
        <stress_mpa>780.0</stress_mpa>
        <temperature_celsius>520.0</temperature_celsius>
        <efficiency>0.94</efficiency>
        <lifespan_hours>35000</lifespan_hours>
    </blade>
</turbine_simulation>"""

# Save original
with open('turbine_data.xml', 'w') as f:
    f.write(PROPRIETARY_XML)

print("=== YOUR PROPRIETARY DATA (NEVER SHARE THIS!) ===")
print(PROPRIETARY_XML)

## 3. Sanitize with Molt-Shield

Now let's sanitize this data so it can be safely shared with an LLM.

In [None]:
# Configure the policy
masking_config = MaskingConfig()
shuffling_config = ShufflingConfig(seed=42)

# Parse and mask
xml_tree = etree.fromstring(PROPRIETARY_XML.encode())
vault = Vault('turbine_vault.json')

# Step 1: Mask numeric values
masked_tree = mask_values(xml_tree, masking_config, vault)

# Step 2: Shadow tag names
# Extend the default map for turbine-specific tags
TURBINE_TAG_MAP = {**DEFAULT_TAG_MAP, 
    "stress_mpa": "stress_metric",
    "temperature_celsius": "thermal_reading",
    "efficiency": "performance_factor",
    "lifespan_hours": "operational_duration",
    "material": "alloy_type",
}

_apply_tag_shadowing(masked_tree, TURBINE_TAG_MAP)
shadowed_xml = etree.tostring(masked_tree, encoding='unicode')

# Step 3: Shuffle siblings (optional - for demo we skip this)
sanitized_xml = shadowed_xml

# Save vault for later rehydration
vault.save()

print("=== SANITIZED DATA (SAFE TO SHARE WITH LLM) ===")
print(sanitized_xml)
print(f"\nVault saved with {len(vault)} entries")

## 4. Simulated LLM Response

This is what an LLM like Claude, GPT, or Gemini would return when given the sanitized data.

The LLM sees ONLY the sanitized data and makes suggestions using the masked values.

In [None]:
# Simulated LLM response (in real usage, you'd call an API)
LLM_RESPONSE = """
Based on the sanitized turbine blade data, here are my optimization suggestions:

**Analysis:**
- blade_001 has moderate stress (VAL_xxx1) and good efficiency (VAL_xxx2)
- blade_002 shows high stress (VAL_xxx3) which is concerning
- blade_003 has the best efficiency but moderate thermal readings

**Recommendations:**
1. For blade_002: Reduce stress_metric below VAL_xxx1 by 15%
2. Consider material change for blade_002 to match blade_003 alloy_type
3. Target operational_duration > 20000 hours - blade_002 needs improvement

**Specific changes suggested:**
- stress_mpa: VAL_xxx3 → 750.0 (reduce from current high)
- thermal_reading: VAL_xxx4 → 600.0 (lower for longevity)
- performance_factor: VAL_xxx5 → 0.91 (optimize)
"""

# In JSON format (as you'd get from an LLM API)
LLM_SUGGESTIONS = {
    "blade_002": {
        "stress_metric": "VAL_xxx1",  # LLM uses masked value reference
        "thermal_reading": "VAL_xxx4",
        "performance_factor": "VAL_xxx5",
        "alloy_type": "Ti6Al4V"  # This wasn't masked, LLM can see it
    },
    "recommendations": [
        "reduce stress by 15%",
        "increase operational_duration above 20000"
    ]
}

print("=== LLM RESPONSE (USING MASKED VALUES) ===")
print(LLM_RESPONSE)
print("\n=== LLM SUGGESTIONS (JSON) ===")
print(json.dumps(LLM_SUGGESTIONS, indent=2))

## 5. Rehydration

Now we map the LLM's suggestions back to original values using the vault.

In [None]:
# Load the vault
vault = Vault('turbine_vault.json')
vault.load()

# Rehydrate the suggestions
print("=== REHYDRATION ===\n")

# Find a masked value from our vault
sample_masked = list(vault.entries.keys())[0]
sample_original = vault.restore(sample_masked)

print(f"Example mapping:")
print(f"  LLM sees: {sample_masked}")
print(f"  Actually means: {sample_original}")

# Rehydrate the JSON suggestions
rehydrated_suggestions = vault.rehydrate_dict(LLM_SUGGESTIONS)

print("\n=== REHYDRATED SUGGESTIONS (BACK TO ORIGINAL VALUES) ===")
print(json.dumps(rehydrated_suggestions, indent=2))

# Rehydrate text response (for demonstration)
rehydrated_text = vault.rehydrate_xml(LLM_RESPONSE)
print("\n=== SAMPLE OF REHYDRATED TEXT ===")
# Show just a portion
import re
matches = re.findall(r'VAL_[a-z0-9]+', rehydrated_text)
for m in list(set(matches))[:3]:
    original = vault.restore(m)
    print(f"  {m} → {original}")

## 6. Complete Workflow

Here's the full pipeline in practice:

In [None]:
# Full pipeline using apply_gatekeeper
policy = Policy(
    version="1.0",
    global_masking=True,
    rules=[
        Rule(tag_pattern="stress_mpa", action="mask_value"),
        Rule(tag_pattern="temperature_celsius", action="mask_value"),
        Rule(tag_pattern="efficiency", action="mask_value"),
        Rule(tag_pattern="lifespan_hours", action="mask_value"),
    ]
)

config = load_config('config/default.yaml')
sanitized_path, vault_path = apply_gatekeeper(Path('turbine_data.xml'), policy, config)

print("=== COMPLETE PIPELINE ===\n")
print(f"Input: turbine_data.xml")
print(f"Sanitized output: {sanitized_path}")
print(f"Vault: {vault_path}")

print("\n--- SANITIZED CONTENT ---")
print(sanitized_path.read_text())

## Using with Real LLM APIs

To use with a real LLM, you would:

### Option 1: Claude Desktop (MCP)
```python
# Configure in Claude Desktop settings:
{
  "mcpServers": {
    "molt-shield": {
      "command": "python",
      "args": ["-m", "src.server"]
    }
  }
}
```

### Option 2: OpenAI API
```python
import openai

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "user", "content": f"Analyze this sanitized data: {sanitized_xml}"}
    ]
)
# Then rehydrate the response using vault
```

### Option 3: Gemini API
```python
import google.generativeai as genai

model = genai.GenerativeModel('gemini-pro')
response = model.generate_content(f"Analyze: {sanitized_xml}")
# Then rehydrate
```

## Summary

The Molt-Shield workflow:

```
1. YOUR DATA → 2. SANITIZE → 3. LLM → 4. REHYDRATE → 5. ORIGINAL
   (private)      (safe to share)   (analyzes)    (maps back)
```

This ensures your proprietary data never leaves your control in an identifiable form.