# Intro
In this notebook, we showcase how to fine-tune the Qwen3-1.7B model on AWS Trainium using the Hugging Face Optimum Neuron library.
The goal of this task is Text-to-SQL generation — training the model to translate natural language questions into executable SQL queries.

We will fine-tune the model using `optimum.neuron`, save the trained checkpoint, and then deploy it for inference with Optimum-Neuron[vllm], enabling high-performance, low-latency Text-to-SQL execution.

By the end of this notebook, you’ll have a fine-tuned, Trainium-optimized Qwen3 model ready for deployment and real-time inference. This workflow demonstrates how to leverage the Optimum Neuron toolchain to efficiently train and serve large language models on AWS Neuron devices.

For this module, you will be using the [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) dataset which consists of thousands of examples of SQL schemas, questions about the schemas, and SQL queries intended to answer the questions.

*Dataset example 1:*
* *SQL schema/context:* `CREATE TABLE management (department_id VARCHAR); CREATE TABLE department (department_id VARCHAR)`
* *Question:* `How many departments are led by heads who are not mentioned?`
* *SQL query/answer:* `SELECT COUNT(*) FROM department WHERE NOT department_id IN (SELECT department_id FROM management)`

*Dataset example 2:*
* *SQL schema/context:* `CREATE TABLE courses (course_name VARCHAR, course_id VARCHAR); CREATE TABLE student_course_registrations (student_id VARCHAR, course_id VARCHAR)`
* *Question:* `What are the ids of all students for courses and what are the names of those courses?`
* *SQL query/answer:* `SELECT T1.student_id, T2.course_name FROM student_course_registrations AS T1 JOIN courses AS T2 ON T1.course_id = T2.course_id`

By fine-tuning the model over several thousand of these text-to-SQL examples, the model will then learn how to generate an appropriate SQL query when presented with a SQL context and a free-form question.

This text-to-SQL use case was selected so you can successfully fine-tune your model in a reasonably short amount of time (~25 minutes) which is appropriate for this workshop. Although this is a relatively simple use case, please keep in mind that the same techniques and components used in this module can also be applied to fine-tune LLMs for more advanced use cases such as writing code, summarizing documents, creating blog posts - the possibilities are endless!

# Install requirements
This notebook uses [Hugging Face Optimum Neuron](https://github.com/huggingface/optimum-neuron) which works like an interface between the Hugging Face Transformers library and AWS Accelerators including AWS Trainium and AWS Inferentia. We will also install some other libraries like peft, trl etc.


In [1]:
%cd /home/ubuntu/environment/FineTuning/HuggingFaceExample/01_finetuning/assets
%pip install -r requirements.txt

/home/ubuntu/environment/FineTuning/HuggingFaceExample/01_finetuning/assets


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting optimum-neuron==0.3.0 (from -r requirements.txt (line 1))
  Downloading optimum_neuron-0.3.0-py3-none-any.whl.metadata (16 kB)
Collecting peft==0.16.0 (from -r requirements.txt (line 2))
  Downloading peft-0.16.0-py3-none-any.whl.metadata (14 kB)
Collecting trl==0.11.4 (from -r requirements.txt (line 3))
  Downloading trl-0.11.4-py3-none-any.whl.metadata (12 kB)
Collecting huggingface_hub==0.33.4 (from -r requirements.txt (line 4))
  Downloading huggingface_hub-0.33.4-py3-none-any.whl.metadata (14 kB)
Collecting datasets==3.6.0 (from -r requirements.txt (line 5))
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting accelerate==1.8.1 (from optimum-neuron==0.3.0->-r requirements.txt (line 1))
  Downloading accelerate-1.8.1-py3-none-any.whl.metadata (19 kB)
Collecting optimum~=1.24.0 (from optimum-neuron==0.3.0->-r requirements.txt (line 1))
  Downloading optimum-1.24.0-py3-

# Fine-tuning

In this section, we fine-tune the Qwen3-1.7B model on the Text-to-SQL task using Hugging Face Optimum Neuron. Here are the parameters we are going to pass - 

1. `--nnodes`:	Number of nodes (1 = single node)
2. `--nproc_per_node`: 	Processes per node (usually equals number of devices).
3. `--model_id, --tokenizer_id`:	Model and tokenizer identifiers (from Hugging Face or local path).
4. `--output_dir`:	Directory for saving checkpoints and logs.
5. `--bf16`:	Enables bfloat16 precision for faster, memory-efficient training.
5. `--gradient_checkpointing`:	Saves memory by recomputing activations during backprop.
6. `--gradient_accumulation_steps`:	Steps to accumulate gradients before optimizer update.
7. `--learning_rate`:	Initial training learning rate.
8. `--max_steps`:	Total number of training steps.
9. `--per_device_train_batch_size`:	Batch size per device.
10. `--tensor_parallel_size`:	Number of devices for tensor parallelism.
11. `--lora_r, --lora_alpha, --lora_dropout`:	LoRA hyperparameters — rank, scaling, and dropout rate.
12. `--dataloader_drop_last`:	Drops last incomplete batch.
13. `--disable_tqdm`: Disables progress bar.
14. `--logging_steps`:	Log interval (in steps).

In [2]:
!torchrun \
  --nnodes 1 \
  --nproc_per_node 2 \
  finetune_model.py \
  --model_id Qwen/Qwen3-1.7B \
  --tokenizer_id Qwen/Qwen3-1.7B \
  --output_dir ~/environment/ml/qwen \
  --bf16 True \
  --gradient_checkpointing True \
  --gradient_accumulation_steps 1 \
  --learning_rate 5e-5 \
  --max_steps 1000 \
  --per_device_train_batch_size 2 \
  --tensor_parallel_size 2 \
  --lora_r 16 \
  --lora_alpha 32 \
  --lora_dropout 0.05 \
  --dataloader_drop_last True \
  --disable_tqdm True \
  --logging_steps 10

W1108 18:28:50.941000 41695 torch/distributed/run.py:766] 
W1108 18:28:50.941000 41695 torch/distributed/run.py:766] *****************************************
W1108 18:28:50.941000 41695 torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1108 18:28:50.941000 41695 torch/distributed/run.py:766] *****************************************
  from .mappings import (
  from .mappings import (
  from .mappings import (
  from .mappings import (
  from .mappings import (
  from .mappings import (
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(confi

# Compilation

After completing the fine-tuning process, the next step is to compile the trained model for AWS Trainium inference using the Hugging Face Optimum Neuron toolchain.
Neuron compilation optimizes the model graph and converts it into a Neuron Executable File Format (NEFF), enabling efficient execution on NeuronCores.

In [3]:
!optimum-cli export neuron \
  --model /home/ubuntu/environment/ml/qwen/merged_model \
  --task text-generation \
  --sequence_length 512 \
  --batch_size 1 \
  /home/ubuntu/environment/ml/qwen/compiled_model

  from pkg_resources import get_distribution
  from .mappings import (
  from .mappings import (
  from .mappings import (
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  from .mappings import (
  from .mappings import (
  from .mappings import (
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  from pkg_resources import get_distribution
  from ..attention.gqa import (
  from ..backend.modules.attention.attention_base import NeuronAttentionBase
INFO:Neuron:Generating HLOs for the following models: ['context_encoding_model', 'token_generation_model']
[2025-11-08 18:49:20.189: I neuronx_distributed/parallel_l

# Inference

We will install the Optimum Neuron vllm library.  Then, run inference using the compiled model.

In [4]:
%pip install optimum-neuron[vllm]


Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting vllm==0.9.2 (from optimum-neuron[vllm])
  Downloading vllm-0.9.2-cp38-abi3-manylinux1_x86_64.whl.metadata (15 kB)
Collecting cachetools (from vllm==0.9.2->optimum-neuron[vllm])
  Downloading cachetools-6.2.1-py3-none-any.whl.metadata (5.5 kB)
Collecting blake3 (from vllm==0.9.2->optimum-neuron[vllm])
  Downloading blake3-1.0.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.6 kB)
Collecting py-cpuinfo (from vllm==0.9.2->optimum-neuron[vllm])
  Downloading py_cpuinfo-9.0.0-py3-none-any.whl.metadata (794 bytes)
Collecting openai<=1.90.0,>=1.52.0 (from vllm==0.9.2->optimum-neuron[vllm])
  Downloading openai-1.90.0-py3-none-any.whl.metadata (26 kB)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm==0.9.2->optimum-neuron[vllm])
  Downloading prometheus_fastapi_instrumentator-7.1.0-py3-none-any.whl.metadata (13 kB)
Collecting tiktoken>=0.6.0 (from vllm==0.9.2->op

In [5]:
import os
from vllm import LLM, SamplingParams
llm = LLM(
    model="/home/ubuntu/environment/ml/qwen/compiled_model", #local compiled model
    max_num_seqs=1,
    max_model_len=2048,
    device="neuron",
    tensor_parallel_size=2,
    override_neuron_config={})
example1="""
<|im_start|>system
You are a text to SQL query translator. Users will ask you questions in English and you will generate a SQL query based on the provided SCHEMA.
SCHEMA:
CREATE TABLE management (department_id VARCHAR); CREATE TABLE department (department_id VARCHAR)<|im_end|>
<|im_start|>user
How many departments are led by heads who are not mentioned?<|im_end|>
<|im_start|>assistant
"""
example2="""
<|im_start|>system
You are a text to SQL query translator. Users will ask you questions in English and you will generate a SQL query based on the provided SCHEMA.
SCHEMA:
CREATE TABLE courses (course_name VARCHAR, course_id VARCHAR); CREATE TABLE student_course_registrations (student_id VARCHAR, course_id VARCHAR)<|im_end|>
<|im_start|>user
What are the ids of all students for courses and what are the names of those courses?<|im_end|>
<|im_start|>assistant
"""
example3="""
<|im_start|>system
You are a text to SQL query translator. Users will ask you questions in English and you will generate a SQL query based on the provided SCHEMA.
SCHEMA:
CREATE TABLE table_name_9 (wins INTEGER, year VARCHAR, team VARCHAR, points VARCHAR)<|im_end|>
<|im_start|>user
Which highest wins number had Kawasaki as a team, 95 points, and a year prior to 1981?<|im_end|>
<|im_start|>assistant
"""

prompts = [
    example1,
    example2,
    example3
]

sampling_params = SamplingParams(max_tokens=2048, temperature=0.8)
outputs = llm.generate(prompts, sampling_params)

print("#########################################################")

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, \n\n Generated text: {generated_text!r} \n")

INFO 11-08 18:51:20 [__init__.py:39] Available plugins for group vllm.platform_plugins:
INFO 11-08 18:51:20 [__init__.py:41] - optimum_neuron -> optimum.neuron.vllm.plugin:register
INFO 11-08 18:51:20 [__init__.py:44] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.


  from .mappings import (
  from .mappings import (
  from .mappings import (
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
INFO:Neuron:Optimum Neuron platform plugin registered for vLLM.
INFO:Neuron:Optimum Neuron platform plugin registered for vLLM.


INFO 11-08 18:51:21 [__init__.py:235] Platform plugin optimum_neuron is activated
INFO 11-08 18:51:29 [config.py:841] This model supports multiple tasks: {'generate', 'classify', 'reward', 'embed'}. Defaulting to 'generate'.
INFO 11-08 18:51:29 [config.py:1472] Using max model len 2048
INFO 11-08 18:51:30 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.2) with config: model='/home/ubuntu/environment/ml/qwen/compiled_model', speculative_config=None, tokenizer='/home/ubuntu/environment/ml/qwen/compiled_model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cpu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False

  from ..attention.gqa import (
  from ..backend.modules.attention.attention_base import NeuronAttentionBase
INFO:Neuron:Loading sharded checkpoint from /home/ubuntu/environment/ml/qwen/compiled_model/checkpoint/weights


INFO 11-08 18:51:38 [executor_base.py:113] # neuron blocks: 2, # CPU blocks: 0
INFO 11-08 18:51:38 [executor_base.py:118] Maximum concurrency for 2048 tokens per request: 2.00x
INFO 11-08 18:51:38 [llm_engine.py:428] init engine (profile, create kv cache, warmup model) took 0.00 seconds


Adding requests:   0%|          | 0/3 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/3 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

2025-Nov-08 18:51:38.0715 41557:46377 [0] int nccl_net_ofi_create_plugin(nccl_net_ofi_plugin_t**):213 CCOM WARN NET/OFI Failed to initialize sendrecv protocol
2025-Nov-08 18:51:38.0725 41557:46377 [0] int nccl_net_ofi_create_plugin(nccl_net_ofi_plugin_t**):354 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Nov-08 18:51:38.0735 41557:46377 [0] ncclResult_t nccl_net_ofi_init_no_atexit_fini_v6(ncclDebugLogger_t):183 CCOM WARN NET/OFI Initializing plugin failed
2025-Nov-08 18:51:38.0745 41557:46377 [0] net_plugin.cc:97 CCOM WARN OFI plugin initNet() failed is EFA enabled?
#########################################################
Prompt: '\n<|im_start|>system\nYou are a text to SQL query translator. Users will ask you questions in English and you will generate a SQL query based on the provided SCHEMA.\nSCHEMA:\nCREATE TABLE management (department_id VARCHAR); CREATE TABLE department (department_id VARCHAR)<|im_end|>\n<|im_start|>user\nHow many departments are led by heads who are

In [6]:
from datasets import load_dataset

# Load the full dataset
ds = load_dataset("nvidia/Nemotron-PII")

# Filter by domain
finance_ds = ds.filter(lambda x: x["domain"] == "Finance")

# Inspect a sample
print(finance_ds)
print(finance_ds["train"][0])


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/156M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/151M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/100000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/100000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/100000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['uid', 'domain', 'document_type', 'document_description', 'document_format', 'locale', 'text', 'spans', 'text_tagged'],
        num_rows: 3990
    })
    test: Dataset({
        features: ['uid', 'domain', 'document_type', 'document_description', 'document_format', 'locale', 'text', 'spans', 'text_tagged'],
        num_rows: 0
    })
})
{'uid': '9d53d13f0e3d4abfaff21347796f73e4', 'domain': 'Finance', 'document_type': 'Equity Allocation Report', 'document_description': 'An Equity Allocation Report in the finance domain is an unstructured document that typically includes sections detailing investment strategies, portfolio performance, risk assessments, and specific equity allocations, with common fields such as asset classes, investment ratios, and financial projections presented in a narrative or tabular format.', 'document_format': 'unstructured', 'locale': 'us', 'text': 'The Equity Allocation Report for account number C37529641 det

In [7]:
for example in finance_ds["train"]:
    spans = example["spans"]
    print(type(spans))
    break

<class 'str'>


In [8]:
import ast

train_data = []

for example in finance_ds["train"]:
    text = example["text"]
    spans_raw = example["spans"]

    # Convert string to Python list safely
    try:
        spans = ast.literal_eval(spans_raw)
    except (ValueError, SyntaxError):
        continue  # skip malformed entries

    # Extract entities
    entities = [(s["start"], s["end"], s["label"].upper()) for s in spans if isinstance(s, dict)]
    if entities:
        train_data.append((text, {"entities": entities}))


In [12]:
train_data[0]

('The Equity Allocation Report for account number C37529641 details the current investment strategies and portfolio performance. The report can be accessed at https://financialreports.com. Our investment strategy focuses on diversifying across various asset classes to mitigate risk. The swift bic GHTBUS45KLX is used for international transactions, ensuring secure and efficient transfers. The portfolio includes a mix of equities, bonds, and other financial instruments, with a specific focus on high-growth sectors. The financial projections indicate a positive outlook, with expected returns aligning with our investment goals.',
 {'entities': [(48, 57, 'ACCOUNT_NUMBER'),
   (157, 185, 'URL'),
   (296, 307, 'SWIFT_BIC')]})

In [13]:
!pip install presidio_analyzer
from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()

all_labels = []
for sample in finance_ds["train"]:
    spans = sample["text"]
    print(spans)
    results = analyzer.analyze(text=spans,language="en")
    print(results)
    


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m




The Equity Allocation Report for account number C37529641 details the current investment strategies and portfolio performance. The report can be accessed at https://financialreports.com. Our investment strategy focuses on diversifying across various asset classes to mitigate risk. The swift bic GHTBUS45KLX is used for international transactions, ensuring secure and efficient transfers. The portfolio includes a mix of equities, bonds, and other financial instruments, with a specific focus on high-growth sectors. The financial projections indicate a positive outlook, with expected returns aligning with our investment goals.
[type: URL, start: 157, end: 185, score: 0.6, type: US_DRIVER_LICENSE, start: 48, end: 57, score: 0.3, type: US_PASSPORT, start: 48, end: 57, score: 0.1]
This Derivative Contract is effective as of the date 2023-08-15 between Vanguard Capital and the counterparty. The contract pertains to a notional amount of $5,000,000, with the underlying asset being a basket of equ

In [15]:
!pip install spacy==3.7.4


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting spacy==3.7.4
  Downloading spacy-3.7.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting thinc<8.3.0,>=8.2.2 (from spacy==3.7.4)
  Downloading thinc-8.2.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting weasel<0.4.0,>=0.1.0 (from spacy==3.7.4)
  Downloading weasel-0.3.4-py3-none-any.whl.metadata (4.7 kB)
Collecting typer<0.10.0,>=0.3.0 (from spacy==3.7.4)
  Downloading typer-0.9.4-py3-none-any.whl.metadata (14 kB)
Collecting smart-open<7.0.0,>=5.2.1 (from spacy==3.7.4)
  Downloading smart_open-6.4.0-py3-none-any.whl.metadata (21 kB)
Collecting langcodes<4.0.0,>=3.2.0 (from spacy==3.7.4)
  Downloading langcodes-3.5.0-py3-none-any.whl.metadata (29 kB)
Collecting language-data>=1.2 (from langcodes<4.0.0,>=3.2.0->spacy==3.7.4)
  Downloading language_data-1.3.0-py3-none-any.whl.metadata (4.3 kB)
Collecting blis<0.8.0,>=0.7

In [18]:
!python -m spacy download en_core_web_sm

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m169.3 MB/s[0m  [33m0:00:00[0m
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.7.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [19]:
# ner.py
import spacy
nlp = spacy.load("en_core_web_sm")

NER_MAP = {
    "PERSON":"PERSON","ORG":"ORG","GPE":"GPE","DATE":"DATE","LOC":"GPE","NORP":"ORG"
}

def ner_spans(text):
    doc = nlp(text)
    out = []
    for ent in doc.ents:
        et = NER_MAP.get(ent.label_)
        if et:
            out.append({"start": ent.start_char, "end": ent.end_char, "type": et, "score": 0.55})
    return out

  for node in model.walk():


In [20]:
# slm_gate.py
import subprocess, json, textwrap

PROMPT_TMPL = """You are a privacy classifier. Output compact JSON only.
Q: Is SPAN personal data in CONTEXT? If label is generic/code/placeholder, return false.
CONTEXT:
{context}

SPAN:
{span}

Return: {{"is_sensitive": true/false, "category": "EMAIL/PHONE/ID/HEALTH/FINANCIAL/UNKNOWN", "reason": "<=20 words"}}"""

def slm_is_sensitive(context, span_text, model="phi3:mini"):
    prompt = textwrap.dedent(PROMPT_TMPL.format(context=context, span=span_text))
    res = subprocess.run(["ollama", "run", model, prompt], capture_output=True, text=True)
    line = res.stdout.strip().splitlines()[-1]
    try:
        return json.loads(line)
    except Exception:
        return {"is_sensitive": True, "category":"UNKNOWN", "reason":"fallback"}

In [21]:
# fuse.py
def fuse(rule_score=0.7, ner_score=0.55, slm_json=None):
    alpha, beta, gamma = 0.55, 0.25, 0.20
    slm_bit = 1.0 if (slm_json and slm_json.get("is_sensitive")) else 0.0
    return alpha*rule_score + beta*ner_score + gamma*slm_bit

def decide(score):
    if score >= 0.70: return "REDACT"
    if 0.50 <= score < 0.70: return "REVIEW"
    return "KEEP"

In [22]:
def build_candidates(text, presidio_spans, spacy_spans):
    # merge & coalesce overlaps, prefer higher score/type specificity
    spans = presidio_spans + spacy_spans
    spans.sort(key=lambda x:(x["start"], x["end"]))
    merged = []
    for s in spans:
        if not merged or s["start"] > merged[-1]["end"]:
            merged.append(s)
        else:
            # overlap → keep wider and higher score
            m = merged[-1]
            m["end"] = max(m["end"], s["end"])
            m["score"] = max(m.get("score",0.5), s.get("score",0.5))
            m["type"] = m["type"] if m["score"]>=s.get("score",0.5) else s["type"]
    return merged

In [23]:
# policy.py
POLICY = {
  "PCI":  {"CREDIT_CARD":"DROP","IBAN_CODE":"MASK","BANK_ACCOUNT":"MASK","SWIFT_CODE":"MASK"},
  "GDPR": {"PERSON":"MASK","EMAIL_ADDRESS":"MASK","PHONE_NUMBER":"MASK","IP_ADDRESS":"MASK",
           "LOCATION":"MASK","DATE_TIME":"MASK","SALARY":"MASK","CASE_ID":"MASK"},
  "HIPAA":{"PERSON":"MASK","DATE_TIME":"MASK","LOCATION":"MASK","MEDICAL_RECORD_NUMBER":"DROP",
           "ACCOUNT_NUMBER":"MASK","CERTIFICATE_LICENSE":"MASK"}
}

In [24]:
# redact.py
import hashlib, time

def sha256_peppered(s, pepper): 
    return hashlib.sha256((s+pepper).encode()).hexdigest()[:16]

def apply_redactions(text, spans, policy="GDPR", pepper="local-secret"):
    out, last = [], 0
    audit = []
    for i, s in enumerate(sorted(spans, key=lambda x:x["start"])):
        action = POLICY[policy].get(s["type"])
        if not action:
            continue
        out.append(text[last:s["start"]])
        raw = text[s["start"]:s["end"]]
        if action=="DROP":
            repl = ""
        elif action=="HASH":
            repl = f"[[{s['type']}_HASH_{sha256_peppered(raw,pepper)}]]"
        else: # MASK or FPR, keep simple MASK for demo
            repl = f"[[{s['type']}_{i+1}]]"
        out.append(repl)
        audit.append({
            "i": i+1, "type": s["type"], "action": action,
            "start": s["start"], "end": s["end"],
            "timestamp": int(time.time())
        })
        last = s["end"]
    out.append(text[last:])
    return "".join(out), audit

In [25]:
# audit.py
import json

def write_audit_jsonl(audit, path="audit.jsonl"):
    with open(path, "w") as f:
        for row in audit:
            f.write(json.dumps(row)+"\n")

In [None]:
# bedrock_client.py
import os, json, boto3
region = os.getenv("AWS_REGION","us-east-1")
bedrock = boto3.client("bedrock-runtime", region_name=region)

def bedrock_summarize(sanitized_text, model_id="mistral.mistral-7b-instruct-v0", guardrails=None):
    body = {
        "messages":[{"role":"user","content":[{"type":"text","text":
            "Summarize this sanitized content. Treat [[TOKENS]] as anonymized:\n\n"+sanitized_text}]}],
        "max_tokens": 512, "temperature": 0.2
    }
    if guardrails: body["guardrails"] = guardrails
    resp = bedrock.invoke_model(modelId=model_id, body=json.dumps(body))
    payload = json.loads(resp["body"].read())
    try:
        return payload["output"][0]["content"][0]["text"]
    except Exception:
        return json.dumps(payload, indent=2)

In [85]:
%%writefile streamlit_app.py
# streamlit_app.py — single-file, no external local imports
import re, json, datetime, hashlib, io, textwrap
import streamlit as st

# =========================
# Helpers
# =========================
def asterisk_mask(raw: str) -> str:
    # Mask every non-whitespace char with "*", keep whitespace/punct as-is
    return re.sub(r"\S", "*", raw)

def merge_overlaps(spans):
    # spans: [{start,end,type,score}]
    spans = sorted(spans, key=lambda s: (s["start"], s["end"]))
    out = []
    for s in spans:
        if not out or s["start"] > out[-1]["end"]:
            out.append(s.copy())
        else:
            # overlap -> merge
            if s["end"] > out[-1]["end"]:
                out[-1]["end"] = s["end"]
            if s["score"] > out[-1]["score"]:
                out[-1]["type"] = s["type"]
                out[-1]["score"] = s["score"]
    return out

def luhn_ok(s):
    digits = [int(c) for c in re.sub(r"\D", "", s)]
    if not (13 <= len(digits) <= 19): 
        return False
    parity = len(digits) % 2
    total = 0
    for i, d in enumerate(digits):
        if i % 2 == parity:
            d *= 2
            if d > 9: d -= 9
        total += d
    return total % 10 == 0

# =========================
# Deterministic detectors (regex)
# =========================
EMAIL_RE = re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b")
PHONE_RE = re.compile(r"(?:\+?\d[\s\-\.()]*){7,}\d")
IPV4_RE  = re.compile(r"\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d?\d)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d?\d)\b")
SSN_RE   = re.compile(r"\b\d{3}-\d{2}-\d{4}\b")
IBAN_RE  = re.compile(r"\b[A-Z]{2}\d{2}[A-Z0-9]{10,30}\b")
SWIFT_RE = re.compile(r"\b[A-Z]{6}[A-Z0-9]{2}([A-Z0-9]{3})?\b")
SALARY_RE= re.compile(r"\b(?:USD|\$)?\s?\d{2,3}(?:,\d{3})*(?:\.\d{1,2})?\s*(per\s*)?(year|yr|month|mo|hr|hour)\b", re.I)
CASE_RE  = re.compile(r"\b(?:Case|Docket|File)\s*#?\s*[A-Z0-9\-]{4,}\b", re.I)
CARD_RE  = re.compile(r"\b(?:\d[ -]*?){13,19}\b")

# Account Number: specifically redact the value following the label
ACCOUNT_LABEL_RE = re.compile(
    r"(?i)\baccount\s*number\b\s*[:#]?\s*(?P<num>(?:[*A-Za-z0-9][A-Za-z0-9\-* ]{3,}))"
)

def rule_spans(text):
    spans = []
    for m in EMAIL_RE.finditer(text):
        spans.append({"start": m.start(), "end": m.end(), "type": "EMAIL_ADDRESS", "score": 0.9})
    for m in PHONE_RE.finditer(text):
        spans.append({"start": m.start(), "end": m.end(), "type": "PHONE_NUMBER", "score": 0.7})
    for m in IPV4_RE.finditer(text):
        spans.append({"start": m.start(), "end": m.end(), "type": "IP_ADDRESS", "score": 0.85})
    for m in SSN_RE.finditer(text):
        spans.append({"start": m.start(), "end": m.end(), "type": "US_SSN", "score": 0.95})
    for m in IBAN_RE.finditer(text):
        spans.append({"start": m.start(), "end": m.end(), "type": "IBAN_CODE", "score": 0.85})
    for m in SWIFT_RE.finditer(text):
        spans.append({"start": m.start(), "end": m.end(), "type": "SWIFT_CODE", "score": 0.7})
    for m in SALARY_RE.finditer(text):
        spans.append({"start": m.start(), "end": m.end(), "type": "SALARY", "score": 0.75})
    for m in CASE_RE.finditer(text):
        spans.append({"start": m.start(), "end": m.end(), "type": "CASE_ID", "score": 0.7})
    for m in CARD_RE.finditer(text):
        s = m.group(0)
        if luhn_ok(s):
            spans.append({"start": m.start(), "end": m.end(), "type": "CREDIT_CARD", "score": 0.98})
    # Account Number value (only the value part)
    for m in ACCOUNT_LABEL_RE.finditer(text):
        g = m.group("num")
        if g:
            start = m.start("num")
            end = m.end("num")
            spans.append({"start": start, "end": end, "type": "ACCOUNT_NUMBER", "score": 0.9})
    return merge_overlaps(spans)

# =========================
# “SLM” context gate (simple heuristic so app runs anywhere)
# =========================
def slm_is_sensitive(context, span_text):
    looks_like_id = bool(re.search(r"\b(id|mrn|account|patient|employee|ssn|card)\b", context, re.I))
    return {"is_sensitive": looks_like_id or len(re.sub(r"\W","",span_text)) >= 6}

def fuse(rule_score, slm):
    alpha, gamma = 0.8, 0.2
    slm_bonus = 1.0 if slm.get("is_sensitive") else 0.0
    return alpha*rule_score + gamma*slm_bonus

def decide(score):
    if score >= 0.70: return "REDACT"
    if score >= 0.50: return "REVIEW"
    return "KEEP"

# =========================
# Single policy: mask with asterisks
# =========================
POLICY = {
    "ENTERPRISE_DEFAULT": {
        # every type -> asterisk masking
        "EMAIL_ADDRESS":"MASK_ASTERISKS","PHONE_NUMBER":"MASK_ASTERISKS",
        "IP_ADDRESS":"MASK_ASTERISKS","US_SSN":"MASK_ASTERISKS","SALARY":"MASK_ASTERISKS",
        "CASE_ID":"MASK_ASTERISKS","IBAN_CODE":"MASK_ASTERISKS","SWIFT_CODE":"MASK_ASTERISKS",
        "CREDIT_CARD":"MASK_ASTERISKS","ACCOUNT_NUMBER":"MASK_ASTERISKS"
    }
}

def apply_redactions(text, final_spans, policy_name="ENTERPRISE_DEFAULT"):
    policy = POLICY[policy_name]
    out, last = [], 0
    audit = []
    for i, s in enumerate(sorted(final_spans, key=lambda x: x["start"])):
        action = policy.get(s["type"])
        if not action:
            continue
        out.append(text[last:s["start"]])
        raw = text[s["start"]:s["end"]]
        if action == "MASK_ASTERISKS":
            placeholder = asterisk_mask(raw)
        else:
            placeholder = asterisk_mask(raw)   # single-policy app → fallback same
        out.append(placeholder)
        audit.append({
            "index": i+1, "type": s["type"], "action": action,
            "start": s["start"], "end": s["end"],
            "timestamp": datetime.datetime.utcnow().isoformat()+"Z"
        })
        last = s["end"]
    out.append(text[last:])
    return "".join(out), audit

# =========================
# PDF utilities
# =========================
def extract_pdf_text(file_bytes: bytes) -> str:
    try:
        import fitz  # PyMuPDF
    except Exception:
        return "PyMuPDF (fitz) not installed—cannot read PDF."
    doc = fitz.open(stream=file_bytes, filetype="pdf")
    pages = [p.get_text("text") for p in doc]
    doc.close()
    return "\n".join(pages)

def make_pdf_from_text(text: str) -> bytes:
    # Generate a simple text PDF (not layout-preserving)
    from reportlab.pdfgen import canvas
    from reportlab.lib.pagesizes import LETTER
    from reportlab.lib.units import inch
    packet = io.BytesIO()
    c = canvas.Canvas(packet, pagesize=LETTER)
    width, height = LETTER
    left = 0.75*inch
    top = height - 0.75*inch
    lineh = 14
    y = top
    for line in text.splitlines():
        for chunk in textwrap.wrap(line, width=90):
            if y < 0.75*inch:
                c.showPage()
                y = top
            c.drawString(left, y, chunk)
            y -= lineh
    c.save()
    return packet.getvalue()

# =========================
# UI
# =========================
st.set_page_config(page_title="AI Data Governance & PII Redactor", layout="wide")
st.markdown(
    "<h1 style='margin-bottom:0'>PII Sentinel — Ensuring Sensitive Data Never Reaches AI Models</h1>"
    "<p style='color:#9aa1a9;margin-top:6px'>Redaction with asterisk masking PDF or plain text.</p>",
    unsafe_allow_html=True
)

uploaded_pdf = st.file_uploader("Upload PDF (optional)", type=["pdf"])
st.caption("If no PDF is uploaded, paste text below.")

# Input text box
if "input_text" not in st.session_state:
    st.session_state.input_text = ""

text = st.text_area("",
                    key="text_area",
                    value=st.session_state.input_text,
                    height=220,
                    placeholder="Paste PDF/CSV/email text here...")

# Buttons (top actions)
col_a, col_b = st.columns([1,1])
with col_a:
    run = st.button("Detect & Redact", type="primary")
with col_b:
    pass

# Results placeholders in session
for k in ("redacted_text","final","review","audit","sanitized_pdf"):
    st.session_state.setdefault(k, None)

# Execute
if run:
    # Prefer PDF if provided
    if uploaded_pdf is not None:
        file_bytes = uploaded_pdf.read()
        text = extract_pdf_text(file_bytes)
        st.session_state.input_text = text  # show extracted to user
    else:
        st.session_state.input_text = text

    # Detect
    detected = rule_spans(text)
    # Fuse decisions
    final, review = [], []
    for s in detected:
        ctx = text[max(0, s["start"]-120): s["end"]+120]
        verdict = slm_is_sensitive(ctx, text[s["start"]:s["end"]])
        fused = fuse(s.get("score",0.6), verdict)
        decision = decide(fused)
        item = {**s, "fused_score": round(fused, 2), "decision": decision}
        if decision == "REDACT":
            final.append(item)
        elif decision == "REVIEW":
            review.append(item)

    redacted, audit = apply_redactions(text, final, policy_name="ENTERPRISE_DEFAULT")

    st.session_state.redacted_text = redacted
    st.session_state.final = final
    st.session_state.review = review
    st.session_state.audit = audit

    # Build sanitized PDF (even when original was text)
    try:
        st.session_state.sanitized_pdf = make_pdf_from_text(redacted)
    except Exception:
        st.session_state.sanitized_pdf = None

# Display if we have results
if st.session_state.redacted_text is not None:
    st.subheader("Redacted Output")
    st.code(st.session_state.redacted_text, language="text")

    # Two-column JSON (left: redacted spans, right: review)
    c1, c2 = st.columns(2)
    with c1:
        st.markdown("#### Redacted spans  &nbsp;&nbsp; <span style='font-size:12px;color:#9aa1a9'>(asterisk-masked)</span>", unsafe_allow_html=True)
        st.json([{k:v for k,v in s.items() if k in ("type","start","end","fused_score","decision")} for s in (st.session_state.final or [])] or {})
    with c2:
        st.markdown("#### Under review")
        st.json([{k:v for k,v in s.items() if k in ("type","start","end","fused_score","decision")} for s in (st.session_state.review or [])] or {})

    st.markdown("#### Audit JSONL")
    audit_lines = "\n".join(json.dumps(a) for a in (st.session_state.audit or []))
    st.code(audit_lines or "{}", language="json")

    # Download/view sanitized PDF
    if st.session_state.sanitized_pdf:
        st.download_button("Download sanitized.pdf", st.session_state.sanitized_pdf, "sanitized.pdf", mime="application/pdf")

# Clear button at the bottom
st.markdown("---")
def _clear_all():
    st.session_state.input_text = ""
    st.session_state.text_area = ""  # clear widget text
    for k in ("redacted_text","final","review","audit","sanitized_pdf"):
        st.session_state[k] = None
    st.rerun()

st.button("Clear", on_click=_clear_all)

Overwriting streamlit_app.py


In [None]:
# %%writefile streamlit_app.py
# # streamlit_app.py — single-file, no local imports required
# import re, json, datetime, hashlib
# import streamlit as st

# # ---------------------------
# # 0) Helpers
# # ---------------------------
# def luhn_ok(s):
#     digits = [int(c) for c in re.sub(r"\D", "", s)]
#     if len(digits) < 13 or len(digits) > 19: 
#         return False
#     parity = len(digits) % 2
#     total = 0
#     for i, d in enumerate(digits):
#         if i % 2 == parity:
#             d = d * 2
#             if d > 9: d -= 9
#         total += d
#     return total % 10 == 0

# def sha256_pepper(s, pepper="local-pepper-123"):
#     return hashlib.sha256((pepper + s).encode()).hexdigest()

# def merge_overlaps(spans):
#     spans = sorted(spans, key=lambda s: (s["start"], s["end"]))
#     out = []
#     for s in spans:
#         if not out or s["start"] > out[-1]["end"]:
#             out.append(s.copy())
#         else:
#             # merge
#             out[-1]["end"] = max(out[-1]["end"], s["end"])
#             out[-1]["score"] = max(out[-1]["score"], s["score"])
#             if s["score"] >= out[-1]["score"]:
#                 out[-1]["type"] = s["type"]
#     return out

# # ---------------------------
# # 1) Deterministic detectors (regex first)
# #    (simple but effective baseline; you can extend anytime)
# # ---------------------------
# EMAIL_RE = re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b")
# PHONE_RE = re.compile(r"(?:\+?\d[\s\-\.()]*){7,}\d")
# IPV4_RE  = re.compile(r"\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d?\d)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d?\d)\b")
# SSN_RE   = re.compile(r"\b\d{3}-\d{2}-\d{4}\b")
# IBAN_RE  = re.compile(r"\b[A-Z]{2}\d{2}[A-Z0-9]{10,30}\b")
# SWIFT_RE = re.compile(r"\b[A-Z]{6}[A-Z0-9]{2}([A-Z0-9]{3})?\b")
# SALARY_RE= re.compile(r"\b(?:USD|\$)?\s?\d{2,3}(?:,\d{3})*(?:\.\d{1,2})?\s*(per\s*)?(year|yr|month|mo|hr|hour)\b", re.I)
# CASE_RE  = re.compile(r"\b(?:Case|Docket|File)\s*#?\s*[A-Z0-9\-]{4,}\b", re.I)
# CARD_RE  = re.compile(r"\b(?:\d[ -]*?){13,19}\b")

# def rule_spans(text):
#     spans = []
#     for m in EMAIL_RE.finditer(text):
#         spans.append({"start": m.start(), "end": m.end(), "type": "EMAIL_ADDRESS", "score": 0.9})
#     for m in PHONE_RE.finditer(text):
#         spans.append({"start": m.start(), "end": m.end(), "type": "PHONE_NUMBER", "score": 0.7})
#     for m in IPV4_RE.finditer(text):
#         spans.append({"start": m.start(), "end": m.end(), "type": "IP_ADDRESS", "score": 0.85})
#     for m in SSN_RE.finditer(text):
#         spans.append({"start": m.start(), "end": m.end(), "type": "US_SSN", "score": 0.95})
#     for m in IBAN_RE.finditer(text):
#         spans.append({"start": m.start(), "end": m.end(), "type": "IBAN_CODE", "score": 0.85})
#     for m in SWIFT_RE.finditer(text):
#         spans.append({"start": m.start(), "end": m.end(), "type": "SWIFT_CODE", "score": 0.7})
#     for m in SALARY_RE.finditer(text):
#         spans.append({"start": m.start(), "end": m.end(), "type": "SALARY", "score": 0.75})
#     for m in CASE_RE.finditer(text):
#         spans.append({"start": m.start(), "end": m.end(), "type": "CASE_ID", "score": 0.7})
#     for m in CARD_RE.finditer(text):
#         s = m.group(0)
#         if luhn_ok(s):
#             spans.append({"start": m.start(), "end": m.end(), "type": "CREDIT_CARD", "score": 0.98})
#     return merge_overlaps(spans)

# # ---------------------------
# # 2) (Optional) NER layer stub
# #    If you want spaCy, replace this with real NER; otherwise return empty.
# # ---------------------------
# def ner_spans(text):
#     # Return [] to keep this single-file and dependency-free.
#     # You can integrate spaCy later and map PERSON/ORG/GPE/DATE here.
#     return []

# # ---------------------------
# # 3) SLM context gate stub (you can wire to your fine-tuned Qwen endpoint)
# #    For hackathon stability we keep it deterministic: mark likely sensitive by heuristics.
# #    Replace body with real call to your local Qwen if desired.
# # ---------------------------
# def slm_is_sensitive(context, span_text):
#     # Heuristic “safe default”: assume sensitive for known types; else conservative false.
#     # If you have a local SLM HTTP endpoint, call it here and return {"is_sensitive": bool, "reason": "..."}
#     looks_like_id = bool(re.search(r"\b(id|mrn|account|patient|employee)\b", context, re.I))
#     return {
#         "is_sensitive": looks_like_id or len(re.sub(r"\W", "", span_text)) >= 6,
#         "category": "UNKNOWN",
#         "reason": "Heuristic gate (replace with Qwen call if available)."
#     }

# # ---------------------------
# # 4) Fusion & decision
# # ---------------------------
# def fuse(rule_score, ner_score, slm):
#     alpha, beta, gamma = 0.55, 0.25, 0.20
#     slm_bonus = 1.0 if slm.get("is_sensitive") else 0.0
#     return alpha*rule_score + beta*ner_score + gamma*slm_bonus

# def decide(score):
#     if score >= 0.70: return "REDACT"
#     if score >= 0.50: return "REVIEW"
#     return "KEEP"

# # ---------------------------
# # 5) Policy-based redaction
# # ---------------------------
# POLICY = {
#     "ENTERPRISE_DEFAULT": {
#         "EMAIL_ADDRESS": "MASK",
#         "PHONE_NUMBER":   "MASK",
#         "IP_ADDRESS":     "MASK",
#         "US_SSN":         "DROP",
#         "SALARY":         "MASK",
#         "CASE_ID":        "MASK",
#         "IBAN_CODE":      "MASK",
#         "SWIFT_CODE":     "MASK",
#         "CREDIT_CARD":    "DROP"
#     }
# }

# def apply_redactions(text, final_spans, policy_name="GDPR"):
#     policy = POLICY.get(policy_name, {})
#     out, last = [], 0
#     audit = []
#     for i, s in enumerate(sorted(final_spans, key=lambda x: x["start"])):
#         action = policy.get(s["type"])
#         if not action:
#             continue
#         out.append(text[last:s["start"]])
#         raw = text[s["start"]:s["end"]]
#         placeholder = ""
#         if action == "MASK":
#             placeholder = f"[[{s['type']}_{i+1}]]"
#         elif action == "HASH":
#             placeholder = f"[[{s['type']}_HASH_{sha256_pepper(raw)[:10]}]]"
#         elif action == "DROP":
#             placeholder = ""
#         elif action == "FORMAT-PRESERVE":
#             # Simple preserve-last-4 example (for cards); extend as needed.
#             if s["type"] == "CREDIT_CARD":
#                 digits = re.sub(r"\D","",raw)
#                 placeholder = f"[[CARD_LAST4_{digits[-4:]}]]"
#             else:
#                 placeholder = f"[[{s['type']}_{i+1}]]"
#         out.append(placeholder)
#         audit.append({
#             "index": i+1,
#             "type": s["type"],
#             "action": action,
#             "start": s["start"],
#             "end": s["end"],
#             "timestamp": datetime.datetime.utcnow().isoformat()+"Z"
#         })
#         last = s["end"]
#     out.append(text[last:])
#     return "".join(out), audit

# # ---------------------------
# # 6) Optional: Bedrock (sanitized only)
# # ---------------------------
# def bedrock_summarize(sanitized_text, model_id="mistral.mistral-7b-instruct-v0"):
#     try:
#         import boto3
#         client = boto3.client("bedrock-runtime")
#         body = {
#             "messages": [
#                 {"role": "user", "content": [{"type":"text","text": (
#                     "Summarize the following already-sanitized content. "
#                     "Treat placeholders like [[EMAIL_1]] as anonymized tokens.\n\n"
#                     + sanitized_text
#                 )}]}
#             ],
#             "max_tokens": 300,
#             "temperature": 0.2,
#         }
#         resp = client.invoke_model(modelId=model_id, body=json.dumps(body))
#         payload = json.loads(resp["body"].read())
#         return payload.get("output",[{}])[0].get("content",[{}])[0].get("text","(No text)")
#     except Exception as e:
#         return f"(Bedrock call skipped or not configured) {e}"

# # ---------------------------
# # 7) Streamlit UI
# # ---------------------------

# st.set_page_config(page_title="AI PII Redactor — Local Only", layout="wide")
# st.title("AI Data Governance & PII Redactor (Local-Only Version)")

# st.markdown("**Policy Applied:** `ENTERPRISE_DEFAULT` (no cloud usage, all processing stays local)")

# text = st.text_area("Paste extracted text", height=280, placeholder="Paste PDF/CSV/email/chat text here...")

# if st.button("Detect & Redact"):
#     # STEP 1: Detect
#     spans = merge_overlaps(rule_spans(text) + ner_spans(text))

#     # STEP 2: Context-aware decision (SLM gate)
#     final = []
#     review = []
#     for s in spans:
#         ctx = text[max(0, s["start"]-120): s["end"]+120]
#         verdict = slm_is_sensitive(ctx, text[s["start"]:s["end"]])
#         fused = fuse(s.get("score",0.6), 0.0, verdict)
#         decision = decide(fused)

#         s2 = {**s, "fused_score": fused, "decision": decision, "slm": verdict}
#         if decision == "REDACT": final.append(s2)
#         elif decision == "REVIEW": review.append(s2)

#     # STEP 3: Apply final redactions using the single policy
#     sanitized, audit = apply_redactions(text, final, policy_name="ENTERPRISE_DEFAULT")

#     # UI output
#     st.subheader("Redacted Output")
#     st.code(sanitized, language="text")

#     st.write("### Redacted Spans")
#     st.json([{k:v for k,v in s.items() if k in ("type","start","end","fused_score","decision")} for s in final] or {})

#     st.write("### Under Review (Human Check Suggested)")
#     st.json([{k:v for k,v in s.items() if k in ("type","start","end","fused_score","decision")} for s in review] or {})

#     st.download_button("Download sanitized.txt", sanitized, "sanitized.txt")
#     st.download_button("Download audit.jsonl",
#                        "\n".join(json.dumps(a) for a in audit),
#                        "audit.jsonl",
#                        mime="application/json")
    

Overwriting streamlit_app.py


In [None]:
!streamlit run streamlit_app.py --server.headless true --server.port 8501

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://192.168.0.115:8501[0m
[34m  External URL: [0m[1mhttp://44.242.137.173:8501[0m
[0m
2025-11-09 00:35:34.187 `label` got an empty value. This is discouraged for accessibility reasons and may be disallowed in the future by raising an exception. Please provide a non-empty label and hide it with label_visibility if needed.
Stack (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 973, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/streamlit/runti

In [72]:
!pip install pymupdf
!pip install streamlit pymupdf
!pip install -U streamlit pymupdf reportlab

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting reportlab
  Downloading reportlab-4.4.4-py3-none-any.whl.metadata (1.7 kB)
Downloading reportlab-4.4.4-py3-none-any.whl (2.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m58.1 MB/s[0m  [33m0:00:00[0m
[?25hInstalling collected packages: reportlab
Successfully installed reportlab-4.4.4

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
