# Synthetic text PII scanner pipeline

The objective with the example is to showcase how **LeakPro** can be used to assess whether there are PIIs (Personal Identifiable Information) in synthetic documents produced from an original dataset.

The synthetic text PII scanner requires 3 objects to run:
1. *An original dataset*. In this example we fetch the **Text Anonymization Benchmark (TAB)** from Github. More info on TAB dataset [here](https://github.com/NorskRegnesentral/text-anonymization-benchmark). 
1. *A synthetic dataset*. In this example we fetch a synthetic version of TAB from the **LeakPro Hugging Face repo**, using the `datasets` package (install with `pip install datasets`).
1. *A PII classifier model*. The PII classifier model should be able to classify if a token belongs to a PII and should be masked out (e.g. output `1`) or not (output `0`). In this example we fetch a PII classifier model from the **LeakPro Hugging Face repo**. The model was trained on `echr_train.json` from TAB dataset.

The example showcase how public PIIs (defined as PIIs present in many documents) can be excluded to focus only on the more sensitive non-public PIIs.

After public PIIs are filtered out, the example showcases how the user can assess the presence of PIIs that overlaps within the original dataset and the synthetic dataset.

In [1]:
import os
import sys

from datasets import load_dataset

os.environ["TOKENIZERS_PARALLELISM"] = "true"
sys.path.append("../..")

import examples.synthetic_data.utils.tab_data_aux as aux
from leakpro.synthetic_data_attacks.syn_text_pii_scanner import data_handling as dh
from leakpro.synthetic_data_attacks.syn_text_pii_scanner import utils
from leakpro.synthetic_data_attacks.syn_text_pii_scanner.pii_token_classif_models import ner_longformer_model as model_module

# Objects needed for pipeline
tab_url = "https://raw.githubusercontent.com/NorskRegnesentral/text-anonymization-benchmark/refs/heads/master/echr_train.json"
syn_tab_path = "LeakPro/synthetic_tab"
path_model = "LeakPro/pii-classifier-tab-dataset"

### Pipeline

Init tokenizer, label_set and data. Load data and model.

In [2]:
# Get tokenizer
tokenizer = model_module.get_tokenizer()
# Initialize label set and num_labels
label_set = dh.LabelSet(labels=["MASK"], IOB2_FORMAT=False)
num_labels = len(label_set.f_labels)
# Init data object with input and load data
data = utils.Data(
    ori = utils.SubData(
        # TAB data treatment function on original data
        path_or_data = aux.tab_data_treatment(
            data = aux.get_json_from_url(url=tab_url),
            only_first_annotator_flag = True,
            include_tasks_flag = False
        ),
        label_set = label_set,
        label_key = "more_ex_label",
        batch_size = 8,
        shuffle = False,
        num_workers = 2
    ),
    syn = utils.SubData(
        path_or_data = load_dataset(syn_tab_path, split="train"),
        batch_size = 8,
        shuffle = False,
        num_workers = 2
    )
)
utils.load_data(data=data, tokenizer=tokenizer)
#Load model
model = model_module.NERLongformerModel.from_pretrained(path_model).to(utils.device)

### Forward pass

2 forward passes are executed, one on the original data and one on the synthetic data. If the labels of the original data are provided, the original data forward pass is not required.

The predictions are appended to the `data` object.

In [3]:
#Forward pass
utils.forward_pass(
    data = data,
    num_labels = num_labels,
    model = model,
    verbose = True
)

2025-02-11 17:39:55,551 INFO     Starting forward_pass with ori.
  0%|          | 0/127 [00:00<?, ?it/s]Input ids are automatically padded to be a multiple of `config.attention_window`: 512
100%|██████████| 127/127 [02:44<00:00,  1.30s/it]
2025-02-11 17:42:41,651 INFO     Ending forward_pass with ori.
2025-02-11 17:42:41,652 INFO     Starting forward_pass with syn.
100%|██████████| 127/127 [03:00<00:00,  1.42s/it]
2025-02-11 17:45:43,398 INFO     Ending forward_pass with syn.


Get PIIS and show first 20 (original labeled PIIs vs PIIs from PII classifier model.)

In [4]:
inputs_ori = data.ori.model_input_output.inputs
labels_ori = data.ori.model_input_output.labels
predictions_ori = data.ori.model_input_output.predictions
inputs_syn = data.syn.model_input_output.inputs
predictions_syn = data.syn.model_input_output.predictions

#Get PIIs token offsets
ori_piis = utils.get_PIIs_01(
    labels = labels_ori,
    inputs = inputs_ori,
    tokenizer = tokenizer
)
ori_predicted_piis = utils.get_PIIs_01(
    labels = predictions_ori,
    inputs = inputs_ori,
    tokenizer = tokenizer
)
syn_piis = utils.get_PIIs_01(
    labels = predictions_syn,
    inputs = inputs_syn,
    tokenizer = tokenizer
)

#Show texts of ori_piis and ori_predicted_piis
print("\nComparison first 20 records ori_piis and ori_predicted_piis")
print("ori_piis\tori_predicted_piis")
print("--------\t------------------")
for i in range(20):
    print(ori_piis[i].text, ori_predicted_piis[i].text, sep="\t")



Comparison first 20 records ori_piis and ori_predicted_piis
ori_piis	ori_predicted_piis
--------	------------------
36244/06	36244/06
Mr Henrik Hasslund	Mr Henrik Hasslund
Mr Tyge Trier	Mr Tyge Trier
Ms Nina Holst-Christensen	Ms Nina Holst-Christensen
29366/03	29366/03
Mr D. Stępniak	Mr D. Stępniak
Mr J. Wołąsiewicz	Mr J. Wołąsiewicz
5138/04	5138/04
Mr Nusret Amutgan	Mr Nusret Amutgan
Ms B Özpolat	Ms B Özpolat
42596/98	42596/98
42603/98	42603/98
Mr Mustafa Sarı	Mr Mustafa Sarı
Ms Sibel Çolak	Ms Sibel Çolak
Mrs E. Çıtak	Mrs E. Çıtak
Mr Sarı	Mr Sarı
Ms Çolak	Ms Çolak
Ms Çolak	Ms Çolak
Mr Sarı	Mr Sarı
Mr Sarı	Mr Sarı


Detect non public PIIs and compare PIIs lists.

Method `utils.compare_piis_lists` logs the distribution of cosine-similarities between the original non-public PIIs (`filtered_non_public_ori_piis`) and the PIIs present in the synthetic dataset (`syn_piis`), for guiding the user into setting a `similarity_threshold`. If similarity is above the threshold, corresponding original PII and synthetic PII will be stored in `sorted_sim_items` and analyzed further.

In [5]:
similarity_threshold = 0.95
min_nr_repetitions = 1

#Get non public ori piis
non_public_ori_piis = utils.detect_non_public_pii(
    piis = ori_piis,
    similarity_threshold = similarity_threshold,
    min_nr_repetitions = min_nr_repetitions
)
#Get non public ori predicted piis
non_public_ori_predicted_piis = utils.detect_non_public_pii(
    piis = ori_predicted_piis,
    similarity_threshold = similarity_threshold,
    min_nr_repetitions = min_nr_repetitions
)
#Set ignore_list from predefined ignore list
ignore_list = aux.tab_predefined_ignore_piis_list + []

#Filter non_public_ori_piis from ignore_list
filtered_non_public_ori_piis = [i for i in non_public_ori_piis if i.text not in ignore_list]

sit, tot, sorted_sim_items, distr = utils.compare_piis_lists(
    ori_piis = filtered_non_public_ori_piis,
    syn_piis = syn_piis,
    similarity_threshold = similarity_threshold
)

print_ori_syn_cases_fun = aux.print_ori_syn_cases_fact(
    syn_piis = syn_piis,
    sorted_sim_items = sorted_sim_items,
    data = data
)

2025-02-11 17:47:17,790 INFO     Start compare_piis_lists
2025-02-11 17:47:21,472 INFO     Start log_distribution
2025-02-11 17:47:21,472 INFO     Mean: 0.162158
2025-02-11 17:47:21,473 INFO     0th Percentile: -0.328879
2025-02-11 17:47:21,473 INFO     10th Percentile: 0.010996
2025-02-11 17:47:21,474 INFO     100th Percentile: 1.000001
2025-02-11 17:47:21,474 INFO     25th Percentile: 0.070178
2025-02-11 17:47:21,475 INFO     50th Percentile: 0.144687
2025-02-11 17:47:21,475 INFO     75th Percentile: 0.237819
2025-02-11 17:47:21,475 INFO     90th Percentile: 0.339598
2025-02-11 17:47:21,475 INFO     99th Percentile: 0.536478
2025-02-11 17:47:21,476 INFO     End log_distribution
2025-02-11 17:47:21,476 INFO     Nr. Similar Items: 135
2025-02-11 17:47:21,476 INFO     Total Items: 87,481,260
2025-02-11 17:47:21,477 INFO     Percentage: 0.000154%
2025-02-11 17:47:21,478 INFO     End compare_piis_lists


Further info of comparison:

In [6]:
print(f"Original (labels) PIIs len: {len(ori_piis)}")
print(f"Original (predicted) PIIs len: {len(ori_predicted_piis)}")
print(f"Non-public original (labels) PIIs len: {len(non_public_ori_piis)}")
print(f"Non-public original (predicted) PIIs len: {len(non_public_ori_predicted_piis)}")
print(f"Filtered (with ignore list) non-public original PIIs len: {len(filtered_non_public_ori_piis)}")
print(f"Len Ignore List: {len(ignore_list)}")
print(f"Nr synthetic PIIs: {len(syn_piis)}")



Original (labels) PIIs len: 15350
Original (predicted) PIIs len: 15541
Non-public original (labels) PIIs len: 8945
Non-public original (predicted) PIIs len: 8990
Filtered (with ignore list) non-public original PIIs len: 8564
Len Ignore List: 123
Nr synthetic PIIs: 10215


### Further analysis of `sorted_sim_items`

In [7]:
print("len sorted_sim_items in original PIIs:", len(sorted_sim_items))
print("len of sorted_sim_items in synthetic PIIs:", sum([v["len_syn_items"] for v in sorted_sim_items]))

len sorted_sim_items in original PIIs: 110
len of sorted_sim_items in synthetic PIIs: 135


Printing first 20 similar items (similar PIIs present in original and synthetic dataset).

Structure:

| Original item id | Nr synthetic items | Original document number | Original PII | Synthetic doc nrs. | Len Synthetic doc nrs |
| -- | -- | -- | -- | -- | -- |

In [8]:
for i in sorted_sim_items[0:20]:
    print(
        i["ori_item"], i["len_syn_items"], i["ori_doc_nr"],
        i["ori_text"], '', i["syn_docs"], i["len_syn_docs"]
    , sep="\t|")

7570	|3	|944	|Mr. Koç	|	|[175, 934, 999]	|3
7575	|3	|944	|Mr. Koç	|	|[175, 934, 999]	|3
7591	|3	|944	|Mr. Koç	|	|[175, 934, 999]	|3
7593	|3	|944	|Mr. Koç	|	|[175, 934, 999]	|3
1803	|2	|325	|Mr P. Havers	|	|[339, 993]	|2
2065	|2	|378	|Mr Mehmet Aydın	|	|[493, 826]	|2
5571	|2	|782	|Mr N. Özdemir	|	|[340, 909]	|2
5743	|2	|796	|Mr Mehmet Güneş	|	|[301, 627]	|2
21	|2	|5	|Mr A. Erdoğan	|	|[869]	|1
42	|1	|8	|Süleyman Çetinkaya	|	|[935]	|1
230	|1	|46	|Mr M.S. Talay	|	|[968]	|1
465	|1	|95	|Mr Artur Nowak	|	|[25]	|1
542	|1	|107	|Mr S Karadayı	|	|[435]	|1
560	|1	|112	|Mr Dariusz Nowak	|	|[180]	|1
593	|1	|120	|Ms N. Vajić	|	|[103]	|1
618	|4	|126	|Singh	|	|[605]	|1
657	|1	|135	|87/2	|	|[975]	|1
663	|1	|137	|Mr S. Sikora	|	|[405]	|1
724	|1	|146	|Mehmet Duman	|	|[376]	|1
748	|1	|152	|Mr C. Bîrsan	|	|[360]	|1


Examining the first PII match in original and synthetic cases (with predefined function `print_ori_syn_cases_fun`).

In [9]:
print_ori_syn_cases_fun(0)


### PII Mr. Koç 

Original court case:
PROCEDURE

The case of Luedicke, Belkacem and Koç was referred to the Court by the Government of the Federal Republic of Germany (“the Government“) and the European Commission of Human Rights (“the Commission“). The case originated in three applications against the Federal Republic of Germany lodged with the Commission by Mr. Gerhard W. Luedicke, Mr. Mohammed Belkacem and Mr. Arif Koç in 1973, 1974 and 1975 respectively. The Commission ordered the joinder of these three applications on 4 October 1976.

Both the application of the Government, which referred to Article 48 (art. 48) of the Convention, and the request of the Commission, which relied on Articles 44 and 48, sub-paragraph (a) (art. 44, art. 48-a), and to which was attached the report provided for under Article 31 (art. 31), were lodged with the registry of the Court within the period of three months laid down in Articles 32 para. 1 and 47 ((art. 32-1, art. 47). The application was lodge