# **Spacy NER model for PI redaction**

The following notebook contains the initial usage of the spacy NER model to evaluate whether or not it is a good performer for the project.

Step 1: imports

In [1]:
!pip install Trafilatura
import spacy
from trafilatura.core import *
from trafilatura import fetch_url
import re
import os

Collecting Trafilatura
  Downloading trafilatura-1.12.2-py3-none-any.whl.metadata (14 kB)
Collecting courlan>=1.2.0 (from Trafilatura)
  Downloading courlan-1.3.2-py3-none-any.whl.metadata (17 kB)
Collecting htmldate>=1.8.1 (from Trafilatura)
  Downloading htmldate-1.9.1-py3-none-any.whl.metadata (10 kB)
Collecting justext>=3.0.1 (from Trafilatura)
  Downloading jusText-3.0.1-py2.py3-none-any.whl.metadata (6.9 kB)
Collecting tld>=0.13 (from courlan>=1.2.0->Trafilatura)
  Downloading tld-0.13-py2.py3-none-any.whl.metadata (9.4 kB)
Collecting dateparser>=1.1.2 (from htmldate>=1.8.1->Trafilatura)
  Downloading dateparser-1.2.0-py2.py3-none-any.whl.metadata (28 kB)
Collecting lxml-html-clean (from lxml[html_clean]>=4.4.2->justext>=3.0.1->Trafilatura)
  Downloading lxml_html_clean-0.3.1-py3-none-any.whl.metadata (2.4 kB)
Downloading trafilatura-1.12.2-py3-none-any.whl (132 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.2/132.2 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00

Step 2: load model

In [2]:
# Load spacy NER model
nlp = spacy.load('en_core_web_sm')

Step 3: define function for PII detection and redaction using the model

In [3]:
# Function to detect and redact personal information using Named Entity Recognition
def remove_personal_info(text):
    doc = nlp(text)
    pi_entities = ['PERSON', 'EMAIL']
    redacted_text = text

    instances = 0
    for ent in doc.ents:
        if ent.label_ in pi_entities:
            redacted_text = redacted_text.replace(ent.text, '[REDACTED]')
            instances += 1

    return redacted_text, instances

Step 4: perform a Trafilatura extraction and PII removal on the ECE404 webpage as an example

In [None]:
# test a specific instance: the ECE 404 homepage
ece404html_content = fetch_url("https://engineering.purdue.edu/ece404/")
text = extract(ece404html_content, favor_recall=True) # favor_precision=True will cut out noise, favor_recall=True will keep more in
redacted_text, count = remove_personal_info(str(text))
print("Total instances of PI detected in ECE404 homepage: " + str(count))
print(str(redacted_text))

Total instances of PI detected in ECE404 homepage: 7
Instructor: [REDACTED]Professor, ECE
E-mail: [REDACTED] (You must place the string 'ece404' in the subject line to get past your instructor's notorious spam filter)
Graduate TAs:
[REDACTED]
-
E-mail: kashyap9 (at purdue dot edu)
[REDACTED]E-mail: wang3450 (at purdue dot edu)
[REDACTED]
-
E-mail: [REDACTED] (at purdue dot edu)
Lecture Location and Time
-
TuTh: 6:00 PM - 7:15 PM, PHYS 112
Course Description
-
Beyond question, computer and network security has emerged as one of
the most important subjects of study in modern times. Even the minutest
details of our lives now depend on our computers and networks working
with our trust that the information that is private to us will not fall
in the hands of those with ill intent. The two major components of
computer and network security are cryptography and what is known as
systems-oriented security. For a good education in computer and network
security, you have no choice but to learn them

Step 5: mount subset of GovDocs1 dataset from Google Drive for testing (this section of code is not reproducible unless you download GovDocs data yourself at this link https://corp.digitalcorpora.org/corpora/files/govdocs1/threads/ )

In [None]:
from google.colab import drive
drive.mount('/content/drive')
govdocs_dir ='/content/drive/My Drive/ECE570/govdocs_testingdata'

Mounted at /content/drive


Step 6: create a list of the html files from the dataset (again not reproducible without govdocs data downloaded locally, but there was no way to access these files without local download)


In [None]:
# Function to identify and read HTML files from the dataset
def find_html_files(directory):
    html_files = []
    for root, dirs, files in os.walk(directory):
        for file in files:
            # Check if the file has an HTML extension
            if file.endswith(".html") or file.endswith(".htm"):
                html_files.append(os.path.join(root, file))
    return html_files

# Find HTML files in the dataset
html_files = find_html_files(govdocs_dir)

Step 7: for each html file from GovDocs1 dataset, extract content with Trafilatura and use the model to remove PII (again not reproducible if you have not downloaded a GovDocs1 thread and saved it to your Drive)

In [None]:
# Process the HTML content with Trafilatura's extract() function
# detect and remove PII with the spacy model
def process_html(html_content):
    text = extract(html_content, favor_recall=True)
    if text:
        # Apply PI redaction
        redacted_text, instances = remove_personal_info(text)
        return redacted_text, instances
    else:
        return None

# Function to read the content of an HTML file
def read_html_file(file_path):
    with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
        return f.read()

# for each HTML file, extract content and detect and remove PII
if html_files:
    i = 0
    total_instances = 0
    for file in html_files:
        html_content = read_html_file(file)
        redacted_content, instances = process_html(html_content)
        total_instances += instances
        if i == 13: # this was selected because it is an example of reasonably small enough length to display
            ex_for_print = redacted_content
        # if redacted_content:
        #     try:
        #         with open(f'./spacy_PI_filtered_content/test{i}.txt', 'w', encoding='utf-8') as fp:
        #             fp.write(redacted_content)
        #     except Exception as e:
        #         print(f"Error writing to file: {e}")
        # else:
        #     print(f'Failed to extract text from {file}.')
        i += 1
    print("Total instances of PI detected in GovDocs subset: " + str(total_instances))
    print("Example: " + str(ex_for_print))
else:
    print("No HTML files found in the directory.")


Total instances of PI detected in GovDocs subset: 19758
Example: Statement of [REDACTED] lance [REDACTED], nominated by president [REDACTED] to be a member of the board of trustees of the morris k. udall scholarship and excellence in national environmental policy foundation
To the united states senate committee on environment and public works
Submitted march 25, 2003
Mr. Chairman and Members of the Committee, thank you for the opportunity to provide this statement in support of my nomination to be a member of the Board of Trustees of the Morris K. Udall Foundation. I am honored and grateful that President [REDACTED] saw fit to nominate me to this position and, if confirmed, look forward to continuing my public service by helping to advance the mission of the Udall Foundation.
My professional career and personal background has provided me with valuable experience and perspective to bring to the Udall Foundation. I grew up in northern rural Michigan close to the shores of Lake Michigan. 

Step 8: download labeled dataset for accuracy testing

The dataset can be found here: https://huggingface.co/datasets/ai4privacy/pii-masking-400k

In [4]:
!pip install datasets
from datasets import load_dataset
ds = load_dataset("ai4privacy/pii-masking-400k")

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

1en.jsonl:   0%|          | 0.00/84.8M [00:00<?, ?B/s]

de.jsonl:   0%|          | 0.00/82.7M [00:00<?, ?B/s]

es.jsonl:   0%|          | 0.00/42.5M [00:00<?, ?B/s]

fr.jsonl:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

it.jsonl:   0%|          | 0.00/79.2M [00:00<?, ?B/s]

nl.jsonl:   0%|          | 0.00/38.4M [00:00<?, ?B/s]

1en.jsonl:   0%|          | 0.00/21.3M [00:00<?, ?B/s]

de.jsonl:   0%|          | 0.00/20.7M [00:00<?, ?B/s]

es.jsonl:   0%|          | 0.00/10.7M [00:00<?, ?B/s]

fr.jsonl:   0%|          | 0.00/21.1M [00:00<?, ?B/s]

it.jsonl:   0%|          | 0.00/19.8M [00:00<?, ?B/s]

data/validation/nl.jsonl:   0%|          | 0.00/9.67M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/325517 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/81379 [00:00<?, ? examples/s]

Step 9: Split the data, run through model, and compute accuracy based on location of redactions in the strings

In [5]:
# select 1,000 rows from the validation set as no test set is available
data = ds['validation']
data = data[0:1000]
y_true = data['masked_text']
X = data['source_text']
y_pred = []
for x in X:
  y, _ = remove_personal_info(x)
  y_pred.append(y)

def normalize_text(text):
    text = text.replace(" ", "")
    text = text.replace("\n", "")
    return text.lower()

# finds each redaction and adds the redaction and the characters to its left and right into a tuple
# returns list of the tuples
def extract_redactions_with_context(text):
    redactions_with_context = []

    # Use regular expression to find all redactions and their positions
    for match in re.finditer(r'\[.*?\]', text):
        start = match.start()
        end = match.end()
        left_char = text[start - 1] if start > 0 else ''  # Get the left character
        right_char = text[end] if end < len(text) else ''  # Get the right character
        redactions_with_context.append((match.group(0), left_char, right_char))  # Store redaction and its context

    return redactions_with_context

# wrote own accuracy score function to compute accuracy of redactions based on what characters are to
# the left and right of each [redacted] segment
def compute_accuracy(y_true, y_pred):
    # normalize both true and predicted texts by removing all spaces, newlines, and converting to lowercase
    y_true_normalized = [normalize_text(str(t)) for t in y_true]
    y_pred_normalized = [normalize_text(str(p)) for p in y_pred]
    total_redactions = 0
    correct_matches = 0

    for true_text, pred_text in zip(y_true_normalized, y_pred_normalized):
      # find each redaction and the characters to its left and right, returns list of tuples
      true_redactions = extract_redactions_with_context(true_text)
      pred_redactions = extract_redactions_with_context(pred_text)

      # count total redactions in the true text
      total_redactions += len(pred_redactions)

      # check surrounding characters of predicted redactions against true redactions
      i = 0
      for pred_redaction, pred_left, pred_right in pred_redactions:
        if i < len(true_redactions) and true_redactions[i][1] == pred_left: # correct character on the left
          # The start of the redaction is considered more important than the end
          correct_matches += 1
        elif pred_left == ']': # the case when there are two back to back redactions, so need to check the previous
            if i-1 >= 0 and i < len(true_redactions) and pred_redactions[i-1][1] == true_redactions[i][1]:
              correct_matches += .5
        elif i-1 >= 0 and i-1 < len(true_redactions) and pred_left == true_redactions[i-1][1]: # the case where two back to back redactions in true but pred only made one
          correct_matches += .5

        if i < len(true_redactions) and true_redactions[i][2] == pred_right: # correct character on the right
          correct_matches += .5
        elif pred_right == '[': # the case when there are two back to back redactions in pred, so need to check the next one
            if i+1 < len(pred_redactions) and i < len(true_redactions) and pred_redactions[i+1][2] == true_redactions[i][2]:
              correct_matches += .5
        elif i-1 >= 0 and i-1 < len(true_redactions) and pred_right == true_redactions[i-1][2]: # the case where two back to back redactions in true but pred only made one
          correct_matches += .5

        i += 1

    # compute accuracy as the number of correct matches over the total redactions
    accuracy = correct_matches / total_redactions if total_redactions > 0 else 0
    return accuracy

# Calculate accuracy
accuracy = compute_accuracy(y_true, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')

Accuracy: 65.03%
