# **distilBert for PI redaction**

The following notebook contains the initial usage of the distilBERT model to evaluate whether or not it is a good performer for the project.

Step 1: imports

In [None]:
!pip install Trafilatura
from transformers import pipeline
from transformers import DistilBertTokenizer
from transformers import DistilBertForTokenClassification, Trainer, TrainingArguments
from trafilatura.core import *
from trafilatura import fetch_url
import os

Collecting Trafilatura
  Downloading trafilatura-1.12.2-py3-none-any.whl.metadata (14 kB)
Collecting courlan>=1.2.0 (from Trafilatura)
  Downloading courlan-1.3.1-py3-none-any.whl.metadata (17 kB)
Collecting htmldate>=1.8.1 (from Trafilatura)
  Downloading htmldate-1.9.1-py3-none-any.whl.metadata (10 kB)
Collecting justext>=3.0.1 (from Trafilatura)
  Downloading jusText-3.0.1-py2.py3-none-any.whl.metadata (6.9 kB)
Collecting lxml>=5.2.2 (from Trafilatura)
  Downloading lxml-5.3.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.8 kB)
Collecting tld>=0.13 (from courlan>=1.2.0->Trafilatura)
  Downloading tld-0.13-py2.py3-none-any.whl.metadata (9.4 kB)
Collecting dateparser>=1.1.2 (from htmldate>=1.8.1->Trafilatura)
  Downloading dateparser-1.2.0-py2.py3-none-any.whl.metadata (28 kB)
Collecting lxml-html-clean (from lxml[html_clean]>=4.4.2->justext>=3.0.1->Trafilatura)
  Downloading lxml_html_clean-0.3.1-py3-none-any.whl.metadata (2.4 kB)
Downloading trafilatura-1.12.2-py3-none-any.whl (

Step 2: load model

In [None]:
# Load DistilBERT NER model
nlp = pipeline('ner', model='Davlan/distilbert-base-multilingual-cased-ner-hrl')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/876 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/539M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



Step 3: mount subset of GovDocs dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')
govdocs_dir ='/content/drive/My Drive/ECE570/govdocs_testingdata'

Mounted at /content/drive


Step 4: create a list of the html files from the dataset

In [None]:
# Function to identify and read HTML files from the dataset
def find_html_files(directory):
    html_files = []
    for root, dirs, files in os.walk(directory):
        for file in files:
            # Check if the file has an HTML extension
            if file.endswith(".html") or file.endswith(".htm"):
                html_files.append(os.path.join(root, file))
    return html_files

# Find HTML files in the dataset
html_files = find_html_files(govdocs_dir)

Step 5: for each html file, extract content with Trafilatura and use the model to remove PII

In [None]:
# Function to detect and redact personal information using Named Entity Recognition
def remove_personal_info(text):
    entities = nlp(text)
    pi_entities = ['B-PER', 'I-PER', 'EMAIL']
    redacted_text = text

    # Sort entities by position in the text to avoid overlapping replacements
    sorted_entities = sorted(entities, key=lambda x: x['start'], reverse=True)

    # Replace each entity text with '[REDACTED]'
    instances = 0
    for ent in sorted_entities:
        if ent['entity'] in pi_entities:
            redacted_text = redacted_text[:ent['start']] + '[REDACTED]' + redacted_text[ent['end']:]
            instances += 1

    return redacted_text, instances

# Process the HTML content with Trafilatura's extract() function
# detect and remove the PII with the distilBERT NER model
def process_html(html_content):
    text = extract(html_content, favor_recall=True)
    if text:
        # Apply PI redaction
        redacted_text, instances = remove_personal_info(text)
        return redacted_text, instances
    else:
        return None

# Function to read the content of an HTML file
def read_html_file(file_path):
    with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
        return f.read()

# for each HTML file, extract content and detect and remove PII
if html_files:
    i = 0
    total_instances = 0
    for file in html_files:
        html_content = read_html_file(file)
        redacted_content, instances = process_html(html_content)
        total_instances += instances
        # if redacted_content:
        #     try:
        #         with open(f'./distilBERT_filtered_content/test{i}.txt', 'w', encoding='utf-8') as fp:
        #             fp.write(redacted_content)
        #         i += 1
        #     except Exception as e:
        #         print(f"Error writing to file: {e}")
        # else:
        #     print(f'Failed to extract text from {file}.')
    print("Total instances of PI detected in GovDocs subset: " + str(total_instances))

# test a specific instance: the ECE 404 homepage
ece404 = '/content/drive/My Drive/ECE570/ECE404.html'
with open(ece404, 'r', encoding='utf-8') as fp:
    ece404html_content = fp.read()
text = extract(ece404html_content, favor_recall=True) # favor_precision=True will cut out noise, favor_recall=True will keep more in
redacted, count = remove_personal_info(str(text))
print("Total instances of PI detected in ECE404 homepage: " + str(count))


Total instances of PI detected in GovDocs subset: 622
Total instances of PI detected in ECE404 homepage: 9


In [None]:
print(str(redacted))

Instructor: [REDACTED][REDACTED] [REDACTED]k
-
Professor, ECE
E-mail: kak@purdue.edu (You must place the string 'ece404' in the subject line to get past your instructor's notorious spam filter)
Graduate TAs:
Ami[REDACTED] [REDACTED]shyap
-
E-mail: kashyap9 (at purdue dot edu)
[REDACTED] [REDACTED]
-
E-mail: wang3450 (at purdue dot edu)
[REDACTED] [REDACTED]
-
E-mail: dubois6 (at purdue dot edu)
Lecture Location and Time
-
TuTh: 6:00 PM - 7:15 PM, PHYS 112
Course Description
-
Beyond question, computer and network security has emerged as one of
the most important subjects of study in modern times. Even the minutest
details of our lives now depend on our computers and networks working
with our trust that the information that is private to us will not fall
in the hands of those with ill intent. The two major components of
computer and network security are cryptography and what is known as
systems-oriented security. For a good education in computer and network
security, you have no choice 