# Trafilatura with PI protection
*ECE 570 Course Project*

Trafilatura is an open-source generic extraction tool for extracting meaningful content from html inputs. It was developed as a tool for building web corpora to train machine learning models. https://github.com/adbar/trafilatura

Trafilatura does not have any functionality to remove PII from its extractions. It is crucial to keep specific real PII out of corpora, so this project attempts to achieve that. The piiranha model for PII detection (https://huggingface.co/iiiorg/piiranha-v1-detect-personal-information) is used to redact PII from trafilatura's extractions before returning the extracted content.

Import libraries:

In [4]:
!pip install trafilatura
import trafilatura
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification



Load model:

In [5]:
tokenizer = AutoTokenizer.from_pretrained("iiiorg/piiranha-v1-detect-personal-information")
model = AutoModelForTokenClassification.from_pretrained("iiiorg/piiranha-v1-detect-personal-information")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

DebertaV2ForTokenClassification(
  (deberta): DebertaV2Model(
    (embeddings): DebertaV2Embeddings(
      (word_embeddings): Embedding(251000, 768, padding_idx=0)
      (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True)
      (dropout): StableDropout()
    )
    (encoder): DebertaV2Encoder(
      (layer): ModuleList(
        (0-11): 12 x DebertaV2Layer(
          (attention): DebertaV2Attention(
            (self): DisentangledSelfAttention(
              (query_proj): Linear(in_features=768, out_features=768, bias=True)
              (key_proj): Linear(in_features=768, out_features=768, bias=True)
              (value_proj): Linear(in_features=768, out_features=768, bias=True)
              (pos_dropout): StableDropout()
              (dropout): StableDropout()
            )
            (output): DebertaV2SelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=Tr

Obtain a url to perform an extraction from:

In [6]:
print("\nTrafilatura w/ PII Protection!\n")
url = input("Please enter a URL to extract content from: ")


Trafilatura w/ PII Protection!

Please enter a URL to extract content from: https://engineering.purdue.edu/ece404/


Perform the extraction:

In [7]:
# Use PIIranha to detect and redact PII
# if the parameter aggregate_redaction is true, PII is replaced with "[redacted]"
# if aggregate_redaction is false, PII is replaced with more detailed tags such as "[I-EMAIL]", "[I-SURNAME]", etc.
def mask_pii(text, aggregate_redaction=True):
    # Tokenize input text
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Get the model predictions
    with torch.no_grad():
        outputs = model(**inputs)

    # Get the predicted labels
    predictions = torch.argmax(outputs.logits, dim=-1)

    # Convert token predictions to word predictions
    encoded_inputs = tokenizer.encode_plus(text, return_offsets_mapping=True, add_special_tokens=True)
    offset_mapping = encoded_inputs['offset_mapping']

    masked_text = list(text)
    is_redacting = False
    redaction_start = 0
    current_pii_type = ''
    count = 0

    for i, (start, end) in enumerate(offset_mapping):
        if start == end:  # Special token
            continue

        label = predictions[0][i].item()
        if label != model.config.label2id['O']:  # Non-O label
            pii_type = model.config.id2label[label]
            if not is_redacting:
                is_redacting = True
                redaction_start = start
                current_pii_type = pii_type
            elif not aggregate_redaction and pii_type != current_pii_type:
                # End current redaction and start a new one
                apply_redaction(masked_text, redaction_start, start, current_pii_type, aggregate_redaction)
                redaction_start = start
                current_pii_type = pii_type
                count += 1
        else:
            if is_redacting:
                apply_redaction(masked_text, redaction_start, end, current_pii_type, aggregate_redaction)
                is_redacting = False
                count += 1

    # Handle case where PII is at the end of the text
    if is_redacting:
        apply_redaction(masked_text, redaction_start, len(masked_text), current_pii_type, aggregate_redaction)
        count += 1

    return ''.join(masked_text), count

# helper function for replacing the detected PII with tags
def apply_redaction(masked_text, start, end, pii_type, aggregate_redaction):
    for j in range(start, end):
        masked_text[j] = ''
    if aggregate_redaction:
        masked_text[start] = '[redacted]'
    else:
        masked_text[start] = f'[{pii_type}]'

downloaded_html = trafilatura.fetch_url(url)
if downloaded_html:
    # Extract main content from the downloaded HTML
    extracted_content = trafilatura.extract(downloaded_html, favor_recall=True)

    if extracted_content:
        filtered_content, pii_instance_count = mask_pii(extracted_content, aggregate_redaction=False)
        print(f"\nThere were a total of {pii_instance_count} redactions of PII made:\n")
        print(filtered_content)
    else:
        print("Content extraction failed.")
else:
    print("Failed to fetch content from the URL.")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.



There were a total of 13 redactions of PII made.

Instructor:[I-GIVENNAME][I-SURNAME]
Professor, ECE
E-mail:[I-EMAIL]You must place the string 'ece404' in the subject line to get past your instructor's notorious spam filter)
Graduate TAs:[I-GIVENNAME][I-SURNAME]-
E-mail:[I-USERNAME][I-EMAIL][I-USERNAME]
Joseph Wang
-
E-mail:[I-USERNAME][I-EMAIL]
Adrien Dubois
-
E-mail:[I-USERNAME][I-EMAIL][I-USERNAME]
Lecture Location and Time
-
TuTh: 6:00 PM - 7:15 PM, PHYS 112
Course Description
-
Beyond question, computer and network security has emerged as one of
the most important subjects of study in modern times. Even the minutest
details of our lives now depend on our computers and networks working
with our trust that the information that is private to us will not fall
in the hands of those with ill intent. The two major components of
computer and network security are cryptography and what is known as
systems-oriented security. For a good education in computer and network
security, you have no