# Removing PII as part of Unstructured Data ETL

When building a RAG application, removing Personally Identifiable Information (PII) during data preprocessing can be crucial for privacy, compliance, and security reasons. In this notebook, we will walk through data preprocessing steps that process unstructured documents, chunk them, remove PII, generate embeddings and upload the results.

We'll use Unstructured's Serverless API via `unstructured-ingest` library to handle document ingestion and partitioning, and GLiNER model for identifying and redacting PII entities. Finally, we'll embed the cleaned chunks and prepare them for downstream RAG applications.

Unstructured doesn't specialize in PII removal, but focuses on ETL, but we can still incorporate this step as a part of data preprocessing.


## Step 1: Install required libraries

- [unstructured-ingest](https://github.com/urchade/GLiNER): Python package for data processing using Unstructured
- [gliner](https://github.com/urchade/GLiNER): for running zero-shot NER inference with GLiNER model.

[GLiNER](https://huggingface.co/urchade/gliner_large-v2.1) is a Named Entity Recognition (NER) model capable of identifying any entity type using a bidirectional transformer encoder (BERT-like). It provides a practical alternative to traditional NER models, which are limited to predefined entities, and Large Language Models (LLMs) that, despite their flexibility, are costly and large for resource-constrained scenarios.
GLiNER is shared with an Apache 2.0 license on the Hugging Face Hub.

In [None]:
!pip install -qU "unstructured-ingest[pdf, embed-huggingface]" gliner

## Step 2: Setup prerequisites

- **Set the Unstructured API key and URL**: Steps to obtain the API key and URL are [here](https://unstructured.io/api-key-hosted)

If you'd like, feel free to use a different embedding model, this will not change the preprocessing steps that we want to illustrate here.



In [None]:
# Your Unstructured API key and URL
import os

os.environ["UNSTRUCTURED_API_KEY"] = ""
os.environ["UNSTRUCTURED_URL"] = ""
os.environ["EMBEDDING_MODEL_NAME"] = "BAAI/bge-base-en-v1.5"


In [None]:
from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.interfaces import ProcessorConfig
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
from unstructured_ingest.v2.processes.chunker import ChunkerConfig
from unstructured_ingest.v2.processes.embedder import EmbedderConfig

from unstructured_ingest.v2.processes.connectors.local import (
    LocalIndexerConfig,
    LocalDownloaderConfig,
    LocalConnectionConfig,
)

from unstructured_ingest.v2.processes.connectors.local import LocalUploaderConfig

## Step 3: Locate some data with PII

For illustration purposes, we used an LLM to generate a few documents with fake personal information, you can find them here:

* [Fake CV](https://docs.google.com/document/d/1NrbiXLpsqTxmNX8ZTmFA6jHRPahq2sV5AeS6Yn19s5Y/edit?usp=drive_link)
* [Fake hotel reservation confirmation](https://docs.google.com/document/d/1YQQXRrdBFXJ0c7q-qqA5ti5ajtlVL3vJfvBYQxpccoo/edit?usp=drive_link)
* [Fake mortgage agrement](https://docs.google.com/document/d/1vUAQ74wCbZfwgrg0uMVraBlj2-yE_QpiAdobtfmHa_0/edit?usp=drive_link)


If you'd like to run the example with the exact same documents, download them, and place into the "/content/data" folder. Otherwise, feel free to use any other documents from the file types that Unstructured supports: `.eml`, `.html`, `.md`, `.msg`, `.rst`, `.rtf`, `.txt`, `.xml`, `.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`, `.heic`, `.csv`, `.doc`, `.docx`, `.epub`, `.odt`, `.pdf`, `.ppt`, `.pptx`, `.tsv`, `.xlsx`.

Set the `WORK_DIR` - this is where Unstructured ingest library will cache the results of each preprocessing step.

In [None]:
DOCUMENTS = "/content/data"
WORK_DIR = "/content/temp"
interm_results = "/content/intermediate_results"

## Step 4: Run initial data preprocessing

First, let's convert unstructured documents (`*.docx` in this case) into structured JSON.

We'll define the pipeline to partition and chunk the documents. Here, we're using a `fast` partitioning strategy which is best suited for non-image documents that can be parsed with rule-based parsers.

**Note**: For the chunking step, in addition to the standard considerations of the embedding model context size and retrieval precision, you want to take into account that GLiNER's input is limited to 384 tokens, so if you want to have larger chunks, you may need to do additional splitting for GLiNER and merging back. In this example, the chunks are small enough to fit GLiNER's input restrictions, and to fit the embedding model limits.

In [None]:
Pipeline.from_configs(
    context=ProcessorConfig(
        verbose=True, tqdm=True, num_processes=5, work_dir=WORK_DIR
    ),
    indexer_config=LocalIndexerConfig(input_path=DOCUMENTS),
    downloader_config=LocalDownloaderConfig(),
    source_connection_config=LocalConnectionConfig(),
    partitioner_config=PartitionerConfig(
        partition_by_api=True,
        api_key=os.getenv("UNSTRUCTURED_API_KEY"),
        partition_endpoint=os.getenv("UNSTRUCTURED_URL"),
        strategy="fast",
    ),
    chunker_config=ChunkerConfig(
        chunking_strategy="by_title",
        chunk_max_characters=1000,
        chunk_overlap=150,
    ),
    uploader_config=LocalUploaderConfig(output_dir=interm_results),
).run()

2024-09-24 18:20:12,821 MainProcess INFO     created index with configs: {"input_path": "/content/data", "recursive": false}, connection configs: {"access_config": "**********"}
2024-09-24 18:20:12,831 MainProcess INFO     Created download with configs: {"download_dir": null}, connection configs: {"access_config": "**********"}
2024-09-24 18:20:12,840 MainProcess INFO     created partition with configs: {"strategy": "fast", "ocr_languages": null, "encoding": null, "additional_partition_args": null, "skip_infer_table_types": null, "fields_include": ["element_id", "text", "type", "metadata", "embeddings"], "flatten_metadata": false, "metadata_exclude": [], "metadata_include": [], "partition_endpoint": "https://api.unstructuredapp.io/general/v0/general", "partition_by_api": true, "api_key": "*******", "hi_res_model_name": null}
2024-09-24 18:20:12,847 MainProcess INFO     created chunk with configs: {"chunking_strategy": "by_title", "chunking_endpoint": "https://api.unstructured.io/genera

Let's take a look at a single document chunk after partitioning and chunking:

In [None]:
import json

fake_mortgage_path = "/content/intermediate_results/Fake_Mortgage Agreement.docx.json"
with open(fake_mortgage_path, 'r') as file:
    fake_mortgage_elements = json.load(file)

In [None]:
print(fake_mortgage_elements[0]['text'])

Mortgage Agreement

This Mortgage Agreement ("Agreement") is entered into on this 24th day of September, 2024, by and between:

Borrower: Jane Doe
Address: 123 Elm Street, Springfield, IL 62704
SSN: 123-45-6789

Lender: First National Bank
Address: 456 Main Street, Springfield, IL 62701

Property: 123 Elm Street, Springfield, IL 62704

WHEREAS, Borrower has applied for a loan from Lender, and Lender has agreed to make a loan to Borrower upon the terms and conditions set forth in this Agreement;

NOW, THEREFORE, in consideration of the mutual covenants herein, the parties agree as follows:


As you can see, it contains a lot of PII - a person's name, social security number, address. Let's see how we can identify and redact this information with GLiNER.

## About GLiNER

Named Entity Recognition(NER) is an method used to locate and classify named entities mentioned in text into predefined categories (person names, organization, locations, time expression, monetary values, etc).
Traditional NER models are limited to a trained on a predefined set of entity types, this means that if you wanted to extend the list of categories that you would like to identify, you would have to retrain of fine-tune a model yourself on your own labeled data. For example, a traditional NER model trained to identify people, locations, and organizations wouldn't be able to identify product names or dates.


[GLiNER](https://arxiv.org/abs/2311.08526), on the other hand, is designed to identify any type of entity, regardless of whether it was explicitly included in the training data. This "open-type" capability stems from its use of natural language instructions and its ability to match entity type embeddings to text spans in a latent space.

This means that you have to define your own PII categories that you want the model to identify in your text. Let's load the model and define some categories.


In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_large-v2.1")

pii_categories = ["account number", "address", "city", "credit card number", "date of birth", "corporate email", "price",
                  "driver license number", "email", "person's name", "id card number", "password", "social security number",
                  "street", "telephone number", "username", "zipcode", "reference number", "booking number"]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]



To remove the PII from a chunk of text we can define a simple function that will:

1. Pass the text chunk to GLiNER model that will return a list of identified PII entities with their text and start/end character positions.
2. Use those start/end positions to substitute the text with asterisk characters, effectively removing the PII.
3. For a prettier result, replace `******`-like substrings with `[REDACTED]`, and return cleaned up text.

In [None]:
import re

def remove_pii_from_chunk(chunk_text):
  entities = model.predict_entities(chunk_text, pii_categories, flat_ner=False, threshold=0.3)

  for entity in entities:
      chunk_text = chunk_text[:entity['start']] + '*' * len(entity['text']) + chunk_text[entity['end']:]

  return re.sub(r'\*{2,}', '[REDACTED]', chunk_text)

Let's see what the result will look like for the chunk that we saw earlier:

In [None]:
cleaned_mortgage_chunk = remove_pii_from_chunk(fake_mortgage_elements[0]['text'])
print(cleaned_mortgage_chunk)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Mortgage Agreement

This Mortgage Agreement ("Agreement") is entered into on this 24th day of September, 2024, by and between:

Borrower: [REDACTED]
Address: [REDACTED], [REDACTED]: [REDACTED]

Lender: First National Bank
Address: [REDACTED], [REDACTED], IL [REDACTED]

Property: [REDACTED], [REDACTED]

WHEREAS, Borrower has applied for a loan from Lender, and Lender has agreed to make a loan to Borrower upon the terms and conditions set forth in this Agreement;

NOW, THEREFORE, in consideration of the mutual covenants herein, the parties agree as follows:


## Step 5: Apply PII removal to all chunks

Now, let's apply the PII removal function to all chunks cached in the working directory. We'll simply read all the *.json outputs, clean each text chunk, and write the results back.

In [None]:
directory = f"{WORK_DIR}/chunk"

In [None]:
for filename in os.listdir(directory):
    if filename.endswith(".json"):
        file_path = os.path.join(directory, filename)
        print(f"Processing file {filename}")
        try:
            with open(file_path, "r") as file:
                data = json.load(file)

            for entry in data:
                entry["text"] = remove_pii_from_chunk(entry["text"])

            with open(file_path, "w") as file:
                json.dump(data, file, indent=2)

            print(f"Successfully updated {file_path} with custom metadata fields.")
        except json.JSONDecodeError as e:
            print(f"Error parsing JSON in {file_path}: {e}")
        except IOError as e:
            print(f"Error reading from or writing to {file_path}: {e}")

Processing file 87d9bba0d931.json
Successfully updated /content/temp/chunk/87d9bba0d931.json with custom metadata fields.
Processing file fe83d53f4446.json
Successfully updated /content/temp/chunk/fe83d53f4446.json with custom metadata fields.
Processing file 6ab673436d44.json
Successfully updated /content/temp/chunk/6ab673436d44.json with custom metadata fields.


## Step 6: Proceed with the embedding step of the pipeline


At this point, the cached results for the chunking step have been updated.

If we use the same pipeline, but add an embedding step to it, the pipeline will pick up the (modified) cached results of the chunking step, and use them to proceed with the embedding step.

At this stage, you may also want to change the destination for the processing results to a vector database, such as AstraDB, MongoDB, or any other of the [supported destinations](https://docs.unstructured.io/api-reference/ingest/destination-connector/overview). Here we save the results into a new local directory.

In [None]:
final_results = "/content/final_results"
Pipeline.from_configs(
    context=ProcessorConfig(
        verbose=True, tqdm=True, num_processes=5, work_dir=WORK_DIR
    ),
    indexer_config=LocalIndexerConfig(input_path=DOCUMENTS),
    downloader_config=LocalDownloaderConfig(),
    source_connection_config=LocalConnectionConfig(),
    partitioner_config=PartitionerConfig(
        partition_by_api=True,
        api_key=os.getenv("UNSTRUCTURED_API_KEY"),
        partition_endpoint=os.getenv("UNSTRUCTURED_URL"),
        strategy="fast",
    ),
    chunker_config=ChunkerConfig(
        chunking_strategy="by_title",
        chunk_max_characters=1000,
        chunk_overlap=150,
    ),
    embedder_config=EmbedderConfig(
        embedding_provider="huggingface", # "langchain-huggingface" for ingest v<0.23
        embedding_model_name=os.getenv("EMBEDDING_MODEL_NAME"),
    ),

    uploader_config=LocalUploaderConfig(output_dir=final_results),

).run()


Overriding of current TracerProvider is not allowed
2024-09-24 18:22:52,549 MainProcess INFO     created index with configs: {"input_path": "/content/data", "recursive": false}, connection configs: {"access_config": "**********"}
2024-09-24 18:22:52,555 MainProcess INFO     Created download with configs: {"download_dir": null}, connection configs: {"access_config": "**********"}
2024-09-24 18:22:52,561 MainProcess INFO     created partition with configs: {"strategy": "fast", "ocr_languages": null, "encoding": null, "additional_partition_args": null, "skip_infer_table_types": null, "fields_include": ["element_id", "text", "type", "metadata", "embeddings"], "flatten_metadata": false, "metadata_exclude": [], "metadata_include": [], "partition_endpoint": "https://api.unstructuredapp.io/general/v0/general", "partition_by_api": true, "api_key": "*******", "hi_res_model_name": null}
2024-09-24 18:22:52,564 MainProcess INFO     created chunk with configs: {"chunking_strategy": "by_title", "chu

Once we've processed all of the documents, you'll find them in the `"/content/final_results"` directory, or in your AstraDB, if you chose to use the commented out connector. Let's take a look at another processed document, this time we'll print out all of it:

In [None]:
hotel = "/content/final_results/Fake_Hotel_Reservation.docx.json"
with open(hotel, 'r') as file:
    cleaned_hotel_elements = json.load(file)

for element in cleaned_hotel_elements:
    print(element['text'])

Subject: Your Reservation Confirmation - Oceanview Grand Hotel

Dear [REDACTED],

Thank you for choosing Oceanview Grand Hotel! We're excited to welcome you. Below are the details of your upcoming stay:

Reservation Number: [REDACTED]
[REDACTED]: [REDACTED]
Check-in Date: Friday, October 5, 2024 after 3:00 PM
Check-out Date: Monday, October 8, 2024 by 11:00 AM
Number of Guests: 2 Adults, 1 Child

Room Type: Deluxe Oceanfront Suite
[REDACTED]: [REDACTED]
Total Amount: [REDACTED] (inclusive of taxes and fees)

Your Stay Includes:

Complimentary Wi-Fi

Daily breakfast for two

Access to the rooftop pool and lounge

Complimentary beach towels and chairs

Special Requests:

King-sized bed

Crib for infant
Important Information:

Check-in Instructions: Please bring a valid ID and the credit card used for booking.

Cancellation Policy: Free cancellation until 48 hours before check-in. A [REDACTED] applies if canceled after this period.

COVID-19 Safety Measures: Masks are optional for fully v

For an off-the-shelf model that required zero fine-tuning and offers you vast flexibility in how you define PII categories, this is a great result!

By combining the Unstructured API with the GLiNER model, we can build a complete ETL pipeline that handles PII removal and prepares unstructured data for use in downstream applications. This approach ensures that sensitive information is properly redacted while still making the data valuable for your RAG.