# Personally Identifiable Information (PII) Data Anonymization

Script by Aussie Frost.

This script removes Personally Identifiable Information (PII) from a given csv.

A project for [CAHOOTS](https://whitebirdclinic.org/cahoots/)

Started 4/15/2024

## Data

Source A: Self-generated test dataset (see 'data_generation.ipynb')\
Source A: A pre-cleaned source given by the org (yet to receive)

## Preliminary imports

Note that these are all of the libraries used to run this script. You will need to install each one to ensure the script will run best.

In [1]:
# import standard libraries
import numpy as np
import pandas as pd
import random
import regex as re
import csv

# import natural language processing libraries
import spacy

# load the spacy nlp model
# note must install: python -m spacy download en_core_web_trf
nlp = spacy.load("en_core_web_sm") # en_core_web_sm or en_core_web_trf

### Import ssa_names.txt file
This is an aggregated set of names registered at least five times in the SSA database from years 1880-2022.

In [2]:
# import ssafirstnames_list.txt file as ssafirstnames_list and sort asc
with open('data/resources/ssafirstnames_list/ssafirstnames_list.txt', 'r') as file:
    ssafirstnames_list = file.read().split(',')
ssafirstnames_list = np.sort(ssafirstnames_list)

# import lanestreets_list.txt file as lanestreets_list and sort asc
with open('data/resources/lanestreets_list/lanestreets_list.txt', 'r') as file:
    lanestreets_list = file.read().split(',')
lanestreets_list = np.sort(lanestreets_list)

# import states_list.txt file as states_list and sort asc
with open('data/resources/states_list/states_list.txt', 'r') as file:
    states_list = file.read().split(',')
states_list = np.sort(states_list)

## Defining case narrative anonymizer script

This section contains a script for anonymizing a case narrative dataset.

### Method 1: RegEx String Replacement
This method involves defining regular expression patterns, then deploying these RegEx methods to further anonymize the data.

In [3]:
# define regex patterns
phone_pattern = r"\(?\b(\d{3})\)?[-.\s]*(\d{3})[-.\s]*(\d{4})\b"
address_pattern = r"\b\d+\s(?:[A-Za-z]+\s)*(?:St|Street|Rd|Road|Ave|Avenue|Blvd|Boulevard|Pl|Place|Lane|Ln|Drive|Dr|Court|Ct|Terrace|Ter|Way)[,.\s]"
web_pattern = r'(https?:\/\/)?(?:www\.)?[a-zA-Z0-9\.-]+\.[a-zA-Z]{2,}(?:\/\S*)?'
ip_pattern = r"\b((?:\d{1,3}\.){3}\d{1,3}|([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|:([0-9a-fA-F]{1,4}:){1,7}|::(?:[0-9a-fA-F]{1,4}:){0,5}[0-9a-fA-F]{1,4})\b"
zip_pattern = r"\b\d{5,}\b"
date_pattern = r"\b(?:\d{1,2}(st|nd|rd|th)?\s?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)|\d{1,2}/\d{1,2}/?\d{2,4}|\d{4}-\d{2}-\d{2})\b"
month_pattern = "\b(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\b"
num_pattern = r"\b\w*[\d]+\w*\b"
prefix_pattern = r"\b(Dr|Dr\.|Mr|Mr\.|Mrs|Mrs\.|Ms|Ms\.|Miss|Miss\.|Sir|Madam)\b"

def regex_remover(text, address_rem, date_rem, web_rem, ip_rem, zip_rem, phone_rem, num_rem):
    
    # apply RegEx pattern for each target feature
    text, address_rem = re.subn(address_pattern, "ADDRESS", text, flags=re.IGNORECASE)
    text, zip_rem = re.subn(zip_pattern, "ZIP", text, flags=re.IGNORECASE)
    text, date_rem = re.subn(date_pattern, "DATE", text, flags=re.IGNORECASE)
    text, date_rem = re.subn(month_pattern, "DATE", text, flags=re.IGNORECASE)
    text, web_rem = re.subn(web_pattern, "WEBSITE", text, flags=re.IGNORECASE)
    text, ip_rem = re.subn(ip_pattern, "IP", text, flags=re.IGNORECASE)
    text, phone_rem = re.subn(phone_pattern, "PHONE", text, flags=re.IGNORECASE)
    text, num_rem = re.subn(num_pattern, "NUMBER", text, flags=re.IGNORECASE)
    text, prefix_rem = re.subn(prefix_pattern, "PREFIX", text, flags=re.IGNORECASE)
    
    return text, address_rem, date_rem, web_rem, ip_rem, zip_rem, phone_rem, num_rem

### Method 2: Natural Language Processing and Named Entity Recognition with spaCy
For NLP, I am using spaCy and the 'en_core_web_sm' pretrained model (see more [here](https://spacy.io/models/en#en_core_web_sm)).

In [4]:
def nlp_anonymize_text(text, name_rem, address_rem, date_rem):
    """ nlp_anonymize_text(text)
    
    - this function deploys a spaCy NLP model and removes
    target features that are found that are found
    """
    
    # process the text with the NLP model
    doc = nlp(text)

    # replace all recognized names with 'NAME_REMOVED'
    for ent in doc.ents:

        # first check for addresses
        if ent.label_ in ["GPE", "LOC", "FAC"]:
            text = text.replace(ent.text, "ADDRESS")
            address_rem += 1
        # then check for names
        if ent.label_ == "PERSON":
            text = text.replace(ent.text, "NAME")
            name_rem += 1
        # then check for dates
        if ent.label_ == "DATE":
            text = text.replace(ent.text, "DATE")
            date_rem += 1
            
    return text, name_rem, address_rem, date_rem

### Method 3: Predefined Term Replacement

The first part of this method uses an aggregated set of first names registered at least five times in the SSA database from years 1880 through 2022. We call this method last as it is the most distructive to the original database.

Then, this method uses a set of states and their shorthand forms such that states can be caught and replaced.

#### BETTER METHOD
Using a decision tree such as [this](https://github.com/vigviswa/Named-Entity-Recognition-Using-Decision-Trees). Can speed up the process instead of using comparison for every single name?

In [5]:
def ssa_name_remover(text, name_rem):
    """
    Replaces names in the given text with 'NAME_REM', handling names case-insensitively.
    
    Args:
    text (str): The input text that may contain names.
    ssa_names (set): A set of names that should be anonymized, assumed to be in lower case.

    Returns:
    str: The anonymized text.
    int: The count of names replaced.
    """
    
    # Prepare regex pattern for case-insensitive matching
    pattern = r'\b(' + '|'.join(map(re.escape, ssafirstnames_list)) + r')\b'
    regex = re.compile(pattern, re.IGNORECASE)
    
    # Function to replace and count each match
    def replace_func(match):
        nonlocal name_rem
        name_rem += 1
        return "NAME"
    
    # Replace occurrences of any names in the text using a regex substitution
    text_anonymized = regex.sub(replace_func, text)
    
    return text_anonymized, name_rem

In [6]:
def lanecounty_streets_remover(text, address_rem):
    """
    Replaces names in the given text with 'ADDRESS', handling names case-insensitively.
    
    Args:
    text (str): The input text that may contain names.
    ssa_names (set): A set of names that should be anonymized, assumed to be in lower case.

    Returns:
    str: The anonymized text.
    int: The count of names replaced.
    """
    
    # Prepare regex pattern for case-insensitive matching
    pattern = r'\b(' + '|'.join(map(re.escape, lanestreets_list)) + r')\b'
    regex = re.compile(pattern, re.IGNORECASE)
    
    # Function to replace and count each match
    def replace_func(match):
        nonlocal address_rem
        address_rem += 1
        return "ADDRESS"
    
    # Replace occurrences of any names in the text using a regex substitution
    text_anonymized = regex.sub(replace_func, text)
    
    return text_anonymized, address_rem

In [7]:
def state_remover(text, address_rem):
    
    # Prepare regex pattern for case-insensitive matching
    pattern = r'\b(' + '|'.join(map(re.escape, states_list)) + r')\b'
    regex = re.compile(pattern, re.IGNORECASE)
    
    # Replace occurrences of any state names or abbreviations in the text
    text_anonymized, count = regex.subn("ADDRESS", text)
    
    return text_anonymized, address_rem

In [8]:
def term_removal(case, name_rem, address_rem):
    
    # call predefined term removal functions
    case, name_rem = ssa_name_remover(case, name_rem)
    case, address_rem = lanecounty_streets_remover(case, address_rem)
    case, address_rem = state_remover(case, address_rem) 
    
    return case, name_rem, address_rem

### Define single case_anonymizer

In [9]:
# define data_anonymizer
def data_anonymizer(case):
    """ narrative_anonymizer(case)
    
    this function takes in a dataset, in this case a case narrative
    and returns an anonymized case, where any identifying information
    is replaced with a FEATURENAME, where FEATURENAME is representative
    of the type of information that was removed.
    """
    
    # initialize removal counters
    name_rem1 = name_rem3 = phone_rem = web_rem = address_rem1 = address_rem3 = ip_rem = zip_rem = date_rem1 = date_rem3 = num_rem = 0
    
    # METHOD 1-- RegEx replacement:
    case, address_rem1, date_rem1, web_rem, ip_rem, zip_rem, phone_rem, num_rem = regex_remover(
        case, address_rem3, date_rem3, web_rem, ip_rem, zip_rem, phone_rem, num_rem)
    
    # METHOD 2-- natural language processing:
    # use NLP to anonomize target features
    case, name_rem2, address_rem2, date_rem2 = nlp_anonymize_text(case, name_rem1, address_rem1, date_rem1)
    
    # METHOD 3-- predefined term replacement:
    #if name_rem1 < 2: # define thresh !!! import from ssa_names.ipynb
    case, name_rem3, address_rem3 = term_removal(case, name_rem3, address_rem3)
    
    # combine results
    name_rem = name_rem2 + name_rem3
    address_rem = address_rem1 + address_rem2 + address_rem3
    date_rem = date_rem1 + date_rem2
        
    return {
        "call_transcription": case,
        "name_removed": name_rem,
        "phone_removed": phone_rem,
        "address_removed": address_rem,
        "web_removed": web_rem,
        "ip_removed": ip_rem,
        "zip_removed": zip_rem,
        "date_removed": date_rem,
        "num_removed": num_rem
        }

## Function to run anonymizer on complete dataset

In [10]:
def anonymize_narratives(in_datapath, out_datapath, target_colname, seperator=','):
    """ 
    anonymize_narratives(
        in_datapath: a path to your raw data,
        out_datapath: a path to your raw data
        target_colname: name of column that contains text to be anonymized,
        seperator (optional): the delimitor that separates the in_data
    ):

    This function takes a raw data path, then deploys our data anonomyzer
    on the case narrative dataset, strips target features and stores 
    cleaned case narratives alongside metrics of how many
    features were removed in an anonymized data csv.
    """
    
    # read main dataset into pd dataframe
    data = pd.read_csv(in_datapath, sep=seperator)
    
    # apply data_anonomizer and get resulting columns as list
    print("anonymizer script running!")
    anonymized = data[target_colname].apply(data_anonymizer).to_list()
    print("anonymizer script finished!")

    # create our anonymized dataframe
    anonymized_cols = pd.DataFrame(anonymized)
    
    # drop old target column (transcript) from original data
    original_df = data.drop(columns=target_colname)
    
    # merge the two dataframes together
    anonymized_data = original_df.join(anonymized_cols)

    # finally, write data to csv
    anonymized_data.to_csv(out_datapath, index=False)
    print("anonymized csv created!")

## Deploying the case anonymizer

Deploying the case anonymizer is as easy as one line of code.

In [11]:
# anonymize the narratives
anonymize_narratives("../data/call_data.csv", "../output/call_data_anonymized.csv", "call_transcription")

anonymizer script running!
anonymizer script finished!
anonymized csv created!
