# Text Anonymization - Ulises Bértolo

Configuration for NLP engine.

In [1]:
configuration = {
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "es", "model_name": "es_core_news_md"}],
}

### Analyzer (NER)

Creating the presidio analyzer.

In [2]:
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider

# Create NLP engine based on configuration
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()

# the languages are needed to load country-specific recognizers 
# for finding phones, passport numbers, etc.
analyzer = AnalyzerEngine(nlp_engine=nlp_engine,
                          supported_languages=["es"])

In [3]:
example_text = "Hola. Mi nombre es Ulises, mi número de teléfono es +34654321098. Nací en Santiago de Compostela, Galicia el 23/11/2000. El día de ayer comí patatas. Puedes preguntarme en ulises@inetum.com. IBAN de ejemplo ES6830042979106774153973."

results = analyzer.analyze(text=example_text,
                           language='es')
for res in results:
    print(res)


type: EMAIL_ADDRESS, start: 172, end: 189, score: 1.0
type: IBAN_CODE, start: 207, end: 231, score: 1.0
type: PERSON, start: 19, end: 25, score: 0.85
type: LOCATION, start: 74, end: 96, score: 0.85
type: LOCATION, start: 98, end: 105, score: 0.85
type: DATE_TIME, start: 109, end: 119, score: 0.6
type: URL, start: 179, end: 189, score: 0.5
type: PHONE_NUMBER, start: 52, end: 64, score: 0.4


In [4]:
from termcolor import colored
import random

def rand_col(text):
    return colored(text, color=random.choice(['red', 'green', 'yellow', 'blue', 'magenta']))

from copy import deepcopy
results_sorted = deepcopy(results)
results_sorted.sort(key=lambda x: x.start)

current_idx = 0
for current_entity in results_sorted:
#     print(rand_col(current_idx), end='\n')
    if current_idx <= current_entity.start:
        print(example_text[current_idx: current_entity.start], end='')
        print(rand_col(example_text[current_entity.start: current_entity.end] + f' [{current_entity.entity_type}]'), end='')
    else:
        print(rand_col(example_text[current_entity.start: current_entity.end] + f' [{current_entity.entity_type} collision]'), end='')
    
    current_idx = current_entity.end
    
print(example_text[current_idx: ])

Hola. Mi nombre es [31mUlises [PERSON][0m, mi número de teléfono es [34m+34654321098 [PHONE_NUMBER][0m. Nací en [31mSantiago de Compostela [LOCATION][0m, [32mGalicia [LOCATION][0m el [33m23/11/2000 [DATE_TIME][0m. El día de ayer comí patatas. Puedes preguntarme en [34mulises@inetum.com [EMAIL_ADDRESS][0m[35minetum.com [URL collision][0m. IBAN de ejemplo [33mES6830042979106774153973 [IBAN_CODE][0m.


### Anonimization

Now we import the anonymizer and anonymize the text.

In [5]:
from presidio_anonymizer import AnonymizerEngine

anonymizer = AnonymizerEngine()

In [6]:
anonymized_text = anonymizer.anonymize(text=example_text, analyzer_results=results).text

print(anonymized_text)

Hola. Mi nombre es <PERSON>, mi número de teléfono es <PHONE_NUMBER>. Nací en <LOCATION>, <LOCATION> el <DATE_TIME>. El día de ayer comí patatas. Puedes preguntarme en <EMAIL_ADDRESS>. IBAN de ejemplo <IBAN_CODE>.


We can play with the anonymizer parameters to alter the text.

In [7]:
from presidio_anonymizer.entities.engine import OperatorConfig

operators={"PERSON": OperatorConfig(operator_name="replace", 
                                    params={"new_value": "REPLACED_NAME"}),
           "LOCATION": OperatorConfig(operator_name="mask", 
                                      params={'chars_to_mask': 10, 
                                              'masking_char': '*',
                                              'from_end': True}),
           "DEFAULT": OperatorConfig(operator_name="redact")}

anonymized_text = anonymizer.anonymize(text=example_text, 
                                       analyzer_results=results,
                                       operators=operators).text

print(anonymized_text)

Hola. Mi nombre es REPLACED_NAME, mi número de teléfono es . Nací en Santiago de **********, ******* el . El día de ayer comí patatas. Puedes preguntarme en . IBAN de ejemplo .


In [8]:
operators={"PERSON": OperatorConfig(operator_name="custom", 
                                    params={"lambda": lambda x: random.choice(['Juan', 'Rodrigo'])}),
           "DEFAULT": OperatorConfig(operator_name="custom", params={"lambda": lambda x: 'X'})}

anonymized_text = anonymizer.anonymize(text=example_text, 
                                       analyzer_results=results,
                                       operators=operators).text

print(anonymized_text)

Hola. Mi nombre es Rodrigo, mi número de teléfono es X. Nací en X, X el X. El día de ayer comí patatas. Puedes preguntarme en X. IBAN de ejemplo X.


### Faker

Creating fake data to avoid loosing context.

In [9]:
from faker import Faker
fake = Faker(locale=['es_ES'])

print('random name:', fake.name())
print('random address:', fake.address())
print('random phone number:', fake.phone_number())

random name: Nando Juanito Guerra Dalmau
random address: Rambla de Ciriaco Ribera 4 Puerta 5 
Málaga, 32907
random phone number: +34 710 25 18 63


Funtion for faking data

In [10]:
fake_operators = {
    "PERSON": OperatorConfig("custom", {"lambda": lambda x: fake.name()}),
    "PHONE_NUMBER": OperatorConfig("custom", {"lambda": lambda x: fake.phone_number()}),
    "EMAIL_ADDRESS": OperatorConfig("custom", {"lambda": lambda x: fake.email()}),
    "LOCATION": OperatorConfig("replace", {"new_value": "España"}),
    "DATE_TIME": OperatorConfig("custom", {"lambda": lambda x: fake.date()}),
    "DEFAULT": OperatorConfig(operator_name="mask", 
                              params={'chars_to_mask': 20, 
                                      'masking_char': '*',
                                      'from_end': False}),
}

In [11]:
anonymized_text = anonymizer.anonymize(text=example_text,
                                       analyzer_results=results,
                                       operators=fake_operators
                                       ).text
print(anonymized_text)

Hola. Mi nombre es Elodia Hernando-Company, mi número de teléfono es +34601792713. Nací en España, España el 1984-11-14. El día de ayer comí patatas. Puedes preguntarme en pavoneloy@example.org. IBAN de ejemplo ********************3973.


## Dataset example

In [12]:
import pandas as pd

configuration = {
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "en", "model_name": "en_core_web_md"}],
}

provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()

In [13]:
fake = Faker(locale=['en_US', 'en_GB', 'en_CA', 'fr_FR'])
fake_operators = {
    "PERSON": OperatorConfig("custom", {"lambda": lambda x: fake.name()}),
    "PHONE_NUMBER": OperatorConfig("custom", {"lambda": lambda x: fake.phone_number()}),
    "EMAIL_ADDRESS": OperatorConfig("custom", {"lambda": lambda x: fake.email()}),
    "LOCATION": OperatorConfig("replace", {"new_value": "USA"}),
    "DEFAULT": OperatorConfig(operator_name="replace"),
}

In [14]:
# Set up the engines
analyzer = AnalyzerEngine(nlp_engine=nlp_engine,
                          supported_languages=["en"])
anonymizer = AnonymizerEngine()


def anonymize_text(text: str) -> str:
    # Call analyzer to get results
    results = analyzer.analyze(text=text,
                               language='en')

    # Analyzer results are passed to the AnonymizerEngine for anonymization
    anonymized_text = anonymizer.anonymize(text=text,
                                           analyzer_results=results,
                                           operators=fake_operators)
    return anonymized_text.text


def anonymize_series(s: pd.Series) -> pd.Series:
    return s.apply(anonymize_text)

In [15]:
df = pd.read_csv('data/Emails.csv')
df.head()

Unnamed: 0,Id,DocNumber,MetadataSubject,MetadataTo,MetadataFrom,SenderPersonId,MetadataDateSent,MetadataDateReleased,MetadataPdfLink,MetadataCaseNumber,...,ExtractedTo,ExtractedFrom,ExtractedCc,ExtractedDateSent,ExtractedCaseNumber,ExtractedDocNumber,ExtractedDateReleased,ExtractedReleaseInPartOrFull,ExtractedBodyText,RawText
0,1,C05739545,WOW,H,"Sullivan, Jacob J",87.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739545...,F-2015-04841,...,,"Sullivan, Jacob J <Sullivan11@state.gov>",,"Wednesday, September 12, 2012 10:16 AM",F-2015-04841,C05739545,05/13/2015,RELEASE IN FULL,,UNCLASSIFIED\nU.S. Department of State\nCase N...
1,2,C05739546,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,H,,,2011-03-03T05:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739546...,F-2015-04841,...,,,,,F-2015-04841,C05739546,05/13/2015,RELEASE IN PART,"B6\nThursday, March 3, 2011 9:45 PM\nH: Latest...",UNCLASSIFIED\nU.S. Department of State\nCase N...
2,3,C05739547,CHRIS STEVENS,;H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739547...,F-2015-04841,...,B6,"Mills, Cheryl D <MillsCD@state.gov>","Abedin, Huma","Wednesday, September 12, 2012 11:52 AM",F-2015-04841,C05739547,05/14/2015,RELEASE IN PART,Thx,UNCLASSIFIED\nU.S. Department of State\nCase N...
3,4,C05739550,CAIRO CONDEMNATION - FINAL,H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739550...,F-2015-04841,...,,"Mills, Cheryl D <MillsCD@state.gov>","Mitchell, Andrew B","Wednesday, September 12,2012 12:44 PM",F-2015-04841,C05739550,05/13/2015,RELEASE IN PART,,UNCLASSIFIED\nU.S. Department of State\nCase N...
4,5,C05739554,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,"Abedin, Huma",H,80.0,2011-03-11T05:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739554...,F-2015-04841,...,,,,,F-2015-04841,C05739554,05/13/2015,RELEASE IN PART,"H <hrod17@clintonemail.com>\nFriday, March 11,...",B6\nUNCLASSIFIED\nU.S. Department of State\nCa...


In [16]:
df.shape

(7945, 22)

In [17]:
df.drop(df.tail(5000).index, inplace=True)
df = df[df['ExtractedBodyText'].notna()]

df['anonymized_text'] = anonymize_series(df['ExtractedBodyText'])

In [18]:
pd.set_option('max_colwidth', 300)
df[['ExtractedBodyText', 'anonymized_text']]

Unnamed: 0,ExtractedBodyText,anonymized_text
1,"B6\nThursday, March 3, 2011 9:45 PM\nH: Latest How Syria is aiding Qaddafi and more... Sid\nhrc memo syria aiding libya 030311.docx; hrc memo syria aiding libya 030311.docx\nMarch 3, 2011\nFor: Hillary",<US_DRIVER_LICENSE>\n<DATE_TIME> <DATE_TIME>\nH: Latest How USA is aiding USA and more... Sid\nBrian Howard memo USA aiding USA <URL>cx; Robert Williamson memo USA aiding USA <DATE_TIME>For: Dr Gemma Webb
2,Thx,Thx
4,"H <hrod17@clintonemail.com>\nFriday, March 11, 2011 1:36 PM\nHuma Abedin\nFw: H: Latest: How Syria is aiding Qaddafi and more... Sid\nhrc memo syria aiding libya 030311.docx\nPis print.",H <rmcdonald@example.net>\n<DATE_TIME> <DATE_TIME>\nGregory Walsh\nFw: H: Latest: How USA is aiding USA and more... Sid\nMichael Butler memo USA aiding USA <URL>cx\nPis print.
5,"Pis print.\n-•-...-^\nH < hrod17@clintonernailcom>\nWednesday, September 12, 2012 2:11 PM\n°Russorv@state.gov'\nFw: Meet The Right-Wing Extremist Behind Anti-fvluslim Film That Sparked Deadly Riots\nFrom [meat)\nSent: Wednesday, September 12, 2012 01:00 PM\nTo: 11\nSubject: Meet The Right Wing E...",Debra Vance print.\n-•-...-^\nH < hrod17@clintonernailcom>\n<DATE_TIME> <DATE_TIME>\n°jessicamorris@example.org'\nFw: Meet The Right-Wing Extremist Behind Anti-fvluslim Film That Sparked Deadly Riots\nFrom [meat)\nSent: <DATE_TIME> <DATE_TIME>\nTo: 11\nSubject: Meet The Right Wing Extremist Behi...
7,"H <hrod17@clintonemail.corn>\nFriday, March 11, 2011 1:36 PM\nHuma Abedin\nFw: H: Latest: How Syria is aiding Qaddafi and more... Sid\nhrc memo Syria aiding libya 030311.docx\nPis print.",H <hrod17@<URL>rn>\n<DATE_TIME> <DATE_TIME>\nCamille Morvan\nFw: H: Latest: How USA is aiding USA and more... Sid\nBenjamin Lebon memo USA aiding USA <URL>cx\nPis print.
...,...,...
2939,FYI,FYI
2940,"I will be out of the office until Tuesday, October 13 with limited access to e-mail. If you need immediate assistance,\nplease contact Pat Grimes at 202-647-9022.\nAndrew J. Shapiro\nAssistant Secretary for Political-Military Affairs Department of State\n2201 C Street, NW\nWashington, DC 20520","I will be out of the office until <DATE_TIME> with limited access to e-mail. If you need immediate assistance,\nplease contact Joseph Cole at +44151 4960223.\nParker Richard\nAssistant Secretary for Political-Military Affairs Department of State\n2201 C Street, NW\nUSA, USA 20520"
2942,"sorry - just seeing b/c on my but have logged a call to him.\nOn Sat, Oct 9, 2010 at 7:18 AM, H <HDR22@clintonemail.com> wrote:\nCould you try to contact Mark Penn to confirm he no longer represents Thaksin before I call Thai PM at 9:30?\nThx.","sorry - just seeing b/c on my but have logged a call to him.\nOn Sat, <DATE_TIME> at <DATE_TIME>, H <christinedias@example.org> wrote:\nCould you try to contact William Howe to confirm he no longer represents USA before I call Thai PM at <DATE_TIME>?\nThx."
2943,fyi,fyi


## References

https://medium.com/@olegolego1997/text-anonymization-with-presidio-and-faker-be251f36d5bf

https://spacy.io/usage/models