# Illustration of masking functions

[Andrew Wheeler, PhD](mailto:apwheele@gmail.com)

This uses pre-trained models to do *Named Entity Resolution*. In particular it uses a model that was built to mask [private medical record data](https://github.com/MIDRC/Stanford_Penn_MIDRC_Deidentifier). So it has many similarties, including names, geographic data, and personally identifying information (like social security numbers).

It has a second layer, that does fuzzy name linking to entities given the entire set. So if row1 has 'Andy Wheeler', and row 3 has 'Andrew Wheeler', they will be masked to the same final replacement token.

I have written the code you can pipe in different pre-trained NER models though, and it will do the masking. You may also consider *training your own* model going forward in the future. I can help accommodate that as well. (There are other paid models [you can use as well](https://nlp.johnsnowlabs.com/2022/08/31/finpipe_deid_en.html), but I believe it would be better to train yourself at that point than to pay a continual fee.)

In [1]:
import pandas as pd
from src.masking import mask_dataframe # local functions, first time will download model, so may take awhile

# Illustrating with a simple dataframe of text
t1 = "Andy Wheeler is a birder 190682540 where I live 100 Main St Kansas with Joe Schmo and andy wheeler"
t2 = "Scott Jacques is an interesting fellow, his check number 18887623597 is a good one."
t3 = "lol what a noob, Atlanta GA is on fire, email me qwerty@gmail.com your stats"
t4 = "so what, andrew wheeler @ 100 main st kansas is not so bad"
t5 = "pics or it didnt happen 999-887-6666"
text_li = [t1,t2,t3,t4,t5]
id = [1,2,3,4,5]

test_df = pd.DataFrame(zip(id,text_li),columns=['ID','Text'], index=['a','b','c','d','e'])
test_df

Unnamed: 0,ID,Text
a,1,Andy Wheeler is a birder 190682540 where I liv...
b,2,"Scott Jacques is an interesting fellow, his ch..."
c,3,"lol what a noob, Atlanta GA is on fire, email ..."
d,4,"so what, andrew wheeler @ 100 main st kansas i..."
e,5,pics or it didnt happen 999-887-6666


So this function will produce an output that identifies particular tokens in the input that include:

 - PersonName, e.g a textual name for a person
 - IdentNumber, e.g. a SSN or another potential sensitive number
 - Contact, e.g. Phone number, Email
 - Date, a date field
 - Geo, a location, address, building
 - Web, a website location
 
Currently by default I *only* mask PersonName, IdentNumber, Contact, and Geo. You can mask all of those fields though if you wish. Also note if you are interested in training your own model, you can identify *more* entities (not just for masking).

In [2]:
res = mask_dataframe(test_df,'Text') # pass in dataframe, and the field that has the text
res

Unnamed: 0,Text,Contact,Geo,IdentNumber,PersonName
a,PersonName2 is a birder IdentNumber2 where I l...,[],"[{'entity_group': 'Geo1', 'score': 0.972325563...","[{'entity_group': 'IdentNumber2', 'score': 0.9...","[{'entity_group': 'PersonName2', 'score': 0.99..."
b,"PersonName5 is an interesting fellow, his chec...",[],[],"[{'entity_group': 'IdentNumber1', 'score': 0.9...","[{'entity_group': 'PersonName5', 'score': 0.99..."
c,"lol what PersonName1, Geo2 is on fire, email m...",[],"[{'entity_group': 'Geo2', 'score': 0.979433655...",[],"[{'entity_group': 'PersonName1', 'score': 0.91..."
d,"so what, PersonName2 @ Geo1 is not so bad",[],"[{'entity_group': 'Geo1', 'score': 0.995676457...",[],"[{'entity_group': 'PersonName2', 'score': 0.99..."
e,pics or it didnt happen Contact1,"[{'entity_group': 'Contact1', 'score': 0.99377...",[],[],[]


In [3]:
for text in res['Text']:
    print(text)

PersonName2 is a birder IdentNumber2 where I live Geo1 with PersonName3 and PersonName2
PersonName5 is an interesting fellow, his check number IdentNumber1 is a good one.
lol what PersonName1, Geo2 is on fire, email me PersonName4@gmail.com your stats
so what, PersonName2 @ Geo1 is not so bad
pics or it didnt happen Contact1
