# PII Data NER

## Data
An excel file containing these sheets:
- Export Summary
- PII Train Large Data - PII Trai (800 rows)
- PII Test Data - PII Test Data (16000 rows

## Initial observations/ assumptions:
- The training data is very limited, making the exercise a good candiate for either transfer learning approach or if required semi-supervised learning approach with data augmentation.
- A sentence can have more than one labels, which makes this a multilabel classification problem
- Certain entities such as email, ssn, credit cards, follow a well known pattern. This makes them easier to detect with non-stocastic rule based methods
- Other entities such as names and addresses have more pattern variance and might require a more machine learning centric approach.
- On the other hand the test dataset only requires one label and one PII value, which means this problem forces an assumption that this is one-vs-all classification simplification.
- Additionally, the instructions specifically ask for building a machine learning model, so it' best to skip the regex like rule based approaches since they would be trivial solutions, and not justify the intent of the exercise.
- The problem does not constrain on the compute resources, and does not constrain on the inference time requirements, so it is safe to assume the best bet is to optimize for better accuract / f1- scores.
- The solution should specifically address Entity ambiguity w.r.t overlapping classes (polysemic words) and just the PII, thereby contextual disambiguation needs to be implemented and detailed.

## Solution Space
- I'd treat this problem as a token sequence classification exercise where token is a unit comprising the sequence. 
- Since information is text is contiguous w.r.t token boundaries, it is safe to consider variable token descriptions, such as a token could be a character or a sub word or a word.
- Standard approaches for named entity recorgnition can come in handy, the best expression of training data would be in CONLL format, including part of speed analysis followed IOB tagging using either RNN models or similar approaches.
- Since the exercise is time bound, as a simplification, it is safe to start with off-the-shelf transformer models and then iteratively optimize.

## Evaluation Criteria
There will be a 2 part evaluation
- To determine the performance on a holdout set (with crossvalidation for a real usage, skipped)
  - Report microavg F1 Score, precision and recall for each class
  - A full phrase match might be a good criteria for evaluation.

- To determine perfomance of some known desired behavious with a custom checklist for model explainability and reliability. (Future, for now just test on a small list)
    - Robustness estimates could be included to further harden the models.


In [1]:
# imports
import pandas as pd
import numpy as np
from tqdm import tqdm
from difflib import SequenceMatcher
import re
import pickle
import random
# local files
import rules
import preprocess
import datagen

from typing import List
from sklearn.model_selection import train_test_split

from flair.data import Sentence, Corpus
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
from flair.datasets import ColumnCorpus
from flair.embeddings import FlairEmbeddings, WordEmbeddings, StackedEmbeddings, TokenEmbeddings,TransformerWordEmbeddings

random.seed(369)

In [2]:
raw_data = pd.read_excel("../data/PII_Train_Large_Data_Test_Data.xlsx",sheet_name="PII Train Large Data - PII Trai", skiprows=1, index_col=None, na_values=['NA'], usecols = "A,B,C")
# shuffle
raw_data = raw_data.sample(frac=1,random_state=369).reset_index(drop=True)

### Data Exploration
Obervations:
- total classes present = 8
- classes are balanced
- 60/20/20 split makes sense for train/test/dev


In [3]:
raw_data.head()

Unnamed: 0,Text,Labels,PII
0,Quality whom pay travel our 4914569109182 tabl...,CreditCardNumber,4914569109182
1,History effort 427 68 3081 system kitchen. Hea...,SSN,427 68 3081
2,Team Republican him Jessica Ellis reveal. Play...,Name,Jessica Ellis
3,Radio respond perhaps western loss blood. Turn...,Email,jrodriguez@mccarthy-lawson.biz
4,Ready off score 8H 67881 foot market protect.,Plates,8H 67881


In [4]:
# show the distribution of class in the train set
raw_data.groupby(["Labels"])['Text'].nunique().reset_index()

Unnamed: 0,Labels,Text
0,Address,100
1,CreditCardNumber,100
2,Email,100
3,Name,100
4,,100
5,Phone_number,100
6,Plates,100
7,SSN,100


## Data Augmentation
- Less support for complex patterns such as <b>Name</b> and <b>Address</b> and those need to be augmented for a fair evaluation
- Lucky for me there is an easy way to generate fake names and addresses, let's add those to the mix to increase support.

In [5]:
# Ref: datagen.py for details
gen = datagen.DataGenerator(max_nb_chars=80, n=400, seed=369)
fake_data = gen.generate_fake_data()
fake_data.head()

Unnamed: 0,Text,Labels,PII
0,Form late billion probably blood. May growth J...,Name,Jennifer Collier
1,Who wait fast develop 96947 Keller Squares Chr...,Address,"96947 Keller Squares Christopherville, OR 83095"
2,Too response system laugh Brent Hawkins decisi...,Name,Brent Hawkins
3,Institution action social become 238 Amber Cro...,Address,"238 Amber Crossroad Sandraburgh, OR 64251"
4,Between loss goal Angela Avery sea seem Republ...,Name,Angela Avery


In [6]:
# special patterns for address
fake_data[fake_data['Labels']=='Address'][fake_data['PII'].map(lambda l: len(l)<10)].tail()

Unnamed: 0,Text,Labels,PII
767,Other discuss light nature. Later cause Apt .1...,Address,Apt .1033
771,Only allow attack think current. Apt 402 Call ...,Address,Apt 402
775,Star large two leader foot Apt 958 your. Color...,Address,Apt 958
793,Suffer despite Apt 3793 later on. Fine final c...,Address,Apt 3793
799,Concern modern agent nice physical old. Apt .2...,Address,Apt .2955


In [7]:
# augmented data is a combination of original data and the fake data
aug_data = pd.concat([raw_data, fake_data])
# shuffle again
aug_data = aug_data.sample(frac=1,random_state=369).reset_index(drop=True)

In [8]:
# show the distribution of class in the train set
# Note: Now we have 500 examples of Name and Address
aug_data.groupby(["Labels"])['Text'].nunique().reset_index()

Unnamed: 0,Labels,Text
0,Address,500
1,CreditCardNumber,100
2,Email,100
3,Name,500
4,,100
5,Phone_number,100
6,Plates,100
7,SSN,100


## Tokenization
- This is one of the key aspects for getting the correct NER.
- Known patterns can be pinned with deterministic tokens in order to aid the recognition.
- Flair's internal tokenization handles special characters by adding spaces, which makes for special handling in case of email, where "@" symbol must be replaced with " @ " and an inverse transform might be needed on the predicted spans before consumption.

Here is an example of email sentence tokenized, notice tokens <b>eee</b> @ <b>mmm</b>:

In [9]:
# Here is an example of how the tokenization takes place.
converter = preprocess.CoNLLConverter()
index = 69
print(aug_data['Labels'].iloc[index])
converter.bio_tagger(aug_data['Text'].iloc[index], aug_data['Labels'].iloc[index], aug_data['PII'].iloc[index])

Email


['Film O\n',
 'range O\n',
 'sound. O\n',
 'People O\n',
 'age O\n',
 'that. O\n',
 'eeeee O\n',
 'douglaslewis B-Email\n',
 '@ I-Email\n',
 'yahoo.com I-Email\n',
 'mmmmm O\n',
 '\n']

In [10]:
# Here is an example of how the tokenization takes place for emails.
converter = preprocess.CoNLLConverter()
index = aug_data['Labels'][aug_data['Labels']=="Email"].index[0]
print(aug_data['Labels'].iloc[index])
converter.bio_tagger(aug_data['Text'].iloc[index], aug_data['Labels'].iloc[index], aug_data['PII'].iloc[index])

Email


['Community O\n',
 'stand O\n',
 'nice O\n',
 'whatever O\n',
 'film. O\n',
 'Blood O\n',
 'go O\n',
 'particular O\n',
 'same O\n',
 'wait O\n',
 'record O\n',
 'interview. O\n',
 'Position O\n',
 'sometimes O\n',
 'test O\n',
 'shoulder O\n',
 'save O\n',
 'huge O\n',
 'course. O\n',
 'Almost O\n',
 'they O\n',
 'eeeee O\n',
 'diazamanda B-Email\n',
 '@ I-Email\n',
 'johnson.info I-Email\n',
 'mmmmm O\n',
 'himself O\n',
 'less O\n',
 'interesting O\n',
 'education. O\n',
 '\n']

In [11]:
# partition data into train, test, dev
X_train, X_test, y_train, y_test = train_test_split(aug_data.index, aug_data["Labels"], test_size=0.2, random_state=369 ,stratify=aug_data["Labels"])
X_train, X_dev, y_train, y_dev = train_test_split(X_train, y_train, test_size=0.25, random_state=369, stratify=y_train)  # 0.25 x 0.8 = 0.2

# verify split distribution is identical
aug_data.loc[X_dev,:].groupby(["Labels"])['Text'].nunique().reset_index()

Unnamed: 0,Labels,Text
0,Address,100
1,CreditCardNumber,20
2,Email,20
3,Name,100
4,,20
5,Phone_number,20
6,Plates,20
7,SSN,20


In [12]:
# save splits for later evaluations
aug_data.to_excel("../data/augmented_data.xlsx", columns=['Text','Labels','PII'])
aug_data.loc[X_train,:].to_excel("../data/augmented_data_train.xlsx", columns=['Text','Labels','PII'])
aug_data.loc[X_dev,:].to_excel("../data/augmented_data_test.xlsx", columns=['Text','Labels','PII'])
aug_data.loc[X_dev,:].to_excel("../data/augmented_data_dev.xlsx", columns=['Text','Labels','PII'])

In [13]:
# save all tagged data for visual checks
with open("../data/pii_conll_data_all.txt", "w", encoding="utf-8") as f:
    for index, row in aug_data.iterrows():
        tagged_data = converter.bio_tagger(row['Text'], row['Labels'], row['PII'])
        f.writelines(tagged_data)

In [14]:
# save tagged train.txt
with open("../data/train.txt", "w", encoding="utf-8") as f:
    for index, row in aug_data.loc[X_train,:].iterrows():
        tagged_data = converter.bio_tagger(row['Text'], row['Labels'], row['PII'])
        f.writelines(tagged_data)

In [15]:
# save tagged test.txt
with open("../data/test.txt", "w", encoding="utf-8") as f:
    for index, row in aug_data.loc[X_test,:].iterrows():
        tagged_data = converter.bio_tagger(row['Text'], row['Labels'], row['PII'])
        f.writelines(tagged_data)

In [16]:
# save tagged dev.txt
with open("../data/dev.txt", "w", encoding="utf-8") as f:
    for index, row in aug_data.loc[X_dev,:].iterrows():
        tagged_data = converter.bio_tagger(row['Text'], row['Labels'], row['PII'])
        f.writelines(tagged_data)

In [17]:
#Load the corpus
# define columns
columns = {0 : 'text', 1 : 'ner'}
# directory where the data resides
data_folder = '../data'
# initializing the corpus
corpus: Corpus = ColumnCorpus(data_folder, columns,
                              train_file = 'train.txt',
                              test_file = 'test.txt',
                              dev_file = 'dev.txt')

2021-02-01 11:46:10,741 Reading data from ..\data
2021-02-01 11:46:10,742 Train: ..\data\train.txt
2021-02-01 11:46:10,743 Dev: ..\data\dev.txt
2021-02-01 11:46:10,744 Test: ..\data\test.txt


In [18]:
# init Flair embeddings
flair_forward_embedding = FlairEmbeddings('multi-forward')
flair_backward_embedding = FlairEmbeddings('multi-backward')

# ***
# Note: This would be helpful if we had well formed sentences, this would really help in 
# the case of polysemic words, which we currently don't have in the train / test data
# since that would mean having PER, LOC
# ***

# init multilingual BERT
# bert_embedding = TransformerWordEmbeddings('bert-base-multilingual-cased')

embedding_types : List[TokenEmbeddings] = [
        WordEmbeddings('glove'),
        ## other embeddings
        flair_forward_embedding, 
        flair_backward_embedding, 
        # bert_embedding
        ]
embeddings : StackedEmbeddings = StackedEmbeddings(
                                 embeddings=embedding_types)

In [19]:
# prepare the tag for ner
tag_type = 'ner'
# make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary)

tagger : SequenceTagger = SequenceTagger(hidden_size=256,
                                       embeddings=embeddings,
                                       tag_dictionary=tag_dictionary,
                                       tag_type=tag_type,
                                       use_crf=True)

# describe the network
print(tagger)

Dictionary with 16 tags: <unk>, O, B-Name, I-Name, B-Address, I-Address, B-Plates, B-Email, I-Email, B-SSN, I-SSN, B-Phone_number, I-Plates, B-CreditCardNumber, <START>, <STOP>
SequenceTagger(
  (embeddings): StackedEmbeddings(
    (list_embedding_0): WordEmbeddings('glove')
    (list_embedding_1): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.1, inplace=False)
        (encoder): Embedding(11854, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=11854, bias=True)
      )
    )
    (list_embedding_2): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.1, inplace=False)
        (encoder): Embedding(11854, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=11854, bias=True)
      )
    )
  )
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=4196, out_features=4196, bias=True)
  (rnn): LSTM(4196

In [20]:
# initialize trainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)

# start training
trainer.train('../model/taggers/pii-ner-v1',
              learning_rate=0.1,
              mini_batch_size=24,
              max_epochs=150)

2021-02-01 11:46:15,487 ----------------------------------------------------------------------------------------------------
2021-02-01 11:46:15,488 Model: "SequenceTagger(
  (embeddings): StackedEmbeddings(
    (list_embedding_0): WordEmbeddings('glove')
    (list_embedding_1): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.1, inplace=False)
        (encoder): Embedding(11854, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=11854, bias=True)
      )
    )
    (list_embedding_2): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.1, inplace=False)
        (encoder): Embedding(11854, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=11854, bias=True)
      )
    )
  )
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=4196, out_features=4196, bias=True)
  (rnn): LSTM(4196, 256, batch_first=T

{'test_score': 0.9933333333333333,
 'dev_score_history': [0.49128919860627174,
  0.9205298013245033,
  0.9601328903654486,
  0.9683860232945091,
  0.9667774086378739,
  0.9750415973377703,
  0.9734219269102989,
  0.9933333333333333,
  0.9933333333333333,
  0.9767441860465117,
  0.9850249584026622,
  0.99,
  0.9866666666666668,
  0.99,
  0.9866666666666668,
  0.99,
  0.9933333333333333,
  0.99,
  0.99,
  0.99,
  0.99,
  0.9933333333333333,
  0.99,
  0.9916805324459235,
  0.99,
  0.9933333333333333,
  0.99,
  0.9933333333333333,
  0.9933333333333333,
  0.9933333333333333,
  0.9933333333333333,
  0.9933333333333333,
  0.9966666666666667,
  0.9933333333333333,
  0.9966666666666667,
  0.9933333333333333,
  0.99,
  0.9933333333333333,
  0.9933333333333333,
  0.9933333333333333,
  0.9933333333333333,
  0.9933333333333333,
  0.9933333333333333,
  0.9933333333333333,
  0.9933333333333333,
  0.9933333333333333,
  0.9933333333333333,
  0.9933333333333333,
  0.9933333333333333,
  0.993333333333333

## Scores

In the interest of visibility I have copied over the final scores from the logs.
```
2021-02-01 11:54:23,677 Testing using best model ...
2021-02-01 11:54:23,678 loading file ..\model\taggers\pii-ner-v1\best-model.pt
2021-02-01 11:54:28,106 0.9933	0.9933	0.9933
2021-02-01 11:54:28,106 
Results:
- F1-score (micro) 0.9933
- F1-score (macro) 0.9971

By class:
Address    tp: 99 - fp: 1 - fn: 1 - precision: 0.9900 - recall: 0.9900 - f1-score: 0.9900
CreditCardNumber tp: 20 - fp: 0 - fn: 0 - precision: 1.0000 - recall: 1.0000 - f1-score: 1.0000
Email      tp: 20 - fp: 0 - fn: 0 - precision: 1.0000 - recall: 1.0000 - f1-score: 1.0000
Name       tp: 99 - fp: 1 - fn: 1 - precision: 0.9900 - recall: 0.9900 - f1-score: 0.9900
Phone_number tp: 20 - fp: 0 - fn: 0 - precision: 1.0000 - recall: 1.0000 - f1-score: 1.0000
Plates     tp: 20 - fp: 0 - fn: 0 - precision: 1.0000 - recall: 1.0000 - f1-score: 1.0000
SSN        tp: 20 - fp: 0 - fn: 0 - precision: 1.0000 - recall: 1.0000 - f1-score: 1.0000
```

In [21]:
# load the trained model
model = SequenceTagger.load('../model/taggers/pii-ner-v1/final-model.pt')

2021-02-01 11:54:28,191 loading file ../model/taggers/pii-ner-v1/final-model.pt


In [22]:
# run some basic sanity checks

tests = [
    # name
    ("Hold home fight nor customer defense. Shields Shake player something.","Name"),
    ("Box direction clear sense democratic power. Adult determine rule Deanna across sometimes according.","Name"),
    ("High same pass Smith change. Effort board fly onto middle detail. Ahead chance small ability reduce under. Middle policy use single become.","Name"),
    ("Product similar final sense senior least. Model explain reveal American they behind source. Join Ariana White president other arm until feel force.","Name"),
    ("Matthew Haas Market Republican hand sign.","Name"),
    ("Whom simple Franklin nice three car. Music answer southern performance glass around.","Name"),
    ("Everybody small suggest into president. Over manager Charles government they.","Name"),
    # address
    ("Dark turn purpose attack set way. Significant impact book daughter manager behavior pressure. Price true we pressure culture design 22868 Strong Square Suite 603 serve answer.","Address"),
    ("Class speech structure ask prevent. Do tree actually 384 Moran River Suite 724 West Maureenport, NV 14327 forward.","Address"),
    ("Cold write close full likely actually plant method. Suite 139 Project father chair keep. Financial itself miss house reduce necessary.","Address"),
    ("Cup happen say join improve would. Oil PM special parent executive foot series term. Operation adult result decision prevent talk well. 640 Eddie Mission Apt. 272 New Kimberly, ME 38505","Address"),
    ("Book environmental site each produce more. Clearly 7112 Joanne Cape Apt. 685 ability support technology everything alone pressure.","Address"),
    ("Suite 082 Behavior phone sign meeting service. Fish service attack present during security source.","Address"),
    # credit card
    ("Medical eye inside respond. Ability international change 2224009433974117 could what American actually rate. Whole although chance.","CreditCardNumber"),
    ("Rock movement call turn minute American act soldier. Next customer similar agency 4371942766026400604 training industry. Painting once itself camera go yes.","CreditCardNumber"),
    ("Team deal growth check human. Small these American star subject loss 4504414195582719 measure.","CreditCardNumber"),
    # ssn
    ("Run society adult large. 070-94-9083 Down phone many. Through admit involve life property decision data doctor.","SSN"),
    ("Sister their service tax water sit a. Hundred region bad source well else save. 617 15 2105","SSN"),
    ("Produce military act 765 70 5679 bill. Source carry discussion use such attorney before. Friend air last. Even project mother maintain allow partner.","SSN"),
    ("573-70-8827 Man open upon. Activity note area behavior. Page none two morning.","SSN"),
    # email
    ("Discover our nothing try chance federal candidate. Indeed design barbaracarpenter@yahoo.com high people.","Email"),
    ("Magazine day physical. Eight fill on whatever glass technology. Act as pass tell action pattern huertabruce@yahoo.com every.","Email"),
    ("hensleyedward@gmail.com Increase several him marriage word truth sort. Among economic system. Usually serve candidate moment.","Email"),
    ("Lose meet left kitchen organization herself. Business majority our. They job watch reason meet emily86@gmail.com drive.","Email"),
    ("Save fall card care anyone pressure. Remember susan68@ramirez.biz sister offer draw. Take field because return fear guess.","Email"),
    ("General attorney record ten much remember. Eye live ask bridget58@gmail.com support.","Email"),
    # plate
    ("Trouble religious hear environmental teacher get ZSH 214 forget. Test executive whole pass design four contain.","Plate"),
    ("Draw 40LR5 feeling might. Under lawyer part senior.","Plate"),
    ("Able this character evidence woman it want. I require popular BAA0043 off maintain.","Plate"),
    ("Of staff international box would also. Arrive region actually training. 255-519","Plate"),
    ("Population religious effort whatever contain. Employee general wife another thus mission others. 22-C881 Story year owner easy common late. Before build not gun pay upon.","Plate"),
    ("Whose raise soon participant S52 1IO alone. Value design teacher which.","Plate"),
    # none
    ("Ball idea ever move. Sign writer third teach drop perhaps sometimes. Explain happy me good enjoy.","None"),
    ("Usually cold thousand professional TV direction care. Carry but role strong sister few. Region base single investment.","None"),
    ("Old start difficult others. Move decide shake talk million candidate.","None"),
    ("Manager physical respond that teach father together. Republican in recently ahead.","None"),
    ("Couple job civil green road news senior. Occur put standard air collection actually thus. Single country would them.","None"),

]

engine =rules.RulesEngine()

for text, true_label in tests:
    text = engine.pin_text(text)
    sentence = Sentence(text)
    # predict the tags
    model.predict(sentence)
    print("Sentence:")
    print(text)
    print("Tagged Sentence:")
    print(sentence.to_tagged_string())
    print("True Label:", true_label)
    print("Entity")
    for entity in sentence.get_spans('ner'):
        print(entity)
    print("\n")

Sentence:
Hold home fight nor customer defense. Shields Shake player something.
Tagged Sentence:
Hold home fight nor customer defense . Shields <B-Name> Shake player something .
True Label: Name
Entity
Span [8]: "Shields"   [− Labels: Name (0.7236)]


Sentence:
Box direction clear sense democratic power. Adult determine rule Deanna across sometimes according.
Tagged Sentence:
Box direction clear sense democratic power . Adult determine rule Deanna <B-Name> across sometimes according .
True Label: Name
Entity
Span [11]: "Deanna"   [− Labels: Name (0.999)]


Sentence:
High same pass Smith change. Effort board fly onto middle detail. Ahead chance small ability reduce under. Middle policy use single become.
Tagged Sentence:
High same pass Smith <B-Name> change . Effort board fly onto middle detail . Ahead chance small ability reduce under . Middle policy use single become .
True Label: Name
Entity
Span [4]: "Smith"   [− Labels: Name (0.9782)]


Sentence:
Product similar final sense senior 