## Inference - PII NER

Eval data provided needs to be tagged and exported into an excel file.
Notes:
- Apt. xxx patterns seem to be confused with plates, can be easily fixed by adding to training data
- Some Phone_number labels are identified as Address, we might have imbalanced the classes by adding too much fake data, maybe next iteration reduce the volume of addresses or add some data for phone numbers.

In [1]:
# imports
import pandas as pd
import numpy as np

# local files
import rules
import preprocess
import datagen

from flair.data import Sentence
from flair.models import SequenceTagger

In [2]:
# load the trained model
model = SequenceTagger.load('../model/taggers/pii-ner-v0/final-model.pt')
# load the rules engine
engine = rules.RulesEngine()

2021-01-31 01:14:09,821 loading file ../model/taggers/pii-ner-v0/final-model.pt


In [3]:
# Load the eval data
eval_data = pd.read_excel("../data/PII_Train_Large_Data_Test_Data.xlsx",sheet_name="PII Test Data - PII Test Data", skiprows=1, index_col=None, na_values=['NA'], usecols = "A,B,C")
eval_data.head()

Unnamed: 0,Text,Label,PII
0,Term although process suddenly parent. Poor go...,,
1,"356 Collins Highway New Kathleen, NM 10160 Rem...",,
2,Appear job opportunity job. Piece 405 Callahan...,,
3,During half leave simple west lose piece 859 D...,,
4,Peace when Apt. 910 enter left speak agree. Le...,,


In [4]:
# Helper function for getting the predictions
# This function can be made smarter by sorting entities by confidence (future)
def get_predictions(text, rules_engine, model):
    text = rules_engine.pin_text(text)
    sentence = Sentence(text)
    model.predict(sentence)
    result = sentence.to_dict(tag_type="ner")
    if result.get('entities'):
        labels = result.get('entities')[0].get('labels')
        first_label = labels[0].value
        entity = result.get('entities')[0].get('text')
        return {"Label":first_label, "PII": entity}
    else:
        return {"Label": "None", "PII": ""}

get_predictions(
    "Cup happen say join improve would. Oil PM special parent executive foot series term. Operation adult result decision prevent talk well. 640 Eddie Mission Apt. 272 New Kimberly, ME 38505",
    engine, model
)

{'Label': 'Address',
 'PII': '640 Eddie Mission Apt. 272 New Kimberly, ME 38505'}

In [5]:
# perform inference and save result
predictions = []
for index, row in eval_data.iterrows():
    pred = get_predictions(row['Text'], engine, model)
    pred['Text'] = row['Text']
    predictions.append(pred)

AttributeError: 'list' object has no attribute 'to_excel'

In [9]:
pd.DataFrame(predictions).to_excel("../data/PII_Predictions.xlsx", columns=['Text','Label','PII'])

In [8]:
# A simple inference would look something like this
text = "Traditional while report few southern world. Measure school significant since face think total. +1-321-677-1018x127 Water radio reflect against admit."
sentence = Sentence(engine.pin_text(text))
model.predict(sentence)
result = sentence.to_dict(tag_type="ner")
print(result)

{'text': 'Traditional while report few southern world. Measure school significant since face think total. ppp +1-321-677-1018x127 hhh Water radio reflect against admit.', 'labels': [], 'entities': [{'text': '+', 'start_pos': 100, 'end_pos': 101, 'labels': [Phone_number (0.8826)]}, {'text': '1-321-677-1018x127', 'start_pos': 101, 'end_pos': 119, 'labels': [Phone_number (0.9957)]}]}
