# NER - extracting person and company names </br>
- This notebook infers a trained BERT model and finds out person and company names from documents (w2, resumes, etc). 

- The model was separately trained using train_NER.ipynb or similar scripts (which uses huggingface library)

- Modeled as multi-label classification problem (3 classes being - person, org, none)

### Documents:

- Complete Results: https://docs.google.com/spreadsheets/d/1rzEQrqRDqQpZ95_G95Fl1MKQppWJ1lAPBuOU6LVNiXY/edit?usp=sharing

- Slides: https://docs.google.com/presentation/d/1EdubLHYdHDPJKT1GYHjkP86ZJMqjEhM_PFhvmWXCHSg/edit?usp=sharing

- Final Intern presentation: https://docs.google.com/presentation/d/10mXA7K5sa_nAkqx2onsIfrH3TPj2Ni4LfCOxDhN5XBI/edit?usp=sharing

In [14]:
import os
import importlib
import logging
importlib.reload(logging)
import framework
importlib.reload(framework)
import bert_ner
importlib.reload(bert_ner)
import infer_bert_classifier
importlib.reload(infer_bert_classifier)
import bert_utils
importlib.reload(bert_utils)
import pandas as pd
import webbrowser
from framework import DataCuration, FeatureEngineering
from bert_ner import TaskNER, FeatureEngineeringNER, BERTNER

# Define some constants and configurations
logging.getLogger().setLevel(logging.INFO)
ACCESS_TOKEN = 'WUpGevbWC9lsnTW8quNUtmWRdAEM89'

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ahsaasbajaj/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Specify Task
- Mention configurations of the task and create a task object

In [15]:
DATASET = 'w2' # supports w2 and resume
TASK_CONFIG = {
    'task': 'ner',
    'num_labels': 3,
    'labels_dict': {'person' : 0, 'org' : 1, 'none': 2}
}

task = TaskNER(TASK_CONFIG)

## Curate dataset
- Specify paths for dataset and goldens (if available). Paths can be local or from instabase drives (use *is_local*). 
- Also specify configurations like extensions, column names to use as index. 
- Currently supports csv format for goldens, ibocr/ibdoc for dataset. 
- Use *context2txt* to extract and store raw texts. 
- This block creates a object of DataCuration() which maps goldens with the dataset and removes any mismatches, generates 1:1 mapping. 
- This object can be queried using *data.golden* or *data.dataset* or *data.dataset.texts* based on requirement

### Files:
- Download documents (ibocr or ibdoc files) after running flow on raw files here https://dogfood.instabase.com/dlluncor/lending-club/fs/Instabase%20Drive/workspace-us-markets/w2/data-500/input/. Use s2_map_records
- Sample flow outputs and processed goldens here: https://drive.google.com/drive/folders/1h1eHP1Jy8FmRoCehfKQ9dblIwJ8OEmwC?usp=sharing. Download and specify local directory path in code below and set *is_local* = True
- Alternatively, specify instabase drive paths and set *is_local* as False

In [16]:
W2_DATA = [
   '/Users/ahsaasbajaj/Documents/Data/w2-instabase/flow/s2_map_records'
]
W2_GOLDEN = [
   '/Users/ahsaasbajaj/Documents/Data/w2-instabase/golden/goldens.csv'
]

GOLDEN_CONFIG = {
    'path': W2_GOLDEN,
    'is_local': True,
    'index_field_name':'filename',
    'file_type': 'csv',
    'identifier': 'file'
}
DATASET_CONFIG = {
    'path': W2_DATA,
    'is_local': True, 
    'file_type': 'ibocr',
    'identifier': lambda path: os.path.basename(path).split('.ibocr')[0],
    'convert2txt': True
}

data = DataCuration(ACCESS_TOKEN, DATASET_CONFIG, GOLDEN_CONFIG)

INFO:root:Loading dataset from /Users/ahsaasbajaj/Documents/Data/w2-instabase/flow/s2_map_records
INFO:root:142 files loaded
INFO:root:Converting IBOCR/IBDOC to raw texts
INFO:root:Loading goldens from /Users/ahsaasbajaj/Documents/Data/w2-instabase/golden/goldens.csv
INFO:root:Total files Goldens: (154, 25)
INFO:root:Total files found in the source with unique index: (142, 25)


### Print Goldens

In [17]:
data.golden.head()

Unnamed: 0_level_0,employee_ssn,box5_medicare_wages,box3_ss_wage,box6_medicare_withholding,box4_ss_withholding,box2_fed_withhold,box17_state_income_tax,box1_wage,box8_allocated_tips,box14_other,...,box12c_amount,box12d_code,box12d_amount,employer_federal_ein,document_type,template_name,employer_name,employee_name,w2_year,gross_pay
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
last_year_w2_1493334985571.PDF,561-87-0728,36505.83,36505.83,529.33,2263.36,4093.92,519.22,36505.83,,"[""328.55""]",...,,,,01-0726495,W2,general_w2,BROKER SOLUTIONS,PATRICIA HEREDIA,2016.0,39105.41
last_year_w2_1493334989467.PDF,408-31-3195,51350.25,51350.25,744.58,3183.72,6940.69,,47242.23,,,...,,,,06-1102358,W2,general_w2,FORMAN INDUSTRIES INC,THOMAS V. MOORE,2016.0,51350.25
last_year_w2_1493334998968.PDF,261-77-1595,105916.49,105916.49,1535.82,6566.82,24471.02,,105916.49,,,...,,,,36-4248787,W2,general_w2,"YASH-LUJAN CONSULTING INC Y & L CONSULTING, INC",STACY L STUMETZ,2016.0,110240.0
last_year_w2_1493335006405.PDF,452-93-6475,35987.53,35987.53,521.82,2231.23,2814.31,,35987.53,,,...,,,,74-2482708,W2,general_w2,TECO-WESTINGHOUSE MOTOR COMPANY,HENRY COTTLE,2016.0,43827.05
last_year_w2_1493752474038.PDF,365-04-7683,85245.86,85245.86,1236.06,5285.24,13629.89,3129.87,77722.96,,"[""2069.50"", ""9.00""]",...,10815.96,,,75-2778918,W2,general_w2,FLOWSERVE US INC,JASON ALLEN JERZ,2016.0,88420.2


### Split inputs document to generate *candidate phrases*
- These are subsequent tokens clustered using whitespacing information
- These are used as strings (to train/test) sequence classifiers
- Golden person, org names have to be one of these phrases in order to be extracted by this solution

In [18]:
PROCESSING_CONFIG = {
    'X_DIST_THRESHOLD': 200
}

data.generate_candidates_phrases(PROCESSING_CONFIG)

INFO:root:Generating candidates for 142 files


## Feature Engineering (Generate Labeled Data)
- Specify DATA_ARGS which includes the task and data objects created beforehand
- Mention fields of interest (for extraction, classification) in DATA_ARGS
- Generate test data from goldens (from actual persons and company names) 
- Alternately, generate test data from *candidate phrases* produced by *data.generate_candidates_phrases()*

In [19]:
DATA_ARGS = {
    'task': task,
    'dataset': data,
    'candidates_fields': {
        'person':'employee_name',
        'org':'employer_name'
    }
}

data.compare_candidates_and_goldens(DATA_ARGS['candidates_fields'])
fe = FeatureEngineeringNER(DATA_ARGS)
test_data_from_goldens = fe.generate_test_samples_from_goldens() # single dataframe
test_data_from_candidates = fe.generate_test_samples_from_candidates() # dict{'filename' : dataframe}

INFO:root:For X_DIST_THRESHOLD configuraion: 200
INFO:root:total files: 142
person names found in candidates: 130
org names found in candidates: 69



## Modeling (BERT Inference for sequence classification)
- Specify model and load fine-tuned model for inference
- The model used in this solution was trained using train_NER.ipynb script (or equivalent)
- This model uses pretrained BERT Classifier which was later finetuned on publicly available datasets (Kaggle W2 or public lists of names)

### Specify TRAINING_ARGS
- Mention the class of model, to be used appropriately by back-end huggingface libraries
- Mention the number of labels (in case of multi-label classification)
- Also supports the use of GPU for deep learning libraries

In [20]:
MODEL_PATHS = {
    'w2' : '/Users/ahsaasbajaj/Documents/Code/ner-hf/sequence-classification/w2/no-address/5/model.pt', # trained on public w2 from Kaggle
    'public': '/Users/ahsaasbajaj/Documents/Code/ner-hf/sequence-classification/public/no-address/200/model.pt' # trained on public names repo
}

TRAINING_ARGS = {
    'model_file_or_path' : MODEL_PATHS['w2'],
    'model_type': 'bert-large-cased',
    'num_labels': TASK_CONFIG['num_labels'],
    'gpu': False,
}

model = BERTNER(DATA_ARGS, TRAINING_ARGS)

## Predictions
- Setup model evaluator and evaluate either using test_data generated in Feature Engineering 
- Runs BERT inference (in classification setting) and extracts predicted person and company names

## Evaluation
- Use *model.analyze_result()* to compares predictions with goldens.
- Also calculates metrics like Recall, Precision, F1 score

In [21]:
# Predictions
# output_golden = model.predict(test_data_from_goldens) # single dataframe 

# print('Sample outputs: ', output_golden.head())
# model.analyze_golden_result(output_golden)


# Do only for debugging and getting quick results
test_data = FeatureEngineering.get_subset_for_debugging(test_data_from_candidates, sample_size=5)

output = model.predict(test_data) # output is a dictionary
print('Number of files: ', len(output.keys()))
results = model.analyze_result(output)

INFO:root:inferring BERT classifier for file last_year_w2_1493919644111.PDF
INFO:root:inferring BERT classifier for file last_year_w2_1494271162294.PDF
INFO:root:inferring BERT classifier for file last_year_w2_1495120461121.PNG
INFO:root:inferring BERT classifier for file last_year_w2_1494972980996.PDF
INFO:root:inferring BERT classifier for file last_year_w2_1494609579761.PDF
INFO:root:For field person, recall: 1.0000, precision: 0.5067, F1: 0.6726 
INFO:root:For field org, recall: 0.0000, precision: 0.4000, F1: 0.0000 
Number of files:  5


## DEMO
- Specify local path of PDFs to run a quick DEMO
- Use DEMO_FILE from the ones samples in above block (and paste in the block below)
- This print the extracted person and company names as per the BERT Model

In [23]:
DIR_PATH = '/Users/ahsaasbajaj/Documents/Data/w2-instabase/pdf'

# Choose one file from the list printed above (Samples)
DEMO_FILE = 'last_year_w2_1493919644111.PDF'

FILE_PATH = DIR_PATH + '/' + DEMO_FILE 
webbrowser.open_new(r'file:' + FILE_PATH)

model.demo(results, DEMO_FILE)

INFO:root:Field type: person
INFO:root:filename: last_year_w2_1493919644111.PDF
INFO:root:{'JPMORGAN CHASE BANK', 'JUDITH VILLARREAL'}
INFO:root:Field type: org
INFO:root:filename: last_year_w2_1493919644111.PDF
INFO:root:{'Dept. of the Treasury - IRS', "oyee's name, a s, and ZIP code"}
