# Evaluate Refiner Outputs
- This notebook takes refiner outputs and computes evaluation metrics (recall, precision, F1) for extraction of person, company names

Final Intern presentation: https://docs.google.com/presentation/d/10mXA7K5sa_nAkqx2onsIfrH3TPj2Ni4LfCOxDhN5XBI/edit?usp=sharing


In [28]:
import os
import importlib
import logging
importlib.reload(logging)
import framework
importlib.reload(framework)
import refiner
importlib.reload(refiner)
import infer_bert_classifier
importlib.reload(infer_bert_classifier)
import bert_utils
importlib.reload(bert_utils)
import pandas as pd
import webbrowser
from framework import DataCuration
from refiner import Refiner, TaskNER

# Define some constants and configurations
logging.getLogger().setLevel(logging.INFO)
ACCESS_TOKEN = 'WUpGevbWC9lsnTW8quNUtmWRdAEM89'

## Specify Task
- Mention configurations of the task and create a task object

In [29]:
DATASET = 'w2' # supports w2 and resume
TASK_CONFIG = {
    'task': 'ner',
    'num_labels': 3,
    'labels_dict': {'person' : 0, 'org' : 1, 'none': 2}
}

task = TaskNER(TASK_CONFIG)

## Curate dataset
- Specify paths for dataset and goldens (if available). Paths can be local or from instabase drives (use *is_local*). 
- Also specify configurations like extensions, column names to use as index. 
- Currently supports csv format for goldens, ibocr/ibdoc for dataset. 
- Use *context2txt* to extract and store raw texts. 
- This block creates a object of DataCuration() which maps goldens with the dataset and removes any mismatches, generates 1:1 mapping. 
- This object can be queried using *data.golden* or *data.dataset* or *data.dataset.texts* based on requirement

### Files:
- Download documents (ibocr or ibdoc files) after running flow on raw files here https://dogfood.instabase.com/dlluncor/lending-club/fs/Instabase%20Drive/workspace-us-markets/w2/data-500/input/. Use s2_map_records
- Sample flow outputs and processed goldens here: https://drive.google.com/drive/folders/1h1eHP1Jy8FmRoCehfKQ9dblIwJ8OEmwC?usp=sharing. Download and specify local directory path in code below and set *is_local* = True
- Alternatively, specify instabase drive paths and set *is_local* as False

In [30]:
W2_DATA = [
   '/Users/ahsaasbajaj/Documents/Data/w2-instabase/flow/s2_map_records'
]
W2_GOLDEN = [
   '/Users/ahsaasbajaj/Documents/Data/w2-instabase/golden/goldens.csv'
]

GOLDEN_CONFIG = {
    'path': W2_GOLDEN,
    'is_local': True,
    'index_field_name':'filename',
    'file_type': 'csv',
    'identifier': 'file'
}
DATASET_CONFIG = {
    'path': W2_DATA,
    'is_local': True, 
    'file_type': 'ibocr',
    'identifier': lambda path: os.path.basename(path).split('.ibocr')[0],
    'convert2txt': True
}

data = DataCuration(ACCESS_TOKEN, DATASET_CONFIG, GOLDEN_CONFIG)

INFO:root:Loading dataset from /Users/ahsaasbajaj/Documents/Data/w2-instabase/flow/s2_map_records
INFO:root:142 files loaded
INFO:root:Converting IBOCR/IBDOC to raw texts
INFO:root:Loading goldens from /Users/ahsaasbajaj/Documents/Data/w2-instabase/golden/goldens.csv
INFO:root:Total files Goldens: (154, 25)
INFO:root:Total files found in the source with unique index: (142, 25)


### Print Goldens

In [31]:
data.golden.head()

Unnamed: 0_level_0,employee_ssn,box5_medicare_wages,box3_ss_wage,box6_medicare_withholding,box4_ss_withholding,box2_fed_withhold,box17_state_income_tax,box1_wage,box8_allocated_tips,box14_other,...,box12c_amount,box12d_code,box12d_amount,employer_federal_ein,document_type,template_name,employer_name,employee_name,w2_year,gross_pay
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
last_year_w2_1493334985571.PDF,561-87-0728,36505.83,36505.83,529.33,2263.36,4093.92,519.22,36505.83,,"[""328.55""]",...,,,,01-0726495,W2,general_w2,BROKER SOLUTIONS,PATRICIA HEREDIA,2016.0,39105.41
last_year_w2_1493334989467.PDF,408-31-3195,51350.25,51350.25,744.58,3183.72,6940.69,,47242.23,,,...,,,,06-1102358,W2,general_w2,FORMAN INDUSTRIES INC,THOMAS V. MOORE,2016.0,51350.25
last_year_w2_1493334998968.PDF,261-77-1595,105916.49,105916.49,1535.82,6566.82,24471.02,,105916.49,,,...,,,,36-4248787,W2,general_w2,"YASH-LUJAN CONSULTING INC Y & L CONSULTING, INC",STACY L STUMETZ,2016.0,110240.0
last_year_w2_1493335006405.PDF,452-93-6475,35987.53,35987.53,521.82,2231.23,2814.31,,35987.53,,,...,,,,74-2482708,W2,general_w2,TECO-WESTINGHOUSE MOTOR COMPANY,HENRY COTTLE,2016.0,43827.05
last_year_w2_1493752474038.PDF,365-04-7683,85245.86,85245.86,1236.06,5285.24,13629.89,3129.87,77722.96,,"[""2069.50"", ""9.00""]",...,10815.96,,,75-2778918,W2,general_w2,FLOWSERVE US INC,JASON ALLEN JERZ,2016.0,88420.2



## Modeling (Pre-Generated Refiner Output)
- Specify DATA_ARGS which includes the task and data objects created beforehand
- Mention fields of interest (for extraction, classification) in DATA_ARGS

### Output files
- Load ouputs from refiner flow (after step 4 producing single out.ibocr)
- Sample Outputs here: https://drive.google.com/drive/folders/1zzq8cM2i2Ek_9T8fTmlgbBF45ZZxjYYX?usp=sharing
- Download the above files and edit path in *model_file_or_path* of TRAINING_ARGS

### Specify TRAINING_ARGS
- Specify path to refiner results above
- Mention the number of labels (in case of multi-label classification)

### Specify field to evaluate in *MODELS_TO_EVAL*


In [32]:
W2_REFINER_RESULT_PATH = '/Users/ahsaasbajaj/Documents/Data/refiner_results/w2.ibocr'
RESUME_REFINER_RESULT_PATH = '/Users/ahsaasbajaj/Documents/Data/refiner_results/resume.ibocr'

DATA_ARGS = {
    'task': task,
    'dataset': data,
    'candidates_fields': {
        'person':'employee_name',
        'org':'employer_name'
    }
}
TRAINING_ARGS = {
    'model_file_or_path' : W2_REFINER_RESULT_PATH,
    'num_labels': TASK_CONFIG['num_labels'],
}

MODELS_TO_EVAL = {
    'models': ['names_vontell', 'names_token_matcher'],
    'spacy_models': ['names_spacy', 'org_spacy'],

    'person_name_models': ['names_vontell', 'names_token_matcher', 'names_spacy'],
    'org_name_models': ['org_spacy'],
}

model = Refiner(DATA_ARGS, TRAINING_ARGS, MODELS_TO_EVAL)

## Evaluation
- Use *model.analyze_result()* to compares predictions with goldens.
- Also calculates metrics like Recall, Precision, F1 score

In [33]:
results = model.analyze_results()
print(results.keys())
print(results['person'].keys())
# print(results['person']['names_vontell'].keys())

INFO:root:
Person Name Scores
INFO:root:For model names_vontell, recall: 0.7465, precision: 0.4180, F1: 0.5359 
INFO:root:For model names_token_matcher, recall: 0.6549, precision: 0.4602, F1: 0.5405 
INFO:root:For model names_spacy, recall: 0.0915, precision: 0.0034, F1: 0.0066 
INFO:root:
Org Name Scores
INFO:root:For model org_spacy, recall: 0.0775, precision: 0.0012, F1: 0.0023 
dict_keys(['person', 'org'])
dict_keys(['names_vontell', 'names_token_matcher', 'names_spacy'])


In [34]:
# data.dataset.keys()

## DEMO
- Specify local path of PDFs to run a quick DEMO
- Use DEMO_FILE from the ones samples in data.dataset.keys() (and paste in the block below)
- This print the extracted person and company names as per refiner outputs

In [35]:
DIR_PATH = '/Users/ahsaasbajaj/Documents/Data/w2-instabase/pdf'
DEMO_FILE = 'last_year_w2_1494607092402.PDF'

FILE_PATH = DIR_PATH + '/' + DEMO_FILE 
webbrowser.open_new(r'file:' + FILE_PATH)

True

In [36]:
model.demo(results, DEMO_FILE)

INFO:root:golden person: CHRISTINA A MEWIS
INFO:root:golden company: EVELYN BAIRD GENTRY CORP DBA CAPP ELECTR
INFO:root:Field type: person
INFO:root:model type: names_vontell
INFO:root:{'CHRISTINA A MEWIS', 'EVELYN BAIRD GENTRY'}
INFO:root:

INFO:root:model type: names_token_matcher
INFO:root:{'BAIRD GENTRY', 'BAIRD GENTRY CORP', 'EVELYN BAIRD', 'EVELYN BAIRD GENTRY'}
INFO:root:

INFO:root:model type: names_spacy
INFO:root:{'Suff', '61071.57', '017128671', 'W2 B', 'DD', "Employer's", 'Filed', 'CHRISTINA', 'Dependent', 'Wage'}
INFO:root:

INFO:root:

INFO:root:Field type: org
INFO:root:model type: org_spacy
INFO:root:{'Safe', 'TX', 'Medicare', 'EVELYN', '7 Social', '2', 'BAIRD GENTRY CORP', 'Tax 2016', 'the Treasury- Internal Revenue Service Form Statement Copy B-', 'Department', "Employee's", 'Employee', 'HOUSTON', 'EIN', 'the Internal Revenue Service'}
INFO:root:

INFO:root:

