# Experiment 4: Testing on consolidated dataset

## The consolidated dataset means that all simular forms are of one classs labels e.g. 'aicf_pg1', 'aicf_v1', 'aicf_v2' and 'aicf_v3' are all labelled 'aicf_pg1'.
## This is effectively how CBP works at present. As we still have a finegrained template matrix, we would expect lower F1 score than seen in experiments 1 and 2.

## Approach: 
* Universal Sentence Encoder Large 
* Template matrix built from text encodings
* No text preprocessing
* Similarity measured between template and document.
* Similarity below 0.7 ruled 'other.

## Dataset:
* test_data_consolidated_types.xlsx.
* 476 documents tested (some not included due to textract problem).
* 19 templates used - [acdbcf, ahwcf, ahwcf_v3, ahwcf_v4, aicf_pg1, aicf_pg2, aicf_v1, aicf_v2, aicf_v3, canscr, clmapp, hicf_pg1, hicf_pg2, init_pg1_v2, init_pg3, phystmt, ptscf, pvbcf]
* Available [here]("s3://aaca-ani-experiments-data/aaca-docdig-test/offline_cf_classification/templates/template_images/") - includes forms not currently considered classes in CBP.

# Result: 92% F1 Score - Even here where some predictions are certain to be wrong, we get high accuracy.
#         When we force results to be in consolidated for we return to 97% F1 score.

In [None]:
from sklearn.metrics import classification_report
from utils import fix_others
import pandas as pd

In [5]:
results_df = pd.read_excel("data/results/experiment_1.xlsx", index_col=0, engine='openpyxl')

In [6]:
results_df.head(2)

Unnamed: 0,json,template,textarct_key_value_dictionary,text,png_path,template_from_interim_logic,Unnamed: 7,revised_template,results,score
0,doc-digitization-pipeline/AD/P0X024W1_02751999...,ahwcf,"{'ZIP:': '3 2 5 0 4', 'Primary Pollcyholder': ...",02-26-21:11:174M; ;11 # 2/ 2 20 ACCIDENT WELLN...,doc-digitization-pipeline/AD/P0X024W1_02751999...,ahwcf,,ahwcf_v3,ahwcf_v3,0.856593
1,doc-digitization-pipeline/AD/PX372696_02641791...,aicf_pg2,"{'*Date of Birth (mm/dd/yy)': '/ /', 'DATE': '...",PX372696 Policyholder Information: *Last Name ...,doc-digitization-pipeline/AD/PX372696_02641791...,aicf_pg2,,aicf_pg2,aicf_pg2,0.91746


In [7]:
labels_df = pd.read_excel("data/test_data_consolidated_types.xlsx", index_col=0, engine='openpyxl')

In [8]:
labels_df.head(2)

Unnamed: 0,json,template,textarct_key_value_dictionary,text,png_path,template_from_interim_logic,Unnamed: 7,revised_template
0,doc-digitization-pipeline/AD/P0X024W1_02751999...,ahwcf,"{'ZIP:': '3 2 5 0 4', 'Primary Pollcyholder': ...",02-26-21:11:174M; ;11 # 2/ 2 20 ACCIDENT WELLN...,doc-digitization-pipeline/AD/P0X024W1_02751999...,ahwcf,,ahwcf
1,doc-digitization-pipeline/AD/PX372696_02641791...,aicf_pg2,"{'*Date of Birth (mm/dd/yy)': '/ /', 'DATE': '...",PX372696 Policyholder Information: *Last Name ...,doc-digitization-pipeline/AD/PX372696_02641791...,aicf_pg2,,aicf_pg2


In [20]:
labels_df['revised_template'].value_counts()

aicf_pg1    235
aicf_pg2    192
ptscf        21
pvbcf        12
ahwcf        11
other        11
hicf_pg2      7
init_pg3      1
init_pg1      1
Name: revised_template, dtype: int64

In [9]:
results_df['labels'] = labels_df['revised_template'].values
results_df.head(2)

Unnamed: 0,json,template,textarct_key_value_dictionary,text,png_path,template_from_interim_logic,Unnamed: 7,revised_template,results,score,labels
0,doc-digitization-pipeline/AD/P0X024W1_02751999...,ahwcf,"{'ZIP:': '3 2 5 0 4', 'Primary Pollcyholder': ...",02-26-21:11:174M; ;11 # 2/ 2 20 ACCIDENT WELLN...,doc-digitization-pipeline/AD/P0X024W1_02751999...,ahwcf,,ahwcf_v3,ahwcf_v3,0.856593,ahwcf
1,doc-digitization-pipeline/AD/PX372696_02641791...,aicf_pg2,"{'*Date of Birth (mm/dd/yy)': '/ /', 'DATE': '...",PX372696 Policyholder Information: *Last Name ...,doc-digitization-pipeline/AD/PX372696_02641791...,aicf_pg2,,aicf_pg2,aicf_pg2,0.91746,aicf_pg2


In [10]:
filtered_df = results_df[results_df['results'] != -100]

In [11]:
filtered_df['results'] = filtered_df.apply(lambda row: fix_others(row), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['results'] = filtered_df.apply(lambda row: fix_others(row), axis=1)


# Support zeros below are all associted with 'classes' not in the test set. We know that this is not really the ground truth but indicates good performance 

In [13]:
print(classification_report(list(filtered_df['labels']),list(filtered_df['results'])))

              precision    recall  f1-score   support

      acdbcf       0.00      0.00      0.00         0
       ahwcf       1.00      0.44      0.62         9
    ahwcf_v3       0.00      0.00      0.00         0
    ahwcf_v4       0.00      0.00      0.00         0
    aicf_pg1       1.00      0.86      0.93       230
    aicf_pg2       1.00      0.95      0.98       187
     aicf_v1       0.00      0.00      0.00         0
     aicf_v2       0.00      0.00      0.00         0
     aicf_v3       0.00      0.00      0.00         0
      canscr       0.00      0.00      0.00         0
      clmapp       0.00      0.00      0.00         0
    hicf_pg1       0.00      0.00      0.00         0
    hicf_pg2       0.88      1.00      0.93         7
    init_pg1       0.00      0.00      0.00         1
 init_pg1_v2       0.00      0.00      0.00         0
    init_pg3       1.00      1.00      1.00         1
       other       1.00      0.22      0.36         9
     phystmt       0.00    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# Same results, but now force similar forms to be of the same class (this is effectively what currently happens in CBP)

# Once again, we reach 0.97 f1. It seems likely that this approach is an improvement on CBP

In [14]:
supported_labels = ['ahwcf', 'aicf_pg1', 'aicf_pg2', 'hicf_pg1', 'hicf_pg2', 'init_pg1', 'init_pg3', 'ptscf', 'pvbcf', 'other']

In [27]:
def for_supported_labels(row, supported_labels):
    """Convert predictions with low similarity to 'other' labels."""
    if row['results'] in ['aicf_v1', 'aicf_v2', 'aicf_v3']:
        return 'aicf_pg1'
    elif row['results'] in ['ahwcf_v3', 'ahwcf_v4']:
        return 'ahwcf'
    elif row['results'] in ['acdbcf', 'canscr', 'clmapp', 'phystmt']:
        return 'other'
    elif row['results'] in ['init_pg1_v2']:
        return 'init_pg1'
    else: 
        return row['results']

In [28]:
filtered_df['supported_results'] = filtered_df.apply(lambda row: for_supported_labels(row, supported_labels), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['supported_results'] = filtered_df.apply(lambda row: for_supported_labels(row, supported_labels), axis=1)


In [29]:
print(classification_report(list(filtered_df['labels']),list(filtered_df['supported_results'])))

              precision    recall  f1-score   support

       ahwcf       0.90      1.00      0.95         9
    aicf_pg1       1.00      0.99      0.99       230
    aicf_pg2       1.00      0.95      0.98       187
    hicf_pg1       0.00      0.00      0.00         0
    hicf_pg2       0.88      1.00      0.93         7
    init_pg1       1.00      1.00      1.00         1
    init_pg3       1.00      1.00      1.00         1
       other       0.78      0.78      0.78         9
       ptscf       0.64      0.90      0.75        20
       pvbcf       1.00      1.00      1.00        12

    accuracy                           0.97       476
   macro avg       0.82      0.86      0.84       476
weighted avg       0.98      0.97      0.97       476



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
