# Doc identification based on taxonomy markers
Tax documents are a highly populated yet very tight semantic cluster: identification by the textual content might be inefficient or insufficient due to significant semantic overlap and the noise presented by the user inputs. Identification by the form title would not help much, some forms share the main/part of the title, and we would have to locate it first which might be challenge if we deal with uploaded scans where not only pages orientation but also pages order could be mixed. 

The official forms usually have `taxonomy markers` (type identifiers) displayed on the first page (sometimes on all the pages), most often they appear in the header and/or the footer areas of the page. If we could read those markers in a reliable way and have a lookup table -- the document identification would be straight forward no matter how populated the domain is. For the multi-page documents we only need to classify one page with confidence. Often, we don't even need to scan a whole page: only the header and footer areas.

Our doc-identification model is a combination of taxonomy pattern matching and semantic search. (Classification models would be neither efficient nor stable with this data scenario.)

With our [Indexing-Pipeline](./Indexing-Pipeline.ipynb) we leverage the PDF form-blanks to build semantic-index from the doc-titles and a pattern lookup table from the taxonomy-markers extracted from the header/footer areas. This notebook focused on the OCR pipeline: it should be scalable and resistant to OCR misses. First, we have to make sure that our strategy would work.

* [Strategy](#desc)
    * [Taxonomy match](#tax)
    * [Semantic match](#sem)
* [Expectation](#base)
* [Pipeline:](#pipe)
    * [Skew detection](#skew)
    * [Orientation detection](#orient)
    * [Content classification](#form)
    * [Pattern match](#match)


In [None]:
import os
import re
import cv2
import torch
import pandas as pd
import numpy as np
import pytesseract as pts
import matplotlib as mpl

from PIL import Image, ImageOps
from matplotlib import pyplot as plt
from matplotlib import patches
from pathlib import Path
from IPython.display import display, clear_output
from fuzzysearch import find_near_matches
from fitz import fitz
from time import time

In [None]:
# local lib
from scripts import prep, parse
from scripts import simulate as sim
from scripts.baselines import *

    #torch._dynamo.config.verbose = True
    torch.cuda.empty_cache()
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    device

<a name="desc"></a>

## Strategy
We've got our taxonomy-lookup table parsed from the form-blanks pdf files: we can guess that the file name reflects the taxonomy pattern we will be looking for the document identification. Let's check if that's the case.

In [None]:
# doc-level lookup table
docs = pd.read_csv('./data/forms.csv.gz')
docs = docs.loc[docs['lang'].isin(['en','fr','sp'])].fillna('')
docs['taxonomy'] = docs.apply(lambda r:f"{r['type']}{r['sub']}".strip().upper(), axis=1)
docs

In [None]:
print(f'Documents: {len(docs.dropna())}')

In [None]:
# page-level reference (multipage docs)
pages = pd.read_csv('./data/page-summary.csv.gz')
pages['file'] = pages['source'].apply(lambda x:'-'.join(x.split('-')[:-1]))
pages = pages.merge(docs[['file','taxonomy','ext','lang']], on='file')
pages.columns

In [None]:
print(f'Pages: {len(pages)}')

<a name="tax"></a>

### Taxonomy match
Usually taxonomy is represented by short alphanumeric pattern with hyphens and dots; sometimes a single letter or a single number (in this case our search may end up with a lot of false positives).

The challenge is the might be either hyphen or whitespace used as separator on random.

In [None]:
def stem(x):
    return re.sub(r'[\.\- ]', '', x).upper()

In [None]:
# first letter `f` in IRS filenames stands for `form` followerd by number (false positive)
def search_pattern(d):
    if d['file'].startswith('irs-'):
        if d['lang'] == 'sp':
            return stem(f"FORMULARIO {d['taxonomy']}")
        return stem(f"FORM {d['taxonomy']}")
    return stem(f"{d['taxonomy']}{d['ext'].replace('-','')}")

pages['pattern'] = pages.apply(search_pattern, axis=1)
pages[['file','taxonomy','pattern']].sample(10)

#### Sample doc
Let's check our hypothesis with some random document.

In [None]:
doc = 'cnd-t100b.fr' #np.random.choice(list(set(pages['doc'])))
# get all pages data
files = pages[pages['file']==doc]['source'].to_list()
desc = docs.loc[docs['file']==doc,'desc'].iloc[0]
pattern = pages.loc[pages['file']==doc,'pattern'].iloc[0]
print(f'Doc: {doc}\n{desc}\n#pages: {len(files)}\nTaxonomy marker: {pattern}')

In [None]:
order = [int(source.split('-').pop()) for source in files]

def find_marker(pattern):
    def has_marker(x):
        return stem(str(x)).find(pattern) != -1
    return has_marker
    
# show pages with layout-blocks marked by type and highlight if found the taxonomy marker
for i,o in enumerate(np.argsort(order)):
    source = files[o]
    
    image = Image.open(f'data/images/{source}.png')
    content = pd.read_csv(f'data/info/{source}.csv.gz')
    try: # check if the for-inputs data is available
        inputs = pd.read_csv(f'data/inputs/{source}.csv.gz')
        inputs = inputs.loc[inputs['field_type_string']!='Button']
    except: # pdf has no widgets
        inputs = None
        
    # check if marker present in the text
    content['marker'] = content['text'].apply(find_marker(pattern))

    fig, ax = plt.subplots(figsize=(8,8))
    
    # make a point for the legend
    ax.scatter([-100], [-100], color='C0', marker='s', s=100, alpha=0.4, label='word')
    ax.plot([-100], [-100], color='C1', label='image')
    ax.plot([-100], [-100], color='C2', label='block')
    ax.plot([-100], [-100], color='C3', label='target block: contains marker')
    ax.scatter([-100], [-100], color='yellow', marker='s', s=100, alpha=0.6, label='form-input')
    
    # use page-image as a background
    ax.imshow(image)
    
    # highlight layout blocks by type
    for box in content[['left','top','right','bottom','scale','orig','block-type','marker']].values:
        x1, y1, x2, y2, s, o, t, m = box
        w, h = (x2 - x1) * s, (y2 - y1) * s
        x, y = x1 * s, y1 * s
        color = 'C3' if m == 1 else ('C1' if t == 1 else ('C0' if o == 'word' else 'C2'))
        if color == 'C0':
            patch = patches.Rectangle((x, y), w, h,
                                      linewidth=0, edgecolor='none', facecolor='C0', alpha=0.25)
        else:
            patch = patches.Rectangle((x, y), w, h,
                                      linewidth=1, edgecolor=color, facecolor='none')
        ax.add_patch(patch)
        
    # whenever the form-inputs data is available show the inputs
    if inputs is not None:    
        for box in inputs[['left','top','right','bottom','field_type_string']].values:
            x1, y1, x2, y2, t = box
            try:
                w, h = (x2 - x1) * s, (y2 - y1) * s
                x, y = x1 * s, y1 * s
                ax.add_patch(patches.Rectangle((x, y), w, h,
                                               linewidth=0, edgecolor='none', facecolor='yellow', alpha=0.6))
            except:
                pass # parsing error
            
    ax.set_title(f'Page {i + 1}')
    ax.legend(bbox_to_anchor=(1, 1), frameon=False)
    plt.show()


<a name="base"></a>

##  Expectation
Let's estimate expectation that finding a taxonomy-marker leads to a successful document identification. For this experiment we will use the text extracted from the PDF-docs, so we would know the hard `ceil` -- we wouldn't be able to do any better than that with OCR.

In [None]:
search = []
# go through all data
for doc in set(pages['file']):    
    files = pages[pages['file']==doc]['source'].to_list()
    data = [pd.read_csv(f'./data/info/{source}.csv.gz') for source in files]
    data = pd.concat(data)
    data = data.loc[~data['text'].isna()]
    pattern = pages.loc[pages['file']==doc,'pattern'].values[0]
    # find if the pattern from the lookup table is present on the page somewhere
    data['marker'] = data['text'].apply(find_marker(pattern))    
    # if found save location reference
    success = data.loc[data['marker']==1,['left','top','right','bottom','page','text']]
    success[['left','right']] /= data['right'].max()
    success[['top','bottom']] /= data['bottom'].max()
    success['pattern'] = pattern
    success['doc'] = doc
    search.append(success)
    
search = pd.concat(search)
search.to_csv('./data/search-test.csv', index=False)

In [None]:
success = set(search.groupby('doc').size().index)
failure = set(pages['file']).difference(success)
# doc-level identification stats
print('Docs. success: {} ({:.2%})  failure: {}'.format(len(success),
                                                       len(success)/(len(success) + len(failure)),
                                                       len(failure)))
# page-level search success
print(f"Pages success: {len(pages.loc[pages['file'].isin(success)])/len(pages):.2%}")

In [None]:
# plot left-top coordinates of the bounding-box containing taxonomy-marker
fig, ax = plt.subplots(figsize=(8, 8))
for i,prefix in enumerate(['irs','cnd','que']):
    D = search[search['doc'].str.startswith(prefix)]
    ax.scatter(D['left'], 1 - D['top'], color=f'C{i}', s=3, alpha=0.2, label=prefix)
ax.axhline(y=0.8, linestyle=':', color='C3')
ax.text(1, 0.81, 'header', color='C3', ha='right')
ax.axhline(y=0.2, linestyle=':', color='C3')
ax.text(1, 0.175, 'footer', color='C3', ha='right')
lg = plt.legend(bbox_to_anchor=(1, 0.6), frameon=False)
for lh in lg.legend_handles: lh.set_alpha(1)
plt.title('Heatmap: taxonomy-marker position on the page')
plt.show()

    # see some failed examples
    pages.loc[pages['file'].isin(failure)]
    for doc in [np.random.choice(list(failure))]:
        files = pages[pages['file']==doc]['source'].to_list()
        data = [pd.read_csv(f'./data/info/{source}.csv.gz') for source in files]
        data = pd.concat(data)
        data = data.loc[~data['text'].isna()]
        pattern = pages.loc[pages['file']==doc,'pattern'].values[0]
        print(doc, pattern)
        print('\n'.join(data[data['orig']!='word']['text'].to_list()))
        print('')

Observation: our strategy should work.

<a name="pipe"></a>

## Doc-identification pipeline
1. Detect and fix skew and orientation to improve OCR outcome and to locate the header/footer areas
2. Check `hot-spots` for the presence of taxonomy markers; if failed to find or match -- run extended search
3. Determine if the page contains form-inputs (needs full-size OCR extraction)

We've seen that variance-based skew correction works quite well with a low-resolution view. However, to correct orientation (90º, 180º, 270º) we need a better model. Our hypothesis here: we can detect orientation based on low-resolution view as well.

We also need to make sure that our processing works in the presence of noise and distortion: for the baselines we are going to use [simulated noisy data](Synthetic-Data.ipynb) to minimize the necessity for the real data (which we might not have to begin with, and would be high risk when available).

<a name="skew"></a>

### Skew detection baseline
Skew correction should fix rotation up to 45º, in the real data we would expect some small (less than 10º) angles. 
Let's evaluate performance of our ([variance-based skew correction](./OCR-Prep.ipynb#skew)) -- we might go with it as is.

In [None]:
# page-images extracted from pdf
images = [f'data/images/{x}.png' for x in pages['source']]

In [None]:
# gather skew-detection stats
n = 1000
# size to test (224 is ViT resolution)
test = [64, 128, 224, 500]
result = pd.DataFrame(columns=test)
error = pd.DataFrame(columns=test)
for k,size in enumerate(test):
    stats = []
    for i in range(n):
        source = np.random.choice(images)
        # run augmentation
        orig, info = prep.random_transform(prep.img_load(source), max_skew=45, noise=0.5, perspective=True)
        # test skew detection with downscale to dpi
        angle = prep.detect_skew(orig, max_angle=45, base_size=size)
        info['corrected'] = angle
        stats.append(info.copy())
        print(f'{(n * k + i)/len(test)/n:.2%}', end='\r')

    stats = pd.DataFrame.from_dict(stats)
    error[size] = np.abs(stats['skew'] - stats['corrected'])
    result[size] = stats['orient'] + error[size]

# visualize result
fig, ax = plt.subplots(1, 2, figsize=(9, 4))
for i,size in enumerate(result.columns):
    ax[0].scatter(range(n), result[size], s=5, marker=f'{i+1}', alpha=0.5, label=size)
ax[0].set_xticks([])
ax[0].set_yticks([0, 90, 180, 270])
ax[0].set_xlabel('Sample order')
for i,size in enumerate(result.columns):
    ax[1].scatter(error[size], result[size], s=5, marker=f'{i+1}', alpha=0.5, label=size)
ax[1].set_yticks([0, 90, 180, 270])
ax[1].set_xlabel('Sample error')
ax[1].set_ylabel('Sample rotation outcome')
lg = ax[1].legend(bbox_to_anchor=(1, 1), frameon=False)
for lh in lg.legend_handles: lh.set_alpha(1)
plt.tight_layout()

In [None]:
# fraction of the correct detection
np.sum((error < 2))/len(error)

Results with a set of downscale-levels show the best performance consistently at 224 size, however, there are no significant difference in the tested range; over 80% of outcomes have the error less then 2º. Good to go.

In [None]:
# pick view size
VIEW_SIZE

In [None]:
# take random page view
source = np.random.choice(images).split('/').pop()[:-4]
print(source)
# generate a noisy view of a filled-in form along with data for the labels
orig, info, inputs = sim.generate_sample(source, dpi=200, light=0.3, noise=0.3)
# correct skew and distortion
output = orig.copy()
angle = prep.detect_skew(output, max_angle=45)
output = prep.img_rotate(output, angle, fill=prep.get_bg_value(output))
output = prep.fit_straight(output)

fig, ax = plt.subplots(1, 2, figsize=(10, 10))
ax[0].imshow(orig, 'gray')
ax[0].set_title('Original (augmented image)')
ax[1].imshow(output, 'gray')
ax[1].set_title(f'Processed (detected skew {angle}º)')
plt.show()


<a name="orient"></a>

### [Orientation detection baseline](Classification-Baseline.ipynb#orient)
At this point we consider that a small rotation angle (skew) and distortion are corrected. Orientation correction should fix 90º, 180º, and 270º. The hypothesis here: there are enough of visual cues for the human eye to hint the page orientation on the low-resolution view and the outlines if present. We cannot read the text from the image, but we know the page orientation, and if some text is present, if a table is present, if some inputs are present. Let's estimate how well we could do that with a neural network.

Bird's eye view estimation: for this experiment we generate a noisy dataset of images which we first straighten-up, means the outcome may not be perfectly aligned (20% will have 1º+ residual skew), then, using this data, we train [orientation detector model](Classification-Baseline.ipynb#orient). The model will take a low-resolution view as an input. To avoid our model picking up on logos and local style we used `center-crop`: model only sees the middle section of the page and the side edges.

In [None]:
lines = parse.extract_lines(output, units=20)

# scale down to 224 on max size
size = tuple((np.array(orig.shape) * 224/max(orig.shape)).astype(int))[::-1]
layout = cv2.resize(output, size, interpolation=cv2.INTER_AREA)

fig, ax = plt.subplots(1, 2, figsize=(10,10))
ax[0].imshow(layout, 'gray')
ax[0].set_title('Low resolution layout features')
ax[1].imshow(lines, 'gray')
ax[1].set_title('Extracted lines')
plt.show()

In [None]:
results = []
for _ in range(1000):
    # take random page view
    source = np.random.choice(images).split('/').pop()[:-4]
    # generate a noisy view of a filled-in form along with data for the labels
    orig, info, _ = sim.generate_sample(source, dpi=200, light=0.3, noise=0.3)
    output, orient = test_correction_pipeline(orig)
    results.append([orient, info['orient']])

# show resulting confusion matrix
results = pd.DataFrame(np.array(results).reshape((1000,2)), columns=['detected','actual'])
heatmap = results.groupby(['detected','actual']).size()
heatmap /= heatmap.sum()
if len(heatmap) < 16:
    heatmap = heatmap.reindex([(a,b) for a in [0,90,180,270] for b in [0,90,180,270]], fill_value=0)
fig, ax = plt.subplots(figsize=(3, 3))
ax.imshow(heatmap.values.reshape((4, 4)))
plt.axis('off')
plt.title(f"Accuracy: {np.sum(results['detected']==results['actual'])/len(results):.2%}")
plt.show()

<a name="form"></a>

### [Form detection baseline](Classification-Baseline.ipynb#form)
In our dataset we have forms with inputs arranged in the tabular layout (`IRS` forms) and forms with inline inputs stylized in some peculiar way which make training of a generic classifier without over-fitting to the dataset quite a challenge. For this experiment we use the same data we generated for the orientation detector. (We have more positive samples in the data).

In [None]:
results = []
for _ in range(1000):
    # take random page view with widgets info available
    source = np.random.choice([x for x in images if not x.startswith('data/images/que-')]).split('/').pop()[:-4]
    # generate a noisy view of a filled-in form along with data for the labels
    orig, _, inputs = sim.generate_sample(source, dpi=200, light=0.3, noise=0.3)
    cls = test_classification_pipeline(orig)
    results.append([cls, len(inputs) > 0])

# show resulting confusion matrix
results = pd.DataFrame(np.array(results).reshape((1000,2)), columns=['detected','actual'])
heatmap = results.groupby(['detected','actual']).size()
heatmap /= heatmap.sum()
if len(heatmap) < 4:
    heatmap = heatmap.reindex([(a,b) for a in [0,1] for b in [0,1]], fill_value=0)
fig, ax = plt.subplots(figsize=(2, 2))
ax.imshow(heatmap.values.reshape((2, 2)))
plt.axis('off')
plt.title(f"Accuracy: {np.sum(results['detected']==results['actual'])/len(results):.2%}")
plt.show()

<a name="match"></a>

### Check hot-spots for the presence of taxonomy markers
At this point we consider the document is straighten up and in the right position. Our strategy is to look at the small fragment of the image in the order of priority which would help with gradient noise and execution time, however, it is still important how the actual OCR is performed. First, we just run a simple brute-force concept prove: go through all the data we know to have the taxonomy-markers and read the known "hot spots" with `Tesseract.image_to_string` method following with pattern search. [The reading strategy we address in a separate notebook.]()

In [None]:
def check_header_footer(image):
    """
    visualize flow
    """
    w, h = image.shape
    d = int(h * 0.2)
    print('------------------------ reading header ---------------------------')
    header = normalize(image[:d,:])            
    fig, ax = plt.subplots(figsize=(10, 10))
    ax.imshow(header, 'gray')
    plt.axis('off')
    plt.title('Original')
    plt.show()
    header = pts.image_to_string(header, lang='eng+fre')
    print(header)
    print('------------------------ reading footer ---------------------------')    
    footer = normalize(image[-d:,:])
    fig, ax = plt.subplots(figsize=(10, 10))
    ax.imshow(footer, 'gray')
    plt.axis('off')
    plt.show()
    footer = pts.image_to_string(footer, lang='eng+fre')
    print(footer)
    return header, footer

# grab a page from the random doc
doc = np.random.choice(list(success))
# get search target
pattern = get_marker(doc)
for source in pages[pages['doc']==doc]['source'].to_list():
    # generate a noisy view of a filled-in form along with labels
    orig, info, inputs = sim.generate_sample(source, dpi=300, light=0.3, noise=0.3)
    # run correction
    output, orient = test_correction_pipeline(orig)
    # show original and straighten-up
    fig, ax = plt.subplots(1, 2, figsize=(10,10))
    ax[0].imshow(orig, 'gray')
    ax[0].set_title('Original')
    ax[1].imshow(output, 'gray')
    ax[1].set_title('Prepared')
    plt.show()
    if info['orient'] != orient:
        print('orientation detection failure...')
        continue
    check_header_footer(output)
    break


If we allow some minimal `levelshtein` distance the outcome improves.

In [None]:
def find_match(pattern, max_dist=1):
    def best_match(x):
        if type(x) != str: return 0
        matches = find_near_matches(pattern, stem(str(x)), max_l_dist=max_dist, max_deletions=0)
        if len(matches) > 0:
            return min(matches, key=lambda x:x.dist)
    return best_match


def search_doc(image, find):
    """
    search priority order: header and footer first
    followed by the middle part top-down
    """    
    w, h = image.shape
    d = int(h * 0.2)
    print(w, h, d)
    header = pts.image_to_string(normalize(image[:d,:]), lang='eng+fre')
    b = find(header)
    if b == 0: # if exact return, if not continue
        return 1, 0
    best = b
    footer = pts.image_to_string(normalize(image[-d:,:]), lang='eng+fre')
    b = find(footer)
    if b == 0:
        return 2, 0
    best = b or best
    return 0, best


In [None]:
search_doc(output, find_match(pattern))

In [None]:
def find_match(pattern, max_dist=1):
    def best_match(x):
        if type(x) != str: return 0
        matches = find_near_matches(pattern, stem(str(x)), max_l_dist=max_dist, max_deletions=0)
        if len(matches) > 0:
            return min(matches, key=lambda x:x.dist)
    return best_match


def search_doc(image, find):
    """
    search priority order: header and footer first
    followed by the middle part top-down
    """    
    w, h = image.shape
    d = int(h * 0.2)          
    header = pts.image_to_string(normalize(image[:d,:]), lang='eng+fre')
    d = find(header)
    if d == 0:
        return 1, 0
    best = d
    footer = pts.image_to_string(normalize(image[-d:,:]), lang='eng+fre')
    d = find(footer)
    if d == 0:
        return 2, 0
    best = min(best, d)
    for i in range(1, 4):
        content = pts.image_to_string(normalize(image[i * d:(i + 1) * d,:]), lang='eng+fre')
        d = find(content)
        if d == 0:
            return 2 + i, 0
        best = min(best, d)
    return 0, best or None


# reset and add `dist` for fuzzy match
test = search.groupby(['doc','pattern']).size().reset_index().set_index('doc')
test['success'] = 0
test['dist'] = None
test['exception'] = 0
test['proc'] = None
# brute-force search
for n, doc in enumerate(test.index):
    # pattern to look for
    pattern = test.loc[doc,'pattern']
    result, dist = 0, 10
    files = pages[pages['doc']==doc]['source'].to_list()
    for source in files:
        # generate a noisy view of a filled-in form along with data for the labels
        orig, info, inputs = sim.generate_sample(source, dpi=300, light=0.3, noise=0.3)
        # run correction
        output, orient = test_correction_pipeline(orig)
        test.loc[doc,'proc'] = 1
        if info['orient'] != orient:
            test.loc[doc,'exception'] += 1
            continue
        try:
            i, d = search_doc(output, find_match(pattern, max_dist=1))
        except:
            test.loc[doc,'exception'] += 1
            continue
        else:
            if d is not None and dist >= d:
                result += 1
                dist = min(dist, d)
    test.loc[doc,'success'] = result
    test.loc[doc,'dist'] = dist
    test.to_csv('./data/test-match.csv')
    print(f'done: {n/len(test):.2%}', end='\r')

result = test.loc[test['proc'] == 1]
print(f"Success: {len(result[result['success'] > 0])/len(result):.2%}")

In [None]:
result = result.groupby(['success','exception']).size()
(result / result.sum() * 100).reset_index().style.background_gradient('Reds')

The method above might be ok for the concept prove, but to make it useful in real practice we have to figure out some serious optimization (this is slow).