# Data extraction from PDF forms
For this project we need
* `layout`: hierarchy of blocks defined by bounding box coordinates and `type`
* `text`: words and `phrases` (linked sequences)
* `inputs`: fields where certain data should be entered defined by text `label` and data type spec.
* `images`: some are `logos` we want to recognize; some contain text we want to be aware of for our vision model

There are several options, we use `PyMuPDF` package: `scripts/parse.py` is initial bulk extraction for exploration.

For our doc-indexing pipeline we need a refined version based on the representation model our exploration outputs. We also need to chose embedding models (text and image) for similarity queries.

* For the text embeddings we are good to go with a pretrained model, maybe with a minimal tune up.
* For the image embedding we are going to train our own model based on either `ResNet` or `ViT` architecture adapted to grayscale.

The single-source-batch data-loaders we use could make learning very sensitive to data quality: we need a way to classify each source for fitness to be a learning sample.

In [None]:
import re
import os
import json
import numpy as np
import pandas as pd

from time import time
from pathlib import Path
from PIL import Image, ImageOps
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from matplotlib import pyplot as plt
from matplotlib import patches

In [None]:
# run initial parsing for exploration
#!python scripts/parse.py

In [None]:
# doc-level lookup table
docs = pd.read_csv('./data/forms.csv.gz')
docs = docs.loc[docs['lang'].isin(['en','fr','sp'])].fillna('')
docs['taxonomy'] = docs.apply(lambda r:f"{r['type']}{r['sub']}".strip().upper(), axis=1)

# page-level reference (multipage docs)
pages = pd.read_csv('./data/page-summary.csv.gz')
pages['file'] = pages['source'].apply(lambda x:'-'.join(x.split('-')[:-1]))

In [None]:
BOX = ['top','left','bottom','right']

### Explore data quality evaluation strategies

In [None]:
failed = []
for i,source in enumerate(pages[(pages['num-pages']==1)&(pages['text-input'].isna())]['source'].to_list()):
    try:
        D = pd.read_csv(f'data/info/{source}.csv.gz')
    except FileNotFoundError:
        print(source)
        continue        
    D = D.loc[D['block-type']=='word']
    D.loc[:,'text'] = D.loc[:,'text'].fillna('').astype(str).str.strip()
    D = D.loc[D['text']!='']
    text = ' '.join(D['text'].to_list())
    if text.find('The document you are trying to load requires Adobe Reader') != -1 \
       or text.find('Please wait...') != -1:
        failed.append(source[:-2])
len(failed) 

In [None]:
index, stats = [],[]
# gather page word stats
for i,source in enumerate(pages['source'].to_list()):
    try:
        D = pd.read_csv(f'data/info/{source}.csv.gz')
    except FileNotFoundError:
        continue
    # filter out duplicate blocks (lines)
    boxes = D.sort_values('block-type', ascending=False)
    boxes = boxes.loc[boxes['block-type'].isin(['block','line']), BOX]
    boxes.loc[:,BOX] = np.round(boxes.loc[:,BOX] * 1000).astype(int)
    n = len(boxes)
    boxes = boxes.drop_duplicates(keep='first')
    drop = D[(D['block-type']=='block')&(~D.index.isin(boxes.index))].index
    #if len(drop) < n - len(boxes):
    #print(f'duplicate blocks: dropped {len(drop)} of {n - len(boxes)} ...')
    D = D.loc[~D.index.isin(drop)]    
    D.to_csv(f'data/info/{source}.csv.gz', index=False, compression='gzip')
    
    if len(D.loc[(D['sin'] > 0)&(D['sin'] < 1)]) > 0:
        print(source)
    
    words = D.loc[D['block-type']=='word']
    words.loc[:,'text'] = words.loc[:,'text'].fillna('').astype(str).str.strip()
    #words.loc[:,BOX] = words.loc[:,BOX].astype(float)
    words = words.loc[words['text']!='']
    if len(words) > 0:
        words['height'] = words['bottom'] - words['top']
        words['width'] = words['right'] - words['left']
        # estimate space between the words
        space = words[['top','left']].merge(words[['top','right']], on='top')
        space = space.loc[space['left'] > space['right']]
        space = (space['left'] - space['right']).min()        
        words = words.median(numeric_only=True).to_dict()
        words['space'] = space
        stats.append(words)
        index.append(source)
    print(f'done: {(i + 1)/len(pages):.2%}', end='\r')

stats = pd.DataFrame.from_dict(stats)
mean = stats.mean(numeric_only=True)
pages = pages.set_index('source')
pages.loc[index,stats.columns[:-2]] = stats.values[:,:-2]
pages['word-width'] = None
pages['word-height'] = None
pages.loc[index,['word-width','word-height','space']] = stats[['width','height','space']].values

# detect and mark outlier-pages
div = np.log(((stats - mean) ** 2).sum(axis=1))
div /= div.max()
pages['div'] = 1
pages.loc[index,'div'] = div.values

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
div.plot(kind='density', ax=ax[0])
ax[0].axvline(x=0.32, linestyle=':')
ax[0].axvline(x=0.55, linestyle=':')
ax[1].plot(sorted(div.to_list())[::-1])
ax[1].axhline(y=0.32, linestyle=':')
ax[1].axhline(y=0.55, linestyle=':')
plt.show()

In [None]:
#outliers = pages[pages['div'] > 0.8].index.to_list()
#Image.open(f'./data/images/{np.random.choice(outliers)}.png')

In [None]:
outliers = set(pages[(pages['div'] < 0.5)|(pages['div'] >= 0.8)].index)
len(outliers)

In [None]:
pages[(pages['div'] >= 0.5)&(pages['div'] < 0.8)].to_csv('./data/pages.csv.gz', compression='gzip')

In [None]:
images = [str(x).split('/').pop()[:-4] for x in Path('./data/images').glob('*.png')]
info = [str(x).split('/').pop()[:-7] for x in Path('./data/info').glob('*.csv.gz')]
len(set(images).intersection(set(outliers))), len(set(info).intersection(set(outliers)))

In [None]:
for source in outliers:
    os.remove(f'data/images/{source}.png')
    os.remove(f'data/info/{source}.csv.gz')


In [None]:
images = [str(x).split('/').pop()[:-4] for x in Path('./data/images').glob('*.png')]
info = [str(x).split('/').pop()[:-7] for x in Path('./data/info').glob('*.csv.gz')]
len(set(images).intersection(set(outliers))), len(set(info).intersection(set(outliers)))

### Explore images on the pages
Some logos contain the text which can mess up our learning; but they can help identify the page origin.

In [None]:
pages = pd.read_csv('./data/pages.csv.gz')
len(pages)

In [None]:
def image_text(data):
    images = data.loc[data['block-type']=='image']
    if len(images) == 0:
        return data
    source, page = data.iloc[0][['source','page']]
    image = np.array(ImageOps.grayscale(Image.open(f'./data/images/{source}-{page}.png')))
    scale = min(image.shape)
    text = []
    for t, l, b, r in images[['top','left','bottom','right']].values:
        if b - t > 0.5 or r - l > 0.5:
            text.append('IMAGE: ')
            continue
        t, l, b, r = int(t * scale), int(l * scale), int(b * scale), int(r * scale)
        try:
            clip = image[max(t - 5, 0):min(b + 5, image.shape[0]), max(l - 5, 0):min(r + 5, image.shape[1])]
            t = ts.image_to_string(clip).strip()
            t = ' '.join(re.split(r'\W+', t)).strip()
            text.append(f'IMAGE: {t}')
        except:
            text.append('IMAGE: ')
    data.loc[data['block-type']=='image','text'] = text
    return data


source = np.random.choice(info)
data = pd.read_csv(f'./data/info/{source}.csv.gz')
data = image_text(data)
data[data['block-type']=='image']

    image_to_text = []
    for i, source in enumerate(info):
        data = pd.read_csv(f'./data/info/{source}.csv.gz')
        images = data.loc[data['block-type']=='image']
        if len(images) == 0:
            continue
        image = np.array(ImageOps.grayscale(Image.open(f'./data/images/{source}.png')))
        scale = min(image.shape)
        for d in images.to_dict('records'):
            t, b = int(d['top'] * scale), int(d['bottom'] * scale)
            l, r = int(d['left'] * scale), int(d['right'] * scale)
            try:
                clip = image[min(t - 5, 0):min(b + 5, image.shape[0]),min(l - 5, 0):min(r + 5, image.shape[1])]
                d['text'] = ts.image_to_string(clip).strip()
            except:
                print('error...')
                d['text'] = 'ERROR'
            image_to_text.append(d)
        print(f'done: {(i + 1)/len(info):.2%}', end='\r')

    data.to_csv('./data/image-text.csv.gz', index=False, compression='gzip')

In [None]:
data = pd.read_csv('./data/image-text.csv.gz')
print(f"errors: {len(data[data['text']=='ERROR'])/len(data):.2%}")
print(f"text: {len(data[~data['text'].isna()])/len(data):.2%}")
print(f"?: {len(data[data['text'].isna()])/len(data):.2%}")

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(11, 5))
data['aspect-ratio'].plot(kind='density', ax=ax[0])
for x in [0.42, 0.60, 0.77, 1.29, 1.65, 2.35]:
    ax[0].axvline(x=x, linestyle=':')
ax[0].set_yticks([])
ax[0].set_title('Pages aspect ratio dist.')
data['scale'].plot(kind='density', ax=ax[1])
for x in [700, 1700]:
    ax[1].axvline(x=x, linestyle=':')
ax[1].set_yticks([])
ax[1].set_title('Pages scale dist.')
plt.show()

In [None]:
#source, page = data[data['scale'] < 1000].sample().iloc[0][['source','page']]
#Image.open(f'data/images/{source}-{page}.png')
len(set(data[data['scale'] < 1000]['source']))

In [None]:
data = data.loc[data['scale'] > 1000]

In [None]:
data['width'] = data['right'] - data['left']
data['height'] = data['bottom'] - data['top']
data['area'] = data['width'] * data['height']
print(f"cover: {len(data[(data['width'] > 0.9)|(data['width'] > 0.6)])/len(data):.2%}")

In [None]:
data.loc[:,BOX + ['width','height']] = np.round(data.loc[:,BOX + ['width','height']] * 100)
data['area'] = data['width'] * data['height']

In [None]:
plt.plot(data['area'].sort_values(ascending=False).values)
plt.title('Rank by covered area')
plt.axvline(x=700, linestyle=':')
plt.show()

In [None]:
data = data.sort_values('area', ascending=False)
test = data.iloc[:700]
data = data.iloc[700:]

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(11, 11))
w, h = int(data['left'].max()), int(data['top'].max())
matrix = np.zeros((h, w))
for t, l, b, r in data[~data['text'].isna()][BOX].values.astype(int):
    matrix[t:b,l:r] += 1
ax[0].imshow(matrix/np.max(matrix), 'Reds')
ax[0].set_title('Small images')
    
w, h = int(test['left'].max()), int(test['top'].max())
matrix = np.zeros((h, w))
for t, l, b, r in test[~test['text'].isna()][BOX].values.astype(int):
    matrix[t:b,l:r] += 1
ax[1].imshow(matrix/np.max(matrix), 'Blues')
ax[1].set_title('Big images')
plt.show()

In [None]:
len(data[data['text'].isna()])/len(data)

In [None]:
len(test[test['text'].isna()])/len(test)

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(11, 5))
data['area'].plot(kind='density', ax=ax[0])
ax[0].set_yticks([])
ax[0].set_title('Image covered area')
ax[1].scatter(data['width'], data['height'], s=3, alpha=0.3)
ax[1].set_title('Image shape [width, height]')
plt.show()

In [None]:
#data

### Interactive correction

    image_to_text = []
    for source in [str(x).split('/').pop()[:-7] for x in Path('./data/info').glob('*.csv.gz')]:
        data = pd.read_csv(f'./data/info/{source}.csv.gz')
        images = data.loc[data['block-type']=='image']
        if len(images) == 0:
            continue
        if data.iloc[0]['scale'] < 1000:
            # not a form
            continue

        image = np.array(ImageOps.grayscale(Image.open(f'./data/images/{source}.png')))
        scale = min(image.shape)
        for d in images.to_dict('records'):
            t, b = int(d['top'] * scale), int(d['bottom'] * scale)
            l, r = int(d['left'] * scale), int(d['right'] * scale)
            clip = image[t - 10:b + 10,l - 10:r + 10]
            img = plt.imshow(clip, 'gray')
            plt.title(f"{source}   [{d['top']:.4f}, {d['bottom']:.4f}, {d['left']:.4f}, {d['right']:.4f}]")
            #img.set_data(clip)
            display(plt.gcf())
            clear_output(wait=True)        

            text = ts.image_to_string(clip).strip()
            correction = input(' '.join(text.split())+'\n')
            d['text'] = correction.strip()
            image_to_text.append(d)

    pd.DataFrame.from_dict(image_to_text).to_csv('./data/image-text.csv.gz', index=False, compression='gzip')

In [None]:
            # block-num is unreliable: got to scan through instead of simple merge on block-num + word
            #df = lines.loc[(lines['block-type']!='image')&(~lines['text'].isna()),['text'] + INFO]
            #df['text'] = df['text'].apply(str.split)
            #df = df.explode('text')
            #D, W = df['text'].values, words['text'].values
            #i, j = 0, 0
            #while i < len(W) and j < len(D):
            #    while i < len(W) and len(np.where(D == W[i])[0]) == 0:
            #        i += 1
            #    if i >= len(W):
            #        break
            #    j = np.where(D == W[i])[0][0]
            #    while i < len(W) and j < len(D) and W[i] == D[j]:
            #        words.iloc[i,-5:] = df.iloc[j,-5:]
            #        i += 1; j += 1

### Visual encoder: embeddings model
Let's start from [ResNet](https://pytorch.org/vision/0.8/_modules/torchvision/models/resnet.html) and see if we can get away with 512 embedding size.

In [None]:
#torch._dynamo.config.verbose = True
torch.cuda.empty_cache()
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('GPU' if device == 'cuda' else 'no GPU')

In [None]:
VIEW_SIZE = 224

In [None]:
MODEL='RN18'
LATENT_DIM = models.resnet.ResNet(models.resnet.BasicBlock, [2, 2, 2, 2]).fc.in_features
print(LATENT_DIM) # embedding size

class GrayResNetEncoder(models.resnet.ResNet):
    def __init__(self, block, layers):
        self.inplanes = 64
        super(GrayResNetEncoder, self).__init__(block, layers)
        # the first layer grayscale adaptation
        self.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False)
        self.fc = nn.Identity()

# ResNet18        
encoder = GrayResNetEncoder(models.resnet.BasicBlock, [2, 2, 2, 2])
# load state dicts trained in baseline-exploration
encoder.load_state_dict(torch.load('./models/visual-encoder-RN18.pt', map_location='cpu'))
#encoder.eval()

In [None]:
#!mkdir data/clips

In [None]:
#!rm -rf data/clips/*.png

index, embeddings = [],[]
for i, source in enumerate(info):
    data = pd.read_csv(f'./data/info/{source}.csv.gz')
    images = data.loc[data['block-type']=='image']
    if len(images) == 0:
        continue
    image = np.array(ImageOps.grayscale(Image.open(f'./data/images/{source}.png')))
    scale = min(image.shape)
    k = -1
    for t, l, b, r in (images[BOX]  * scale).astype(int).values:
        k += 1
        clip = image[max(t - 5, 0):min(b + 5, image.shape[0]),max(l - 5, 0):min(r + 5, image.shape[1])]
        if min(clip.shape) > scale//2:
            # cover image
            continue
        if min(clip.shape) == 0:
            text = 'ZERO'
            continue
        else:
            #try:
            #    text = ts.image_to_string(clip).strip()
            #except:
            #    text = 'ERROR'
            
            size = tuple((np.array(clip.shape) * VIEW_SIZE/max(clip.shape)).astype(int))[::-1]
            try:
                clip = Image.fromarray(clip).resize(size)
                path = f'./data/clips/{source}-C{k}.png'
            except:
                continue
            else:
                clip.save(path)                
                clip = 255. - np.array(clip).astype(float)
                mn, mx = np.min(clip), np.max(clip)
                if mn == mx:
                    text = 'EMPTY'
                    continue
                else:                
                    img = np.zeros((VIEW_SIZE, VIEW_SIZE))
                    h, w = (VIEW_SIZE - clip.shape[0])//2, (VIEW_SIZE - clip.shape[1])//2
                    img[h:h + clip.shape[0],w:w + clip.shape[1]] = (clip - mn)/(mx - mn)
                    with torch.no_grad():
                        vec = encoder(torch.Tensor(img.reshape((1, 1, VIEW_SIZE, VIEW_SIZE))))
                
        embeddings.append(vec.numpy().squeeze())
        index.append({'source':source, 'path':path })
        
    print(f'done: {(i + 1)/len(info):.2%}', end='\r')

print(f'processed: {len(index)} images')
pd.DataFrame.from_dict(index).to_csv('./data/clips.csv.gz', index=False, compression='gzip')

In [None]:
np.array(embeddings).shape

In [None]:
embeddings = np.array(embeddings).squeeze()

pca = PCA(n_components=3)
norm = StandardScaler().fit(embeddings)
pca.fit(norm.transform(embeddings))
pca.explained_variance_ratio_

In [None]:
Y = pca.transform(norm.transform(embeddings))

fig, ax = plt.subplots(1, 3, figsize=(10, 3))
# top two components colored by feature value
for i,j in [[0,1],[1,2],[2,0]]:
    ax[i].scatter(Y[:,i], Y[:,j], s=3, alpha=0.1)
plt.show()

In [None]:
from sklearn.cluster import MiniBatchKMeans

model = MiniBatchKMeans(n_clusters=3, n_init=100)
C3 = model.partial_fit(Y).predict(Y)

fig, ax = plt.subplots(1, 3, figsize=(10, 3))
# top two components colored by feature value
for i,j in [[0,1],[1,2],[2,0]]:
    ax[i].scatter(Y[:,i], Y[:,j], s=3, c=C3, cmap='rainbow', alpha=0.1)
plt.show()

#### Exploring high-dimensional data with [t-SNE](https://distill.pub/2016/misread-tsne/)

In [None]:
T = pd.DataFrame(TSNE(n_components=2, perplexity=90).fit_transform(np.array(embeddings)), columns=['x','y'])

fig, ax = plt.subplots(figsize=(8, 8))
ax.scatter(T['x'], T['y'], c=C3, cmap='rainbow', alpha=0.5, s=5)
ax.set_title('Documents embeddings by component')
plt.show()

In [None]:
from sklearn.cluster import SpectralClustering

for c in range(3):
    t = T.loc[C3==c]
    S = SpectralClustering(n_clusters=7, assign_labels='discretize').fit_predict(t.values)
    fig, ax = plt.subplots(figsize=(8, 8))
    ax.scatter(t['x'], t['y'], c=S, cmap='rainbow', alpha=0.5, s=5)
    ax.set_title('Documents embeddings by component')
    plt.show()

In [None]:
from sklearn.cluster import DBSCAN

for c in range(3):
    t = T.loc[C3==c]
    S = DBSCAN(eps=2., min_samples=5).fit_predict(t.values)
    fig, ax = plt.subplots(figsize=(8, 8))
    ax.scatter(t['x'], t['y'], c=S, cmap='rainbow', alpha=0.5, s=5)
    ax.set_title(f'Cluster C{c} subclusters')
    plt.show()
