# OCR Agent from scratch:
Let's build a custom OCR agent with specific requirements:
a.) high privacy (sensitive data processing);
b.) ability to find and read random alphanumeric codes (identification) with high precision;
c.) ability to learn from human feedback.

We won't be using any OCR libraries or services.

We extract layout features using `OpenCV` and train our local nearsighted (char-level) agent to identify each patch and traverse the patches properly to aggregate the text and tables data in a meaningful way.
Modern laptop should be able to handle this project.

Then we explore `LLM` integration strategies.

As the example workload we used tax-forms: the problem stated as visual layout understanding enabling high-precision data extraction from the multi-page scanned documents (set of page-images) with high number of different classes to recognize, and sensitive user data which must be properly protected.

![Reader-walk record with matplotlib](assets/reader-walk.gif)


### Content

* [Data](Data-Sources.ipynb)
* [Utilities](Data-Processing.ipynb)
* Extract layout features and visual tokens
    * [Cells and grid-lines (tables)](Data-Extraction-1.ipynb)
    * [Text-lines, word-level objects, char-level tokens](Data-Extraction-2.ipynb)
* Generate training data
    * [Labeling](Data-Extraction-3.ipynb)
    * [Pipeline](Data-Pipeline.ipynb)
    * [Datasets](Datasets.ipynb)
* Model architecture
    * [Visual encoder, generative and discriminative heads](Model-Backbone.ipynb)
    * [Unsupervised and semi-supervise pretraining](Model-Pretraining.ipynb)
    * [Supervised multi-task training](Model-Training.ipynb)
* Traversal strategies
    * [Layout traversal](Traversal-Layout.ipynb)
    * [Text aggregation](Traversal-Text.ipynb)
    * [Form extraction and validation](Traversal-Form.ipynb)
* Reader Agent
    * [Wire language model in](Agent-LM.ipynb)
    * [Set RAG utilities](Agent-RAG.ipynb)
    * [Define FSM](Agent-FSM.ipynb)
    * [Reinforcement learning setup](Agent-RL.ipynb)
* [Leverage synthetic training data](Data-Gen.ipynb)
* [Optimization for production](Optimization.ipynb)


### Environment

    root/
    ├── Dockerfile
    ├── requirements.txt
    ├── init.cnf               -- example of env-configuration file
    ├── ...
    │
    ├── notebooks/             -- jupyter notebooks server root
    │   ├── ...
    │   ├── data/
    │   │   ├── ...
    │   │   ├── forms/         -- original PDF files (multi-page)
    │   │   ├── images/        -- images of pages
    │   │   ├── content/       -- extracted textual content and layout data
    │   │   ├── inputs/        -- extracted form inputs data
    │   │   ├── training/      -- extracted labeled samples
    │   │   ├── feedback/      -- human labeled samples
    │   │   └── ...
    │   ├── ...
    │   ├── models/            -- trained models
    │   ├── output/            -- training outcome: history and plots
    │   ├── scripts/           -- local python library
    │   ├── runs/              -- tensorboard logs
    │   └── ...
    │   
    └── app/                   -- frontend for human interaction
        └── ...    


In [1]:
!python --version

Python 3.10.13


In [2]:
!nvidia-smi

Thu Apr  4 21:42:10 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce GTX 1660 Ti     Off | 00000000:08:00.0  On |                  N/A |
| 38%   30C    P8              17W / 120W |    331MiB /  6144MiB |     27%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [3]:
import cv2; print(cv2. __version__)

4.9.0


In [4]:
import torch; print('GPU' if torch.cuda.is_available() else 'CPU'); print(torch.__version__)

GPU
2.1.2+cu121


In [5]:
#!jupyter nbconvert --to markdown README.ipynb --output ../README.md