Fine-tuning LayoutLMv3 on DocVQA

We try to reproduce the experiments for fine-tuning LayoutLMv3 on DocVQA using both extractive and abstractive approach.

I try to present every single detail within this repository. Note, this is not official codebase from LayoutLMv3.

Work In Progress

Install Requirements

pip3 install -r requirements.txt

Dataset Preprocessing

Some of the code in this repository is adapted from this docvqa repo which works on "LayoutLMv1 for DocVQA".

Note that the test set from the docvqa repo does not come with the ground-truth answers.

Download the dataset from the DocVQA Website and put docvqa folder under data folder.
Run the following command to create the huggingface dataset:
```
python3 -m preprocess.extract_spans
```
Then you will get a processed called docvqa_cached_extractive_all_lowercase_True_msr_True More details about the statistics after preprocessing, Check out here. The final statistics about the number of spans founded is as follows:

Train / #found spans / #not found Validation / #found spans / #not found Test

39,643 / 36,759 / 2,704 5,349 / 4,950 / 399 5,188

NOTE: The microsft READ API for OCR is not available. Please contact me if you want to use this dataset. (Thanks @redthing1 giving me the access.)

Usage

Run accelerate config to configrate your distributed training environment and run the experiments by
```
accelerate launch docvqa_main.py --use_generation=0
```
Set use_generation to 1 if you want to use the generation model.

My distributed training environment: 6 GPUs

Current Performance (Improving 🚀)

Model	Preprocessing	OCR Engine	Validation ANLS	Test ANLS
LayoutLMv3-base	lowercase inputs	built-in	68.5%	-
LayoutLMv3-base	lowercase inputs	Microsoft READ API	73.3%	74.24%
LayoutLMv3-base	original cased	Microsoft READ API	72.7%	-
LayoutLMv3-base + Bart Decoder	lowercase	Microsoft READ API	72.5%	-
LayoutLMv3-base + Roberta-base	lowercase	Microsoft READ API	73.0%	-

The performance is still far behind what is reported in the paper.

Note: Adding sliding window gives me the performance around 64% at the moment. It seems harmful to do so.

TODO

Code for tokenization and Collating. (:white_check_mark:)
Code for Training (:white_check_mark:)
Further tune the performance by hyperparameters/casing issue (:white_check_mark:)
Add a decoder for generation (:white_check_mark:)
Sliding window to handle the issue that the matched answers are out of the 512 tokens. (:white_check_mark:)

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
docs		docs
preprocess		preprocess
src		src
README.md		README.md
docvqa_main.py		docvqa_main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

preprocess

preprocess

src

src

README.md

README.md

docvqa_main.py

docvqa_main.py

requirements.txt

requirements.txt

Repository files navigation

Fine-tuning LayoutLMv3 on DocVQA

Install Requirements

Dataset Preprocessing

Usage

Current Performance (Improving 🚀)

TODO

About

Releases

Packages

Languages

allanj/LayoutLMv3-DocVQA

Folders and files

Latest commit

History

Repository files navigation

Fine-tuning LayoutLMv3 on DocVQA

Install Requirements

Dataset Preprocessing

Usage

Current Performance (Improving 🚀)

TODO

About

Resources

Stars

Watchers

Forks

Languages