Skip to content

allanj/LayoutLMv3-DocVQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fine-tuning LayoutLMv3 on DocVQA

We try to reproduce the experiments for fine-tuning LayoutLMv3 on DocVQA using both extractive and abstractive approach.

I try to present every single detail within this repository. Note, this is not official codebase from LayoutLMv3.

Work In Progress

Install Requirements

pip3 install -r requirements.txt

Dataset Preprocessing

Some of the code in this repository is adapted from this docvqa repo which works on "LayoutLMv1 for DocVQA".

Note that the test set from the docvqa repo does not come with the ground-truth answers.

  1. Download the dataset from the DocVQA Website and put docvqa folder under data folder.

  2. Run the following command to create the huggingface dataset:

    python3 -m preprocess.extract_spans
    

    Then you will get a processed called docvqa_cached_extractive_all_lowercase_True_msr_True More details about the statistics after preprocessing, Check out here. The final statistics about the number of spans founded is as follows:

    Train / #found spans / #not found Validation / #found spans / #not found Test
    39,643 / 36,759 / 2,704 5,349 / 4,950 / 399 5,188

    NOTE: The microsft READ API for OCR is not available. Please contact me if you want to use this dataset. (Thanks @redthing1 giving me the access.)

Usage

  1. Run accelerate config to configrate your distributed training environment and run the experiments by

    accelerate launch docvqa_main.py --use_generation=0
    

    Set use_generation to 1 if you want to use the generation model.

    My distributed training environment: 6 GPUs

Current Performance (Improving 🚀)

Model Preprocessing OCR Engine Validation ANLS Test ANLS
LayoutLMv3-base lowercase inputs built-in 68.5% -
LayoutLMv3-base lowercase inputs Microsoft READ API 73.3% 74.24%
LayoutLMv3-base original cased Microsoft READ API 72.7% -
LayoutLMv3-base + Bart Decoder lowercase Microsoft READ API 72.5% -
LayoutLMv3-base + Roberta-base lowercase Microsoft READ API 73.0% -

The performance is still far behind what is reported in the paper.

Note: Adding sliding window gives me the performance around 64% at the moment. It seems harmful to do so.

TODO

  • Code for tokenization and Collating. (:white_check_mark:)
  • Code for Training (:white_check_mark:)
  • Further tune the performance by hyperparameters/casing issue (:white_check_mark:)
  • Add a decoder for generation (:white_check_mark:)
  • Sliding window to handle the issue that the matched answers are out of the 512 tokens. (:white_check_mark:)

About

Example codebase for fine-tuning layoutLMv3 on DocVQA

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages