Skip to content

VT-NLP/Form-Document-IE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 

Repository files navigation

RE2

This is the official PyTorch implementation of RE2: Region-Aware Relation Extraction from Visually Rich Documents. [arXiv]

framework

Installation

For CUDA 11.0

conda create -n re2 python=3.7
conda activate re2
git clone ###to insert git repo url##### (check if cd is required depending on what I am pushing as parent directory)
pip install -r requirements.txt
pip install -e .

For CUDA 11.X

conda create -n re2 python=3.7
conda activate re2
git clone ###to insert git repo url#####
pip install -r requirements_2.txt
pip install -e .

Or check Detectron2/PyTorch versions and modify the requirements.txt file accordingly.

Datasets

In this repository, we provide the fine-tuning codes for FUNSD, XFUND and our proposed dataset DiverseForm.

For FUNSD, create a directory FUNSD and create two directories funsd_test and funsd_train under it. Each directory will have annotations and images folder.

├── FUNSD
│   ├── funsd_test
│   │   ├── annotations
|   |   ├── images
|   ├──funsd_train
|   |   ├── annotations
|   |   ├── images

Available Paragraph/Tabular level region files

Dataset Download
FUNSD Download link
XFUND Download link
DiverseForm Download link

Put the unzipped files to RE2 folder

Creating custom Paragraph/Tabular region files

To create Paragraph level

python region_extraction/easyOCR.py \
--para true \
--ip_dir /path/to/folder/with/images \
--op_dir /path/to/folder/for/output

To create Line level

python region_extraction/easyOCR.py \
--para false \
--ip_dir /path/to/folder/with/images \
--op_dir /path/to/folder/for/output

To create Tabular level

python region_extraction/table.py \
--ip_dir /path/to/folder/with/images \
--op_dir /path/to/folder/for/output

To combine both

python region_extraction/combine.py \
--lang {en,pt...} \
--ip_path /path/to/folder/with/images \
--table /path/to/table.json \
--easy_para /path/to/easy_para.json \
--easy_line /path/to/easy_line.json \
--op_path /path/to/output/folder

Use just paragraph or combined region as fit. Rename file to xfun_custom.json and put it in re2

Available Checkpoints

Dataset (Language) Download
FUNSD (EN) Download link
XFUND (ZH) Download link
XFUND (JA) Download link
XFUND (ES) Download link
XFUND (FR) Download link
XFUND (IT) Download link
XFUND (DE) Download link
XFUND (PT) Download link
DiverseForm Download link

Language specific fine-tuning on XFUND (for example, ZH)

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 examples/run_xfun_re_inf.py \ 
--model_name_or_path microsoft/layoutxlm-base \
--output_dir path/to/output/directory \
--do_train \
--do_eval \
--lang zh \
--max_steps 5000 \
--per_device_train_batch_size 4 \
--warmup_ratio 0.1 \
--fp16

Language specific fine-tuning on FUNSD

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 examples/run_xfun_re_inf.py \ 
--model_name_or_path microsoft/layoutxlm-base \
--output_dir path/to/output/directory \
--do_train \
--do_eval \
--lang en \
--path_to_dataset /path/to/dataset \
--max_steps 5000 \
--per_device_train_batch_size 4 \
--warmup_ratio 0.1 \
--fp16

Language specific fine-tuning on DiverseForm

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 examples/run_xfun_re_inf.py \ 
--model_name_or_path microsoft/layoutxlm-base \
--output_dir path/to/output/directory \
--do_train \
--do_eval \
--lang en \
--custom True \
--path_to_dataset /path/to/dataset \
--max_steps 5000 \
--per_device_train_batch_size 4 \
--warmup_ratio 0.1 \
--fp16

Multitask fine-tuning on XFUND (for example, ZH)

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 examples/run_xfun_re_inf.py \ 
--model_name_or_path microsoft/layoutxlm-base \
--output_dir path/to/output/directory \
--do_train \
--do_eval \
--lang zh \
--additional_langs en+de+es+pt+it+fr+ja
--max_steps 5000 \
--per_device_train_batch_size 4 \
--warmup_ratio 0.1 \
--fp16

Run inference from a model checkpoint

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 examples/run_xfun_re_inf.py \
--model_name_or_path /path/to/model_checkpoint \
--path_to_dataset /path/to/dataset \
--output_dir path/to/output/directory \
--do_predict

Folder directory of path to data that requires inference is as follows:

├── DATA
│   ├── annotations
|   ├── images

Refer to sample_inference folder for a sample of required data

About

Information Extraction from Visually Rich Documents (Form-like Documents)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages