Cross-Document Semantic Enhancement for NER

This is the code for paper ''Read Extensively, Focus Smartly: A Cross-document Semantic Enhancement Method for Visual Documents NER''.

Introduction

This work proposes a cross-document semantic enhancement method, which consists of two modules:

To prevent distractions from irrelevant regions in the current document, we design a learnable attention mask mechanism, which is used to adaptively filter redundant information in the current document.
To further enrich the entity-related context, we propose a cross-document information awareness technique, which enables the model to collect more evidence across documents to assist in prediction.

Environment

conda create -n layoutlmft python=3.7
conda activate layoutlmft
pip install -r requirements.txt
pip install -e .

Prepare Dataset

Prepare train.json train.zip val.json val.zip test.json test.zip in ./data/.

Chinese XFUND Dataset can be downloaded from

链接: https://pan.baidu.com/s/1tKPZaWBPKDTtlyn1926Swg 提取码: suwv

*.zip is the zip file of images.

*.json is OCR result of images. Sample in *.json is shown as follows.

{
    "id": "zh_val_0", 
    "uid": "0ac15750a098682aa02b51555f7c49ff43adc0436c325548ba8dba560cde4e7e", 
    "document": [
        {
            "box": [1958, 144, 2184, 198], 
            "text": "Maribyrnong", 
            "label": "other", 
            "words": [{"box": [1959, 144, 2182, 199], "text": "Maribyrnong"}], 
            "linking": [], 
            "id": 1
        }, 
        .....
    ],
    "img": {
        "fname": "zh_val_0.jpg", 
        "width": 2480, 
        "height": 3508
    }
}

Training

Modify the _URL of dataset in ./layoutlmft/data/datasets/xfun.py.
Modify the CUDA_VISIBLE_DEVICES in run.sh.
Modify the output_dir in run.sh.
Simply run sh run.sh to start training.

The fine-tuned model (f1 score: 0.904) can be downloaded from

链接: https://pan.baidu.com/s/1Q98LlHJZ5ADwbqzKzU9jfQ 提取码: nhbf

Offline Inference and Evaluation

We provide a demo of forward inference and evaluation.

Modify the _URL of dataset in ./layoutlmft/data/datasets/xfun.py.
Modify the CUDA_VISIBLE_DEVICES in test.sh.
Modify the model_name_or_path (the output_dir of training) in test.sh.
Modify the output_dir in test.sh.
Simply run sh test.sh to start inference and evaluation.

The inference result test_predictions.txt and evaluation result test_results.json are both in output_dir.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples		examples
layoutlmft.egg-info		layoutlmft.egg-info
layoutlmft		layoutlmft
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.sh		run.sh
setup.cfg		setup.cfg
setup.py		setup.py
test.sh		test.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cross-Document Semantic Enhancement for NER

Introduction

Environment

Prepare Dataset

Training

Offline Inference and Evaluation

License

About

Releases

Packages

Languages

XinZhao0211/cross-document-ner

Folders and files

Latest commit

History

Repository files navigation

Cross-Document Semantic Enhancement for NER

Introduction

Environment

Prepare Dataset

Training

Offline Inference and Evaluation

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages