CRFIE

This repo contains the code used for the ACL 2023 paper Modeling Instance Interactions for Joint Information Extraction with Neural High-Order Conditional Random Field.

Requirements

Python 3.7
Python packages
- PyTorch 1.0+ (Install the CPU version if you use this tool on a machine without GPUs)
- transformers 3.0.2 (Using transformers 3.1+ may cause some model loading issue)
- tqdm
- lxml
- nltk

Getting Started

Pre-processing

Our preprocessing mainly adapts from OneIE.

Preprocess DyGIE++

The prepreocessing/process_dygiepp.py script converts datasets in DyGIE+ format to the input format.
Example:

python preprocessing/process_dygiepp.py -i train.json -o train.oneie.json

Arguments:

-i, --input: Path to the input file.
-o, --output: Path to the output file.

Preprocess ACE2005

The prepreocessing/process_ace.py script converts raw ACE2005 datasets to the input format. Example:

python preprocessing/process_ace.py -i <INPUT_DIR>/LDC2006T06/data -o <OUTPUT_DIR> \
      -s resource/splits/ACE05-E -b bert-large-cased -c <BERT_CACHE_DIR> -l english

Arguments:

-i, --input: Path to the input directory (data folder in your LDC2006T06 package).
-o, --output: Path to the output directory.
-b, --bert: Bert model name.
-c, --bert_cache_dir: Path to the BERT cache directory.
-s, --split: Path to the split directory. We provide document id lists for all datasets used in our paper in resource/splits.
-l, --lang: Language (options: english, chinese).

Preprocess ERE

The prepreocessing/process_ere.py script converts raw ERE datasets (LDC2015E29, LDC2015E68, LDC2015E78, LDC2015E107) to the input format.

python preprocessing/process_ere.py -i <INPUT_DIR>/data -o <OUTPUT_DIR> -b bert-large-cased -c <BERT_CACHE_DIR> -l english -d normal

Arguments:

-i, --input: Path to the input directory (data folder in your ERE package).
-o, --output: Path to the output directory.
-b, --bert: Bert model name.
-c, --bert_cache_dir: Path to the BERT cache directory.
-d, --dataset: Dataset type: normal, r2v2, parallel, or spanish.
-l, --lang: Language (options: english, spanish).

This script currently supports:

LDC2015E29_DEFT_Rich_ERE_English_Training_Annotation_V1
LDC2015E29_DEFT_Rich_ERE_English_Training_Annotation_V2
LDC2015E68_DEFT_Rich_ERE_English_Training_Annotation_R2_V2
LDC2015E78_DEFT_Rich_ERE_Chinese_and_English_Parallel_Annotation_V2
LDC2015E107_DEFT_Rich_ERE_Spanish_Annotation_V2

Training

cd to the root directory of this package
Set the environment variable PYTHONPATH to the current directory. For example, if you unpack this package to ~/High-order-IE, run: export PYTHONPATH=~/High-order-IE

Because our framework is a pipeline schema, you should first train the Node Identification model and save the checkpoint in a directory.

Run the command line to train an identification model: python train_ident.py -c <CONFIG_FILE_PATH>.

Then train the high-order classification model.

python train.py -c <CONFIG_FILE_PATH>.
One example configuration file is in config/baseline.json. Fill in the following paths in the configuration file:
- BERT_CACHE_DIR: Pre-trained BERT models, configs, and tokenizers will be downloaded to this directory.
- TRAIN_FILE_PATH, DEV_FILE_PATH, TEST_FILE_PATH: Path to the training/dev/test/files.
- OUTPUT_DIR: The model will be saved to subfolders in this directory.
- VALID_PATTERN_DIR: Valid patterns created based on the annotation guidelines or training set. Example files are provided in resource/valid_patterns.
- Set NER_SCORE and SPLIT_TRAIN to be true: Our base pipeline model with different scoring functions of OneIE.
- IDENT_MODEL_PATH: Path to a checkpoint of the saved node identification model. The following hyper-parameters control the high-order part:
- USE_*: Whether to use the corresponding high-order factor.
- DECOMP_SIZE and MFVI_ITER: Hyperparameters of mean field variational inference (refer to paper).

Evaluation

Example command line to test a file input: python predict.py -m <best.role.mdl> -i <input_dir> -o <output_dir> --format json

Arguments:
- -m, --model_path: Path to the trained model.
- -i, --input_dir: Path to the input directory. LTF format sample files can be found in the input directory.
- -o, --output_dir: Path to the output directory (json format). Output files are in the JSON format. Sample files can be found in the output directory.
- --gpu: (optional) Use GPU
- -d, --device: (optional) GPU device index (for multi-GPU machines).
- -b, --batch_size: (optional) Batch size. For a 16GB GPU, a batch size of 10~15 is a reasonable value.
- --max_len: (optional) Max sentence length. Sentences longer than this value will be ignored. You may need to decrease batch_size if you set max_len to a larger number.
- --lang: (optional) Model language.
- --format: Input file format (txt, ltf, or json).

Data Format

Processed input example:

{
    "doc_id": "AFP_ENG_20030401.0476",
    "sent_id": "AFP_ENG_20030401.0476-5",
    "entity_mentions": [
        {
            "id": "AFP_ENG_20030401.0476-5-E0",
            "start": 0,
            "end": 1,
            "entity_type": "GPE",
            "mention_type": "UNK",
            "text": "British"
        },
        ...
    ],
    "relation_mentions": [
        {
            "relation_type": "ORG-AFF",
            "id": "AFP_ENG_20030401.0476-5-R0",
            "arguments": [
                {
                    "entity_id": "AFP_ENG_20030401.0476-5-E1",
                    "text": "Chancellor",
                    "role": "Arg-1"
                },
                ...
            ]
        },
        ...
    ],
    "event_mentions": [
        {
            "event_type": "Personnel:Nominate",
            "id": "AFP_ENG_20030401.0476-5-EV0",
            "trigger": {
                "start": 9,
                "end": 10,
                "text": "named"
            },
            "arguments": [
                {
                    "entity_id": "AFP_ENG_20030401.0476-5-E4",
                    "text": "head",
                    "role": "Person"
                }
            ]
        }
    ],
    "tokens": [
        ...
    ],
    "pieces": [
        ...
    ],
    "token_lens": [
        ...
    ],
    "sentence": ...
}

The "start" and "end" of entities and triggers are token indices. The "arguments" of a relation refer to its head entity and tail entity.

Output example:

{
    "doc_id": "HC0003PYD",
    "sent_id": "HC0003PYD-16",
    "token_ids": [
        ...
    ],
    "tokens": [
        ...
    ],
    "graph": {
        "entities": [
            [
                3,
                5,
                "GPE",
                "NAM",
                1.0
            ],
            ...
        ],
        "triggers": [
            ...
        ],
        "relations": [
            ...
        ],
        "roles": [
            ...
        ]
    }
}

Citation

@inproceedings{jia-etal-2023-modeling,
    title = "Modeling Instance Interactions for Joint Information Extraction with Neural High-Order Conditional Random Field",
    author = "Jia, Zixia and Yan, Zhaohui and Han, Wenjuan and Zheng, Zilong and Tu, Kewei",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.766",
    doi = "10.18653/v1/2023.acl-long.766",
    pages = "13695--13710"
}

Acknowledgments

The codebase of this repo is extended from OneIE v0.4.8

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
config		config
input		input
output		output
preprocessing		preprocessing
resource		resource
scripts		scripts
LICENSE		LICENSE
README.md		README.md
config.py		config.py
convert.py		convert.py
data.py		data.py
debug.py		debug.py
global_feature.py		global_feature.py
graph.py		graph.py
model.py		model.py
predict.py		predict.py
scorer.py		scorer.py
sparsemax.py		sparsemax.py
train.py		train.py
train_ident.py		train_ident.py
util.py		util.py

License

bigai-nlco/CRFIE

Folders and files

Latest commit

History

Repository files navigation

CRFIE

Requirements

Getting Started

Pre-processing

Preprocess DyGIE++

Preprocess ACE2005

Preprocess ERE

Training

Evaluation

Data Format

Citation

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Languages