An Information Extraction Study: Take In Mind the Tokenization!

Official code repository of "An Information Extraction Study: Take In Mind the Tokenization!" paper (PyTorch implementation). For the detailed description of the study, see our papaer: https://arxiv.org/abs/2303.15100 (to be published at EUSFLAT 2023).

Setup

Requirements

We suggest to build two different environments. For the models that use CharacterBERT as Language Model we have the following requirements (environment 1):

Python 3.8+
PyTorch (tested with version 1.10.0)
cudatoolkit (tested with version 10.2)
sentencepiece (tested with version 0.1.95)
transformers (tested with version 3.3.1)
tqdm (tested with version 4.62.3)
numpy (tested with version 1.21.2)

You can also find instructions on building the environment in the official repository of CharacterBERT.

For the rest of the models we have the following requirements (environment 2):

Python 3.8+
PyTorch (tested with version 1.10.0)
cudatoolkit (tested with version 11.3.1)
sentencepiece (tested with version 0.1.95)
transformers (tested with version 4.11.3)
tqdm (tested with version 4.62.3)
numpy (tested with version 1.21.2)

We use two different environments because there are some incompatibility issues with transformers library. CharacterBERT requires an older version while CANINE uses a newer version.

Execution steps

Download the CharacterBERT model [1] using the official repository. We are going to use the general and medical versions of CharacterBERT. Place the character-bert folder inside the repository folder.
Extract the pre-trained embeddings of the different Language Models by running the scripts under the /run_scripts/offline_embedding_extraction/ folder.
Train the models in the baseline setup using the scripts under the /run_scripts/pretrained_frozen_emb/ folder.
Train the models in the advanced setup (end-to-end training) using the scripts under the /run_scripts/e2e_training/ and /run_scripts/e2e_training_word_level/ folders. The *_word_level indicates the case where aggregation is used.
Conduct the similarity analysis using the scripts under the /run_scripts/similarity_analysis/ folder. Update the full path of the trained models in the running scripts accordingly.
Use the explore_results.py script under the /similarity_analysis/ folder to print and explore the aggregated results per model. (e.g. python explore_results.py --filename 'ne_similarities.json')
For the hyphenation analysis, execute the script under the corresponding folder (e.g. python hyphenation_analysis.py --data 'ADE' --subword_length 4 --first_n 25 --threshold 20)

Comments

The trained models are stored under the /save/ folder.
Change the paths in the run scripts appropriately.
Due to the incompatibility issues (transformers library) comment in/out the imports of CANINE model/tokenizer in pfn.py script when the environment of CharacterBERT (environment 1) is used and vice versa.
For this study we also use code from the following repositories:
- CharacterBERT [1]
- Partition Filter Network [2]
- CLDR & CLNER models [3]

Notes

Please cite our work when using this software.

Theodoropoulos, & Moens, M. F. (2023). An Information Extraction Study: Take In Mind the Tokenization!. arXiv preprint arXiv:2303.15100.

BibTex:

@article{theodoropoulos2023information,
         title={An Information Extraction Study: Take In Mind the Tokenization!},
         author={Theodoropoulos, Christos and Moens, Marie-Francine},
         journal={arXiv preprint arXiv:2303.15100},
         year={2023}
}

@software{inductive_bias_IE,
          author = {Theodoropoulos, Christos},
          month = {6},
          title = {An Information Extraction Study: Take In Mind the Tokenization!},
          url = {https://github.com/christos42/inductive_bias_IE},
          version = {main},
          year = {2022}
}

References

[1] Hicham El Boukkouri, et al. 2020. CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 6903–6915. https://doi.org/10.18653/v1/2020.coling-main.609
[2] Zhiheng Yan, et al. 2021. A Partition Filter Network for Joint Entity and Relation Extraction. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 185–197. https://doi.org/10.18653/v1/2021.emnlp-main.17
[3] Christos Theodoropoulos, et al. 2021. Imposing Relation Structure in Language-Model Embeddings Using Contrastive Learning. In Proceedings of the 25th Conference on Computational Natural Language Learning. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 337–348. https://doi.org/10.18653/v1/2021.conll-1.27

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
hyphenation_analysis		hyphenation_analysis
run_scripts		run_scripts
similarity_analysis		similarity_analysis
tokenization_analysis		tokenization_analysis
utils		utils
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
dataloader.py		dataloader.py
detailed_results.xlsx		detailed_results.xlsx
main.py		main.py
offline_embedding_extraction.py		offline_embedding_extraction.py
pfn.py		pfn.py

License

christos42/inductive_bias_IE

Folders and files

Latest commit

History

Repository files navigation

An Information Extraction Study: Take In Mind the Tokenization!

Setup

Requirements

Execution steps

Comments

Notes

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages