Skip to content

Code source of the paper ```Evaluating Tokenizaer Impact on OOVs Representation with Transformers Models``` to be presented at LREC 2022.

License

Notifications You must be signed in to change notification settings

alexandrabenamar/evaluating_tokenizers_oov

Repository files navigation

Evaluating Tokenizers Impact on OOVs Representation with Transformers Models

1. Description

Code source of the paper Evaluating Tokenizaer Impact on OOVs Representation with Transformers Models to be presented at LREC 2022.

Abstract

Transformer models have achieved significant improvements in multiple downstream tasks in recent years. One of the main contributions of Transformers is their ability to create new representations for out-of-vocabulary (OOV) words. In this paper, we have evaluated three categories of OOVs: (A) new domain-specific terms (e.g., "eucaryote" in microbiology), (B) misspelled words containing typos, and (C) cross-domain homographs (e.g., "arm" has different meanings in a clinical trial and in anatomy). We use three French domain-specific datasets on the legal, medical, and energetical domains to robustly analyze these categories. Our experiments have led to exciting findings that showed: (1) It is easier to improve the representation of new words (A and B) than it is for words that already exist in the vocabulary of the Transformer models (C), (2) To ameliorate the representation of OOVs, the most effective method relies on adding external morpho-syntactic context rather than improving the semantic understanding of the words directly (fine-tuning) and (3) We cannot foresee the impact of minor misspellings in words because similar misspellings have different impacts on their representation. We believe that tackling the challenges of processing OOVs regarding their specificities will significantly help the domain adaptation aspect of BERT.

2. Installation

Install the package with pip:

pip install .

Install the package with conda:

conda install -f environment.yml

Make sure all the dependencies have been installed using:

pip install -r requirements.txt

3. How to ?

Gallica Extractor

To extract journals from gallica, use the script src/data/gallica.py. In the original paper, we used "Journal of Microbiology" that you can change in the script (line 45):

PRESSE_MEDICALE = [("journal_microbiologie", "http://gallica.bnf.fr/ark:/12148/cb34348753q/date", 1887, 1900)]

You can add other newspapers using the format ("name of the paper", "gallica link", "start date", "end date").

Preprocess Data

Finetune Language Model

Cosine Similarity

Evaluation Metrics

Two evaluation metrics are implemented in the file src/evaluating_tokenizers_oov/eval/eval_metrics.py : Dice coefficient and Dice-SU coefficient, presented in the paper.

About

Code source of the paper ```Evaluating Tokenizaer Impact on OOVs Representation with Transformers Models``` to be presented at LREC 2022.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages