CamemBERTa: A French language model based on DeBERTa V3

The repos contains the code for training CamemBERTa, a French language model based on DeBERTa V3, a DeBerta V2 with ELECTRA style pretraining using the Replaced Token Detection (RTD) objective. RTD uses a generator model, trained using the MLM objective, to replace masked tokens with plausible candidates, and a discriminator model trained to detect which tokens were replaced by the generator. Usually the generator and discriminator share the same embedding matrix, but the authors of DeBERTa V3 propose a new technique to disentagle the gradients of the shared embedding between the generator and discriminator called gradient-disentangled embedding sharing (GDES) This the first publicly available implementation of DeBERTa V3, and the first publicly DeBERTaV3 model outside of the original Microsoft release .

Preprint Paper: https://inria.hal.science/hal-03963729/

Models:

Gradient-Disentangled Embedding Sharing (GDES)

To disentagle the gradients of the shared embedding between the generator and discriminator, the authors of DeBERTaV3 make use of an another embedding layer that is not shared between the generator and discriminator. This layers is initialized to zero and added to a copy of the generator embedding matrix with diabled gradients, and should encode the difference between the generator embedding and the discriminator embedding, in order to stop the tug-of-war between the two models in the ELECTRA objective. When training ends, the final embedding matrix of the discriminator is the sum of the generator embedding matrix and the disentangled embedding matrix.

The code for GDES is added in the TFDebertaV3Embeddings class, with the stop gradient operation added here. The embedding sharing is added in the PretrainingModel class initialization.

Pretraining Setup

The model was trained on the French subset of the CCNet corpus (the same subset used in CamemBERT and PaGNOL) and is available on the HuggingFace model hub: CamemBERTa and CamemBERTa Generator.

To speed up the pre-training experiments, the pre-training was split into two phases; in phase 1, the model is trained with a maximum sequence length of 128 tokens for 10,000 steps with 2,000 warm-up steps and a very large batch size of 67,584. In phase 2, maximum sequence length is increased to the full model capacity of 512 tokens for 3,300 steps with 200 warm-up steps and a batch size of 27,648.

The model would have seen 133B tokens compared to 419B tokens for CamemBERT-CCNet which was trained for 100K steps, this represents roughly 30% of CamemBERT’s full training. To have a fair comparison, we trained a RoBERTa model, CamemBERT30%, using the same exact pretraining setup but with the MLM objective.

Pretraining Loss Curves

check the huggingface repo for the tensorboard logs and plots

Fine-tuning results

Datasets: POS tagging and Dependency Parsing (GSD, Rhapsodie, Sequoia, FSMB), NER (FTB), the FLUE benchmark (XNLI, CLS, PAWS-X), and the French Question Answering Dataset (FQuAD)

Model	UPOS	LAS	NER	CLS	PAWS-X	XNLI	F1 (FQuAD)	EM (FQuAD)
CamemBERT (CCNet)	97.59	88.69	89.97	94.62	91.36	81.95	80.98	62.51
CamemBERT (30%)	97.53	87.98	91.04	93.28	88.94	79.89	75.14	56.19
CamemBERTa	97.57	88.55	90.33	94.92	91.67	82.00	81.15	62.01

The following table compares CamemBERTa's performance on XNLI against other models under different training setups, which demonstrates the data efficiency of CamelBERTa.

Model	XNLI (Acc.)	Training Steps	Tokens seen in pre-training	Dataset Size in Tokens
mDeBERTa	84.4	500k	2T	2.5T
CamemBERTa	82.0	33k	0.139T	0.319T
XLM-R	81.4	1.5M	6T	2.5T
CamemBERT - CCNet	81.95	100k	0.419T	0.319T

Note: The CamemBERTa training steps was adjusted for a batch size of 8192.

How to use CamemBERTa

Our pretrained weights are available on the HuggingFace model hub, you can load them using the following code:

from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM

camemberta = AutoModel.from_pretrained("almanach/camemberta-base")
tokenizer = AutoTokenizer.from_pretrained("almanach/camemberta-base")

camemberta_gen = AutoModelForMaskedLM.from_pretrained("almanach/camemberta-base-generator")
tokenizer_gen = AutoTokenizer.from_pretrained("almanach/camemberta-base-generator")

We also include the TF2 weights including the weights for the model's RTD head for the discriminator, and the MLM head for the generator.

CamemBERTa is compatible with most finetuning scripts from the transformers library.

Features:

XLA support
FP16 support
Horovod support
Tensorflow Strategy support
Customizable Generator depth and width
Export to PyTorch
Relatively easy extension to other models

Data Preparation

The data prep code is a verbatim copy from NVIDA ELECTRA TF2.

Following NVidia, we recommend using Docker/Singularity for setting up the environment and training.

Pretraining

We used Singularity for training on a HPC cluster running OAR (not SLURM), but it should go something like this if you are doing local pretraining. Check the configuration_deberta_v2.py file and the configs folder for the configuration options.

SLURM users need to add tf.distribute.cluster_resolver.SlurmClusterResolver to the official_utils.misc.distribution_utils.get_distribution_strategy() function. You can also check the NVidia BERT TF2 repository for more advanced ways to run the pretraining.

python run_pretraining --config_file=configs/p1_local_1gpu.json

Finetuning

We suggest post processing the model using postprocess_pretrained_ckpt.py and convert_to_pt.py script, and then using PyTorch and HuggingFace's transformers library to finetune the model.

Note: After conversion check if the model config.json and the tokenizer configs are correct.

Notes:

Treat this as research code, you might find some small bugs, so be patient.

When pretraining from scratch Verify that the IDs of the control tokens are correct in the config classes.
This code base is mostly based on the NVidia ELECTRA TF2 implementation for the ELECTRA objective, the NVidia BERT TF2 implementation for the TF2 training loop that supports horovod as well as TF2 strategy training, and the Deberta v2 implementation from huggingface transformers.
Changed the pretraining code to only accept config files instead of the annoying way that the original implementation did :).
We verified that Horovod works in multi-node mode, but only verified that multi-gpu works with TF2 strategy (the logging code might break in the strategy impl. at least).
Training with TPU runs but is super slow, check this issue for more info huggingface/transformers#18239. (Also you will have to manually disable some logging lines since they cause some issues with the TPU).
The repo also support training with MLM or ELECTRA, with DebertaV2 and RoBERTa. The pretraining code could be improved to support other models, by doing some abstractions to make it easier to add support for other models.
Training (finetuning) DeBERTa V2 is ~30% slower than RoBERTa or BERT models even with XLA and FP16.

License

This code is licensed under the Apache License 2.0. The public model weights are licensed under MIT License.

Citation

Paper accepted to Findings of ACL 2023.

You can use the preprint citation for now

@article{antoun2023camemberta
  TITLE = {{Data-Efficient French Language Modeling with CamemBERTa}},
  AUTHOR = {Antoun, Wissam and Sagot, Beno{\^i}t and Seddah, Djam{\'e}},
  URL = {https://inria.hal.science/hal-03963729},
  NOTE = {working paper or preprint},
  YEAR = {2023},
  MONTH = Jan,
  PDF = {https://inria.hal.science/hal-03963729/file/French_DeBERTa___ACL_2023%20to%20be%20uploaded.pdf},
  HAL_ID = {hal-03963729},
  HAL_VERSION = {v1},
}

Contact

Wissam Antoun: wissam (dot) antoun (at) inria (dot) fr

Benoit Sagot: benoit (dot) sagot (at) inria (dot) fr

Djame Seddah: djame (dot) seddah (at) inria (dot) fr

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
configs		configs
data		data
fast_tokenizer		fast_tokenizer
images		images
official_utils		official_utils
vocab		vocab
.gitignore		.gitignore
CamemBERTa.pdf		CamemBERTa.pdf
Dockerfile		Dockerfile
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
activations_tf.py		activations_tf.py
bind.sh		bind.sh
build_pretraining_dataset.py		build_pretraining_dataset.py
configuration_deberta_v2.py		configuration_deberta_v2.py
configuration_roberta.py		configuration_roberta.py
configuration_utils.py		configuration_utils.py
convert_to_pt.py		convert_to_pt.py
file_utils.py		file_utils.py
gpu_affinity.py		gpu_affinity.py
model_training_utils.py		model_training_utils.py
modeling_tf_deberta_v2.py		modeling_tf_deberta_v2.py
modeling_tf_outputs.py		modeling_tf_outputs.py
modeling_tf_roberta.py		modeling_tf_roberta.py
modeling_tf_utils.py		modeling_tf_utils.py
modeling_utils.py		modeling_utils.py
optimization.py		optimization.py
postprocess_pretrained_ckpt.py		postprocess_pretrained_ckpt.py
pretrain_utils.py		pretrain_utils.py
requirements.txt		requirements.txt
run_hvd.sh		run_hvd.sh
run_pretraining.py		run_pretraining.py
tf_utils.py		tf_utils.py
utils.py		utils.py

License

WissamAntoun/CamemBERTa

Folders and files

Latest commit

History

Repository files navigation

CamemBERTa: A French language model based on DeBERTa V3

Gradient-Disentangled Embedding Sharing (GDES)

Pretraining Setup

Pretraining Loss Curves

Fine-tuning results

How to use CamemBERTa

Features:

Data Preparation

Pretraining

Finetuning

Notes:

License

Citation

Contact

About

Resources

License

Stars

Watchers

Forks

Languages