Skip to content

epfl-nlp/ConLID

Repository files navigation

Model arXiv

ConLID: Supervised Contrastive Learning for Low-Resource Language Identification

Model Overview

License: MIT

Code and model for paper: "ConLID: Supervised Contrastive Learning for Low-Resource Language Identification" arXiv - 2025

TL;DR: We introduce ConLID, a model trained on GlotLID-C dataset using Supervised Contrastive Learning. It supports 2,099 languages and is, especially, effective for low-resource languages.

🛠️ Setup

git clone https://github.com/epfl-nlp/ConLID.git
cd ConLID
# set the evironment variables as in `.env_example`
source setup.sh

Download the models

from huggingface_hub import snapshot_download, hf_hub_download

# Download the GlotLID and ConLID models
snapshot_download(
  repo_id="epfl-nlp/ConLID",
  local_dir="checkpoints/conlid"
)
hf_hub_download(
  repo_id="cis-lmu/glotlid",
  filename="model.bin",
  local_dir="checkpoints/glotlid",
  local_dir_use_symlinks=False
)

🤖 Usage

Use the ConLID model as:

from model import ConLID
model = ConLID.from_pretrained(dir='checkpoints/conlid')

# print the supported labels
print(model.get_labels())
## ['aai_Latn', 'aak_Latn', 'aau_Latn', 'aaz_Latn', 'aba_Latn', ...]

# prediction
model.predict("The cat climbed onto the roof to enjoy the warm sunlight peacefully!")
# (['eng_Latn'], [0.970989465713501])

model.predict("The cat climbed onto the roof to enjoy the warm sunlight peacefully!", k=3)
## (['eng_Latn', 'sco_Latn', 'jam_Latn'], [0.970989465713501, 0.006496887654066086, 0.00487488554790616])

📊 Replicating UDHR results

Run the following command to replicate the results for the UDHR dataset. The results will be stored under results directory.

python evaluate_udhr.py

💪🏻 Training

Download the train dataset under data/glotlid/

huggingface-cli download cis-lmu/glotlid-corpus --repo-type dataset --local-dir data/glotlid

Run data preprocessing pipeline

bash scripts/preprocess_dataset.sh

Run trainings

bash scripts/train_lid_ce.sh    # Trains the LID-CE model
bash scripts/train_lid_scl.sh   # Trains the LID-SCL model
bash scripts/train_conlid_s.sh  # Trains the ConLID-S model

🎯 TODO

  • Release the inference code
  • Release the training code
  • Release the evaluation code
  • Optimize the inference using parallel tokenization

⭐️ Citation

If you find this project useful, welcome to cite us:

@article{foroutan2025conlid,
  title={ConLID: Supervised Contrastive Learning for Low-Resource Language Identification},
  author={Negar Foroutan and Jakhongir Saydaliev and Ye Eun Kim and Antoine Bosselut},
  journal={arXiv preprint arXiv:2506.15304},
  year={2025}
}

About

ConLID: Supervised Contrastive Learning for Low-Resource Language Identification [arXiv - 2025]

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •