LlamBERT implements a hybrid approach approach for text classification that leverages LLMs to annotate a small subset of large, unlabeled databases and uses the results for fine-tuning transformer encoders like BERT and RoBERTa. This strategy is evaluated on two diverse datasets: the IMDb review dataset and the UMLS Meta-Thesaurus where we use it for efficiently extracting subontologies from the UMLS graph using natural language queries. This repository implements the method described in the research paper titled LlamBERT: Leveraging Semi-Supervised Learning for Text Classification.
Given a large corpus of unlabeled natural language data, LlamBERT follows these steps:
- Annotate a reasonably sized, randomly selected subset of the corpus utilizing an LLM and a prompt reflecting the labeling criteria;
- Parse the Llama 2 responses into the desired categories;
- Discard any data that fails to classify into any of the specified categories;
- Employ the resulting labels to perform supervised fine-tuning on a BERT classifier;
- Apply the fine-tuned BERT classifier to annotate the original unlabeled corpus.
BERT model | Baseline train | LlamBERT train | LlamBERT train&extra | Combined extra+train |
---|---|---|---|---|
distilbert-base | 91.23 | 90.77 | 92.12 | 92.53 |
bert-base | 92.35 | 91.58 | 92.76 | 93.47 |
bert-large | 94.29 | 93.31 | 94.07 | 95.03 |
roberta-base | 94.74 | 93.53 | 94.28 | 95.23 |
roberta-large | 96.54 | 94.83 | 94.98 | 96.68 |
95th percentile confidence interval measured on 5 different random seeds.
Model | Baseline | LlamBERT | Combined |
---|---|---|---|
bert-large | 94.84 (±0.25) | 95.70 (±0.21) | 96.14 (±0.42) |
roberta-large | 95.00 (±0.18) | 96.02 (±0.12) | 96.64 (±0.14) |
BiomedBERT-large | 96.72 (±0.17) | 96.66 (±0.13) | 96.92 (±0.10) |
- Llama-2-7b-chat: Requires a single A100 40GB GPU.
- Llama-2-70b-chat: Requires four A100 80GB GPUs
- gpt-4-0613: Requires OpenAI API access.
conda env create --file=environment.yml
CUDA_VISIBLE_DEVICES=0 python <script-name>.py <options>
-- -h
for help, q
for exit help.
If you use this code in your research, please cite the corresponding paper:
InsertCitationHere
- Bálint Csanády (csbalint@protonmail.ch)
- Lajos Muzsai (muzsailajos@protonmail.com)
- Péter Vedres (vedrespeter0000@gmail.com)
- Zoltán Nádasdy (zoltan@utexas.edu)
- András Lukács (andras.lukacs@ttk.elte.hu)
This project is licensed under the MIT License - see the LICENSE file for details.