🦖 Turkish LM Tuner

Overview

Turkish LM Tuner is a library for fine-tuning Turkish language models on various NLP tasks. It is built on top of Hugging Face Transformers library. It supports finetuning with conditional generation and sequence classification tasks. The library is designed to be modular and extensible. It is easy to add new tasks and models. The library also provides data loaders for various Turkish NLP datasets.

Installation

You can install turkish-lm-tuner via PyPI:

pip install turkish-lm-tuner

Alternatively, you can use the following command to install the library:

pip install git+https://github.com/boun-tabi-LMG/turkish-lm-tuner.git

Model Support

Any Encoder or ConditionalGeneration model that is compatible with Hugging Face Transformers library can be used with Turkish LM Tuner. The following models are tested and supported.

Task and Dataset Support

Task	Datasets
Text Classification	Product Reviews, TTC4900, Tweet Sentiment
Natural Language Inference	NLI_TR, SNLI_TR, MultiNLI_TR
Semantic Textual Similarity	STSb_TR
Named Entity Recognition	WikiANN, Milliyet NER
Part-of-Speech Tagging	BOUN, IMST
Text Summarization	TR News, MLSUM, Combined TR News and MLSUM
Title Generation	TR News, MLSUM, Combined TR News and MLSUM
Paraphrase Generation	OpenSubtitles, Tatoeba, TED Talks

Usage

The tutorials in the documentation can help you get started with turkish-lm-tuner.

Examples

Fine-tune and evaluate a conditional generation model

from turkish_lm_tuner import DatasetProcessor, TrainerForConditionalGeneration

dataset_name = "tr_news"
task = "summarization"
task_format="conditional_generation"
model_name = "boun-tabi-LMG/TURNA"
max_input_length = 764
max_target_length = 128
dataset_processor = DatasetProcessor(
    dataset_name=dataset_name, task=task, task_format=task_format, task_mode='',
    tokenizer_name=model_name, max_input_length=max_input_length, max_target_length=max_target_length
)

train_dataset = dataset_processor.load_and_preprocess_data(split='train')
eval_dataset = dataset_processor.load_and_preprocess_data(split='validation')
test_dataset = dataset_processor.load_and_preprocess_data(split="test")

training_params = {
    'num_train_epochs': 10,
    'per_device_train_batch_size': 4,
    'per_device_eval_batch_size': 4,
    'output_dir': './', 
    'evaluation_strategy': 'epoch',
    'save_strategy': 'epoch',
    'predict_with_generate': True    
}
optimizer_params = {
    'optimizer_type': 'adafactor',
    'scheduler': False,
}

model_trainer = TrainerForConditionalGeneration(
    model_name=model_name, task=task,
    optimizer_params=optimizer_params,
    training_params=training_params,
    model_save_path="turna_summarization_tr_news",
    max_input_length=max_input_length,
    max_target_length=max_target_length, 
    postprocess_fn=dataset_processor.dataset.postprocess_data
)

trainer, model = model_trainer.train_and_evaluate(train_dataset, eval_dataset, test_dataset)

model.save_pretrained(model_save_path)
dataset_processor.tokenizer.save_pretrained(model_save_path)

Evaluate a conditional generation model with custom generation config

from turkish_lm_tuner import DatasetProcessor, EvaluatorForConditionalGeneration

dataset_name = "tr_news"
task = "summarization"
task_format="conditional_generation"
model_name = "boun-tabi-LMG/TURNA"
task_mode = ''
max_input_length = 764
max_target_length = 128
dataset_processor = DatasetProcessor(
    dataset_name, task, task_format, task_mode,
    model_name, max_input_length, max_target_length
)

test_dataset = dataset_processor.load_and_preprocess_data(split="test")

test_params = {
    'per_device_eval_batch_size': 4
}

model_path = "turna_tr_news_summarization"
generation_params = {
    'num_beams': 4,
    'length_penalty': 2.0,
    'no_repeat_ngram_size': 3,
    'early_stopping': True,
    'max_length': 128,
    'min_length': 30,
}
evaluator = EvaluatorForConditionalGeneration(
    model_path, model_name, task, max_input_length, max_target_length, test_params,
    generation_params, dataset_processor.dataset.postprocess_data
)
results = evaluator.evaluate_model(test_dataset)
print(results)

Reference

If you use this repository, please cite the following related paper:

@misc{uludogan2024turna,
      title={TURNA: A Turkish Encoder-Decoder Language Model for Enhanced Understanding and Generation}, 
      author={Gökçe Uludoğan and Zeynep Yirmibeşoğlu Balal and Furkan Akkurt and Melikşah Türker and Onur Güngör and Susan Üsküdarlı},
      year={2024},
      eprint={2401.14373},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

Note that all datasets belong to their respective owners. If you use the datasets provided by this library, please cite the original source.

This code base is licensed under the MIT license. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 558 Commits
.github/workflows		.github/workflows
docs		docs
experiments		experiments
turkish_lm_tuner		turkish_lm_tuner
.gitignore		.gitignore
BuildREADME.md		BuildREADME.md
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

License

boun-tabi-LMG/turkish-lm-tuner

Folders and files

Latest commit

History

Repository files navigation

🦖 Turkish LM Tuner

Overview

Installation

Model Support

Task and Dataset Support

Usage

Examples

Fine-tune and evaluate a conditional generation model

Evaluate a conditional generation model with custom generation config

Reference

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages