The paper is publiclly available at arXiv.
The best performing model (BERTić finetuned on Latin version of the dataset) is publicly available at HuggingFace.
The synthetic dataset is available at Kaggle.
In this paper, we focus on generating a synthetic question answering (QA) dataset using an adapted Translate-Align-Retrieve method. Using this method, we created the largest Serbian QA dataset of more than 87K samples, which we name SQuAD-sr. To acknowledge the script duality in Serbian, we generated both Cyrillic and Latin versions of the dataset. We investigate the dataset quality and use it to fine-tune several pre-trained QA models. Best results were obtained by fine-tuning the BERTić model on our Latin SQuAD-sr dataset, achieving 73.91% Exact Match and 82.97% F1 score on the benchmark XQuAD dataset, which we translated into Serbian for the purpose of evaluation. The results show that our model exceeds zero-shot baselines, but fails to go beyond human performance. We note the advantage of using a monolingual pre-trained model over multilingual, as well as the performance increase gained by using Latin over Cyrillic. By performing additional analysis, we show that questions about numeric values or dates are more likely to be answered correctly than other types of questions. Finally, we conclude that SQuAD-sr is of sufficient quality for fine-tuning a Serbian QA model, in the absence of a manually crafted and annotated dataset.
Pipeline utilizes publicly available libraries. Python libraries are listed in requirements.txt
. Notable projects and libraries are:
- HuggingFace: Ecosystem utilized for original dataset translation, as well as model finetuning and evaluation
- Eflomal: Statistical Word Alignment tool
- FastAlign: Statistical Word Alignment tool, utilized for symmetrization
To install Eflomal and FastAlign, please refer to instructions in their repositories.
Source code is divided into 3 parts:
- Synthesis - contains a pipeline for synthetic translation of SQuAD Dataset
- Finetuning - contains Model finetuning notebook and loading script
- Analysis - contains notebooks in which dataset and model analysis, as well as model evaluation and error extraction is performed
Before running any part of the code, configuration parameters must be set. Configuration example is given in config.json
file. Please use this file as a template to your own desired configuration parameters.
Synthesis creation pipeline is divided into 4 scripts (excluding utils
):
translate.py
- for original SQuAD Dataset translationtransliterate.py
- for transliteration from cyrillic to latin script (optional)align.py
- for calculating Word Alignmentsretrieve.py
- for final SQuAD-sr Dataset creation
To run full pipeline, please use shell script run_all.sh
using the following command:
bash src/synthesis/run_all.sh <config_path> <fast_align_dir>
Also, scripts can be run one at a time, but don't forget to perform symmetrization between running align.py
and retrieve.py
. To find instructions on how to perform symmetrization, refer to FastAlign repository or to run_all.sh
script.
Model is finetuned in finetune.ipynb
, which uses HuggingFace ecosystem to perform finetuning. To load synthetically crafted dataset, loading_script.py
is used, which is adapted from HuggingFace examples.
In this section three notebooks are provided:
dataset_analysis.ipynb
- for synthetic dataset analysismodel_evaluation.ipynb
- for finetuned model evaluation and extraction of incorrect samplesmodel_error_analysis.ipynb
- which provides analysis of extracted samples
To use any of these notebooks, please change file paths in order to sucessfully load model and required data.