# Instructions to train the model

#### **Model attributes**
- uses DBpedia knowledge graph
- uses (pre-trained) mT5-XL as base model
- pre-trained on LC-QuAD 1.0 (until early stopping)
- fine-tuned on QALD-9-Plus (custom) on all languages (until early stopping)
- utilizes linguistic context, entity knowledge and padding in pre-training and fine-tuning

#### **Dataset Generation**

For pre-training:

```bash
python3 code/generate_train_csv.py \
-i datasets/lcquad1/train-data.json \
-o datasets/lcquad1/train-data \
-t lcquad1 \
-l all \
--linguistic_context \
--entity_knowledge \
--question_padding_length 128 \
--entity_padding_length 64 \
--train_split_percent 90
```

For fine-tuning:

```bash
python3 code/generate_train_csv.py \
-i datasets/qald9plus/dbpedia/qald_9_plus_train_dbpedia.json \
-o datasets/qald9plus/dbpedia/qald_9_plus_train_dbpedia-lc-ent \
-t qald \
-kg DBpedia \
-l all \
--linguistic_context \
--entity_knowledge \
--question_padding_length 128 \
--entity_padding_length 64 \
--train_split_percent 90

```

#### **Pre-train on LC-QuAD 1.0**

Since pre-training and fine-tuning commands are based on the same script `train_ds.sh`, in order to save time and avoid errors, we directly provide the code in the configured bash script. You can run the following command in your terminal.

```bash
bash train.sh 60020 "google/mt5-xl" datasets/lcquad1/train-data_train_90pct.csv datasets/lcquad1/train-data_dev_10pct.csv fine-tuned_models/lcquad1-finetune_mt5-base_lc-ent lcquad1-finetune_mt5-base_lc-ent 32
```

#### **Fine-tune on QALD-9-Plus**

```bash
bash train.sh 60030 fine-tuned_models/lcquad1-finetune_mt5-base_lc-ent datasets/qald9plus/dbpedia/qald_9_plus_train_dbpedia-lc-ent_train_90pct.csv datasets/qald9plus/dbpedia/qald_9_plus_train_dbpedia-lc-ent_dev_10pct.csv fine-tuned_models/qald9plus-finetune_lcquad1-ft-base_lc-ent qald9plus-finetune_lcquad1-ft-base_lc-ent 32

```

##### **Continue Training**


To continue training, remove the comment the logic associated with early_stopping_callback and increase the number of epochs.

#### **GERBIL Evaluation**

Use `eval.sh` to generate prediction files in QALD format and evaluate them with GERBIL.
`eval.sh` is configured. 

```bash
./eval.sh
```

Prediction files are stored in `pred_files/qald9plus-finetune`.
The script uploads them to GERBIL along with the reference test file
and waits for 5 minutes for the results.
If the GERBIL experiment terminates, the results are stored in `pred_files/qald9plus-finetune/result.csv`, else, the experiment id is stored in this file. You can use the following commands to generate a csv files for results:

```bash
python3 code/gerbil_eval.py --experiment_id [experiment_id] --pred_path pred_files/qald9plus-finetune
```
