# Instructions to train the model

#### **Model attributes**
- uses Wikidata knowledge graph
- uses (pre-trained) mT5-XL as base model
- pre-trained on LC-QuAD 2.0 for 15 epochs
- fine-tuned on QALD-9-Plus (custom) on all languages for 15 epochs
- utilizes linguistic context, entity knowledge and padding in pre-training and fine-tuning

#### **Dataset Generation**

For pre-training:

```bash
python3 code/generate_train_csv.py \
-i datasets/lcquad2/train.json \
-o datasets/lcquad2/train.csv \
-t lcquad2 \
--linguistic_context \
--entity_knowledge \
--question_padding_length 32 \
--entity_padding_length 5
```

For fine-tuning:

```bash
python3 code/generate_train_csv.py \
-i datasets/qald9plus/wikidata/qald_9_plus_train_wikidata.json \
-o datasets/qald9plus/wikidata/qald_9_plus_train_wikidata.csv \
-t qald \
-kg Wikidata \
-l all \
--linguistic_context \
--entity_knowledge \
--question_padding_length 32 \
--entity_padding_length 5
```

#### **Pre-train on LC-QuAD 2.0**

Since pre-training and fine-tuning commands are based on the same script `train_ds.sh`, in order to save time and avoid errors, we directly provide the code in the configured bash script. You can run the following command in your terminal.

```bash
deepspeed --include=localhost:0 --master_port 60000 code/train_new.py \
    --deepspeed deepspeed/ds_config_zero3.json \
    --model_name_or_path google/mt5-xl \
    --do_train \
    --train_file datasets/lcquad2/train.csv \
    --output_dir fine-tuned_models/lcquad2-pretrain \
    --num_train_epochs 15 \
    --per_device_train_batch_size=16 \
    --overwrite_output_dir \
    --save_steps 6000 \
    --save_total_limit 2 \
    --report_to wandb \
    --run_name lcquad2-pretrain \
    --logging_steps 10 \
    --tf32 1 \
    --fp16 0 \
    --gradient_checkpointing 1 \
    --gradient_accumulation_steps 4
```

#### **Fine-tune on QALD-9-Plus**

```bash
deepspeed --include=localhost:0 --master_port 60000 code/train_new.py \
    --deepspeed deepspeed/ds_config_zero3.json \
    --model_name_or_path fine-tuned_models/lcquad2-pretrain \
    --do_train \
    --train_file datasets/qald9plus/wikidata/qald_9_plus_train_wikidata.csv \
    --output_dir fine-tuned_models/qald9plus-finetune-new \
    --num_train_epochs 32 \
    --per_device_train_batch_size=16 \
    --overwrite_output_dir \
    --save_steps 3000 \
    --save_total_limit 2 \
    --report_to wandb \
    --run_name qald9plus-finetune-new \
    --logging_steps 10 \
    --tf32 1 \
    --fp16 0 \
    --gradient_checkpointing 1 \
    --gradient_accumulation_steps 4
```

##### **Continue training**

```bash
deepspeed --include=localhost:0 --master_port 60000 code/train_new.py \
    --deepspeed deepspeed/ds_config_zero3.json \
    --model_name_or_path fine-tuned_models/qald9plus-finetune-new \
    --do_train \
    --train_file datasets/qald9plus/wikidata/qald_9_plus_train_wikidata.csv \
    --output_dir fine-tuned_models/qald9plus-finetune-new-2 \
    --num_train_epochs 70 \
    --per_device_train_batch_size=16 \
    --overwrite_output_dir \
    --save_steps 500 \
    --save_total_limit 2 \
    --report_to wandb \
    --run_name qald9plus-finetune-new-2 \
    --logging_steps 10 \
    --tf32 1 \
    --fp16 0 \
    --gradient_checkpointing 1 \
    --gradient_accumulation_steps 4
```

#### **GERBIL Evaluation**

Use `eval.sh` to generate prediction files in QALD format and evaluate them with GERBIL.
`eval.sh` is configured. 

```bash
./eval.sh
```

Prediction files are stored in `pred_files/qald9plus-finetune`.
The script uploads them to GERBIL along with the reference test file
and waits for 5 minutes for the results.
If the GERBIL experiment terminates, the results are stored in `pred_files/qald9plus-finetune/result.csv`, else, the experiment id is stored in this file. You can use the following commands to generate a csv files for results:

```bash
python3 code/gerbil_eval.py --experiment_id [experiment_id] --pred_path pred_files/qald9plus-finetune
```
