[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/avacaondata/nlpboost/blob/main/notebooks/extractive_qa/train_sqac.ipynb)

# Extractive Question Answering in Spanish: SQAC

In this tutorial we will see how we can train multiple Spanish models on a QA dataset in that language: SQAC. 

We first import the needed modules or, if you are running this notebook in Google colab, please uncomment the cell below and run it before importing, in order to install `nlpboost`.

In [1]:
from nlpboost import AutoTrainer, ModelConfig, DatasetConfig, ResultsPlotter
from transformers import EarlyStoppingCallback
from nlpboost.default_param_spaces import hp_space_base

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to
[nltk_data]     /home/alejandro.vaca/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /home/alejandro.vaca/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Configure the dataset

The next step is to define the fixed train args, which will be the `transformers.TrainingArguments` passed to `transformers.Trainer` inside `nlpboost.AutoTrainer`. For a full list of arguments check [TrainingArguments documentation](https://huggingface.co/docs/transformers/v4.25.1/en/main_classes/trainer#transformers.TrainingArguments). `DatasetConfig` expects these arguments in dictionary format.

To save time, we set `max_steps` to 1; in a real setting we would need to define these arguments differently. However, that is out of scope for this tutorial. To learn how to work with Transformers, and how to configure the training arguments, please check Huggingface Course on NLP. 

In [2]:
fixed_train_args = {
        "evaluation_strategy": "epoch",
        "num_train_epochs": 10,
        "do_train": True,
        "do_eval": True,
        "logging_strategy": "epoch",
        "save_strategy": "epoch",
        "save_total_limit": 2,
        "seed": 69,
        "fp16": True,
        "dataloader_num_workers": 8,
        "load_best_model_at_end": True,
        "per_device_eval_batch_size": 16,
        "adam_epsilon": 1e-6,
        "adam_beta1": 0.9,
        "adam_beta2": 0.999,
        "max_steps": 1
}

Then we define some common args for the dataset. In this case we minimize the loss, as for QA no compute metrics function is used during training. We use the loss to choose the best model and then compute metrics over the test set, which is not a straightforward process (that is the reason for not computing metrics in-training).

In [3]:
default_args_dataset = {
        "seed": 44,
        "direction_optimize": "minimize",
        "metric_optimize": "eval_loss",
        "retrain_at_end": False,
        "callbacks": [EarlyStoppingCallback(1, 0.00001)],
        "fixed_training_args": fixed_train_args
}

We now define arguments specific of SQAC. In this case, the text field and the label col are not used, so we just set them to two string columns of the dataset. In QA tasks, `nlpboost` assumes the dataset is in SQUAD format.

In [4]:
sqac_config = default_args_dataset.copy()
sqac_config.update(
    {
        "dataset_name": "sqac",
        "alias": "sqac",
        "task": "qa",
        "text_field": "context",
        "hf_load_kwargs": {"path": "PlanTL-GOB-ES/SQAC"},
        "label_col": "question",
    }
)

In [5]:
sqac_config = DatasetConfig(**sqac_config)

## Configure Models

We will configure three Spanish models. As you see, we only need to define the `name`, which is the path to the model (either in HF Hub or locally), `save_name` which is an arbitrary name for the model, the hyperparameter space and the number of trials. There are more parameters, which you can check in the documentation.

In [6]:
bertin_config = ModelConfig(
        name="bertin-project/bertin-roberta-base-spanish",
        save_name="bertin",
        hp_space=hp_space_base,
        n_trials=1,
)
beto_config = ModelConfig(
        name="dccuchile/bert-base-spanish-wwm-cased",
        save_name="beto",
        hp_space=hp_space_base,
        n_trials=1,
)
albert_config = ModelConfig(
        name="CenIA/albert-tiny-spanish",
        save_name="albert",
        hp_space=hp_space_base,
        n_trials=1
)