<a href="https://colab.research.google.com/github/alexk2206/tds_capstone/blob/Domi-DEV/Overview_Domi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 5: Generate model output
(Dominik Schuster)
> Referred notebook: model_choice_and_fine_tuning.ipynb

What seems like an easy task, isn't that simple in reality.
But how comes?
There are several difficulties to face, but the predominant one is the fact, that there are different question types.
Is it possible to answer them via one model?
Or are more than one necessary?
Especially the multiple and single choice questions are tough to handle, because the model has to choose an answer from predefined options.

We thought about several ways to treat them:


*   With Zero-Shot Classification -> The model classifies the input. In our case the classes would be the options
*   With Qestion-Answering-models (QA-models), where the best answer(s) is/are tested on similarity on the options and chosen if applicable
*   With Multiple Choice models (mc models), where the model directly finds the best option(s)

The direct approach seemed to be very promising for us, as we would not need to postprocess the outcome of the model.
Maybe this would be the same for the Zero-Shot Classification, but in our considerations we came to the conclusion it maybe would not.

As the questions of type 'DATE' and 'NUMBER' (open-ended (oe) questions) are different to that and one just has to extract the answer out of the text without mapping to the options, we decided to take another model for these.
We chose a QA-model, which is quite more handy than a mc model.
Besides, we used a text-summarization pipeline to summarize the context of 'TEXT' questions, as it doesn't seem reasonable to extract an answer out of these questions.

Before looking into the code, we want to remark, that we concentrated on handling one question a time in the beginning.
This made the task way easier.
Our goal was to generalize our solution to the task where one has to answer to different questions at the same time having one big context.
Unfortunatly, we didn't come that far.


But now, let's jump into the code.
Our main operator was the model_ouput() function, that is able to generate output for multiple questions (for which a intended answer is passed on) one after another, collecting all the answers and returning it in the end.
Moreover, it can calculate metrics for the mc and the oe questions on the fly, where the expected output is compared to the real output.


In [None]:
def model_output(mc_model, mc_tokenizer, oe_model, oe_tokenizer, questions, sum_pipeline=None, mc_metric=None, oe_metric=None):
    '''
    model_output -> creates output for every question in the dataset and safes it in a list of dicts
    parameters:
    - mc_model: one hugging face model for mc questions
    - mc_tokenizer: hugging face tokenizer for mc questions
    - oe_model: one hugging face model for oe questions
    - oe_tokenizer: hugging face tokenizer for oe questions
    - questions: QA-dataset as pd.DataFrame
    - sum_pipeline: huggingface text-summarization pipeline to handle 'TEXT' questions
    - mc_metric: metric for evaluating mc questions
    - oe_metric: metric for evaluating oe questions
    output:
    - mc_answer_comparison: list of dicts with keys 'model', 'intended_answer_binary', 'predicted_answer_binary', 'intended_answer', 'predicted_answer', 'type', 'difficulty'
    - answer_comparison: list of dicts with keys 'model', 'intended_answer', 'predicted_answer', 'type', 'difficulty'
    '''
    answer_comparison = []
    mc_answer_comparison = []
    mc_model_name = mc_model.config._name_or_path
    oe_model_name = oe_model.config._name_or_path

    for index, question in questions.iterrows():
        context = question['context']
        question_text = question['question']
        options = question['options']
        question_type = question['type']
        difficulty = question['difficulty']

        mc_question_type = question_type in ["MULTI_SELECT", "SINGLE_SELECT"]

        if question_type == "MULTI_SELECT":
          intended_answer, intended_answer_binary, predicted_answer_binary, predicted_answer = multi_select_model_output(mc_model, mc_tokenizer, question, mc_metric)
        elif question_type == "SINGLE_SELECT":
          intended_answer, intended_answer_binary, predicted_answer_binary, predicted_answer = single_select_model_output(mc_model, mc_tokenizer, question, mc_metric)
        elif question_type == "TEXT":
          intended_answer, predicted_answer = text_model_output(question, sum_pipeline)
          continue
        elif question_type == "NUMBER":
          intended_answer, predicted_answer = number_model_output(oe_model, oe_tokenizer, question, oe_metric)
        elif question_type == "DATE":
          intended_answer, predicted_answer = date_model_output(oe_model, oe_tokenizer, question, oe_metric)
        else:
          continue
        if predicted_answer != intended_answer:
          print('======= Wrong answer =======')
          print(f"Question: {question_text}")
          print(f"Context: {context}")
          print(f"The intended answer was: {intended_answer}")
          print(f"The predicted answer was: {predicted_answer}")
          if mc_question_type:
            print(f"The intended answer in BINARY was: {intended_answer_binary}")
            print(f"The predicted answer in BINARY was: {predicted_answer_binary}\n")
          else:
            print("")
        if mc_question_type:
          mc_answer_comparison.append({'model': mc_model_name, 'intended_answer_binary': intended_answer_binary, 'predicted_answer_binary': predicted_answer_binary, 'intended_answer': intended_answer, 'predicted_answer': predicted_answer, 'type': question_type, 'difficulty': difficulty})
        else:
          answer_comparison.append({'model': oe_model_name, 'intended_answer': intended_answer, 'predicted_answer': predicted_answer, 'type': question_type, 'difficulty': difficulty})

    # Compute metrics, if they were passed as arguments
    if mc_metric is not None:
      try:
        mc_metric_result = mc_metric.compute()
      except:
        mc_metric_result = None
    else:
      mc_metric_result = None
    if oe_metric is not None:
      try:
        oe_metric_result = oe_metric.compute()
      except:
        oe_metric_result = None
    else:
      oe_metric_result = None
    return mc_answer_comparison, answer_comparison, mc_metric_result, oe_metric_result

The model output itself is created in the function calls of the functions <question_type>_model_output().
Examplarary, we want show the one for 'MULTI_SELECT' questions here.
The tokenization of the input is made in tokenize_function(), which we therefore also show underneath.
In that function, we pass the context as often as there are options, saved as a list. We also pass a list as a question, whereby each entry of the question (of the QA-dataset) is linked to an option. In doing so, the model is able to return something like probabilities for the options to be implicitly chosen.

In the 'MULTI_SELECT' questions we had a special case, where we couldn't choose just the best answer.
To overcome the problem, we took the mean and the standard deviation of the "probabilities" (logits) into account.
We predicted every option that had a higher "probability" as the mean + 40% of the standard deviation.
In doing so, we captured confident predictions, accounted for variability and prevented over- and under-selection.

In [None]:
def tokenize_function(example, tokenizer):
    '''
    Converts the question with its context and the given options for multi-/single-select questions, into IDs the model later can make sense of. Distinguishes between multi-/single-select and the other question types
    parameters:
    - expample: question of the QA-dataset with all its entries (question, context, options, type are urgently necessary)
    - tokenizer: tokenizer of the model
    output:
    - tokenized: tokenized input example
    '''
    if example["type"] == "SINGLE_SELECT" or example["type"] == "MULTI_SELECT":
      number_of_options = len(example["options"])
      first_sentence = [[example["context"]] * number_of_options]  # Repeat context for each option
      second_sentence = [[example["question"] + " " + option] for option in example["options"]]  # Pair with each option
      tokenized = tokenizer(
          sum(first_sentence, []),
          sum(second_sentence, []),
          padding="longest",
          truncation=True
      )
      # Un-flatten
      return {k: [v[i:i+number_of_options] for i in range(0, len(v), number_of_options)] for k, v in tokenized.items()}

    elif example['type'] == 'NUMBER':
      tokenized = tokenizer(
          example['context'],
          example['question'],
          truncation="only_second",
          max_length=384,
          padding="max_length",
          return_tensors="pt"
      )
    else:
      tokenized = tokenizer(
          example['question'],
          example['context'],
          truncation="only_second",
          max_length=384,
          padding="max_length",
          return_tensors="pt"
      )

    return tokenized

In [None]:
def multi_select_model_output(model, tokenizer, question, metric=None):
    '''
    Handles a question, its context and its options for a multi-select question
    parameters:
    - model: one MC hugging face model
    - tokenizer: MC hugging face tokenizer
    - question: one question of the QA-dataset as a dictionary
    - metric: metric for evaluating the model output (optional)
    output:
    - intended_answer: the correct/intended answers as a list of strings
    - intended_answer_binary: the correct/intended answers as a list of binary variables, where each entry is one, if option is chosen, 0 else
    - predicted_answer_binary: the predicted answers as a list of binary variables, where each entry is one, if option is chosen, 0 else
    - high_score_answers: the predicted answers as a list of strings
    '''
    intended_answer = question['intended_answer']
    options = question['options']

    # creating input ids by tokenizing the question
    input_ids = tokenize_function(question, tokenizer)
    input_ids = {key: torch.tensor(array) for key, array in input_ids.items()}

    # generating the output
    outputs = model(**input_ids)
    logits = outputs.logits  # Shape: [batch_size, num_choices]
    print(logits)

    ### Use a threshold from deviation and take all options that are higher than the mean + 40% of standard deviation
    mean_score = logits.mean().item()
    std_dev = logits.std().item()
    threshold = mean_score + (0.4 * std_dev)
    high_score_options = (logits >= threshold).nonzero(as_tuple=True)[1]  # Get the indices of valid options

    # List the corresponding options
    high_score_answers = [options[idx] for idx in high_score_options.tolist()]
    intended_answer_binary = [1 if option in intended_answer else 0 for option in options]

    predicted_answer_binary = [1 if option in high_score_answers else 0 for option in options]

    # Add the results to the metric
    if metric is not None:
        metric.add_batch(predictions=predicted_answer_binary, references=intended_answer_binary)

    return intended_answer, intended_answer_binary, predicted_answer_binary, high_score_answers

For observing the other output functions, please look into model_choice_and_fine_tuning.ipynb.

With that, we were able to produce some output and compare different models.
That brought us to the task of choosing the model we wanted to fine-tune.

# Chapter 6: Model Choice
(Dominik Schuster)

As we wanted to take into account as much as we could and also to get a better feeling for the model output, we chose all of the metrics "accuracy", "f1", "precision" and "recall" for the mc model questions. For the oe model questions, only "exact match", as receiving a false phone number and/or date would be very problematic. The last task ('TEXT' questions) was not evaluated, since there is no real basis on which we can extract the right or wrong answer. The notes were only summarized.

Since the model at this stage wouldn't remember the intended answers of the questions, we just used the training part of the QA-dataset. This didn't result in overfitting/underfitting a model and the dataset was big enough to really get a glimpse of how good the models perform.

### Model for open-ended questions

We took the "distilbert-base-uncased-distilled-squad" for oe questions, a QA model fine-tuned on the squad dataset.
The metric results were so high, that we didn't see the advantage of trying out more models


```
The exact_match metric for all open-ended questions in the train dataset: 0.9651162790697675
```

### Model for multiple and single choice questions

We decided to test a BERT, ALBERT, XLNet and RoBERTa model each. All of them can handle the same type of input and are able to weight the options. A weight what we afterwards use to predict the right option(s).

The outcome:


```
bert-base-cased: {'accuracy': 0.8473297213622291, 'f1': 0.7606915377616015, 'precision': 0.7660354306658522, 'recall': 0.755421686746988}
xlnet/xlnet-base-cased: {'accuracy': 0.7078977932636469, 'f1': 0.5054080629301868, 'precision': 0.5538793103448276, 'recall': 0.46473779385171793}
FacebookAI/roberta-base: {'accuracy': 0.575687185443283, 'f1': 0.27989487516425754, 'precision': 0.3075812274368231, 'recall': 0.25678119349005424}
albert/albert-base-v2: {'accuracy': 0.6314363143631436, 'f1': 0.3789954337899543, 'precision': 0.4129353233830846, 'recall': 0.350210970464135}

```



# Chapter 7: Fine-tuning
(Dominik Schuster)

The model "bert-base-cased" was doing quite well on the QA-dataset we created.
But also "albert/albert-base-v2" managed get quite good results in view to the relatively little size of the model.
Also it is known for faster learning.
This is the reason why we wanted to fine-tune both models.
Remark that we didn't do any fine-tuning on the oe models, as it was quite good instantly.

Including all the code here would be very unhandy.
To observe it, look into model_choice_and_fine_tuning.ipynb.
But to give a glimpse, we post the fine_tune_model() function in here.
We used the Hugging Face's Trainer API, which simplifies training & evaluation. Also we applied best practices for fine-tuning, including logging, evaluation, and model checkpointing.
Also we defined our own DataCollator, the DataCollatorForMultipleChoice, to be able to handle the multiple choice questionsAfterwards, we save the fine-tuned model so it can be reused without retraining.


In [None]:
def fine_tune_model(dataset, tokenizer, model, epochs, output_dir):
    # Preprocess the dataset
    tokenized_dataset = dataset.map(preprocess_function, batched=True)
    # Define training arguments
    training_args = TrainingArguments(output_dir=output_dir,
        eval_strategy="epoch",
        save_strategy="epoch",
        logging_strategy="epoch",
        logging_dir="./logs",
        learning_rate=4e-5,
        num_train_epochs=epochs,
        weight_decay=0.01,
        logging_steps=10,
        load_best_model_at_end=True,
        report_to="none"
    )
    data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)

    # Define Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset['train'],
        eval_dataset=tokenized_dataset['test'],
        data_collator=data_collator,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )
    # Train the model
    trainer.train()
    save_path = f"/content/drive/MyDrive/mc_models/{output_dir}"
    drive.mount('/content/drive')
    # Create the directory if it does not exist
    if not os.path.exists(save_path):
        os.makedirs(save_path)
        print(f"Directory created: {save_path}")
    else:
        print("Directory already exists!")
    trainer.save_model(save_path)

# Chapter X: Conclusions
Created by: Alexander Keßler & Dominik Schuster

blablablablabalbalbalbalalbalbal

## Limitations
blablabalbalbabla

## Further Improvements

But despite facing the limitations mentioned above, we could still improve our code and by that our results.
Additionally, the task we are now fulfilling with the model lacks in empirical realism, since we don't see an additional value in executing a survey and asking question by question.
It would be just as easy as filling out the survey as one would have done before.

So yes, there are some improvements to make.
With the following list, we want to share our thoughts on that:


1.   
2.   In the context the

