<a href="https://colab.research.google.com/github/alexk2206/tds_capstone/blob/Domi-DEV/Productive.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Model Choice, Fine-tuning and Evaluation**
Created by: 95% Dominik Schuster, 5% Alexander Keßler

After having set up the QA-dataset, we are now capable of evaluating different models on the task that the dataset implicitly represents.
For that, we have to create all corresponding functions which translate our dataset entries into model input and vice versa the model output to humanly understandable text.
Therafter - or at the same time, as we will do it - the evaluation of that created output has to happen.

On the basis of this data, we will decide which model we will fine-tune afterwards, to improve the model's performance even more.

At last, we will test the newly fine-tuned model against the other models evaluated before.

But first things first, let's start with installing and importing all the necessary packages.

In [1]:
!pip install evaluate
!pip install --upgrade sympy

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from evaluate)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.17-py311-none-any.whl.metadata (7.2 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec>=2021.05.0 (from fsspec[http]>=2021.05.0->evaluate)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [

In [2]:
import json
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
import torch.nn.functional as F
import numpy as np
import urllib
from itertools import chain, combinations
from transformers import AutoTokenizer, AutoModelForMultipleChoice, AutoModelForQuestionAnswering, TrainingArguments, pipeline, Trainer, DataCollatorWithPadding, XLNetForMultipleChoice
import torch
import requests
import evaluate
import numpy as np
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from datasets import Dataset
from typing import Optional, Union
from dateutil import parser
from datetime import datetime
import os
import re
from google.colab import drive


## Model Choice for fine-tuning

Which models work best on the task? We want to get a glimpse of that in order to decide which model we want to fine-tune. But before that, we definitly have to prepare the dataset itself and some functions for generating output of a model.

### Prepare dataset

Here we split the previously created QA-dataset into train and validation dataset. For that, we download it from the public github account, where it was uploaded before.
Additionally, we prepare the dataset for the response-generation and for fine-tuning of a model.

In [3]:
# Load datset
url = "https://raw.githubusercontent.com/alexk2206/tds_capstone/refs/heads/main/datasets/combined_qa_dataset.json"
data = pd.read_json(url)
# Convert to DataFrame for easy handling
df = pd.DataFrame(data)

# Map the intended answer to the index of the option
df['label'] = df.apply(lambda x: np.array([1 if option in x['intended_answer'] else 0 for option in x['options']]) if x['type'] in ['SINGLE_SELECT', 'MULTI_SELECT'] else np.array([0]), axis=1)
df['stratify_key'] = df['difficulty'] + '_' + df['type']

# Convert to Huggingface Dataset dataset
qa_dataset = Dataset.from_pandas(df)

In [4]:
# split dataset into train and validation (here called test) dataset with stratifying with question type and difficulty of context
qa_dataset = qa_dataset.class_encode_column(
    "stratify_key"
).train_test_split(test_size=0.2, stratify_by_column="stratify_key", seed=42).remove_columns("stratify_key")

Casting to class labels:   0%|          | 0/1381 [00:00<?, ? examples/s]

In [5]:
qa_dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'type', 'options', 'intended_answer', 'context', 'difficulty', 'label'],
        num_rows: 1104
    })
    test: Dataset({
        features: ['question', 'type', 'options', 'intended_answer', 'context', 'difficulty', 'label'],
        num_rows: 277
    })
})

In [6]:
# Label column: the intended_answer as list of binary variables for every option if question type is mc questions, list of entry 0 else
qa_dataset["train"]["label"][:5]

[[0, 1, 0, 0, 0, 0, 0, 0], [0, 1], [0, 1, 0, 1, 0], [0], [1, 0]]

### Generate model output

After creating and praparing the QA-dataset, it's time for generating model output for different Huggingface models.
Since it's not possible to feed the models with humanly understandable text, we have to preprocess inputs, i.e. the questions of the QA-dataset.

**But which approach do we take?** A very good question, because there are several ways to extract an answer and map it to given options. Zero-Shot Classification, QA where the best answers are tested on similarity to the options, and not but least, our approach, to directly find out the best option(s) with a QA (Multiple Choice respectively) model.

**!!!!!! Attention:** The models are called Multiple Choice models, where multiple options for the input question exist. In the model inference, every option is given a certain weight.
By contrast, we define multiple choice questions where one can select one or more options. Single choice question are also questions where many options exist, but it's only possible to choose ONE.
Thus, in the dataset
* MULTI_SELECT = multiple choice question
* SINGLE_SELECT = single choice question
* questions for the multiple choice models = mc questions


As we found out, QA-models, or their tokenizer respectively, expect to get a question and an associated context with which the question can be answered as input.
This is very easy for open-ended (oe) questions like 'NUMBER' or 'DATE', as we can directly pose it to the model without making any changes.
But it's getting cumbersome for the mc questions, where one has to choose from different options.
There, we have to pass the context as often as there are options, saved as a list.
We also pass a list as a question, whereby each entry of the question (of the QA-dataset) is linked to an option.
In doing so, the model is able to chose one of the options as the most likely one, in contrary to the model of the oe questions, which returns the most likely start and end position of the answer in the context.
In fact, the required output format forces us to not use QA-models for this task, but the more specialiced Multiple-Choice models.

Besides, we'll use a text-summarization pipeline to summarize the context of 'TEXT' questions, as it doesn't seem reasonable to extract an answer out of these questions.

For handling all of that mentioned above, following functions come in to play:


*   tokenize_function():
Converts string input into tokens "readable" for the model.
For that it differentiates between the different question types as mentioned above.
*   model_output(): the "main" operator for generating model output
Needs a model, which should create the output, its tokenizer and the questions, for which output should be created.
Additionally, one can pass on the metrics for evaluating the model output for the mc and the oe questions on the fly.
To handle 'TEXT' questions, one also has to pass a text-summarizing pipeline. Hands on the task to the following functions. These create the model's output, a list of logits
  * single_select_model_output(): one logit for each option.
  It chooses the option where the logit is highest
  * multi_select_model_output(): one logit for each option.
  It derives a standard normal distribution from the logits distribution and chooses every option that is 40% above the mean
  * text_model_output(): just summarizes the context
  * number_model_output(): calculates the most likely start and end token and outputs everything in between
  * date_model_output(): as in number_model_output(), but auxiliary function "convert_date_format()" converts the output to the format "yyyy-MM-dd", if possible
  



Tokenize function

In [7]:
def tokenize_function(example, tokenizer):
    '''
    Converts the question with its context and the given options for multi-/single-select questions, into IDs the model later can make sense of. Distinguishes between multi-/single-select and the other question types
    parameters:
    - expample: question of the QA-dataset with all its entries (question, context, options, type are urgently necessary)
    - tokenizer: tokenizer of the model
    output:
    - tokenized: tokenized input example
    '''
    if example["type"] == "SINGLE_SELECT" or example["type"] == "MULTI_SELECT":
      number_of_options = len(example["options"])
      first_sentence = [[example["context"]] * number_of_options]  # Repeat context for each option
      second_sentence = [[example["question"] + " " + option] for option in example["options"]]  # Pair with each option
      tokenized = tokenizer(
          sum(first_sentence, []),
          sum(second_sentence, []),
          padding="longest",
          truncation=True
      )
      # Un-flatten
      return {k: [v[i:i+number_of_options] for i in range(0, len(v), number_of_options)] for k, v in tokenized.items()}

    elif example['type'] == 'NUMBER':
      tokenized = tokenizer(
          example['context'],
          example['question'],
          truncation="only_second",
          max_length=384,
          padding="max_length",
          return_tensors="pt"
      )
    else:
      tokenized = tokenizer(
          example['question'],
          example['context'],
          truncation="only_second",
          max_length=384,
          padding="max_length",
          return_tensors="pt"
      )

    return tokenized

Model output

In [8]:
def model_output(mc_model, mc_tokenizer, oe_model, oe_tokenizer, questions, sum_pipeline=None, mc_metric=None, oe_metric=None):
    '''
    model_output -> creates output for every question in the dataset and safes it in a list of dicts
    parameters:
    - mc_model: one hugging face model for mc questions
    - mc_tokenizer: hugging face tokenizer for mc questions
    - oe_model: one hugging face model for oe questions
    - oe_tokenizer: hugging face tokenizer for oe questions
    - questions: QA-dataset as pd.DataFrame
    - sum_pipeline: huggingface text-summarization pipeline to handle 'TEXT' questions
    - mc_metric: metric for evaluating mc questions
    - oe_metric: metric for evaluating oe questions
    output:
    - mc_answer_comparison: list of dicts with keys 'model', 'intended_answer_binary', 'predicted_answer_binary', 'intended_answer', 'predicted_answer', 'type', 'difficulty'
    - answer_comparison: list of dicts with keys 'model', 'intended_answer', 'predicted_answer', 'type', 'difficulty'
    '''
    answer_comparison = []
    mc_answer_comparison = []
    mc_model_name = mc_model.config._name_or_path
    oe_model_name = oe_model.config._name_or_path

    for index, question in questions.iterrows():
        context = question['context']
        question_text = question['question']
        options = question['options']
        question_type = question['type']
        difficulty = question['difficulty']

        mc_question_type = question_type in ["MULTI_SELECT", "SINGLE_SELECT"]

        if question_type == "MULTI_SELECT":
          intended_answer, intended_answer_binary, predicted_answer_binary, predicted_answer = multi_select_model_output(mc_model, mc_tokenizer, question, mc_metric)
        elif question_type == "SINGLE_SELECT":
          intended_answer, intended_answer_binary, predicted_answer_binary, predicted_answer = single_select_model_output(mc_model, mc_tokenizer, question, mc_metric)
        elif question_type == "TEXT":
          intended_answer, predicted_answer = text_model_output(question, sum_pipeline)
          continue
        elif question_type == "NUMBER":
          intended_answer, predicted_answer = number_model_output(oe_model, oe_tokenizer, question, oe_metric)
        elif question_type == "DATE":
          intended_answer, predicted_answer = date_model_output(oe_model, oe_tokenizer, question, oe_metric)
        else:
          continue
        if predicted_answer != intended_answer:
          print('======= Wrong answer =======')
          print(f"Question: {question_text}")
          print(f"Context: {context}")
          print(f"The intended answer was: {intended_answer}")
          print(f"The predicted answer was: {predicted_answer}")
          if mc_question_type:
            print(f"The intended answer in BINARY was: {intended_answer_binary}")
            print(f"The predicted answer in BINARY was: {predicted_answer_binary}\n")
          else:
            print("")
        if mc_question_type:
          mc_answer_comparison.append({'model': mc_model_name, 'intended_answer_binary': intended_answer_binary, 'predicted_answer_binary': predicted_answer_binary, 'intended_answer': intended_answer, 'predicted_answer': predicted_answer, 'type': question_type, 'difficulty': difficulty})
        else:
          answer_comparison.append({'model': oe_model_name, 'intended_answer': intended_answer, 'predicted_answer': predicted_answer, 'type': question_type, 'difficulty': difficulty})

    # Compute metrics, if they were passed as arguments
    if mc_metric is not None:
      try:
        mc_metric_result = mc_metric.compute()
      except:
        mc_metric_result = None
    else:
      mc_metric_result = None
    if oe_metric is not None:
      try:
        oe_metric_result = oe_metric.compute()
      except:
        oe_metric_result = None
    else:
      oe_metric_result = None
    return mc_answer_comparison, answer_comparison, mc_metric_result, oe_metric_result


Additionally to the output, if the answer was predicted wrong, it prints out some values for debugging, including the output logits tensor, the question text, its context, the intended answer and the predicted answer This looks like this:

```
tensor([[ 0.2761,  0.2655,  0.2123,  0.2723,  0.1614,  0.2863,  0.2624,  0.0581,
          0.1172,  0.2708,  0.0551, -0.2568,  0.2749]],
       grad_fn=<ViewBackward0>)
======= Wrong answer =======
Question: Who to copy in follow up
Context: Oh hmm, I guess I'd follow up with Stephan Maier, Oliver Eibel, Marisa Peng, Johannes Wagner, Jens Roschmann and also Tim Persson.
The intended answer was: ['Stephan Maier', 'Oliver Eibel', 'Marisa Peng', 'Johannes Wagner', 'Jens Roschmann', 'Tim Persson']
The predicted answer was: ['Stephan Maier', 'Joachim Wagner', 'Oliver Eibel', 'Marisa Peng', 'Johannes Wagner', 'Jens Roschmann', 'Tim Persson']
The intended answer in BINARY was: [1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1]
The predicted answer in BINARY was: [1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1]
```



##### Single-select output

In [9]:
def single_select_model_output(model, tokenizer, question, metric=None):
    '''
    Handles a question, its context and its options for a single-select question and generates output
    parameters:
    - model: one MC hugging face model
    - tokenizer: MC hugging face tokenizer
    - question: one question of the QA-dataset as row of pd.DataFrame
    - metric: metric for evaluating the model output (optional)
    output:
    - intended_answer: the correct/intended answer as a string
    - intended_answer_binary: the correct/intended answer as a list of binary variables, where each entry is one, if option is chosen, 0 else
    - predicted_answer_binary: the predicted answer as a list of binary variables, where each entry is one, if option is chosen, 0 else
    - options[predicted_option]: the predicted answer as a list of strings
    '''
    intended_answer = question['intended_answer'][0]
    options = question['options']

    # creating input ids by tokenizing the question
    input_ids = tokenize_function(question, tokenizer)
    input_ids = {key: torch.tensor(array) for key, array in input_ids.items()}

    # generating the output
    outputs = model(**input_ids)
    logits = outputs.logits  # Shape: [batch_size, num_choices]
    print(logits)

    # Predict the option with the highest score
    predicted_option = torch.argmax(logits, dim=1).item()
    predicted_answer_binary = [0] * len(options)
    predicted_answer_binary[predicted_option] = 1

    intended_answer_binary = [1 if option == intended_answer else 0 for option in options]

    # Add the results to the metric
    if metric is not None:
      metric.add_batch(predictions=predicted_answer_binary, references=intended_answer_binary)

    return intended_answer, intended_answer_binary, predicted_answer_binary, options[predicted_option]


##### Multi-select output

For multi-select questions, we choose all options where their logits are higher than the mean logit + 40% of the standard deviation. This is a reasonable approach because:

+++++++++++++ **Capturing Confident Predictions 🎯**

The mean logit represents the average confidence of the model across all options.
Adding 40% of the standard deviation creates a threshold that selects options significantly above the average, meaning the model is more confident about these choices.

+++++++++++++ **Accounting for Variability 📊**

The standard deviation measures how much the logits vary.
By setting the threshold based on 40% of the standard deviation, we balance between selecting only the highest confidence options while not being overly restrictive.

+++++++++++++ **Preventing Over-Selection & Under-Selection ⚖**

If the threshold were too high, the model might miss correct answers.
If it were too low, the model might select too many, including incorrect ones.
40% of the standard deviation is a reasonable balance based on the natural spread of logits.

In [10]:
def multi_select_model_output(model, tokenizer, question, metric=None):
    '''
    Handles a question, its context and its options for a multi-select question
    parameters:
    - model: one MC hugging face model
    - tokenizer: MC hugging face tokenizer
    - question: one question of the QA-dataset as a dictionary
    - metric: metric for evaluating the model output (optional)
    output:
    - intended_answer: the correct/intended answers as a list of strings
    - intended_answer_binary: the correct/intended answers as a list of binary variables, where each entry is one, if option is chosen, 0 else
    - predicted_answer_binary: the predicted answers as a list of binary variables, where each entry is one, if option is chosen, 0 else
    - high_score_answers: the predicted answers as a list of strings
    '''
    intended_answer = question['intended_answer']
    options = question['options']

    # creating input ids by tokenizing the question
    input_ids = tokenize_function(question, tokenizer)
    input_ids = {key: torch.tensor(array) for key, array in input_ids.items()}

    # generating the output
    outputs = model(**input_ids)
    logits = outputs.logits  # Shape: [batch_size, num_choices]
    print(logits)

    ### Use a threshold from deviation and take all options that are higher than the mean + 40% of standard deviation
    mean_score = logits.mean().item()
    std_dev = logits.std().item()
    threshold = mean_score + (0.4 * std_dev)
    high_score_options = (logits >= threshold).nonzero(as_tuple=True)[1]  # Get the indices of valid options

    # List the corresponding options
    high_score_answers = [options[idx] for idx in high_score_options.tolist()]
    intended_answer_binary = [1 if option in intended_answer else 0 for option in options]

    predicted_answer_binary = [1 if option in high_score_answers else 0 for option in options]

    # Add the results to the metric
    if metric is not None:
        metric.add_batch(predictions=predicted_answer_binary, references=intended_answer_binary)

    return intended_answer, intended_answer_binary, predicted_answer_binary, high_score_answers



##### Text output

In [11]:
def text_model_output(question, pipeline):
    '''
    Handles an open text question and summarizes it
    parameter:
    - question: one question of the QA-dataset as a dictionary
    - pipeline: huggingface text-summarization pipeline
    output:
    - intended_answer: the full context of the question as a string
    - summary[0][summary_text]: the generated summary as a string
    '''
    intended_answer = question['context']
    summary = pipeline(intended_answer, max_length=len(intended_answer), do_sample=False)
    return intended_answer, summary[0]['summary_text']



##### Phone Number output

In [12]:
def number_model_output(model, tokenizer, question, metric=None):
    '''
    Handles a question where the context should contain a phone number and generates an answer to that question
    parameters:
    - model: one QA hugging face model
    - tokenizer: QA hugging face tokenizer
    - question: one question of the QA-dataset as a dictionary
    - metric: metric for evaluating the model output (optional)
    output:
    '''
    intended_answer = question['intended_answer'][0]

    input_ids = tokenize_function(question, tokenizer)
    output = model(**input_ids)
    start_logits, end_logits = output.start_logits, output.end_logits

    # Get most probable start and end index
    start_idx = torch.argmax(start_logits, dim=1).item()
    end_idx = torch.argmax(end_logits, dim=1).item() + 1  # Include last token

    # Convert token IDs to text
    predicted_tokens = input_ids["input_ids"][0][start_idx:end_idx]
    predicted_number = tokenizer.decode(predicted_tokens, skip_special_tokens=True)

    # Add results to the metric
    if metric is not None:
        metric.add(predictions=predicted_number, references=intended_answer)

    return intended_answer, predicted_number

##### Date output

Here we have an auxiliary function, that converts the model outut, which mostly is a date in any format, into a date in the format YYYY-mm-dd

In [13]:
def convert_date_format(date_str):
  '''
  extracts date f
  '''
  try:
    parsed_date = parser.parse(date_str)
    return parsed_date.strftime('%Y-%m-%d')
  except Exception as e:
    return date_str

def find_date_and_convert(input_string):
  date_regex = r'\b(?:\d{1,2}(?:st|nd|rd|th)?\s+[A-Za-z]+\s+\d{4}|\d{1,2}[./-]\d{1,2}[./-]\d{2,4}|\b[A-Za-z]+\s+\d{1,2}(?:st|nd|rd|th)?,?\s+\d{4})\b'
  match = re.search(date_regex, input_string)
  if match:
    extracted_date = match.group(0)
    formatted_date = convert_date_format(extracted_date)
    return formatted_date
  else:
    return input_string


In [14]:
def date_model_output(model, tokenizer, question, metric=None):
    '''
    Handles a question where the context should contain a date and generates an answer to that question
    '''
    intended_answer = question['intended_answer'][0]

    input_ids = tokenize_function(question, tokenizer)
    output = model(**input_ids)
    start_logits, end_logits = output.start_logits, output.end_logits
    # Get most probable start and end index
    start_idx = torch.argmax(start_logits, dim=1).item()
    end_idx = torch.argmax(end_logits, dim=1).item() + 1  # Include last token

    # Convert token IDs to text
    predicted_tokens = input_ids["input_ids"][0][start_idx:end_idx]
    predicted_answer = tokenizer.decode(predicted_tokens, skip_special_tokens=True)
    formatted_predicted_answer = find_date_and_convert(predicted_answer)

    if metric is not None:
        metric.add(predictions=formatted_predicted_answer, references=intended_answer)

    return intended_answer, formatted_predicted_answer



## Model Selection

**Which model works best to predict the intended answer?**
This is what we'll find out here.

We firstly load the metrics we want the models to be evaluated on.
As we want to take into account as much as we can and also to get a better feeling for the model output, we choose all of the metrics "accuracy", "f1", "precision" and "recall" for the mc model questions. For the oe model questions, only "exact match", as receiving a false phone number and/or date would be very problematic. The last task ('TEXT' questions) won't be evaluated, since there is no real basis on which we can extract the right or wrong answer. We'll only summarize the notes.

And which dataset take to evaluate on?
Since the model in this stage won't remember the intended answers of the questions, we just use the training part of the QA-dataset.
This won't overfit/underfit a model and its big enough to really get a glimpse of how good the models perform.



**Remark,** that we only consider relatively small models as we heard the other groups have memory and other problems that relate to technical ressources.


In [None]:
# load metrics
clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])
exact_match = evaluate.load("exact_match")

In [None]:
# load text-summarization pipeline
summarization_pipeline = pipeline("summarization", model="t5-small")

Device set to use cuda:0


In [None]:
# filter train dataset on mc questions for oe model evaluation
mc_train_qa_dataset = pd.DataFrame(qa_dataset['train'].filter(lambda example: example['type'] in ['MULTI_SELECT', 'SINGLE_SELECT']))
mc_train_qa_dataset.shape

model_results = []

Filter:   0%|          | 0/1104 [00:00<?, ? examples/s]

### Model for open-ended questions

In order to be able to evaluate, we have to define both a oe and mc model.
That's why we already instanciate the "bert-base-cased" model used for mc.
Afterwards we initialize the "distilbert-base-uncased-distilled-squad" for oe questions, a QA model fine-tuned on the squad dataset.
In the last step, we filter over the train QA-dataset, so that only questions of type 'DATE' or 'NUMBER' are left.

In [None]:
model_name = "bert-base-cased"
model = AutoModelForMultipleChoice.from_pretrained(model_name, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of BertForMultipleChoice were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
oe_model = AutoModelForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased-distilled-squad")
oe_tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased-distilled-squad")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/451 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
oe_qa_train_dataset = pd.DataFrame(qa_dataset['train'].filter(lambda example: example['type'] in ['DATE', 'NUMBER']))
oe_qa_train_dataset.shape

Filter:   0%|          | 0/1104 [00:00<?, ? examples/s]

(86, 7)

So let's try to get an output for the oe model.

In [None]:
mc_results, oe_results, mc_metric_result, oe_metric_result = model_output(model, tokenizer, oe_model, oe_tokenizer, oe_qa_train_dataset, mc_metric=clf_metrics, oe_metric=exact_match)
print(f"The exact_match metric for all open-ended questions in the train dataset: {oe_metric_result['exact_match']}")

Question: When do you wish to receive a follow-up?
Context: How about we touch base again on January 15th?  That works for me.
The intended answer was: 2025-01-15
The predicted answer was: january 15th

Question: When do you wish to receive a follow-up?
Context: How about we connect again on January 17th?  That works for me.
The intended answer was: 2025-01-17
The predicted answer was: january 17th

Question: When do you wish to receive a follow-up?
Context: How about we follow up around January 22nd of 2025? That should work nicely.
The intended answer was: 2025-01-22
The predicted answer was: 22nd of 2025

The exact_match metric for all open-ended questions in the train dataset: 0.9651162790697675


Actually, this looks quite good, there is no need of searching for another model here.
So we can concentrate on the mc models

### Models for multiple and single choice questions

We decided to test a BERT, ALBERT, XLNet and RoBERTa model each.
All of them can handle the same type of input and are able to weight the options.
A weight what we afterwards use to predict the right option(s).

So let's start!!! 🥳

#### BERT base model (cased)

In [None]:
mc_results, oe_results, mc_metric_result, oe_metric_result = model_output(model, tokenizer, oe_model, oe_tokenizer, mc_train_qa_dataset, mc_metric=clf_metrics, oe_metric=exact_match)

[1;30;43mDie letzten 5000 Zeilen der Streamingausgabe wurden abgeschnitten.[0m
Question: What is the type of contact?
Context: Okay, so, the contact type could be an 'Existing customer', a 'New customer / Prospect', maybe someone from 'Press / media', or even a 'Competitor'.
The intended answer was: ['Existing customer', 'New customer / Prospect', 'Press / media', 'Competitor']
The predicted answer was: ['Existing customer', 'Competitor']
The intended answer in BINARY was: [1, 0, 1, 1, 1]
The predicted answer in BINARY was: [1, 0, 0, 0, 1]

tensor([[ 0.1124, -0.1531,  0.2460, -0.4431, -0.3133]],
       grad_fn=<ViewBackward0>)
Question: What is the type of contact?
Context: Oh, well it could be an existing customer, a supplier, or a new customer or prospect. I think maybe it's a new customer then.
The intended answer was: ['Existing customer', 'Supplier', 'New customer / Prospect']
The predicted answer was: ['Existing customer', 'New customer / Prospect']
The intended answer in BINAR

In [None]:
print(f"The metrics for all mc questions in the train dataset:\n{model_name}: {mc_metric_result}")
mc_metric_result['model_name'] = model_name
model_results.append(mc_metric_result)

The metrics for all mc questions in the train dataset:
bert-base-cased: {'accuracy': 0.8473297213622291, 'f1': 0.7606915377616015, 'precision': 0.7660354306658522, 'recall': 0.755421686746988}


That looks quite impressive, doesn't it?

```
{'accuracy': 0.8473297213622291, 'f1': 0.7606915377616015, 'precision': 0.7660354306658522, 'recall': 0.755421686746988}
```


#### XLNet base model (cased)




In [None]:
model_name = "xlnet/xlnet-base-cased"
mc_model = XLNetForMultipleChoice.from_pretrained(model_name)
mc_tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of XLNetForMultipleChoice were not initialized from the model checkpoint at xlnet/xlnet-base-cased and are newly initialized: ['logits_proj.bias', 'logits_proj.weight', 'sequence_summary.summary.bias', 'sequence_summary.summary.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
mc_results, oe_results, mc_metric_result, oe_metric_result = model_output(mc_model, mc_tokenizer, oe_model, oe_tokenizer, mc_train_qa_dataset, mc_metric=clf_metrics, oe_metric=exact_match)
print(f"The metrics for all open-ended questions in the train dataset:\n{mc_metric_result}")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[1;30;43mDie letzten 5000 Zeilen der Streamingausgabe wurden abgeschnitten.[0m
The predicted answer was: ['MY-SYSTEM', 'JS EcoLine']
The intended answer in BINARY was: [1, 0, 1, 1, 0, 0]
The predicted answer in BINARY was: [1, 0, 0, 1, 0, 0]

tensor([[0.1831, 0.1568]], grad_fn=<ViewBackward0>)
tensor([[ 0.2677,  0.1753,  0.1103, -0.0263]], grad_fn=<ViewBackward0>)
Question: What kind of follow up is planned
Context: Okay, for follow up, I can either call you, *phone*, or we can *schedule a visit*. If neither is needed, we'll take *no action*.
The intended answer was: ['Phone', 'Schedule a Visit', 'No action']
The predicted answer was: ['Email']
The intended answer in BINARY was: [0, 1, 1, 1]
The predicted answer in BINARY was: [1, 0, 0, 0]

tensor([[-0.0766, -0.0440,  0.0239,  0.0230,  0.1300,  0.0417,  0.1068]],
       grad_fn=<ViewBackward0>)
tensor([[ 0.2129, -0.2678,  0.2089,  0.0660,  0.1731,  0.2041]],
       grad_fn=<ViewBackward0>)
Question: Products interested in
Context: I 

In [None]:
print(f"The metrics for all mc questions in the train dataset:\n{model_name}: {mc_metric_result}")
mc_metric_result['model_name'] = model_name
model_results.append(mc_metric_result)

The metrics for all mc questions in the train dataset:
xlnet/xlnet-base-cased: {'accuracy': 0.7078977932636469, 'f1': 0.5054080629301868, 'precision': 0.5538793103448276, 'recall': 0.46473779385171793}


Also okay good for having not done any fine-tuning:


```
{'accuracy': 0.7078977932636469, 'f1': 0.5054080629301868, 'precision': 0.5538793103448276, 'recall': 0.46473779385171793}
```

#### RoBERTa base model of FacebookAI

In [None]:
model_name = "FacebookAI/roberta-base"
mc_model = AutoModelForMultipleChoice.from_pretrained(model_name)
mc_tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of RobertaForMultipleChoice were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
mc_results, oe_results, mc_metric_result, oe_metric_result = model_output(mc_model, mc_tokenizer, oe_model, oe_tokenizer, mc_train_qa_dataset, mc_metric=clf_metrics, oe_metric=exact_match)

[1;30;43mDie letzten 5000 Zeilen der Streamingausgabe wurden abgeschnitten.[0m
tensor([[0.1364, 0.1373, 0.1391, 0.1375, 0.1388, 0.1386, 0.1323]],
       grad_fn=<ViewBackward0>)
Question: Size of the trade fair team (on average)
Context: Hmm, on average, I think the trade fair team would be more than 40 people, so something around that number sounds right.
The intended answer was: more than 40
The predicted answer was: 11-15
The intended answer in BINARY was: [0, 0, 0, 0, 0, 0, 1]
The predicted answer in BINARY was: [0, 0, 1, 0, 0, 0, 0]

tensor([[0.1318, 0.1324, 0.1339, 0.1312, 0.1301, 0.1254, 0.1323, 0.1301, 0.1338,
         0.1336, 0.1323, 0.1329, 0.1320]], grad_fn=<ViewBackward0>)
Question: Who to copy in follow up
Context: I'd copy Stephan Maier, Joachim Wagner, Oliver Eibel, Johannes Wagner, Jessica Hanke, and Tim Persson;  they all need to be in the loop on this follow up.
The intended answer was: ['Stephan Maier', 'Joachim Wagner', 'Oliver Eibel', 'Johannes Wagner', 'Jessica 

In [None]:
print(f"The metrics for all mc questions in the train dataset:\n{model_name}: {mc_metric_result}")
mc_metric_result['model_name'] = model_name
model_results.append(mc_metric_result)

The metrics for all mc questions in the train dataset:
FacebookAI/roberta-base: {'accuracy': 0.575687185443283, 'f1': 0.27989487516425754, 'precision': 0.3075812274368231, 'recall': 0.25678119349005424}


This looks way worse


```
{'accuracy': 0.575687185443283, 'f1': 0.27989487516425754, 'precision': 0.3075812274368231, 'recall': 0.25678119349005424}
```


#### ALBERT base version 2

In [None]:
model_name = "albert/albert-base-v2"
mc_model = AutoModelForMultipleChoice.from_pretrained(model_name)
mc_tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

Some weights of AlbertForMultipleChoice were not initialized from the model checkpoint at albert/albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

In [None]:
mc_results, oe_results, mc_metric_result, oe_metric_result = model_output(mc_model, mc_tokenizer, oe_model, oe_tokenizer, mc_train_qa_dataset, mc_metric=clf_metrics, oe_metric=exact_match)

[1;30;43mDie letzten 5000 Zeilen der Streamingausgabe wurden abgeschnitten.[0m
Question: Which language is wanted for communication? 
Context: I'd prefer Japanese, I guess.  I don't know what other languages were offered, but Japanese is what comes to mind.
The intended answer was: Japanese 
The predicted answer was: Italian
The intended answer in BINARY was: [0, 0, 1, 0, 0]
The predicted answer in BINARY was: [0, 1, 0, 0, 0]

tensor([[0.1016, 0.2867, 0.1963, 0.1779, 0.2825]], grad_fn=<ViewBackward0>)
Question: What is the type of contact?
Context: Oh, um, I guess it could be a Supplier, or maybe a New customer or Prospect. It might even be Press or media. Could it also be a Competitor. I don't know, maybe it's any of
The intended answer was: ['Supplier', 'New customer / Prospect', 'Press / media', 'Competitor']
The predicted answer was: ['Supplier', 'Competitor']
The intended answer in BINARY was: [0, 1, 1, 1, 1]
The predicted answer in BINARY was: [0, 1, 0, 0, 1]

tensor([[-0.0455,

In [None]:
print(f"The metrics for all mc questions in the train dataset:\n{model_name}: {mc_metric_result}")
mc_metric_result['model_name'] = model_name
model_results.append(mc_metric_result)

The metrics for all mc questions in the train dataset:
albert/albert-base-v2: {'accuracy': 0.6314363143631436, 'f1': 0.3789954337899543, 'precision': 0.4129353233830846, 'recall': 0.350210970464135}


This seems to be better than the last one, but not as good as the bert model


```
{'accuracy': 0.6314363143631436, 'f1': 0.3789954337899543, 'precision': 0.4129353233830846, 'recall': 0.350210970464135}
```

All in all, the **"bert-base-uncased"** works best for our task.

## Fine-tuning a model

As mentioned above, the **"bert-base-cased"** model is doing quite well on the QA-dataset we created.
But also **"albert/albert-base-v2"** managed get quite good results in view to the relatively little size of the model.
Also it is known for faster learning.
This is the reason why we want to **fine-tune both models.**

For doing so we have to preprocess the data.
In fact, we can use almost the same as the tokenize_function() for that.

In [None]:
def preprocess_function(example):
    '''
    Converts the question/example with its context and the given options for multi-/single-select questions, into IDs the model later can make sense of. Distinguishes between multi-/single-select and the other questions
    parameters:
    - expample: question of the QA-dataset with all its entries (question, context, options, type are urgently necessary)
    - tokenizer: tokenizer of the model
    output:
    - tokenized: tokenized input example
    '''
    if example["type"] == "SINGLE_SELECT" or example["type"] == "MULTI_SELECT":
      number_of_options = len(example["options"])
      first_sentence = [[example["context"]] * number_of_options]  # Repeat context for each option
      second_sentence = [[example["question"] + " " + option] for option in example["options"]]  # Pair with each option
      tokenized = tokenizer(
          sum(first_sentence, []),
          sum(second_sentence, []),
          padding="longest",
          truncation=True
      )
      # Un-flatten
      return {k: [v[i:i+number_of_options] for i in range(0, len(v), number_of_options)] for k, v in tokenized.items()}

    elif example['type'] == 'NUMBER':
      tokenized = tokenizer(
          example['context'],
          example['question'],
          truncation="only_second",
          max_length=384,
          padding="max_length",
          return_tensors="pt"
      )
    else:
      tokenized = tokenizer(
          example['question'],
          example['context'],
          truncation="only_second",
          max_length=384,
          padding="max_length",
          return_tensors="pt"
      )

    return tokenized

We can't use the normal DataCollator, so we have to define our own DataCollatorForMultipleChoice.

The `DataCollatorForMultipleChoice` performs **dynamic padding** and **batch preparation** for a multiple-choice task.

🔹 **Key Responsibilities**:

1️⃣ **Pads Missing Choices:**  
   - Ensures that all questions have the same number of answer choices by padding shorter ones with empty token sequences.

2️⃣ **Reshapes Data for Model Input:**  
   - Converts batch tensors into shape `(batch_size, num_choices, sequence_length)` so the model can process them correctly.

3️⃣ **Handles Label Padding (if needed):**  
   - Ensures labels are also padded if they contain multiple selections (for multi-select questions).

🔹 **Why This is Important?**

✅ Handles **variable answer choices** per question (different number of options per question).  
✅ Ensures **uniform input shape** so the model can process all questions in a batch.  
✅ Supports **both single-select and multi-select** multiple-choice tasks.  


In [None]:
@dataclass
class DataCollatorForMultipleChoice:
    """
    Data collator that will dynamically pad the inputs for multiple choice received.
    """

    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        label_name = "label" if "label" in features[0] else "labels"
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)

        # Determine the maximum number of choices in the batch
        max_choices = max(len(label) for label in labels)
        # Pad missing choices for each feature
        for feature in features:
            num_choices = len(feature["input_ids"])
            while len(feature["input_ids"]) < max_choices:
                for key in feature.keys():
                    feature[key].append([0] * len(feature[key][0]))  # Pad with zeros

        # Flatten for tokenization
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(max_choices)]
            for feature in features
        ]
        flattened_features = sum(flattened_features, [])

        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )

        # Reshape tensors to match (batch_size, num_choices, sequence_length)
        batch = {k: v.view(batch_size, max_choices, -1) for k, v in batch.items()}

        # Handle label padding if necessary
        for i, label in enumerate(labels):
            if isinstance(label, list):  # Handle MULTI_SELECT cases
                labels[i] += [0] * (max_choices - len(label))  # Pad labels with 0s
            else:
                labels[i] = label  # Keep as-is for SINGLE_SELECT
        batch["labels"] = torch.tensor(labels, dtype=torch.float).view(batch_size, -1)

        return batch


Then we can load the metric, which is key for choosing a model of a certain epoch.
This metric is computed with the compute_metrics() function.
With that one, we choose the answers as we did it before for multiple choice questions: every option is chosen which is mean of the logits + 40% of the standard deviation.

Note, that we DO NOT and CANNOT differ between single and multiple choice here.
So it could be that the model is implicitly trained to select more than just one option also for single choice questions.
A better approach would be to train on the single choice questions at first and on the multiple choice questions afterwards.
Since we got aware of this problem not until the fine-tuning and evaluation was already done, we keep it like this for now.

In [None]:
metric = evaluate.load("f1")

Downloading builder script:   0%|          | 0.00/6.79k [00:00<?, ?B/s]

In [None]:
def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = []
    for i in range(len(logits)):
      mean_score = logits[i].mean().item()
      std_dev = logits[i].std().item()

      # Define a threshold based on deviation from the mean
      threshold = mean_score + (0.4 * std_dev)
      prediction = (logits[i] >= threshold).astype(int)
      metric.add_batch(predictions=prediction, references=labels[i].astype(int))
    return metric.compute(average="macro") # maybe with parameter average = "macro"

For training, we use the Hugging Face's Trainer API, which simplifies training & evaluation.
Also we apply best practices for fine-tuning, including logging, evaluation, and model checkpointing.
Afterwards, we save the fine-tuned model so it can be reused without retraining.

In [None]:
def fine_tune_model(dataset, tokenizer, model, epochs, output_dir):
    # Preprocess the dataset
    tokenized_dataset = dataset.map(preprocess_function, batched=True)
    # Define training arguments
    training_args = TrainingArguments(output_dir=output_dir,
        eval_strategy="epoch",
        save_strategy="epoch",
        logging_strategy="epoch",
        logging_dir="./logs",
        learning_rate=4e-5,
        num_train_epochs=epochs,
        weight_decay=0.01,
        logging_steps=10,
        load_best_model_at_end=True,
        report_to="none"
    )
    data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)

    # Define Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset['train'],
        eval_dataset=tokenized_dataset['test'],
        data_collator=data_collator,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )
    # Train the model
    trainer.train()
    save_path = f"/content/drive/MyDrive/mc_models/{output_dir}"
    drive.mount('/content/drive')
    # Create the directory if it does not exist
    if not os.path.exists(save_path):
        os.makedirs(save_path)
        print(f"Directory created: {save_path}")
    else:
        print("Directory already exists!")
    trainer.save_model(save_path)

##### Fine-tuning BERT
So let's start with the "bert-base-cased" model

In [None]:
# load the model again
model_name = "bert-base-cased"
bert_model = AutoModelForMultipleChoice.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of BertForMultipleChoice were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# filter on mc questions
mc_qa_dataset = qa_dataset.filter(lambda example: example['type'] in ['SINGLE_SELECT', 'MULTI_SELECT'])
mc_qa_dataset.shape

Filter:   0%|          | 0/1104 [00:00<?, ? examples/s]

Filter:   0%|          | 0/277 [00:00<?, ? examples/s]

{'train': (937, 7), 'test': (235, 7)}

In [None]:
# apply fine-tuning methods
fine_tune_model(mc_qa_dataset, tokenizer, bert_model, 7, "bert_fine_tuned")

Map:   0%|          | 0/937 [00:00<?, ? examples/s]

Map:   0%|          | 0/235 [00:00<?, ? examples/s]

  trainer = Trainer(
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,F1
1,3.9113,3.93189,0.267477
2,3.8855,4.043892,0.25828
3,3.9559,4.012613,0.257569
4,4.0031,4.000141,0.261581
5,3.906,3.99,0.26077
6,3.8987,3.991022,0.26198
7,3.9052,3.996799,0.264252


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Directory created: /content/drive/MyDrive/mc_models/bert_fine_tuned


We don't want to interprete too much in the results of the fine-tuning for now.
We'll see the differences between the models in the evaluation section (Overview notebook)


##### Fine-tuning ALBERT
After having fine-tuned the BERT model, we want to do that for the "albert/albert-base-v2" model, too

In [None]:
# Load model again
model_name = "albert/albert-base-v2"
albert_model = AutoModelForMultipleChoice.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of AlbertForMultipleChoice were not initialized from the model checkpoint at albert/albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Apply fine-tuning methods
fine_tune_model(mc_qa_dataset, tokenizer, albert_model, 7, "albert/albert-base-v2")

Map:   0%|          | 0/14 [00:00<?, ? examples/s]

Map:   0%|          | 0/14 [00:00<?, ? examples/s]

  trainer = Trainer(
You're using a AlbertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,F1
1,2.9695,2.752164,0.285463
2,2.6967,2.852798,0.237407
3,2.6981,2.861069,0.214931
4,2.6838,2.859945,0.225911
5,2.6575,2.845096,0.241281
6,2.5989,2.842575,0.24565
7,2.6007,2.840811,0.24565


Mounted at /content/drive
cp: cannot create directory '/content/drive/MyDrive/albert/albert-base-v2': No such file or directory


## Testing / Evaluation

So what is the conclusion?
Which model works best?
In fact, this will be answered in the Overview notebook, but we will prepare the data for the evaluation.
Actually, this doesn't take too much of new ideas, since we can catch results of model output instantly via the function model_output() defined in the beginning.

So let's apply this function for the different models on the test dataset, which was created in the other notebook.

In [15]:
# load dataset

url = "https://raw.githubusercontent.com/alexk2206/tds_capstone/refs/heads/main/datasets/test_qa_dataset_with_answers.json"
data = pd.read_json(url)
# Convert to DataFrame for easy handling
test_df = pd.DataFrame(data)

# Map the intended answer to the index of the option
test_df['label'] = test_df.apply(lambda x: np.array([1 if option in x['intended_answer'] else 0 for option in x['options']]) if x['type'] in ['SINGLE_SELECT', 'MULTI_SELECT'] else np.array([0]), axis=1)
test_df["intended_answer"] = test_df["intended_answer"].apply(lambda x: x if isinstance(x, list) else [x])

test_dataset = Dataset.from_pandas(test_df)

In [16]:
df_test_dataset = pd.DataFrame(test_dataset)
df_test_dataset.shape

(200, 7)

In [17]:
# load text-summarization pipeline once again
summarization_pipeline = pipeline("summarization", model="t5-small")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Device set to use cpu


In [18]:
oe_model = AutoModelForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased-distilled-squad")
oe_tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased-distilled-squad")

config.json:   0%|          | 0.00/451 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [19]:
# load metrics once again
clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])
exact_match = evaluate.load("exact_match")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.79k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.56k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.38k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/5.67k [00:00<?, ?B/s]

In [None]:
# mount Google Drive again
drive.mount('/content/drive')

Mounted at /content/drive


##### Fine-tuned albert model

In [None]:
# load model from Google Drive
model_name = "/content/drive/MyDrive/mc_models/albert/albert-base-v2/albert-base-v2/checkpoint-14"
model = AutoModelForMultipleChoice.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
# Compute results
mc_results, oe_results, mc_metric_result, oe_metric_result = model_output(model, tokenizer, oe_model, oe_tokenizer, df_test_dataset, sum_pipeline=summarization_pipeline, mc_metric=clf_metrics, oe_metric=exact_match)

tensor([[0.6188, 0.3287, 0.2924, 0.2797, 0.3464]], grad_fn=<ViewBackward0>)
Question: What department are you representing?
Context: Oh, I think I represent the Operations department. That must be it, right?
The intended answer was: Operations
The predicted answer was: R&D
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [1, 0, 0, 0, 0]

tensor([[0.5285, 0.4390, 0.4790, 0.5805, 0.3717]], grad_fn=<ViewBackward0>)
Question: How soon are you looking for a solution?
Context: Hmm, not sure to be honest, I've not really thought about it yet.
The intended answer was: Not sure
The predicted answer was: Over 6 months
The intended answer in BINARY was: [0, 0, 0, 0, 1]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]

tensor([[0.6079, 0.5647, 0.5406, 0.6192, 0.6080]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Well I would say I am very satisfied with the solutions currently.
The intended ans

Your max_length is set to 120, but your input_length is only 35. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)


tensor([[0.5163, 0.2570, 0.2256, 0.4984, 0.5471]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Oh, well, I'm definitely unsatisfied with the current solutions in my field. It's a bit rough right now I think.
The intended answer was: Unsatisfied
The predicted answer was: Very unsatisfied
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]



Your max_length is set to 113, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.6154, 0.4601, 0.4562, 0.6436, 0.4218]], grad_fn=<ViewBackward0>)
Question: How did you hear about our exhibition stand?
Context: Well, I think it was other, honestly. I'm not sure what else it could be.
The intended answer was: Other
The predicted answer was: Word of mouth
The intended answer in BINARY was: [0, 0, 0, 0, 1]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]

tensor([[0.4383, 0.4129, 0.4651, 0.4917, 0.3218]], grad_fn=<ViewBackward0>)
Question: How did you hear about our exhibition stand?
Context: I think I saw it on social media. Yeah, that's how I found out about it.
The intended answer was: Social media
The predicted answer was: Word of mouth
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]

tensor([[0.5222, 0.6201, 0.5601, 0.6421, 0.4821]], grad_fn=<ViewBackward0>)
Question: What is your primary goal at this trade fair?
Context: Well, I'm not really sure. I suppose my main thing is something like 'other

Your max_length is set to 92, but your input_length is only 24. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=12)


tensor([[0.6533, 0.5413, 0.5423, 0.6885, 0.6535]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Oh, um, well, I'd say I'm very satisfied.
The intended answer was: Very satisfied
The predicted answer was: Unsatisfied
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]

tensor([[0.6577, 0.5745, 0.5749, 0.6794, 0.5871]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: I would say very unsatisfied, to be honest. I think things can get much better in my field.
The intended answer was: Very unsatisfied
The predicted answer was: Unsatisfied
The intended answer in BINARY was: [0, 0, 0, 0, 1]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]

tensor([[0.7121, 0.6512, 0.6506, 0.6782, 0.6432]], grad_fn=<ViewBackward0>)
Question: How many employees does your company have?
Context: Oh, wow, I'm really not sure, I think we hav

Your max_length is set to 193, but your input_length is only 49. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=24)


tensor([[0.4523, 0.7637, 0.3159, 0.8684, 0.3728]], grad_fn=<ViewBackward0>)


Your max_length is set to 144, but your input_length is only 38. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)
Your max_length is set to 117, but your input_length is only 30. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)
Your max_length is set to 94, but your input_length is only 27. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=13)


tensor([[0.4189, 0.4061, 0.4112, 0.3898, 0.3771, 0.3737]],
       grad_fn=<ViewBackward0>)
Question: What language do you prefer for communication?
Context: Oh, that's tricky. I'm not really sure which languages there are, so I'd have to go with other I guess.
The intended answer was: Other
The predicted answer was: English
The intended answer in BINARY was: [0, 0, 0, 0, 0, 1]
The predicted answer in BINARY was: [1, 0, 0, 0, 0, 0]

tensor([[0.5821, 0.3552, 0.5144, 0.5235, 0.5924]], grad_fn=<ViewBackward0>)
Question: What is your preferred method of follow-up?
Context: Oh, um, I guess an in-person visit would probably be my preference.
The intended answer was: In-person visit
The predicted answer was: No follow-up
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]



Your max_length is set to 112, but your input_length is only 30. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.4478, 0.3734, 0.4625, 0.4326, 0.5048]], grad_fn=<ViewBackward0>)
Question: What is your preferred method of follow-up?
Context: Oh, well I guess an in-person visit would be my preferred method of follow-up then. I think it's the best way to connect.
The intended answer was: In-person visit
The predicted answer was: No follow-up
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.5409, 0.5298, 0.5369, 0.5425, 0.5870]], grad_fn=<ViewBackward0>)
Question: What support resources do you need for implementation?
Context: Well, I think I might need some training, also documentation, maybe even some technical support. If none is needed that's ok too I suppose.
The intended answer was: ['Training', 'Documentation', 'Technical support', 'None']
The predicted answer was: ['None']
The intended answer in BINARY was: [1, 1, 1, 0, 1]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.5317, 0.4833, 0.5428, 0.5746, 0

Your max_length is set to 185, but your input_length is only 51. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=25)


tensor([[0.6703, 0.5076, 0.4885, 0.6125, 0.5341]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Honestly, I am quite unsatisfied with them at the moment, I'd say.
The intended answer was: Unsatisfied
The predicted answer was: Very satisfied
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [1, 0, 0, 0, 0]



Your max_length is set to 79, but your input_length is only 21. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=10)


tensor([[0.4586, 0.5222, 0.5137, 0.5077, 0.6163]], grad_fn=<ViewBackward0>)
Question: What stage are you in the buying process?
Context: Oh, I guess I'm in exploration then.
The intended answer was: Exploration
The predicted answer was: Not buying
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.6041, 0.5401, 0.5393, 0.4336, 0.4318]], grad_fn=<ViewBackward0>)
Question: Which features are most important in a solution?
Context: Well, I'd say ease of use is crucial and cost efficiency really matters. Scalability is definitely something important. Oh, and good support is key for any solution.
The intended answer was: ['Ease of use', 'Cost efficiency', 'Scalability', 'Support']
The predicted answer was: ['Ease of use', 'Cost efficiency']
The intended answer in BINARY was: [1, 1, 1, 0, 1]
The predicted answer in BINARY was: [1, 1, 0, 0, 0]

tensor([[0.3897, 0.3384, 0.4383, 0.4771, 0.3113]], grad_fn=<ViewBackward0>)
Question: W

Your max_length is set to 159, but your input_length is only 36. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=18)


tensor([[0.6321, 0.4407, 0.3365, 0.3825, 0.2355]], grad_fn=<ViewBackward0>)
Question: Who in your company evaluates new solutions?
Context: I think it's probably either Procurement or maybe some other team does that.
The intended answer was: ['Procurement', 'Other']
The predicted answer was: ['Team leader']
The intended answer in BINARY was: [0, 0, 1, 0, 1]
The predicted answer in BINARY was: [1, 0, 0, 0, 0]

tensor([[0.4907, 0.4394, 0.6712, 0.5712, 0.4454]], grad_fn=<ViewBackward0>)
Question: What support resources do you need for implementation?
Context: Oh gosh, I guess I'd need training, good documentation, and also someone to help onsite, yeah that's it.
The intended answer was: ['Training', 'Documentation', 'Onsite assistance']
The predicted answer was: ['Technical support', 'Onsite assistance']
The intended answer in BINARY was: [1, 1, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 1, 1, 0]

tensor([[0.4172, 0.3998, 0.4821, 0.5236, 0.4365]], grad_fn=<ViewBackward0>)
Questio

Your max_length is set to 107, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.6070, 0.5061, 0.5118, 0.6407, 0.6165]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Well, I guess I'm satisfied with the current solutions, I don't really know the alternatives anyway.
The intended answer was: Satisfied
The predicted answer was: Unsatisfied
The intended answer in BINARY was: [0, 1, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]

Question: What is your estimated budget for this project?
Context: Okay, for this project, I'm currently estimating a budget of about $11,700.
The intended answer was: $11700
The predicted answer was: $ 11, 700

tensor([[0.5687, 0.5237, 0.4555, 0.4955, 0.3858]], grad_fn=<ViewBackward0>)
Question: Who in your company evaluates new solutions?
Context: Well, I guess it could be the IT department or maybe Procurement, and if not them I'd say Other people.
The intended answer was: ['IT department', 'Procurement', 'Other']
The predicted answer was: ['Team leader'

Your max_length is set to 138, but your input_length is only 35. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)


tensor([[0.4231, 0.4819, 0.4367, 0.3415, 0.3173]], grad_fn=<ViewBackward0>)
Question: Which features are most important in a solution?
Context: Well, I think cost efficiency is important because we need to save money, security is definitely important to protect things, and I'd also say good support is needed for help.
The intended answer was: ['Cost efficiency', 'Security', 'Support']
The predicted answer was: ['Cost efficiency', 'Scalability']
The intended answer in BINARY was: [0, 1, 0, 1, 1]
The predicted answer in BINARY was: [0, 1, 1, 0, 0]

tensor([[0.5020, 0.3020, 0.4927, 0.5053, 0.4613]], grad_fn=<ViewBackward0>)
Question: What is your preferred method of follow-up?
Context: Oh, hmm, if I had to pick a method of follow-up I suppose a phone call would work best for me.
The intended answer was: Phone call
The predicted answer was: In-person visit
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]

tensor([[0.7655, 0.7527, 0.7360

Your max_length is set to 132, but your input_length is only 33. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=16)


tensor([[0.2843, 0.4779, 0.2978, 0.5501, 0.5702]], grad_fn=<ViewBackward0>)
tensor([[0.5348, 0.6851, 0.7978, 0.5844, 0.5250]], grad_fn=<ViewBackward0>)
Question: What is your primary goal at this trade fair?
Context: Honestly, I'm not entirely sure, probably something else I guess.
The intended answer was: Other
The predicted answer was: Learning about products
The intended answer in BINARY was: [0, 0, 0, 0, 1]
The predicted answer in BINARY was: [0, 0, 1, 0, 0]

tensor([[0.2467, 0.3095, 0.8308, 0.9809, 0.2898]], grad_fn=<ViewBackward0>)
Question: What type of customer relationship are you seeking?
Context: I think I would prefer a partner relationship. That sounds good to me.
The intended answer was: Partner
The predicted answer was: End-user
The intended answer in BINARY was: [0, 1, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]

tensor([[0.6568, 0.6775, 0.6491, 0.6845, 0.6489]], grad_fn=<ViewBackward0>)
Question: How soon are you looking for a solution?
Context: Hmm, I

Your max_length is set to 125, but your input_length is only 35. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)


tensor([[0.4523, 0.5635, 0.5467, 0.5745, 0.3685]], grad_fn=<ViewBackward0>)
Question: What is your primary goal at this trade fair?
Context: Well I suppose my primary goal here is finding suppliers, yeah that sounds right to me.
The intended answer was: Finding suppliers
The predicted answer was: Market research
The intended answer in BINARY was: [0, 1, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]



Your max_length is set to 96, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.5391, 0.4987, 0.4659, 0.2364, 0.2206]], grad_fn=<ViewBackward0>)
Question: Which features are most important in a solution?
Context: I guess ease of use is important, and cost efficiency definitely matters too. Security seems key, and good support is a must have I think.
The intended answer was: ['Ease of use', 'Cost efficiency', 'Security', 'Support']
The predicted answer was: ['Ease of use', 'Cost efficiency', 'Scalability']
The intended answer in BINARY was: [1, 1, 0, 1, 1]
The predicted answer in BINARY was: [1, 1, 1, 0, 0]

tensor([[0.5757, 0.5581, 0.5433, 0.6117, 0.4788]], grad_fn=<ViewBackward0>)
Question: How did you hear about our exhibition stand?
Context: Oh, hmm, I guess I heard about it some other way then, you know? Not sure exactly which, but not from a known source.
The intended answer was: Other
The predicted answer was: Word of mouth
The intended answer in BINARY was: [0, 0, 0, 0, 1]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]

tensor([[0.6582, 0.52

Your max_length is set to 153, but your input_length is only 41. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=20)


tensor([[0.4395, 0.5204, 0.5042, 0.4450, 0.4582]], grad_fn=<ViewBackward0>)
Question: How did you hear about our exhibition stand?
Context: Oh wow, I guess it was just through word of mouth.
The intended answer was: Word of mouth
The predicted answer was: Email invitation
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 1, 0, 0, 0]

tensor([[0.4077, 0.4186, 0.7080, 0.6419, 0.8618]], grad_fn=<ViewBackward0>)
Question: What stage are you in the buying process?
Context: I think I've already decided. I'm pretty sure I'm set on that choice.
The intended answer was: Already decided
The predicted answer was: Not buying
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.4989, 0.5587, 0.2570, 0.3655, 0.2329]], grad_fn=<ViewBackward0>)
Question: Who in your company evaluates new solutions?
Context: Hmm I think our IT department looks at the tech stuff. Procurement probably handles the money 

Your max_length is set to 106, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.5535, 0.5332, 0.5435, 0.5642, 0.5384]], grad_fn=<ViewBackward0>)
Question: How soon are you looking for a solution?
Context: Oh, I would like it immediately I suppose. That seems like a good time for me.
The intended answer was: Immediately
The predicted answer was: Over 6 months
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]

tensor([[0.4769, 0.4072, 0.5904, 0.5288, 0.3967]], grad_fn=<ViewBackward0>)
Question: What support resources do you need for implementation?
Context: Well, I suppose I need technical support, that would really help.
The intended answer was: ['Technical support']
The predicted answer was: ['Technical support', 'Onsite assistance']
The intended answer in BINARY was: [0, 0, 1, 0, 0]
The predicted answer in BINARY was: [0, 0, 1, 1, 0]

tensor([[0.3769, 0.5880, 0.5653, 0.5867, 0.3305]], grad_fn=<ViewBackward0>)
Question: What is your primary goal at this trade fair?
Context: I guess my primary goal her

Your max_length is set to 160, but your input_length is only 39. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)


tensor([[0.5461, 0.5281]], grad_fn=<ViewBackward0>)
tensor([[0.4011, 0.4803, 0.3868, 0.4767, 0.5528]], grad_fn=<ViewBackward0>)
Question: How would you prefer to receive product updates?
Context: I guess I would prefer to receive product updates through email, that sounds easiest.
The intended answer was: Email
The predicted answer was: In-person meeting
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.5635, 0.4412, 0.4051, 0.4031, 0.2614]], grad_fn=<ViewBackward0>)
Question: Who in your company evaluates new solutions?
Context: I think maybe the team leader, or perhaps the IT department. Procurement could also be involved, and there might be other people too.
The intended answer was: ['Team leader', 'IT department', 'Procurement', 'Other']
The predicted answer was: ['Team leader']
The intended answer in BINARY was: [1, 1, 1, 0, 1]
The predicted answer in BINARY was: [1, 0, 0, 0, 0]

tensor([[0.3587, 0.3501, 0.4753, 0.52

Your max_length is set to 108, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.5218, 0.4511, 0.6089, 0.5279, 0.5038]], grad_fn=<ViewBackward0>)
Question: What support resources do you need for implementation?
Context: I think I need training and documentation, and maybe some onsite assistance too, or perhaps none of them really.
The intended answer was: ['Training', 'Documentation', 'Onsite assistance', 'None']
The predicted answer was: ['Technical support']
The intended answer in BINARY was: [1, 1, 0, 1, 1]
The predicted answer in BINARY was: [0, 0, 1, 0, 0]

tensor([[0.4673, 0.4018]], grad_fn=<ViewBackward0>)
Question: Do you plan to implement a solution within the next 6 months?
Context: No, I don't think so. I haven't even looked at all the different options.
The intended answer was: No
The predicted answer was: Yes
The intended answer in BINARY was: [0, 1]
The predicted answer in BINARY was: [1, 0]

tensor([[0.6126, 0.4770, 0.4633, 0.4903, 0.5111]], grad_fn=<ViewBackward0>)
tensor([[0.5374, 0.5828, 0.6844, 1.0910, 0.4980]], grad_fn=<ViewBackward0>

Your max_length is set to 145, but your input_length is only 38. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)


tensor([[0.4943, 0.5102, 0.3602, 0.5512, 0.3076]], grad_fn=<ViewBackward0>)
Question: How did you hear about our exhibition stand?
Context: I think I saw it on the trade fair website.
The intended answer was: Trade fair website
The predicted answer was: Word of mouth
The intended answer in BINARY was: [0, 0, 1, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]

tensor([[0.6611, 0.6321, 0.6277, 0.6467, 0.6059]], grad_fn=<ViewBackward0>)
Question: How many employees does your company have?
Context: Oh geez, I'm not totally sure about the exact number. I think we've got somewhere around 450 employees, give or take a few.
The intended answer was: 201-1000
The predicted answer was: 1-10
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [1, 0, 0, 0, 0]

tensor([[0.3214, 0.3426, 0.4039, 0.7238, 0.2722]], grad_fn=<ViewBackward0>)
Question: What type of customer relationship are you seeking?
Context: Well I'm looking for a partner relationship I thin

In [None]:
model_result = {'model_name': model_name, 'mc_metric_result': mc_metric_result, 'oe_metric_result': oe_metric_result}
print(model_result)

{'model_name': '/content/drive/MyDrive/mc_models/albert/albert-base-v2/albert-base-v2/checkpoint-14', 'mc_metric_result': {'accuracy': 0.6226415094339622, 'f1': 0.3212669683257919, 'precision': 0.355, 'recall': 0.29338842975206614}, 'oe_metric_result': {'exact_match': 0.4444444444444444}}


In [None]:
# save model outputs to make further evaluations afterwards
with open('albert_fine_tuned_mc_results.json', 'w') as fp:
    json.dump(mc_results, fp)
with open('albert_fine_tuned_oe_results.json', 'w') as fp:
  json.dump(oe_results, fp)

##### Fine-tuned bert model

In [None]:
# load model from Google Drive
model_name = "/content/drive/MyDrive/mc_models/bert_fine_tuned/bert_fine_tuned/checkpoint-708"
model = AutoModelForMultipleChoice.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
# Compute results
mc_results, oe_results, mc_metric_result, oe_metric_result = model_output(model, tokenizer, oe_model, oe_tokenizer, df_test_dataset, sum_pipeline=summarization_pipeline, mc_metric=clf_metrics, oe_metric=exact_match)

tensor([[0.1854, 0.1959, 0.2323, 0.1938, 0.2394]], grad_fn=<ViewBackward0>)
Question: What department are you representing?
Context: Oh, I think I represent the Operations department. That must be it, right?
The intended answer was: Operations
The predicted answer was: Other
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.3029, 0.2846, 0.2836, 0.2814, 0.3145]], grad_fn=<ViewBackward0>)
tensor([[0.4758, 0.4590, 0.4414, 0.4352, 0.4415]], grad_fn=<ViewBackward0>)
tensor([[0.2776, 0.2578, 0.2657, 0.2648, 0.2728]], grad_fn=<ViewBackward0>)
Question: What is your estimated budget for this project?
Context: My estimated budget for this project is $13,500. I believe that will cover all anticipated costs effectively.
The intended answer was: $13500
The predicted answer was: 

tensor([[0.2788, 0.1920, 0.2727, 0.2513, 0.2334, 0.2321]],
       grad_fn=<ViewBackward0>)
Question: What language do you prefer for communication?
Context

Your max_length is set to 120, but your input_length is only 35. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)


tensor([[0.3761, 0.3601, 0.3474, 0.4177, 0.4422]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Oh, well, I'm definitely unsatisfied with the current solutions in my field. It's a bit rough right now I think.
The intended answer was: Unsatisfied
The predicted answer was: Very unsatisfied
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]



Your max_length is set to 113, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.2778, 0.3289, 0.2961, 0.2971, 0.2571]], grad_fn=<ViewBackward0>)
Question: How did you hear about our exhibition stand?
Context: Well, I think it was other, honestly. I'm not sure what else it could be.
The intended answer was: Other
The predicted answer was: Email invitation
The intended answer in BINARY was: [0, 0, 0, 0, 1]
The predicted answer in BINARY was: [0, 1, 0, 0, 0]

tensor([[0.2632, 0.2103, 0.2197, 0.2278, 0.2087]], grad_fn=<ViewBackward0>)
tensor([[0.2863, 0.2815, 0.2853, 0.2778, 0.2828]], grad_fn=<ViewBackward0>)
Question: What is your primary goal at this trade fair?
Context: Well, I'm not really sure. I suppose my main thing is something like 'other'. That sounds about right.
The intended answer was: Other
The predicted answer was: Networking
The intended answer in BINARY was: [0, 0, 0, 0, 1]
The predicted answer in BINARY was: [1, 0, 0, 0, 0]

tensor([[0.2100, 0.2220, 0.2254, 0.2351, 0.2262]], grad_fn=<ViewBackward0>)
Question: What support resources do you 

Your max_length is set to 92, but your input_length is only 24. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=12)


tensor([[0.2643, 0.2699, 0.2731, 0.3100, 0.2860]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Oh, um, well, I'd say I'm very satisfied.
The intended answer was: Very satisfied
The predicted answer was: Unsatisfied
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]

tensor([[0.3436, 0.3131, 0.3220, 0.4124, 0.4235]], grad_fn=<ViewBackward0>)
tensor([[0.2222, 0.2406, 0.2189, 0.2169, 0.2330]], grad_fn=<ViewBackward0>)
Question: When do you expect to finalize your decision?
Context: How about we circle back around January 23rd, 2025? I should have a final decision by then.
The intended answer was: 2025-01-23
The predicted answer was: by then

tensor([[0.2980, 0.2931, 0.2752, 0.2785, 0.3048]], grad_fn=<ViewBackward0>)
Question: How many employees does your company have?
Context: Oh gosh I'm not really sure we have like, maybe 5 employees I think, it's pretty small.
The intende

Your max_length is set to 193, but your input_length is only 49. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=24)


tensor([[0.4477, 0.4399, 0.4635, 0.4785, 0.4180]], grad_fn=<ViewBackward0>)


Your max_length is set to 144, but your input_length is only 38. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)
Your max_length is set to 117, but your input_length is only 30. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)
Your max_length is set to 94, but your input_length is only 27. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=13)


tensor([[0.3173, 0.3149, 0.3187, 0.3206, 0.3274, 0.3413]],
       grad_fn=<ViewBackward0>)
tensor([[0.2781, 0.2804, 0.2816, 0.2762, 0.2985]], grad_fn=<ViewBackward0>)
Question: What is your preferred method of follow-up?
Context: Oh, um, I guess an in-person visit would probably be my preference.
The intended answer was: In-person visit
The predicted answer was: No follow-up
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]



Your max_length is set to 112, but your input_length is only 30. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.2693, 0.3104, 0.2947, 0.4113, 0.2786]], grad_fn=<ViewBackward0>)
tensor([[0.2712, 0.2853, 0.3103, 0.2271, 0.2036]], grad_fn=<ViewBackward0>)
Question: What support resources do you need for implementation?
Context: Well, I think I might need some training, also documentation, maybe even some technical support. If none is needed that's ok too I suppose.
The intended answer was: ['Training', 'Documentation', 'Technical support', 'None']
The predicted answer was: ['Documentation', 'Technical support']
The intended answer in BINARY was: [1, 1, 1, 0, 1]
The predicted answer in BINARY was: [0, 1, 1, 0, 0]

tensor([[0.2796, 0.2673, 0.2562, 0.2433, 0.2810]], grad_fn=<ViewBackward0>)
Question: What stage are you in the buying process?
Context: Oh, I'm not really sure, maybe I'm at the evaluation stage. That seems right for me now.
The intended answer was: Evaluation
The predicted answer was: Not buying
The intended answer in BINARY was: [0, 1, 0, 0, 0]
The predicted answer in BINARY 

Your max_length is set to 185, but your input_length is only 51. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=25)


tensor([[0.3049, 0.3076, 0.3198, 0.3635, 0.3797]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Honestly, I am quite unsatisfied with them at the moment, I'd say.
The intended answer was: Unsatisfied
The predicted answer was: Very unsatisfied
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]



Your max_length is set to 79, but your input_length is only 21. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=10)


tensor([[0.2490, 0.2467, 0.2311, 0.2176, 0.2269]], grad_fn=<ViewBackward0>)
tensor([[0.3315, 0.3594, 0.3777, 0.3285, 0.3190]], grad_fn=<ViewBackward0>)
Question: Which features are most important in a solution?
Context: Well, I'd say ease of use is crucial and cost efficiency really matters. Scalability is definitely something important. Oh, and good support is key for any solution.
The intended answer was: ['Ease of use', 'Cost efficiency', 'Scalability', 'Support']
The predicted answer was: ['Cost efficiency', 'Scalability']
The intended answer in BINARY was: [1, 1, 1, 0, 1]
The predicted answer in BINARY was: [0, 1, 1, 0, 0]

tensor([[0.2536, 0.2356, 0.2561, 0.2932, 0.2410]], grad_fn=<ViewBackward0>)
Question: What support resources do you need for implementation?
Context: Oh wow, for implementation I'd definitely need training. Maybe also some technical support. And probably onsite assistance would help a lot.
The intended answer was: ['Training', 'Technical support', 'Onsite assis

Your max_length is set to 159, but your input_length is only 36. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=18)


tensor([[0.2715, 0.2946, 0.2766, 0.3148, 0.3163]], grad_fn=<ViewBackward0>)
Question: Who in your company evaluates new solutions?
Context: I think it's probably either Procurement or maybe some other team does that.
The intended answer was: ['Procurement', 'Other']
The predicted answer was: ['CEO', 'Other']
The intended answer in BINARY was: [0, 0, 1, 0, 1]
The predicted answer in BINARY was: [0, 0, 0, 1, 1]

tensor([[0.1956, 0.2054, 0.1976, 0.2026, 0.2018]], grad_fn=<ViewBackward0>)
Question: What support resources do you need for implementation?
Context: Oh gosh, I guess I'd need training, good documentation, and also someone to help onsite, yeah that's it.
The intended answer was: ['Training', 'Documentation', 'Onsite assistance']
The predicted answer was: ['Documentation', 'Onsite assistance']
The intended answer in BINARY was: [1, 1, 0, 1, 0]
The predicted answer in BINARY was: [0, 1, 0, 1, 0]

tensor([[0.0874, 0.1398, 0.1554, 0.1578, 0.1104]], grad_fn=<ViewBackward0>)
Question: 

Your max_length is set to 107, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.3634, 0.3649, 0.3714, 0.3860, 0.3768]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Well, I guess I'm satisfied with the current solutions, I don't really know the alternatives anyway.
The intended answer was: Satisfied
The predicted answer was: Unsatisfied
The intended answer in BINARY was: [0, 1, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]

Question: What is your estimated budget for this project?
Context: Okay, for this project, I'm currently estimating a budget of about $11,700.
The intended answer was: $11700
The predicted answer was: $ 11, 700

tensor([[0.2861, 0.2565, 0.2846, 0.2955, 0.2738]], grad_fn=<ViewBackward0>)
Question: Who in your company evaluates new solutions?
Context: Well, I guess it could be the IT department or maybe Procurement, and if not them I'd say Other people.
The intended answer was: ['IT department', 'Procurement', 'Other']
The predicted answer was: ['Team leader'

Your max_length is set to 138, but your input_length is only 35. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)


tensor([[0.2994, 0.3084, 0.3165, 0.3151, 0.2977]], grad_fn=<ViewBackward0>)
Question: Which features are most important in a solution?
Context: Well, I think cost efficiency is important because we need to save money, security is definitely important to protect things, and I'd also say good support is needed for help.
The intended answer was: ['Cost efficiency', 'Security', 'Support']
The predicted answer was: ['Scalability', 'Security']
The intended answer in BINARY was: [0, 1, 0, 1, 1]
The predicted answer in BINARY was: [0, 0, 1, 1, 0]

tensor([[0.2863, 0.2424, 0.2098, 0.2191, 0.2354]], grad_fn=<ViewBackward0>)
tensor([[0.3369, 0.3138, 0.3381, 0.3033, 0.4098]], grad_fn=<ViewBackward0>)
Question: How many employees does your company have?
Context: Oh, I'm not entirely sure of the exact number but I'd guess we've probably got somewhere between 300 to 700 employees maybe.
The intended answer was: 201-1000
The predicted answer was: 1000+
The intended answer in BINARY was: [0, 0, 0, 1, 0

Your max_length is set to 132, but your input_length is only 33. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=16)


tensor([[0.3459, 0.3187, 0.3839, 0.4240, 0.4482]], grad_fn=<ViewBackward0>)
tensor([[0.2995, 0.2915, 0.3064, 0.3090, 0.3182]], grad_fn=<ViewBackward0>)
tensor([[0.2700, 0.2730, 0.2345, 0.2568, 0.2603]], grad_fn=<ViewBackward0>)
tensor([[0.2208, 0.3464, 0.3252, 0.3594, 0.3207]], grad_fn=<ViewBackward0>)
Question: How soon are you looking for a solution?
Context: Hmm, I'd say probably somewhere around 2 months, give or take.
The intended answer was: 1-3 months
The predicted answer was: Over 6 months
The intended answer in BINARY was: [0, 1, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]

tensor([[0.2423, 0.2386, 0.2634, 0.2482, 0.3018]], grad_fn=<ViewBackward0>)
Question: What department are you representing?
Context: Oh, um, I guess I represent Operations. I don't really know all the options, to be honest.
The intended answer was: Operations
The predicted answer was: Other
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1

Your max_length is set to 125, but your input_length is only 35. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)


tensor([[0.2926, 0.4033, 0.2940, 0.2482, 0.2609]], grad_fn=<ViewBackward0>)


Your max_length is set to 96, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.2871, 0.2927, 0.3072, 0.3072, 0.2914]], grad_fn=<ViewBackward0>)
Question: Which features are most important in a solution?
Context: I guess ease of use is important, and cost efficiency definitely matters too. Security seems key, and good support is a must have I think.
The intended answer was: ['Ease of use', 'Cost efficiency', 'Security', 'Support']
The predicted answer was: ['Scalability', 'Security']
The intended answer in BINARY was: [1, 1, 0, 1, 1]
The predicted answer in BINARY was: [0, 0, 1, 1, 0]

tensor([[0.2599, 0.2561, 0.2665, 0.2659, 0.2728]], grad_fn=<ViewBackward0>)
tensor([[0.3207, 0.3052, 0.3247, 0.3440, 0.3445]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Well, I guess I'm satisfied, I'm not really sure what the other options are though.
The intended answer was: Satisfied
The predicted answer was: Very unsatisfied
The intended answer in BINARY was: [0, 1, 0, 0, 0]
The predicted answer in BINA

Your max_length is set to 153, but your input_length is only 41. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=20)


tensor([[0.2030, 0.1971, 0.2170, 0.2390, 0.1834]], grad_fn=<ViewBackward0>)
tensor([[0.2759, 0.2586, 0.2351, 0.2542, 0.2791]], grad_fn=<ViewBackward0>)
Question: What stage are you in the buying process?
Context: I think I've already decided. I'm pretty sure I'm set on that choice.
The intended answer was: Already decided
The predicted answer was: Not buying
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.2541, 0.2366, 0.2462, 0.2430, 0.2659]], grad_fn=<ViewBackward0>)
Question: Who in your company evaluates new solutions?
Context: Hmm I think our IT department looks at the tech stuff. Procurement probably handles the money part and maybe the CEO gets the final say sometimes I'm not really sure.
The intended answer was: ['IT department', 'Procurement', 'CEO']
The predicted answer was: ['Team leader', 'Other']
The intended answer in BINARY was: [0, 1, 1, 1, 0]
The predicted answer in BINARY was: [1, 0, 0, 0, 1]

tensor([

Your max_length is set to 106, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.3019, 0.2851, 0.2643, 0.2826, 0.3070]], grad_fn=<ViewBackward0>)
Question: How soon are you looking for a solution?
Context: Oh, I would like it immediately I suppose. That seems like a good time for me.
The intended answer was: Immediately
The predicted answer was: Not sure
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.2565, 0.2630, 0.2713, 0.2202, 0.2134]], grad_fn=<ViewBackward0>)
Question: What support resources do you need for implementation?
Context: Well, I suppose I need technical support, that would really help.
The intended answer was: ['Technical support']
The predicted answer was: ['Training', 'Documentation', 'Technical support']
The intended answer in BINARY was: [0, 0, 1, 0, 0]
The predicted answer in BINARY was: [1, 1, 1, 0, 0]

tensor([[0.3351, 0.3336, 0.3649, 0.3750, 0.3232]], grad_fn=<ViewBackward0>)
tensor([[0.4647, 0.4556, 0.4690, 0.4762, 0.4391]], grad_fn=<ViewBackward0>)
Question: How

Your max_length is set to 160, but your input_length is only 39. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)


tensor([[0.2285, 0.2273]], grad_fn=<ViewBackward0>)
tensor([[0.4274, 0.4151, 0.4266, 0.4331, 0.3795]], grad_fn=<ViewBackward0>)
Question: How would you prefer to receive product updates?
Context: I guess I would prefer to receive product updates through email, that sounds easiest.
The intended answer was: Email
The predicted answer was: Social media
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]

tensor([[0.2264, 0.2214, 0.2457, 0.2749, 0.2782]], grad_fn=<ViewBackward0>)
Question: Who in your company evaluates new solutions?
Context: I think maybe the team leader, or perhaps the IT department. Procurement could also be involved, and there might be other people too.
The intended answer was: ['Team leader', 'IT department', 'Procurement', 'Other']
The predicted answer was: ['CEO', 'Other']
The intended answer in BINARY was: [1, 1, 1, 0, 1]
The predicted answer in BINARY was: [0, 0, 0, 1, 1]

tensor([[0.2618, 0.2709, 0.2526, 0.2621, 

Your max_length is set to 108, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.1740, 0.1960, 0.1855, 0.1910, 0.2009]], grad_fn=<ViewBackward0>)
Question: What support resources do you need for implementation?
Context: I think I need training and documentation, and maybe some onsite assistance too, or perhaps none of them really.
The intended answer was: ['Training', 'Documentation', 'Onsite assistance', 'None']
The predicted answer was: ['Documentation', 'None']
The intended answer in BINARY was: [1, 1, 0, 1, 1]
The predicted answer in BINARY was: [0, 1, 0, 0, 1]

tensor([[0.3355, 0.3036]], grad_fn=<ViewBackward0>)
Question: Do you plan to implement a solution within the next 6 months?
Context: No, I don't think so. I haven't even looked at all the different options.
The intended answer was: No
The predicted answer was: Yes
The intended answer in BINARY was: [0, 1]
The predicted answer in BINARY was: [1, 0]

tensor([[0.3031, 0.2690, 0.3017, 0.2892, 0.3253]], grad_fn=<ViewBackward0>)
Question: What department are you representing?
Context: Oh gee, I'm n

Your max_length is set to 145, but your input_length is only 38. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)


tensor([[0.1403, 0.2077, 0.2153, 0.2247, 0.1812]], grad_fn=<ViewBackward0>)
Question: How did you hear about our exhibition stand?
Context: I think I saw it on the trade fair website.
The intended answer was: Trade fair website
The predicted answer was: Word of mouth
The intended answer in BINARY was: [0, 0, 1, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]

tensor([[0.3277, 0.3055, 0.3529, 0.2976, 0.4324]], grad_fn=<ViewBackward0>)
Question: How many employees does your company have?
Context: Oh geez, I'm not totally sure about the exact number. I think we've got somewhere around 450 employees, give or take a few.
The intended answer was: 201-1000
The predicted answer was: 1000+
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.1968, 0.1584, 0.1497, 0.2092, 0.1859]], grad_fn=<ViewBackward0>)
Question: What type of customer relationship are you seeking?
Context: Well I'm looking for a partner relationship I thi

In [None]:
model_result = {'model_name': model_name, 'mc_metric_result': mc_metric_result, 'oe_metric_result': oe_metric_result}
print(model_result)

{'model_name': '/content/drive/MyDrive/mc_models/bert_fine_tuned/bert_fine_tuned/checkpoint-708', 'mc_metric_result': {'accuracy': 0.6842767295597484, 'f1': 0.425629290617849, 'precision': 0.47692307692307695, 'recall': 0.384297520661157}, 'oe_metric_result': {'exact_match': 0.4444444444444444}}


In [None]:
# save model outputs to make further evaluations afterwards
with open('bert_fine_tuned_mc_results.json', 'w') as fp:
    json.dump(mc_results, fp)
with open('bert_fine_tuned_oe_results.json', 'w') as fp:
    json.dump(oe_results, fp)

##### Albert model

In [21]:
# load model again
model_name = "albert/albert-base-v2"
albert_model = AutoModelForMultipleChoice.from_pretrained(model_name, force_download=True)
albert_tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

Some weights of AlbertForMultipleChoice were not initialized from the model checkpoint at albert/albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [22]:
# Compute results
mc_results, oe_results, mc_metric_result, oe_metric_result = model_output(albert_model, albert_tokenizer, oe_model, oe_tokenizer, df_test_dataset, sum_pipeline=summarization_pipeline, mc_metric=clf_metrics, oe_metric=exact_match)

tensor([[0.0887, 0.1092, 0.0570, 0.0169, 0.1703]], grad_fn=<ViewBackward0>)
Question: What department are you representing?
Context: Oh, I think I represent the Operations department. That must be it, right?
The intended answer was: Operations
The predicted answer was: Other
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.4874, 0.0600, 0.0918, 0.2131, 0.0905]], grad_fn=<ViewBackward0>)
Question: How soon are you looking for a solution?
Context: Hmm, not sure to be honest, I've not really thought about it yet.
The intended answer was: Not sure
The predicted answer was: Immediately
The intended answer in BINARY was: [0, 0, 0, 0, 1]
The predicted answer in BINARY was: [1, 0, 0, 0, 0]

tensor([[0.4231, 0.3807, 0.3991, 0.3891, 0.4249]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Well I would say I am very satisfied with the solutions currently.
The intended ans

Your max_length is set to 120, but your input_length is only 35. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)


tensor([[ 0.2270, -0.0486, -0.0037,  0.3037,  0.3831]],
       grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Oh, well, I'm definitely unsatisfied with the current solutions in my field. It's a bit rough right now I think.
The intended answer was: Unsatisfied
The predicted answer was: Very unsatisfied
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]



Your max_length is set to 113, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.1734, 0.0578, 0.1425, 0.2016, 0.2026]], grad_fn=<ViewBackward0>)
tensor([[0.1154, 0.0738, 0.2484, 0.0746, 0.1543]], grad_fn=<ViewBackward0>)
Question: How did you hear about our exhibition stand?
Context: I think I saw it on social media. Yeah, that's how I found out about it.
The intended answer was: Social media
The predicted answer was: Trade fair website
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 1, 0, 0]

tensor([[ 0.3958,  0.0643, -0.0328,  0.4191,  0.3954]],
       grad_fn=<ViewBackward0>)
Question: What is your primary goal at this trade fair?
Context: Well, I'm not really sure. I suppose my main thing is something like 'other'. That sounds about right.
The intended answer was: Other
The predicted answer was: Market research
The intended answer in BINARY was: [0, 0, 0, 0, 1]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]

tensor([[0.4185, 0.3160, 0.2789, 0.2316, 0.3904]], grad_fn=<ViewBackward0>)
Question: What 

Your max_length is set to 92, but your input_length is only 24. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=12)


tensor([[0.3891, 0.3973, 0.4257, 0.4535, 0.4619]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Oh, um, well, I'd say I'm very satisfied.
The intended answer was: Very satisfied
The predicted answer was: Very unsatisfied
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.4559, 0.3714, 0.3846, 0.3926, 0.4535]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: I would say very unsatisfied, to be honest. I think things can get much better in my field.
The intended answer was: Very unsatisfied
The predicted answer was: Very satisfied
The intended answer in BINARY was: [0, 0, 0, 0, 1]
The predicted answer in BINARY was: [1, 0, 0, 0, 0]

tensor([[0.3941, 0.3819, 0.4124, 0.4082, 0.3554]], grad_fn=<ViewBackward0>)
Question: How many employees does your company have?
Context: Oh, wow, I'm really not sure, I thin

Your max_length is set to 193, but your input_length is only 49. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=24)


tensor([[ 0.1311,  0.3939,  0.0556,  0.5217, -0.0150]],
       grad_fn=<ViewBackward0>)


Your max_length is set to 144, but your input_length is only 38. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)
Your max_length is set to 117, but your input_length is only 30. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)
Your max_length is set to 94, but your input_length is only 27. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=13)


tensor([[0.3326, 0.3205, 0.3163, 0.2855, 0.2757, 0.3022]],
       grad_fn=<ViewBackward0>)
Question: What language do you prefer for communication?
Context: Oh, that's tricky. I'm not really sure which languages there are, so I'd have to go with other I guess.
The intended answer was: Other
The predicted answer was: English
The intended answer in BINARY was: [0, 0, 0, 0, 0, 1]
The predicted answer in BINARY was: [1, 0, 0, 0, 0, 0]

tensor([[0.4251, 0.2065, 0.4197, 0.4436, 0.2871]], grad_fn=<ViewBackward0>)


Your max_length is set to 112, but your input_length is only 30. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[ 0.0516, -0.1578,  0.1972,  0.1543,  0.1527]],
       grad_fn=<ViewBackward0>)
Question: What is your preferred method of follow-up?
Context: Oh, well I guess an in-person visit would be my preferred method of follow-up then. I think it's the best way to connect.
The intended answer was: In-person visit
The predicted answer was: Video meeting
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 1, 0, 0]

tensor([[0.4107, 0.3914, 0.3765, 0.4464, 0.4295]], grad_fn=<ViewBackward0>)
Question: What support resources do you need for implementation?
Context: Well, I think I might need some training, also documentation, maybe even some technical support. If none is needed that's ok too I suppose.
The intended answer was: ['Training', 'Documentation', 'Technical support', 'None']
The predicted answer was: ['Onsite assistance', 'None']
The intended answer in BINARY was: [1, 1, 1, 0, 1]
The predicted answer in BINARY was: [0, 0, 0, 1, 1]

tensor([

Your max_length is set to 185, but your input_length is only 51. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=25)


tensor([[0.4820, 0.2355, 0.2496, 0.5180, 0.5409]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Honestly, I am quite unsatisfied with them at the moment, I'd say.
The intended answer was: Unsatisfied
The predicted answer was: Very unsatisfied
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]



Your max_length is set to 79, but your input_length is only 21. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=10)


tensor([[ 0.0463,  0.0331, -0.0344,  0.1089,  0.1141]],
       grad_fn=<ViewBackward0>)
Question: What stage are you in the buying process?
Context: Oh, I guess I'm in exploration then.
The intended answer was: Exploration
The predicted answer was: Not buying
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.3729, 0.4418, 0.4564, 0.3286, 0.3447]], grad_fn=<ViewBackward0>)
Question: Which features are most important in a solution?
Context: Well, I'd say ease of use is crucial and cost efficiency really matters. Scalability is definitely something important. Oh, and good support is key for any solution.
The intended answer was: ['Ease of use', 'Cost efficiency', 'Scalability', 'Support']
The predicted answer was: ['Cost efficiency', 'Scalability']
The intended answer in BINARY was: [1, 1, 1, 0, 1]
The predicted answer in BINARY was: [0, 1, 1, 0, 0]

tensor([[0.1703, 0.1600, 0.0438, 0.0488, 0.1545]], grad_fn=<ViewBackward0>)

Your max_length is set to 159, but your input_length is only 36. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=18)


tensor([[0.4695, 0.3169, 0.1971, 0.2577, 0.0584]], grad_fn=<ViewBackward0>)
Question: Who in your company evaluates new solutions?
Context: I think it's probably either Procurement or maybe some other team does that.
The intended answer was: ['Procurement', 'Other']
The predicted answer was: ['Team leader']
The intended answer in BINARY was: [0, 0, 1, 0, 1]
The predicted answer in BINARY was: [1, 0, 0, 0, 0]

tensor([[0.1718, 0.1243, 0.4166, 0.4910, 0.1734]], grad_fn=<ViewBackward0>)
Question: What support resources do you need for implementation?
Context: Oh gosh, I guess I'd need training, good documentation, and also someone to help onsite, yeah that's it.
The intended answer was: ['Training', 'Documentation', 'Onsite assistance']
The predicted answer was: ['Technical support', 'Onsite assistance']
The intended answer in BINARY was: [1, 1, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 1, 1, 0]

tensor([[0.1026, 0.1014, 0.1883, 0.1811, 0.1899]], grad_fn=<ViewBackward0>)
Questio

Your max_length is set to 107, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.4366, 0.4007, 0.4039, 0.4042, 0.4301]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Well, I guess I'm satisfied with the current solutions, I don't really know the alternatives anyway.
The intended answer was: Satisfied
The predicted answer was: Very satisfied
The intended answer in BINARY was: [0, 1, 0, 0, 0]
The predicted answer in BINARY was: [1, 0, 0, 0, 0]

Question: What is your estimated budget for this project?
Context: Okay, for this project, I'm currently estimating a budget of about $11,700.
The intended answer was: $11700
The predicted answer was: $ 11, 700

tensor([[0.4319, 0.5200, 0.4956, 0.4959, 0.4684]], grad_fn=<ViewBackward0>)
Question: Who in your company evaluates new solutions?
Context: Well, I guess it could be the IT department or maybe Procurement, and if not them I'd say Other people.
The intended answer was: ['IT department', 'Procurement', 'Other']
The predicted answer was: ['IT depart

Your max_length is set to 138, but your input_length is only 35. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)


tensor([[0.2680, 0.5511, 0.3759, 0.2454, 0.2207]], grad_fn=<ViewBackward0>)
Question: Which features are most important in a solution?
Context: Well, I think cost efficiency is important because we need to save money, security is definitely important to protect things, and I'd also say good support is needed for help.
The intended answer was: ['Cost efficiency', 'Security', 'Support']
The predicted answer was: ['Cost efficiency']
The intended answer in BINARY was: [0, 1, 0, 1, 1]
The predicted answer in BINARY was: [0, 1, 0, 0, 0]

tensor([[0.2478, 0.2111, 0.3317, 0.4195, 0.1032]], grad_fn=<ViewBackward0>)
Question: What is your preferred method of follow-up?
Context: Oh, hmm, if I had to pick a method of follow-up I suppose a phone call would work best for me.
The intended answer was: Phone call
The predicted answer was: In-person visit
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]

tensor([[0.3656, 0.3586, 0.3595, 0.3665, 0.419

Your max_length is set to 132, but your input_length is only 33. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=16)


tensor([[-0.0060,  0.0794,  0.1136,  0.3090,  0.4434]],
       grad_fn=<ViewBackward0>)
tensor([[0.3556, 0.2487, 0.1514, 0.1969, 0.3494]], grad_fn=<ViewBackward0>)
Question: What is your primary goal at this trade fair?
Context: Honestly, I'm not entirely sure, probably something else I guess.
The intended answer was: Other
The predicted answer was: Networking
The intended answer in BINARY was: [0, 0, 0, 0, 1]
The predicted answer in BINARY was: [1, 0, 0, 0, 0]

tensor([[-0.0472, -0.2007,  0.4707,  0.9367, -0.2318]],
       grad_fn=<ViewBackward0>)
Question: What type of customer relationship are you seeking?
Context: I think I would prefer a partner relationship. That sounds good to me.
The intended answer was: Partner
The predicted answer was: End-user
The intended answer in BINARY was: [0, 1, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]

tensor([[ 0.4176,  0.1854,  0.1844, -0.0045,  0.4235]],
       grad_fn=<ViewBackward0>)
Question: How soon are you looking for a so

Your max_length is set to 125, but your input_length is only 35. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)


tensor([[0.3328, 0.2447, 0.2555, 0.5161, 0.2508]], grad_fn=<ViewBackward0>)
Question: What is your primary goal at this trade fair?
Context: Well I suppose my primary goal here is finding suppliers, yeah that sounds right to me.
The intended answer was: Finding suppliers
The predicted answer was: Market research
The intended answer in BINARY was: [0, 1, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]



Your max_length is set to 96, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[ 0.4154,  0.3669,  0.2636, -0.0328, -0.0137]],
       grad_fn=<ViewBackward0>)
Question: Which features are most important in a solution?
Context: I guess ease of use is important, and cost efficiency definitely matters too. Security seems key, and good support is a must have I think.
The intended answer was: ['Ease of use', 'Cost efficiency', 'Security', 'Support']
The predicted answer was: ['Ease of use', 'Cost efficiency']
The intended answer in BINARY was: [1, 1, 0, 1, 1]
The predicted answer in BINARY was: [1, 1, 0, 0, 0]

tensor([[0.2125, 0.3446, 0.3534, 0.1987, 0.3653]], grad_fn=<ViewBackward0>)
tensor([[0.4622, 0.4052, 0.4311, 0.4599, 0.4948]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Well, I guess I'm satisfied, I'm not really sure what the other options are though.
The intended answer was: Satisfied
The predicted answer was: Very unsatisfied
The intended answer in BINARY was: [0, 1, 0, 0, 0]
The predi

Your max_length is set to 153, but your input_length is only 41. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=20)


tensor([[0.1674, 0.4286, 0.2983, 0.1120, 0.3501]], grad_fn=<ViewBackward0>)
Question: How did you hear about our exhibition stand?
Context: Oh wow, I guess it was just through word of mouth.
The intended answer was: Word of mouth
The predicted answer was: Email invitation
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 1, 0, 0, 0]

tensor([[-0.1757, -0.1192,  0.0323, -0.0423,  0.1210]],
       grad_fn=<ViewBackward0>)
Question: What stage are you in the buying process?
Context: I think I've already decided. I'm pretty sure I'm set on that choice.
The intended answer was: Already decided
The predicted answer was: Not buying
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.1602, 0.3942, 0.1494, 0.2668, 0.1382]], grad_fn=<ViewBackward0>)
Question: Who in your company evaluates new solutions?
Context: Hmm I think our IT department looks at the tech stuff. Procurement probably handle

Your max_length is set to 106, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[ 0.3479,  0.0369,  0.0769, -0.1112,  0.2798]],
       grad_fn=<ViewBackward0>)
tensor([[0.3219, 0.2259, 0.4061, 0.4781, 0.2797]], grad_fn=<ViewBackward0>)
Question: What support resources do you need for implementation?
Context: Well, I suppose I need technical support, that would really help.
The intended answer was: ['Technical support']
The predicted answer was: ['Technical support', 'Onsite assistance']
The intended answer in BINARY was: [0, 0, 1, 0, 0]
The predicted answer in BINARY was: [0, 0, 1, 1, 0]

tensor([[0.0833, 0.1530, 0.1640, 0.2969, 0.0388]], grad_fn=<ViewBackward0>)
tensor([[ 0.1565, -0.0567,  0.1759,  0.0469,  0.3226]],
       grad_fn=<ViewBackward0>)
Question: How would you prefer to receive product updates?
Context: Oh, um, I guess I would prefer email for product updates. That seems like the most convenient way for me to get them.
The intended answer was: Email
The predicted answer was: In-person meeting
The intended answer in BINARY was: [1, 0, 0, 0, 0]


Your max_length is set to 160, but your input_length is only 39. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)


tensor([[0.2817, 0.2901]], grad_fn=<ViewBackward0>)
Question: Do you plan to implement a solution within the next 6 months?
Context: Oh, you're asking about my plans. Well, between yes and no, I'd have to say yes.
The intended answer was: Yes
The predicted answer was: No
The intended answer in BINARY was: [1, 0]
The predicted answer in BINARY was: [0, 1]

tensor([[0.1852, 0.0930, 0.3502, 0.1596, 0.4850]], grad_fn=<ViewBackward0>)
Question: How would you prefer to receive product updates?
Context: I guess I would prefer to receive product updates through email, that sounds easiest.
The intended answer was: Email
The predicted answer was: In-person meeting
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.3751, 0.3136, 0.3484, 0.3445, 0.2305]], grad_fn=<ViewBackward0>)
Question: Who in your company evaluates new solutions?
Context: I think maybe the team leader, or perhaps the IT department. Procurement could also be involv

Your max_length is set to 108, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.3057, 0.2360, 0.4881, 0.3440, 0.3075]], grad_fn=<ViewBackward0>)
Question: What support resources do you need for implementation?
Context: I think I need training and documentation, and maybe some onsite assistance too, or perhaps none of them really.
The intended answer was: ['Training', 'Documentation', 'Onsite assistance', 'None']
The predicted answer was: ['Technical support']
The intended answer in BINARY was: [1, 1, 0, 1, 1]
The predicted answer in BINARY was: [0, 0, 1, 0, 0]

tensor([[0.3049, 0.0848]], grad_fn=<ViewBackward0>)
Question: Do you plan to implement a solution within the next 6 months?
Context: No, I don't think so. I haven't even looked at all the different options.
The intended answer was: No
The predicted answer was: Yes
The intended answer in BINARY was: [0, 1]
The predicted answer in BINARY was: [1, 0]

tensor([[0.4456, 0.3257, 0.3093, 0.2671, 0.3314]], grad_fn=<ViewBackward0>)
tensor([[0.2153, 0.2149, 0.3339, 0.6133, 0.1326]], grad_fn=<ViewBackward0>

Your max_length is set to 145, but your input_length is only 38. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)


tensor([[ 0.1461, -0.0011, -0.0963,  0.1118, -0.0640]],
       grad_fn=<ViewBackward0>)
Question: How did you hear about our exhibition stand?
Context: I think I saw it on the trade fair website.
The intended answer was: Trade fair website
The predicted answer was: Social media
The intended answer in BINARY was: [0, 0, 1, 0, 0]
The predicted answer in BINARY was: [1, 0, 0, 0, 0]

tensor([[0.2942, 0.2974, 0.3178, 0.3137, 0.3533]], grad_fn=<ViewBackward0>)
Question: How many employees does your company have?
Context: Oh geez, I'm not totally sure about the exact number. I think we've got somewhere around 450 employees, give or take a few.
The intended answer was: 201-1000
The predicted answer was: 1000+
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.1739, 0.0889, 0.4822, 0.8223, 0.0897]], grad_fn=<ViewBackward0>)
Question: What type of customer relationship are you seeking?
Context: Well I'm looking for a partner relatio

In [23]:
model_result = {'model_name': model_name, 'mc_metric_result': mc_metric_result, 'oe_metric_result': oe_metric_result}
print(model_result)

{'model_name': 'albert/albert-base-v2', 'mc_metric_result': {'accuracy': 0.6125786163522012, 'f1': 0.3031674208144796, 'precision': 0.335, 'recall': 0.2768595041322314}, 'oe_metric_result': {'exact_match': 0.4444444444444444}}


In [24]:
# save model outputs to make further evaluations afterwards
with open('albert_mc_results.json', 'w') as fp:
    json.dump(mc_results, fp)
with open('albert_oe_results.json', 'w') as fp:
    json.dump(oe_results, fp)

##### Bert model

In [25]:
# load model again
model_name = "bert-base-cased"
bert_model = AutoModelForMultipleChoice.from_pretrained(model_name)
bert_tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForMultipleChoice were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [26]:
# Compute results
mc_results, oe_results, mc_metric_result, oe_metric_result = model_output(bert_model, bert_tokenizer, oe_model, oe_tokenizer, df_test_dataset, sum_pipeline=summarization_pipeline, mc_metric=clf_metrics, oe_metric=exact_match)

tensor([[0.1235, 0.1190, 0.1131, 0.1290, 0.1189]], grad_fn=<ViewBackward0>)
tensor([[0.1878, 0.1890, 0.1787, 0.1795, 0.1466]], grad_fn=<ViewBackward0>)
Question: How soon are you looking for a solution?
Context: Hmm, not sure to be honest, I've not really thought about it yet.
The intended answer was: Not sure
The predicted answer was: 1-3 months
The intended answer in BINARY was: [0, 0, 0, 0, 1]
The predicted answer in BINARY was: [0, 1, 0, 0, 0]

tensor([[0.1367, 0.1370, 0.1316, 0.1338, 0.1357]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Well I would say I am very satisfied with the solutions currently.
The intended answer was: Very satisfied
The predicted answer was: Satisfied
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 1, 0, 0, 0]

tensor([[0.0899, 0.1701, 0.1332, 0.1045, 0.1265]], grad_fn=<ViewBackward0>)
Question: What stage are you in the buying process?
Context: Oh g

Your max_length is set to 120, but your input_length is only 35. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)


tensor([[0.1331, 0.1338, 0.1304, 0.1352, 0.1374]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Oh, well, I'm definitely unsatisfied with the current solutions in my field. It's a bit rough right now I think.
The intended answer was: Unsatisfied
The predicted answer was: Very unsatisfied
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]



Your max_length is set to 113, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.4441, 0.4128, 0.4468, 0.3862, 0.3098]], grad_fn=<ViewBackward0>)
Question: How did you hear about our exhibition stand?
Context: Well, I think it was other, honestly. I'm not sure what else it could be.
The intended answer was: Other
The predicted answer was: Trade fair website
The intended answer in BINARY was: [0, 0, 0, 0, 1]
The predicted answer in BINARY was: [0, 0, 1, 0, 0]

tensor([[0.1497, 0.2462, 0.2316, 0.2393, 0.2348]], grad_fn=<ViewBackward0>)
Question: How did you hear about our exhibition stand?
Context: I think I saw it on social media. Yeah, that's how I found out about it.
The intended answer was: Social media
The predicted answer was: Email invitation
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 1, 0, 0, 0]

tensor([[0.2787, 0.3113, 0.2617, 0.2520, 0.2822]], grad_fn=<ViewBackward0>)
Question: What is your primary goal at this trade fair?
Context: Well, I'm not really sure. I suppose my main thing is something lik

Your max_length is set to 92, but your input_length is only 24. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=12)


tensor([[0.1238, 0.1069, 0.1779, 0.1611, 0.1044]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Oh, um, well, I'd say I'm very satisfied.
The intended answer was: Very satisfied
The predicted answer was: Neutral
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 1, 0, 0]

tensor([[0.1114, 0.1083, 0.1018, 0.1414, 0.1450]], grad_fn=<ViewBackward0>)
tensor([[0.1663, 0.1302, 0.1330, 0.1761, 0.1191]], grad_fn=<ViewBackward0>)
Question: How many employees does your company have?
Context: Oh, wow, I'm really not sure, I think we have maybe 28 people working here right now, it's somewhere between 11 and 50, so yeah, 28 seems right.
The intended answer was: 11-50
The predicted answer was: 201-1000
The intended answer in BINARY was: [0, 1, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]

Question: When do you expect to finalize your decision?
Context: How about we circle back ar

Your max_length is set to 193, but your input_length is only 49. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=24)


tensor([[0.1337, 0.1358, 0.1350, 0.1340, 0.1363]], grad_fn=<ViewBackward0>)
Question: How would you prefer to receive product updates?
Context: Oh, hmm, product updates? I guess I'd prefer to get them through social media.
The intended answer was: Social media
The predicted answer was: In-person meeting
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]



Your max_length is set to 144, but your input_length is only 38. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)
Your max_length is set to 117, but your input_length is only 30. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)
Your max_length is set to 94, but your input_length is only 27. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=13)


tensor([[0.1238, 0.1355, 0.1596, 0.1620, 0.1624, 0.1297]],
       grad_fn=<ViewBackward0>)
Question: What language do you prefer for communication?
Context: Oh, that's tricky. I'm not really sure which languages there are, so I'd have to go with other I guess.
The intended answer was: Other
The predicted answer was: Italian
The intended answer in BINARY was: [0, 0, 0, 0, 0, 1]
The predicted answer in BINARY was: [0, 0, 0, 0, 1, 0]

tensor([[0.2255, 0.2169, 0.2122, 0.1259, 0.2281]], grad_fn=<ViewBackward0>)
Question: What is your preferred method of follow-up?
Context: Oh, um, I guess an in-person visit would probably be my preference.
The intended answer was: In-person visit
The predicted answer was: No follow-up
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]



Your max_length is set to 112, but your input_length is only 30. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.1275, 0.1198, 0.1298, 0.1405, 0.1262]], grad_fn=<ViewBackward0>)
tensor([[0.0927, 0.0995, 0.0964, 0.1279, 0.1381]], grad_fn=<ViewBackward0>)
Question: What support resources do you need for implementation?
Context: Well, I think I might need some training, also documentation, maybe even some technical support. If none is needed that's ok too I suppose.
The intended answer was: ['Training', 'Documentation', 'Technical support', 'None']
The predicted answer was: ['Onsite assistance', 'None']
The intended answer in BINARY was: [1, 1, 1, 0, 1]
The predicted answer in BINARY was: [0, 0, 0, 1, 1]

tensor([[0.1884, 0.0867, 0.1062, 0.1221, 0.1418]], grad_fn=<ViewBackward0>)
Question: What stage are you in the buying process?
Context: Oh, I'm not really sure, maybe I'm at the evaluation stage. That seems right for me now.
The intended answer was: Evaluation
The predicted answer was: Exploration
The intended answer in BINARY was: [0, 1, 0, 0, 0]
The predicted answer in BINARY was: [1,

Your max_length is set to 185, but your input_length is only 51. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=25)


tensor([[0.1844, 0.2171, 0.2134, 0.1197, 0.1186]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Honestly, I am quite unsatisfied with them at the moment, I'd say.
The intended answer was: Unsatisfied
The predicted answer was: Satisfied
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 1, 0, 0, 0]



Your max_length is set to 79, but your input_length is only 21. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=10)


tensor([[0.1636, 0.2025, 0.2078, 0.1780, 0.2026]], grad_fn=<ViewBackward0>)
Question: What stage are you in the buying process?
Context: Oh, I guess I'm in exploration then.
The intended answer was: Exploration
The predicted answer was: Decision-making
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 1, 0, 0]

tensor([[0.1014, 0.0964, 0.1291, 0.1106, 0.0864]], grad_fn=<ViewBackward0>)
Question: Which features are most important in a solution?
Context: Well, I'd say ease of use is crucial and cost efficiency really matters. Scalability is definitely something important. Oh, and good support is key for any solution.
The intended answer was: ['Ease of use', 'Cost efficiency', 'Scalability', 'Support']
The predicted answer was: ['Scalability']
The intended answer in BINARY was: [1, 1, 1, 0, 1]
The predicted answer in BINARY was: [0, 0, 1, 0, 0]

tensor([[0.1213, 0.0842, 0.1261, 0.1398, 0.1133]], grad_fn=<ViewBackward0>)
Question: What support re

Your max_length is set to 159, but your input_length is only 36. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=18)


tensor([[0.1466, 0.1994, 0.1150, 0.2472, 0.2343]], grad_fn=<ViewBackward0>)
Question: Who in your company evaluates new solutions?
Context: I think it's probably either Procurement or maybe some other team does that.
The intended answer was: ['Procurement', 'Other']
The predicted answer was: ['CEO', 'Other']
The intended answer in BINARY was: [0, 0, 1, 0, 1]
The predicted answer in BINARY was: [0, 0, 0, 1, 1]

tensor([[0.1368, 0.2003, 0.2152, 0.1097, 0.2560]], grad_fn=<ViewBackward0>)
Question: What support resources do you need for implementation?
Context: Oh gosh, I guess I'd need training, good documentation, and also someone to help onsite, yeah that's it.
The intended answer was: ['Training', 'Documentation', 'Onsite assistance']
The predicted answer was: ['Technical support', 'None']
The intended answer in BINARY was: [1, 1, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 1, 0, 1]

tensor([[0.2441, 0.1242, 0.2343, 0.2322, 0.2316]], grad_fn=<ViewBackward0>)
Question: How did y

Your max_length is set to 107, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.1307, 0.1323, 0.1267, 0.1313, 0.1322]], grad_fn=<ViewBackward0>)
Question: What is your estimated budget for this project?
Context: Okay, for this project, I'm currently estimating a budget of about $11,700.
The intended answer was: $11700
The predicted answer was: $ 11, 700

tensor([[0.2366, 0.0950, 0.1036, 0.2429, 0.2000]], grad_fn=<ViewBackward0>)
Question: Who in your company evaluates new solutions?
Context: Well, I guess it could be the IT department or maybe Procurement, and if not them I'd say Other people.
The intended answer was: ['IT department', 'Procurement', 'Other']
The predicted answer was: ['Team leader', 'CEO']
The intended answer in BINARY was: [0, 1, 1, 0, 1]
The predicted answer in BINARY was: [1, 0, 0, 1, 0]

tensor([[0.1200, 0.1173, 0.1200, 0.1115, 0.1026]], grad_fn=<ViewBackward0>)
Question: What stage are you in the buying process?
Context: Oh, um, I guess I'm in the evaluation stage, yeah, that seems right.
The intended answer was: Evaluation
The pr

Your max_length is set to 138, but your input_length is only 35. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)


tensor([[0.1592, 0.0763, 0.1880, 0.0819, 0.1371]], grad_fn=<ViewBackward0>)
Question: Which features are most important in a solution?
Context: Well, I think cost efficiency is important because we need to save money, security is definitely important to protect things, and I'd also say good support is needed for help.
The intended answer was: ['Cost efficiency', 'Security', 'Support']
The predicted answer was: ['Ease of use', 'Scalability']
The intended answer in BINARY was: [0, 1, 0, 1, 1]
The predicted answer in BINARY was: [1, 0, 1, 0, 0]

tensor([[0.1316, 0.1192, 0.1217, 0.1228, 0.1286]], grad_fn=<ViewBackward0>)
tensor([[0.1284, 0.1383, 0.1041, 0.1036, 0.1184]], grad_fn=<ViewBackward0>)
Question: How many employees does your company have?
Context: Oh, I'm not entirely sure of the exact number but I'd guess we've probably got somewhere between 300 to 700 employees maybe.
The intended answer was: 201-1000
The predicted answer was: 11-50
The intended answer in BINARY was: [0, 0, 0, 1

Your max_length is set to 132, but your input_length is only 33. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=16)


tensor([[0.1361, 0.1327, 0.1324, 0.1358, 0.1424]], grad_fn=<ViewBackward0>)
tensor([[0.4003, 0.3323, 0.3095, 0.3559, 0.3784]], grad_fn=<ViewBackward0>)
Question: What is your primary goal at this trade fair?
Context: Honestly, I'm not entirely sure, probably something else I guess.
The intended answer was: Other
The predicted answer was: Networking
The intended answer in BINARY was: [0, 0, 0, 0, 1]
The predicted answer in BINARY was: [1, 0, 0, 0, 0]

tensor([[0.0923, 0.0919, 0.1492, 0.1130, 0.1094]], grad_fn=<ViewBackward0>)
Question: What type of customer relationship are you seeking?
Context: I think I would prefer a partner relationship. That sounds good to me.
The intended answer was: Partner
The predicted answer was: Reseller
The intended answer in BINARY was: [0, 1, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 1, 0, 0]

tensor([[0.2269, 0.1078, 0.1081, 0.1128, 0.2299]], grad_fn=<ViewBackward0>)
Question: How soon are you looking for a solution?
Context: Hmm, I'd say probab

Your max_length is set to 125, but your input_length is only 35. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)


tensor([[0.1014, 0.1349, 0.1107, 0.1007, 0.1053]], grad_fn=<ViewBackward0>)


Your max_length is set to 96, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.1155, 0.1282, 0.1710, 0.1015, 0.1220]], grad_fn=<ViewBackward0>)
Question: Which features are most important in a solution?
Context: I guess ease of use is important, and cost efficiency definitely matters too. Security seems key, and good support is a must have I think.
The intended answer was: ['Ease of use', 'Cost efficiency', 'Security', 'Support']
The predicted answer was: ['Scalability']
The intended answer in BINARY was: [1, 1, 0, 1, 1]
The predicted answer in BINARY was: [0, 0, 1, 0, 0]

tensor([[0.2653, 0.2766, 0.2562, 0.2790, 0.2708]], grad_fn=<ViewBackward0>)
Question: How did you hear about our exhibition stand?
Context: Oh, hmm, I guess I heard about it some other way then, you know? Not sure exactly which, but not from a known source.
The intended answer was: Other
The predicted answer was: Word of mouth
The intended answer in BINARY was: [0, 0, 0, 0, 1]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]

tensor([[0.1033, 0.1125, 0.1720, 0.1528, 0.1426]], grad

Your max_length is set to 153, but your input_length is only 41. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=20)


tensor([[0.2544, 0.2954, 0.2545, 0.1620, 0.3110]], grad_fn=<ViewBackward0>)
Question: How did you hear about our exhibition stand?
Context: Oh wow, I guess it was just through word of mouth.
The intended answer was: Word of mouth
The predicted answer was: Other
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.2296, 0.2223, 0.2084, 0.1684, 0.2443]], grad_fn=<ViewBackward0>)
Question: What stage are you in the buying process?
Context: I think I've already decided. I'm pretty sure I'm set on that choice.
The intended answer was: Already decided
The predicted answer was: Not buying
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.1256, 0.1128, 0.1207, 0.0904, 0.1372]], grad_fn=<ViewBackward0>)
Question: Who in your company evaluates new solutions?
Context: Hmm I think our IT department looks at the tech stuff. Procurement probably handles the money part and ma

Your max_length is set to 106, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.2066, 0.2179, 0.2070, 0.1993, 0.1852]], grad_fn=<ViewBackward0>)
Question: How soon are you looking for a solution?
Context: Oh, I would like it immediately I suppose. That seems like a good time for me.
The intended answer was: Immediately
The predicted answer was: 1-3 months
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 1, 0, 0, 0]

tensor([[0.0932, 0.0937, 0.0934, 0.0762, 0.1443]], grad_fn=<ViewBackward0>)
Question: What support resources do you need for implementation?
Context: Well, I suppose I need technical support, that would really help.
The intended answer was: ['Technical support']
The predicted answer was: ['None']
The intended answer in BINARY was: [0, 0, 1, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.1125, 0.1184, 0.1218, 0.1237, 0.1121]], grad_fn=<ViewBackward0>)
tensor([[0.1386, 0.1395, 0.1382, 0.1380, 0.1401]], grad_fn=<ViewBackward0>)
Question: How would you prefer to receive product upd

Your max_length is set to 160, but your input_length is only 39. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)


tensor([[0.1662, 0.1772]], grad_fn=<ViewBackward0>)
Question: Do you plan to implement a solution within the next 6 months?
Context: Oh, you're asking about my plans. Well, between yes and no, I'd have to say yes.
The intended answer was: Yes
The predicted answer was: No
The intended answer in BINARY was: [1, 0]
The predicted answer in BINARY was: [0, 1]

tensor([[0.1339, 0.1331, 0.1324, 0.1349, 0.1355]], grad_fn=<ViewBackward0>)
Question: How would you prefer to receive product updates?
Context: I guess I would prefer to receive product updates through email, that sounds easiest.
The intended answer was: Email
The predicted answer was: In-person meeting
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.1075, 0.0807, 0.0892, 0.2220, 0.2078]], grad_fn=<ViewBackward0>)
Question: Who in your company evaluates new solutions?
Context: I think maybe the team leader, or perhaps the IT department. Procurement could also be involv

Your max_length is set to 108, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.0742, 0.0961, 0.1043, 0.1161, 0.1740]], grad_fn=<ViewBackward0>)
Question: What support resources do you need for implementation?
Context: I think I need training and documentation, and maybe some onsite assistance too, or perhaps none of them really.
The intended answer was: ['Training', 'Documentation', 'Onsite assistance', 'None']
The predicted answer was: ['None']
The intended answer in BINARY was: [1, 1, 0, 1, 1]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.1894, 0.1884]], grad_fn=<ViewBackward0>)
Question: Do you plan to implement a solution within the next 6 months?
Context: No, I don't think so. I haven't even looked at all the different options.
The intended answer was: No
The predicted answer was: Yes
The intended answer in BINARY was: [0, 1]
The predicted answer in BINARY was: [1, 0]

tensor([[0.1277, 0.1089, 0.1064, 0.1150, 0.1102]], grad_fn=<ViewBackward0>)
tensor([[0.3085, 0.3783, 0.3205, 0.3851, 0.2998]], grad_fn=<ViewBackward0>)
Question: W

Your max_length is set to 145, but your input_length is only 38. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)


tensor([[0.1349, 0.1601, 0.1358, 0.1429, 0.2266]], grad_fn=<ViewBackward0>)
Question: How did you hear about our exhibition stand?
Context: I think I saw it on the trade fair website.
The intended answer was: Trade fair website
The predicted answer was: Other
The intended answer in BINARY was: [0, 0, 1, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.1295, 0.1236, 0.1203, 0.1186, 0.1312]], grad_fn=<ViewBackward0>)
Question: How many employees does your company have?
Context: Oh geez, I'm not totally sure about the exact number. I think we've got somewhere around 450 employees, give or take a few.
The intended answer was: 201-1000
The predicted answer was: 1000+
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.1138, 0.1180, 0.1092, 0.1022, 0.1066]], grad_fn=<ViewBackward0>)
tensor([[0.1543, 0.2177, 0.2225, 0.2047, 0.2496]], grad_fn=<ViewBackward0>)
Question: What is your preferred method of follow-up

In [27]:
model_result = {'model_name': model_name, 'mc_metric_result': mc_metric_result, 'oe_metric_result': oe_metric_result}
print(model_result)

{'model_name': 'bert-base-cased', 'mc_metric_result': {'accuracy': 0.6427672955974842, 'f1': 0.3515981735159817, 'precision': 0.39285714285714285, 'recall': 0.3181818181818182}, 'oe_metric_result': {'exact_match': 0.4444444444444444}}


In [28]:
# save model outputs to make further evaluations afterwards
with open('bert_mc_results.json', 'w') as fp:
    json.dump(mc_results, fp)
with open('bert_oe_results.json', 'w') as fp:
    json.dump(oe_results, fp)

##### Roberta model

In [29]:
# load model again
model_name = "FacebookAI/roberta-base"
roberta_model = AutoModelForMultipleChoice.from_pretrained(model_name)
roberta_tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForMultipleChoice were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [30]:
# Compute results
mc_results, oe_results, mc_metric_result, oe_metric_result = model_output(roberta_model, roberta_tokenizer, oe_model, oe_tokenizer, df_test_dataset, sum_pipeline=summarization_pipeline, mc_metric=clf_metrics, oe_metric=exact_match)

tensor([[0.3599, 0.3607, 0.3644, 0.3600, 0.3615]], grad_fn=<ViewBackward0>)
Question: What department are you representing?
Context: Oh, I think I represent the Operations department. That must be it, right?
The intended answer was: Operations
The predicted answer was: Marketing
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 1, 0, 0]

tensor([[0.3644, 0.3576, 0.3590, 0.3628, 0.3635]], grad_fn=<ViewBackward0>)
Question: How soon are you looking for a solution?
Context: Hmm, not sure to be honest, I've not really thought about it yet.
The intended answer was: Not sure
The predicted answer was: Immediately
The intended answer in BINARY was: [0, 0, 0, 0, 1]
The predicted answer in BINARY was: [1, 0, 0, 0, 0]

tensor([[0.3519, 0.3549, 0.3559, 0.3535, 0.3509]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Well I would say I am very satisfied with the solutions currently.
The intended

Your max_length is set to 120, but your input_length is only 35. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)


tensor([[0.3544, 0.3562, 0.3559, 0.3558, 0.3545]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Oh, well, I'm definitely unsatisfied with the current solutions in my field. It's a bit rough right now I think.
The intended answer was: Unsatisfied
The predicted answer was: Satisfied
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 1, 0, 0, 0]



Your max_length is set to 113, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.3616, 0.3577, 0.3592, 0.3615, 0.3621]], grad_fn=<ViewBackward0>)
tensor([[0.3627, 0.3606, 0.3623, 0.3610, 0.3665]], grad_fn=<ViewBackward0>)
Question: How did you hear about our exhibition stand?
Context: I think I saw it on social media. Yeah, that's how I found out about it.
The intended answer was: Social media
The predicted answer was: Other
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.3603, 0.3568, 0.3597, 0.3608, 0.3604]], grad_fn=<ViewBackward0>)
Question: What is your primary goal at this trade fair?
Context: Well, I'm not really sure. I suppose my main thing is something like 'other'. That sounds about right.
The intended answer was: Other
The predicted answer was: Market research
The intended answer in BINARY was: [0, 0, 0, 0, 1]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]

tensor([[0.3594, 0.3542, 0.3534, 0.3562, 0.3535]], grad_fn=<ViewBackward0>)
Question: What support resources do you 

Your max_length is set to 92, but your input_length is only 24. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=12)


tensor([[0.3580, 0.3599, 0.3627, 0.3592, 0.3577]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Oh, um, well, I'd say I'm very satisfied.
The intended answer was: Very satisfied
The predicted answer was: Neutral
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 1, 0, 0]

tensor([[0.3509, 0.3526, 0.3534, 0.3530, 0.3508]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: I would say very unsatisfied, to be honest. I think things can get much better in my field.
The intended answer was: Very unsatisfied
The predicted answer was: Neutral
The intended answer in BINARY was: [0, 0, 0, 0, 1]
The predicted answer in BINARY was: [0, 0, 1, 0, 0]

tensor([[0.3555, 0.3542, 0.3572, 0.3593, 0.3569]], grad_fn=<ViewBackward0>)
Question: How many employees does your company have?
Context: Oh, wow, I'm really not sure, I think we have maybe 

Your max_length is set to 193, but your input_length is only 49. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=24)


tensor([[0.3611, 0.3660, 0.3624, 0.3610, 0.3634]], grad_fn=<ViewBackward0>)
Question: How would you prefer to receive product updates?
Context: Oh, hmm, product updates? I guess I'd prefer to get them through social media.
The intended answer was: Social media
The predicted answer was: Webinar
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 1, 0, 0, 0]



Your max_length is set to 144, but your input_length is only 38. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)
Your max_length is set to 117, but your input_length is only 30. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)
Your max_length is set to 94, but your input_length is only 27. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=13)


tensor([[0.3552, 0.3563, 0.3560, 0.3569, 0.3588, 0.3577]],
       grad_fn=<ViewBackward0>)
Question: What language do you prefer for communication?
Context: Oh, that's tricky. I'm not really sure which languages there are, so I'd have to go with other I guess.
The intended answer was: Other
The predicted answer was: Italian
The intended answer in BINARY was: [0, 0, 0, 0, 0, 1]
The predicted answer in BINARY was: [0, 0, 0, 0, 1, 0]

tensor([[0.3657, 0.3705, 0.3715, 0.3651, 0.3676]], grad_fn=<ViewBackward0>)
Question: What is your preferred method of follow-up?
Context: Oh, um, I guess an in-person visit would probably be my preference.
The intended answer was: In-person visit
The predicted answer was: Video meeting
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 1, 0, 0]



Your max_length is set to 112, but your input_length is only 30. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.3637, 0.3673, 0.3694, 0.3648, 0.3641]], grad_fn=<ViewBackward0>)
Question: What is your preferred method of follow-up?
Context: Oh, well I guess an in-person visit would be my preferred method of follow-up then. I think it's the best way to connect.
The intended answer was: In-person visit
The predicted answer was: Video meeting
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 1, 0, 0]

tensor([[0.3468, 0.3460, 0.3455, 0.3500, 0.3445]], grad_fn=<ViewBackward0>)
Question: What support resources do you need for implementation?
Context: Well, I think I might need some training, also documentation, maybe even some technical support. If none is needed that's ok too I suppose.
The intended answer was: ['Training', 'Documentation', 'Technical support', 'None']
The predicted answer was: ['Onsite assistance']
The intended answer in BINARY was: [1, 1, 1, 0, 1]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]

tensor([[0.3619, 0.3626, 0.3

Your max_length is set to 185, but your input_length is only 51. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=25)


tensor([[0.3585, 0.3600, 0.3622, 0.3595, 0.3590]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Honestly, I am quite unsatisfied with them at the moment, I'd say.
The intended answer was: Unsatisfied
The predicted answer was: Neutral
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 1, 0, 0]



Your max_length is set to 79, but your input_length is only 21. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=10)


tensor([[0.3659, 0.3650, 0.3600, 0.3599, 0.3578]], grad_fn=<ViewBackward0>)
tensor([[0.3463, 0.3457, 0.3461, 0.3462, 0.3474]], grad_fn=<ViewBackward0>)
Question: Which features are most important in a solution?
Context: Well, I'd say ease of use is crucial and cost efficiency really matters. Scalability is definitely something important. Oh, and good support is key for any solution.
The intended answer was: ['Ease of use', 'Cost efficiency', 'Scalability', 'Support']
The predicted answer was: ['Support']
The intended answer in BINARY was: [1, 1, 1, 0, 1]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.3500, 0.3506, 0.3490, 0.3513, 0.3488]], grad_fn=<ViewBackward0>)
Question: What support resources do you need for implementation?
Context: Oh wow, for implementation I'd definitely need training. Maybe also some technical support. And probably onsite assistance would help a lot.
The intended answer was: ['Training', 'Technical support', 'Onsite assistance']
The predicted a

Your max_length is set to 159, but your input_length is only 36. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=18)


tensor([[0.3559, 0.3582, 0.3566, 0.3608, 0.3586]], grad_fn=<ViewBackward0>)
Question: Who in your company evaluates new solutions?
Context: I think it's probably either Procurement or maybe some other team does that.
The intended answer was: ['Procurement', 'Other']
The predicted answer was: ['CEO']
The intended answer in BINARY was: [0, 0, 1, 0, 1]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]

tensor([[0.3598, 0.3569, 0.3565, 0.3594, 0.3564]], grad_fn=<ViewBackward0>)
Question: What support resources do you need for implementation?
Context: Oh gosh, I guess I'd need training, good documentation, and also someone to help onsite, yeah that's it.
The intended answer was: ['Training', 'Documentation', 'Onsite assistance']
The predicted answer was: ['Training', 'Onsite assistance']
The intended answer in BINARY was: [1, 1, 0, 1, 0]
The predicted answer in BINARY was: [1, 0, 0, 1, 0]

tensor([[0.3699, 0.3680, 0.3679, 0.3702, 0.3730]], grad_fn=<ViewBackward0>)
Question: How did you he

Your max_length is set to 107, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.3515, 0.3538, 0.3551, 0.3523, 0.3517]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Well, I guess I'm satisfied with the current solutions, I don't really know the alternatives anyway.
The intended answer was: Satisfied
The predicted answer was: Neutral
The intended answer in BINARY was: [0, 1, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 1, 0, 0]

Question: What is your estimated budget for this project?
Context: Okay, for this project, I'm currently estimating a budget of about $11,700.
The intended answer was: $11700
The predicted answer was: $ 11, 700

tensor([[0.3507, 0.3515, 0.3504, 0.3535, 0.3509]], grad_fn=<ViewBackward0>)
Question: Who in your company evaluates new solutions?
Context: Well, I guess it could be the IT department or maybe Procurement, and if not them I'd say Other people.
The intended answer was: ['IT department', 'Procurement', 'Other']
The predicted answer was: ['CEO']
The intend

Your max_length is set to 138, but your input_length is only 35. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)


tensor([[0.3493, 0.3478, 0.3471, 0.3470, 0.3501]], grad_fn=<ViewBackward0>)
Question: Which features are most important in a solution?
Context: Well, I think cost efficiency is important because we need to save money, security is definitely important to protect things, and I'd also say good support is needed for help.
The intended answer was: ['Cost efficiency', 'Security', 'Support']
The predicted answer was: ['Ease of use', 'Support']
The intended answer in BINARY was: [0, 1, 0, 1, 1]
The predicted answer in BINARY was: [1, 0, 0, 0, 1]

tensor([[0.3675, 0.3725, 0.3765, 0.3712, 0.3673]], grad_fn=<ViewBackward0>)
Question: What is your preferred method of follow-up?
Context: Oh, hmm, if I had to pick a method of follow-up I suppose a phone call would work best for me.
The intended answer was: Phone call
The predicted answer was: Video meeting
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 1, 0, 0]

tensor([[0.3554, 0.3574, 0.3570, 0.3550, 

Your max_length is set to 132, but your input_length is only 33. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=16)


tensor([[0.3617, 0.3654, 0.3641, 0.3630, 0.3584]], grad_fn=<ViewBackward0>)
Question: How would you prefer to receive product updates?
Context: I guess an in person meeting would be my preferred way to get product updates.
The intended answer was: In-person meeting
The predicted answer was: Webinar
The intended answer in BINARY was: [0, 0, 0, 0, 1]
The predicted answer in BINARY was: [0, 1, 0, 0, 0]

tensor([[0.3661, 0.3629, 0.3656, 0.3667, 0.3706]], grad_fn=<ViewBackward0>)
tensor([[0.3595, 0.3624, 0.3624, 0.3579, 0.3655]], grad_fn=<ViewBackward0>)
Question: What type of customer relationship are you seeking?
Context: I think I would prefer a partner relationship. That sounds good to me.
The intended answer was: Partner
The predicted answer was: Other
The intended answer in BINARY was: [0, 1, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.3588, 0.3503, 0.3508, 0.3554, 0.3581]], grad_fn=<ViewBackward0>)
Question: How soon are you looking for a solution?
Contex

Your max_length is set to 125, but your input_length is only 35. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)


tensor([[0.3655, 0.3610, 0.3642, 0.3645, 0.3698]], grad_fn=<ViewBackward0>)
Question: What is your primary goal at this trade fair?
Context: Well I suppose my primary goal here is finding suppliers, yeah that sounds right to me.
The intended answer was: Finding suppliers
The predicted answer was: Other
The intended answer in BINARY was: [0, 1, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]



Your max_length is set to 96, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.3441, 0.3445, 0.3440, 0.3441, 0.3462]], grad_fn=<ViewBackward0>)
Question: Which features are most important in a solution?
Context: I guess ease of use is important, and cost efficiency definitely matters too. Security seems key, and good support is a must have I think.
The intended answer was: ['Ease of use', 'Cost efficiency', 'Security', 'Support']
The predicted answer was: ['Support']
The intended answer in BINARY was: [1, 1, 0, 1, 1]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.3609, 0.3578, 0.3599, 0.3607, 0.3646]], grad_fn=<ViewBackward0>)
tensor([[0.3572, 0.3583, 0.3612, 0.3573, 0.3574]], grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Well, I guess I'm satisfied, I'm not really sure what the other options are though.
The intended answer was: Satisfied
The predicted answer was: Neutral
The intended answer in BINARY was: [0, 1, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 1, 0, 0]



Your max_length is set to 153, but your input_length is only 41. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=20)


tensor([[0.3706, 0.3677, 0.3683, 0.3695, 0.3730]], grad_fn=<ViewBackward0>)
Question: How did you hear about our exhibition stand?
Context: Oh wow, I guess it was just through word of mouth.
The intended answer was: Word of mouth
The predicted answer was: Other
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.3633, 0.3611, 0.3574, 0.3570, 0.3556]], grad_fn=<ViewBackward0>)
Question: What stage are you in the buying process?
Context: I think I've already decided. I'm pretty sure I'm set on that choice.
The intended answer was: Already decided
The predicted answer was: Exploration
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [1, 0, 0, 0, 0]

tensor([[0.3542, 0.3558, 0.3541, 0.3573, 0.3589]], grad_fn=<ViewBackward0>)
Question: Who in your company evaluates new solutions?
Context: Hmm I think our IT department looks at the tech stuff. Procurement probably handles the money part and m

Your max_length is set to 106, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.3608, 0.3535, 0.3544, 0.3578, 0.3603]], grad_fn=<ViewBackward0>)
tensor([[0.3576, 0.3556, 0.3516, 0.3583, 0.3513]], grad_fn=<ViewBackward0>)
Question: What support resources do you need for implementation?
Context: Well, I suppose I need technical support, that would really help.
The intended answer was: ['Technical support']
The predicted answer was: ['Training', 'Onsite assistance']
The intended answer in BINARY was: [0, 0, 1, 0, 0]
The predicted answer in BINARY was: [1, 0, 0, 1, 0]

tensor([[0.3587, 0.3557, 0.3583, 0.3564, 0.3614]], grad_fn=<ViewBackward0>)
Question: What is your primary goal at this trade fair?
Context: I guess my primary goal here would be market research. I am trying to figure out what's happening out there.
The intended answer was: Market research
The predicted answer was: Other
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.3476, 0.3551, 0.3519, 0.3524, 0.3533]], grad_fn=<ViewBackwa

Your max_length is set to 160, but your input_length is only 39. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)


tensor([[0.3537, 0.3562]], grad_fn=<ViewBackward0>)
Question: Do you plan to implement a solution within the next 6 months?
Context: Oh, you're asking about my plans. Well, between yes and no, I'd have to say yes.
The intended answer was: Yes
The predicted answer was: No
The intended answer in BINARY was: [1, 0]
The predicted answer in BINARY was: [0, 1]

tensor([[0.3545, 0.3635, 0.3592, 0.3614, 0.3633]], grad_fn=<ViewBackward0>)
Question: How would you prefer to receive product updates?
Context: I guess I would prefer to receive product updates through email, that sounds easiest.
The intended answer was: Email
The predicted answer was: Webinar
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 1, 0, 0, 0]

tensor([[0.3516, 0.3554, 0.3510, 0.3546, 0.3544]], grad_fn=<ViewBackward0>)
Question: Who in your company evaluates new solutions?
Context: I think maybe the team leader, or perhaps the IT department. Procurement could also be involved, and th

Your max_length is set to 108, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[0.3571, 0.3562, 0.3555, 0.3572, 0.3549]], grad_fn=<ViewBackward0>)
Question: What support resources do you need for implementation?
Context: I think I need training and documentation, and maybe some onsite assistance too, or perhaps none of them really.
The intended answer was: ['Training', 'Documentation', 'Onsite assistance', 'None']
The predicted answer was: ['Training', 'Onsite assistance']
The intended answer in BINARY was: [1, 1, 0, 1, 1]
The predicted answer in BINARY was: [1, 0, 0, 1, 0]

tensor([[0.3510, 0.3528]], grad_fn=<ViewBackward0>)
tensor([[0.3515, 0.3538, 0.3564, 0.3530, 0.3558]], grad_fn=<ViewBackward0>)
Question: What department are you representing?
Context: Oh gee, I'm not entirely sure. I guess I would be representing the R&D department then.
The intended answer was: R&D
The predicted answer was: Marketing
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 1, 0, 0]

tensor([[0.3621, 0.3710, 0.3673, 0.3609, 0.3693

Your max_length is set to 145, but your input_length is only 38. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)


tensor([[0.3679, 0.3659, 0.3632, 0.3662, 0.3717]], grad_fn=<ViewBackward0>)
Question: How did you hear about our exhibition stand?
Context: I think I saw it on the trade fair website.
The intended answer was: Trade fair website
The predicted answer was: Other
The intended answer in BINARY was: [0, 0, 1, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.3524, 0.3539, 0.3540, 0.3552, 0.3539]], grad_fn=<ViewBackward0>)
tensor([[0.3629, 0.3692, 0.3674, 0.3627, 0.3705]], grad_fn=<ViewBackward0>)
Question: What type of customer relationship are you seeking?
Context: Well I'm looking for a partner relationship I think.
The intended answer was: Partner
The predicted answer was: Other
The intended answer in BINARY was: [0, 1, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[0.3646, 0.3689, 0.3749, 0.3660, 0.3663]], grad_fn=<ViewBackward0>)
Question: What is your preferred method of follow-up?
Context: Well, I guess I'd have to say a phone call is what I

In [31]:
model_result = {'model_name': model_name, 'mc_metric_result': mc_metric_result, 'oe_metric_result': oe_metric_result}
print(model_result)

{'model_name': 'FacebookAI/roberta-base', 'mc_metric_result': {'accuracy': 0.5748427672955975, 'f1': 0.2102803738317757, 'precision': 0.24193548387096775, 'recall': 0.1859504132231405}, 'oe_metric_result': {'exact_match': 0.4444444444444444}}


In [32]:
# save model outputs to make further evaluations afterwards
with open('roberta_mc_results.json', 'w') as fp:
    json.dump(mc_results, fp)
with open('roberta_oe_results.json', 'w') as fp:
    json.dump(oe_results, fp)

##### XLNet model

In [33]:
# load model again
model_name = "xlnet/xlnet-base-cased"
xlnet_model = XLNetForMultipleChoice.from_pretrained(model_name)
xlnet_tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/760 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/467M [00:00<?, ?B/s]

Some weights of XLNetForMultipleChoice were not initialized from the model checkpoint at xlnet/xlnet-base-cased and are newly initialized: ['logits_proj.bias', 'logits_proj.weight', 'sequence_summary.summary.bias', 'sequence_summary.summary.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]

In [34]:
# Compute results
mc_results, oe_results, mc_metric_result, oe_metric_result = model_output(xlnet_model, xlnet_tokenizer, oe_model, oe_tokenizer, df_test_dataset, sum_pipeline=summarization_pipeline, mc_metric=clf_metrics, oe_metric=exact_match)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


tensor([[-0.3060, -0.1699, -0.0191, -0.6529, -0.3635]],
       grad_fn=<ViewBackward0>)
Question: What department are you representing?
Context: Oh, I think I represent the Operations department. That must be it, right?
The intended answer was: Operations
The predicted answer was: Marketing
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 1, 0, 0]

tensor([[-0.6120, -0.2691, -0.3249, -0.3142, -0.3381]],
       grad_fn=<ViewBackward0>)
Question: How soon are you looking for a solution?
Context: Hmm, not sure to be honest, I've not really thought about it yet.
The intended answer was: Not sure
The predicted answer was: 1-3 months
The intended answer in BINARY was: [0, 0, 0, 0, 1]
The predicted answer in BINARY was: [0, 1, 0, 0, 0]

tensor([[-0.2294, -0.3196, -0.6731, -0.3929, -0.3166]],
       grad_fn=<ViewBackward0>)
tensor([[-0.8225, -0.9195, -0.9858, -0.8141, -0.7152]],
       grad_fn=<ViewBackward0>)
Question: What stage are you in the buy

Your max_length is set to 120, but your input_length is only 35. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)


tensor([[-0.5136, -0.4252, -0.5209, -0.4472, -0.5305]],
       grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Oh, well, I'm definitely unsatisfied with the current solutions in my field. It's a bit rough right now I think.
The intended answer was: Unsatisfied
The predicted answer was: Satisfied
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 1, 0, 0, 0]



Your max_length is set to 113, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[-0.2177, -0.5736, -0.5831, -0.5396, -0.2597]],
       grad_fn=<ViewBackward0>)
Question: How did you hear about our exhibition stand?
Context: Well, I think it was other, honestly. I'm not sure what else it could be.
The intended answer was: Other
The predicted answer was: Social media
The intended answer in BINARY was: [0, 0, 0, 0, 1]
The predicted answer in BINARY was: [1, 0, 0, 0, 0]

tensor([[-0.6458, -0.8589, -0.6316, -0.6553, -0.8104]],
       grad_fn=<ViewBackward0>)
Question: How did you hear about our exhibition stand?
Context: I think I saw it on social media. Yeah, that's how I found out about it.
The intended answer was: Social media
The predicted answer was: Trade fair website
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 1, 0, 0]

tensor([[-0.4797, -0.4615, -0.6605, -0.4782, -0.5748]],
       grad_fn=<ViewBackward0>)
Question: What is your primary goal at this trade fair?
Context: Well, I'm not really sure. I suppos

Your max_length is set to 92, but your input_length is only 24. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=12)


tensor([[-0.2832, -0.1738, -0.5698, -0.1563, -0.3152]],
       grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Oh, um, well, I'd say I'm very satisfied.
The intended answer was: Very satisfied
The predicted answer was: Unsatisfied
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]

tensor([[-0.4853, -0.6128, -0.5308, -0.6297, -0.6140]],
       grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: I would say very unsatisfied, to be honest. I think things can get much better in my field.
The intended answer was: Very unsatisfied
The predicted answer was: Very satisfied
The intended answer in BINARY was: [0, 0, 0, 0, 1]
The predicted answer in BINARY was: [1, 0, 0, 0, 0]

tensor([[-0.3250, -0.3740, -0.2058, -0.2026, -0.3335]],
       grad_fn=<ViewBackward0>)
Question: How many employees does your company have?
Context: Oh, w

Your max_length is set to 193, but your input_length is only 49. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=24)


tensor([[-0.5737, -0.6653, -0.5661, -0.5053, -0.5207]],
       grad_fn=<ViewBackward0>)


Your max_length is set to 144, but your input_length is only 38. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)
Your max_length is set to 117, but your input_length is only 30. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)
Your max_length is set to 94, but your input_length is only 27. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=13)


tensor([[-0.5298, -0.4444, -0.3595, -0.4511, -0.3976, -0.6188]],
       grad_fn=<ViewBackward0>)
Question: What language do you prefer for communication?
Context: Oh, that's tricky. I'm not really sure which languages there are, so I'd have to go with other I guess.
The intended answer was: Other
The predicted answer was: French
The intended answer in BINARY was: [0, 0, 0, 0, 0, 1]
The predicted answer in BINARY was: [0, 0, 1, 0, 0, 0]

tensor([[-0.6332, -0.5552, -0.4678, -0.4204, -0.4889]],
       grad_fn=<ViewBackward0>)


Your max_length is set to 112, but your input_length is only 30. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[-0.6405, -0.6667, -0.7298, -0.5827, -0.4644]],
       grad_fn=<ViewBackward0>)
Question: What is your preferred method of follow-up?
Context: Oh, well I guess an in-person visit would be my preferred method of follow-up then. I think it's the best way to connect.
The intended answer was: In-person visit
The predicted answer was: No follow-up
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[-0.2851, -0.2411, -0.2085, -0.2829, -0.3213]],
       grad_fn=<ViewBackward0>)
Question: What support resources do you need for implementation?
Context: Well, I think I might need some training, also documentation, maybe even some technical support. If none is needed that's ok too I suppose.
The intended answer was: ['Training', 'Documentation', 'Technical support', 'None']
The predicted answer was: ['Documentation', 'Technical support']
The intended answer in BINARY was: [1, 1, 1, 0, 1]
The predicted answer in BINARY was: [0, 1

Your max_length is set to 185, but your input_length is only 51. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=25)


tensor([[-0.1628, -0.3839, -0.2915, -0.2919, -0.3010]],
       grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Honestly, I am quite unsatisfied with them at the moment, I'd say.
The intended answer was: Unsatisfied
The predicted answer was: Very satisfied
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [1, 0, 0, 0, 0]



Your max_length is set to 79, but your input_length is only 21. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=10)


tensor([[-0.4447, -0.7227, -0.8096, -0.8091, -0.7201]],
       grad_fn=<ViewBackward0>)
tensor([[-0.2477, -0.2826, -0.2646, -0.2877, -0.3991]],
       grad_fn=<ViewBackward0>)
Question: Which features are most important in a solution?
Context: Well, I'd say ease of use is crucial and cost efficiency really matters. Scalability is definitely something important. Oh, and good support is key for any solution.
The intended answer was: ['Ease of use', 'Cost efficiency', 'Scalability', 'Support']
The predicted answer was: ['Ease of use', 'Scalability']
The intended answer in BINARY was: [1, 1, 1, 0, 1]
The predicted answer in BINARY was: [1, 0, 1, 0, 0]

tensor([[-0.6871, -0.3257, -0.4399, -0.5525, -0.5729]],
       grad_fn=<ViewBackward0>)
Question: What support resources do you need for implementation?
Context: Oh wow, for implementation I'd definitely need training. Maybe also some technical support. And probably onsite assistance would help a lot.
The intended answer was: ['Training', 'T

Your max_length is set to 159, but your input_length is only 36. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=18)


tensor([[-0.6079, -0.5479, -0.6037, -0.4836, -0.3929]],
       grad_fn=<ViewBackward0>)
Question: Who in your company evaluates new solutions?
Context: I think it's probably either Procurement or maybe some other team does that.
The intended answer was: ['Procurement', 'Other']
The predicted answer was: ['CEO', 'Other']
The intended answer in BINARY was: [0, 0, 1, 0, 1]
The predicted answer in BINARY was: [0, 0, 0, 1, 1]

tensor([[-0.6571, -0.4553, -0.1824, -0.3478, -0.5535]],
       grad_fn=<ViewBackward0>)
Question: What support resources do you need for implementation?
Context: Oh gosh, I guess I'd need training, good documentation, and also someone to help onsite, yeah that's it.
The intended answer was: ['Training', 'Documentation', 'Onsite assistance']
The predicted answer was: ['Technical support', 'Onsite assistance']
The intended answer in BINARY was: [1, 1, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 1, 1, 0]

tensor([[-0.5277, -0.7449, -1.0051, -0.6468, -0.9768]],
  

Your max_length is set to 107, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[-0.3033, -0.3266, -0.2474, -0.3353, -0.4976]],
       grad_fn=<ViewBackward0>)
Question: How satisfied are you with the current solutions in your field?
Context: Well, I guess I'm satisfied with the current solutions, I don't really know the alternatives anyway.
The intended answer was: Satisfied
The predicted answer was: Neutral
The intended answer in BINARY was: [0, 1, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 1, 0, 0]

Question: What is your estimated budget for this project?
Context: Okay, for this project, I'm currently estimating a budget of about $11,700.
The intended answer was: $11700
The predicted answer was: $ 11, 700

tensor([[-0.1162, -0.1655, -0.4525, -0.6421, -0.5540]],
       grad_fn=<ViewBackward0>)
Question: Who in your company evaluates new solutions?
Context: Well, I guess it could be the IT department or maybe Procurement, and if not them I'd say Other people.
The intended answer was: ['IT department', 'Procurement', 'Other']
The predicted answer

Your max_length is set to 138, but your input_length is only 35. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)


tensor([[-0.3805, -0.7258, -0.6651, -0.7858, -0.6595]],
       grad_fn=<ViewBackward0>)
Question: Which features are most important in a solution?
Context: Well, I think cost efficiency is important because we need to save money, security is definitely important to protect things, and I'd also say good support is needed for help.
The intended answer was: ['Cost efficiency', 'Security', 'Support']
The predicted answer was: ['Ease of use']
The intended answer in BINARY was: [0, 1, 0, 1, 1]
The predicted answer in BINARY was: [1, 0, 0, 0, 0]

tensor([[-0.4533, -0.5782, -0.5579, -0.5205, -0.4720]],
       grad_fn=<ViewBackward0>)
tensor([[-0.3955, -0.3214, -0.3075, -0.3343, -0.4950]],
       grad_fn=<ViewBackward0>)
Question: How many employees does your company have?
Context: Oh, I'm not entirely sure of the exact number but I'd guess we've probably got somewhere between 300 to 700 employees maybe.
The intended answer was: 201-1000
The predicted answer was: 51-200
The intended answer in B

Your max_length is set to 132, but your input_length is only 33. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=16)


tensor([[-0.5806, -0.7387, -0.7611, -0.6724, -0.5660]],
       grad_fn=<ViewBackward0>)
tensor([[-0.5792, -0.7671, -0.7733, -0.5588, -0.6160]],
       grad_fn=<ViewBackward0>)
Question: What is your primary goal at this trade fair?
Context: Honestly, I'm not entirely sure, probably something else I guess.
The intended answer was: Other
The predicted answer was: Market research
The intended answer in BINARY was: [0, 0, 0, 0, 1]
The predicted answer in BINARY was: [0, 0, 0, 1, 0]

tensor([[-0.2924, -0.7954, -0.7684, -0.6077, -0.7521]],
       grad_fn=<ViewBackward0>)
Question: What type of customer relationship are you seeking?
Context: I think I would prefer a partner relationship. That sounds good to me.
The intended answer was: Partner
The predicted answer was: Supplier
The intended answer in BINARY was: [0, 1, 0, 0, 0]
The predicted answer in BINARY was: [1, 0, 0, 0, 0]

tensor([[-0.5734, -0.4789, -0.4562, -0.4585, -0.3274]],
       grad_fn=<ViewBackward0>)
Question: How soon are you

Your max_length is set to 125, but your input_length is only 35. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)


tensor([[-0.1951, -0.1843, -0.1799, -0.2575, -0.4643]],
       grad_fn=<ViewBackward0>)
Question: What is your primary goal at this trade fair?
Context: Well I suppose my primary goal here is finding suppliers, yeah that sounds right to me.
The intended answer was: Finding suppliers
The predicted answer was: Learning about products
The intended answer in BINARY was: [0, 1, 0, 0, 0]
The predicted answer in BINARY was: [0, 0, 1, 0, 0]



Your max_length is set to 96, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[-0.4983, -0.7365, -0.5616, -0.6842, -0.6618]],
       grad_fn=<ViewBackward0>)
Question: Which features are most important in a solution?
Context: I guess ease of use is important, and cost efficiency definitely matters too. Security seems key, and good support is a must have I think.
The intended answer was: ['Ease of use', 'Cost efficiency', 'Security', 'Support']
The predicted answer was: ['Ease of use', 'Scalability']
The intended answer in BINARY was: [1, 1, 0, 1, 1]
The predicted answer in BINARY was: [1, 0, 1, 0, 0]

tensor([[-0.2250, -0.5531, -0.3816, -0.2997, -0.3336]],
       grad_fn=<ViewBackward0>)
Question: How did you hear about our exhibition stand?
Context: Oh, hmm, I guess I heard about it some other way then, you know? Not sure exactly which, but not from a known source.
The intended answer was: Other
The predicted answer was: Social media
The intended answer in BINARY was: [0, 0, 0, 0, 1]
The predicted answer in BINARY was: [1, 0, 0, 0, 0]

tensor([[-0.1549,

Your max_length is set to 153, but your input_length is only 41. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=20)


tensor([[-0.6744, -0.6130, -0.5985, -0.5721, -0.5840]],
       grad_fn=<ViewBackward0>)
tensor([[-0.2300, -0.4975, -0.6795, -0.5075, -0.2706]],
       grad_fn=<ViewBackward0>)
Question: What stage are you in the buying process?
Context: I think I've already decided. I'm pretty sure I'm set on that choice.
The intended answer was: Already decided
The predicted answer was: Exploration
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [1, 0, 0, 0, 0]

tensor([[ 0.2812, -0.1349, -0.2034, -0.1331,  0.0598]],
       grad_fn=<ViewBackward0>)
Question: Who in your company evaluates new solutions?
Context: Hmm I think our IT department looks at the tech stuff. Procurement probably handles the money part and maybe the CEO gets the final say sometimes I'm not really sure.
The intended answer was: ['IT department', 'Procurement', 'CEO']
The predicted answer was: ['Team leader', 'Other']
The intended answer in BINARY was: [0, 1, 1, 1, 0]
The predicted answer in 

Your max_length is set to 106, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[-0.8310, -0.4693, -0.4900, -0.4721, -0.5688]],
       grad_fn=<ViewBackward0>)
Question: How soon are you looking for a solution?
Context: Oh, I would like it immediately I suppose. That seems like a good time for me.
The intended answer was: Immediately
The predicted answer was: 1-3 months
The intended answer in BINARY was: [1, 0, 0, 0, 0]
The predicted answer in BINARY was: [0, 1, 0, 0, 0]

tensor([[-0.4738, -0.4049, -0.0420, -0.2732, -0.2435]],
       grad_fn=<ViewBackward0>)
tensor([[-0.4858, -0.2558, -0.5570, -0.4496, -0.4080]],
       grad_fn=<ViewBackward0>)
Question: What is your primary goal at this trade fair?
Context: I guess my primary goal here would be market research. I am trying to figure out what's happening out there.
The intended answer was: Market research
The predicted answer was: Finding suppliers
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 1, 0, 0, 0]

tensor([[-0.5917, -0.5854, -0.6059, -0.6539, -0.6912]],


Your max_length is set to 160, but your input_length is only 39. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)


tensor([[-0.6157, -0.5072]], grad_fn=<ViewBackward0>)
Question: Do you plan to implement a solution within the next 6 months?
Context: Oh, you're asking about my plans. Well, between yes and no, I'd have to say yes.
The intended answer was: Yes
The predicted answer was: No
The intended answer in BINARY was: [1, 0]
The predicted answer in BINARY was: [0, 1]

tensor([[-0.5307, -0.8765, -0.5889, -0.6612, -0.7038]],
       grad_fn=<ViewBackward0>)
tensor([[-0.4372, -0.3584, -0.3442, -0.3185, -0.3614]],
       grad_fn=<ViewBackward0>)
Question: Who in your company evaluates new solutions?
Context: I think maybe the team leader, or perhaps the IT department. Procurement could also be involved, and there might be other people too.
The intended answer was: ['Team leader', 'IT department', 'Procurement', 'Other']
The predicted answer was: ['Procurement', 'CEO']
The intended answer in BINARY was: [1, 1, 1, 0, 1]
The predicted answer in BINARY was: [0, 0, 1, 1, 0]

tensor([[-0.4144, -0.5925, -0.6

Your max_length is set to 108, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


tensor([[-0.7344, -0.5185, -0.3113, -0.4056, -0.5913]],
       grad_fn=<ViewBackward0>)
Question: What support resources do you need for implementation?
Context: I think I need training and documentation, and maybe some onsite assistance too, or perhaps none of them really.
The intended answer was: ['Training', 'Documentation', 'Onsite assistance', 'None']
The predicted answer was: ['Technical support', 'Onsite assistance']
The intended answer in BINARY was: [1, 1, 0, 1, 1]
The predicted answer in BINARY was: [0, 0, 1, 1, 0]

tensor([[-0.3995, -0.3708]], grad_fn=<ViewBackward0>)
tensor([[-0.3340, -0.3492, -0.4754, -0.3734, -0.4327]],
       grad_fn=<ViewBackward0>)
tensor([[-0.4420, -0.5953, -0.6094, -0.5707, -0.4929]],
       grad_fn=<ViewBackward0>)
Question: What type of customer relationship are you seeking?
Context: I'm not really sure, maybe something other I guess.
The intended answer was: Other
The predicted answer was: Supplier
The intended answer in BINARY was: [0, 0, 0, 0, 1

Your max_length is set to 145, but your input_length is only 38. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)


tensor([[-0.9793, -0.8455, -1.0237, -0.9141, -0.7768]],
       grad_fn=<ViewBackward0>)
Question: How did you hear about our exhibition stand?
Context: I think I saw it on the trade fair website.
The intended answer was: Trade fair website
The predicted answer was: Other
The intended answer in BINARY was: [0, 0, 1, 0, 0]
The predicted answer in BINARY was: [0, 0, 0, 0, 1]

tensor([[-0.3068, -0.3026, -0.2254, -0.2639, -0.4694]],
       grad_fn=<ViewBackward0>)
Question: How many employees does your company have?
Context: Oh geez, I'm not totally sure about the exact number. I think we've got somewhere around 450 employees, give or take a few.
The intended answer was: 201-1000
The predicted answer was: 51-200
The intended answer in BINARY was: [0, 0, 0, 1, 0]
The predicted answer in BINARY was: [0, 0, 1, 0, 0]

tensor([[-0.4619, -0.6748, -0.4625, -0.5399, -0.7607]],
       grad_fn=<ViewBackward0>)
Question: What type of customer relationship are you seeking?
Context: Well I'm looking for

In [35]:
model_result = {'model_name': model_name, 'mc_metric_result': mc_metric_result, 'oe_metric_result': oe_metric_result}
print(model_result)

{'model_name': 'xlnet/xlnet-base-cased', 'mc_metric_result': {'accuracy': 0.6062893081761006, 'f1': 0.27713625866050806, 'precision': 0.31413612565445026, 'recall': 0.24793388429752067}, 'oe_metric_result': {'exact_match': 0.4444444444444444}}


In [36]:
# save model outputs to make further evaluations afterwards
with open('xlnet_mc_results.json', 'w') as fp:
    json.dump(mc_results, fp)
with open('xlnet_oe_results.json', 'w') as fp:
    json.dump(oe_results, fp)

## Notebook Conclusion:
We created a fine-tuning pipeline that successfully prepares and trains multiple-choice models using Hugging Face's Trainer API.
It efficiently tokenizes inputs, applies dynamic padding, fine-tunes models with, what we identified as optimal, settings, and saves them to Google Drive.
The setup ensures robust training, evaluation, and reproducibility, making it well-suited for both single- and multi-select multiple-choice tasks 🚀

How well-suited the fine-tuning pipeline is, can be seen in the Overview notebook, where we will compute the final evaluations ✅