<a href="https://colab.research.google.com/github/alexk2206/tds_capstone/blob/Domi-DEV/Productive.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Productive Notebook

In [3]:
!pip install evaluate
!pip install --upgrade sympy

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from evaluate)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.17-py311-none-any.whl.metadata (7.2 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec>=2021.05.0 (from fsspec[http]>=2021.05.0->evaluate)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [

In [21]:
import json
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
import torch.nn.functional as F
import numpy as np
import urllib
from itertools import chain, combinations
from transformers import AutoTokenizer, AutoModelForMultipleChoice, AutoModelForQuestionAnswering, TrainingArguments, pipeline, Trainer, DataCollatorWithPadding
import torch
import requests
import evaluate
import numpy as np
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
from dateutil import parser
from datetime import datetime
import re



### Preprocess dataset

Here we split the QA-dataset into train and validation dataset.
Additionnaly, we prepare the dataset to later be useful for response-generation and fine-tuning of a model

❎ Please insert code: load dataset into variable q ❎

In [5]:
# Example dataset

url = "https://raw.githubusercontent.com/alexk2206/tds_capstone/refs/heads/main/datasets/combined_qa_dataset.json"
data = pd.read_json(url)
# Convert to DataFrame for easy handling
df = pd.DataFrame(data)

# Map the intended answer to the index of the option
df['label'] = df.apply(lambda x: np.array([1 if option in x['intended_answer'] else 0 for option in x['options']]) if x['type'] in ['SINGLE_SELECT', 'MULTI_SELECT'] else np.array([0]), axis=1)
df['stratify_key'] = df['difficulty'] + '_' + df['type']

# Stratified Train-Validation Split
train_df, val_df = train_test_split(
    df,
    train_size=0.8,
    stratify=df['stratify_key'],
    random_state=42
)

### Generate model output

After the creation of the QA-dataset, it's time for generating model output for different Huggingface models.

In [31]:
def model_output(mc_model, mc_tokenizer, oe_model, oe_tokenizer, questions, mc_metric=None, oe_metric=None):
    '''
    model_output -> creates output for every question in the dataset and safes it in a list of dicts. One dic has keys 'answer', 'predicted_answer', 'type'
    parameters:
    - model: one hugging face model
    - tokenizer: hugging face tokenizer
    - questions: QA-dataset in json format
    '''
    answer_comparison = []
    mc_answer_comparison = []
    summarization_pipeline = pipeline("summarization")

    for index, question in questions.iterrows():
        context = question['context']
        question_text = question['question']
        options = question['options']
        question_type = question['type']
        difficulty = question['difficulty']

        if question_type == "MULTI_SELECT":
          intended_answer, intended_answer_binary, predicted_answer_binary, predicted_answer = multi_select_model_output(mc_model, mc_tokenizer, question, mc_metric)
          print('======= Question =======')
          print(f"Question: {question_text}")
          print(f"Context: {context}")
          print(f"The intended answer was: BINARY: {intended_answer_binary}, WORDS: {intended_answer}")
          print(f"The predicted answer was: BINARY: {predicted_answer_binary}, WORDS: {predicted_answer}\n")

        elif question_type == "SINGLE_SELECT":
          intended_answer, intended_answer_binary, predicted_answer_binary, predicted_answer = single_select_model_output(mc_model, mc_tokenizer, question, mc_metric)
          print('======= Question =======')
          print(f"Question: {question_text}")
          print(f"Context: {context}")
          print(f"The intended answer was: BINARY: {intended_answer_binary}, WORDS: {intended_answer}")
          print(f"The predicted answer was: BINARY: {predicted_answer_binary}, WORDS: {predicted_answer}\n")

        elif question_type == "TEXT":
          intended_answer, predicted_answer = text_model_output(question, summarization_pipeline)
        elif question_type == "NUMBER":
          intended_answer, predicted_answer = number_model_output(oe_model, oe_tokenizer, question, oe_metric)
        elif question_type == "DATE":
          intended_answer, predicted_answer = date_model_output(oe_model, oe_tokenizer, question, oe_metric)
          print('======= Question =======')
          print(f"Question: {question_text}")
          print(f"Context: {context}")
          print(f"The intended answer was: {intended_answer}")
          print(f"The predicted answer was: {predicted_answer}\n")
        else:
          continue
        '''if predicted_answer != intended_answer:
          print('======= Wrong answer =======')
          print(f"Question: {question_text}")
          print(f"Context: {context}")
          print(f"The intended answer was: {intended_answer}")
          print(f"The predicted answer was: {predicted_answer}\n")'''
        if question_type in ["MULTI_SELECT", "SINGLE_SELECT"]:
          mc_answer_comparison.append({'intended_answer_binary': intended_answer_binary, 'predicted_answer_binary': predicted_answer_binary, 'intended_answer': intended_answer, 'predicted_answer': predicted_answer, 'type': question_type, 'difficulty': difficulty})
        else:
          answer_comparison.append({'intended_answer': intended_answer, 'predicted_answer': predicted_answer, 'type': question_type, 'difficulty': difficulty})
    if mc_metric is not None:
      try:
        mc_metric_result = mc_metric.compute()
      except:
        mc_metric_result = None
    else:
      mc_metric_result = None
    if oe_metric is not None:
      try:
        oe_metric_result = oe_metric.compute()
      except:
        oe_metric_result = None
    else:
      oe_metric_result = None
    return mc_answer_comparison, answer_comparison, mc_metric_result, oe_metric_result


Single-select output

In [18]:
def single_select_model_output(model, tokenizer, question, metric=None):
    '''
    Handles a question, its context and its options for a single-select question and generates output
    parameters:
    - model: one hugging face model
    - tokenizer: hugging face tokenizer
    - question: one question of the QA-dataset as a dictionary
    output:
    - answer: the correct/intended answer as a list of a string
    - predicted_answer: the predicted answer as a list of a string
    '''
    intended_answer = question['intended_answer'][0]
    options = question['options']

    # creating input ids by tokenizing the question
    input_ids = tokenize_function(question, tokenizer)
    input_ids = {key: torch.tensor(array) for key, array in input_ids.items()}

    # generating the output
    outputs = model(**input_ids)
    logits = outputs.logits  # Shape: [batch_size, num_choices]
    print(logits)
    # Predict the option with the highest score
    predicted_option = torch.argmin(logits, dim=1).item()

    predicted_answer_binary = [0] * len(options)
    predicted_answer_binary[predicted_option] = 1

    intended_answer_binary = [1 if option == intended_answer else 0 for option in options]

    if metric is not None:
      metric.add_batch(predictions=predicted_answer_binary, references=intended_answer_binary)

    return intended_answer, intended_answer_binary, predicted_answer_binary, options[predicted_option]


Multi-select output

In [8]:
def multi_select_model_output(model, tokenizer, question, metric=None):
    '''
    Handles a question, its context and its options for a multi-select question and generates output as a list of indices of the predicted answers. Ticks every option whose probability is at least 90% of the best option (softmax)
    parameters:
    - model: one hugging face model
    - tokenizer: hugging face tokenizer
    - question: one question of the QA-dataset as a dictionary
    output:
    - answer: the correct/intended answers as a list of strings
    - predicted_answer: the predicted answers as a list of strings
    '''
    intended_answer = question['intended_answer']
    options = question['options']

    # creating input ids by tokenizing the question
    input_ids = tokenize_function(question, tokenizer)
    input_ids = {key: torch.tensor(array) for key, array in input_ids.items()}

    # generating the output
    outputs = model(**input_ids)
    logits = outputs.logits  # Shape: [batch_size, num_choices]
    print(logits)
    # Find all indices to have at least 80% of the max score
    # probabilities = F.softmax(logits, dim=1)  # Convert logits to probabilities
    # print(probabilities)
    # max_score = probabilities.max().item()  # Use max probability
    #### better approach: using min_score
    #min_score = logits.min().item()
    #threshold = 2 * min_score  # Compute threshold based on probabilities

    ### next approach: using a threshold from deviation
    mean_score = logits.mean().item()
    std_dev = logits.std().item()

    # Define a threshold based on deviation from the mean
    threshold = mean_score - (0.4 * std_dev)  # Adjust the 0.5 multiplier as needed
    high_score_options = (logits <= threshold).nonzero(as_tuple=True)[1]  # Get the indices of valid options
    # high_score_options = (probabilities >= threshold).nonzero(as_tuple=True)[1]  # Get the indices of valid options

    # List the corresponding options
    high_score_answers = [options[idx] for idx in high_score_options.tolist()]
    intended_answer_binary = [1 if option in intended_answer else 0 for option in options]

    predicted_answer_binary = [1 if option in high_score_answers else 0 for option in options]

    if metric is not None:
        metric.add_batch(predictions=predicted_answer_binary, references=intended_answer_binary)

    return intended_answer, intended_answer_binary, predicted_answer_binary, high_score_answers



Text output

In [9]:
def text_model_output(question, pipeline):
    '''
    Handles an open text question and summarizes it
    parameter:
    - question: one question of the QA-dataset as a dictionary
    output:
    - answer: the full context of the question as a string
    - summary: the generated summary as a string
    '''
    intended_answer = question['context']
    summary = pipeline(intended_answer, max_length=100, min_length=30, do_sample=False)
    return intended_answer, summary[0]['summary_text']



Phone Number output

In [10]:
def number_model_output(model, tokenizer, question, metric=None):
    '''
    Handles a question where the context should contain a phone number and generates an answer to that question
    '''
    intended_answer = question['intended_answer'][0]

    input_ids = tokenize_function(question, tokenizer)
    output = model(**input_ids)
    start_logits, end_logits = output.start_logits, output.end_logits

    # Get most probable start and end index
    start_idx = torch.argmax(start_logits, dim=1).item()
    end_idx = torch.argmax(end_logits, dim=1).item() + 1  # Include last token

    # Convert token IDs to text
    predicted_tokens = input_ids["input_ids"][0][start_idx:end_idx]
    predicted_number = tokenizer.decode(predicted_tokens, skip_special_tokens=True)

    if metric is not None:
        metric.add(predictions=predicted_number, references=intended_answer)

    return intended_answer, predicted_number



Date output

In [22]:
def convert_date_format(date_str):
  try:
    parsed_date = parser.parse(date_str)
    return parsed_date.strftime('%Y-%m-%d')
  except Exception as e:
    return date_str

def find_date_and_convert(input_string):
  date_regex = r'\b(?:\d{1,2}(?:st|nd|rd|th)?\s+[A-Za-z]+\s+\d{4}|\d{1,2}[./-]\d{1,2}[./-]\d{2,4}|\b[A-Za-z]+\s+\d{1,2}(?:st|nd|rd|th)?,?\s+\d{4})\b'
  match = re.search(date_regex, input_string)
  if match:
    extracted_date = match.group(0)
    formatted_date = convert_date_format(extracted_date)
    return formatted_date
  else:
    return input_string




In [23]:
def date_model_output(model, tokenizer, question, metric=None):
    '''
    Handles a question where the context should contain a date and generates an answer to that question
    '''
    intended_answer = question['intended_answer'][0]

    input_ids = tokenize_function(question, tokenizer)
    output = model(**input_ids)
    start_logits, end_logits = output.start_logits, output.end_logits

    # Get most probable start and end index
    start_idx = torch.argmax(start_logits, dim=1).item()
    end_idx = torch.argmax(end_logits, dim=1).item() + 1  # Include last token

    # Convert token IDs to text
    predicted_tokens = input_ids["input_ids"][0][start_idx:end_idx]
    predicted_date = tokenizer.decode(predicted_tokens, skip_special_tokens=True)
    formated_predicted_date = find_date_and_convert(predicted_date)

    if metric is not None:
        metric.add(predictions=predicted_date, references=intended_answer)

    return intended_answer, formated_predicted_date



Accuracy

In [12]:
def accuracy(answer_comparison):
    '''
    Computes the total accuracy and accuracy for each question type for the passed list of dicts. One dict in the list is one question with keys 'answer', 'predicted_answer', 'type'
    parameters:
    - list of dicts with entries 1) predicted answer 2) answer 3) type of question
    '''
    correct_multi_select = 0
    correct_single_select = 0
    correct_text = 0
    correct_number = 0
    correct_date = 0
    correct_total = 0
    total = 0

    for entry in answer_comparison:
        question_type = entry['type']
        if entry['intended_answer'] == entry['predicted_answer']:
            if question_type == 'MULTI_SELECT':
                correct_multi_select += 1
                total_multi_select += 1
            elif question_type == 'SINGLE_SELECT':
                correct_single_select += 1
                total_single_select += 1
            elif question_type == 'TEXT':
                correct_text += 1
                total_text += 1
            elif question_type == 'NUMBER':
                correct_number += 1
                total_number += 1
            elif question_type == 'DATE':
                correct_date += 1
                total_date += 1
            else:
              continue
            correct_total += 1
        total += 1
    accuracy_total = correct_total / total
    accuracy_multi_select = correct_multi_select / total_multi_select
    accuracy_single_select = correct_single_select / total_single_select
    accuracy_text = correct_text / total_text
    accuracy_number = correct_number / total_number
    accuracy_date = correct_date / total_date
    return accuracy_total, accuracy_multi_select, accuracy_single_select, accuracy_text, accuracy_number, accuracy_date
'''
print_out_model_quality: takes the computations of function accuracy() and prints them out
parameters:
- accuracy_total
- accuracy_multi_select
- accuracy_single_select
- accuracy_text
- accuracy_number
- accuracy_date
'''
def print_out_model_quality(accuracy_total, accuracy_multi_select, accuracy_single_select, accuracy_text, accuracy_number, accuracy_date):
    # accuracy_total, accuracy_multi_select, accuracy_single_select, accuracy_text, accuracy_number, accuracy_date = accuracy(model, tokenizer, questions)
    print(f"""Accuracy values of model: {model.name_or_path}\n
    Total: {accuracy_total}\n
    Multi-select: {accuracy_multi_select}\n
    Single-select: {accuracy_single_select}\n
    Text: {accuracy_text}\n
    Number: {accuracy_number}\n
    Date: {accuracy_date}\n""")
    return accuracy_total, accuracy_multi_select, accuracy_single_select, accuracy_text, accuracy_number, accuracy_date




### Fine-tuning a model


In [13]:

@dataclass
class DataCollatorForMultipleChoice:
    """
    Data collator that will dynamically pad the inputs for multiple choice received.
    """

    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        label_name = "label" if "label" in features[0].keys() else "labels"
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]["input_ids"])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])

        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )

        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        batch["labels"] = torch.tensor(labels, dtype=torch.int64)
        return batch

In [29]:
def fine_tune_model(train_dataset, val_dataset, tokenizer, model):
    # Define training arguments
    training_args = TrainingArguments("trainer",
        evaluation_strategy="epoch",
        save_strategy="epoch",
        logging_dir="./logs",
        learning_rate=2e-5,
        num_train_epochs=3,
        weight_decay=0.01,
        logging_steps=10,
        load_best_model_at_end=True
    )
    data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)

    tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
    tokenized_val_dataset = val_dataset.map(tokenize_function, batched=True)

    # Define Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        data_collator=data_collator,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )

    # Train the model
    trainer.train()

def compute_metrics(eval_preds, pretrained_dataset_name):
    metric = evaluate.load("glue", pretrained_dataset_name)
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

def tokenize_function(example, tokenizer):
    '''
    Converts the string input, which is a question with its context and the given options for multi-/single-select questions, into IDs the model later can make sense of. Distinguishes between multi-/single-select and the other questions
    parameters:
    - expample: question of the QA-dataset with all its entries (question, context, options, type are urgently necessary)
    - tokenizer: tokenizer of the model
    output:
    - tokenized: tokenized input example
    '''
    if example["type"] == "SINGLE_SELECT" or example["type"] == "MULTI_SELECT":
      number_of_options = len(example["options"])
      first_sentence = [[example["context"]] * number_of_options]  # Repeat context for each option
      second_sentence = [[example["question"] + " " + option] for option in example["options"]]  # Pair with each option
      tokenized = tokenizer(
          sum(first_sentence, []),
          sum(second_sentence, []),
          padding="longest",
          truncation=True
      )
      # Un-flatten
      return {k: [v[i:i+number_of_options] for i in range(0, len(v), number_of_options)] for k, v in tokenized.items()}

    else:
      tokenized = tokenizer(
          example['context'],
          example['question'],
          truncation="only_second",
          max_length=384,
          padding="max_length",
          return_tensors="pt"
      )

    return tokenized

In [15]:
model = AutoModelForMultipleChoice.from_pretrained("bert-base-cased", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForMultipleChoice were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [78]:
tryout = train_df[train_df['type'] == 'SINGLE_SELECT']

In [70]:
tryout

Unnamed: 0,question,type,options,intended_answer,context,difficulty,label,stratify_key
83,Size of the trade fair team (on average),SINGLE_SELECT,"[1-5, 6-10, 11-15, 16-20, 21-30, 31-40, more t...",[6-10],"On average, our trade fair team is usually bet...",easy,"[0, 1, 0, 0, 0, 0, 0]",easy_SINGLE_SELECT
40,CRM-System,SINGLE_SELECT,"[Salesforce, Pipedrive, Close.io, Microsoft Dy...",[Pipedrive],I've heard of Pipedrive; it's a CRM system I t...,easy,"[0, 1, 0, 0, 0, 0, 0, 0]",easy_SINGLE_SELECT
27,Customer group,SINGLE_SELECT,"[End User, Wholesaler, Distributor, Consultant...",[Architect],"Well, the customer group is architects; I'm w...",easy,"[0, 0, 0, 0, 0, 1, 0]",easy_SINGLE_SELECT
15,Customer group,SINGLE_SELECT,"[End User, Wholesaler, Distributor, Consultant...",[Wholesaler],"I think they're a wholesaler, meaning they buy...",easy,"[0, 1, 0, 0, 0, 0, 0]",easy_SINGLE_SELECT
16,Customer satisfaction,SINGLE_SELECT,"[Very satisfied, Satisfied, Unsatisfied, Very ...",[Very unsatisfied],"I'm very unsatisfied, actually. That's the on...",easy,"[0, 0, 0, 1]",easy_SINGLE_SELECT
9,CRM-System,SINGLE_SELECT,"[Salesforce, Pipedrive, Close.io, Microsoft Dy...",[SAP Sales Cloud],I've heard of SAP Sales Cloud; it's a CRM syst...,easy,"[0, 0, 0, 0, 0, 0, 1, 0]",easy_SINGLE_SELECT
84,CRM-System,SINGLE_SELECT,"[Salesforce, Pipedrive, Close.io, Microsoft Dy...",[CAS],I'm not familiar with CRM systems beyond what ...,easy,"[0, 0, 0, 0, 0, 1, 0, 0]",easy_SINGLE_SELECT
74,Which language is wanted for communication?,SINGLE_SELECT,"[German, Italian, Japanese , English, Spanish]",[English],"I'd prefer to communicate in English, since th...",easy,"[0, 0, 0, 1, 0]",easy_SINGLE_SELECT
45,Data processing consent,SINGLE_SELECT,"[Yes, No]",[Yes],"Yes, I consent to the data processing.",easy,"[1, 0]",easy_SINGLE_SELECT
7,Which language is wanted for communication?,SINGLE_SELECT,"[German, Italian, Japanese , English, Spanish]",[Spanish],"I'd prefer to communicate in Spanish, since t...",easy,"[0, 0, 0, 0, 1]",easy_SINGLE_SELECT


In [16]:
clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])
exact_match = evaluate.load("exact_match")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.79k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.56k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.38k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/5.67k [00:00<?, ?B/s]

In [32]:
oe_model = AutoModelForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased-distilled-squad")
oe_tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased-distilled-squad")
tryout = train_df[train_df['type'] == 'DATE'].tail(30)
mc_results, results, mc_metric_result, oe_metric_result = model_output(model, tokenizer, oe_model, oe_tokenizer, tryout, mc_metric=clf_metrics, oe_metric=exact_match)
print(mc_metric_result)
print(oe_metric_result)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


Question: When do you wish to receive a follow-up?
Context: How about we plan a follow-up for, say, January 23rd, 2025? Does that work for you?
The intended answer was: 2025-01-23
The predicted answer was: when

Question: When do you wish to receive a follow-up?
Context: How about we catch up again on January 20th, 2025? Does that work for you?
The intended answer was: 2025-01-20
The predicted answer was: when do you wish to receive a follow - up

Question: When do you wish to receive a follow-up?
Context: How about we schedule the follow-up for January 17th, 2025? Does that work for you?
The intended answer was: 2025-01-17
The predicted answer was: when do you wish to receive a follow - up

Question: When do you wish to receive a follow-up?
Context: How about we plan on a follow-up around January 24th, 2025? Does that work for you?
The intended answer was: 2025-01-24
The predicted answer was: when do you wish to receive a follow - up

Question: When do you wish to receive a follow-up?

In [None]:
{'accuracy': 0.23497267759562843, 'f1': 0.3069306930693069, 'precision': 0.25833333333333336, 'recall': 0.3780487804878049}
{'accuracy': 0.7923497267759563, 'f1': 0.7790697674418605, 'precision': 0.7444444444444445, 'recall': 0.8170731707317073}