<a href="https://colab.research.google.com/github/croco22/CapstoneProjectTDS/blob/annelie/notebooks/Huggingface_QA_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 2: Evaluate Dataset

## Import Data


In [2]:
!pip install word2number

import json
import time
import google.generativeai as genai
from google.colab import userdata
import requests
from word2number import w2n
import re
import pandas as pd

# Gemini API Setup
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)
model = genai.GenerativeModel('gemini-1.5-flash')

# Read dataset file
url = 'https://raw.githubusercontent.com/croco22/CapstoneProjectTDS/refs/heads/main/qa_dataset.json'
data = pd.read_json(url)

data.head()

Collecting word2number
  Downloading word2number-1.1.zip (9.7 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: word2number
  Building wheel for word2number (setup.py) ... [?25l[?25hdone
  Created wheel for word2number: filename=word2number-1.1-py3-none-any.whl size=5568 sha256=062e1ceaa3bb785a8021657555b795250b6297c0273430dd98997aeefd020fed
  Stored in directory: /root/.cache/pip/wheels/cd/ef/ae/073b491b14d25e2efafcffca9e16b2ee6d114ec5c643ba4f06
Successfully built word2number
Installing collected packages: word2number
Successfully installed word2number-1.1


Unnamed: 0,type,question,options,intended_answer,context,timestamp
0,SINGLE_SELECT,Data processing consent,"[Yes, No]",Yes,"Yes, absolutely, I'm completely fine with that.",2024-12-31 22:15:06.880
1,SINGLE_SELECT,Data processing consent,"[Yes, No]",Yes,"Sure, I give my consent, no problem at all.",2024-12-31 22:15:06.880
2,SINGLE_SELECT,Data processing consent,"[Yes, No]",Yes,"Yep, consider my agreement given; I have no ob...",2024-12-31 22:15:06.880
3,SINGLE_SELECT,Data processing consent,"[Yes, No]",Yes,"Indeed, you have my permission to proceed with...",2024-12-31 22:15:06.880
4,SINGLE_SELECT,Data processing consent,"[Yes, No]",Yes,"Okay, yes, I definitely agree to those data pr...",2024-12-31 22:15:06.880


## Auxilary Functions


*   Function to convert number words into actual numbers, e.g. "fifty-one" ➡ 51
*   Function to check if options appear explicitly in text



In [3]:
def convert_numbers_in_text(text):
    # Regular expression to find number words contained in questionnaires
    pattern = r'(two thousand|two hundred one|two hundred|fifty-one|thirty-one|twenty-one|sixteen|fifteen|eleven|thirty|twenty|fifty|forty|sixty|ten|five|six|one)'
    # Interesting finding: Regex only works if longer words are in order before shorter that contain similar parts, e.g. fifty-one has to be in front of fifty to work as intended

    def convert(match):
        word = match.group(0)
        try:
            # Convert the word to number
            return str(w2n.word_to_num(word))
        except ValueError:
            return word

    # Replace all number words in the text with their integer equivalents
    converted_text = re.sub(pattern, convert, text, flags=re.IGNORECASE)

    # Now convert ranges like 'twenty to thirty' into '20-30'
    converted_text = re.sub(r'(\d+)\s*(to|and)\s*(\d+)', r'\1-\3', converted_text)

    # Replace text
    # Todo: Dafür noch ne bessere Lösung finden, das ist eig nur n Beispiel und geht auch bei ähnlichen Sätzen nicht
    converted_text = converted_text.replace('more than 2000', 'larger than 2000')
    converted_text = converted_text.replace('More than 2000', 'Larger than 2000')

    return converted_text


def is_exact_or_phrase_match(option, text):
    # Escape the option to handle special characters
    escaped_option = re.escape(option.strip())

    # Pattern to match the option as a full word or part of a phrase
    pattern = rf'\b(?:\w+\s+)*{escaped_option}(?:\s+\w+)*\b'

    # Search for the pattern in the text (case-insensitive)
    return re.search(pattern, text, re.IGNORECASE) is not None

## Generate QA Pipelines


*   deepset/roberta-base-squad2
*   distilbert-base-cased-distilled-squad
*   google-bert/bert-large-uncased-whole-word-masking-finetuned-squad






In [24]:
from transformers import pipeline

qa_pipeline1 = pipeline("question-answering", model="deepset/roberta-base-squad2", topk = 5)

Device set to use cuda:0


In [None]:
qa_pipeline2 = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')

config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cuda:0


In [None]:
qa_pipeline3 = pipeline("question-answering", model='google-bert/bert-large-uncased-whole-word-masking-finetuned-squad')

config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at google-bert/bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cuda:0


## Predict Function
Function to extract answers to given questions from a given context for different question types


*   Single-Select:
    * Extract answer from context with QA model
    * Check answer options for similarity with the extracted answer
    * If a matching answer option is found, append it to list of predictions
    * Choose answer with highest confidence value from list
*   Multi-Select:
    * Idea was to handle it the same way as Single-Select
    * Problem: QA models can only extract a single answer from a text
    * Performance with Regex (is_exact_or_phrase_match) is much better
    * Limitation: Regex can only be used, because we have given answer options
*   Date:
    * Time reference is extracted from context with QA model
    * Time reference is parsed to an exact date
    * Intended date is calculated from timestamp and inteded answer
    * Check if predicted date and intended date are the same
*   Number:
    * Extract phone number from context with QA model and compare it to intended answer
*   Text:
    * Skip text questions (nothing to predict)







In [17]:
from pickle import NONE
!pip install dateparser
import dateparser
from datetime import datetime
from datetime import timedelta
!pip install fuzzywuzzy
from fuzzywuzzy import fuzz

def predict_answers(df, pipeline):
    """
    Predict the answer for each row in the DataFrame.
    Prints only incorrectly predicted answers.
    """
    print("[INFO] Printing only incorrectly predicted answers.")
    correct_count = 0
    total_count = 0
    qa_pipeline = pipeline

    for _, row in df.iterrows():

        predictions = []

        # Regex-check only for single- and multi-select questions
        # if row['options']:  # Evaluates to False if options is None or empty
            # converted_context = convert_numbers_in_text(row['context'])

            # for option in row['options']:
            #     # Check for exact match or part of a phrase
            #     exact_match = is_exact_or_phrase_match(option, converted_context)
            #     if exact_match:
            #         predictions.append((option, 0.95))  # 95% confidence for exact match
            #     else:
            #         result = qa_pipeline(question=row['question'], context=f"{converted_context} {option}")
            #         predictions.append((option, result['score']))

        # Nur für Single- und Multi-Select-Fragen
        if row['type'] == "SINGLE_SELECT":
            print(f"\nProcessing question: {row['question']} with type: {row['type']}")  # Welche Frage wird bearbeitet?
            print(f"\nProcessing context: {row['context']}")
            results = qa_pipeline(question=row['question'], context=row['context'])
            print(f"Pipeline results: {results}")

            if isinstance(results, dict):
                # In eine Liste umwandeln, falls ein einzelnes Ergebnis vorliegt
                results = [results]
            elif not isinstance(results, list):
                print(f"Warning: Unexpected output format from qa_pipeline for question: {row['question']}")
                continue

            # Verarbeite die Ergebnisse
            for result in results:
                extracted_answer = result.get('answer', '')
                print(f"Extracted answer: {extracted_answer}")  # Was wurde extrahiert?
                for option in row['options']:
                    similarity_score = fuzz.ratio(extracted_answer.lower(), option.lower())
                    print(f"Checking similarity: '{extracted_answer}' vs '{option}' → Score: {similarity_score}")
                    if similarity_score >= 60:  # Schwellenwert für Ähnlichkeit
                        predictions.append((option, result.get('score', 0)))
            #print(f"Predictions: {predictions}")

        is_correct = False

        # Handle different question types
        if row['type'] == "SINGLE_SELECT":
            if predictions:
                print(f"Before max(): Predictions available: {predictions}")
                predicted_option, confidence = max(predictions, key=lambda x: x[1])
                print(f"Predicted SINGLE_SELECT: {predicted_option} (Confidence: {confidence})")
                print(f"Predicted: {predicted_option}, Correct: {row['intended_answer']}, Match: {predicted_option == row['intended_answer']}")
                is_correct = predicted_option == row['intended_answer']
            else:
                print(f"No predictions found for SINGLE_SELECT: {row['question']}")
                predicted_option = None

        # if row['type'] == "MULTI_SELECT":
        #     print(f"Before filtering: Predictions available: {predictions}")
        #     predicted_option = [option for option, score in predictions if score >= 0.25]
        #     print(f"Filtered predicted options: {predicted_option}")
        #     is_correct = set(predicted_option) == set(row['intended_answer'])

        # if row['type'] == "DATE":
        #     try:
        #         # Basis-Timestamp aus der Dataframe-Spalte (Unix-Timestamp)
        #         base_timestamp = pd.Timestamp(row['timestamp'], unit='ms')

        #         # Extrahiere Zeitangabe aus dem Kontext
        #         extracted_time = qa_pipeline(question=row['question'], context=row['context'])['answer']

        #         # Konvertiere extrahierte Zeitangabe in Sekunden
        #         parsed_date = dateparser.parse(
        #             extracted_time,
        #             settings={'RELATIVE_BASE': base_timestamp.to_pydatetime(), 'PREFER_DATES_FROM': 'future'}
        #         )
        #         if not parsed_date:
        #             raise ValueError(f"Unable to parse date from extracted time: {extracted_time}")

        #         predicted_option = parsed_date
        #         intended_seconds = int(row['intended_answer'])
        #         intended_date = base_timestamp + timedelta(seconds=intended_seconds)
        #         intended_date = intended_date.replace(hour=0, minute=0, second=0, microsecond=0)  # Normalisiere auf den Beginn des Tages

        #         # Vergleich der vorhergesagten und intendierten Daten
        #         #is_correct = predicted_option.date() == intended_date.date()
        #         is_correct = abs((predicted_option - intended_date).days) <= 1
        #         print(f"Extracted time: {extracted_time}, predicted date: {predicted_option.date()}, intended date: {intended_date.date()}")

        #     except Exception as e:
        #         print(f"[ERROR] DATE question processing failed: {e}")

        # if row['type'] == "NUMBER":
        #     try:
        #         predicted_option = qa_pipeline(question=row['question'], context=row['context'])['answer']
        #         is_correct = predicted_option == row['intended_answer']
        #     except Exception as e:
        #         print(f"[ERROR] NUMBER question failed: {e}")

        # Text-Fragen ignorieren
        else:
          continue;

        # Output incorrect predictions
        if not is_correct:
            print(f"Context: {row['context']}")
            print(f"Correct: {row['intended_answer']}, Predicted: {predicted_option}")
            print()

        if is_correct:
            correct_count += 1
        total_count += 1

    # Calculate accuracy
    accuracy = correct_count / total_count if total_count > 0 else 0
    return accuracy




## Evaluate Dataset with Pre-trained Models

In [25]:
accuracy = predict_answers(data, qa_pipeline1)
print(f"Accuracy: {accuracy * 100:.2f} %")

[1;30;43mDie letzten 5000 Zeilen der Streamingausgabe wurden abgeschnitten.[0m
Checking similarity: 'Data Analytics, plain and simple' vs 'Data Analytics' → Score: 61
Checking similarity: 'Data Analytics, plain and simple' vs 'Other' → Score: 11
Extracted answer: where we find the value
Checking similarity: 'where we find the value' vs 'Project Management' → Score: 24
Checking similarity: 'where we find the value' vs 'Customer Relationship Management (CRM)' → Score: 20
Checking similarity: 'where we find the value' vs 'Internal Communications' → Score: 26
Checking similarity: 'where we find the value' vs 'Data Analytics' → Score: 22
Checking similarity: 'where we find the value' vs 'Other' → Score: 21
Extracted answer: Data Analytics,
Checking similarity: 'Data Analytics,' vs 'Project Management' → Score: 36
Checking similarity: 'Data Analytics,' vs 'Customer Relationship Management (CRM)' → Score: 30
Checking similarity: 'Data Analytics,' vs 'Internal Communications' → Score: 37
Che

In [None]:
accuracy = predict_answers(data, qa_pipeline2)
print(f"Accuracy: {accuracy * 100:.2f} %")

[INFO] Printing only incorrectly predicted answers.


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Context: Well, the customer group we need to consider in this case, is made up of both the wholesaler and the distributor types, you see.
Correct: Wholesaler, Distributor, Predicted: Consultant, Planner, Architect

Context: If we're looking at the different customer groups, the ones we're targeting are the wholesaler and the distributor, plain and simple.
Correct: Wholesaler, Distributor, Predicted: End User

Context: For this particular situation, the customer group includes, specifically, those who act as a wholesaler and those who are distributors, yeah that's it.
Correct: Wholesaler, Distributor, Predicted: End User

Context: To be clear, the intended customer group comprises both the wholesaler category and the distributor category, those are the ones we're thinking about.
Correct: Wholesaler, Distributor, Predicted: Consultant, Planner, Architect

Context: Well, when we talk about the customer groups, we're mainly looking at Consultants, Planners, and Architects, those are the fo

In [None]:
accuracy = predict_answers(data, qa_pipeline3)
print(f"Accuracy: {accuracy * 100:.2f} %")

[INFO] Printing only incorrectly predicted answers.
Context: Absolutely not, I do not give my consent for data processing
Correct: No, Predicted: Yes

Context: Okay, so when you're talking about the customer group, we're definitely thinking about folks who are either a wholesaler or a distributor, that’s who we're focused on here.
Correct: Wholesaler, Distributor, Predicted: Consultant, Planner, Architect

Context: To be clear, the intended customer group comprises both the wholesaler category and the distributor category, those are the ones we're thinking about.
Correct: Wholesaler, Distributor, Predicted: R&D

Context: Regarding who we're thinking of as our customers, we can definitely say that Consultants are one group, then there are also Planners, and of course the Architects, those three are key.
Correct: Consultant, Planner, Architect, Predicted: End User

Context: "Well, let me tell you, for follow-up, we've got a few options on the table: we could send out an Email, or maybe w

## Fine-tune QA Model

*   We fine-tuned our best performing model (deepset/roberta-base-squad2) on our own data with the Huggingface Trainer API
*   Problem: Intended answer serves as label
    *   QA model can only be trained with data, where the intended answer appears in exactly the same way in the context
    *   Multi-Select, Date and Text questions were not suitable for training
    *   Lack of suitable data leads to a bad model



In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer
from sklearn.model_selection import train_test_split
!pip install datasets
from datasets import Dataset

def prepare_squad_data(df):
    squad_data = {"data": []}

    for idx, row in df.iterrows():
        question = row["question"]
        context = row["context"]
        intended_answer = row["intended_answer"]
        if (row['type'] == 'DATE') or (row['type'] == 'TEXT'):
          continue

        start_positions = []
        end_positions = []

        if isinstance(intended_answer, list):
            for answer in intended_answer:
                answer = str(answer)  # Konvertiere jeden Eintrag in einen String
                start = context.find(answer)
                if start != -1:
                    start_positions.append(start)
                    end_positions.append(start + len(answer))
                else:
                    print(f"Warnung: Antwort '{answer}' nicht im Kontext enthalten. Überspringe Datensatz.")
                    continue
        # Single-Select (nur ein String)
        elif isinstance(intended_answer, str):
            answer = intended_answer
            start = context.find(answer)
            if start != -1:
                start_positions.append(start)
                end_positions.append(start + len(answer))
            else:
                print(f"Warnung: Antwort '{answer}' nicht im Kontext enthalten. Überspringe Datensatz.")
                continue
        else:
            # Wenn der Typ von intended_answer weder Liste noch String ist
            print(f"Warnung: 'intended_answer' hat ein ungültiges Format in Zeile {idx}. Überspringe Datensatz.")
            continue

        squad_data["data"].append({
    "paragraphs": [
        {
            "context": context,
            "qas": [
                {
                    "question": question,
                    "id": f"q_{idx}",
                    "answers": [
                        {"text": answer, "answer_start": start} for answer in intended_answer
                    ],
                    "is_impossible": False
                        }
                    ]
                }
            ]
        })

    return squad_data

def tokenize_squad_data(squad_data, tokenizer, max_length=512):
    """
    Tokenisiert das vorbereitete SQuAD-Dataset.
    """
    # Telefonnummern-Regex (für verschiedene internationale Formate)
    phone_regex = re.compile(r"\+?\d{1,3}[-\s]?\(?\d{1,4}\)?[-\s]?\d{1,4}[-\s]?\d{1,9}")

    tokenized_examples = []

    for data in squad_data["data"]:
        for paragraph in data["paragraphs"]:
            # Kontext vorbereiten: Telefonnummern markieren
            context = paragraph["context"]
            context_matches = phone_regex.findall(context)
            for match in context_matches:
                context = context.replace(match, f"[PHONE_TOKEN_{match}]")

            for qa in paragraph["qas"]:
                question = qa["question"]
                answers = qa["answers"]

                # Antworten vorbereiten: Telefonnummern markieren
                processed_answers = []
                for answer in answers:
                    answer_text = answer["text"]
                    if phone_regex.match(answer_text):
                        answer_text = f"[PHONE_TOKEN_{answer_text}]"
                    processed_answers.append({
                        "text": answer_text,
                        "answer_start": context.find(answer_text)  # Aktualisierte Startposition
                    })

                # Extrahiere die Positionen der Antworten
                start_positions = [answer["answer_start"] for answer in processed_answers]
                answer_texts = [answer["text"] for answer in processed_answers]

                # Tokenize Frage und Kontext
                tokenized_example = tokenizer(
                    question,
                    context,
                    max_length=max_length,
                    truncation="only_second",  # Kontext wird bei Überlänge abgeschnitten
                    padding="max_length",
                    return_offsets_mapping=True
                )

                # Berechne Token-Startpositionen der Antworten (case-insensitive)
                token_start_positions = []
                token_end_positions = []
                context_lower = context.lower()  # Kontext in Kleinbuchstaben
                answer_texts_lower = [answer_text.lower() for answer_text in answer_texts]  # Antworten in Kleinbuchstaben

                for answer_text, answer_text_lower in zip(answer_texts, answer_texts_lower):
                    start_position = context_lower.find(answer_text_lower)  # Suche case-insensitive
                    if start_position != -1:
                        # Füge die tatsächliche Startposition und Endposition basierend auf dem Originaltext hinzu
                        token_start_positions.append(start_position)
                        token_end_positions.append(start_position + len(answer_text))
                    else:
                        # Antwort wurde nicht gefunden
                        print(f"Warning: Answer not found in context for question: {question}")
                        print(f"Intended answer: {answer_text}")
                        print(f"Context: {context}")

                # Tokenisierte Daten speichern
                tokenized_example["start_positions"] = token_start_positions[0] if token_start_positions else 0  # Handle empty list
                tokenized_example["end_positions"] = token_end_positions[0] if token_end_positions else 0  # Handle empty list
                tokenized_examples.append(tokenized_example)


    return tokenized_examples



In [None]:
# Aufteilen in Trainings- und Testdaten, Testdaten werden nicht umgewandelt
train_data, test_data = train_test_split(data, test_size=0.3, stratify=data["type"], random_state=42)

# Trainings- und Validierungsdaten vorbereiten
squad_data = prepare_squad_data(train_data)

tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")
tokenized_data = tokenize_squad_data(squad_data, tokenizer)

train_data, val_data = train_test_split(tokenized_data, test_size=0.3, random_state=42)

# Dataset Klasse für Huggingface
class SquadDataset:
    def __init__(self, data):
        self.data = Dataset.from_dict({k: [v] for k, v in data[0].items()})
        for item in data[1:]:
            self.data = self.data.add_item(item)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        return {
            "input_ids": item["input_ids"],
            "attention_mask": item["attention_mask"],
            "start_positions": item["start_positions"],
            "end_positions": item["end_positions"]
        }

# Huggingface Datasets erstellen
train_dataset = SquadDataset(train_data)
val_dataset = SquadDataset(val_data)
print(len(train_dataset))

Warnung: Antwort 'Very unsatisfied' nicht im Kontext enthalten. Überspringe Datensatz.
Warnung: Antwort 'Wholesaler, Distributor' nicht im Kontext enthalten. Überspringe Datensatz.
Warnung: Antwort 'Consultant, Planner, Architect' nicht im Kontext enthalten. Überspringe Datensatz.
Warnung: Antwort 'Consultant, Planner, Architect' nicht im Kontext enthalten. Überspringe Datensatz.
Warnung: Antwort 'To check project progress' nicht im Kontext enthalten. Überspringe Datensatz.
Warnung: Antwort 'To communicate with team members' nicht im Kontext enthalten. Überspringe Datensatz.
Warnung: Antwort 'To create new tasks' nicht im Kontext enthalten. Überspringe Datensatz.
Warnung: Antwort 'To update task status' nicht im Kontext enthalten. Überspringe Datensatz.
Warnung: Antwort 'To communicate with team members' nicht im Kontext enthalten. Überspringe Datensatz.
Warnung: Antwort 'To create new tasks' nicht im Kontext enthalten. Überspringe Datensatz.
Warnung: Antwort 'Satisfied' nicht im Konte

In [None]:
# Modell laden
model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2")

# TrainingArguments definieren
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_steps=500,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="loss"
)

# Trainer erstellen
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer
)

  trainer = Trainer(


In [None]:
# Training starten
trainer.train()

# Save model
from google.colab import drive
drive.mount('/content/drive')
!cp -r ./results /content/drive/MyDrive

Epoch,Training Loss,Validation Loss
1,6.0328,5.860127
2,5.8806,5.78452
3,5.6716,5.734552


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Model laden und Pipeline erstellen
import os

results_path = '/content/drive/MyDrive/results'
folders = [os.path.join(results_path, folder) for folder in os.listdir(results_path) if os.path.isdir(os.path.join(results_path, folder))]
latest_folder = max(folders, key=os.path.getctime)

model_path = latest_folder
print(f"Der zuletzt erstellte Ordner ist: {model_path}")

model = AutoModelForQuestionAnswering.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

qa_pipeline_squad = pipeline("question-answering", model=model, tokenizer=tokenizer)

Der zuletzt erstellte Ordner ist: /content/drive/MyDrive/results/checkpoint-171


Device set to use cuda:0


In [None]:
accuracy = predict_answers(test_data, qa_pipeline_squad)
print(f"Accuracy: {accuracy * 100:.2f} %")

[INFO] Printing only incorrectly predicted answers.
Context: Honestly, I’m mostly just hanging out on this app to go over some product information; that's really the core of what I’m doing today – gotta get the specs and all that!
Correct: To review product information, Predicted: Other

Context: Of course, making sure we can stay in touch is important! The number you're looking for is +33-814-501-1126, and I’m available most of the time.
Correct: +33-814-501-1126, Predicted: I’

Context: Alright, so to answer your question directly, you can use the phone number +49-778-336-7278 if you need to get in touch with me, that's the best way.
Correct: +49-778-336-7278, Predicted: that's the best way.

Context: So, concerning the average size of the trade fair team, we're talking a number that goes beyond 40, yeah, that's about right.
Correct: more than 40, Predicted: 11-15

Context: "Absolutely, I'm happy to give you my phone number, it's +1-395-371-5592. Please feel free to call if you need 