<a href="https://colab.research.google.com/github/croco22/CapstoneProjectTDS/blob/dev/notebooks/Huggingface_QA_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 2: Evaluate Dataset

In [65]:
!pip install word2number

import json
import time
import google.generativeai as genai
from google.colab import userdata
import requests
from word2number import w2n
import re

# API setup
key = userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=key)
ai_model = genai.GenerativeModel('gemini-1.5-flash')

# Read dataset file
url = 'https://raw.githubusercontent.com/croco22/CapstoneProjectTDS/refs/heads/main/qa_dataset.json'
response = requests.get(url)

if response.status_code == 200:
    data = response.json()
    print("Retrieved file: qa_dataset.json")
else:
    print("Error while parsing a file: ", response.status_code)

Retrieved file: qa_dataset.json


In [66]:
def convert_numbers_in_text(text):
    # Regular expression to find number words contained in questionnaires
    pattern = r'(two thousand|two hundred one|two hundred|fifty-one|thirty-one|twenty-one|sixteen|fifteen|eleven|thirty|twenty|fifty|forty|sixty|ten|five|six|one)'
    # Interesting finding: Regex only works if longer words are in order before shorter that contain similar parts, e.g. fifty-one has to be in front of fifty to work as intended

    def convert(match):
        word = match.group(0)
        try:
            # Convert the word to number
            return str(w2n.word_to_num(word))
        except ValueError:
            return word

    # Replace all number words in the text with their integer equivalents
    converted_text = re.sub(pattern, convert, text, flags=re.IGNORECASE)

    # Now convert ranges like 'twenty to thirty' into '20-30'
    converted_text = re.sub(r'(\d+)\s*(to|and)\s*(\d+)', r'\1-\3', converted_text)

    # Replace text
    # Todo: Dafür noch ne bessere Lösung finden, das ist eig nur n Beispiel und geht auch bei ähnlichen Sätzen nicht
    converted_text = converted_text.replace('more than 2000', 'larger than 2000')
    converted_text = converted_text.replace('More than 2000', 'Larger than 2000')

    return converted_text


def is_exact_or_phrase_match(option, text):
    # Escape the option to handle special characters
    escaped_option = re.escape(option.strip())

    # Pattern to match the option as a full word or part of a phrase
    pattern = rf'\b(?:\w+\s+)*{escaped_option}(?:\s+\w+)*\b'

    # Search for the pattern in the text (case-insensitive)
    return re.search(pattern, text, re.IGNORECASE) is not None

## Evaluate different models

In [67]:
from transformers import pipeline

qa_pipeline1 = pipeline("question-answering", model="deepset/roberta-base-squad2")

Device set to use cpu


In [68]:
qa_pipeline2 = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')

Device set to use cpu


In [69]:
qa_pipeline3 = pipeline("question-answering", model='google-bert/bert-large-uncased-whole-word-masking-finetuned-squad')

Some weights of the model checkpoint at google-bert/bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


In [70]:
def predict_answers(data):
    """
    Predict the answer for each option in the JSON data.
    Printing only incorrectly predicted answers.
    """
    print("[INFO] Printing only incorrectly predicted answers.")
    correct_count = 0
    total_count = 0

    for item in data:
        predictions = list()

        # Convert numbers contained in the text to actual integer values
        converted_text = convert_numbers_in_text(item['answer_text'])

        for option in item['possible_answers']:
            # Check for exact match or part of a phrase
            # Todo: Problem: Da der Loop zuerst für unsatisfied durchlaufen wird, wird diesem 95% zugewiesen,
            # erst danach wird very unsatisfied ebenfalls 95% zugewiesen --> falsche Zuordnung
            exact_match = is_exact_or_phrase_match(option, converted_text)
            if exact_match:
                predictions.append((option, 0.95)) # 95 % sure its the correct answer
            else:
                # Hier den Namen der Pipeline eingeben, die man testen will:
                result = qa_pipeline1(question=item['question'], context=f"{converted_text} {option}")
                predictions.append((option, result['score']))

        predicted_option, confidence = max(predictions, key=lambda x: x[1])

        if predicted_option == item['intended_answer']:
            correct_count += 1
        else:
            print(f"Text: {item['answer_text']}")
            print(f"Correct: {item['intended_answer']}, Predicted: {predicted_option}, Confidence: {round(confidence, 4)} \n")
        total_count += 1

    accuracy = correct_count / total_count if total_count > 0 else 0
    return accuracy

In [71]:
accuracy = predict_answers(data)
print(f"Accuracy: {accuracy * 100:.2f} %")

[INFO] Printing only incorrectly predicted answers.
Text: Nope, I'd rather not give consent for data processing, thanks.
Correct: No, Predicted: Yes, Confidence: 0.0 

Text: Oh yeah, I definitely work with wholesalers and distributors, they're a big part of my business.
Correct: Wholesaler, Distributor, Predicted: Consultant, Planner, Architect, Confidence: 0.0 

Text: Absolutely, my customer group consists of wholesalers and distributors – that's the heart of my operations.
Correct: Wholesaler, Distributor, Predicted: End User, Confidence: 0.117 

Text: Oh, the customer group?  That'd be consultants, planners, and architects, mostly.
Correct: Consultant, Planner, Architect, Predicted: Wholesaler, Distributor, Confidence: 0.2673 

Text: In terms of client base, we're focused on architects, planners, and consultants for this particular initiative.
Correct: Consultant, Planner, Architect, Predicted: Wholesaler, Distributor, Confidence: 0.0 

Text: Nope, I'd rather not be bombarded with m

## Interesting Findings

*   Prediction of names very bad, because no deeper meaning --> fixed by checking for exact matches
  * Maybe implement name interpreter later?
*   Numerical values (size of company) prediction very bad

* QA Pipelines
  * Pipeline 2 und 3 haben nur eine accuracy von ungefähr 60 %

