<a href="https://colab.research.google.com/github/croco22/CapstoneProjectTDS/blob/philipp/notebooks/Huggingface_QA_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 2: Evaluate Dataset

In [1]:
!pip install word2number

import json
import time
import google.generativeai as genai
from google.colab import userdata
import requests
from word2number import w2n
import re

# API setup
key = userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=key)
ai_model = genai.GenerativeModel('gemini-1.5-flash')

# Read dataset file
url = 'https://raw.githubusercontent.com/croco22/CapstoneProjectTDS/refs/heads/main/qa_dataset.json'
response = requests.get(url)

if response.status_code == 200:
    data = response.json()
    print("Retrieved file: qa_dataset.json")
else:
    print("Error while parsing a file: ", response.status_code)

Collecting word2number
  Downloading word2number-1.1.zip (9.7 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: word2number
  Building wheel for word2number (setup.py) ... [?25l[?25hdone
  Created wheel for word2number: filename=word2number-1.1-py3-none-any.whl size=5568 sha256=6b308c76119bbb05cfabb737f6b02e3c7791abf2d25570e68920d7be84baa874
  Stored in directory: /root/.cache/pip/wheels/84/ff/26/d3cfbd971e96c5aa3737ecfced81628830d7359b55fbb8ca3b
Successfully built word2number
Installing collected packages: word2number
Successfully installed word2number-1.1
Retrieved file: qa_dataset.json


In [2]:
def convert_numbers_in_text(text):
    # Regular expression to find number words contained in questionnaires
    pattern = r'(two thousand|two hundred one|two hundred|fifty-one|thirty-one|twenty-one|sixteen|fifteen|eleven|thirty|twenty|fifty|forty|sixty|ten|five|six|one)'
    # Interesting finding: Regex only works if longer words are in order before shorter that contain similar parts, e.g. fifty-one has to be in front of fifty to work as intended

    def convert(match):
        word = match.group(0)
        try:
            # Convert the word to number
            return str(w2n.word_to_num(word))
        except ValueError:
            return word

    # Replace all number words in the text with their integer equivalents
    converted_text = re.sub(pattern, convert, text, flags=re.IGNORECASE)

    # Now convert ranges like 'twenty to thirty' into '20-30'
    converted_text = re.sub(r'(\d+)\s*(to|and)\s*(\d+)', r'\1-\3', converted_text)

    return converted_text


def is_exact_or_phrase_match(option, text):
    # Escape the option to handle special characters
    escaped_option = re.escape(option.strip())

    # Pattern to match the option as a full word or part of a phrase
    pattern = rf'\b(?:\w+\s+)*{escaped_option}(?:\s+\w+)*\b'

    # Search for the pattern in the text (case-insensitive)
    return re.search(pattern, text, re.IGNORECASE) is not None

## Evaluate different models

In [3]:
from transformers import pipeline

qa_pipeline1 = pipeline("question-answering", model="deepset/roberta-base-squad2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Device set to use cpu


In [4]:
qa_pipeline2 = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')

config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cpu


In [5]:
qa_pipeline3 = pipeline("question-answering", model='google-bert/bert-large-uncased-whole-word-masking-finetuned-squad')

config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at google-bert/bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cpu


In [9]:
def predict_answers(data):
    """
    Predict the answer for each option in the JSON data.
    Printing only incorrectly predicted answers.
    """
    print("[INFO] Printing only incorrectly predicted answers.")
    correct_count = 0
    total_count = 0

    for item in data:
        predictions = list()

        # Convert numbers contained in the text to actual integer values
        converted_text = convert_numbers_in_text(item['answer_text'])

        for option in item['possible_answers']:
            # Check for exact match or part of a phrase
            exact_match = is_exact_or_phrase_match(option, converted_text)
            if exact_match:
                predictions.append((option, 0.95)) # 95 % sure its the correct answer
            else:
                # Hier den Namen der Pipeline eingeben, die man testen will:
                result = qa_pipeline1(question=item['question'], context=f"{converted_text} {option}")
                predictions.append((option, result['score']))

        # Select the option with the highest confidence; if tied, choose the longer string
        # Solves the problem of 'Very unsatisfied' & 'unsatisfied' mistakes
        predicted_option, confidence = max(
            predictions,
            key=lambda x: (x[1], len(x[0]))
        )

        if predicted_option == item['intended_answer']:
            correct_count += 1
        else:
            print(f"Text: {item['answer_text']}")
            print(f"Correct: {item['intended_answer']}, Predicted: {predicted_option}, Confidence: {round(confidence, 4)} \n")
        total_count += 1

    accuracy = correct_count / total_count if total_count > 0 else 0
    return accuracy

In [10]:
accuracy = predict_answers(data)
print(f"Accuracy: {accuracy * 100:.2f} %")

[INFO] Printing only incorrectly predicted answers.
Text: Nope, I'd rather not give consent for data processing, thanks.
Correct: No, Predicted: Yes, Confidence: 0.0 

Text: Oh yeah, I definitely work with wholesalers and distributors, they're a big part of my business.
Correct: Wholesaler, Distributor, Predicted: Consultant, Planner, Architect, Confidence: 0.0 

Text: Absolutely, my customer group consists of wholesalers and distributors – that's the heart of my operations.
Correct: Wholesaler, Distributor, Predicted: End User, Confidence: 0.117 

Text: Oh, the customer group?  That'd be consultants, planners, and architects, mostly.
Correct: Consultant, Planner, Architect, Predicted: Wholesaler, Distributor, Confidence: 0.2673 

Text: In terms of client base, we're focused on architects, planners, and consultants for this particular initiative.
Correct: Consultant, Planner, Architect, Predicted: Wholesaler, Distributor, Confidence: 0.0 

Text: Nope, I'd rather not be bombarded with m

## Interesting Findings

*   Prediction of names very bad, because no deeper meaning --> fixed by checking for exact matches
  * Maybe implement name interpreter later?
*   Numerical values (size of company) prediction very bad

* QA Pipelines
  * Pipeline 2 und 3 haben nur eine accuracy von ungefähr 60 %



# Neuer Ansatz mit Finetuning (bisher alles von Chat)

In [17]:
!pip install transformers datasets evaluate

from evaluate import load
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer
import numpy as np

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [18]:
# # Fine-tuning a pre-trained model for question answering

# def prepare_data(data):
#     """
#     Prepare the dataset by extracting relevant fields from the provided structure.
#     """
#     contexts = []
#     questions = []
#     answers = []

#     for item in data:
#         context = f"{item['answer_text']} {' '.join(item['possible_answers'])}"
#         contexts.append(context)
#         questions.append(item['question'])

#         # Assume the intended answer is part of possible_answers
#         answer_start = context.find(item['intended_answer'])
#         answers.append({
#             "text": item['intended_answer'],
#             "answer_start": answer_start if answer_start != -1 else 0
#         })

#     return Dataset.from_dict({
#         "context": contexts,
#         "question": questions,
#         "answers": answers
#     })

# # Tokenizer preparation
# def preprocess_data(examples, tokenizer, max_length=384, doc_stride=128):
#     """
#     Preprocess the dataset: tokenize and align answers.
#     """
#     tokenized_examples = tokenizer(
#         examples["question"],
#         examples["context"],
#         truncation=True,
#         padding="max_length",
#         max_length=max_length,
#         stride=doc_stride,
#         return_overflowing_tokens=True,
#         return_offsets_mapping=True,
#     )

#     offset_mapping = tokenized_examples.pop("offset_mapping")
#     sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

#     start_positions = []
#     end_positions = []

#     for i, offsets in enumerate(offset_mapping):
#         input_ids = tokenized_examples["input_ids"][i]
#         cls_index = input_ids.index(tokenizer.cls_token_id)

#         sample_index = sample_mapping[i]
#         answers = examples["answers"][sample_index]

#         if len(answers["text"]) == 0:
#             start_positions.append(cls_index)
#             end_positions.append(cls_index)
#         else:
#             start_char = answers["answer_start"]
#             end_char = start_char + len(answers["text"])

#             token_start_index = 0
#             token_end_index = len(input_ids) - 1

#             while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
#                 token_start_index += 1
#             while offsets[token_end_index][1] >= end_char:
#                 token_end_index -= 1

#             start_positions.append(token_start_index - 1)
#             end_positions.append(token_end_index + 1)

#     tokenized_examples["start_positions"] = start_positions
#     tokenized_examples["end_positions"] = end_positions

#     return tokenized_examples

# # Training setup
# def train_qa_model(train_data, val_data, model_name="bert-base-uncased", output_dir="./results"):
#     """
#     Train the model using the provided training and validation datasets.
#     """
#     tokenizer = AutoTokenizer.from_pretrained(model_name)
#     model = AutoModelForQuestionAnswering.from_pretrained(model_name)

#     # Preprocess the datasets
#     tokenized_train_data = train_data.map(
#         lambda x: preprocess_data(x, tokenizer), batched=True
#     )
#     tokenized_val_data = val_data.map(
#         lambda x: preprocess_data(x, tokenizer), batched=True
#     )

#     training_args = TrainingArguments(
#         output_dir=output_dir,
#         evaluation_strategy="epoch",
#         learning_rate=3e-5,
#         per_device_train_batch_size=16,
#         num_train_epochs=3,
#         weight_decay=0.01,
#         logging_dir=f"{output_dir}/logs",
#     )

#     trainer = Trainer(
#         model=model,
#         args=training_args,
#         train_dataset=tokenized_train_data,
#         eval_dataset=tokenized_val_data,
#         tokenizer=tokenizer,
#     )

#     trainer.train()
#     model.save_pretrained(output_dir)
#     tokenizer.save_pretrained(output_dir)

#     return model, tokenizer

# # Model evaluation
# def evaluate_model(test_data, model, tokenizer):
#     """
#     Evaluate the trained model on the test dataset.
#     """
#     metric = load("squad")  # Lade die SQuAD-Metrik

#     def postprocess_predictions(examples, features, raw_predictions):
#         """
#         Convert raw predictions into the desired format.
#         """
#         all_start_logits, all_end_logits = raw_predictions
#         predictions = []

#         for i, example in enumerate(examples):
#             start_logits = all_start_logits[i]
#             end_logits = all_end_logits[i]

#             max_start = np.argmax(start_logits)
#             max_end = np.argmax(end_logits)

#             predictions.append({
#                 "id": example["id"],
#                 "prediction_text": tokenizer.decode(
#                     tokenizer.convert_ids_to_tokens(features[max_start:max_end + 1]),
#                     skip_special_tokens=True
#                 )
#             })

#         return predictions

#     return metric.compute(
#         predictions=postprocess_predictions(test_data),
#         references=test_data["answers"]
#     )


# # Prepare and split the dataset
# dataset = prepare_data(data)
# train_size = int(0.8 * len(dataset))
# train_dataset = dataset[:train_size]
# val_dataset = dataset[train_size:]

# # Train the model
# model, tokenizer = train_qa_model(train_dataset, val_dataset)

# # Evaluate the model
# test_accuracy = evaluate_model(val_dataset, model, tokenizer)
# print(f"Test Accuracy: {test_accuracy}")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


AttributeError: 'dict' object has no attribute 'map'