<a href="https://colab.research.google.com/github/croco22/CapstoneProjectTDS/blob/main/notebooks/03_Dataset_Continuous.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extra Task 1: Continuous Dataset Evaluation
In this step, several questions will be answered at once in order to increase the complexity for the model. By introducing multiple questions in a single request, we aim to test how well the model handles and responds to more complex inputs, simulating a real-world scenario where users may provide multiple queries in one interaction.

In [1]:
!pip install dateparser
!pip install fuzzywuzzy

import random
import re
import time
from datetime import timedelta
import dateparser
import google.generativeai as genai
import pandas as pd
from fuzzywuzzy import fuzz
from google.colab import userdata
from sentence_transformers import SentenceTransformer
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import f1_score, precision_score, recall_score
from sklearn.metrics.pairwise import cosine_similarity
from transformers import pipeline


# API setup
genai.configure(api_key=userdata.get('GOOGLE_API_KEY'))
model = genai.GenerativeModel('gemini-1.5-flash')
userdata.get('HF_TOKEN')


def generate_text(prompt):
    """
    Generates text based on the provided prompt using the genai model. The function sends the prompt
    to the model, with a generation configuration that includes a temperature of 2.0 for creative output.
    It then waits for 5 seconds to avoid exceeding API limits before returning the generated text.
    """
    try:
        response = model.generate_content(
            prompt,
            generation_config=genai.GenerationConfig(
                temperature=2.0, # creative output
            )
        )
        time.sleep(5) # avoid exceeding API limits
        return response.text.strip()
    except Exception as e:
        exit("Error during API call: ", e)


# Read dataset file
url = 'https://raw.githubusercontent.com/croco22/CapstoneProjectTDS/refs/heads/main/qa_dataset.json'
data = pd.read_json(url)

Collecting dateparser
  Downloading dateparser-1.2.0-py2.py3-none-any.whl.metadata (28 kB)
Downloading dateparser-1.2.0-py2.py3-none-any.whl (294 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dateparser
Successfully installed dateparser-1.2.0
Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl.metadata (4.9 kB)
Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0




In [2]:
# TODO: Nur verschiedene fragen, max 1 date und max 1 number, kein text
# AI erweiterung
# Summarization modell quatsch schreiben

data = data[data['type'].isin(["SINGLE_SELECT", "MULTI_SELECT"])]

data.head()

Unnamed: 0,type,question,options,intended_answer,context,timestamp
0,SINGLE_SELECT,Data processing consent,"[Yes, No]",Yes,"Yes, absolutely, I'm completely fine with that.",2024-12-31 22:15:06.880
1,SINGLE_SELECT,Data processing consent,"[Yes, No]",Yes,"Sure, I give my consent, no problem at all.",2024-12-31 22:15:06.880
2,SINGLE_SELECT,Data processing consent,"[Yes, No]",Yes,"Yep, consider my agreement given; I have no ob...",2024-12-31 22:15:06.880
3,SINGLE_SELECT,Data processing consent,"[Yes, No]",Yes,"Indeed, you have my permission to proceed with...",2024-12-31 22:15:06.880
4,SINGLE_SELECT,Data processing consent,"[Yes, No]",Yes,"Okay, yes, I definitely agree to those data pr...",2024-12-31 22:15:06.880


In [3]:
def test(data):
    # Erstelle eine leere Liste, um alle DataFrames zu sammeln
    new_rows = []

    for _ in range(100):
        random_rows = data.sample(n=3)  # 3 zufällige Zeilen auswählen
        combined_text = " ".join(random_rows['context'])  # Kombiniere die Kontexte zu einem Text
        random_rows['context'] = combined_text  # Aktualisiere die 'context'-Spalte
        new_rows.append(random_rows)  # Füge die neuen Zeilen zur Liste hinzu

    # Kombiniere alle DataFrames in der Liste zu einem neuen DataFrame
    new_df = pd.concat(new_rows, ignore_index=True)
    return new_df

In [4]:
df = test(data)

df.head()

Unnamed: 0,type,question,options,intended_answer,context,timestamp
0,SINGLE_SELECT,What is the primary purpose of this software f...,"[Project Management, Customer Relationship Man...",Data Analytics,"""If I had to boil it down to one thing, one pr...",2024-12-26 22:55:06.880
1,MULTI_SELECT,What is your primary goal for using this app t...,"[Monitor progress on existing projects, Commun...","[Plan new tasks or initiatives, Access documen...","""If I had to boil it down to one thing, one pr...",2025-01-09 21:28:30.880
2,MULTI_SELECT,What is the primary purpose for using our proj...,"[Task management, Collaboration with team memb...","[Task management, Collaboration with team memb...","""If I had to boil it down to one thing, one pr...",2025-01-08 16:22:56.881
3,MULTI_SELECT,What is your primary reason for using our proj...,"[To track project progress, To collaborate wit...",[To track project progress],"Well, you see, mostly today I'm logging in bec...",2025-01-13 04:50:21.881
4,SINGLE_SELECT,What type of company is it?,"[Construction company, Craft enterprises, Scaf...",Craft enterprises,"Well, you see, mostly today I'm logging in bec...",2025-01-19 23:42:36.880


## Clustering (no longer in user)
This section explores clustering techniques to group similar questions or responses based on their characteristics. By applying clustering methods, patterns and structures within the dataset can be identified, improving organization and analysis. This approach helps in refining question generation, optimizing model training, and ensuring diverse yet coherent question-answer pairs.


In [5]:
model = SentenceTransformer('all-MiniLM-L6-v2')

for text in df['context']:
    sentences = [sentence.strip(' "') for sentence in re.split(r'[.!?]', text) if sentence]
    sentence_embeddings = model.encode(sentences)
    similarity_matrix = cosine_similarity(sentence_embeddings)
    clustering = AgglomerativeClustering(n_clusters=3, metric='cosine', linkage='average')
    try:
        labels = clustering.fit_predict(sentence_embeddings)
    except ValueError as e:
        print(f"Error during clustering: {e}")
        continue

    clusters = {}
    for sentence, label in zip(sentences, labels):
        clusters.setdefault(label, []).append(sentence)

    for i, (_, grouped_sentences) in enumerate(clusters.items()):
        print(f"Cluster {i}:")
        for sentence in grouped_sentences:
            print(f"* {sentence}")
    print()

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Cluster 0:
* If I had to boil it down to one thing, one primary reason why we use it, I’d say without hesitation it’s for Data Analytics
* It’s the core functionality for us, the main thing
* Honestly, I'd say at its heart, for me personally the most important feature that this program provides would absolutely have to be centered around robust **Task management**, I mean seriously, keeping an overview on everything that’s happening is invaluable and, like equally importantly is the platform’s wonderful ability for facilitating **Collaboration with team members** because everyone on my teams loves to be able to chat and share everything
Cluster 1:
* Oh, for me, it's a two-parter today
Cluster 2:
* Firstly, I need to Plan new tasks or initiatives, there's quite a backlog brewing, and then secondly I absolutely need to Access documentation and resources to make sure everything aligns with our current guidelines and that I'm following protocol, it's quite important you see

Cluster 0:
* I

# Evaluation

In [6]:
def predict_answers(df, pipeline):
    """
    Predict the answer for each row in the DataFrame.
    Prints only incorrectly predicted answers.
    Also calculates F1-score, Precision, and Recall.
    """
    print("[INFO] Printing only incorrectly predicted answers.")

    correct_count = 0
    total_count = 0
    qa_pipeline = pipeline

    y_true = []  # Liste für echte Werte
    y_pred = []  # Liste für vorhergesagte Werte

    for _, row in df.iterrows():
        predictions = []
        is_correct = False
        predicted_option = None

        if (row['type'] == "SINGLE_SELECT") or (row['type'] == "MULTI_SELECT"):
            results = qa_pipeline(question=row['question'], context=row['context'])

            if isinstance(results, dict):
                results = [results]  # Falls nur ein einzelnes Ergebnis existiert
            elif not isinstance(results, list):
                print(f"Warning: Unexpected output format from qa_pipeline for question: {row['question']}")
                continue

            for result in results:
                extracted_answer = result.get('answer', '')
                for option in row['options']:
                    similarity_score = fuzz.ratio(extracted_answer.lower(), option.lower())
                    if similarity_score >= 50:
                        predictions.append((option, result.get('score', 0)))

        if row['type'] == "SINGLE_SELECT":
            if predictions:
                predicted_option, confidence = max(predictions, key=lambda x: x[1])
                is_correct = predicted_option == row['intended_answer']
            else:
                predicted_option = None

        if row['type'] == "MULTI_SELECT":
            if predictions:
                predicted_option = list(set(option for option, _ in predictions))
                is_correct = set(predicted_option) == set(row['intended_answer'])
            else:
                predicted_option = None

        if row['type'] == "NUMBER":
            try:
                predicted_option = qa_pipeline(question=row['question'], context=row['context'])['answer']
                is_correct = predicted_option == row['intended_answer']
            except Exception as e:
                print(f"[ERROR] NUMBER question failed: {e}")

        # Ignore TEXT questions
        if row['type'] == "TEXT":
            continue

        # Konvertiere Vorhersagen und tatsächliche Werte in eine binäre Form für Metriken
        if row['type'] in ["SINGLE_SELECT", "MULTI_SELECT", "NUMBER"]:
            y_true.append(1 if row['intended_answer'] else 0)  # 1 = korrekte Antwort existiert
            y_pred.append(1 if is_correct else 0)  # 1 = korrekt vorhergesagt

        # Ausgabe falscher Vorhersagen
        if not is_correct:
            print(f"Context: {row['context']}")
            print(f"Correct: {row['intended_answer']}, Predicted: {predicted_option}")
            print()

        if is_correct:
            correct_count += 1
        total_count += 1

    # Berechnung der Metriken
    accuracy = correct_count / total_count if total_count > 0 else 0
    precision = precision_score(y_true, y_pred, zero_division=0)
    recall = recall_score(y_true, y_pred, zero_division=0)
    f1 = f1_score(y_true, y_pred, zero_division=0)

    print(f"\n[INFO] Accuracy: {accuracy:.4f}")
    print(f"[INFO] Precision: {precision:.4f}")
    print(f"[INFO] Recall: {recall:.4f}")
    print(f"[INFO] F1 Score: {f1:.4f}")

    return accuracy, precision, recall, f1

In [7]:
qa_pipeline1 = pipeline("question-answering", model="deepset/roberta-base-squad2")
qa_pipeline_ms = pipeline("question-answering", model="deepset/roberta-base-squad2", topk = 10)

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Device set to use cpu
Device set to use cpu


In [8]:
a, p, r, f = predict_answers(df, qa_pipeline1)

[INFO] Printing only incorrectly predicted answers.
Context: "If I had to boil it down to one thing, one primary reason why we use it, I’d say without hesitation it’s for Data Analytics. It’s the core functionality for us, the main thing." Oh, for me, it's a two-parter today! Firstly, I need to Plan new tasks or initiatives, there's quite a backlog brewing, and then secondly I absolutely need to Access documentation and resources to make sure everything aligns with our current guidelines and that I'm following protocol, it's quite important you see! Honestly, I'd say at its heart, for me personally the most important feature that this program provides would absolutely have to be centered around robust **Task management**, I mean seriously, keeping an overview on everything that’s happening is invaluable and, like equally importantly is the platform’s wonderful ability for facilitating **Collaboration with team members** because everyone on my teams loves to be able to chat and share ev