<a href="https://colab.research.google.com/github/croco22/CapstoneProjectTDS/blob/main/notebooks/03_Dataset_Continuous.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extra Task 1: Continuous Dataset Evaluation
In this step, several questions will be answered at once in order to increase the complexity for the model. By introducing multiple questions in a single request, we aim to test how well the model handles and responds to more complex inputs, simulating a real-world scenario where users may provide multiple queries in one interaction.

The secrets `GOOGLE_API_KEY` and `HF_TOKEN` must be configured in your Colab environment for proper execution.

## Imports and Setup

In [None]:
%%capture
!pip install dateparser
!pip install fuzzywuzzy

import random
import re
import time
from datetime import timedelta

import dateparser
import google.generativeai as genai
import pandas as pd
from fuzzywuzzy import fuzz
from google.colab import userdata
from sentence_transformers import SentenceTransformer
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import f1_score, precision_score, recall_score
from sklearn.metrics.pairwise import cosine_similarity
from transformers import pipeline

In [None]:
# API setup
genai.configure(api_key=userdata.get('GOOGLE_API_KEY'))
model = genai.GenerativeModel('gemini-2.0-flash-exp')
userdata.get('HF_TOKEN')


def generate_text(prompt):
    """
    Generates text based on the provided prompt using the genai model. The function sends the prompt
    to the model, with a generation configuration that includes a temperature of 2.0 for creative output.
    It then waits for 5 seconds to avoid exceeding API limits before returning the generated text.
    """
    try:
        response = model.generate_content(
            prompt,
            generation_config=genai.GenerationConfig(
                temperature=2.0, # creative output
            )
        )
        time.sleep(10) # avoid exceeding API limits
        return response.text.strip()
    except Exception as e:
        exit("Error during API call: ", e)

In [None]:
url = 'https://raw.githubusercontent.com/croco22/CapstoneProjectTDS/refs/heads/main/qa_dataset.json'
data = pd.read_json(url)

data.head()

Unnamed: 0,type,question,options,intended_answer,context,timestamp
0,SINGLE_SELECT,Data processing consent,"[Yes, No]",Yes,"Absolutely, you've got my consent for that dat...",2025-01-29 13:37:12.404
1,SINGLE_SELECT,Data processing consent,"[Yes, No]",Yes,"Yes, you can absolutely process my data, that'...",2025-01-29 13:37:12.404
2,SINGLE_SELECT,Data processing consent,"[Yes, No]",Yes,"Without any hesitation, I can confirm yes, tha...",2025-01-29 13:37:12.404
3,SINGLE_SELECT,Data processing consent,"[Yes, No]",Yes,"Okay then, for that question regarding data pr...",2025-01-29 13:37:12.404
4,SINGLE_SELECT,Data processing consent,"[Yes, No]",Yes,"That sounds perfectly fine to me, I wholeheart...",2025-01-29 13:37:12.404


## Continuous Text Generation
To generate a coherent text, three questions are grouped together. Since combining multiple questions with phone numbers or dates wouldn't make much sense (because the scenario would be very unrealistic) a random pre-selection is made, where only one number and one date entry are included. Additionally, three random single- and multi-select questions are added. From this set of eight questions, three are randomly selected and then combined into a text.

Previously, we tested a **text summarization model from Hugging Face** for this task. However, the performance was quite poor. When the output length was too long, it did not summarize the content but rather just appended the text. On the other hand, when the output length was too short, important elements of the response were omitted.

In [None]:
prompt = """
    Summarize the following text that answers different questions, which may
    not necessarily be related to the same context. The summary should provide
    a concise version of the text, ensuring that key details and answer
    components are preserved without simply appending or omitting important
    information. Return only the summarized answer text, without quotation marks.
    Text:
"""

In [None]:
def generate_sample_df(data, n=10, advanced=False):
    new_rows = []

    for _ in range(n):
        single_rows = data[data['type'] == 'SINGLE_SELECT'].sample(n=3)
        multi_rows = data[data['type'] == 'MULTI_SELECT'].sample(n=3)
        num_row = data[data['type'] == 'NUMBER'].sample(n=1)
        date_row = data[data['type'] == 'DATE'].sample(n=1)
        preselect = pd.concat([single_rows, multi_rows, num_row, date_row], ignore_index=True)

        random_rows = preselect.sample(n=3)
        combined_text = " ".join(random_rows['context'])
        if advanced:
            combined_text = generate_text(prompt + combined_text)
        random_rows['context'] = combined_text
        new_rows.append(random_rows)

    new_df = pd.concat(new_rows, ignore_index=True)
    return new_df

In [None]:
df_simple = generate_sample_df(data, n=50)
df = generate_sample_df(data, advanced=True)

df.head()

Unnamed: 0,type,question,options,intended_answer,context,timestamp
0,MULTI_SELECT,Productinterests,"[BusinessCards, DataEnrichment, VisitReport, D...","[DataEnrichment, VisitReport, DataQuality]",The speaker prioritizes DataQuality as a found...,2025-01-31 13:17:55.404
1,SINGLE_SELECT,"In the past month, which of these best describ...","[Project Management, Customer Relationship Man...",Customer Relationship Management,The speaker prioritizes DataQuality as a found...,2025-01-06 16:28:28.404
2,DATE,What is the anticipated project completion date?,,tomorrow,The speaker prioritizes DataQuality as a found...,2025-01-30 12:15:08.404
3,NUMBER,For verification purposes and to allow direct ...,,+49-220-262-2230,The business phone number is +49-220-262-2230....,2025-01-21 15:27:28.404
4,SINGLE_SELECT,What type of company is it?,"[Construction company, Craft enterprises, Scaf...",Construction company,The business phone number is +49-220-262-2230....,2025-01-20 04:06:12.404


In [None]:
print("Simple texts:")
for text in df_simple['context'][::3][:5]:
    print(text)

print("\nAdvanced texts:")
for text in df['context'][::3][:5]:
    print(text)

Simple texts:
At this point, describing our posture on reaching projected ROI, I'm inclined to think Somewhat confident reflects my outlook while recognizing potential risks still persist; I would temper this slight concern with saying we are otherwise Moderately confident overall because those main concerns seem like they can be turned into a manageable risk with a small amount of further attention. So, you're wondering about when the project will wrap up and the deliverables will be available? Hmmm, my gut says we’re probably looking at having all of that in our hands in around a fortnight or thereabouts. Well, you know, the field that I'm involved in is definitely medical, it's fascinating actually
I mean it’s definitely important to know is it someone from a Supplier company? Are they maybe instead a New customer / Prospect showing interest? It could even be a call from Press / media reaching out for comment – we get them all so keeping them straight is important, yes Alright so wh

## Clustering (no longer in use)
This section explores clustering techniques to group similar contexts or responses based on their characteristics. By applying clustering methods, patterns and structures within the dataset can be identified, improving organization and analysis.

The idea was to first cluster the texts during evaluation to make complex tasks, such as identifying notes that are not clearly assignable, more manageable. However, we ultimately decided to evaluate answer texts that address multiple questions separately for each question they contain. This approach led to significantly better results, so we abandoned the clustering method.

In [None]:
transformer_model = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
for text in df['context']:
    # Split the text into sentences and remove unnecessary spaces and quotes
    sentences = [sentence.strip(' "') for sentence in re.split(r'[.!?]', text) if sentence]

    # Ensure there are at least two sentences for clustering
    if len(sentences) < 2:
        print("Skipping text: Not enough sentences for clustering.")
        continue

    # Generate sentence embeddings using the transformer model
    sentence_embeddings = transformer_model.encode(sentences)
    similarity_matrix = cosine_similarity(sentence_embeddings)

    # Dynamically determine the number of clusters: minimum of 2 or the number of sentences
    n_clusters = min(len(sentences), 3)

    clustering = AgglomerativeClustering(n_clusters=n_clusters, metric='cosine', linkage='average')

    try:
        # Perform clustering
        labels = clustering.fit_predict(sentence_embeddings)
    except ValueError as e:
        print(f"Error during clustering: {e}")
        continue

    # Organize sentences into clusters
    clusters = {}
    for sentence, label in zip(sentences, labels):
        clusters.setdefault(label, []).append(sentence)

    # Print the clusters
    for i, (_, grouped_sentences) in enumerate(clusters.items()):
        print(f"Cluster {i}:")
        for sentence in grouped_sentences:
            print(f"* {sentence}")
    print()

Cluster 0:
* The speaker prioritizes DataQuality as a foundation, then explores DataEnrichment
Cluster 1:
* They heavily value VisitReports for concise summaries, and have primarily used the tools for Customer Relationship Management in the past 30 days
Cluster 2:
* The project is anticipated to conclude tomorrow if current progress is maintained

Cluster 0:
* The speaker prioritizes DataQuality as a foundation, then explores DataEnrichment
Cluster 1:
* They heavily value VisitReports for concise summaries, and have primarily used the tools for Customer Relationship Management in the past 30 days
Cluster 2:
* The project is anticipated to conclude tomorrow if current progress is maintained

Cluster 0:
* The speaker prioritizes DataQuality as a foundation, then explores DataEnrichment
Cluster 1:
* They heavily value VisitReports for concise summaries, and have primarily used the tools for Customer Relationship Management in the past 30 days
Cluster 2:
* The project is anticipated to con

## Evaluation
Similar to the evaluation in the notebook '02_Dataset_Evaluation.ipynb' for Task 2.

## Auxiliary Functions

In [None]:
def calculate_metrics(y_true, y_pred):
    if not y_true:  # ensure y_true is not empty to avoid errors
        return {"Accuracy": 0, "Precision": 0, "Recall": 0, "F1 Score": 0, "Jaccard Score": 0}
    else:
        return {
            "Accuracy": accuracy_score(y_true, y_pred),
            "Precision": precision_score(y_true, y_pred, zero_division=0),
            "Recall": recall_score(y_true, y_pred, zero_division=0),
            "F1 Score": f1_score(y_true, y_pred, zero_division=0),
            "Jaccard Score": jaccard_score(y_true, y_pred, zero_division=0),
        }


def plot_metrics(metrics_per_type, overall_metrics):
    metric_names = ["Accuracy", "Precision", "Recall", "F1 Score", "Jaccard Score"]
    plt.figure(figsize=(8, 5))
    df_heatmap = pd.DataFrame(metrics_per_type).T
    sns.heatmap(df_heatmap, annot=True, cmap="coolwarm", linewidths=0.5, vmin=0, vmax=1)
    plt.title("Metrics per question type")
    plt.show()

In [None]:
def predict_answers(df, qa_pipeline):
    metrics_per_type = dict()

    type_metrics = {t: {"y_true": [], "y_pred": []} for t in ["SINGLE_SELECT", "MULTI_SELECT", "DATE", "NUMBER"]}

    for _, row in df.iterrows():
        predictions = list()
        is_correct = False

        if (row['type'] == "SINGLE_SELECT") or (row['type'] == "MULTI_SELECT"):
            if row['options']:
                converted_context = convert_numbers_in_text(row['context'])
            # Extract answers from context (topk > 1)
            results = qa_pipeline_ms(question=row['question'], context=converted_context)

            if isinstance(results, dict):
                # Convert to list if a single object is found
                results = [results]
            elif not isinstance(results, list):
                print(f"Warning: Unexpected output format from qa_pipeline for question: {row['question']}") # Bei Single Select nicht auch nur string möglich?
                continue

            # Check answer for similarity with given answer options
            for result in results:
                extracted_answer = result.get('answer', '')
                for option in row['options']:
                    similarity_score = fuzz.ratio(extracted_answer.lower(), option.lower())
                    if similarity_score >= 60:  # Threshold for similarity
                        predictions.append((option, result.get('score', 0)))

        if row['type'] == "SINGLE_SELECT":
            # Select prediction with highest confidence
            if predictions:
                predicted_option, confidence = max(predictions, key=lambda x: x[1])
                is_correct = predicted_option == row['intended_answer']
            else:
                print(f"No predictions found for SINGLE_SELECT: {row['question']}")
                predicted_option = None

        if row['type'] == "MULTI_SELECT":
            if predictions:
                # Select all answers that matched an option
                predicted_option = []
                for option, score in predictions:
                    if option not in predicted_option:
                        predicted_option.append(option)
                is_correct = set(predicted_option) == set(row['intended_answer'])
            else:
                print(f"No predictions found for MULTI_SELECT: {row['question']}")
                predicted_option = None

        if row['type'] == "DATE":
            try:
                # Base timestamp from dataframe column (Unix-Timestamp)
                base_timestamp = pd.Timestamp(row['timestamp'], unit='ms')

                # Extract time expression and convert it to an exact date
                extracted_time = qa_pipeline(question=row['question'], context=row['context'])['answer']
                parsed_date = dateparser.parse(
                    extracted_time,
                    settings={'RELATIVE_BASE': base_timestamp.to_pydatetime(), 'PREFER_DATES_FROM': 'future'}
                )
                if not parsed_date:
                    raise ValueError(f"Unable to parse date from extracted time: {extracted_time}")

                predicted_option = parsed_date

                # Calculate intended date from intended answer
                intended_time = row['intended_answer']
                intended_date = dateparser.parse(
                    intended_time,
                    settings={'RELATIVE_BASE': base_timestamp.to_pydatetime(), 'PREFER_DATES_FROM': 'future'}
                )

                # Compare predicted and intended date, one day buffer for more robust results
                is_correct = abs((predicted_option - intended_date).days) <= 1
                print(f"Extracted time: {extracted_time}, predicted date: {predicted_option.date()}, intended date: {intended_date.date()}")

            except Exception as e:
                print(f"[ERROR] DATE question processing failed: {e}")

        if row['type'] == "NUMBER":
            try:
                #Exctract phone number from context with QA pipeline (topk = 1)
                predicted_option = qa_pipeline(question=row['question'], context=row['context'])['answer']
                is_correct = predicted_option == row['intended_answer']
            except Exception as e:
                print(f"[ERROR] NUMBER question failed: {e}")

        # Ignore TEXT questions
        if row['type'] == "TEXT":
            continue

        # # Convert predictions and correct answers to binary form to calculate metrics
        # if row['type'] in ["SINGLE_SELECT", "MULTI_SELECT", "NUMBER"]:
        #     y_true.append(1 if row['intended_answer'] else 0)  # 1 = correct answer exists
        #     y_pred.append(1 if is_correct else 0)  # 1 = predicted correctly
        if row['type'] in type_metrics:
            type_metrics[row['type']]['y_true'].append(1 if row['intended_answer'] else 0)
            type_metrics[row['type']]['y_pred'].append(1 if is_correct else 0)

    # Calc metric for each type
    for q_type, data in type_metrics.items():
        metrics_per_type[q_type] = calculate_metrics(data["y_true"], data["y_pred"])

    # Calc overall metrics
    y_true_total = sum([data["y_true"] for data in type_metrics.values()], [])
    y_pred_total = sum([data["y_pred"] for data in type_metrics.values()], [])
    overall_metrics = calculate_metrics(y_true_total, y_pred_total)

    return metrics_per_type, overall_metrics

In [None]:
qa_pipeline1 = pipeline("question-answering", model="deepset/roberta-base-squad2")
qa_pipeline_ms = pipeline("question-answering", model="deepset/roberta-base-squad2", top_k=10)



### Predict answers for simple DataFrame

In [None]:
metrics, overall = predict_answers(df_simple, qa_pipeline1)

print("Overall metrics:")
for key, value in overall.items():
    print(f"{key}: {value:.4f}")
print()

plot_metrics(metrics, overall)

[INFO] Printing only incorrectly predicted answers.
Context: At this point, describing our posture on reaching projected ROI, I'm inclined to think Somewhat confident reflects my outlook while recognizing potential risks still persist; I would temper this slight concern with saying we are otherwise Moderately confident overall because those main concerns seem like they can be turned into a manageable risk with a small amount of further attention. So, you're wondering about when the project will wrap up and the deliverables will be available? Hmmm, my gut says we’re probably looking at having all of that in our hands in around a fortnight or thereabouts. Well, you know, the field that I'm involved in is definitely medical, it's fascinating actually
Correct: in a fortnight, Predicted: None

Context: I mean it’s definitely important to know is it someone from a Supplier company? Are they maybe instead a New customer / Prospect showing interest? It could even be a call from Press / media r

### Predict answers for summarized DataFrame

In [None]:
metrics, overall = predict_answers(df, qa_pipeline1)

print("Overall metrics:")
for key, value in overall.items():
    print(f"{key}: {value:.4f}")
print()

plot_metrics(metrics, overall)

[INFO] Printing only incorrectly predicted answers.
Context: The speaker prioritizes DataQuality as a foundation, then explores DataEnrichment. They heavily value VisitReports for concise summaries, and have primarily used the tools for Customer Relationship Management in the past 30 days. The project is anticipated to conclude tomorrow if current progress is maintained.
Correct: ['DataEnrichment', 'VisitReport', 'DataQuality'], Predicted: None

Context: The speaker prioritizes DataQuality as a foundation, then explores DataEnrichment. They heavily value VisitReports for concise summaries, and have primarily used the tools for Customer Relationship Management in the past 30 days. The project is anticipated to conclude tomorrow if current progress is maintained.
Correct: Customer Relationship Management, Predicted: Data Analysis

Context: The speaker prioritizes DataQuality as a foundation, then explores DataEnrichment. They heavily value VisitReports for concise summaries, and have pri