<a href="https://colab.research.google.com/github/croco22/CapstoneProjectTDS/blob/philipp/notebooks/01_Dataset_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 1: Dataset Generation
This chapter focuses on creating and expanding the dataset for the task. It covers data collection, preprocessing, augmentation, and formatting to ensure compatibility with the model. The goal is to generate high-quality input data that improves model performance and generalization.

**Note**: The secret `GOOGLE_API_KEY` must be configured in your Colab environment for proper execution.

## Imports and Setup

In [None]:
import ast
import random
import time
from datetime import datetime, timedelta
from itertools import combinations

import pandas as pd
import google.generativeai as genai
from google.colab import userdata


# API Setup
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)
model = genai.GenerativeModel('gemini-2.0-flash-exp')


def generate_text(prompt):
    """
    Generates text based on the provided prompt using the genai model. The function sends the prompt
    to the model, with a generation configuration that includes a temperature of 2.0 for creative output.
    It then waits for 5 seconds to avoid exceeding API limits before returning the generated text.
    """
    try:
        response = model.generate_content(
            prompt,
            generation_config=genai.GenerationConfig(
                temperature=2.0, # creative output
            )
        )
        time.sleep(5) # avoid exceeding API limits
        return response.text.strip()
    except Exception as e:
        exit("Error during API call: ", e)

## Load the data from the provided questionnaires

In [None]:
dfs = list()

for q in range(1, 6):
    url = f'https://raw.githubusercontent.com/croco22/CapstoneProjectTDS/refs/heads/main/questionnaires/questionnaire{q}.json'
    temp_df = pd.read_json(url)

    # Unpack options into an array
    temp_df['options'] = temp_df['options'].apply(lambda x: [option['option'] for option in x])

    # Remove options for specific question types
    # because irrelevant or do not contribute meaningfully to the dataset
    temp_df.loc[temp_df['type'].isin(['TEXT', 'NUMBER', 'DATE']), 'options'] = None

    dfs.append(temp_df)

df = pd.concat(dfs, ignore_index=True)

df.head()

Unnamed: 0,type,question,options
0,SINGLE_SELECT,Data processing consent,"[Yes, No]"
1,SINGLE_SELECT,Customer group,"[End User, Wholesaler, Distributor, Consultant..."
2,MULTI_SELECT,Products interested in,"[MY-SYSTEM, Notion, JTS, JS EcoLine, AKW100, A..."
3,MULTI_SELECT,What kind of follow up is planned,"[Email, Phone, Schedule a Visit, No action]"
4,MULTI_SELECT,Who to copy in follow up,"[Stephan Maier, Joachim Wagner, Erik Schneider..."


## Generate Additional Questions
In this chapter, a set number of new questions is generated for each question type using various prompts. The process leverages the Gemini API 2.0 (experimental), as it provides the best results and fastest performance. This approach ensures a diverse and well-balanced dataset expansion.

In [None]:
# Prompts

select_question = f"""
    Generate a question that could be asked to an app user in a business
    context, designed as either a single-choice or multiple-choice question.
    Provide the question and an array of answer options in the format:
    [question, [option1, option2, ..., optionN]]
    Respond strictly in this format without additional explanations, comments,
    or text.
"""

text_question = f"""
    Generate a question that could be asked to an app user in a business
    context, designed as an open text entry question. Return the generated
    question without additional explanations, comments, or text.
"""

date_question = f"""
    Generate a question asking an app user in a business context to provide a
    date in the future. Return the generated question without additional
    explanations, comments, or text.
"""

number_question = f"""
    Generate a question asking an app user in a business context to provide a
    phone number. Return the generated question without additional
    explanations, comments, or text.
"""

In [None]:
add_data = list()
n_questions_per_type = 10

for t in df['type'].unique():
    for _ in range(n_questions_per_type):
        if t == "SINGLE_SELECT":
            question, options = ast.literal_eval(generate_text(select_question))
        elif t == "MULTI_SELECT":
            question, options = ast.literal_eval(generate_text(select_question))
        elif t == "TEXT":
            question, options = generate_text(text_question), None
        elif t == "DATE":
            question, options = generate_text(date_question), None
        elif t == "NUMBER":
            question, options = generate_text(number_question), None
        else:
            exit(f"Unhandled question type: {t}")

        add_data.append({"type": t, "question": question, "options": options})

    time.sleep(30)
    print(f"Generated {n_questions_per_type} questions of type: {t}")

add_df = pd.DataFrame(add_data)

add_df.head()

Generated 10 questions of type: SINGLE_SELECT
Generated 10 questions of type: MULTI_SELECT
Generated 10 questions of type: TEXT
Generated 10 questions of type: DATE
Generated 10 questions of type: NUMBER


Unnamed: 0,type,question,options
0,SINGLE_SELECT,What is your primary goal for using this platf...,"[Analyze data trends, Generate a report, Manag..."
1,SINGLE_SELECT,Which of the following best describes your pri...,"[Increase sales, Improve customer engagement, ..."
2,SINGLE_SELECT,Which department are you primarily working wit...,"[Sales, Marketing, Engineering, Customer Suppo..."
3,SINGLE_SELECT,Which of the following best describes your pri...,"[Tracking project progress, Managing tasks, Co..."
4,SINGLE_SELECT,Which of the following best describes your pri...,"[Task management, Collaboration, Time tracking..."


## Process Questions
This chapter focuses on generating spoken-style answers based on the question type. A distinction is made between different question types, including single-select, multi-select, text, number, and date. Each type is processed using a dedicated handler to ensure appropriate answer generation. The implementation introduces a delay to manage request timing before executing the corresponding function.

In [None]:
def process_question(data):
    """
    Generate spoken-style answers for the passed question.
    A distinction is made between the different types of questions.
    """
    type_handlers = {
        "SINGLE_SELECT": handle_single_select,
        "MULTI_SELECT": handle_multi_select,
        "TEXT": handle_text,
        "NUMBER": handle_number,
        "DATE": handle_date,
    }

    data_type = data.get('type')
    handler = type_handlers.get(data_type)

    time.sleep(10)

    if handler:
        return handler(data)
    else:
        exit(f"Unhandled data type: {data_type}")

### Single-Select

In [None]:
def generate_single_answers(question, option):
    prompt = f"""
        You are an app user responding to the following question in a
        conversational, spoken style. You enjoy talking, so you respond with
        full sentences rather than a simple 'yes' or 'no'.
        Question: '{question}'
        Your response must explicitly convey the provided content: '{option}'.
        Generate 5 unique and varied responses, formatted as:
        'answer1§answer2§...§answer5'.
        Return only the generated responses in the specified format, without
        any additional explanation or comments.
        Do not use quotation marks in the response.
    """
    return generate_text(prompt)


def handle_single_select(data):
    """
    Example output:
    intended_answer: ['Yes', 'Yes', ..., 'No', 'No', ...]
    context: ['Yeah, sure thing, ...', 'Nope, I'd rather ...', ...]
    """
    intended_answer = list()
    context = list()

    for option in data['options']:
        response_text = generate_single_answers(data['question'], option)
        texts_array = [answer.strip() for answer in response_text.split("§")]

        intended_answer.extend([option] * 5)
        context.extend(texts_array)

    print(f"Generated context for question: '{data['question']}'")
    return intended_answer, context

### Multi-Select

In [None]:
def generate_multi_answers(question, options):
    options_text = ", ".join(options)
    prompt = f"""
        You are an app user responding to the following question in a
        conversational, spoken style. You enjoy talking, so you respond with
        full sentences rather than a simple 'yes' or 'no'.
        Question: '{question}'
        Your response must contain all of the following text elements
        explicitly to be valid: '{options_text}'.
        Generate 5 unique and varied responses, formatted as:
        'answer1§answer2§...§answer5'.
        Return only the generated responses in the specified format, without
        any additional explanation or comments.
        Do not use quotation marks in the response.
    """
    return generate_text(prompt)


def handle_multi_select(data):
    """
    Example output:
    intended_answer: [["MY-SYSTEM", "Notion"], ["Notion"], ...]
    context: ['Yeah, that would be MY-SYSTEM and Notion, ...',
        'Hmm, I think I'm mainly interested in Notion ...', ...]
    """
    intended_answer = list()
    context = list()

    # Generate all possible combinations of options (subsets)
    all_combinations = []
    for r in range(1, len(data['options']) + 1):
        all_combinations.extend(list(combinations(data['options'], r)))

    # Shuffle combinations for randomness
    random.shuffle(all_combinations)

    # Only generate answers for a random sample of combinations
    selected_combinations = random.sample(all_combinations, min(5, len(all_combinations)))

    for combo in selected_combinations:
        response_text = generate_multi_answers(data['question'], combo)
        texts_array = [answer.strip() for answer in response_text.split("§")]

        intended_answer.extend([combo] * 5)
        context.extend(texts_array)

    print(f"Generated context for question: '{data['question']}'")
    return intended_answer, context

### Text

In [None]:
def generate_text_answers(question):
    prompt = f"""
        You are an app user responding to the following question in a
        conversational, spoken style. You enjoy talking, so you respond with
        full sentences rather than a simple 'yes' or 'no'.
        Question: '{question}'
        Generate 5 unique and varied responses, formatted as:
        'answer1§answer2§...§answer5'.
        Return only the generated responses in the specified format, without
        any additional explanation or comments.
        Do not use quotation marks in the response.
    """
    return generate_text(prompt)


def handle_text(data):
    """
    Example output:
    intended_answer: [None, None, ...]
    context: ['You can only reach me on ...', 'I have no notes to add.', ...]
    """
    intended_answer = list()
    context = list()

    response_text = generate_text_answers(data['question'])
    texts_array = [answer.strip() for answer in response_text.split("§")]

    intended_answer.extend([None] * 5)
    context.extend(texts_array)

    print(f"Generated context for question: '{data['question']}'")
    return intended_answer, context

### Number

In [None]:
def generate_phone_number():
    """
    Generates a random phone number in an international format.
    """
    country_code = random.choice(["+1", "+44", "+49", "+33", "+91"])
    area_code = random.randint(100, 999)
    local_number = f"{random.randint(100, 999)}-{random.randint(1000, 9999)}"
    return f"{country_code}-{area_code}-{local_number}"


def generate_number_answers(question, option):
    prompt = f"""
        You are an app user responding to the following question in a
        conversational, spoken style. You enjoy talking, so you respond with
        full sentences rather than a simple 'yes' or 'no'.
        Question: '{question}'
        Your response must contain the following phone number explicitly
        to be valid: '{option}'.
        Generate 5 unique and varied responses, formatted as:
        'answer1§answer2§...§answer5'.
        Return only the generated responses in the specified format, without
        any additional explanation or comments.
        Do not use quotation marks in the response.
    """
    return generate_text(prompt)


def handle_number(data):
    """
    Example output:
    intended_answer: ['+44-7700-900123', ...]
    context: ['My number is +44-7700-900123.', ...]
    """
    intended_answer = list()
    context = list()

    phone_numbers = [generate_phone_number() for _ in range(5)]

    for option in phone_numbers:
        response_text = generate_number_answers(data['question'], option)
        texts_array = [answer.strip() for answer in response_text.split("§")]

        intended_answer.extend([option] * 5)
        context.extend(texts_array)

    print(f"Generated context for question: '{data['question']}'")
    return intended_answer, context

### Date

In [None]:
def generate_date_answers(question):
    prompt = f"""
        You are an app user responding to the following question in a
        conversational, spoken style. You enjoy talking, so you respond with
        full sentences rather than a simple 'yes' or 'no'.
        Question: '{question}'
        Your answer must contain a time reference in the future, such as
        'tomorrow', 'in three weeks', etc.
        Additionally you have to give a calculation reference as an Integer
        value for this in seconds without naming a fixed date,
        e.g. 'tomorrow'=86400; 'in three weeks'=1814400. Generate 5 unique
        and varied responses, formatted as:
        'answer1§answer2§...§answer5§calculation1§calculation2§...§calculation5'.
        Return only the generated responses in the specified format, without
        any additional explanation or comments.
        Do not use quotation marks in the response.
    """
    return generate_text(prompt)


def handle_date(data):
    """
    Example output:
    intended_answer: [86400, 1814400, ...]
    context: ['Tomorrow would be good ...', 'How about in three weeks ...', ...]
    """
    response_text = generate_date_answers(data['question'])
    response_text = response_text.strip('"').strip("'")
    texts_array = [answer.strip() for answer in response_text.split("§")]

    intended_answer = texts_array[5:]
    context = texts_array[:5]

    print(f"Generated context for question: '{data['question']}'")
    return intended_answer, context

## Create User Inputs (Context)
This section focuses on generating relevant user inputs for each question to provide meaningful context. By incorporating realistic responses, the model can better understand different question types and generate more natural, conversational answers. The context ensures that each question is framed appropriately, improving the quality and coherence of the generated responses.

In [None]:
def generate_timestamp():
    """
    Generate random timestamp within the last 30 days.
    """
    start_date = datetime.now() - timedelta(days=30)
    random_seconds = random.randint(0, 30 * 24 * 60 * 60)
    random_date = start_date + timedelta(seconds=random_seconds)
    return random_date


df = pd.concat([df, add_df], ignore_index=True)

new_cols = ['intended_answer', 'context']

df[new_cols] = df.apply(
    lambda row: pd.Series(process_question(row)), axis=1
)

df['timestamp'] = [generate_timestamp() for _ in range(len(df))]

df = df.explode(new_cols, ignore_index=True)

Generated context for question: 'Data processing consent'
Generated context for question: 'Customer group'
Generated context for question: 'Products interested in'
Generated context for question: 'What kind of follow up is planned'
Generated context for question: 'Who to copy in follow up'
Generated context for question: 'Would you like to receive marketing information from via e-mail?'
Generated context for question: 'What industry are you operating in?'
Generated context for question: 'What products are you interested in?'
Generated context for question: 'Notes'
Generated context for question: 'What type of company is it?'
Generated context for question: 'What is the size of your company?'
Generated context for question: 'When do you wish to receive a follow-up?'
Generated context for question: 'Any additional notes?'
Generated context for question: 'Which language is wanted for communication? '
Generated context for question: 'What is the type of contact?'
Generated context for ques

In [None]:
print("Number of unique questions:", df['question'].nunique())
print("Number Q&A-pairs (= number of rows in dataset):", len(df))

Number of unique questions: 75
Number Q&A-pairs (= number of rows in dataset): 1450
            type                 question    options intended_answer  \
0  SINGLE_SELECT  Data processing consent  [Yes, No]             Yes   
1  SINGLE_SELECT  Data processing consent  [Yes, No]             Yes   
2  SINGLE_SELECT  Data processing consent  [Yes, No]             Yes   
3  SINGLE_SELECT  Data processing consent  [Yes, No]             Yes   
4  SINGLE_SELECT  Data processing consent  [Yes, No]             Yes   

                                             context  \
0    Yes, absolutely, I'm completely fine with that.   
1        Sure, I give my consent, no problem at all.   
2  Yep, consider my agreement given; I have no ob...   
3  Indeed, you have my permission to proceed with...   
4  Okay, yes, I definitely agree to those data pr...   

                   timestamp  
0 2024-12-31 22:15:06.880838  
1 2024-12-31 22:15:06.880838  
2 2024-12-31 22:15:06.880838  
3 2024-12-31 22:15:06.

In [None]:
# Save dataset to a new JSON file
df.to_json('qa_dataset.json', orient='records', indent=4)