<a href="https://colab.research.google.com/github/croco22/CapstoneProjectTDS/blob/main/notebooks/01_Dataset_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 1: Dataset Generation
This chapter focuses on creating and expanding the dataset. It covers data collection, preprocessing and formatting to ensure compatibility with the model. The goal is to generate high-quality input data that improves model performance.

The secret `GOOGLE_API_KEY` must be configured in your Google Colab environment for proper execution.

**Remark**: It should be noted that the 503 error encountered when calling the Gemini API is an error on Google's side, typically due to temporary issues such as server overload or maintenance. This can happen when the API service is unable to handle the incoming requests at that moment. For example, a similar issue was reported in the Google AI forum: https://discuss.ai.google.dev/t/error-the-model-is-overloaded/48410

## Imports and Setup

In [1]:
import ast
import random
import time
from datetime import datetime, timedelta
from itertools import combinations

import pandas as pd
import google.generativeai as genai
from google.colab import userdata


# API Setup
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)
model = genai.GenerativeModel('gemini-2.0-flash-exp')


def generate_text(prompt):
    """
    Generates text based on the provided prompt using the Gemini API. The function sends the prompt
    to the model, with a generation configuration that includes a temperature of 2.0 for creative output.
    It then waits for 5 seconds to avoid exceeding API limits before returning the generated text.
    """
    try:
        response = model.generate_content(
            prompt,
            generation_config=genai.GenerationConfig(
                temperature=2.0, # creative output
            )
        )
        time.sleep(5) # avoid exceeding API limits
        return response.text.strip()
    except Exception as e:
        exit("Error during API call: ", e)

## Load the data from the provided questionnaires

In [2]:
dfs = list()

for q in range(1, 6):
    url = f'https://raw.githubusercontent.com/croco22/CapstoneProjectTDS/refs/heads/main/questionnaires/questionnaire{q}.json'
    temp_df = pd.read_json(url)

    # Unpack options into an array
    temp_df['options'] = temp_df['options'].apply(lambda x: [option['option'] for option in x])

    # Remove options for specific question types
    temp_df.loc[temp_df['type'].isin(['TEXT', 'NUMBER', 'DATE']), 'options'] = None

    dfs.append(temp_df)

df = pd.concat(dfs, ignore_index=True)

df.head()

Unnamed: 0,id,type,question,options
0,aa2d8cdd-0758-4035-b0b6-ca18e2f380d8,SINGLE_SELECT,Data processing consent,"[Yes, No]"
1,12e1ed1d-edaa-4e93-8645-de3850e998f9,SINGLE_SELECT,Customer group,"[End User, Wholesaler, Distributor, Consultant..."
2,625012ae-9192-4cf6-a73d-55e1813d6014,MULTI_SELECT,Products interested in,"[MY-SYSTEM, Notion, JTS, JS EcoLine, AKW100, A..."
3,0699fc5a-34a4-4160-bda1-fb135a3615da,MULTI_SELECT,What kind of follow up is planned,"[Email, Phone, Schedule a Visit, No action]"
4,815dab84-bc5e-4764-9777-0c0126e3173e,MULTI_SELECT,Who to copy in follow up,"[Stephan Maier, Joachim Wagner, Erik Schneider..."


## Generate Additional Questions
In this chapter, a set number of new questions is generated for each question type using various prompts. The process leverages the Gemini API 2.0 (experimental), as it provides the best results and fastest performance. This approach ensures a diverse and well-balanced dataset expansion.

In [3]:
# Prompts
select_question = f"""
    Generate a question that could be asked to an app user in a business
    context, designed as either a single-choice or multiple-choice question.
    Provide the question and an array of answer options in the format:
    [question, [option1, option2, ..., optionN]]
    Respond strictly in this format without additional explanations, comments,
    or text.
"""

text_question = f"""
    Generate a question that could be asked to an app user in a business
    context, designed as an open text entry question. Return the generated
    question without additional explanations, comments, or text.
"""

date_question = f"""
    Generate a question asking an app user in a business context to provide a
    date in the future. Return the generated question without additional
    explanations, comments, or text.
"""

number_question = f"""
    Generate a question asking an app user in a business context to provide a
    phone number. Return the generated question without additional
    explanations, comments, or text.
"""

In [4]:
add_data = list()
n_questions_per_type = 10

for t in df['type'].unique():
    for _ in range(n_questions_per_type):
        if t == "SINGLE_SELECT":
            question, options = ast.literal_eval(generate_text(select_question))
        elif t == "MULTI_SELECT":
            question, options = ast.literal_eval(generate_text(select_question))
        elif t == "TEXT":
            question, options = generate_text(text_question), None
        elif t == "DATE":
            question, options = generate_text(date_question), None
        elif t == "NUMBER":
            question, options = generate_text(number_question), None
        else:
            exit(f"Unhandled question type: {t}")

        add_data.append({"type": t, "question": question, "options": options})

    time.sleep(30)
    print(f"Generated {n_questions_per_type} questions of type: {t}")

add_df = pd.DataFrame(add_data)

add_df.head()

Generated 10 questions of type: SINGLE_SELECT


ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 556.82ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 557.70ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 531.65ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 506.37ms


Generated 10 questions of type: MULTI_SELECT
Generated 10 questions of type: TEXT


ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 532.27ms


Generated 10 questions of type: DATE
Generated 10 questions of type: NUMBER


Unnamed: 0,type,question,options
0,SINGLE_SELECT,Which department do you primarily work with?,"[Sales, Marketing, Engineering, Customer Suppo..."
1,SINGLE_SELECT,What is your primary goal for using this app t...,"[Submit an expense report, Review project prog..."
2,SINGLE_SELECT,What is your primary reason for accessing the ...,"[Review project progress, Update task status, ..."
3,SINGLE_SELECT,Which of the following areas of our business a...,"[Marketing & Sales, Product Development, Custo..."
4,SINGLE_SELECT,What is your primary objective for using this ...,"[Generate sales leads, Manage customer interac..."


## Process Questions
This chapter focuses on generating spoken-style answers based on the question type. A distinction is made between different question types, including single-select, multi-select, text, number, and date. Each type is processed using a dedicated handler to ensure appropriate answer generation. The implementation introduces a delay to manage request timing before executing the corresponding function.

In [5]:
def process_question(data):
    """
    Generate spoken-style answers for the passed question.
    A distinction is made between the different types of questions.
    """
    type_handlers = {
        "SINGLE_SELECT": handle_single_select,
        "MULTI_SELECT": handle_multi_select,
        "TEXT": handle_text,
        "NUMBER": handle_number,
        "DATE": handle_date,
    }

    data_type = data.get('type')
    handler = type_handlers.get(data_type)

    time.sleep(10)

    if handler:
        return handler(data)
    else:
        exit(f"Unhandled data type: {data_type}")

### Single-Select

In [6]:
def generate_single_answers(question, option):
    prompt = f"""
        You are an app user responding to the following question in a
        conversational, spoken style. You enjoy talking, so you respond with
        full sentences rather than a simple 'yes' or 'no'.
        Question: '{question}'
        Your response must explicitly convey the provided content: '{option}'.
        Generate 5 unique and varied responses, formatted as:
        'answer1§answer2§...§answer5'.
        Return only the generated responses in the specified format, without
        any additional explanation or comments.
        Do not use quotation marks in the response.
    """
    return generate_text(prompt)


def handle_single_select(data):
    """
    Example output:
    intended_answer: ['Yes', 'Yes', ..., 'No', 'No', ...]
    context: ['Yeah, sure thing, ...', 'Nope, I'd rather ...', ...]
    """
    intended_answer = list()
    context = list()

    for option in data['options']:
        response_text = generate_single_answers(data['question'], option)
        texts_array = [answer.strip() for answer in response_text.split("§")]

        intended_answer.extend([option] * 5)
        context.extend(texts_array)

    print(f"Generated context for question: '{data['question']}'")
    return intended_answer, context

### Multi-Select

In [7]:
def generate_multi_answers(question, options):
    options_text = ", ".join(options)
    prompt = f"""
        You are an app user responding to the following question in a
        conversational, spoken style. You enjoy talking, so you respond with
        full sentences rather than a simple 'yes' or 'no'.
        Question: '{question}'
        Your response must contain all of the following text elements
        explicitly to be valid: '{options_text}'.
        Generate 5 unique and varied responses, formatted as:
        'answer1§answer2§...§answer5'.
        Return only the generated responses in the specified format, without
        any additional explanation or comments.
        Do not use quotation marks in the response.
    """
    return generate_text(prompt)


def handle_multi_select(data):
    """
    Example output:
    intended_answer: [["MY-SYSTEM", "Notion"], ["Notion"], ...]
    context: ['Yeah, that would be MY-SYSTEM and Notion, ...',
        'Hmm, I think I'm mainly interested in Notion ...', ...]
    """
    intended_answer = list()
    context = list()

    # Generate all possible combinations of options (subsets)
    all_combinations = list()
    for r in range(1, len(data['options']) + 1):
        all_combinations.extend(list(combinations(data['options'], r)))

    # Shuffle combinations for randomness
    random.shuffle(all_combinations)

    # Only generate answers for a random sample of combinations
    selected_combinations = random.sample(all_combinations, min(5, len(all_combinations)))

    for combo in selected_combinations:
        response_text = generate_multi_answers(data['question'], combo)
        texts_array = [answer.strip() for answer in response_text.split("§")]

        intended_answer.extend([combo] * 5)
        context.extend(texts_array)

    print(f"Generated context for question: '{data['question']}'")
    return intended_answer, context

### Text

In [8]:
def generate_text_answers(question):
    prompt = f"""
        You are an app user responding to the following question in a
        conversational, spoken style. You enjoy talking, so you respond with
        full sentences rather than a simple 'yes' or 'no'.
        Question: '{question}'
        Generate 5 unique and varied responses, formatted as:
        'answer1§answer2§...§answer5'.
        Return only the generated responses in the specified format, without
        any additional explanation or comments.
        Do not use quotation marks in the response.
    """
    return generate_text(prompt)


def handle_text(data):
    """
    Example output:
    intended_answer: [None, None, ...]
    context: ['You can only reach me on ...', 'I have no notes to add.', ...]
    """
    intended_answer = list()
    context = list()

    response_text = generate_text_answers(data['question'])
    texts_array = [answer.strip() for answer in response_text.split("§")]

    intended_answer.extend([None] * 5)
    context.extend(texts_array)

    print(f"Generated context for question: '{data['question']}'")
    return intended_answer, context

### Number

In [9]:
def generate_phone_number():
    """
    Generates a random phone number in an international format.
    """
    country_code = random.choice(["+1", "+44", "+49", "+33"])
    area_code = random.randint(100, 999)
    local_number = f"{random.randint(100, 999)}-{random.randint(1000, 9999)}"
    return f"{country_code}-{area_code}-{local_number}"


def generate_number_answers(question, option):
    prompt = f"""
        You are an app user responding to the following question in a
        conversational, spoken style. You enjoy talking, so you respond with
        full sentences rather than a simple 'yes' or 'no'.
        Question: '{question}'
        Your response must contain the following phone number explicitly
        to be valid: '{option}'.
        Generate 5 unique and varied responses, formatted as:
        'answer1§answer2§...§answer5'.
        Return only the generated responses in the specified format, without
        any additional explanation or comments.
        Do not use quotation marks in the response.
    """
    return generate_text(prompt)


def handle_number(data):
    """
    Example output:
    intended_answer: ['+44-7700-900123', ...]
    context: ['My number is +44-7700-900123.', ...]
    """
    intended_answer = list()
    context = list()

    phone_numbers = [generate_phone_number() for _ in range(5)]

    for option in phone_numbers:
        response_text = generate_number_answers(data['question'], option)
        texts_array = [answer.strip() for answer in response_text.split("§")]

        intended_answer.extend([option] * 5)
        context.extend(texts_array)

    print(f"Generated context for question: '{data['question']}'")
    return intended_answer, context

### Date

In [10]:
def generate_date_answers(question):
    prompt = f"""
        You are an app user responding to the following question in a
        conversational, spoken style. You enjoy talking, so you respond with
        full sentences rather than a simple 'yes' or 'no'.
        Question: '{question}'
        Your answer must contain a time reference in the future, such as
        'tomorrow', 'in three weeks', etc.
        Build a natural-sounding spoken response around this time reference.
        Generate 5 unique and varied responses, each incorporating a different
        future time reference, formatted as:
        'answer1§answer2§...§answer5§reference1§reference2§...§reference5'.
        Return only the generated responses in the specified format, without
        any additional explanation or comments.
        Do not use quotation marks in the response.
    """
    return generate_text(prompt)


def handle_date(data):
    """
    Example output:
    intended_answer: ['tomorrow', 'in three weeks', ...]
    context: ['Tomorrow would be good ...', 'How about in three weeks ...', ...]
    """
    response_text = generate_date_answers(data['question'])
    response_text = response_text.strip('"').strip("'")
    texts_array = [answer.strip() for answer in response_text.split("§")]

    intended_answer = texts_array[5:]
    context = texts_array[:5]

    print(f"Generated context for question: '{data['question']}'")
    return intended_answer, context

## Create User Inputs (Context)
This section focuses on generating relevant user inputs for each question to provide meaningful context. By incorporating realistic responses, the model can better understand different question types and generate more natural, conversational answers. The context ensures that each question is framed appropriately, improving the quality and coherence of the generated responses.

In [11]:
def generate_timestamp():
    """
    Generate random timestamp within the last 30 days.
    """
    start_date = datetime.now() - timedelta(days=30)
    random_seconds = random.randint(0, 30 * 24 * 60 * 60)
    random_date = start_date + timedelta(seconds=random_seconds)
    return random_date


df = pd.concat([df, add_df], ignore_index=True)

new_cols = ['intended_answer', 'context']

df[new_cols] = df.apply(
    lambda row: pd.Series(process_question(row)), axis=1
)

df['timestamp'] = [generate_timestamp() for _ in range(len(df))]

df = df.explode(new_cols, ignore_index=True)

Generated context for question: 'Data processing consent'


ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 507.35ms


Generated context for question: 'Customer group'
Generated context for question: 'Products interested in'


ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 506.53ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 533.16ms


Generated context for question: 'What kind of follow up is planned'
Generated context for question: 'Who to copy in follow up'
Generated context for question: 'Would you like to receive marketing information from via e-mail?'
Generated context for question: 'What industry are you operating in?'


ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 557.69ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 531.82ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 507.87ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 532.22ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 557.66ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 531.08ms


Generated context for question: 'What products are you interested in?'
Generated context for question: 'Notes'


ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 531.60ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 859.53ms


Generated context for question: 'What type of company is it?'
Generated context for question: 'What is the size of your company?'
Generated context for question: 'When do you wish to receive a follow-up?'
Generated context for question: 'Any additional notes?'
Generated context for question: 'Which language is wanted for communication? '


ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 557.44ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 532.78ms


Generated context for question: 'What is the type of contact?'
Generated context for question: 'What is the contact person interested in?'
Generated context for question: 'What phone number can we use for contact?'


ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 532.02ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 557.23ms


Generated context for question: 'When does the contact person wish to receive a follow up?'
Generated context for question: 'Customer type'


ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 531.84ms


Generated context for question: 'Customer satisfaction'


ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 506.08ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 658.29ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 506.74ms


Generated context for question: 'Size of the trade fair team (on average)'
Generated context for question: 'CRM-System'
Generated context for question: 'Productinterests'


ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 506.72ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 759.81ms


Generated context for question: 'Searches a solution for'
Generated context for question: 'Next steps'
Generated context for question: 'Which department do you primarily work with?'
Generated context for question: 'What is your primary goal for using this app today?'
Generated context for question: 'What is your primary reason for accessing the project dashboard today?'
Generated context for question: 'Which of the following areas of our business are you most interested in learning about?'
Generated context for question: 'What is your primary objective for using this platform today?'
Generated context for question: 'Which of the following best describes your primary goal for using this app today?'


ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 633.39ms


Generated context for question: 'What is the primary goal of your meeting today?'
Generated context for question: 'Which of these tasks would improve your efficiency today?'
Generated context for question: 'Which of the following areas would you like to see improved in our project management process?'


ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 507.73ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 505.97ms


Generated context for question: 'Which of the following best describes your primary work responsibility?'
Generated context for question: 'What is your primary reason for using our project management tool today?'
Generated context for question: 'What is your primary role in your current project?'
Generated context for question: 'What is your primary goal for using this reporting feature today?'


ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 531.56ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 532.03ms


Generated context for question: 'What is your primary goal for using this application today?'
Generated context for question: 'Which department do you primarily work within?'
Generated context for question: 'What is your primary reason for accessing the CRM today?'
Generated context for question: 'Which of the following best describes your primary reason for using our expense tracking app this week?'
Generated context for question: 'Which department are you primarily working with this week?'


ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 531.68ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 531.87ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 556.89ms


Generated context for question: 'What is your primary reason for using this expense reporting app?'


ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 508.19ms


Generated context for question: 'Which of these areas would benefit most from additional investment?'
Generated context for question: 'How could we improve your experience using this app for your business?'
Generated context for question: 'What are some areas where you believe our product could better meet your business needs?'
Generated context for question: 'How can our services be improved to better meet your needs?'
Generated context for question: 'What are the primary challenges you currently face in managing your team's workload?'
Generated context for question: 'What are your biggest challenges in managing your current projects?'
Generated context for question: 'What are the biggest challenges you're currently facing in managing your projects?'


ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 533.07ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 532.70ms


Generated context for question: 'How can we improve your experience using this app for your business needs?'


ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 632.88ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 531.70ms


Generated context for question: 'What challenges are you currently facing in your daily workflow?'
Generated context for question: 'Describe how our service has impacted your business operations.'
Generated context for question: 'Describe the most significant challenge your team is currently facing.'
Generated context for question: 'What is the desired completion date for this project?'
Generated context for question: 'What is the projected completion date?'
Generated context for question: 'When would you like this task to be completed by?'
Generated context for question: 'What date in the future would you like to schedule this?'
Generated context for question: 'What is the expected completion date?'
Generated context for question: 'What is the anticipated completion date?'
Generated context for question: 'What date would you like to schedule this for?'
Generated context for question: 'What is the anticipated completion date?'
Generated context for question: 'When would you like this t

ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 556.53ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 531.47ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 556.89ms


Generated context for question: 'Could you please provide your phone number for verification and to facilitate direct communication regarding your account?'


ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 1164.97ms


Generated context for question: 'To ensure seamless communication, what's the best phone number to reach you?'


ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash-exp:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 531.02ms


Generated context for question: 'What is the best phone number to reach you at for business inquiries?'
Generated context for question: 'Could you please provide your phone number so we can contact you if needed?'
Generated context for question: 'What is the best phone number to reach you at for business purposes?'
Generated context for question: 'Could you please provide your phone number so we can reach you regarding your business inquiry?'
Generated context for question: 'Could you please provide your phone number for verification?'
Generated context for question: 'What phone number can we use to contact you regarding this business inquiry?'
Generated context for question: 'To better assist you, what is your business phone number?'
Generated context for question: 'To best reach you, could you please provide your phone number?'


In [12]:
print("Number of unique questions:", df['question'].nunique())
print("Number Q&A-pairs (= number of rows in dataset):", len(df))

Number of unique questions: 73
Number Q&A-pairs (= number of rows in dataset): 1430


In [13]:
# Save dataset to a new JSON file
df.to_json('qa_dataset.json', orient='records', indent=4)