<a href="https://colab.research.google.com/github/alexk2206/tds_capstone/blob/Alex-DEV/test_qa_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generate Q and A test dataset
I order to test a trained and evaluated model a test dataset is needed. This will be generated from scratch with new questions.

In [2]:
import pandas as pd
import random
import json

import google.generativeai as genai
from IPython.display import display, Markdown
from google.colab import userdata
import time

In [3]:
dfs = []

for i in range(1, 6):
    url = f'https://raw.githubusercontent.com/alexk2206/tds_capstone/refs/heads/main/questionnaires/questionnaire{i}.json'
    df = pd.read_json(url)
    df['options'] = df['options'].apply(lambda x: ', '.join([opt['option'] for opt in x]))
    dfs.append(df)

all_questions = pd.concat(dfs, ignore_index=True)

In [4]:
print(f"all_questions shape: {all_questions.shape}")
all_questions

all_questions shape: (25, 4)


Unnamed: 0,id,type,question,options
0,aa2d8cdd-0758-4035-b0b6-ca18e2f380d8,SINGLE_SELECT,Data processing consent,"Yes, No"
1,12e1ed1d-edaa-4e93-8645-de3850e998f9,SINGLE_SELECT,Customer group,"End User, Wholesaler, Distributor, Consultant,..."
2,625012ae-9192-4cf6-a73d-55e1813d6014,MULTI_SELECT,Products interested in,"MY-SYSTEM, Notion, JTS, JS EcoLine, AKW100, AX100"
3,0699fc5a-34a4-4160-bda1-fb135a3615da,MULTI_SELECT,What kind of follow up is planned,"Email, Phone, Schedule a Visit, No action"
4,815dab84-bc5e-4764-9777-0c0126e3173e,MULTI_SELECT,Who to copy in follow up,"Stephan Maier, Joachim Wagner, Erik Schneider,..."
5,3f34e5b3-1cb0-48ea-93d2-3f21b3371b5d,SINGLE_SELECT,Would you like to receive marketing informatio...,"Yes, No"
6,ba042f33-697e-4c6f-924c-b4de2c30f443,SINGLE_SELECT,What industry are you operating in?,"Aerospace, Computers & Networks, Government, M..."
7,7a776cc0-ffe8-4891-b8a9-dd5ff984de13,MULTI_SELECT,What products are you interested in?,"Automotive radar target simulation, Noise figu..."
8,a0148bc7-15b3-41d5-b97c-6420b8bd927c,TEXT,Notes,Please provide any additional information that...
9,5aefc81d-c5d2-41fc-bc7b-6117d1c7671e,SINGLE_SELECT,What type of company is it?,"Construction company, Craft enterprises, Scaff..."


# Setting up a prompt for Chat GPT
To create new questions for a test dataset we ask Chat GPT to generate them. The prompt we used looks like this:

You are a salesman at a trade fair and want to ask customers who visit your exhibition stand questions. You will be given a list of questions, the question types, and possible options to answer each question. I want you to think of completely new questions, their types, and possible answer options. Your aim is to create 20 questions with each type at least once for a new questionnaire for the next trade fair. Keep it short, as you are allowed to use only up to 32 tokens per question and up to 32 tokens for their options. It is important to come up with completely new questions!

Sample questions divided by // : {questions}

Question type divided by // : {type}

Possible answer options per question divided by // : {options}

New questions with type and answer options formatted as a json:

## For {questions} we used this string:

Data processing consent // Customer group // Products interested in // What kind of follow up is planned // Who to copy in follow up // Would you like to receive marketing information from via e-mail? // What industry are you operating in? // What products are you interested in? // Notes // What type of company is it? // What is the size of your company? // When do you wish to receive a follow-up? // Any additional notes? // Which language is wanted for communication? // What is the type of contact? // What is the contact person interested in? // What phone number can we use for contact? // When does the contact person wish to receive a follow up? // Customer type // Customer satisfaction // Size of the trade fair team (on average) // CRM-System // Productinterests // Searches a solution for // Next steps

## For {type} we used this string:

SINGLE_SELECT // SINGLE_SELECT // MULTI_SELECT // MULTI_SELECT // MULTI_SELECT // SINGLE_SELECT // SINGLE_SELECT // MULTI_SELECT // TEXT // SINGLE_SELECT // SINGLE_SELECT // DATE // TEXT // SINGLE_SELECT // MULTI_SELECT // MULTI_SELECT // NUMBER // MULTI_SELECT // SINGLE_SELECT // SINGLE_SELECT // SINGLE_SELECT // SINGLE_SELECT // MULTI_SELECT // MULTI_SELECT // SINGLE_SELECT

## For {options} we used this string:

Yes, No // End User, Wholesaler, Distributor, Consultant, Planner, Architect, R&D // MY-SYSTEM, Notion, JTS, JS EcoLine, AKW100, AX100 // Email, Phone, Schedule a Visit, No action // Stephan Maier, Joachim Wagner, Erik Schneider, Oliver Eibel, Angelina Haug, Marisa Peng, Johannes Wagner, Jessica Hanke, Sandro Kalter, Jens Roschmann, Domiki Stein, Sean Kennin, Tim Persson // Yes, No // Aerospace, Computers & Networks, Government, Medical, Automotive, Defense, Industrial, Network Operators & Infrastructure, Public Safety / Law Enforcement, Physical Security // Automotive radar target simulation, Noise figure measurements, Double-Pulse Testing, Display port debugging and compliance, High-speed interconnect testing // Please provide any additional information that you would like to share. // Construction company, Craft enterprises, Scaffolding company, Trading company, Production company, Education sector // 1-10, 11-50, 51-200, 201-2000, larger than 2000 // Date // What additional information would you like to share? // German, Italian, Japanese, English, Spanish // Existing customer, Supplier, New customer / Prospect, Press / media, Competitor // 100 Additive Manufacturing, 200 Automation, 300 Advanced Manufacturing, 234 Assembly Systems, 256 Joining Systems for large components, Others // phone number // 1 week, 2 weeks, 3 weeks // New customer, Existing customer, Partner, Applicant // Very satisfied, Satisfied, Unsatisfied, Very unsatisfied // 1-5, 6-10, 11-15, 16-20, 21-30, 31-40, more than 40 // Salesforce, Pipedrive, Close.io, Microsoft Dynamics, HubSpot, CAS, SAP Sales Cloud, Adito // BusinessCards, DataEnrichment, VisitReport, Data Cleansing, DataQuality // Scan business cards, Clean up CRM, Extract data from emails, Improve CRM data quality, Capture trade fair contacts // Offer, Meeting, Call

In [5]:
test_questions_url = 'https://raw.githubusercontent.com/alexk2206/tds_capstone/refs/heads/main/datasets/test_dataset_questions.json'
test_questions = pd.read_json(test_questions_url)
test_questions

Unnamed: 0,question,type,options
0,How did you hear about our exhibition stand?,SINGLE_SELECT,"Social media, Email invitation, Trade fair web..."
1,What is your primary goal at this trade fair?,SINGLE_SELECT,"Networking, Finding suppliers, Learning about ..."
2,Which features are most important in a solution?,MULTI_SELECT,"Ease of use, Cost efficiency, Scalability, Sec..."
3,How would you prefer to receive product updates?,SINGLE_SELECT,"Email, Webinar, Newsletter, Social media, In-p..."
4,Who in your company evaluates new solutions?,MULTI_SELECT,"Team leader, IT department, Procurement, CEO, ..."
5,Do you plan to implement a solution within the...,SINGLE_SELECT,"Yes, No"
6,What is your preferred method of follow-up?,SINGLE_SELECT,"Phone call, Email, Video meeting, In-person vi..."
7,What stage are you in the buying process?,SINGLE_SELECT,"Exploration, Evaluation, Decision-making, Alre..."
8,What challenges are you currently facing in yo...,TEXT,Please share specific challenges or issues.
9,What department are you representing?,SINGLE_SELECT,"R&D, Procurement, Marketing, Operations, Other"


In [6]:
key = userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=key)
model = genai.GenerativeModel("gemini-2.0-flash-exp")

In [None]:
max_output_tokens = 48

def generate_selection_answer_easy(question, intended_answer):
  prompt = f"""
  You are asked a question, and you need to provide a natural, conversational answer in the first person. Do not use special characters other than ',' and '.'.
  Act like you really do not know which options there are and the intended answer is your answer.
  When given a range, use a number between the two values.
  Be concise but clear, and avoid unnecessary elaboration. Use up to {max_output_tokens} tokens.
  Question: {question}\n
  Intended answer: {intended_answer}\n
  Answer as a sentence, mentioning and explaining all the provided options:
  """

  response = model.generate_content(
      contents = prompt,
      generation_config = genai.GenerationConfig(
          max_output_tokens=max_output_tokens,
          temperature=2)
    )

  answer = response.text.strip()

  time.sleep(6)

  return {"answer": answer, "difficulty": "easy"}

def generate_number_answer_easy(question, intended_answer):
  prompt = f"""
  You are being asked for contact information, and your response should be clear and concise, as if you're giving someone your phone number and how you can be reached in a conversation.
  Mention the provided phone number and ensure your response sounds natural and professional.
  Your answer should be in the first person, present tense, and only include the relevant details. Use up to {max_output_tokens} tokens.
  Question: {question}\n
  Intended answer: {intended_answer}\n
  Answer as a sentence, providing the phone number and any relevant details:
  """

  response = model.generate_content(
      contents = prompt,
      generation_config = genai.GenerationConfig(
          max_output_tokens=max_output_tokens,
          temperature=2)
    )

  answer = response.text.strip()
  time.sleep(6)

  return {"answer": answer, "difficulty": "easy"}


def generate_freetext_answer_easy(question, intended_answer):
  prompt = f"""
  You are being asked if you have any additional notes or information to share.
  Your response should sound natural, in the first person, and can be either brief or more detailed, depending on the situation.
  You can provide additional information but you don't have to and mention it clearly and politely.
  If there isn't anything else to add, express that in a conversational manner. Use up to {max_output_tokens} tokens.
  Question: {question}\n
  Intended answer: {intended_answer}
  Answer as a sentence, providing any additional information or politely stating that there's nothing else to add:
  """

  response = model.generate_content(
      contents = prompt,
      generation_config = genai.GenerationConfig(
          max_output_tokens=max_output_tokens,
          temperature=2)
    )

  answer = response.text.strip()
  time.sleep(6)

  return {"answer": answer, "difficulty": "easy"}


def generate_date_answer_easy(question, intended_answer):
  prompt = f"""
  You are asked a question about a specific date, and you need to provide a natural, conversational answer in the first person.
  Include the date from the intended answer in your response, phrasing it naturally as if you're suggesting a meeting.
  Be concise but clear, and use up to {max_output_tokens} tokens.
  Question: {question}\n
  Intended Answer: {intended_answer}\n
  Context: Provide a conversational response mentioning the date in a natural way:
  """

  response = model.generate_content(
      contents = prompt,
      generation_config = genai.GenerationConfig(
          max_output_tokens=max_output_tokens,
          temperature=2)
    )

  answer = response.text.strip()
  time.sleep(6)

  return {"answer": answer, "difficulty": "easy"}