<a href="https://colab.research.google.com/github/alexk2206/tds_capstone/blob/Alex-DEV/Capstone_Project_Topics_in_Data_Science.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [34]:
import pandas as pd
import json
import random
from itertools import chain, combinations
from datetime import datetime, timedelta
random_state = 1

# Capstone Project - Topics in Data Science

In this notebook we will give an overview over the project and how we approached it. A rough summary will depict our thoughts and provides the solution we chose. Throughot this notebook we will share our thoughts and also small parts of code we implemented. For further explanation, look at the reffered notebooks for each section.

# Chapter 1: Getting familiar with provided data

> Referred notebook: Question_type_identification.ipynb

At first we made us familiar with the provided data. At first, we accumulated the questionnaires, looked at the questions and their types. We wanted to have an overview of the questions itself, but also their types and the distribution of the question types. Based on that, we will examine a strategy on how to approach the capstone tasks.

In [35]:
all_questions_url = 'https://raw.githubusercontent.com/alexk2206/tds_capstone/refs/heads/main/datasets/all_questions.json'
all_questions = pd.read_json(all_questions_url)
all_questions_count = all_questions['type'].value_counts()

print(f'count of question types: \n{all_questions_count}')
all_questions

count of question types: 
type
SINGLE_SELECT    12
MULTI_SELECT      9
TEXT              2
DATE              1
NUMBER            1
Name: count, dtype: int64


Unnamed: 0,id,type,question,options
0,aa2d8cdd-0758-4035-b0b6-ca18e2f380d8,SINGLE_SELECT,Data processing consent,"Yes, No"
1,12e1ed1d-edaa-4e93-8645-de3850e998f9,SINGLE_SELECT,Customer group,"End User, Wholesaler, Distributor, Consultant,..."
2,625012ae-9192-4cf6-a73d-55e1813d6014,MULTI_SELECT,Products interested in,"MY-SYSTEM, Notion, JTS, JS EcoLine, AKW100, AX100"
3,0699fc5a-34a4-4160-bda1-fb135a3615da,MULTI_SELECT,What kind of follow up is planned,"Email, Phone, Schedule a Visit, No action"
4,815dab84-bc5e-4764-9777-0c0126e3173e,MULTI_SELECT,Who to copy in follow up,"Stephan Maier, Joachim Wagner, Erik Schneider,..."
5,3f34e5b3-1cb0-48ea-93d2-3f21b3371b5d,SINGLE_SELECT,Would you like to receive marketing informatio...,"Yes, No"
6,ba042f33-697e-4c6f-924c-b4de2c30f443,SINGLE_SELECT,What industry are you operating in?,"Aerospace, Computers & Networks, Government, M..."
7,7a776cc0-ffe8-4891-b8a9-dd5ff984de13,MULTI_SELECT,What products are you interested in?,"Automotive radar target simulation, Noise figu..."
8,a0148bc7-15b3-41d5-b97c-6420b8bd927c,TEXT,Notes,Please provide any additional information that...
9,5aefc81d-c5d2-41fc-bc7b-6117d1c7671e,SINGLE_SELECT,What type of company is it?,"Construction company, Craft enterprises, Scaff..."


# Chapter 2: Set up for the Q&A Dataset - Answer combinations

> Referred notebook: answer_combinations.ipynb

After checking out the questions that we were given, we wanted to created a large Q&A dataset. In order to do that, we wanted to scale up the number of questions and create an 'intended_answer' for it. This way we have a question and an predefined answer for which we could generate a context for. After that we could use a model to extract answers from the context, evaluate it and fine tune the model. We approached the question types differently:

1. For the single choice questions we wanted to pick one option from the options section and make it the 'intended_answer'.
2. For multiple choice questions we generated all possible combinations from the available options and made each of them an 'intended_answer'.
3. For the number question, which were here only aiming for a phone number, we used a python function to generate a phone number as out 'intended_answer'.
4. For the date questions we used a python function as well, whch created a date within the last two weeks and used that as our 'intended_answer'.  
5. For the text questions we just used a string ('Add additional information here') as an 'intended_answer'

As already mentioned we discussed a lot about the size of the Q&A dataset and as a result we wanted around a thousand questions and answers. For that we used also duplicates for some questions but tried to switch things up with the context to ensure a good training of a model.

Key functions and the results of using them will be displayed in this chapter.

## To generate intended answers, these function were used:
- generate_combinations: generates all possible combinations form available options and will be applied to the MULTI_SELECT and SINGLE_SELECT questions
- generate_phone_number: generates a random phone number and will be applied to the NUMBER questions
- generate_date: Generates a date within the last two weeks and will be applied on the DATE questions
- generate_notes: Just sets 'Add additional information here' as the intended answer

In [37]:
def generate_combinations(options_list, max_size):
    return list(chain.from_iterable(combinations(options_list, r) for r in range(0, min(len(options_list), max_size) + 1)))


def generate_phone_number():
    phone_prefix = '01' + str(random.randint(100, 999)) + (str(random.randint(0, 9)) if random.random() < 0.5 else '')
    main_number = ''.join([str(random.randint(0, 9)) for _ in range(random.randint(6, 8))])
    phone_number = phone_prefix + main_number
    return [phone_number]


def generate_date(today=None):
    if today is None:
        today = datetime.today()

    random_days = random.randint(0, 13)
    random_date = today - timedelta(days=random_days)

    date = random_date.strftime('%Y-%m-%d')

    return [date]


def generate_notes():
    return ['Add additional information here']

## To process the different types of questions, these functions were used:

- process_selections: applies the generate_combinations function onto a dataset
- process_freetext: applies the functions for other question types respectively

In [38]:
def process_selections(row, max_size):
    question = row['question']
    options_list = row['options']
    question_type = row['type']
    expanded = []

    if question_type == 'MULTI_SELECT':
        options_combinations = generate_combinations(options_list, max_size=max_size)
        for combo in options_combinations:
            expanded.append({'question': question, 'type': question_type, 'options': options_list, 'intended_answer': list(combo)})

    elif question_type == 'SINGLE_SELECT':
        for option in options_list:
            expanded.append({'question': question, 'type': question_type, 'options': options_list, 'intended_answer': [option]})

    return expanded


def process_freetext(row):
    question = row['question']
    options_list = row['options']
    question_type = row['type']
    expanded = []

    if question_type == 'TEXT':
        expanded.append({'question': question, 'type': question_type, 'options': options_list, 'intended_answer': generate_notes()})

    elif question_type == 'NUMBER':
        expanded.append({'question': question, 'type': question_type, 'options': options_list, 'intended_answer': generate_phone_number()})

    elif question_type == 'DATE':
        expanded.append({'question': question, 'type': question_type, 'options': options_list, 'intended_answer': generate_date()})

    return expanded

## To adjust the amount of questions, this function was used:
adjust_question_amount: brings the amount of any unique question between 48 and 64 to create a bigger dataset

In [39]:
def adjust_question_amount(df, column, random_state):
    random.seed(random_state)
    def adjust_group(group):
        max_amount = random.randint(48, 64)

        if len(group) < max_amount:
            return group.sample(n=max_amount, replace=True, random_state=random_state)
        else:
            return group.sample(n=max_amount, random_state=random_state)

    return df.groupby(column, group_keys=False).apply(adjust_group).reset_index(drop=True)

## Answer combinations dataset creation
In order to get the right amount of questions and different intended_answers, the functions have to be executed in the following order:
1. Split dataset into selection questions and freetext questions
2. Create intended answers for selection type questions
3. Scale up free text questions
4. Create intended answers for free text questions
5. Append dataset

After executing these steps, the ouput looks as follows:

In [40]:
answer_combinations_url ="https://raw.githubusercontent.com/alexk2206/tds_capstone/refs/heads/main/datasets/answer_combinations.json"
answer_combinations = pd.read_json(answer_combinations_url)

print(f"answer_combinations shape: {answer_combinations.shape}")
answer_combinations.sample(25, random_state = random_state)

answer_combinations shape: (1381, 4)


Unnamed: 0,question,type,options,intended_answer
989,Notes,TEXT,[Please provide any additional information tha...,[Add additional information here]
991,What is the size of your company?,SINGLE_SELECT,"[1-10, 11-50, 51-200, 201-2000, larger than 2000]",[11-50]
1057,Searches a solution for,MULTI_SELECT,"[Scan business cards, Clean up CRM, Extract da...",[Improve CRM data quality]
1071,What type of company is it?,SINGLE_SELECT,"[Construction company, Craft enterprises, Scaf...",[Craft enterprises]
60,Customer type,SINGLE_SELECT,"[New customer, Existing customer, Partner, App...",[New customer]
750,Which language is wanted for communication?,SINGLE_SELECT,"[German, Italian, Japanese , English, Spanish]",[German]
1276,Any additional notes?,TEXT,[What additional information would you like to...,[Add additional information here]
1347,Size of the trade fair team (on average),SINGLE_SELECT,"[1-5, 6-10, 11-15, 16-20, 21-30, 31-40, more t...",[21-30]
311,Any additional notes?,TEXT,[What additional information would you like to...,[Add additional information here]
670,What is the size of your company?,SINGLE_SELECT,"[1-10, 11-50, 51-200, 201-2000, larger than 2000]",[1-10]


# Chapter 3: Creation of the Q&A dataset


> Referred notebook: QA_Dataset.ipynb

With a dataset containing questions, possible answers and an intended answer we could generate contexts to create a complete Q&A dataset.

For context generation, we used dedicated prompts for different question types and processed them through 'gemini-1.5-flash'. Throughout the process we discovered various challenges but overcame them individually. More on that in our limitations and ideas outline.

**For context generation we used these prompts for each question type, indicated by the functions name:**

generate_selection_answer_easy:

You are asked a question, and you need to provide a natural, conversational answer in the first person. Do not use special characters other than ',' and '.'. Act like you really do not know which options there are and the intended answer is your answer. When given a range, use a number between the two values.
Be concise but clear, and avoid unnecessary elaboration. Use up to {max_output_tokens} tokens. Question: {question}\n Intended answer: {intended_answer}\n Answer as a sentence, mentioning and explaining all the provided options:

generate_number_answer_easy:

You are asked for contact information, and your response should be clear and concise, as if you're giving someone your phone number and how you can be reached in a conversation. Mention the provided phone number and ensure your response sounds natural and professional. Your answer should be in the first person, present tense, and only include the relevant details. Use up to {max_output_tokens} tokens. Question: {question}\n Intended Answer: {intended_answer}\n Answer as a sentence, providing the phone number and any relevant details:

generate_freetext_answer_easy:

You are asked if you have any additional notes or information to share. Your response should sound natural, in the first person, and can be either brief or more detailed, depending on the situation. You can provide additional information but you don't have to and should mention it clearly and politely. If there isn't anything else to add, express that in a conversational manner. Use up to {max_output_tokens} tokens. Question: {question}\n Intended Answer: {intended_answer}\n Answer as a sentence, providing any additional information or politely stating that there's nothing else to add:

generate_date_answer_easy:

You are asked a question about a specific date, and you need to provide a natural, conversational answer in the first person. Include the date from the intended answer in your response, phrasing it naturally as if you're suggesting a meeting. Be concise but clear, and use up to {max_output_tokens} tokens. Question: {question}\n Intended Answer: {intended_answer}\n Context: Provide a conversational response mentioning the date in a natural way:

Apart from the prompts, every function that aims to generate context, has the same structure:

We tracked the difficulty for the generated context as 'easy', as we could see making more difficult contexts in the future.

In [41]:
def generate_context(question, intended_answer):
  prompt = f"""
  prompt
  """

  response = model.generate_content(
      contents = prompt,
      generation_config = genai.GenerationConfig(
          max_output_tokens=max_output_tokens,
          temperature=2)
    )

  answer = response.text.strip()

  time.sleep(6)

  return {"answer": answer, "difficulty": "easy"}

After setting up the function for each question type we looped through every row and used the gemini API and generated a context using the question and intended answer

In [42]:
cycle_count = 0
def generate_answer_for_row(row):
    global cycle_count
    cycle_count += 1
    print(f"Cycle: {cycle_count}")

    question = row['question']
    intended_answer = row['intended_answer']
    question_type = row['type']

    if question_type in ['SINGLE_SELECT', 'MULTI_SELECT']:
        return generate_selection_answer_easy(question, intended_answer)
    elif question_type == 'NUMBER':
        return generate_number_answer_easy(question, intended_answer)
    elif question_type == 'TEXT':
        return generate_freetext_answer_easy(question, intended_answer)
    elif question_type == 'DATE':
        return generate_date_answer_easy(question, intended_answer)
    else:
        return {"answer": "Unknown question type", "difficulty": "unknown"}

After sampling and examining contexts for the different types we split the answer combinations dataset into five equally large datasets, created contexts for each of them and appended them afterwards. Our Q&A dataset lokke like this, before using any models on it:

In [43]:
combined_qa_dataset_url = 'https://raw.githubusercontent.com/alexk2206/tds_capstone/refs/heads/main/datasets/combined_qa_dataset.json'
combined_qa_dataset = pd.read_json(combined_qa_dataset_url)

print(f"combined_qa_dataset shape: {combined_qa_dataset.shape}")
combined_qa_dataset.sample(25, random_state = random_state)

combined_qa_dataset shape: (1381, 6)


Unnamed: 0,question,type,options,intended_answer,context,difficulty
989,Notes,TEXT,[Please provide any additional information tha...,[Add additional information here],"I think I've covered everything, so no additio...",easy
991,What is the size of your company?,SINGLE_SELECT,"[1-10, 11-50, 51-200, 201-2000, larger than 2000]",[11-50],"Oh gosh I really don't know all the details, I...",easy
1057,Searches a solution for,MULTI_SELECT,"[Scan business cards, Clean up CRM, Extract da...",[Improve CRM data quality],"Oh, I'm not sure what that means. Is it like, ...",easy
1071,What type of company is it?,SINGLE_SELECT,"[Construction company, Craft enterprises, Scaf...",[Craft enterprises],"Hmm, I'm not really sure, maybe it's a craft e...",easy
60,Customer type,SINGLE_SELECT,"[New customer, Existing customer, Partner, App...",[New customer],"Well, I suppose I'm a new customer. Is that wh...",easy
750,Which language is wanted for communication?,SINGLE_SELECT,"[German, Italian, Japanese , English, Spanish]",[German],"Oh, um, I think I'd probably choose German. I ...",easy
1276,Any additional notes?,TEXT,[What additional information would you like to...,[Add additional information here],"No, I don't believe so; I think I've covered e...",easy
1347,Size of the trade fair team (on average),SINGLE_SELECT,"[1-5, 6-10, 11-15, 16-20, 21-30, 31-40, more t...",[21-30],"I'd say around 25 people, give or take a few. ...",easy
311,Any additional notes?,TEXT,[What additional information would you like to...,[Add additional information here],"Hmm, I don't think I have anything else to add...",easy
670,What is the size of your company?,SINGLE_SELECT,"[1-10, 11-50, 51-200, 201-2000, larger than 2000]",[1-10],Oh gosh I'm not sure about the exact size. I'd...,easy


The process of the Q&A datatset genration took surprisingly very long and afterwards we came up with some ides on how to make it work better. More on that in our limitations and ideas outline.

# Chapter 4: Creation of a Q&A dataset for testing

> Referred notebook: test_qa_dataset.ipynb

Apart from creating a Q&A dataset for training, validation and fine-tuning we wanted to create a completely new dataset for testing our models afterwards. We came up with the idea to create new questions with new options and corresponding inteded answers. After that we would generate contexts and would have a new dataset for testing our models and calculating metrics.

In order to do that we took all available questions, together with questions type and options, and created 20 new questions using OpenAIs ChatGPT. You can look up the prompts and data we used in the referred notebook.

After generating new questions, we followed the same steps used for creating the Q&A dataset (Splitting, creation of intended answers, upscaling, etc.). Unlike the original Q&A dataset we used 'gemini-2.0-flash-exp' for context generation, as we quickly exceeded our limit with the 'gemini-1.5-flash' API. The results though were just as good as before. We ended up with a test Q&A dataset that looks like this:

In [44]:
test_qa_dataset_url = 'https://raw.githubusercontent.com/alexk2206/tds_capstone/refs/heads/main/datasets/test_qa_dataset_with_answers.json'
test_qa_dataset = pd.read_json(test_qa_dataset_url)

print(f"test_qa_dataset shape: {test_qa_dataset.shape}")
test_qa_dataset.sample(25, random_state = random_state)

test_qa_dataset shape: (200, 6)


Unnamed: 0,question,type,options,intended_answer,context,difficulty
58,When do you expect to finalize your decision?,DATE,[Select an approximate date.],[2025-01-22],How about we aim to have the decision finalize...,easy
40,What support resources do you need for impleme...,MULTI_SELECT,"[Training, Documentation, Technical support, O...","[Training, Documentation, Technical support, N...","Well, I think I might need some training, also...",easy
34,What challenges are you currently facing in yo...,TEXT,[Please share specific challenges or issues.],[Add additional information here],"Actually, there isn't anything else to add at ...",easy
102,What language do you prefer for communication?,SINGLE_SELECT,"[English, German, French, Spanish, Italian, Ot...",[Other],"I don't really have a preference, I suppose it...",easy
184,Which features are most important in a solution?,MULTI_SELECT,"[Ease of use, Cost efficiency, Scalability, Se...","[Ease of use, Scalability, Security]","Well, I'd say ease of use is important, plus s...",easy
198,Which features are most important in a solution?,MULTI_SELECT,"[Ease of use, Cost efficiency, Scalability, Se...","[Cost efficiency, Scalability, Security]","I'd say cost efficiency is important, as is sc...",easy
95,What stage are you in the buying process?,SINGLE_SELECT,"[Exploration, Evaluation, Decision-making, Alr...",[Evaluation],"Oh, um, I guess I'm in the evaluation stage, y...",easy
4,What is your estimated budget for this project?,NUMBER,[Please provide an approximate value.],$13500,"My estimated budget for this project is $13,50...",easy
29,How many employees does your company have?,SINGLE_SELECT,"[1-10, 11-50, 51-200, 201-1000, 1000+]",[1-10],"Oh gosh I'm not really sure we have like, mayb...",easy
168,Do you plan to implement a solution within the...,SINGLE_SELECT,"[Yes, No]",[No],"No, I haven't thought about it. There weren't ...",easy


Looking back to this section we could have used this technique to create a larger and more diverse Q&A dataset from scratch. More on that at the 'Limitation and ideas' section.

# Chapter X____X: Evaluation of the models

> Referred notebook: model_evaluations.ipynb

After letting our models run through the test Q&A dataset we can evaluate the performance of it. For measurement we tried using different but suitable metrics for the different question types and came up with the foloowing approach:

- **MULTI_SELECT:** F1 score as it balances the precision and recall, making it a reliable metric for evaluating the performance of multi-label classification models. It provides a harmonic mean of precision and recall, ensuring that both false positives and false negatives are taken into account when assessing model accuracy.
- **SINGLE_SELECT:** Accuracy because it measures the proportion of correctly classified instances out of the total instances, making it a straightforward and effective metric for evaluating single-label classification models. It works well when the classes are balanced and provides a clear measure of overall correctness.
- **DATE and NUMBER:** Also Accuracy because it directly measures how often the model correctly predicts the exact date, phone number, or budget value. Since these types of predictions require precise matching, accuracy is an appropriate metric to assess performance, ensuring that the model outputs the correct values without partial credit for close approximations.
- **TEXT:** No evaluation because there are no right or wrong answers, as these question were asking for addition notes for example. Evaluation would depend here on human judgment, considering factors like relevance, clarity, and completeness of the response.

In order to calculate the metrics we looped through each row of the test Q&A dataset and compared intended and predicted answer. We wanted to measure over the dataset as whole and came up with this code:

In [None]:
#INSERT CODE HERE

To get a better comparison of the different models used and their scores we visualized the results as follows:

In [None]:
#INSERT CODE HERE