# Generating Text-Based Questions

The first part of this notebook is a walkthrough on how we used Chat GPT generated math division questions to create a dataframe that can be used in Literacy Now's question bank data frame.

 Please go to the very end to see a dataframe of all of our generated text-based math questions 

### Import the necessary libraries 

In [24]:
import pandas as pd
import regex as re
import random
import numpy as np

### Prompt Chat GPT

Paste the following prompt into Chat GPT: 

- You are generating questions for an educational app aimed toward educating 8 year olds.  Generate simple division questions similar to "Rida has Rf 730. She wants to give them equally to her two brothers for buying books. How much amount does each brother get?" Make sure the solution is a whole number. Make sure all numbers used are less than 100

### Load in the Chat GPT output 

Note that you will upload multiple CSVs because each CSV corresponds to a different type of text based question. For example, you would upload addition based questions and subtraction based questions separately 

In [25]:
generated_divison_questions = pd.read_csv("/work/CHATGPT/divison-qs.csv")
generated_divison_questions

Unnamed: 0,"Prompt: You are generating questions for an educational app aimed toward educating 8 year olds. Generate simple division questions similar to ""Rida has Rf 730. She wants to give them equally to her two brothers for buying books. How much amount does each brother get?"" Make sure the solution is a whole number. Make sure all numbers used are less than 100\n\n"
0,"""Sara has 56 stickers. She wants to divide the..."
1,"""Ahmed has 72 marbles. He wants to divide them..."
2,"""Mia has 48 crayons. She wants to share them w..."
3,"""Tariq has 80 candies. He wants to divide them..."
4,"""Fatima has 64 grapes. She wants to share them..."
5,"""Omar has 36 toy cars. He wants to divide them..."
6,"""Sana has 90 stickers. She wants to divide the..."
7,"""Hassan has 81 blocks. He wants to divide them..."
8,"""Dania has 72 crayons. She wants to share them..."
9,"""Rashid has 63 candies. He wants to divide the..."


### Clean up the loaded dataframe  

In [26]:
#Renamed the column
generated_divison_questions = generated_divison_questions.rename(columns = {generated_divison_questions.columns[0]: 'Question'})

#Add the question cateogry 
generated_divison_questions["Category"] = ["Divison"] * len(generated_divison_questions)
generated_divison_questions


Unnamed: 0,Question,Category
0,"""Sara has 56 stickers. She wants to divide the...",Divison
1,"""Ahmed has 72 marbles. He wants to divide them...",Divison
2,"""Mia has 48 crayons. She wants to share them w...",Divison
3,"""Tariq has 80 candies. He wants to divide them...",Divison
4,"""Fatima has 64 grapes. She wants to share them...",Divison
5,"""Omar has 36 toy cars. He wants to divide them...",Divison
6,"""Sana has 90 stickers. She wants to divide the...",Divison
7,"""Hassan has 81 blocks. He wants to divide them...",Divison
8,"""Dania has 72 crayons. She wants to share them...",Divison
9,"""Rashid has 63 candies. He wants to divide the...",Divison


### Generate a math expression for each question 

In [27]:
#Generate a math expression that evaluates to the solution for each question 

#Define a function that returns a division expression from two numbers
def div_expr(x):
    return x[0] + '/' + x[1]


generated_divison_questions["Expression"] = generated_divison_questions["Question"].str.findall('\d+') #finds all numbers in the question 

generated_divison_questions["Expression"] = generated_divison_questions["Expression"].apply(div_expr) #applies the divison expression function

generated_divison_questions["Expression"] 

0     56/7
1     72/9
2     48/6
3     80/8
4     64/4
5     36/6
6     90/9
7     81/9
8     72/9
9     63/9
10    45/9
11    36/6
12    50/5
13    72/8
14    45/9
15    36/6
16    60/6
17    54/9
18    40/4
19    48/8
Name: Expression, dtype: object

### Evaluate the math expression to get the solution

In [28]:
# Evaluate the math expression to get the solution 

generated_divison_questions["Solution"] = generated_divison_questions["Expression"].apply(eval)
generated_divison_questions["Solution"]

0      8.0
1      8.0
2      8.0
3     10.0
4     16.0
5      6.0
6     10.0
7      9.0
8      8.0
9      7.0
10     5.0
11     6.0
12    10.0
13     9.0
14     5.0
15     6.0
16    10.0
17     6.0
18    10.0
19     6.0
Name: Solution, dtype: float64

### Create the other answer choices

In [29]:
# create other answer choices 

answer_choices_list = [] #list of possible answer choices 
for index, row in generated_divison_questions.iterrows():
    solution = int(row["Solution"])
    possible_answer_choices = np.linspace(solution - 10, solution + 10, 21) #create an array of numbers centered around the solution
    possible_answer_choices = possible_answer_choices.astype(int)
    answer_choices = random.sample(sorted(possible_answer_choices[possible_answer_choices != solution]), 3) #randomly choose 3 of those numbers from the possible answer choices
    answer_choices.append(solution) #append the solution to the possible answer choices
    random.shuffle(answer_choices) #shuffle the answer choices 
    answer_choices_list.append(answer_choices) #append this answer choices to the overall list 

generated_divison_questions["Answer Choices"] = answer_choices_list
answer_choices_list


[[12, -2, 11, 8],
 [5, 8, 13, -2],
 [17, 8, 12, -2],
 [19, 10, 14, 3],
 [16, 22, 19, 18],
 [11, 8, -2, 6],
 [10, 14, 0, 19],
 [15, 4, 9, 18],
 [18, 8, 7, 6],
 [15, 14, 7, 1],
 [10, -2, 1, 5],
 [8, 6, 13, 12],
 [13, 10, 4, 19],
 [0, 16, 15, 9],
 [14, 1, 5, -5],
 [6, 9, 5, 11],
 [16, 1, 10, 9],
 [14, 5, 12, 6],
 [20, 16, 10, 11],
 [6, 7, 0, -2]]

In [30]:
generated_divison_questions

Unnamed: 0,Question,Category,Expression,Solution,Answer Choices
0,"""Sara has 56 stickers. She wants to divide the...",Divison,56/7,8.0,"[12, -2, 11, 8]"
1,"""Ahmed has 72 marbles. He wants to divide them...",Divison,72/9,8.0,"[5, 8, 13, -2]"
2,"""Mia has 48 crayons. She wants to share them w...",Divison,48/6,8.0,"[17, 8, 12, -2]"
3,"""Tariq has 80 candies. He wants to divide them...",Divison,80/8,10.0,"[19, 10, 14, 3]"
4,"""Fatima has 64 grapes. She wants to share them...",Divison,64/4,16.0,"[16, 22, 19, 18]"
5,"""Omar has 36 toy cars. He wants to divide them...",Divison,36/6,6.0,"[11, 8, -2, 6]"
6,"""Sana has 90 stickers. She wants to divide the...",Divison,90/9,10.0,"[10, 14, 0, 19]"
7,"""Hassan has 81 blocks. He wants to divide them...",Divison,81/9,9.0,"[15, 4, 9, 18]"
8,"""Dania has 72 crayons. She wants to share them...",Divison,72/9,8.0,"[18, 8, 7, 6]"
9,"""Rashid has 63 candies. He wants to divide the...",Divison,63/9,7.0,"[15, 14, 7, 1]"


### Joining multiple dataframes together

Once you have completed generating questions for each question type, you can merge all of the questions together in one dataframe using the following code. Since there is only one dataframe in the notebook right now, please scroll to the bottom of the notebook to see this code in action 

In [31]:
#dataframes is a list of all the cleaned dataframes that contain the question, expression, solution, and answer choices 
#all_generated_questions = pd.concat([df for df in dataframes] + [generated_divison_questions], axis=0)
#all_generated_questions

### That was a walkthrough for one type of text based questions. We generated all of the other math questions following a similar format, which we have also included in this notebook. 

Please use the walkthrough above as an overall reference for using question generation. The following code is not as clear to follow as the example in the walkthrough. 

In [32]:
generated_area_questions = pd.read_csv("/work/CHATGPT/area-qs.csv")
generated_area_questions = generated_area_questions.rename(columns = {generated_area_questions.columns[0]: 'Question'})
generated_area_questions["Category"] = ["Area"] * len(generated_area_questions)

generated_perimeter_questions = pd.read_csv("/work/CHATGPT/perimeter-qs.csv")
generated_perimeter_questions = generated_perimeter_questions.rename(columns = {generated_perimeter_questions.columns[0]: 'Question'})
generated_perimeter_questions["Category"] = ["Perimeter"] * len(generated_perimeter_questions)

generated_addition_questions = pd.read_csv("/work/CHATGPT/addition.csv")
# note that we manually calculated the values for this in google spreadsheet before we realized it was easier to do in deepnote
generated_addition_questions["Category"] = ["Addition"] * len(generated_addition_questions)

generated_subtraction_questions = pd.read_csv("/work/CHATGPT/subtraction.csv")
generated_subtraction_questions["Category"] = ["Subtraction"] * len(generated_subtraction_questions)

generated_multiplication_questions = pd.read_csv("/work/CHATGPT/multiplication.csv")
generated_multiplication_questions = generated_multiplication_questions.rename(columns = {generated_multiplication_questions.columns[0]: 'Question'})
generated_multiplication_questions["Category"] = ["Multiplication"] * len(generated_multiplication_questions)



Here's how the tables look like at this step of table generation. A word problem and category have been assigned to each question:

In [35]:
generated_multiplication_questions.head(5)

Unnamed: 0,Question,Category
0,If a box contains 10 pencils and there are 6 b...,Multiplication
1,A pizza place makes 8 pizzas and each pizza ha...,Multiplication
2,There are 7 days in a week and each day has 24...,Multiplication
3,A bag has 12 marbles and there are 4 bags. How...,Multiplication
4,If a bookshelf has 5 shelves and there are 10 ...,Multiplication


In [34]:
generated_perimeter_questions.head(5)

Unnamed: 0,Question,Category
0,Find the perimeter of a square with a side len...,Perimeter
1,Find the perimeter of a rectangle with a lengt...,Perimeter
2,Find the perimeter of a square with a side len...,Perimeter
3,Find the perimeter of a rectangle with a lengt...,Perimeter
4,Find the perimeter of a square with a side len...,Perimeter


We'll now be expressing the word problems as mathematical expressions using these functions (as each question type has a similar "format" — i.e. division questions have their numerator first, then their denominator second):

In [36]:
#function expressions

def perimeter_expr(row):
    if row["Shape"] == 'square':
        return row["Numbers"][0] + "* 4"
    elif row["Shape"] == 'rectangle':
        return "2*" + row["Numbers"][0] + "+ 2 *" + row["Numbers"][1]
    else:
        return 0 

def area_expr(x):
    return x[0] + "**2"

def multi_expr(x):
    return x[0] + "*" + x[1]

def sub_expr(x):
    return x[0] + "-" + x[1]

def add_expr(x):
    return x[0] + "+" + x[1]

def div_expr(x):
    return x[0] + "\\" + x[1]

In [37]:
generated_area_questions["Expression"] = generated_area_questions["Question"].str.findall('\d+') #finds all numbers in the question 
generated_area_questions["Expression"] = generated_area_questions["Expression"].apply(area_expr) #applies the divison expression function

generated_perimeter_questions["Numbers"] = generated_perimeter_questions["Question"].str.findall('\d+')
generated_perimeter_questions["Shape"] = generated_perimeter_questions["Question"].str.findall('square|rectangle').str[0]
generated_perimeter_questions["Expression"] = generated_perimeter_questions.apply(perimeter_expr, axis = 1)
generated_perimeter_questions.drop(["Numbers", "Shape"], axis = 1, inplace = True)
generated_perimeter_questions

generated_addition_questions["Expression"] = generated_addition_questions["Question"].str.findall('\d+') #finds all numbers in the question 
generated_addition_questions["Expression"] = generated_addition_questions["Expression"].apply(add_expr) 

generated_subtraction_questions["Expression"] = generated_subtraction_questions["Question"].str.findall('\d+') #finds all numbers in the question 
generated_subtraction_questions["Expression"] = generated_subtraction_questions["Expression"].apply(sub_expr)

generated_multiplication_questions["Expression"] = generated_multiplication_questions["Question"].str.findall('\d+')
generated_multiplication_questions["Expression"] = generated_multiplication_questions["Expression"].apply(multi_expr)


After creating the mathematical expression, we evaluate it to obtain a solution. The table below illustrates the current state of our tables:

In [38]:
dataframes = [generated_multiplication_questions, generated_subtraction_questions, generated_perimeter_questions, generated_area_questions]

for df in dataframes:
    df["Solution"] = df["Expression"].apply(eval).astype(int)


In [39]:
generated_multiplication_questions

Unnamed: 0,Question,Category,Expression,Solution
0,If a box contains 10 pencils and there are 6 b...,Multiplication,10*6,60
1,A pizza place makes 8 pizzas and each pizza ha...,Multiplication,8*6,48
2,There are 7 days in a week and each day has 24...,Multiplication,7*24,168
3,A bag has 12 marbles and there are 4 bags. How...,Multiplication,12*4,48
4,If a bookshelf has 5 shelves and there are 10 ...,Multiplication,5*10,50
5,A classroom has 30 desks and each desk has 2 c...,Multiplication,30*2,60
6,There are 6 balls in a pack and there are 8 pa...,Multiplication,6*8,48
7,A box contains 4 toy cars and there are 15 box...,Multiplication,4*15,60
8,If a store sells 5 boxes of candy and each box...,Multiplication,5*10,50
9,A garden has 8 rows and each row has 7 flowers...,Multiplication,8*7,56


In [40]:
for df in dataframes:
    answer_choices_list = [] #list of possible answer choices 
    for index, row in df.iterrows():
        solution = row["Solution"]
        possible_answer_choices = np.linspace(solution - 10, solution + 10, 21) #create an array of numbers centered around the solution
        possible_answer_choices = possible_answer_choices.astype(int) 
        answer_choices = random.sample(sorted(possible_answer_choices[possible_answer_choices != solution]), 3) #randomly choose 3 of those numbers from the possible answer choices
        answer_choices.append(solution) #append the solution to the possible answer choices
        random.shuffle(answer_choices) #shuffle the answer choices 
        answer_choices_list.append(answer_choices) #append this answer choices to the overall list 

    df["Answer Choices"] = answer_choices_list



In [41]:
generated_multiplication_questions


Unnamed: 0,Question,Category,Expression,Solution,Answer Choices
0,If a box contains 10 pencils and there are 6 b...,Multiplication,10*6,60,"[53, 69, 63, 60]"
1,A pizza place makes 8 pizzas and each pizza ha...,Multiplication,8*6,48,"[48, 43, 52, 51]"
2,There are 7 days in a week and each day has 24...,Multiplication,7*24,168,"[168, 176, 162, 170]"
3,A bag has 12 marbles and there are 4 bags. How...,Multiplication,12*4,48,"[51, 44, 39, 48]"
4,If a bookshelf has 5 shelves and there are 10 ...,Multiplication,5*10,50,"[48, 47, 41, 50]"
5,A classroom has 30 desks and each desk has 2 c...,Multiplication,30*2,60,"[60, 69, 54, 55]"
6,There are 6 balls in a pack and there are 8 pa...,Multiplication,6*8,48,"[50, 48, 58, 44]"
7,A box contains 4 toy cars and there are 15 box...,Multiplication,4*15,60,"[62, 60, 59, 56]"
8,If a store sells 5 boxes of candy and each box...,Multiplication,5*10,50,"[44, 53, 50, 43]"
9,A garden has 8 rows and each row has 7 flowers...,Multiplication,8*7,56,"[47, 66, 56, 58]"


## Functions to Complete Everything Above

We have included functions that automatically create the 5 question types we talked about previously, inputting them into the table format above. These are to help get you started, and if you plan to create new question types, you can follow a similar format.

As a note, the "expression" argument is a function of the format we'll describe in the cell below this one.

In [56]:
def create_table(question_type, filepath, expression): 
    """
    Create a table of questions from ChatGPT word problems with the specified question type.

    Args:
    question_type (String): the arithmetic operation that the question uses (addition, subtraction, divison, etc.) -- This isn't case-sensitive
    filepath (String): the filepath to the csv file that contains the Chat GPT generated output (example: "/work/CHATGPT/area-qs.csv")
    expression (function): the function that creates a mathematical expression from the word problem

    Returns (dataframe): 
        the final question table with the columns: "Question," "Category," "Expression," "Solution," and "Answer Choices"  
    """

    questions_table = pd.read_csv(filepath) #loads in the data 
    questions_table = questions_table.rename(columns = {questions_table.columns[0]: 'Question'})

    question_type = question_type.lower()
    
    questions_table["Expression"] = questions_table["Question"].str.findall('\d+') #finds all numbers in the question 
    questions_table["Expression"] = questions_table["Expression"].apply(expression) 

    questions_table["Solution"] = questions_table["Expression"].apply(eval).astype(int)
    questions_table["Category"] = [question_type] * len(questions_table)

    answer_choices_list = [] #list of possible answer choices 
    for index, row in questions_table.iterrows():
        solution = row["Solution"]
        possible_answer_choices = np.linspace(solution - 10, solution + 10, 21) #create an array of numbers centered around the solution
        possible_answer_choices = possible_answer_choices.astype(int)
        answer_choices = random.sample(sorted(possible_answer_choices[possible_answer_choices != solution]), 3) #randomly choose 3 of those numbers from the possible answer choices
        answer_choices.append(solution) #append the solution to the possible answer choices
        random.shuffle(answer_choices) #shuffle the answer choices 
        answer_choices_list.append(answer_choices) #append this answer choices to the overall list 

    questions_table["Answer Choices"] = answer_choices_list

    return questions_table


Here, sub_expr is an example "expression" function to plug into our create_table function.

In [57]:
#example of the create_table function working 
def sub_expr(x):
    return x[0] + "-" + x[1]
    
create_table("subtraction", "/work/CHATGPT/subtraction.csv", sub_expr)

Unnamed: 0,Question,Expression,Solution,Category,Answer Choices
0,Tom had 70 stickers and he gave away 25. How m...,70-25,45,subtraction,"[39, 51, 48, 45]"
1,Sarah has 36 pencils and she loses 15. How man...,36-15,21,subtraction,"[27, 16, 21, 11]"
2,There were 50 apples in a basket and 20 were t...,50-20,30,subtraction,"[40, 30, 26, 25]"
3,John had 45 toy cars and he gave 12 to his fri...,45-12,33,subtraction,"[33, 40, 35, 39]"
4,A pizza had 12 slices and 4 slices were eaten....,12-4,8,subtraction,"[8, 4, 6, 10]"
5,There were 60 balloons in a bag and 18 were po...,60-18,42,subtraction,"[42, 33, 38, 39]"
6,Emily had 90 crayons and she gave away 40. How...,90-40,50,subtraction,"[52, 53, 50, 49]"
7,There were 35 ducks in a pond and 8 flew away....,35-8,27,subtraction,"[37, 22, 17, 27]"
8,A toy store had 80 stuffed animals and 25 were...,80-25,55,subtraction,"[61, 55, 59, 53]"
9,Jason has 30 books and he loses 5. How many bo...,30-5,25,subtraction,"[28, 18, 25, 27]"


In [58]:
create_table("addition", "/work/CHATGPT/addition.csv", add_expr)

Unnamed: 0,Question,Expression,Solution,Category,Answer Choices
0,Katie has 20 marbles. She then finds 15 more o...,20+15,35,addition,"[35, 28, 27, 31]"
1,John has 30 toy cars. He buys 25 more at the s...,30+25,55,addition,"[57, 63, 55, 62]"
2,If Emily has 35 stickers and her friend gives ...,35+12,47,addition,"[49, 42, 47, 41]"
3,David has 50 pencils. He then receives 18 more...,50+18,68,addition,"[58, 68, 72, 69]"
4,Alex has 40 baseball cards. He trades with his...,40+16,56,addition,"[59, 56, 57, 62]"
5,Sarah has 25 books on her shelf. She gets 10 m...,25+10,35,addition,"[35, 41, 25, 42]"
6,If Lily has 55 pieces of candy and her brother...,55+8,63,addition,"[60, 66, 63, 69]"
7,Henry has 15 action figures. He buys 22 more a...,15+22,37,addition,"[44, 37, 39, 27]"
8,Tim has 7 toy cars. His friend gives him 9 mor...,7+9,16,addition,"[18, 8, 16, 19]"
9,Sally has 12 stickers. She gets 8 more from he...,12+8,20,addition,"[23, 21, 20, 28]"


### Explanation of the "expression" argument

To generalize this result to other question types, we'll detail the process of creating a function to pass in as the "expression" argument in our create_table function. 

What the expression function takes in: A list of numbers. The order of the numbers matters. 

For example, our division questions were generated such that the numerator was mentioned in the word problem 1st, then the denominator. Our code in the "create_table" function grabs all the numbers that appear in the question, and puts that into a list.

What the expression function returns: A mathematical expression that represents what the word problem is testing (i.e. 5+7). 

This uses the ordering of the numbers in the list to generate the question. Going back to our division example, we know the 0th index has the numerator, and the 1st index has the denominator, so our div_expr function returns x[0] + "\\" + x[1], where x is the list of numbers.

### Final dataset of all our Chat GPT generated questions 

In [60]:
def join_all_dfs(df_list):
    """
    Args:
    df_list (type: list): list of all the dataframes of generated questions. assumes each dataframe has a question, category, expression, solution, and answer choice column

    Returns:
    Joined dataframe of all the dataframes in df_list
    """

    return pd.concat([df for df in dataframes], axis = 0) 

In [61]:
df_list = [generated_addition_questions, generated_area_questions, generated_divison_questions, generated_multiplication_questions, generated_perimeter_questions, generated_subtraction_questions]

join_all_dfs(df_list)

Unnamed: 0,Question,Category,Expression,Solution,Answer Choices
0,If a box contains 10 pencils and there are 6 b...,Multiplication,10*6,60,"[53, 69, 63, 60]"
1,A pizza place makes 8 pizzas and each pizza ha...,Multiplication,8*6,48,"[48, 43, 52, 51]"
2,There are 7 days in a week and each day has 24...,Multiplication,7*24,168,"[168, 176, 162, 170]"
3,A bag has 12 marbles and there are 4 bags. How...,Multiplication,12*4,48,"[51, 44, 39, 48]"
4,If a bookshelf has 5 shelves and there are 10 ...,Multiplication,5*10,50,"[48, 47, 41, 50]"
...,...,...,...,...,...
15,A square-shaped tile has a side length of 20 c...,Area,20**2,400,"[400, 409, 401, 395]"
16,A square-shaped book has a side length of 5 cm...,Area,5**2,25,"[25, 16, 35, 15]"
17,A square-shaped poster has a side length of 30...,Area,30**2,900,"[898, 905, 903, 900]"
18,A square-shaped cushion has a side length of 1...,Area,18**2,324,"[324, 319, 334, 317]"


In [None]:
add_then_mul_expr(x): #x is an array of all the numbers from the question in order
    return (x[0] + x[1]) / x[2] 

create_table("Addition and Multiplication", filepath, add_then_mul_expr)


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=08b91cc6-deda-4616-92c9-e073a9a8b5c8' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>