Contact Ansong for relevant files. These early blocks are simply breaking up the jsonl into python arrays.

In [1]:
import json

# outputs lines from codex model run on gsmath to arrays depending on if they're successful
codex_file = "results/gsmath-codex_davinci-pass_at_100-dev-output_gen_prob-gen_len_256-gsm_shots.jsonl"

codex_data = {}
codex_failed = {}
codex_successful = {}

with open(codex_file, "r") as f:
    for line in f:
        obj = json.loads(line)
        codex_data[obj['metadata']['question']] = obj
        if obj["generated_program"]['exec_match'] == 1:    # exec_match is used to determine if a program is successful
            codex_successful[obj['metadata']['question']] = obj['metadata']
        else:
            codex_failed[obj['metadata']['question']] = obj['metadata']


In [2]:
# outputs lines from incoder model run on gsmath to arrays depending on if they're successful
incoder_file = "results/gsmath-incoder_6b-pass_at_100-dev-output_gen_prob.jsonl"

incoder_data = {}
incoder_failed = {}
incoder_successful = {}

with open(incoder_file, "r") as f:
    for line in f:
        obj = json.loads(line)
        incoder_data[obj['metadata']['question']] = obj
        if obj["generated_program"]['exec_match'] == 1:    # exec_match is used to determine if a program is successful
            incoder_successful[obj['metadata']['question']] = obj['metadata']
        else:
            incoder_failed[obj['metadata']['question']] = obj['metadata']

# 1st Sanity Check

The following code block is used to make sure that the above steps were performed correctly on the right data. The reported success rates should be as follows.

Codex:

Incoder:

In [3]:
print("Total Codex: ", len(codex_data))
print("Total Codex failed: ", len(codex_failed))
print("Total Codex successful: ", len(codex_successful))
print("Total Incoder: ", len(incoder_data))
print("Total Incoder failed: ", len(incoder_failed))
print("Total Incoder successful: ", len(incoder_successful))

Total Codex:  1488
Total Codex failed:  475
Total Codex successful:  1013
Total Incoder:  1495
Total Incoder failed:  1449
Total Incoder successful:  46


Uncomment these to see how data is stored

In [4]:
# print(codex_data[0])

In [5]:
# print(codex_data[0]['metadata'])

# Seperation

The code blocks below are used to see which problems which models succeed and fail at.

In [6]:
codex_failed_questions = {}
codex_successful_questions = {}
incoder_failed_questions = {}
incoder_successful_questions = {}
    
both_succeed = {}
both_failed = {}
codex_succeed = {}
incoder_succeed = {}
only_codex = {}
only_incoder = {}

for key in incoder_failed:
    cf = codex_failed.get(key)
    cs = codex_successful.get(key)
    if cf:
        both_failed[key] = cf
    elif cs:
        codex_succeed[key] = cs
    else:
        only_incoder[key] = incoder_failed.get(key)

for key in incoder_successful:
    cf = codex_failed.get(key)
    cs = codex_successful.get(key)
    if cf:
        incoder_succeed[key] = cf
    elif cs:
        both_succeed[key] = cs
    else:
        only_incoder[key] = incoder_successful.get(key)

In [7]:
print("both failed: ", len(both_failed))
print("both succeed: ", len(both_succeed))
print("only codex succeeded: ", len(codex_succeed))
print("only incoder succeeded: ", len(incoder_succeed))
print("only in incoder: ", len(only_incoder))
print("only in codex: ", len(only_codex))
print("total number of questions: ", len(both_failed) + len(both_succeed) + len(codex_succeed) + len(only_incoder) +len(incoder_succeed) + len(only_codex))

both failed:  465
both succeed:  36
only codex succeeded:  977
only incoder succeeded:  10
only in incoder:  7
only in codex:  0
total number of questions:  1495


Since there are more questions in the dataset provided for incoder, the following codeblock shouldn't change anything. It's included in case things change, or something is messed up.

In [8]:
for key in codex_failed:
    inf = incoder_failed.get(key)
    ins = incoder_successful.get(key)
    if inf:
        pass
    elif ins:
        pass
    else:
        only_codex[key] = incoder_failed.get(key)

for key in codex_successful:
    inf = codex_failed.get(key)
    ins = codex_successful.get(key)
    if inf:
        pass
    elif ins:
        pass
    else:
        only_codex[key] = incoder_successful.get(key)

In [9]:
print("both failed: ", len(both_failed))
print("both succeed: ", len(both_succeed))
print("only codex succeeded: ", len(codex_succeed))
print("only incoder succeeded: ", len(incoder_succeed))
print("only in incoder: ", len(only_incoder))
print("only in codex: ", len(only_codex))
print("total number of questions: ", len(both_failed) + len(both_succeed) + len(codex_succeed) + len(only_incoder) +len(incoder_succeed) + len(only_codex))

both failed:  465
both succeed:  36
only codex succeeded:  977
only incoder succeeded:  10
only in incoder:  7
only in codex:  0
total number of questions:  1495


# Sandbox

Used to look at things

In [10]:
key = "A club is going to get additional members so that they will have 5 more than twice their current number of their members. If the club has 10 members now, how many additional members do they need?"
# print(codex_data.get(key))
# print(incoder_data.get(key))

### The following code block looks at answers to the questions that both models got right

In [11]:
for key in both_succeed:
    print("Question: ", key,"\n")
    codex_code = codex_data.get(key)
    print("Codex Code:\n", codex_code["generated_program"]["code"])
    print("answer = ", codex_code['generated_program']['exec_result']['answer'], "\n")
    incoder_code = incoder_data.get(key)
    print("Incoder Code:\n", incoder_code["generated_program"]["code"])
#     print("answer = ", incoder_code['generated_program']['exec_result']['answer'], "\n")

Question:  Wade is the star player of the basketball team. His average points per game is 20, and his teammates' average points per game is 40. How many points will their team have in total after 5 games? 

Codex Code:
 n_games = 5
points_wade = 20
points_teammates = 40
points_total = n_games * (points_wade + points_teammates)
answer = points_total
answer =  300 

Incoder Code:
 average_points = 20
average_points_teammates = 40
n_games = 5
n_points = average_points * n_games
n_points_teammates = average_points_teammates * n_games
n_points_total = n_points + n_points_teammates
answer = n_points_total
Question:  Steve finds 100 gold bars while visiting Oregon. He wants to distribute his gold bars evenly to his 4 friends. If 20 gold bars were lost on the way back to San Diego, how many gold bars will each of his 4 friends get when he returns? 

Codex Code:
 n_gold_bars_found = 100
n_friends = 4
n_gold_bars_lost = 20
n_gold_bars_left = n_gold_bars_found - n_gold_bars_lost
n_gold_bars_per_f

### The following code block looks at answers to the questions that both models failed

In [12]:
for key in both_failed:
    print("Question: ", key,"\n")
    codex_code = codex_data.get(key)
    print("Codex Code:\n", codex_code["generated_program"]["code"])
    try:
        print("answer = ", codex_code['generated_program']['exec_result']['answer'], "\n")
    except:
        pass
    incoder_code = incoder_data.get(key)
    print("Incoder Code:\n", incoder_code["generated_program"]["code"])
    try:
        print("answer = ", incoder_code['generated_program']['exec_result']['answer'], "\n")
    except:
        pass

Question:  Adam’s wardrobe is too crowded so he decides to donate some of his clothes to a charity shop. He takes out 4 pairs of pants, 4 jumpers, 4 pajama sets (top and bottom), and 20 t-shirts, then asks his friends if they have anything they want to donate. 3 of his friends donate the same amount of clothing as Adam each. Then he takes another look over his clothing and decides that he actually wants to keep half of his clothes. How many articles of clothing are being donated in total? 

Codex Code:
 n_pants_adam = 4
n_jumpers_adam = 4
n_pajama_sets_adam = 4
n_t_shirts_adam = 20
n_friends = 3
n_articles_of_clothing_donated = n_pants_adam + n_jumpers_adam + n_pajama_sets_adam + n_t_shirts_adam
n_articles_of_clothing_donated_total = n_articles_of_clothing_donated * (n_friends + 1)
percent_clothes_kept = 0.5
n_articles_of_clothing_donated_total_after_keeping = n_articles_of_clothing_donated_total * (1 - percent_clothes_kept)
answer = n_articles_of_clothing_donated_total_after_keeping
a

### The following code block looks at answers to the questions that only Codex succeeded on

In [13]:
for key in codex_succeed:
    print("Question: ", key,"\n")
    codex_code = codex_data.get(key)
    print("Gold Code:\n", codex_code["metadata"]['original_answer'], "\n")
    print("Codex Code:\n", codex_code["generated_program"]["code"], "\n")
    incoder_code = incoder_data.get(key)
    print("Incoder Code:\n", incoder_code["generated_program"]["code"], "\n")    

Question:  Carly recently graduated and is looking for work in a field she studied for. She sent 200 job applications to companies in her state, and twice that number to companies in other states. Calculate the total number of job applications she has sent so far. 

Gold Code:
 If she sent 200 job applications to her state, she sent 200*2 = <<200*2=400>>400 job applications to other states.
The total number of job applications she has sent is 400+200 = <<400+200=600>>600
#### 600 

Codex Code:
 n_job_applications_in_state = 200
n_job_applications_out_of_state = n_job_applications_in_state * 2
n_total_job_applications = n_job_applications_in_state + n_job_applications_out_of_state
answer = n_total_job_applications 

Incoder Code:
 n_jobs = 200
n_states = 5
n_jobs_sent = n_jobs * n_states
answer = n_jobs_sent 

Question:  A radio show plays for 3 hours a day. They split their show into talking segments, ad breaks and songs. Talking segments last 10 minutes each, ad breaks last 5 minutes 

### The following code block looks at answers to the questions that only Incoder succeeded on

In [14]:
for key in incoder_succeed:
    print("Question: ", key,"\n")
    codex_code = codex_data.get(key)
    print("Gold Code:\n", codex_code["metadata"]['original_answer'], "\n")
    print("Codex Code:\n", codex_code["generated_program"]["code"], "\n")
    incoder_code = incoder_data.get(key)
    print("Incoder Code:\n", incoder_code["generated_program"]["code"], "\n")  

Question:  A grocery store has 4 kinds of jelly. They sell grape jelly twice as much as strawberry jelly, and raspberry jelly twice as much as plum jelly. The raspberry jelly sells a third as much as the grape jelly. If they sold 6 jars of plum jelly today, how many jars of strawberry jelly did they sell? 

Gold Code:
 They sell twice as much raspberry jelly as plum jelly, so they sold 2 * 6 = <<2*6=12>>12 jars of raspberry jelly today.
The raspberry jelly sells a third as much as the grape jelly, so they sold 12 * 3 = <<12*3=36>>36 jars of grape jelly today.
The grape jelly sells twice as much as the strawberry jelly, so they sold 36 / 2 = <<36/2=18>>18 jars of strawberry jelly today.
#### 18 

Codex Code:
 n_jelly_grape = 6
n_jelly_raspberry = n_jelly_grape / 3.0
n_jelly_plum = n_jelly_raspberry / 2.0
n_jelly_strawberry = n_jelly_plum * 2
answer = n_jelly_strawberry 

Incoder Code:
 n_jelly = 4
n_jelly_twice = 6
n_jelly_thrice = 3
n_jelly_now = n_jelly_twice * n_jelly_thrice
answer =

# Error messages

some questions have error messages returned by the models. Below is an example of what the error message looks like

In [15]:
key = "A Whatsapp group has members sending messages every day sharing about how each one's day was. Last week, 300 messages were sent by the members on Monday, 200 messages on Tuesday, 300 more messages on Wednesday than the previous day, and two times as many messages on Thursday as there were on Wednesday. Calculate the number of messages sent in the Whatsapp group after the four days."
print(incoder_data.get(key)['generated_program']['exec_result'])

ERROR: no answer variable


In [16]:
both_lack_answer = {}
incoder_lacks_answer = {}
codex_lacks_answer = {}

for key in codex_failed:
    c = codex_data.get(key)
    i = incoder_data.get(key)
    if c['generated_program']['exec_result'] == "ERROR: no answer variable":
        if i['generated_program']['exec_result'] == "ERROR: no answer variable":
            both_lack_answer[key] = c
        else:
            codex_lacks_answer[key] = c

for key in incoder_failed:
    if both_lack_answer.get(key) or codex_lacks_answer.get(key):
        pass
    else:
        c = codex_data.get(key)
        i = incoder_data.get(key)
        if i['generated_program']['exec_result'] == "ERROR: no answer variable":
            if c['generated_program']['exec_result'] == "ERROR: no answer variable":
                both_lack_answer[key] = c
            else:
                incoder_lacks_answer[key] = c

In [17]:
print("both lack an answer: ", len(both_lack_answer))
print("only incoder lacks an answer: ", len(incoder_lacks_answer))
print("only codex lacks an answer: ", len(codex_lacks_answer))

both lack an answer:  0
only incoder lacks an answer:  13
only codex lacks an answer:  5


### The following code block looks at answers to the questions that only incoder is missing an answer to

In [18]:
for key in incoder_lacks_answer:
    print("Question: ", key, "\n")
    incoder_code = incoder_data.get(key)
    print("Incoder Code:\n", incoder_code["generated_program"]["code"], "\n")

Question:  A Whatsapp group has members sending messages every day sharing about how each one's day was. Last week, 300 messages were sent by the members on Monday, 200 messages on Tuesday, 300 more messages on Wednesday than the previous day, and two times as many messages on Thursday as there were on Wednesday. Calculate the number of messages sent in the Whatsapp group after the four days. 

Incoder Code:
 n_messages = 300
n_days_after_monday = 300
n_days_after_tuesday = 300
n_days_after_wednesday = 300
n_days_after_thursday = 300
n_days_after_wednesday = 300
n_days_after_wednesday = 300
n_days_after_wednesday = 300
n_days_after_wednesday = 300
n_days_after_wednesday = 300
n_days_after_wednesday = 300
n_days_after_wednesday = 300
n_days_after_wednesday = 300
n_days_after_wednesday = 300
n_days_after_wednesday = 300 

Question:  Bob started out the week with $80. On Monday alone, he spent half the money. On Tuesday, he spent one-fifth of the amount left from Monday. On Wednesday, he 

### The following code block looks at answers to the questions that only codex is missing an answer on

In [19]:
for key in codex_lacks_answer:
    print("Question: ", key, "\n")
    codex_code = codex_data.get(key)
    print("Codex Code:\n", codex_code["generated_program"]["code"], "\n")

Question:  Jack is on the phone with a scammer who says the IRS will arrest Jack if he doesn't send them the codes from 6 $500 Best Buy gift cards and 9 $200 Walmart gift cards. After sending the codes for 1 Best Buy gift card and 2 Walmart gift cards, Jack wises up and hangs up. How many dollars' worth of gift cards can he still return? 

Codex Code:
 n_best_buy_gift_cards = 6
n_walmart_gift_cards = 9
n_best_buy_gift_cards_sent = 1
n_walmart_gift_cards_sent = 2
n_best_buy_gift_cards_left = n_best_buy_gift_cards - n_best_buy_gift_cards_sent
n_walmart_gift_cards_left = n_walmart_gift_cards - n_walmart_gift_cards_sent
value_best_buy_gift_card = 500
value_walmart_gift_card = 200
value_best_buy_gift_cards_left = n_best_buy_gift_cards_left * value_best_buy_gift_card
value_walmart_gift_cards_left = n_walmart_gift_cards_left * value_walmart_gift_card
value_gift_cards_left = value_best_buy_gift_cards_left + value_walmart_gift_cards_left 

Question:  For the funfair, the school organizers order

# Python traces

For some questions, the models return code that varies from the characters required for this dataset (*, /, +, -, =). Examples are below

In [20]:
key = "A man is returning home from work and trying to decide which route to take.  His first route option includes 3 stoplights.  This route will take him 10 minutes if all three lights are green, but each light that is red will add 3 minutes to the trip.  The second route does not include any stoplights and takes 14 minutes.  If the man chooses the first route, how much longer will his trip be if all 3 stoplights are red?"
print(incoder_data.get(key)['generated_program']['code'])

n_stoplights = 3
n_stoplights_red = 3
n_stoplights_green = 3
n_stoplights_total = (n_stoplights_red + n_stoplights_green) * n_stoplights
n_minutes_trip = 10
if n_stoplights_total == n_minutes_trip:
    n_minutes_trip_red = n_minutes_trip - n_stoplights_red
    n_minutes_trip_green = n_minutes_trip - n_stoplights_green
    n_minutes_trip_total = n_minutes_trip_red + n_minutes_trip_green
    answer = n_minutes_trip_total


In [21]:
key = "The school has 14 boys and 10 girls. If 4 boys and 3 girls drop out, how many boys and girls are left?"
print(codex_data.get(key)['generated_program']['code'])

n_boys = 14
n_girls = 10
n_boys_left = n_boys - 4
n_girls_left = n_girls - 3
answer = n_boys_left, n_girls_left


## Regex stuff

Using regex to search for symbols that are in python but not in simple math

### Incoder

In [22]:
import re

incoder_python = {}
incoder_count = 0

for key in incoder_data:
    code = incoder_data.get(key)['generated_program']['code']
    if re.search(":|,|[|]", code):
        incoder_python[key] = code
        print("Question: ", key, "\n")
        print("Code: ", code, "\n")
        result = incoder_data.get(key)['generated_program']['exec_match']
        if result:
            incoder_count += 1
        print("Correct? ", result, "\n")

Question:  Tim decides to do a movie marathon.  The first movie is 2 hours long.  The next movie is 50% longer.  And the last movie is 1 hour shorter than the combined time of the previous 2 movies.  How long was his movie marathon? 

Code:  movie_marathon = [2, 0.5, 1]
total_time = movie_marathon[0] + movie_marathon[1] + movie_marathon[2]
answer = total_time 

Correct?  0.0 

Question:  It is raining outside and Bill puts his empty fish tank in his yard to let it fill with rainwater. It starts raining at 1 pm. 2 inches of rainfall in the first hour. For the next four hours, it rains at a rate of 1 inch per hour. It then rains at three inches per hour for the rest of the day. If the fish tank is 18 inches tall, at what time will it be filled with rainwater. 

Code:  rainwater_time = datetime.time(hour=12, minute=0, second=0)
rainwater_hour = rainwater_time.hour
rainwater_minute = rainwater_time.minute
rainwater_second = rainwater_time.second
rainwater_hour_rain = rainwater_hour * 6
rai

### Codex

In [39]:
codex_python = {}
codex_count = 0

for key in codex_data:
    code = codex_data.get(key)['generated_program']['code']
    if re.search(":|,|[|]", code):
#         incoder_code = incoder_data.get(key)['generated_program']['code']
        codex_python[key] = code
        print("Question: ", key, "\n")
        print("Codex Code: ", code, "\n")
        result = codex_data.get(key)['generated_program']['exec_match']
        print("Incoder Code: ", incoder_code['generated_program']['code'], "\n")
        result = incoder_data.get(key)['generated_program']['exec_match']
        if result:
            codex_count += 1
        print("Correct? ", result, "\n")

Question:  Lionel went to the grocery store and bought 14 boxes of Graham crackers and 15 packets of Oreos. To make an Oreo cheesecake, Lionel needs 2 boxes of Graham crackers and 3 packets of Oreos. After making the maximum number of Oreo cheesecakes he can with the ingredients he bought, how many boxes of Graham crackers would he have left over? 

Codex Code:  n_boxes_graham_crackers = 14
n_packets_oreos = 15
n_boxes_graham_crackers_per_cheesecake = 2
n_packets_oreos_per_cheesecake = 3
n_cheesecakes = min(n_boxes_graham_crackers / n_boxes_graham_crackers_per_cheesecake, n_packets_oreos / n_packets_oreos_per_cheesecake)
n_boxes_graham_crackers_left = n_boxes_graham_crackers - n_cheesecakes * n_boxes_graham_crackers_per_cheesecake
answer = n_boxes_graham_crackers_left 

Incoder Code:  n_pieces = 5
n_pieces_hamburger = 3
n_pieces_french_fries = 4
n_pieces_soda = 5
n_pieces_platter = 2
cost_hamburger = n_pieces_hamburger * 3
cost_french_fries = n_pieces_french_fries * 1.20
cost_soda = n_

### Analysis

In [24]:
both_python = {}

for key in incoder_python:
    answer = codex_python.get(key)
    if answer:
        both_python[key] = answer
        print("Question: ", key, "\n")
        print("Code: ", code, "\n")

In [25]:
print("Number of questions answered with python by incoder: ", len(incoder_python))
print("Number of questions answered with python by codex: ", len(codex_python))
print("Number of questions answered with python by both: ", len(both_python))

Number of questions answered with python by incoder:  15
Number of questions answered with python by codex:  11
Number of questions answered with python by both:  0


In [26]:
print("Number correct in incoder: ", incoder_count)
print("Number correct in codex: ", codex_count)

Number correct in incoder:  0
Number correct in codex:  2


# IDF on error messages

We intend to perform IDF on the questions that don't have an answer.

In [27]:
failed_questions = []

for question in incoder_lacks_answer:
    c = incoder_lacks_answer.get(question)
    failed_questions.append(c['metadata']['question'])
    
for question in codex_lacks_answer:
    c = codex_lacks_answer.get(question)
    failed_questions.append(c['metadata']['question'])

# print(failed_questions)

In [29]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

tfIdfTransformer = TfidfTransformer(use_idf=True)
countVectorizer = CountVectorizer()
wordCount = countVectorizer.fit_transform(failed_questions)
newTfIdf = tfIdfTransformer.fit_transform(wordCount)
df = pd.DataFrame(newTfIdf[0].T.todense(), index=countVectorizer.get_feature_names(), columns=["TF-IDF"])
df = df.sort_values('TF-IDF', ascending=False)
print (df.head(25))

             TF-IDF
messages   0.592975
on         0.265550
day        0.259513
whatsapp   0.197658
300        0.197658
members    0.197658
sent       0.197658
group      0.197658
the        0.187912
were       0.173009
wednesday  0.173009
as         0.155519
thursday   0.098829
about      0.098829
number     0.098829
sharing    0.098829
more       0.098829
previous   0.098829
monday     0.086504
times      0.086504
sending    0.086504
calculate  0.086504
there      0.086504
200        0.086504
four       0.086504




In [None]:
word_count_vector.shape

In [None]:
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True) 
tfidf_transformer.fit(word_count_vector)

# print idf values 
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names_out(),columns=["idf_weights"]) 

# print(pd.options.display.max_rows)

# sort ascending 
df_idf.sort_values(by=['idf_weights'])
# most_common = df_idf.reset_index().head(25)
# print(most_common)


In [36]:
python_questions = []

for question in incoder_python:
    c = incoder_data.get(question)
    python_questions.append(c['metadata']['question'])
    
for question in codex_python:
    c = codex_data.get(question)
    python_questions.append(c['metadata']['question'])

In [37]:
tfIdfTransformer = TfidfTransformer(use_idf=True)
countVectorizer = CountVectorizer()
wordCount = countVectorizer.fit_transform(python_questions)
newTfIdf = tfIdfTransformer.fit_transform(wordCount)
df = pd.DataFrame(newTfIdf[0].T.todense(), index=countVectorizer.get_feature_names(), columns=["TF-IDF"])
df = df.sort_values('TF-IDF', ascending=False)
print (df.head(25))

            TF-IDF
movie     0.671499
marathon  0.302663
long      0.268599
the       0.217952
is        0.200085
do        0.151331
shorter   0.151331
tim       0.151331
movies    0.134300
decides   0.134300
was       0.134300
last      0.134300
combined  0.134300
longer    0.134300
hours     0.122216
time      0.122216
previous  0.112842
hour      0.112842
next      0.112842
than      0.105184
first     0.105184
50        0.098709
his       0.083727
how       0.066695
to        0.066695




# Looking for "hour" in the dataset

In [59]:
n_correct = 0
correct_time = []
n_total = 0

for key in incoder_data:
    code = incoder_data.get(key)['generated_program']['code']
    if re.search("hour", key):
        incoder_python[key] = code
        print("Question: ", key, "\n")
        print("Code: ", code, "\n")
        result = incoder_data.get(key)['generated_program']['exec_match']
        if result:
            n_correct += 1
            correct_time.append(key)
        print("Correct? ", result, "\n")
        n_total += 1

Question:  A radio show plays for 3 hours a day. They split their show into talking segments, ad breaks and songs. Talking segments last 10 minutes each, ad breaks last 5 minutes each and songs are played throughout the rest of the show. If the radio show includes 3 talking segments and 5 ad breaks in today’s show, how long, in minutes, does the show play songs? 

Code:  n_talking_segments = 3
n_ad_breaks = 5
n_songs = 3
n_minutes_played = n_talking_segments * 10 + n_ad_breaks * 5
n_minutes_played_today = n_minutes_played * n_songs
answer = n_minutes_played_today 

Correct?  0.0 

Question:  A driver travels 30 miles per hour for 3 hours and 25 miles per hour for 4 hours to deliver goods to a town every day from Monday to Saturday. How many miles does the driver travel in a week? 

Code:  miles_per_hour = 30
hours_per_week = 3
miles_per_week = miles_per_hour * hours_per_week
answer = miles_per_week 

Correct?  0.0 

Question:  Jeff was driving to the capital city to attend a conference

In [60]:
print("total: ", n_total, "correct: ", n_correct)
for key in correct_time:
    print(key)

total:  149 correct:  2
John assembles widgets at a factory.  He can make 20 widgets an hour and works for 8 hours a day 5 days a week.  How many widgets does he make a week?
Mark loves to see shows in theaters. He decided to visit the theater at least once a week. One performance lasts 3 hours. The price of the ticket depends on the time spent in the theater and stands at $5 for each hour. How much will Mark spend on visits to the theater in 6 weeks?


In [63]:
n_correct = 0
correct_time = []
n_total = 0

for key in codex_data:
    code = codex_data.get(key)['generated_program']['code']
    if re.search("hour", key):
        codex_python[key] = code
        print("Question: ", key, "\n")
        print("Code: ", code, "\n")
        result = codex_data.get(key)['generated_program']['exec_match']
        if result:
            n_correct += 1
            correct_time.append(key)
        print("Correct? ", result, "\n")
        n_total += 1

Question:  A radio show plays for 3 hours a day. They split their show into talking segments, ad breaks and songs. Talking segments last 10 minutes each, ad breaks last 5 minutes each and songs are played throughout the rest of the show. If the radio show includes 3 talking segments and 5 ad breaks in today’s show, how long, in minutes, does the show play songs? 

Code:  minutes_per_hour = 60
minutes_per_talking_segment = 10
minutes_per_ad_break = 5
minutes_per_show = minutes_per_hour * 3
minutes_per_talking_segments = minutes_per_talking_segment * 3
minutes_per_ad_breaks = minutes_per_ad_break * 5
minutes_per_songs = minutes_per_show - minutes_per_talking_segments - minutes_per_ad_breaks
answer = minutes_per_songs 

Correct?  1.0 

Question:  Lilly and Fiona are cleaning a room. Between them, it takes 8 hours to clean the room. A quarter of the time spent cleaning was by Lilly and Fiona was responsible for the rest of the cleaning. How long, in minutes, was Fiona cleaning? 

Code:  n_

In [64]:
print("total: ", n_total, "correct: ", n_correct)
for key in correct_time:
    print(key)

total:  148 correct:  105
A radio show plays for 3 hours a day. They split their show into talking segments, ad breaks and songs. Talking segments last 10 minutes each, ad breaks last 5 minutes each and songs are played throughout the rest of the show. If the radio show includes 3 talking segments and 5 ad breaks in today’s show, how long, in minutes, does the show play songs?
Lilly and Fiona are cleaning a room. Between them, it takes 8 hours to clean the room. A quarter of the time spent cleaning was by Lilly and Fiona was responsible for the rest of the cleaning. How long, in minutes, was Fiona cleaning?
When the machine is cold, as it is in the first hour of production, it takes 6 minutes to produce each molded flower pot. Thereafter, once it is warm, it takes only 5 minutes to produce each pot. How many additional pots are produced in the last hour of the day, compared to the first?
Evelyn’s family watched 10 hours of television last week. The week before, they watched 8 hours of 

# Expanding the search above to all questions with "year, month, week, day, hour, minute, and second"

In [65]:
n_correct = 0
correct_time = []
n_total = 0

for key in incoder_data:
    code = incoder_data.get(key)['generated_program']['code']
    if re.search("year|month|week|day|hour|minute|second", key):
        incoder_python[key] = code
        result = incoder_data.get(key)['generated_program']['exec_match']
        if result:
            n_correct += 1
            correct_time.append(key)
        n_total += 1

In [66]:
print("total: ", n_total, "correct: ", n_correct)

total:  605 correct:  14


In [67]:
n_correct = 0
correct_time = []
n_total = 0

for key in codex_data:
    code = codex_data.get(key)['generated_program']['code']
    if re.search("year|month|week|day|hour|minute|second", key):
        codex_python[key] = code
        result = codex_data.get(key)['generated_program']['exec_match']
        if result:
            n_correct += 1
            correct_time.append(key)
        n_total += 1

In [68]:
print("total: ", n_total, "correct: ", n_correct)

total:  602 correct:  406
