In [1]:
import os
import json
import random

import pandas as pd

In [60]:
AQUA_PATH = '../data/cot_data/aqua_train.tsv'
CREAK_PATH = '../data/cot_data/creak_train.tsv'
ECQA_PATH = '../data/cot_data/ecqa_train.tsv'
ESNLI_PATH = '../data/cot_data/esnli_train.tsv'
GSM8K_PATH = '../data/cot_data/gsm8k_train.tsv'
QASC_PATH = '../data/cot_data/qasc_train.tsv'
QED_PATH = '../data/cot_data/qed_train.tsv'
SENSEMAKING_PATH = '../data/cot_data/sensemaking_train.tsv'
STRATEGYQA_PATH ='../data/cot_data/strategyqa_train.tsv'

In [54]:
def get_data_info(data_path):
    """
    This function takes in a Pandas dataframe and returns information about the dataframe,
    including shape, memory usage, and statistics on numerical columns.
    """
    print(f'Data path: {data_path}')
    data = pd.read_csv(data_path, sep='\t', header=None)
    data.columns = ['question', 'answer', 'rationale']

    # Get shape of dataframe
    shape = data.shape
    print(f"Shape: {shape}")

    # Get memory usage of dataframe
    memory_usage = data.memory_usage(deep=True).sum()
    print(f"Memory Usage: {memory_usage / 1024**2:.2f} MB")

    # Get the number of null values in each column
    null_counts = data.isnull().sum()
    print("Null values:")
    for col, count in null_counts.items():
        print(f"{col}: {count}")

    # Check for duplicates in the DataFrame
    duplicates = data[data.duplicated()]
    if len(duplicates) > 0:
        print("Duplicates:")
        print(duplicates)

    sample = data.head(1)
    print(f"Question: {sample.question[0]}")
    print(f"Answer: {sample.answer[0]}")
    print(f"Rationale: {sample.rationale[0]}")

In [55]:
get_data_info(AQUA_PATH)

Data path: ../data/cot_data/aqua_train.tsv
Shape: (2728, 3)
Memory Usage: 1.35 MB
Null values:
question: 0
answer: 0
rationale: 0
Question: Rs. 5600 is divided into three parts A, B and C. How much A is more than C if their ratio is 1/7:1/7:1/14?\nOptions:\n(A) 300\n(B) 992\n(C) 1120\n(D) 552\n(E) 312
Answer: (C)
Rationale: 1/7:1/7:1/14 = 2:2:1\n1/5*5600 = 1120\n2240-1120 = 1120


In [56]:
get_data_info(CREAK_PATH)

Data path: ../data/cot_data/creak_train.tsv
Shape: (6915, 3)
Memory Usage: 2.56 MB
Null values:
question: 0
answer: 0
rationale: 0
Question: Claim: "Only people named Floyd wearing pink are allowed to attend Pink Floyd concerts."\nIs the claim above correct, and can it be verified by human common sense and without a web search?\nOptions:\n- yes\n- no
Answer: no
Rationale: The rock group would not be as popular is they had such requirements for their concerts.


In [57]:
get_data_info(ECQA_PATH)

Data path: ../data/cot_data/ecqa_train.tsv
Shape: (7112, 3)
Memory Usage: 3.03 MB
Null values:
question: 0
answer: 0
rationale: 0
Question: What might a person see at the scene of a brutal killing?\nOptions:\n- bloody mess\n- pleasure\n- being imprisoned\n- feeling of guilt\n- cake
Answer: bloody mess
Rationale: Bloody mess is covered or stained with blood. A person might see a bloody mess at the scene of a brutal killing.


In [58]:
get_data_info(ESNLI_PATH)

Data path: ../data/cot_data/esnli_train.tsv
Shape: (36174, 3)
Memory Usage: 15.98 MB
Null values:
question: 0
answer: 0
rationale: 0
Question: Premise: "Man scaling wall with fire in hand."\nHypothesis: "A man holding fire in his hand is trying to escape by scaling a wall."\nDo we know that the hypothesis entailed by the premise?
Answer: it is not possible to tell
Rationale: Just because a man is scaling a wall doesn't imply he is trying to escape.


In [59]:
get_data_info(GSM8K_PATH)

Data path: ../data/cot_data/gsm8k_train.tsv
Shape: (7473, 3)
Memory Usage: 4.50 MB
Null values:
question: 0
answer: 0
rationale: 0
Question: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
Answer: 72
Rationale: Natalia sold 48 / 2 = 24 clips in May. Natalia sold 48 + 24 = 72 clips altogether in April and May.


In [61]:
get_data_info(QASC_PATH)

Data path: ../data/cot_data/qasc_train.tsv
Shape: (1084, 3)
Memory Usage: 0.57 MB
Null values:
question: 0
answer: 0
rationale: 0
Question: What is the process by which living things give rise to offspring?\nOptions:\n- (A) DNA\n- (B) photosynthesis\n- (C) bird\n- (D) sex\n- (E) subtraction\n- (F) gametes\n- (G) eggs\n- (H) ovum
Answer: (D)
Rationale: Reproduction is the process by which living things give rise to offspring. Sex equals reproduction. Sex is the process by which living things give rise to offspring.


In [62]:
get_data_info(QED_PATH)

Data path: ../data/cot_data/qed_train.tsv
Shape: (5154, 3)
Memory Usage: 5.45 MB
Null values:
question: 0
answer: 1
rationale: 0
Question: Passage: Webbed toes is the common name for syndactyly affecting the feet. It is characterised by the fusion of two or more digits of the feet. This is normal in many birds, such as ducks; amphibians, such as frogs; and mammals, such as kangaroos. In humans it is considered unusual, occurring in approximately one in 2,000 to 2,500 live births.\n\nQuestion: Based on this passage, what is the medical term for webbed toes?
Answer: syndactyly affecting the feet
Rationale: The relevant information is: Webbed toes is the common name for syndactyly affecting the feet.


In [63]:
get_data_info(SENSEMAKING_PATH)

Data path: ../data/cot_data/sensemaking_train.tsv
Shape: (6070, 3)
Memory Usage: 2.43 MB
Null values:
question: 0
answer: 0
rationale: 0
Question: Of the following two sentences, which one is against common sense?\nOptions:\n- Sentence A: "He poured orange juice on his cereal."\n- Sentence B: "He poured milk on his cereal."\n
Answer: Sentence A
Rationale: Orange juice does not taste good on cereal.


In [64]:
get_data_info(STRATEGYQA_PATH)

Data path: ../data/cot_data/strategyqa_train.tsv
Shape: (2061, 3)
Memory Usage: 0.81 MB
Null values:
question: 0
answer: 0
rationale: 0
Duplicates:
                                               question answer   
1469  Can you find Bob Marley's face in most smoke s...    yes  \

                                              rationale  
1469  Bob Marley's face is on the packaging of a pop...  
Question: Are more people today related to Genghis Khan than Julius Caesar?
Answer: yes
Rationale: Julius Caesar had three children. Genghis Khan had sixteen children. Modern geneticists have determined thatout of every 200 men today has DNA that can be traced to Genghis Khan.
