In [4]:
from datasets import load_from_disk
import pandas as pd

# Load the dataset
dataset = load_from_disk("./processed_data")

# Convert to pandas DataFrame for nice viewing
df = dataset.to_pandas()

# Display basic info
print(f"Total samples: {len(df)}")
print(f"Columns: {list(df.columns)}")
print(f"\nPerturbation type distribution:")
print(df['perturbation_type'].value_counts().sort_index())

# Show a few examples from each perturbation type
print("\n" + "="*100)
for ptype in ['operand_swap',
'number_substitution', 'operator_replace',
'computation_plusminus']:
  print(f"\n{ptype.upper()}:")
  print("-"*100)
  sample = df[df['perturbation_type'] == ptype].iloc[10]
  print(f"Question: {sample['question']}")
  print(f"Original:  {sample['answer']}")
  print(f"Perturbed: {sample['perturbed_answer']}")

# Display the DataFrame (will show nicely in Jupyter)
df.head(10)



Total samples: 512
Columns: ['question', 'answer', 'perturbed_answer', 'perturbation_type']

Perturbation type distribution:
perturbation_type
computation_plusminus    128
number_substitution      128
operand_swap             128
operator_replace         128
Name: count, dtype: int64


OPERAND_SWAP:
----------------------------------------------------------------------------------------------------
Question: Susan has 21 cats and Bob has 3 cats. If Susan gives Robert 4 of her cats, how many more cats does Susan have than Bob?
Original:  After giving away four of her cats, Susan has 21 - 4 = 17.
Perturbed: After giving away four of her cats, Susan has 4 - 21 = -17

NUMBER_SUBSTITUTION:
----------------------------------------------------------------------------------------------------
Question: Jay & Gloria were hosting a 4th of July party at their house.  Jay invited 22 people and Gloria invited 36.  They wanted to buy small American flags for everyone.  The craft store was having a sa

Unnamed: 0,question,answer,perturbed_answer,perturbation_type
0,The Tampa Bay Bucs have 13 football players an...,There are 13 - 10 = 3 football players left.,There are 10 - 13 = -3 football players left.,operand_swap
1,There are 30 different nuts in a bowl. If 5/6 ...,30 x 5/6 = 25 nuts were eaten.,30 x 6 / 5 = 1.2 nuts were eaten.,operand_swap
2,Roosevelt High school plays a basketball tourn...,The points Roosevelt high school has for the s...,The points Roosevelt high school has for the s...,operand_swap
3,Four runners ran a combined total of 195 miles...,195 - 51 = 144 = miles the 3 runners ran,51 - 195 = -144 = miles the 3 runners ran,operand_swap
4,Tony exercises every morning by walking 3 mile...,Tony walks 3 miles at 3 miles per hour for 3/3...,Tony walks 3 miles at 3 miles per hour for 3 /...,operand_swap
5,Sally had 14 fries at lunch. Her friend Mark g...,Mark gave Sally 36/3=12 fries,Mark gave Sally 3 / 36 = 0.08333333333333333 f...,operand_swap
6,A group of bedbugs infested an old mattress. E...,"On the third day, there were one-third of the ...","On the third day, there were one-third of the ...",operand_swap
7,55% of Toby's friends are boys and the rest ar...,Toby has 60 friends because 33 / .55 = 60,Toby has 60 friends because 0.55 / 33 = 0.0166...,operand_swap
8,A pad of paper comes with 60 sheets. Evelyn us...,"Evelyn takes two days a week off from work, so...","Evelyn takes two days a week off from work, so...",operand_swap
9,Luther designs clothes for a high fashion comp...,Luther has 10 / 2 = 5 pieces made with cashmere.,Luther has 2 / 10 = 0.2 pieces made with cashm...,operand_swap


In [6]:
# Set pandas display options for better viewing
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.max_rows', 50)

# Filter by perturbation type
perturbation_type = 'operand_swap'  # Change this to any type you want
filtered_df = df[df['perturbation_type'] ==
perturbation_type]

print(f"Showing {perturbation_type} samples:")
filtered_df[['question', 'answer',
'perturbed_answer']].head(10)

# Compare original vs perturbed for random samples
import random

random.seed(42)
sample_indices = random.sample(range(len(df)), 5)

for idx in sample_indices:
  row = df.iloc[idx]
  print(f"\n{'='*100}")
  print(f"Type: {row['perturbation_type']}")
  print(f"Question: {row['question'][:80]}...")
  print(f"\nOriginal:  {row['answer']}")
  print(f"Perturbed: {row['perturbed_answer']}")

Showing operand_swap samples:

Type: operand_swap
Question: A building has 20 floors. Each floor is 3 meters high, except for the last two f...

Original:  There are 20 - 2 = 18 floors that are each 3 meters high.
Perturbed: There are 2 - 20 = -18 floors that are each 3 meters high.

Type: operand_swap
Question: Hans reserved a table at a fine dining restaurant for twelve people. He has to p...

Original:  Hans’s party includes 12 - 2 = 10 adults and 2 children.
Perturbed: Hans’s party includes 2 - 12 = -10 adults and 2 children.

Type: operator_replace
Question: Savannah is wrapping presents for her friends and family for Christmas. She has ...

Original:  Savannah has 12 gifts to give her friends and family and has already wrapped 3 gifts + 5 gifts = 8 gifts already wrapped with the first two rolls of paper.
Perturbed: Savannah has 12 gifts to give her friends and family and has already wrapped 3 gifts * 5 gifts = 15 gifts already wrapped with the first two rolls of paper.

Type: num