# Data Preparation

We load the data, segment the Chain of Thought and publish to huggingface account

## Setup

In [None]:
%load_ext autoreload

In [69]:
import logging
import os

from dataprep import segment
from datasets import load_dataset, Dataset
from huggingface_hub import notebook_login
from dotenv import load_dotenv

In [None]:
load_dotenv("/home/anuj/cs224n/cs224n-final-project/.env")
notebook_login()

logger = logging.getLogger(__name__)

True

In [None]:
os.environ["ANTHROPIC_API_KEY"] = ""

## Load the Datasets

### OpenMathReasoning
OpenMathReasoning is a dataset of mathematical reasoning problems that require multi-step reasoning to solve. The schema is defined below for this dataset. We limit the load to cot dataset which is contained in `data/cot-*` files

Fields:
  - problem: Problem statement extracted from AoPS forums and refined with Qwen2.5-32B-Instruct
  - generated_solution: Synthetically generated solution using either DeepSeek-R1 or QwQ-32B
  - generation_model: DeepSeek-R1 or QwQ-32B
  - problem_type: Can be one of "has_answer_extracted", "no_answer_extracted" and "converted_proof" dependening on whether we were able to extract the answer or if this is a proof question converted to answer question.
  - expected_answer: Extracted answer if "problem_type" is "has_answer_extracted". Otherwise this is the majority-voting answer across all generated solutions for this problem.
  - problem_source: States the corresponding AoPS forum (e.g. "aops_c6_high_school_olympiads") or "MATH_training_set" as we also include a small set of generations from MATH.
  - inference_mode: "cot", "tir" or "genselect"
  - pass_rate_72b_tir: Pass rate out of 32 generations for Qwen2.5-Math-72B-Instruct run in TIR mode. This attribute is only available when "problem_type" is "has_answer_extracted" and is set to "n/a" for other cases.
  - used_in_kaggle: Whether the instance was used in training the winning model for AIMO-2 Kaggle competition or not. We had used 2.2M CoT and 15K TIR solutions for training the OpenMath-Nemotron-14B-Kaggle model. Note that for training the OpenMath-Nemotron models, we used all the CoT, TIR, and GenSelect data, except for the TIR subset used in Kaggle.
  

In [9]:
openmath_ds = load_dataset("nvidia/OpenMathReasoning", data_files="data/cot-*", split="train", streaming=True)

Resolving data files:   0%|          | 0/144 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/144 [00:00<?, ?it/s]

## Preparation

Now we use LLM to generate the chain of thought in the desired format which reflects hierarchical thinking. We prompt a reasoning LLM, Claude Opus 4.6 in our case, to generate the dataset.

In [None]:
examples = Dataset.from_list(list(openmath_ds.take(10)))

In [None]:
display(examples.to_pandas())

Unnamed: 0,expected_answer,problem_type,problem_source,generation_model,pass_rate_72b_tir,problem,generated_solution,inference_mode,used_in_kaggle
0,\(\frac{C_{n_1}^{a_1} \cdot C_{n_2}^{a_2} \cdo...,has_answer_extracted,aops_c6_high_school_olympiads,DeepSeek-R1,0.65625,Given a group of \( N \) balls consisting of \...,"<think>\nOkay, so I need to find the probabili...",cot,True
1,\frac{n(n-1)}{2},no_answer_extracted,aops_c4_high_school_math,DeepSeek-R1,,How many lines can be drawn that are equidista...,"<think>\nOkay, so the problem is asking how ma...",cot,True
2,\( f(x) = f(1)x \),has_answer_extracted,aops_c6_high_school_olympiads,DeepSeek-R1,0.5625,Find all functions \( f: \mathbb{R} \to \mathb...,"<think>\nOkay, let's try to solve this functio...",cot,True
3,2,has_answer_extracted,aops_c6_high_school_olympiads,DeepSeek-R1,0.625,Find the sum of the roots of the equation \((x...,"<think>\nOkay, let's see. I need to find the s...",cot,True
4,32,has_answer_extracted,aops_c6_high_school_olympiads,QwQ-32B,0.0,Determine how many 1000 digit numbers \( A \) ...,"<think>\nOkay, so I need to figure out how man...",cot,False
5,\(\frac{2}{\pi} + \frac{32}{9\pi^2}\),has_answer_extracted,aops_c7_college_math,DeepSeek-R1,0.4375,Calculate the integral\n\n\[\n\int^{\frac{3\pi...,"<think>\nOkay, let me try to solve this integr...",cot,True
6,1,converted_proof,aops_c6_high_school_olympiads,QwQ-32B,,"In $\triangle ABC$ with incenter $I$, points $...","<think>\nAlright, let me try to tackle this ge...",cot,False
7,\(\frac{(3-\sqrt{3})(2-\sqrt{3})^{2016}+(3+\sq...,has_answer_extracted,aops_c6_high_school_olympiads,DeepSeek-R1,0.0,"Let \( x_0 = 1 \), and \( x_{n+1} = 2x_n + \sq...","<think>\nOkay, let's see. The problem is about...",cot,True
8,$\sqrt[2012]{2013!} > \sqrt[2013]{2012!}$,has_answer_extracted,aops_c6_high_school_olympiads,DeepSeek-R1,0.0625,"Which is greater, $\sqrt[2012]{2013!}$ or $\sq...","<think>\nOkay, so I need to figure out whether...",cot,True
9,20,has_answer_extracted,aops_c6_high_school_olympiads,DeepSeek-R1,0.9375,"On average, how long will you have to flip a c...","<think>\nOkay, so I need to figure out the exp...",cot,True


In [70]:
from dataprep import segment

def generate_hierarchical_cot(x):
  try:
    x["hierarchical_cot"] = segment.segment_chain_of_thought_with_claude(x["problem"], x["generated_solution"], x["expected_answer"])
  except Exception as e:
    logger.error("segment for record %s failed with error.", x["problem"], e)
    x["hierarchical_cot"] =  ""
  return x

In [71]:
updated_examples = examples.map(generate_hierarchical_cot, num_proc=1)

Map (num_proc=1):   0%|          | 0/10 [00:00<?, ? examples/s]

In [72]:
updated_examples.save_to_disk("OpenMathReasoning-updated")

Saving the dataset (0/1 shards):   0%|          | 0/10 [00:00<?, ? examples/s]

In [73]:
updated_examples.push_to_hub("anujjamwal/openmathreasoning-hierarchical-cot")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

CommitInfo(commit_url='https://huggingface.co/datasets/anujjamwal/openmathreasoning-hierarchical-cot/commit/5a90928889b0648e3fa0876f48a590c3c9c63a99', commit_message='Upload dataset', commit_description='', oid='5a90928889b0648e3fa0876f48a590c3c9c63a99', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/anujjamwal/openmathreasoning-hierarchical-cot', endpoint='https://huggingface.co', repo_type='dataset', repo_id='anujjamwal/openmathreasoning-hierarchical-cot'), pr_revision=None, pr_num=None)