# Hierarchical Chain-of-Thought Training

Fine-tune Qwen3-0.6B on the OpenMathReasoning Hierarchical CoT dataset using `HCotTrainer`.

## Setup

Clone the repo (Colab) or configure `sys.path` so that `model` and `training` packages are importable.

In [37]:
import sys, os

# When running in Colab, clone the repo and add lib/ to the path
if "google.colab" in sys.modules:
    if not os.path.exists("cs224n-final-project"):
        !git clone https://github.com/anujjamwal/cs224n-final-project.git
    sys.path.insert(0, "cs224n-final-project/lib")
else:
    # Local: notebook lives inside lib/ already
    sys.path.insert(0, os.path.dirname(os.path.abspath("__file__")))

In [2]:
%load_ext tensorboard
%tensorboard --logdir ./hcot-qwen2.5-math-1.5b

<IPython.core.display.Javascript object>

In [3]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from datasets import load_dataset

from model import generate, masks
from model.model import THOUGHT_TOKEN, SOLUTION_TOKEN, RETURN_TOKEN, SPECIAL_TOKENS
from training.trainer import HCotTrainer

## Load Model and Tokenizer

In [30]:
MODEL_NAME = "Qwen/Qwen2.5-Math-1.5B"
model_repo_id = "anujjamwal/Qwen2.5-Math-1.5B-hcot"

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
  MODEL_NAME,
  dtype=torch.bfloat16,
  device_map='auto'
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.add_special_tokens(
    {"additional_special_tokens": SPECIAL_TOKENS}
)

# Set chat template with {% generation %} tag so TRL's assistant_only_loss works
tokenizer.chat_template = (
    "{% for message in messages %}"
    "{% if message['role'] == 'system' %}"
    "<|im_start|>system\n{{ message['content'] }}<|im_end|>\n"
    "{% elif message['role'] == 'user' %}"
    "<|im_start|>user\n{{ message['content'] }}<|im_end|>\n"
    "{% elif message['role'] == 'assistant' %}"
    "<|im_start|>assistant\n{% generation %}{{ message['content'] }}{% endgeneration %}<|im_end|>\n"
    "{% endif %}"
    "{% endfor %}"
)

base_model.resize_token_embeddings(len(tokenizer))
model = base_model

Loading weights:   0%|          | 0/338 [00:00<?, ?it/s]

## Load and Tokenize Dataset

In [5]:
DATASET_NAME = "anujjamwal/OpenMathReasoning-Sampled-Hierarchical-Cot"
MAX_SEQ_LEN = 2048

dataset = load_dataset(DATASET_NAME, split="train").filter(lambda ex: len(ex['hierarchical_cot']) > 50)
print(f"Dataset size: {len(dataset)}")
print(f"Columns: {dataset.column_names}")
print(dataset[0].keys())

README.md:   0%|          | 0.00/662 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100 [00:00<?, ? examples/s]

Filter:   0%|          | 0/100 [00:00<?, ? examples/s]

Dataset size: 99
Columns: ['id', 'question', 'expected_answer', 'problem_source', 'generated_solution', 'pass_rate_72b_tir', 'used_in_kaggle', 'hierarchical_cot', 'hierarchical_cot_raw', 'hcot_model']
dict_keys(['id', 'question', 'expected_answer', 'problem_source', 'generated_solution', 'pass_rate_72b_tir', 'used_in_kaggle', 'hierarchical_cot', 'hierarchical_cot_raw', 'hcot_model'])


In [38]:
# Prepare dataset in completion format
def preprocess_function(example):
    prompt = "Solve the following math problem. Make sure to put the answer (and only answer) inside \\boxed{}."
    return {
        "prompt": [
            {"role": "system", "content": prompt},
            {"role": "user", "content": example["question"]}
        ],
        "completion": [
            {"role": "assistant", "content": f"<think>{example['hierarchical_cot']}</think>\boxed{{{example['expected_answer']}}}"}
        ],
    }

completion_dataset = dataset.map(preprocess_function, remove_columns=dataset.column_names)
print(completion_dataset[0].keys())


Map:   0%|          | 0/99 [00:00<?, ? examples/s]

dict_keys(['prompt', 'completion'])


In [None]:
import os
os.environ['HF_TOKEN'] = ''

In [15]:
class HCotMaskBuilder(masks.MaterialisedMaskMixin):
    def __init__(self, tokenizer):
        self.thought_token_id = tokenizer.convert_tokens_to_ids(THOUGHT_TOKEN)
        self.solution_token_id = tokenizer.convert_tokens_to_ids(SOLUTION_TOKEN)
        self.return_token_id = tokenizer.convert_tokens_to_ids(RETURN_TOKEN)

    def __call__(self, input_ids, padding_mask):
        return self._build_hierarchical_mask(input_ids=input_ids, padding_mask=padding_mask)

In [None]:
torch.cuda.empty_cache()

## SFT Trainer

In [11]:
!pip install trl peft

Collecting trl
  Downloading trl-0.28.0-py3-none-any.whl.metadata (11 kB)
Downloading trl-0.28.0-py3-none-any.whl (540 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m540.5/540.5 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: trl
Successfully installed trl-0.28.0


In [None]:
from trl import SFTConfig
from peft import LoraConfig

lora_config = LoraConfig(
    lora_alpha=128,
    lora_dropout=0.05,
    r=256,
    bias="none",
    target_modules="all-linear",
    task_type="CAUSAL_LM",
)

training_args = SFTConfig(
    model_init_kwargs={"dtype": torch.bfloat16},
    hub_model_id=model_repo_id,
    packing=True,
    assistant_only_loss=True,
    num_train_epochs=30,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=1e-4,
    weight_decay=0.01,
    warmup_ratio=0.1,
    bf16=True,
    logging_steps=50,
    save_strategy="epoch",
    save_total_limit=5,
    report_to="tensorboard",
    push_to_hub=True,
)

warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.


warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.


In [13]:
from torch import nn
from typing import Any, Callable
from trl import SFTTrainer

class HCotSFTTrainer(SFTTrainer):
    def __init__(self, attention_mask_func: Callable, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._attention_mask_func = attention_mask_func
    
    def compute_loss(
        self,
        model: nn.Module,
        inputs: dict[str, torch.Tensor | Any],
        return_outputs: bool = False,
        num_items_in_batch: torch.Tensor | int | None = None,
    ) -> torch.Tensor | tuple[torch.Tensor, Any]:
        padding_mask = inputs.get('attention_mask', None)
        attention_mask = self._attention_mask_func(input_ids=inputs["input_ids"], padding_mask=padding_mask)
        inputs['attention_mask'] = attention_mask
        return super().compute_loss(model, inputs, return_outputs, num_items_in_batch)

In [None]:
trainer = HCotSFTTrainer(
  model=model,
  train_dataset=completion_dataset,
  attention_mask_func=HCotMaskBuilder(tokenizer),
  args=training_args,
  # peft_config=lora_config,
  processing_class=tokenizer,
)

trainer.train()

In [36]:
output_path = "./hcot-qwen2.5-Math-1.5b/final"
trainer.save_model(output_path)
tokenizer.save_pretrained(output_path)
trainer.push_to_hub()

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ..._output/training_args.bin: 100%|##########| 5.71kB / 5.71kB            

  ...872687.3140bbd8edc3.578.0: 100%|##########| 19.4kB / 19.4kB            

  ...872776.3140bbd8edc3.578.1: 100%|##########|  442kB /  442kB            

  ...877264.3140bbd8edc3.578.2: 100%|##########|  429kB /  429kB            

  ...ner_output/tokenizer.json: 100%|##########| 11.4MB / 11.4MB            

  ..._output/model.safetensors:   1%|          | 41.9MB / 4.27GB            

No files have been modified since last commit. Skipping to prevent empty commit.


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ..._output/training_args.bin: 100%|##########| 5.71kB / 5.71kB            

  ...872687.3140bbd8edc3.578.0: 100%|##########| 19.4kB / 19.4kB            

  ...872776.3140bbd8edc3.578.1: 100%|##########|  442kB /  442kB            

  ...877264.3140bbd8edc3.578.2: 100%|##########|  429kB /  429kB            

  ...ner_output/tokenizer.json: 100%|##########| 11.4MB / 11.4MB            

  ..._output/model.safetensors:   1%|          | 41.9MB / 4.27GB            

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/anujjamwal/Qwen2.5-Math-1.5B-hcot/commit/5d5f951886d59f7aedee4be9f12b1bdc5077b2c9', commit_message='End of training', commit_description='', oid='5d5f951886d59f7aedee4be9f12b1bdc5077b2c9', pr_url=None, repo_url=RepoUrl('https://huggingface.co/anujjamwal/Qwen2.5-Math-1.5B-hcot', endpoint='https://huggingface.co', repo_type='model', repo_id='anujjamwal/Qwen2.5-Math-1.5B-hcot'), pr_revision=None, pr_num=None)

In [39]:
example = completion_dataset[9]
messages = example["prompt"]  # list of message dicts, e.g. [{"role": "user", "content": "..."}]
model.eval()

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

thought_token_id = tokenizer.convert_tokens_to_ids(THOUGHT_TOKEN)
solution_token_id = tokenizer.convert_tokens_to_ids(SOLUTION_TOKEN)
return_token_id = tokenizer.convert_tokens_to_ids(RETURN_TOKEN)

gen_out = model.generate(
    **inputs,
    thought_token_id=thought_token_id,
    solution_token_id=solution_token_id,
    return_token_id=return_token_id,
    eos_token_id=tokenizer.eos_token_id,
    use_cache=False,
    custom_generate=generate.generate,
)

print("\n".join(tokenizer.batch_decode(gen_out)).replace("\\n", "\n"))

<|im_start|>system
Solve the following math problem. Make sure to put the answer (and only answer) inside \boxed{}.<|im_end|>
<|im_start|>user
The radius of the base of a cylinder is increasing at a rate of 0.5 cm/sec while its height is decreasing at a rate of 1.4 cm/sec. At what rate is the volume of the cylinder changing when the radius is 50 cm and the height is 80 cm?<|im_end|>
 WithTitle
<think>  scrição Okay, let's see. I need to find the rate at which the volume of a cylinder is changing when the radius is 50 cm and the height is 80 cm. The problem says the radius is increasing at 0.5 cm per second and the height is decreasing at 1.4 cm per second. Hmm, so related rates problem. I remember that for these kinds of problems, you need to relate the rates of change using derivatives.

 AMAGE The volume V of a cylinder is given by the formula V = πr²h, right? Where r is the radius and h is the height. Since both r and h are changing with time, I need to find dV/dt, the rate of chang

In [40]:
gen_out2 = model.generate(
    **inputs,
)

print("\n".join(tokenizer.batch_decode(gen_out2)).replace("\\n", "\n"))

<|im_start|>system
Solve the following math problem. Make sure to put the answer (and only answer) inside \boxed{}.<|im_end|>
<|im_start|>user
The radius of the base of a cylinder is increasing at a rate of 0.5 cm/sec while its height is decreasing at a rate of 1.4 cm/sec. At what rate is the volume of the cylinder changing when the radius is 50 cm and the height is 80 cm?<|im_end|>
 WithTitle
<think>  scrição Okay, let's see. I need to find the rate at which the volume of a cylinder is changing when the radius is 50 cm and the height is 80 cm. The problem says the radius is increasing at 0.5 cm per second and the height is decreasing at 1.4 cm per second. Hmm, so related rates problem. I remember that for these kinds of problems, you need to relate the rates of change using derivatives.

 AMAGE First, let me recall the formula for the volume of a cylinder. The volume V is πr²h, right? Where r is the radius and h is the height. Since both r and h are changing with time, I need to find 