# Hierarchical Chain-of-Thought Training

Fine-tune Qwen3-0.6B on the OpenMathReasoning Hierarchical CoT dataset using `HCotTrainer`.

## Setup

Clone the repo (Colab) or configure `sys.path` so that `model` and `training` packages are importable.

In [1]:
import sys, os

# When running in Colab, clone the repo and add lib/ to the path
if "google.colab" in sys.modules:
    if not os.path.exists("cs224n-final-project"):
        !git clone https://github.com/anujjamwal/cs224n-final-project.git
    else:
        !cd cs224n-final-project && git pull
    sys.path.insert(0, "cs224n-final-project/lib")
else:
    # Local: notebook lives inside lib/ already
    sys.path.insert(0, os.path.dirname(os.path.abspath("__file__")))

Cloning into 'cs224n-final-project'...
remote: Enumerating objects: 105, done.[K
remote: Counting objects: 100% (105/105), done.[K
remote: Compressing objects: 100% (70/70), done.[K
remote: Total 105 (delta 51), reused 85 (delta 31), pack-reused 0 (from 0)[K
Receiving objects: 100% (105/105), 201.96 KiB | 28.85 MiB/s, done.
Resolving deltas: 100% (51/51), done.


In [2]:
!pip install trl peft
# !pip install flash-attn --no-build-isolation 

Collecting trl
  Downloading trl-0.29.0-py3-none-any.whl.metadata (11 kB)
Downloading trl-0.29.0-py3-none-any.whl (528 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m528.8/528.8 kB[0m [31m42.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: trl
Successfully installed trl-0.29.0


In [3]:
%load_ext tensorboard
%tensorboard --logdir ./hcot-qwen2.5-math-1.5b

<IPython.core.display.Javascript object>

In [4]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from datasets import load_dataset

from model import generate, masks
from model.model import THOUGHT_TOKEN, SOLUTION_TOKEN, RETURN_TOKEN, SPECIAL_TOKENS
from training.trainer import HCotTrainer

## Load Model and Tokenizer

In [None]:
MODEL_NAME = "Qwen/Qwen2.5-Math-1.5B"
MODEL_NAME = "nvidia/OpenMath-Nemotron-1.5B"
model_repo_id = "anujjamwal/OpenMath-Nemotron-1.5B-hcot"

In [None]:
import os
os.environ['HF_TOKEN'] = ''

In [7]:
model = AutoModelForCausalLM.from_pretrained(
  MODEL_NAME,
  dtype=torch.bfloat16,
  device_map='auto',
  # TODO(FLASH ATT)
  # attn_implementation="flash_attention_2",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/338 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/121 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

In [None]:
# tokenizer.all_special_tokens


In [11]:
tokenizer.add_special_tokens(
    {"additional_special_tokens": tokenizer.all_special_tokens + SPECIAL_TOKENS + ['<think>','</think>']}
)

# # Set chat template with {% generation %} tag so TRL's assistant_only_loss works
# tokenizer.chat_template = (
#     "{% for message in messages %}"
#     "{% if message['role'] == 'system' %}"
#     "<|im_start|>system\n{{ message['content'] }}<|im_end|>\n"
#     "{% elif message['role'] == 'user' %}"
#     "<|im_start|>user\n{{ message['content'] }}<|im_end|>\n"
#     "{% elif message['role'] == 'assistant' %}"
#     "<|im_start|>assistant\n{% generation %}{{ message['content'] }}{% endgeneration %}<|im_end|>\n"
#     "{% endif %}"
#     "{% endfor %}"
# )
tokenizer.chat_template = """\
{%- if messages[0]['role'] == 'system' -%}
<|im_start|>system
{{ messages[0]['content'] | trim }}<|im_end|>
{%- else -%}
<|im_start|>system
<|im_end|>
{%- endif -%}
{%- for message in messages -%}
{%- if (message.role == 'user') or (message.role == 'system' and not loop.first) or (message.role == 'assistant') -%}
<|im_start|>{{ message.role }}
{%- if message['role'] == 'assistant' %}
{% generation %}{{ message['content'] | trim }}<|im_end|>{% endgeneration %}
{%- else %}
{{ message['content'] | trim }}<|im_end|>
{%- endif %}
{%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt %}
<|im_start|>assistant
{%- endif -%}
"""

model.resize_token_embeddings(len(tokenizer))


Embedding(151670, 1536)

In [12]:
# Initialize new special token embeddings from semantically similar seed words.
# After resize_token_embeddings(), new rows are random — this gives them a
# meaningful starting point so the model doesn't have to learn from scratch.

seed_map = {
    "[THOUGHT]":  "subproblem start",
    "[SOLUTION]": "summary of solution",
    "[RETURN]":   "subproblem return",
    "<think>":    "thinking start",
    "</think>":   "thinking complete",
}

with torch.no_grad():
    embed_layer = model.get_input_embeddings()
    for special_tok, seed_word in seed_map.items():
        # Get the ID of the newly added special token
        special_id = tokenizer.convert_tokens_to_ids(special_tok)
        # Tokenize the seed word — may produce multiple sub-tokens
        seed_ids = tokenizer.encode(seed_word, add_special_tokens=False)
        # Average the embeddings of the seed sub-tokens
        seed_embeds = embed_layer.weight[seed_ids]          # (n_subtokens, hidden_dim)
        avg_embed = seed_embeds.mean(dim=0)                 # (hidden_dim,)
        # Write into the embedding table
        embed_layer.weight[special_id] = avg_embed
        print(f"  {special_tok:12s} (id={special_id}) <- avg of '{seed_word}' tokens {seed_ids}")

    # Also copy input embeddings to the output (lm_head) if they are NOT tied,
    # so the model can predict these tokens from the start.
    lm_head = model.get_output_embeddings()
    if lm_head is not None and lm_head.weight.data_ptr() != embed_layer.weight.data_ptr():
        for special_tok, seed_word in seed_map.items():
            special_id = tokenizer.convert_tokens_to_ids(special_tok)
            lm_head.weight[special_id] = embed_layer.weight[special_id]
        print("  (also copied to untied lm_head)")

print("Special token embeddings initialized.")

  [THOUGHT]    (id=151665) <- avg of 'subproblem start' tokens [1966, 34586, 1191]
  [SOLUTION]   (id=151666) <- avg of 'summary of solution' tokens [1708, 315, 6291]
  [RETURN]     (id=151667) <- avg of 'subproblem return' tokens [1966, 34586, 470]
  <think>      (id=151668) <- avg of 'thinking start' tokens [82260, 1191]
  </think>     (id=151669) <- avg of 'thinking complete' tokens [82260, 4583]
Special token embeddings initialized.


In [None]:
for token in SPECIAL_TOKENS:
  encoded_input = tokenizer(f"Hello world {token}", return_tensors="pt")
  decoded_output = tokenizer.decode(encoded_input['input_ids'][0], skip_special_tokens=True)
  assert decoded_output == "Hello world "

## Load and Tokenize Dataset

In [13]:
DATASET_NAME = "anujjamwal/OpenMathReasoning-Sampled-Hierarchical-Cot"
MAX_SEQ_LEN = 2048

dataset = load_dataset(DATASET_NAME, split="train").filter(lambda ex: len(ex['hierarchical_cot']) > 50)
print(f"Dataset size: {len(dataset)}")
print(f"Columns: {dataset.column_names}")
print(dataset[0].keys())

README.md:   0%|          | 0.00/662 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/1.73M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/200 [00:00<?, ? examples/s]

Filter:   0%|          | 0/200 [00:00<?, ? examples/s]

Dataset size: 99
Columns: ['id', 'question', 'expected_answer', 'problem_source', 'generated_solution', 'pass_rate_72b_tir', 'used_in_kaggle', 'hierarchical_cot', 'hierarchical_cot_raw', 'hcot_model']
dict_keys(['id', 'question', 'expected_answer', 'problem_source', 'generated_solution', 'pass_rate_72b_tir', 'used_in_kaggle', 'hierarchical_cot', 'hierarchical_cot_raw', 'hcot_model'])


In [14]:
# Prepare dataset in completion format
def preprocess(example):
    prompt = "Solve the following math problem. Make sure to put the answer (and only answer) inside \\boxed{}."
    assistant_content = f"<think>\n{example['hierarchical_cot']}\n</think>\n\\boxed{{{example['expected_answer']}}}"
    
    return {
        "prompt": [
            {"role": "system", "content": prompt},
            {"role": "user", "content": example["question"]},
        ],
        "completion": [
            {"role": "assistant", "content": assistant_content},
        ],
    }

completion_dataset = dataset.map(preprocess, remove_columns=dataset.column_names)
print(completion_dataset[0].keys())


Map:   0%|          | 0/99 [00:00<?, ? examples/s]

dict_keys(['prompt', 'completion'])


In [None]:
# sample = completion_dataset[0]
# messages = sample["prompt"] + sample["completion"]
# print(tokenizer.apply_chat_template(messages, tokenize=False))

In [15]:
class HCotMaskBuilder(masks.MaterialisedMaskMixin):
    def __init__(self, tokenizer):
        self.thought_token_id = tokenizer.convert_tokens_to_ids(THOUGHT_TOKEN)
        self.solution_token_id = tokenizer.convert_tokens_to_ids(SOLUTION_TOKEN)
        self.return_token_id = tokenizer.convert_tokens_to_ids(RETURN_TOKEN)

    def __call__(self, input_ids, padding_mask):
        return self._build_hierarchical_mask(input_ids=input_ids, padding_mask=padding_mask)

In [16]:
from torch import nn
from typing import Any, Callable
from trl import SFTTrainer

class HCotSFTTrainer(SFTTrainer):
    def __init__(self, attention_mask_func: Callable, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._attention_mask_func = attention_mask_func
    
    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        padding_mask = inputs.get('attention_mask', None)
        hierarchical_mask = self._attention_mask_func(input_ids=inputs["input_ids"], padding_mask=padding_mask)
        
        # Keep the original 2D padding mask for TRL's entropy calculation,
        # but pass the 4D hierarchical mask to the model forward pass
        outputs = model(
            input_ids=inputs["input_ids"],
            attention_mask=hierarchical_mask,
            labels=inputs.get("labels"),
        )
        loss = outputs.loss
        return (loss, outputs) if return_outputs else loss

In [None]:
from trl import SFTConfig
from peft import LoraConfig

output_path = "./hcot-nemotron-Math-1.5b/final"

# lora_config = LoraConfig(
#     lora_alpha=128,
#     lora_dropout=0.05,
#     r=256,
#     bias="none",
#     target_modules="all-linear",
#     task_type="CAUSAL_LM",
# )

training_args = SFTConfig(
    output_dir=output_path,
    hub_model_id=model_repo_id,
    # TODO(FLASH ATT)
    # packing=True,
    assistant_only_loss=True,
    num_train_epochs=10,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=1e-4,
    weight_decay=0.01,
    warmup_ratio=0.1,
    bf16=True,
    logging_steps=10,
    save_strategy="epoch",
    save_total_limit=5,
    report_to="tensorboard",
    push_to_hub=True,
    optim="adamw_torch_fused",
)

warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.


In [18]:
trainer = HCotSFTTrainer(
  model=model,
  train_dataset=completion_dataset,
  attention_mask_func=HCotMaskBuilder(tokenizer),
  args=training_args,
  # peft_config=lora_config,
  processing_class=tokenizer,
)

Tokenizing train dataset:   0%|          | 0/99 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/99 [00:00<?, ? examples/s]

In [19]:
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.


Step,Training Loss
10,2.424305
20,1.431483
30,0.569918
40,0.173207
50,0.05937
60,0.018884
70,0.008183
80,0.00288
90,0.002245
100,0.001798


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

TrainOutput(global_step=210, training_loss=0.22430119299374165, metrics={'train_runtime': 1058.8387, 'train_samples_per_second': 2.805, 'train_steps_per_second': 0.198, 'total_flos': 2.391067610578944e+16, 'train_loss': 0.22430119299374165})

In [20]:
trainer.save_model(output_path)
tokenizer.save_pretrained(output_path)

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...b/final/training_args.bin: 100%|##########| 5.65kB / 5.65kB            

  ...1.5b/final/tokenizer.json: 100%|##########| 11.4MB / 11.4MB            

  ...b/final/model.safetensors:   0%|          |  649kB / 3.09GB            

  ...38507.396f4b3f71ef.1188.0:  19%|#9        | 1.99kB / 10.3kB            

('./hcot-nemotron-Math-1.5b/final/tokenizer_config.json',
 './hcot-nemotron-Math-1.5b/final/chat_template.jinja',
 './hcot-nemotron-Math-1.5b/final/tokenizer.json')

In [21]:
tokenizer.push_to_hub(model_repo_id) 
trainer.push_to_hub()

README.md: 0.00B [00:00, ?B/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...mp3ztxirm_/tokenizer.json: 100%|##########| 11.4MB / 11.4MB            

No files have been modified since last commit. Skipping to prevent empty commit.


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...b/final/training_args.bin: 100%|##########| 5.65kB / 5.65kB            

  ...38507.396f4b3f71ef.1188.0: 100%|##########| 10.3kB / 10.3kB            

  ...1.5b/final/tokenizer.json: 100%|##########| 11.4MB / 11.4MB            

  ...b/final/model.safetensors:   1%|1         | 39.9MB / 3.09GB            

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/anujjamwal/OpenMath-Nemotron-1.5B-hcot/commit/7b02f80db28df584ce90d76ab771db81f864d24c', commit_message='End of training', commit_description='', oid='7b02f80db28df584ce90d76ab771db81f864d24c', pr_url=None, repo_url=RepoUrl('https://huggingface.co/anujjamwal/OpenMath-Nemotron-1.5B-hcot', endpoint='https://huggingface.co', repo_type='model', repo_id='anujjamwal/OpenMath-Nemotron-1.5B-hcot'), pr_revision=None, pr_num=None)

In [22]:
example = completion_dataset[9]
messages = example["prompt"]  # list of message dicts, e.g. [{"role": "user", "content": "..."}]
model.eval()

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

thought_token_id = tokenizer.convert_tokens_to_ids(THOUGHT_TOKEN)
solution_token_id = tokenizer.convert_tokens_to_ids(SOLUTION_TOKEN)
return_token_id = tokenizer.convert_tokens_to_ids(RETURN_TOKEN)

In [51]:
inputs2 = base_tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(base_model.device)

In [52]:
gen_out_seen = base_model.generate(
    **inputs2,
    max_new_tokens=4096,
    use_cache=False,
    custom_generate=generate.generate_standard,
)


print("\n".join(base_tokenizer.batch_decode(gen_out_seen.sequences)).replace("\\n", "\n"))
print(f"\n--- Stats (record 0) ---")
print(f"Prompt tokens:          {gen_out_seen.prompt_tokens}")
print(f"Generated tokens:       {gen_out_seen.generated_tokens}")
print(f"Total tokens processed: {gen_out_seen.total_tokens_processed[0]}")
print(f"Output tokens:          {gen_out_seen.output_tokens[0]}")
print(f"Prune events:           {gen_out_seen.prune_events[0]}")
print(f"Tokens pruned:          {gen_out_seen.tokens_pruned[0]}")
print(f"Wall time:              {gen_out_seen.wall_time_seconds:.2f}s")

Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Both `max_new_tokens` (=4096) and `max_length`(=4210) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


<|im_start|>system
Solve the following math problem. Make sure to put the answer (and only answer) inside \boxed{}.<|im_end|>
<|im_start|>user
How many ordered triples of integers $(x,y,z)$ are there such that $0 \leq x,y,z \leq 100$ and
\[
(x-y)^2+(y-z)^2+(z-x)^2 \geq (x+y-2z)^2+(y+z-2x)^2+(z+x-2y)^2?
\]<|im_end|>
<|im_start|>assistant
<think>
Okay, so I need to find the number of ordered triples (x, y, z) where each of x, y, z is an integer between 0 and 100 inclusive. The condition given is that the sum of the squares of (x - y), (y - z), and (z - x) is greater than or equal to the sum of the squares of (x + y - 2z), (y + z - 2x), and (z + x - 2y). Hmm, that seems a bit complicated, but maybe I can simplify it.

First, let me write down the inequality again to make sure I have it right:

\[
(x - y)^2 + (y - z)^2 + (z - x)^2 \geq (x + y - 2z)^2 + (y + z - 2x)^2 + (z + x - 2y)^2
\]

I need to compare these two expressions. Maybe expanding both sides will help. Let's start with the lef

In [23]:
gen_out = model.generate(
    **inputs,
    max_new_tokens=4096,
    thought_token_id=thought_token_id,
    solution_token_id=solution_token_id,
    return_token_id=return_token_id,
    eos_token_id=tokenizer.eos_token_id,
    use_cache=False,
    custom_generate=generate.generate,
)

print("\n".join(tokenizer.decode(gen_out.sequences, skip_special_tokens=False)).replace("\\n", "\n"))
print(f"\n--- Stats (record 0) ---")
print(f"Prompt tokens:          {gen_out.prompt_tokens}")
print(f"Generated tokens:       {gen_out.generated_tokens}")
print(f"Total tokens processed: {gen_out.total_tokens_processed[0]}")
print(f"Output tokens:          {gen_out.output_tokens[0]}")
print(f"Prune events:           {gen_out.prune_events[0]}")
print(f"Tokens pruned:          {gen_out.tokens_pruned[0]}")
print(f"Wall time:              {gen_out.wall_time_seconds:.2f}s")

<|im_start|>system
Solve the following math problem. Make sure to put the answer (and only answer) inside \boxed{}.<|im_end|><|im_start|>userThe radius of the base of a cylinder is increasing at a rate of 0.5 cm/sec while its height is decreasing at a rate of 1.4 cm/sec. At what rate is the volume of the cylinder changing when the radius is 50 cm and the height is 80 cm?<|im_end|><|im_start|>assistant<think>
[THOUGHT] So, putting it all together, dV/dt = π (2r dr/dt h + r² dh/dt). [RETURN]
[THOUGHT] First, calculate 2r h dr/dt: 2 * 50 * 80 * (-0.5). Let's compute that. 50*80 is 4000, times 2 is 8000, times -0.5 is -4000. [RETURN]

[THOUGHT] So adding those two parts: -4000 + (-3500) = -7500. Multiply by π: dV/dt = -7500π cm³/s. [RETURN]

[THOUGHT] So that term is -4000. [RETURN]
[THOUGHT] So total is -4000 + (-3500) = -7500, times π is -7500π. [RETURN]

[THOUGHT] dV/dt = -7500π cm³/s, with the volume decreasing. [RETURN]

[THOUGHT] Formula verified: dV/dt = π(2r dr/dt h + r² dh/dt) = π

In [None]:
gen_out = model.generate(
    **inputs,
    max_new_tokens=4196,
    thought_token_id=thought_token_id,
    solution_token_id=solution_token_id,
    return_token_id=return_token_id,
    eos_token_id=tokenizer.eos_token_id,
    use_cache=True,
    custom_generate=generate.generate,
)

print("\n".join(tokenizer.decode(gen_out.sequences, skip_special_tokens=False)).replace("\\n", "\n"))
print(f"\n--- Stats (record 0) ---")
print(f"Prompt tokens:          {gen_out.prompt_tokens}")
print(f"Generated tokens:       {gen_out.generated_tokens}")
print(f"Total tokens processed: {gen_out.total_tokens_processed[0]}")
print(f"Output tokens:          {gen_out.output_tokens[0]}")
print(f"Prune events:           {gen_out.prune_events[0]}")
print(f"Tokens pruned:          {gen_out.tokens_pruned[0]}")
print(f"Wall time:              {gen_out.wall_time_seconds:.2f}s")

In [53]:
gen_out = model.generate(
    **inputs,
    max_new_tokens=4096,
    min_token_length=1024,
    thought_token_id=thought_token_id,
    solution_token_id=solution_token_id,
    return_token_id=return_token_id,
    eos_token_id=tokenizer.eos_token_id,
    use_cache=False,
    custom_generate=generate.generate_with_mask,
)

print("\n".join(tokenizer.decode(gen_out.sequences, skip_special_tokens=False)).replace("\\n", "\n"))
print(f"\n--- Stats (record 0) ---")
print(f"Prompt tokens:          {gen_out.prompt_tokens}")
print(f"Generated tokens:       {gen_out.generated_tokens}")
print(f"Total tokens processed: {gen_out.total_tokens_processed[0]}")
print(f"Output tokens:          {gen_out.output_tokens[0]}")
print(f"Prune events:           {gen_out.prune_events[0]}")
print(f"Tokens pruned:          {gen_out.tokens_pruned[0]}")
print(f"Wall time:              {gen_out.wall_time_seconds:.2f}s")

<|im_start|>system
Solve the following math problem. Make sure to put the answer (and only answer) inside \boxed{}.<|im_end|><|im_start|>userThe radius of the base of a cylinder is increasing at a rate of 0.5 cm/sec while its height is decreasing at a rate of 1.4 cm/sec. At what rate is the volume of the cylinder changing when the radius is 50 cm and the height is 80 cm?<|im_end|><|im_start|>assistant<think>
[THOUGHT] Okay, let's see. I need to find the rate at which the volume of a cylinder is changing when the radius is 50 cm and the height is 80 cm. The problem says the radius is increasing at 0.5 cm per second and the height is decreasing at 1.4 cm per second. Hmm, so related rates problem. I remember that for these kinds of problems, you need to relate the rates of change using derivatives.

[THOUGHT] First, let me recall the formula for the volume of a cylinder. The volume V is πr²h, right? Where r is the radius and h is the height. Since both r and h are changing with time, I ne

In [None]:
gen_out = model.generate(
    **inputs,
    max_new_tokens=8192,
    min_token_length=2048,
    thought_token_id=thought_token_id,
    solution_token_id=solution_token_id,
    return_token_id=return_token_id,
    eos_token_id=tokenizer.eos_token_id,
    use_cache=True,
    custom_generate=generate.generate_with_mask,
)

print("\n".join(tokenizer.decode(gen_out.sequences, skip_special_tokens=False)).replace("\\n", "\n"))
print(f"\n--- Stats (record 0) ---")
print(f"Prompt tokens:          {gen_out.prompt_tokens}")
print(f"Generated tokens:       {gen_out.generated_tokens}")
print(f"Total tokens processed: {gen_out.total_tokens_processed[0]}")
print(f"Output tokens:          {gen_out.output_tokens[0]}")
print(f"Prune events:           {gen_out.prune_events[0]}")
print(f"Tokens pruned:          {gen_out.tokens_pruned[0]}")
print(f"Wall time:              {gen_out.wall_time_seconds:.2f}s")

In [None]:
gen_out = model.generate(
    **inputs,
    max_new_tokens=4196,
    custom_generate=generate.generate_standard,
)

print("\n".join(tokenizer.decode(gen_out.sequences, skip_special_tokens=False)).replace("\\n", "\n"))
print(f"\n--- Stats (record 0) ---")
print(f"Prompt tokens:          {gen_out.prompt_tokens}")
print(f"Generated tokens:       {gen_out.generated_tokens}")
print(f"Total tokens processed: {gen_out.total_tokens_processed[0]}")
print(f"Output tokens:          {gen_out.output_tokens[0]}")
print(f"Prune events:           {gen_out.prune_events[0]}")
print(f"Tokens pruned:          {gen_out.tokens_pruned[0]}")
print(f"Wall time:              {gen_out.wall_time_seconds:.2f}s")

## Verification

In [25]:
eval_ds = load_dataset("davidanugraha/OpenMathReasoning-Sampled", split="train")

README.md:   0%|          | 0.00/539 [00:00<?, ?B/s]

data/train-00000-of-00003.parquet:   0%|          | 0.00/141M [00:00<?, ?B/s]

data/train-00001-of-00003.parquet:   0%|          | 0.00/149M [00:00<?, ?B/s]

data/train-00002-of-00003.parquet:   0%|          | 0.00/179M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/92544 [00:00<?, ? examples/s]

In [26]:
# Prepare dataset in completion format
def preprocess(example):
    prompt = "Solve the following math problem. Make sure to put the answer (and only answer) inside \\boxed{}."
    
    assistant_content = f"<think>\n{example['generated_solution']}\n</think>\n\\boxed{{{example['expected_answer']}}}"
    
    return {
        "prompt": [
            {"role": "system", "content": prompt},
            {"role": "user", "content": example["question"]},
        ],
        "completion": [
            {"role": "assistant", "content": assistant_content},
        ],
    }

eval_prep_ds = eval_ds.map(preprocess, remove_columns=eval_ds.column_names)


Map:   0%|          | 0/92544 [00:00<?, ? examples/s]

#### UNSEEN Example

In [27]:
eval_example = eval_prep_ds[1000]["prompt"]
eval_inputs = tokenizer.apply_chat_template(
    eval_example,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

In [28]:
print(eval_example)

print(eval_ds[1000]["question"])
print(eval_ds[1000]["expected_answer"])

[{'content': 'Solve the following math problem. Make sure to put the answer (and only answer) inside \\boxed{}.', 'role': 'system'}, {'content': 'How many ordered triples of integers $(x,y,z)$ are there such that $0 \\leq x,y,z \\leq 100$ and\n\\[\n(x-y)^2+(y-z)^2+(z-x)^2 \\geq (x+y-2z)^2+(y+z-2x)^2+(z+x-2y)^2?\n\\]', 'role': 'user'}]
How many ordered triples of integers $(x,y,z)$ are there such that $0 \leq x,y,z \leq 100$ and
\[
(x-y)^2+(y-z)^2+(z-x)^2 \geq (x+y-2z)^2+(y+z-2x)^2+(z+x-2y)^2?
\]
101


In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
  MODEL_NAME,
  dtype=torch.bfloat16,
  device_map='auto',
  # TODO(FLASH ATT)
  # attn_implementation="flash_attention_2",
)
base_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
eval_inputs2 = base_tokenizer.apply_chat_template(
    eval_example,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(base_model.device)

Loading weights:   0%|          | 0/338 [00:00<?, ?it/s]

Both `max_new_tokens` (=4096) and `max_length`(=4210) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


<|im_start|>system
Solve the following math problem. Make sure to put the answer (and only answer) inside \boxed{}.<|im_end|>
<|im_start|>user
How many ordered triples of integers $(x,y,z)$ are there such that $0 \leq x,y,z \leq 100$ and
\[
(x-y)^2+(y-z)^2+(z-x)^2 \geq (x+y-2z)^2+(y+z-2x)^2+(z+x-2y)^2?
\]<|im_end|>
<|im_start|>assistant
 Okay, let's try to tackle this problem. Hmm, we need to find the number of ordered triples of integers (x, y, z) where each is between 0 and 100 inclusive. The condition given is that the sum of the squares of (x-y), (y-z), and (z-x) is greater than or equal to the sum of the squares of (x+y-2z), (y+z-2x), and (z+x-2y). 

First, let me write down the inequality again to make sure I have it right:

(x - y)² + (y - z)² + (z - x)² ≥ (x + y - 2z)² + (y + z - 2x)² + (z + x - 2y)².

Hmm, that looks a bit intimidating at first, but maybe if I expand both sides, things will cancel out or simplify. Let me start by expanding the left-hand side (LHS) and the righ

In [None]:
gen_out = base_model.generate(
    **eval_inputs2,
    max_new_tokens=4096,
    use_cache=False,
    custom_generate=generate.generate_standard,
)


print("\n".join(base_tokenizer.batch_decode(gen_out.sequences)).replace("\\n", "\n"))
print(f"\n--- Stats (record 0) ---")
print(f"Prompt tokens:          {gen_out.prompt_tokens}")
print(f"Generated tokens:       {gen_out.generated_tokens}")
print(f"Total tokens processed: {gen_out.total_tokens_processed[0]}")
print(f"Output tokens:          {gen_out.output_tokens[0]}")
print(f"Prune events:           {gen_out.prune_events[0]}")
print(f"Tokens pruned:          {gen_out.tokens_pruned[0]}")
print(f"Wall time:              {gen_out.wall_time_seconds:.2f}s")

<|im_start|>system
Solve the following math problem. Make sure to put the answer (and only answer) inside \boxed{}.<|im_end|>
<|im_start|>user
How many ordered triples of integers $(x,y,z)$ are there such that $0 \leq x,y,z \leq 100$ and
\[
(x-y)^2+(y-z)^2+(z-x)^2 \geq (x+y-2z)^2+(y+z-2x)^2+(z+x-2y)^2?
\]<|im_end|>
<|im_start|>assistant
 Okay, let's try to tackle this problem. Hmm, we need to find the number of ordered triples of integers (x, y, z) where each is between 0 and 100 inclusive. The condition given is that the sum of the squares of (x-y), (y-z), and (z-x) is greater than or equal to the sum of the squares of (x+y-2z), (y+z-2x), and (z+x-2y). 

First, let me write down the inequality again to make sure I have it right:

(x - y)² + (y - z)² + (z - x)² ≥ (x + y - 2z)² + (y + z - 2x)² + (z + x - 2y)².

Hmm, that looks a bit intimidating at first, but maybe if I expand both sides, things will cancel out or simplify. Let me start by expanding the left-hand side (LHS) and the righ

In [None]:
gen_out = model.generate(
    **eval_inputs,
    max_new_tokens=8192,
    use_cache=False,
)

print("\n".join(tokenizer.batch_decode(gen_out)).replace("\\n", "\n"))

In [29]:
gen_out = model.generate(
    **eval_inputs,
    max_new_tokens=4096,
    thought_token_id=thought_token_id,
    solution_token_id=solution_token_id,
    return_token_id=return_token_id,
    eos_token_id=tokenizer.eos_token_id,
    use_cache=False,
    custom_generate=generate.generate,
)

print("\n".join(tokenizer.batch_decode(gen_out.sequences)).replace("\\n", "\n"))
print(f"\n--- Stats (record 0) ---")
print(f"Prompt tokens:          {gen_out.prompt_tokens}")
print(f"Generated tokens:       {gen_out.generated_tokens}")
print(f"Total tokens processed: {gen_out.total_tokens_processed[0]}")
print(f"Output tokens:          {gen_out.output_tokens[0]}")
print(f"Prune events:           {gen_out.prune_events[0]}")
print(f"Tokens pruned:          {gen_out.tokens_pruned[0]}")
print(f"Wall time:              {gen_out.wall_time_seconds:.2f}s")

<|im_start|>system
Solve the following math problem. Make sure to put the answer (and only answer) inside \boxed{}.<|im_end|><|im_start|>userHow many ordered triples of integers $(x,y,z)$ are there such that $0 \leq x,y,z \leq 100$ and
\[
(x-y)^2+(y-z)^2+(z-x)^2 \geq (x+y-2z)^2+(y+z-2x)^2+(z+x-2y)^2?
\]<|im_end|><|im_start|>assistant<think>
[THOUGHT] RHS = 9x² + 6y² + 6z² -6xy -6yz -6xz [RETURN]
[THOUGHT] The inequality holds only when x = y = z, giving 101 solutions. [RETURN]
[THOUGHT] When variables are in arithmetic progression, equality holds only when d=0 (i.e., x=y=z), same as before. [RETURN]
[THOUGHT] For other cases, the inequality is strict. But wait, the problem asks for the number of triples where the left side is greater than or equal to the right. So if equality only occurs when x=y=z, then the total number of solutions would be 101 (for x=y=z) plus the number of triples where the left side is strictly greater. But wait, maybe the inequality is only satisfied when x=y=z? 

In [None]:
gen_out = model.generate(
    **eval_inputs,
    max_new_tokens=4096,
    thought_token_id=thought_token_id,
    solution_token_id=solution_token_id,
    return_token_id=return_token_id,
    eos_token_id=tokenizer.eos_token_id,
    min_token_length=2048,
    use_cache=False,
    custom_generate=generate.generate_with_mask,
)
print("\n".join(base_tokenizer.batch_decode(gen_out.sequences)).replace("\\n", "\n"))
print(f"\n--- Stats (record 0) ---")
print(f"Prompt tokens:          {gen_out.prompt_tokens}")
print(f"Generated tokens:       {gen_out.generated_tokens}")
print(f"Total tokens processed: {gen_out.total_tokens_processed[0]}")
print(f"Output tokens:          {gen_out.output_tokens[0]}")
print(f"Prune events:           {gen_out.prune_events[0]}")
print(f"Tokens pruned:          {gen_out.tokens_pruned[0]}")
print(f"Wall time:              {gen_out.wall_time_seconds:.2f}s")

In [38]:
eval_inputs = tokenizer.apply_chat_template(
    eval_example,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)
gen_out = model.generate(
    **eval_inputs,
    max_new_tokens=4096,
    thought_token_id=thought_token_id,
    solution_token_id=solution_token_id,
    return_token_id=return_token_id,
    eos_token_id=tokenizer.eos_token_id,
    min_token_length=2048,
    use_cache=False,
    custom_generate=generate.generate_with_mask,
)
print("\n".join(base_tokenizer.batch_decode(gen_out.sequences)).replace("\\n", "\n"))
print(f"\n--- Stats (record 0) ---")
print(f"Prompt tokens:          {gen_out.prompt_tokens}")
print(f"Generated tokens:       {gen_out.generated_tokens}")
print(f"Total tokens processed: {gen_out.total_tokens_processed[0]}")
print(f"Output tokens:          {gen_out.output_tokens[0]}")
print(f"Prune events:           {gen_out.prune_events[0]}")
print(f"Tokens pruned:          {gen_out.tokens_pruned[0]}")
print(f"Wall time:              {gen_out.wall_time_seconds:.2f}s")

<|im_start|>system
Solve the following math problem. Make sure to put the answer (and only answer) inside \boxed{}.<|im_end|><|im_start|>userHow many ordered triples of integers $(x,y,z)$ are there such that $0 \leq x,y,z \leq 100$ and
\[
(x-y)^2+(y-z)^2+(z-x)^2 \geq (x+y-2z)^2+(y+z-2x)^2+(z+x-2y)^2?
\]<|im_end|><|im_start|>assistant
 Okay, let's try to tackle this problem. Hmm, we need to find the number of ordered triples of integers (x, y, z) where each is between 0 and 100 inclusive. The condition is that the sum of the squares (x-y)² + (y-z)² + (z-x)² is greater than or equal to another sum of squares: (x+y-2z)² + (y+z-2x)² + (z+x-2y)². 

First, maybe I should expand both sides and see if we can simplify the inequality. Let me start by expanding the left-hand side (LHS) and the right-hand side (RHS) separately.

 LHS: (x - y)² + (y - z)² + (z - x)²

Let's expand each term:

(x - y)² = x² - 2xy + y²

(y - z)² = y² - 2yz + z²

(z - x)² = z² - 2zx + x²

Adding them together:

LHS = (

In [34]:
gen_out = model.generate(
    **eval_inputs,
    max_new_tokens=4096,
    thought_token_id=thought_token_id,
    solution_token_id=solution_token_id,
    return_token_id=return_token_id,
    eos_token_id=tokenizer.eos_token_id,
    min_token_length=1024,
    use_cache=True,
    custom_generate=generate.generate_with_mask,
)

print("\n".join(tokenizer.batch_decode(gen_out.sequences)).replace("\\n", "\n"))
print(f"\n--- Stats (record 0) ---")
print(f"Prompt tokens:          {gen_out.prompt_tokens}")
print(f"Generated tokens:       {gen_out.generated_tokens}")
print(f"Total tokens processed: {gen_out.total_tokens_processed[0]}")
print(f"Output tokens:          {gen_out.output_tokens[0]}")
print(f"Prune events:           {gen_out.prune_events[0]}")
print(f"Tokens pruned:          {gen_out.tokens_pruned[0]}")
print(f"Wall time:              {gen_out.wall_time_seconds:.2f}s")

<|im_start|>system
Solve the following math problem. Make sure to put the answer (and only answer) inside \boxed{}.<|im_end|><|im_start|>userHow many ordered triples of integers $(x,y,z)$ are there such that $0 \leq x,y,z \leq 100$ and
\[
(x-y)^2+(y-z)^2+(z-x)^2 \geq (x+y-2z)^2+(y+z-2x)^2+(z+x-2y)^2?
\]<|im_end|><|im_start|>assistant<think>
[THOUGHT] Okay, let's try to tackle this problem. Hmm, we need to find the number of ordered triples of integers (x, y, z) where each is between 0 and 100 inclusive. The condition is that the sum of the squares (x-y)² + (y-z)² + (z-x)² is greater than or equal to another sum of squares: (x+y-2z)² + (y+z-2x)² + (z+x-2y)². 

First, maybe I should expand both sides and see if we can simplify the inequality. Let me start by expanding the left-hand side (LHS) and the right-hand side (RHS) separately.

[THOUGHT] The inequality simplifies to 3(x² + y² + z²) - 4(xy + yz + zx) ≤ 0. [RETURN]

Wait, let me check that. Let me expand both sides step by step.

St

In [33]:
gen_out = model.generate(
    **eval_inputs,
    max_new_tokens=4096,
    thought_token_id=thought_token_id,
    solution_token_id=solution_token_id,
    return_token_id=return_token_id,
    eos_token_id=tokenizer.eos_token_id,
    min_token_length=1024,
    use_cache=False,
    custom_generate=generate.generate_with_mask,
)

print("\n".join(tokenizer.batch_decode(gen_out.sequences)).replace("\\n", "\n"))
print(f"\n--- Stats (record 0) ---")
print(f"Prompt tokens:          {gen_out.prompt_tokens}")
print(f"Generated tokens:       {gen_out.generated_tokens}")
print(f"Total tokens processed: {gen_out.total_tokens_processed[0]}")
print(f"Output tokens:          {gen_out.output_tokens[0]}")
print(f"Prune events:           {gen_out.prune_events[0]}")
print(f"Tokens pruned:          {gen_out.tokens_pruned[0]}")
print(f"Wall time:              {gen_out.wall_time_seconds:.2f}s")

<|im_start|>system
Solve the following math problem. Make sure to put the answer (and only answer) inside \boxed{}.<|im_end|><|im_start|>userHow many ordered triples of integers $(x,y,z)$ are there such that $0 \leq x,y,z \leq 100$ and
\[
(x-y)^2+(y-z)^2+(z-x)^2 \geq (x+y-2z)^2+(y+z-2x)^2+(z+x-2y)^2?
\]<|im_end|><|im_start|>assistant<think>
[THOUGHT] Okay, let's try to tackle this problem. Hmm, we need to find the number of ordered triples of integers (x, y, z) where each is between 0 and 100 inclusive. The condition is that the sum of the squares (x-y)² + (y-z)² + (z-x)² is greater than or equal to another sum of squares: (x+y-2z)² + (y+z-2x)² + (z+x-2y)². 

First, maybe I should expand both sides and see if we can simplify the inequality. Let me start by expanding the left-hand side (LHS) and the right-hand side (RHS) separately.

[THOUGHT] The inequality simplifies to 3(x² + y² + z²) - 4(xy + yz + zx) ≤ 0. [RETURN]

Wait, let me check that. Let me expand both sides step by step.

St

In [35]:
gen_out = model.generate(
    **eval_inputs,
    max_new_tokens=4096,
    thought_token_id=thought_token_id,
    solution_token_id=solution_token_id,
    return_token_id=return_token_id,
    eos_token_id=tokenizer.eos_token_id,
    min_token_length=4097,
    use_cache=False,
    custom_generate=generate.generate_with_mask,
)

print("\n".join(tokenizer.batch_decode(gen_out.sequences)).replace("\\n", "\n"))
print(f"\n--- Stats (record 0) ---")
print(f"Prompt tokens:          {gen_out.prompt_tokens}")
print(f"Generated tokens:       {gen_out.generated_tokens}")
print(f"Total tokens processed: {gen_out.total_tokens_processed[0]}")
print(f"Output tokens:          {gen_out.output_tokens[0]}")
print(f"Prune events:           {gen_out.prune_events[0]}")
print(f"Tokens pruned:          {gen_out.tokens_pruned[0]}")
print(f"Wall time:              {gen_out.wall_time_seconds:.2f}s")

<|im_start|>system
Solve the following math problem. Make sure to put the answer (and only answer) inside \boxed{}.<|im_end|><|im_start|>userHow many ordered triples of integers $(x,y,z)$ are there such that $0 \leq x,y,z \leq 100$ and
\[
(x-y)^2+(y-z)^2+(z-x)^2 \geq (x+y-2z)^2+(y+z-2x)^2+(z+x-2y)^2?
\]<|im_end|><|im_start|>assistant<think>
[THOUGHT] Okay, let's try to tackle this problem. Hmm, we need to find the number of ordered triples of integers (x, y, z) where each is between 0 and 100 inclusive. The condition is that the sum of the squares (x-y)² + (y-z)² + (z-x)² is greater than or equal to another sum of squares: (x+y-2z)² + (y+z-2x)² + (z+x-2y)². 

First, maybe I should expand both sides and see if we can simplify the inequality. Let me start by expanding the left-hand side (LHS) and the right-hand side (RHS) separately.

[THOUGHT] LHS: (x - y)² + (y - z)² + (z - x)²

Let's expand each term:

(x - y)² = x² - 2xy + y²

(y - z)² = y² - 2yz + z²

(z - x)² = z² - 2zx + x²

Addin

In [39]:
gen_out = model.generate(
    **eval_inputs,
    max_new_tokens=4096,
    thought_token_id=thought_token_id,
    solution_token_id=solution_token_id,
    return_token_id=return_token_id,
    eos_token_id=tokenizer.eos_token_id,
    min_token_length=1,
    use_cache=False,
    custom_generate=generate.generate_with_mask,
)

print("\n".join(tokenizer.batch_decode(gen_out.sequences)).replace("\\n", "\n"))
print(f"\n--- Stats (record 0) ---")
print(f"Prompt tokens:          {gen_out.prompt_tokens}")
print(f"Generated tokens:       {gen_out.generated_tokens}")
print(f"Total tokens processed: {gen_out.total_tokens_processed[0]}")
print(f"Output tokens:          {gen_out.output_tokens[0]}")
print(f"Prune events:           {gen_out.prune_events[0]}")
print(f"Tokens pruned:          {gen_out.tokens_pruned[0]}")
print(f"Wall time:              {gen_out.wall_time_seconds:.2f}s")

<|im_start|>system
Solve the following math problem. Make sure to put the answer (and only answer) inside \boxed{}.<|im_end|><|im_start|>userHow many ordered triples of integers $(x,y,z)$ are there such that $0 \leq x,y,z \leq 100$ and
\[
(x-y)^2+(y-z)^2+(z-x)^2 \geq (x+y-2z)^2+(y+z-2x)^2+(z+x-2y)^2?
\]<|im_end|><|im_start|>assistant<think>
[THOUGHT] RHS = 9x² + 6y² + 6z² -6xy -6yz -6xz [RETURN]
[THOUGHT] The inequality holds only when x = y = z, giving 101 solutions. [RETURN]
[THOUGHT] When variables are in arithmetic progression, equality holds only when d=0 (i.e., x=y=z), same as before. [RETURN]
[THOUGHT] For other cases, the inequality is strict. But wait, the problem asks for the number of triples where the left side is greater than or equal to the right. So if equality only occurs when x=y=z, then the total number of solutions would be 101 (for x=y=z) plus the number of triples where the left side is strictly greater. But wait, maybe the inequality is only satisfied when x=y=z? 

In [47]:
tokenizer("""<|im_start|>system
Solve the following math problem. Make sure to put the answer (and only answer) inside \\boxed{}.<|im_end|><|im_start|>userHow many ordered triples of integers $(x,y,z)$ are there such that $0 \\leq x,y,z \\leq 100$ and
\\[
(x-y)^2+(y-z)^2+(z-x)^2 \\geq (x+y-2z)^2+(y+z-2x)^2+(z+x-2y)^2?
\\]<|im_end|><|im_start|>assistant<think>
[THOUGHT] RHS = 9x² + 6y² + 6z² -6xy -6yz -6xz [RETURN]
[THOUGHT] The inequality holds only when x = y = z, giving 101 solutions. [RETURN]
[THOUGHT] When variables are in arithmetic progression, equality holds only when d=0 (i.e., x=y=z), same as before. [RETURN]
[THOUGHT] For other cases, the inequality is strict. But wait, the problem asks for the number of triples where the left side is greater than or equal to the right. So if equality only occurs when x=y=z, then the total number of solutions would be 101 (for x=y=z) plus the number of triples where the left side is strictly greater. But wait, maybe the inequality is only satisfied when x=y=z? Let me check.

[THOUGHT] First, let's expand both sides:

Left: (x-y)² + (y-z)² + (z-x)²
= 2x² + 2y² + 2z² - 2xy - 2yz - 2zx

Right: (x+y-2z)² + (y+z-2x)² + (z+x-2y)²
Let's expand each term:
First term: (x + y - 2z)² = x² + y² + 4z² + 2xy -4xz -4yz
Second term: (y + z - 2x)² = y² + z² + 4x² + 2yz -4xy -4xz
Third term: (z + x - 2y)² = z² + x² + 4y² + 2zx -4zy -4xy

Adding them together:
(4x² + 4y² + 4z²) + (2xy + 2yz + 2zx) -4xz -4yz -4xy -4xz -4zy -4xy
Wait, let me compute term by term:

Sum of x² terms: 4x² + y² + z² + ... Wait, no. Let me re-express each expansion:

First term: x² + y² + 4z² + 2xy -4xz -4yz
Second term: 4x² + y² + z² -4xy -4xz + 2yz
Third term: x² + 4y² + z² -4xy -4yz + 2zx

Adding all three:
x² + y² + 4z² + 2xy -4xz -4yz
+ 4x² + y² + z² -4xy -4xz + 2yz
+ x² + 4y² + z² -4xy -4yz + 2zx
----------------------------
= (x² +4x² +x²) + (y² + y² +4y²) + (4z² + z² + z²)
+ (2xy -4xy -4xy) + (-4xz -4xz +2zx) + (-4yz +2yz -4yz)
= 6x² +6y² +6z² + (-6xy) + (-6xz) + (-6yz)

So the right-hand side simplifies to 6x² +6y² +6z² -6xy -6yz -6xz.

[THOUGHT] The inequality reduces to ½[(x-y)² + (y-z)² + (z-x)²] ≤ 0, which holds only when x = y = z. Therefore, the only solutions are triples where x, y, z are all equal, giving 101 solutions. [RETURN]

[THOUGHT] The inequality reduces to -½[(x-y)² + (y-z)² + (z-x)²] ≥ 0, which holds only when x = y = z. Thus, the only solutions are triples where x, y, z are equal, giving 101 solutions. [RETURN]

[THOUGHT] Therefore, the answer is 101. [RETURN]""", return_tensors="pt")["input_ids"].shape

torch.Size([1, 987])