### Experimenting budget forcing and test-time scaling from `s1: Simple test-time scaling`

**From the paper: https://arxiv.org/pdf/2501.19393 <br/>**

>Figure 3. Budget forcing with s1-32B. The model tries to stop
after “...is 2.”, but we suppress the end-of-thinking token delimiter
instead appending “Wait” leading s1-32B to self-correct its answer.

<img src="https://arxiv.org/html/2501.19393v2/x4.png" width=400 height=300>

**Interesting nuggets from s1 paper (methodology)**

1. They collected a dataset of 1k examples with reasoning traces from Google Gemini model and performed SFT (supervised fine tuning).
2. They fix response lengths by adding "wait" tokens in certain cases to get models to generate longer CoT's, verify, and correct itself, or halt token generation by introducing an EOT (end of thinking) token delimiter and the authors call this phenomenon as "budget forcing."

**Budget Forcing**
Not to be picky or pedantic but `budget forcing (BF)` is still not a parallel inference scaling technique (as seen in o-1 or by Gemini Thinking). As the authors point out, we can think of `BF` as a sequential inference scaling technique. Despite the `<wait>` and `<think` tokens at appropriate steps, the model is still generating one token at a time, the only difference being in total number of tokens.

These two charts highlight the limitations, but they are still salient contributions from the `s1` paper:
<br/>
<img src="https://arxiv.org/html/2501.19393v2/x7.png" width=500 height=500 />
<br/>
<img src="https://arxiv.org/html/2501.19393v2/x8.png" width=500 height=500 />
</br>
The important contribution here is `BF` method seems to be effective than other resource hungry inference-scaling techniques. A good test on how useful `BF` here could be is by testing a good instruct model (Claude or open weight models like Qwen 2.5) using CoT prompting and observe the quality of responses supplied by models trained with `BF` vs regular models with CoT prompts.

**Compute optimal parallel inference scaling**
Although `BF` is compared against majority voting, and they found it to work well there are better compute optimal strategies
to perform inference scaling. For more check paper: https://arxiv.org/pdf/2408.03314v1, especially section **5.2**. Key strategies that should be compared against `BF` are:
<br/>
<img src="https://arxiv.org/html/2408.03314v1/x3.png" width=500 height=500 />

### What is scaling at test time ?

Given a question `Q` to a model `LM` you get a response `R`. This process is pretty familiar and is called inference, but the key takeaway here is until `o1`, there was a semi-fixed resource consumption and also an assumption that the compute required to answer `Q`. After the advent of `o1` compute on inference or test-time compute (TTC) dynamically increase their reasoning time during inference leading to more time thinking about `Q`, improving accuracy of `R` but at the cost of higher compute usage.

### Isn't Deepseek-R1's scaling inference time ?

**Short answer**: Nope

**Long answer**:
DeepSeek stated that their main goal was to achieve strong reasoning capabilities by leveraging the principle of deep, step-by-step thinking. DeepSeek-R1 got nearly SoTA results on most reasoning datasets simply by leveraging pure RL (GRPO), good data quality mix, and an incredibly strong base model (DeepSeek-V3). So, they built a reasoning model that mimicks o-1, but the phenomenon of scaling at inference time is somewhat not disclosed in the paper. Also, page-1 from s1's paper: https://arxiv.org/pdf/2501.19393 <br/>
>However, despite the large number of o1 replication attempts, none have openly replicated
a clear test-time scaling behavior. Thus, we ask: what is
the simplest approach to achieve both test-time scaling and
strong reasoning performance?

### Experimenting `s1` phenomenon on Science of Science tasks

### 1. Import necessary libraries

In [1]:
# This Source Code Form is subject to the terms of the MIT
# License. If a copy of the same was not distributed with this
# file, You can obtain one at
# https://github.com/akhilpandey95/s1/blob/main/LICENSE.

import gc
import time
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, set_seed

### 2. Load Deepseek-R1 distill models

#### 2.1 Model, tokenizer load

In [13]:
del model, tokenizer
gc.collect()

0

In [14]:
# device for acceleration
# device = "mps" if torch.mps.is_available() else "cpu"
device = "cpu"

# set the model-id
model_id = "/Users/akhilakella/code/models/DeepSeek-R1-Distill-Qwen-1.5B"

# log
print("----------------------------------")
print(f"Using {device} to load {model_id}")
print("----------------------------------")

# 4-bit quantization config
bnb_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_storage=torch.bfloat16
)

# get model-tokenizer pair
start = time.time()
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, \
                                             trust_remote_code=True, \
                                             low_cpu_mem_usage=True, \
                                             torch_dtype=torch.bfloat16, \
                                             device_map=device)

# # pad token if needed
# tokenizer.add_special_tokens({"pad_token": "<|finetune_right_pad_id|>"})
# model.resize_token_embeddings(len(tokenizer))

# load time
end = time.time()
print("Model-tokenizer Load Time:", end - start)
print("----------------------------------")

----------------------------------
Using cpu to load /Users/akhilakella/code/models/DeepSeek-R1-Distill-Qwen-1.5B
----------------------------------
Model-tokenizer Load Time: 0.6631889343261719
----------------------------------


#### 2.2 Seed and parameters for reproducibility

In [15]:
# seed for reproducibility
set_seed(2025)

# set top_p and temperature to none
model.generation_config.temperature=None
model.generation_config.top_p=None

### 3. Prompting a reasoning model to construct a policy brief

#### 3.1 Prompt for T, K, and E

In [16]:
# prompt to generate T, K, and E
input_text = """
SCIENTIFIC_PAPER_TITLE: Pharmaceutical Side Effects and Mental Health Paradoxes among Racial-Ethnic Minorities

SCIENTIFIC_PAPER_ABSTRACT: Sociologists have long struggled to explain the minority mental health paradox: 
that racial-ethnic minorities often report better mental health than non-Hispanic whites despite social 
environments that seem less conducive to well-being. Using data from the 2008–2013 Medical Expenditure 
Panel Survey (MEPS), this study provides a partial explanation for the paradox rooted in a very different disparity. 
Evidence from MEPS indicates that non-Hispanic whites consume more pharmaceuticals than racial-ethnic minorities 
for a wide variety of medical conditions. Moreover, non-Hispanic whites consume more pharmaceuticals that although 
effective in treating their focal indication, include depression or suicide as a side effect. In models that adjust
for the use of such medications, the minority advantage in significant distress is reduced, in some instances to 
statistical nonsignificance. Although a significant black and Hispanic advantage in a continuous measure of distress 
remains, the magnitude of the difference is reduced considerably. The relationship between the use of medications with 
suicide as a side effect and significant distress is especially large, exceeding, for instance, the relationship between 
poverty and significant distress. For some minority groups, the less frequent use of such medications is driven by better 
health (as in the case of Asians), whereas for others, it reflects a treatment disparity (as in the case of blacks), 
although the consequences for the mental health paradox are the same. The implications of the results are discussed, 
especially with respect to the neglect of psychological side effects in the treatment of physical disease as well as 
the problem of multiple morbidities.

The final policy brief must contain the following structure:

1. Title
2. Executive summary of the scientific work
3. Key policy implications

Write a policy brief for the above scientific work.
"""

# apply chat template
messages = [
    {
        "role": "system",
        "content": (
            "You are a helpful intelligent policy brief writing assistant. "
            "You have the ability to take scientific papers and write policy briefs "
            "for the respective scientific works."
        )
    },
    {
        "role": "user",
        "content": input_text
    }
]

In [17]:
# get attention mask and input ids
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, \
                                        return_dict=True, return_tensors="pt")
input_encoded = {k: v.to(model.device) for k, v in inputs.items()}
input_shape = len(input_encoded["input_ids"][0])

# set max new tokens
max_tokens = 1024

# shape check
print(f"Encoded inputs: {input_shape}")

Encoded inputs: 438


#### 3.3 Inference on the prompt using `R1-distill-Qwen-8b` without `BF`

In [18]:
# time check
start = time.time()

# routine model.generate()
outputs = model.generate(
    **input_encoded,
    max_new_tokens=max_tokens,
    do_sample=False,
    num_beams=1,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id
)

# inference time
end = time.time()
inference = end - start

# model.generate() output
print("----------------------------------")
output = tokenizer.decode(outputs[0][input_shape:])
print("Policy Brief:")
print(output)
print("----------------------------------")

# tps logic
new_tokens = outputs[0].shape[0] - input_shape
tokens_per_second = new_tokens / inference if inference > 0 else float('inf')
print(f"Inference time: {inference:.2f} seconds")
print(f"New tokens generated: {new_tokens}")
print(f"Tokens per second: {tokens_per_second:.2f}")

----------------------------------
Policy Brief:
Okay, so I need to write a policy brief based on the provided scientific paper. Let me start by understanding the structure and the content of the paper.

The paper is titled "Pharmaceutical Side Effects and Mental Health Paradoxes Among Racial-Ethnic Minorities" and it's about the mental health paradox where racial-ethnic minorities report better mental health than non-Hispanic whites despite having less favorable social environments. The study uses data from the 2008-2013 Medical Expenditure Panel Survey (MEPS) to explore this issue.

First, I need to create a title that reflects the study's focus. Maybe something like "Exploring the Mental Health Paradox Among Racial-Ethnic Minorities: The Role of Pharmaceutical Side Effects."

Next, the executive summary should summarize the key findings. The study found that racial-ethnic minorities consume more pharmaceuticals, including those that cause depression or suicide, which reduces their m

#### 3.4 Inference on the prompt with test time scaling via `BF`

In [22]:
del model, tokenizer
gc.collect()

287

##### 3.4.1 Induce **Wait** and `</think>` tokens

In [20]:
# checking the tokenizer's special tokens
print("----------------------------------")
print(f"Before adjusting tokens: {tokenizer.all_special_tokens}")
print("----------------------------------")

# define BF parameters
bf_tokens = 0
wait_token = "Wait"
end_think_token = "</think>"
end_think_triggered = False
NUM_IGNORE = 1

# resize model embedding
special_tokens_dict = {"additional_special_tokens": [wait_token, end_think_token]}
tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer), mean_resizing=False)
print(f"After adding special wait, think tokens: {tokenizer.all_special_tokens}")
print("----------------------------------")

# check
wait_token_id = tokenizer.convert_tokens_to_ids("Wait")
end_think_token_id = tokenizer.convert_tokens_to_ids("</think>")
print(f"wait_token: {wait_token}")
print(f"end_think_token_id: {end_think_token_id}")
print("----------------------------------")

----------------------------------
Before adjusting tokens: ['<｜begin▁of▁sentence｜>', '<｜end▁of▁sentence｜>']
----------------------------------
After adding special wait, think tokens: ['<｜begin▁of▁sentence｜>', '<｜end▁of▁sentence｜>', 'Wait', '</think>']
----------------------------------
wait_token: Wait
end_think_token_id: 151649
----------------------------------


##### 3.4.2 Generate tokens with a budget

In [21]:
# clone the input ids
generated = input_encoded["input_ids"].clone()

# sequential generation of tokens
start = time.time()
while bf_tokens < max_tokens:
    prompt_text = tokenizer.decode(generated[0])
    input_encoded = tokenizer(prompt_text, return_tensors="pt").to(model.device)

    # generate one token at a time
    outputs = model.generate(
        **input_encoded,
        max_new_tokens=1,
        do_sample=False,
        num_beams=1,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

    # 
    next_token = outputs[0][-1].unsqueeze(0).unsqueeze(0)
    generated = torch.cat((generated, next_token), dim=1)
    token_str = tokenizer.decode(next_token[0])

    # display the text generated
    print(token_str, end="", flush=True)

    # increment token counter
    bf_tokens += 1

    # check for wait or think tokens
    if wait_token in token_str:
        # out of the loop
        break
    if end_think_token in token_str:
        # end of token triggered
        end_think_triggered = True

        # out of the loop
        break

end = time.time()
inference_time = end - start
total_generated = generated.shape[1] - len(input_encoded["input_ids"][0])

print("\n----------------------------------")
print("Synchronous BF Inference")
print(f"Inference time: {inference_time:.2f} seconds")
print(f"New tokens generated: {total_generated}")
print("----------------------------------")

# If a wait token was encountered, ignore the stop token NUM_IGNORE times
if wait_token in tokenizer.decode(generated[0]):
    for i in range(NUM_IGNORE):
        # Adjust remaining token budget (here simply using bf_tokens)
        remaining_tokens = max_tokens - bf_tokens
        # Append generated text and the ignore string to the prompt
        prompt = tokenizer.decode(generated[0]) + wait_token
        input_encoded = tokenizer(prompt, return_tensors="pt").to(model.device)
        # Generate additional tokens synchronously until a control token is seen
        while bf_tokens < max_tokens:
            outputs = model.generate(
                **input_encoded,
                max_new_tokens=1,
                do_sample=False,
                num_beams=1,
                pad_token_id=tokenizer.pad_token_id,
                eos_token_id=tokenizer.eos_token_id
            )
            next_token = outputs[0][-1].unsqueeze(0).unsqueeze(0)
            generated = torch.cat((generated, next_token), dim=1)
            token_str = tokenizer.decode(next_token[0])
            print(token_str, end="", flush=True)
            bf_tokens += 1
            if wait_token in token_str or end_think_token in token_str:
                break

# force end_think_token
if not end_think_triggered:
    prompt = tokenizer.decode(generated[0]) + end_think_token + "\n"
    input_encoded = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
         **input_encoded,
         max_new_tokens=1024,
         do_sample=False,
         num_beams=1,
         pad_token_id=tokenizer.pad_token_id,
         eos_token_id=tokenizer.eos_token_id
    )
    final_text = tokenizer.decode(outputs[0][input_encoded["input_ids"].shape[1]:])
    print("\n----------------------------------")
    print("Final Answer:")
    print(final_text)
    print("----------------------------------")

Okay, so I need to write a policy brief based on the provided scientific paper. The title is already given, so I don't need to change that. The executive summary should summarize the study, key policy implications should discuss the findings, and the policy brief should outline the implications for policymakers.

First, I'll start with the title. It's about the paradox of pharmaceutical side effects and mental health among racial-ethnic minorities. That's clear, but maybe I can make it a bit more specific, like mentioning the specific medications or the context of the study.

Next, the executive summary. I need to explain the paradox: why minorities have better mental health despite lower access to medications. The study uses MEPS data, which is a big survey, so I should mention that. It found that whites consume more medications, especially those with mental health issues. Also, some medications have side effects like depression or suicide, which might explain the paradox. The study a


KeyboardInterrupt



### 4. Prompting a reasoning model to extract information from technology disclosure page