## RL Hacking 3

Hack in this direction as long as it makes sense: 

- [ ]  Setup a basic LLM (maybe QWEN 0.6B)
- [ ]  A math dataset
- [ ]  Run various policy gradient methods to teach the model to reason (GRPO, Raschka’s simpler GRPO, Dr. GRPO, PPO, Vanilla Policy gradient, maybe DPO, maybe RPOO)
- [ ]  How do implementations perform and vary? Do components make sense? Can we replicated some version of the “aha moment”? (although that’s apparently not an emergent RL property?) Does the way the math is setup make sense? does Karpathy’s simplified explanation make it easier? Would demoing on or more of these algorithms on in a game setting make more sense?

I'd prefer to use standard HF QWEN here, but let's roll with Raschka's rooling for a bit
- https://github.com/rasbt/reasoning-from-scratch/tree/main/ch06

In [1]:
import torch

from reasoning_from_scratch.ch02 import get_device
from reasoning_from_scratch.ch03 import (
     load_model_and_tokenizer
)

device = get_device()
# device = torch.device("cpu")

model, tokenizer = load_model_and_tokenizer(
    which_model="base",
    device=device,
    use_compile=False
)

Using NVIDIA CUDA GPU
✓ qwen3/qwen3-0.6B-base.pth already up-to-date


In [2]:
from reasoning_from_scratch.ch03 import render_prompt
from reasoning_from_scratch.ch04 import (
    generate_text_stream_concat_flex,
    generate_text_top_p_stream_cache
)

raw_prompt = (
    "Half the value of $3x-9$ is $x+37$. "
    "What is the value of $x$?"
)
prompt = render_prompt(raw_prompt)

torch.manual_seed(0)
response = generate_text_stream_concat_flex(
    model, tokenizer, prompt, device,
    max_new_tokens=2048, verbose=True,
    generate_func=generate_text_top_p_stream_cache,
    temperature=0.9,
    top_p=0.9
)

 46

In [3]:
response

' 46'

In [4]:
import json
import requests
from pathlib import Path

def load_math_train(local_path="math_train.json", save_copy=True):
    local_path = Path(local_path)
    url = (
        "https://raw.githubusercontent.com/rasbt/"
        "math_full_minus_math500/refs/heads/main/"
        "math_full_minus_math500.json"
    )

    if local_path.exists():
        with local_path.open("r", encoding="utf-8") as f:
            data = json.load(f)
    else:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        data = r.json()

        if save_copy:  # Saves a local copy
            with local_path.open("w", encoding="utf-8") as f:
                json.dump(data, f, indent=2)

    return data

In [5]:
math_train = load_math_train()

print("Dataset size:", len(math_train))

Dataset size: 12000


In [6]:
from pprint import pprint

pprint(math_train[4])

{'answer': '6',
 'level': 'Level 3',
 'problem': 'Sam is hired for a 20-day period. On days that he works, he earns '
            '$\\$$60. For each day that he does not work, $\\$$30 is '
            'subtracted from his earnings. At the end of the 20-day period, he '
            'received $\\$$660. How many days did he not work?',
 'solution': 'Call $x$ the number of days Sam works and $y$ the number of days '
             'he does not. We can set up the following system of equations to '
             'represent the given information: \\begin{align*}\n'
             'x+y &= 20 \\\\\n'
             '60x - 30y &= 660 \\\\\n'
             '\\end{align*} The first equation represents the total number of '
             'days Sam works, and the second equation represents his total '
             'profit. Solving for $x$ in the first equation yields $x = 20 - '
             'y$. Substituting into the second equation gives $60(20-y) - 30y '
             '= 660$. Canceling a factor of $10$ an

In [7]:
from reasoning_from_scratch.qwen3 import KVCache
from reasoning_from_scratch.ch04 import top_p_filter


@torch.no_grad()
def sample_response(
    model,
    tokenizer,
    prompt,
    device,
    max_new_tokens=512,
    temperature=0.8,
    top_p=0.9,
):
    input_ids = torch.tensor(
        tokenizer.encode(prompt),
        device=device
        )

    cache = KVCache(n_layers=model.cfg["n_layers"])
    model.reset_kv_cache()
    logits = model(input_ids.unsqueeze(0), cache=cache)[:, -1]

    generated = []
    for _ in range(max_new_tokens):
        if temperature and temperature != 1.0:
            logits = logits / temperature

        probas = torch.softmax(logits, dim=-1)
        probas = top_p_filter(probas, top_p)
        next_token = torch.multinomial(
            probas.cpu(), num_samples=1
        ).to(device)

        if (
            tokenizer.eos_token_id is not None
            and next_token.item() == tokenizer.eos_token_id
        ):
            break
        generated.append(next_token.item())
        logits = model(next_token, cache=cache)[:, -1]

    full_token_ids = torch.cat(
        [input_ids,
         torch.tensor(generated, device=device, dtype=input_ids.dtype),]
    )
    return full_token_ids, input_ids.numel(), tokenizer.decode(generated)

In [8]:
torch.manual_seed(0)

raw_prompt = (
    "Half the value of $3x-9$ is $x+37$. "
    "What is the value of $x$?"
)
prompt = render_prompt(raw_prompt)

token_ids, prompt_len, answer_text = sample_response(
            model=model,
            tokenizer=tokenizer,
            prompt=prompt,
            device=device,
            max_new_tokens=512,
            temperature=0.9,
            top_p=0.9,
        )

print(answer_text)

 46


In [9]:
torch.manual_seed(1)

token_ids, prompt_len, answer_text = sample_response(
            model=model,
            tokenizer=tokenizer,
            prompt=prompt,
            device=device,
            max_new_tokens=512,
            temperature=0.9,
            top_p=0.9,
        )

print(answer_text)

 Let's solve the problem step by step.

1. **Translate the problem into an equation:**
   
   The problem states that half the value of \( 3x - 9 \) is \( x + 37 \). We can translate this into the following equation:
   \[
   \frac{1}{2}(3x - 9) = x + 37
   \]

2. **Simplify the equation:**
   
   Distribute the \( \frac{1}{2} \) on the left side:
   \[
   \frac{3x}{2} - \frac{9}{2} = x + 37
   \]

3. **Eliminate the fraction by multiplying every term by 2:**
   \[
   3x - 9 = 2x + 74
   \]

4. **Isolate the variable \( x \):**
   
   Subtract \( 2x \) from both sides:
   \[
   3x - 2x - 9 = 74
   \]
   \[
   x - 9 = 74
   \]

5. **Solve for \( x \):**
   
   Add 9 to both sides:
   \[
   x = 74 + 9
   \]
   \[
   x = 83
   \]

6. **Write the final answer:**
   
   \[
   \boxed{83}
   \]


In [10]:
torch.manual_seed(5)

token_ids, prompt_len, answer_text = sample_response(
            model=model,
            tokenizer=tokenizer,
            prompt=prompt,
            device=device,
            max_new_tokens=512,
            temperature=0.9,
            top_p=0.9,
        )

print(answer_text)

 Let's solve the problem step by step.

**Given:**
\[
\frac{1}{2} \times (3x - 9) = x + 37
\]

**Step 1: Eliminate the fraction by multiplying both sides by 2.**
\[
2 \times \left( \frac{1}{2} \times (3x - 9) \right) = 2 \times (x + 37)
\]
\[
3x - 9 = 2x + 74
\]

**Step 2: Subtract \(2x\) from both sides to get the \(x\)-terms on one side.**
\[
3x - 2x - 9 = 74
\]
\[
x - 9 = 74
\]

**Step 3: Add 9 to both sides to solve for \(x\).**
\[
x = 74 + 9
\]
\[
x = 83
\]

**Final Answer:**
\[
\boxed{83}
\]


- It is cool that sometimes QWEN kidna reasons here and sometimes it doesn't.
- Wonder how this various across models
- Ok so from here Raschka uses some simulated rollouts to walk through GPRO math, that's cool.
- The general premise of how do we train our model to get the right answer more often is nice
  

In [1]:
# import torch
# from transformers import AutoModelForCausalLM, AutoTokenizer, PreTrainedModel

In [6]:
# # MODEL_NAME = "Qwen/Qwen2.5-3B"
# MODEL_NAME = "Qwen/qwen3-0.6B-base"

In [7]:
# policy_model = AutoModelForCausalLM.from_pretrained(
#     MODEL_NAME,
#     attn_implementation="flash_attention_2",
#     torch_dtype=torch.bfloat16,
#     device_map=0,
# )
# # reference_model = AutoModelForCausalLM.from_pretrained(
# #     MODEL_NAME,
# #     attn_implementation="flash_attention_2",
# #     torch_dtype=torch.bfloat16,
# #     device_map=0,
# # )

config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.19G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

- Hmm yeah this is like a non-trivial amount of tooling with either repo
- Raschka's seems friendlier, but more complex than I would want in some areas for sure.
- Can I achieve my hacking goals here using Raschka's code?
- Seems like it comes down to if I can make sense of the policy gradient part of his code and modify it. 