<a href="https://colab.research.google.com/github/byi8220/unsloth-puzzles/blob/main/Problem5/Unsloth_Problem_5_Llama_8B_Quantized_GRPO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unsloth Problem 5 - Memory Efficient Backprop on Llama 3.2 1B GRPO

#### Ran on a colab L4 GPU instance

This is a reproduction of Unsloth's [LLama 3.2 GRPO](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb) where we use a modified model with `MemoryEfficientLinear` patched in to support `compute_loss` inlined.

This is quite hacky, and selective_log_softmax is not bfloat16 friendly.

In [None]:
%%capture
# Code to install Unsloth, Triton, Torch etc
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29 peft trl triton
!pip install --no-deps cut_cross_entropy unsloth_zoo
!pip install --upgrade transformers
!pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
!pip install unsloth vllm
!pip install flash-attn --no-build-isolation
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git@nightly git+https://github.com/unslothai/unsloth-zoo.git

!pip install ipywidgets # Needed to export to github

### Patch MemoryEfficientLinear into `new_model`

In [None]:
#@title MemoryEfficientLinear Implementation
device = "cuda"
import torch
import torch.nn.functional as F
import math

class MemoryEfficientLinear(torch.autograd.Function):
    # IMO, the spec is a bit vague, and I interpreted the arguments to
    # as MemoryEfficientLinear.forward(X, W, labels, fn) = fn(XW, labels)
    @staticmethod
    # (bsz, qlen, hd) @ (hd, vocab) -> (bsz, qlen, vocab)
    def forward(ctx, X, W, labels, forward_function, mel_num_chunks=1, ignore_index=-100):
        # NOTE: I wasn't sure what `allows_dynamic_chunk_sizes` means here.
        # I interpreted it to mean "let the user specify the number of chunks,
        # and the chunks will be sized accordingly."
        ctx.mel_num_chunks = mel_num_chunks # How to chunk `XW` over batches

        # Perform `forward_function` in chunks, and reduce them into `output`
        output = 0.0

        # Require uniform chunk size, for cleaner computations involving
        # `ForCausalLMLoss` and `num_items_in_batch`.
        assert X.shape[0] % ctx.mel_num_chunks == 0
        assert ctx.mel_num_chunks <= X.shape[0]
        b_per_chunk = X.shape[0] // ctx.mel_num_chunks

        N = 0
        for b in range(ctx.mel_num_chunks):
            b0, b1 = b *  b_per_chunk, (b+1) * b_per_chunk
            # Reduce (bsz, qlen, vocab) to (b_per_chunk, q_per_chunk, vocab)
            with torch.no_grad():
                X_slice = X[b0:b1]
                l_slice = labels[b0:b1]
                XW_slice = (F.linear(X_slice, W.T)).float()
            output += torch.numel(l_slice) * forward_function(XW_slice, l_slice)
            N += torch.numel(l_slice)
        del XW_slice
        ctx.save_for_backward(X, W, labels)
        ctx.forward_function = forward_function
        ctx.N = N
        ctx.ignore_index = ignore_index
        return output / N

    # L(X,W,T,f) = f(XW, T)
    # dL/dX = dL/df * df/d(XW) * d(XW)/dX
    # dL/dW = dL/df * df/d(XW) * d(XW)/dW
    # We want to avoid materializing df/d(XW) to save on memory,
    # as XW is the large tensor we are trying to avoid materializing
    @staticmethod
    def backward(ctx, dY):

        # As written we need to retain at least all of X, W, labels
        # (This could possibly be optimized more)
        X, W, labels = ctx.saved_tensors

        # The absolute minimum memory usage this function can possibly incur is
        # that required for the returned gradients.
        dX = torch.zeros_like(X)
        dW = torch.zeros_like(W)
        assert X.shape[0] % ctx.mel_num_chunks == 0
        assert ctx.mel_num_chunks <= X.shape[0]
        b_per_chunk = X.shape[0] // ctx.mel_num_chunks
        for b in range(ctx.mel_num_chunks):
            b0, b1 = b * b_per_chunk, (b+1) * b_per_chunk
            X_slice = X[b0:b1].detach().requires_grad_()
            W_slice = W.detach().requires_grad_()
            l_slice = labels[b0:b1].detach()
            with torch.enable_grad():
                XW_slice = (F.linear(X_slice, W_slice.T)).float()
                out = ctx.forward_function(XW_slice, l_slice) * torch.numel(l_slice)
            # From my testing this appears to use more memory than hardcoded matmul (sometimes)
            dX_slice, dW_slice = torch.autograd.grad(out, (X_slice, W_slice), dY / ctx.N, retain_graph=False, create_graph=False)
            dX[b0:b1] = dX_slice.to(dX.dtype)
            dW += dW_slice.to(dW.dtype)

        return dX, dW, None, None, None, None


In [None]:
#@title MemoryEfficientLinearSls Implementation
import torch
from trl.trainer.utils import selective_log_softmax
from typing import Callable, List, Optional, Tuple, Union
from transformers.models.llama.modeling_llama import LlamaForCausalLM, KwargsForCausalLM
from transformers.modeling_outputs import CausalLMOutputWithPast
from transformers.cache_utils import Cache
from transformers.processing_utils import Unpack
from transformers.loss.loss_utils import ForCausalLMLoss
from functools import partial
import torch.nn as nn
import torch.nn.functional as F
import gc

# Linear -> selective_log_softmax fusion
# This is specialized,
class MemoryEfficientLinearSLS(torch.autograd.Function):
    @staticmethod
    def forward(ctx, X, W, index, forward_function, mel_num_chunks=1):
        if X.shape[0] > 1:
            ctx.mel_num_chunks = mel_num_chunks
            assert X.shape[0] % ctx.mel_num_chunks == 0
            assert ctx.mel_num_chunks <= X.shape[0]
            b_per_chunk = X.shape[0] // ctx.mel_num_chunks
        else:
            b_per_chunk = 1
            ctx.mel_num_chunks = 1
        # selective_log_softmax
        bsz, qlen = X.shape[0], X.shape[1]
        output = torch.zeros(bsz, qlen).to(device)
        for b in range(ctx.mel_num_chunks):
            b0, b1 = b *  b_per_chunk, (b+1) * b_per_chunk
            with torch.no_grad():
                X_slice = X[b0:b1]
                l_slice = index[b0:b1]
                XW_slice = (F.linear(X_slice, W.T)).float()
            output[b0:b1] = forward_function(XW_slice, l_slice)
        del XW_slice
        ctx.save_for_backward(X, W, index)
        ctx.forward_function = forward_function
        return output

    @staticmethod
    def backward(ctx, dY):

        X, W, index = ctx.saved_tensors
        dX = torch.zeros_like(X)
        dW = torch.zeros_like(W)
        assert X.shape[0] % ctx.mel_num_chunks == 0
        assert ctx.mel_num_chunks <= X.shape[0]
        b_per_chunk = X.shape[0] // ctx.mel_num_chunks
        for b in range(ctx.mel_num_chunks):
            b0, b1 = b * b_per_chunk, (b+1) * b_per_chunk
            X_slice = X[b0:b1].detach().requires_grad_()
            W_slice = W.detach().requires_grad_()
            l_slice = index[b0:b1].detach()
            with torch.enable_grad():
                XW_slice = (F.linear(X_slice, W_slice.T)).float()
                out = ctx.forward_function(XW_slice, l_slice)
            dX_slice, dW_slice = torch.autograd.grad(out, (X_slice, W_slice), dY[b0:b1], retain_graph=False, create_graph=False)
            dX[b0:b1] = dX_slice.to(dX.dtype)
            dW += dW_slice.to(dW.dtype)
        return dX, dW, None, None, None, None

In [None]:
import torch

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

## Presumably we're testing base LLama.

hf_repo = "unsloth/Meta-Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(hf_repo)
peft_config = LoraConfig(
    r=16,
    lora_alpha=64,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj", "gate_proj"],
    task_type="CAUSAL_LM",
    lora_dropout=0.05,
)
quantization_config = BitsAndBytesConfig(
        load_in_4bit              = True,
        bnb_4bit_use_double_quant = True,
        bnb_4bit_quant_type       = "nf4",
)

model = AutoModelForCausalLM.from_pretrained(
    hf_repo,
    quantization_config=quantization_config,
    torch_dtype = torch.bfloat16
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/55.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/956 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

In [None]:
#@title MemoryEfficientLinear Patch
# Model based on https://github.com/huggingface/transformers/blob/51083d1bac7905aa8316b75f7897bdd4e5302044/src/transformers/models/llama/modeling_llama.py#L752
from typing import Callable, List, Optional, Tuple, Union
from transformers.models.llama.modeling_llama import LlamaForCausalLM, KwargsForCausalLM
from transformers.modeling_outputs import CausalLMOutputWithPast
from transformers.utils import ModelOutput
from transformers.cache_utils import Cache
from transformers.loss.loss_utils import ForCausalLMLoss
from transformers.processing_utils import Unpack
from transformers.utils.deprecation import deprecate_kwarg
from functools import partial
import torch.nn as nn
from dataclasses import dataclass
from trl.trainer.utils import selective_log_softmax

# In new_model, we are fusing the loss function and the final linear projection
# into one layer.
NUM_MEL_CHUNKS = 2

@dataclass
class CausalLMOutputWithPastAndLogps(ModelOutput):
    loss: Optional[torch.FloatTensor] = None
    logps: Optional[torch.FloatTensor] = None
    logits: torch.FloatTensor = None
    past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None
    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None

@deprecate_kwarg("num_logits_to_keep", version="4.50", new_name="logits_to_keep")
def mem_eff_forward(
    self,
    input_ids: torch.LongTensor = None,
    attention_mask: Optional[torch.Tensor] = None,
    position_ids: Optional[torch.LongTensor] = None,
    past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
    inputs_embeds: Optional[torch.FloatTensor] = None,
    labels: Optional[torch.LongTensor] = None,
    use_cache: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
    cache_position: Optional[torch.LongTensor] = None,
    logits_to_keep: Union[int, torch.Tensor] = 0,
    return_logits = True,
    **kwargs: Unpack[KwargsForCausalLM],
) -> Union[Tuple, CausalLMOutputWithPastAndLogps]:
    r"""
        Near identical to `LLamaForCausalLM.forward()`, except:
          1. In training mode, loss is computed via `MemoryEfficientLinear`, and logits are not returned.
          2. In eval mode, the model behaves identically to `LLamaForCausalLM`.
    """
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    outputs = self.model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        position_ids=position_ids,
        past_key_values=past_key_values,
        inputs_embeds=inputs_embeds,
        use_cache=use_cache,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
        cache_position=cache_position,
        **kwargs,
    )

    hidden_states = outputs[0]
    # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
    slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep

    logits, loss, logps = None, None, None
    # [ForCausalLMLoss](https://github.com/huggingface/transformers/blob/51083d1bac7905aa8316b75f7897bdd4e5302044/tests/trainer/test_trainer.py#L189)
    # The default loss function, ForCausalLMLoss uses mean reduction.
    memEffLinear = MemoryEfficientLinear.apply
    if labels is not None:
        # ForCausalLMLoss utilizes the "num_items_in_batch" kwarg passed in
        # via the `Trainer` to normalize the computed cross entropy loss.
        # This completely throws off our calculation since we expect
        # `forward_function()` to return the mean of only the provided args.
        #
        # The solution below is hacky, but we can scale forward_function()
        # by its contribution to the accumulated loss
        loss =  memEffLinear(hidden_states[:, slice_indices, :], self.lm_head.weight.T,
                            labels,
                            partial(ForCausalLMLoss, vocab_size=self.config.vocab_size, **kwargs),
                            NUM_MEL_CHUNKS)
        if "num_items_in_batch" in kwargs:
            loss = loss * NUM_MEL_CHUNKS
    logp_slice_indices = slice(-logits_to_keep, -1)
    memEffLinearSLS = MemoryEfficientLinearSLS.apply
    logps = memEffLinearSLS(hidden_states[:, logp_slice_indices, :], self.lm_head.weight.T,
                            input_ids[:,logp_slice_indices],
                            selective_log_softmax,
                            NUM_MEL_CHUNKS)
    if return_logits:
        logits = self.lm_head(hidden_states[:, slice_indices, :])
    if not return_dict:
        output = (logits,) + outputs[1:]
        return (loss,) + output if loss is not None else output
    return CausalLMOutputWithPastAndLogps(
        loss=loss,
        logps = logps,
        logits=logits,
        past_key_values=outputs.past_key_values,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
    )

# Model patching
import types
setattr(model, "forward", types.MethodType(mem_eff_forward, model))
model = get_peft_model(model, peft_config)

model.print_trainable_parameters()

trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5196


In [None]:
#@title Run the model once to compare pre-post training outputs
text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "Calculate pi."},
], tokenize = False, add_generation_prompt = True)

input_text = tokenizer(text, return_tensors='pt').to(device)
output = model.generate(
    input_text.input_ids,
    max_new_tokens=1024,
    temperature = 0.8,
    top_p = 0.95,
)

print(tokenizer.decode(output[0], skip_special_tokens=True))

system

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

user

Calculate pi.assistant

Calculating pi is a complex task that has been studied extensively in mathematics. Here's a simple approach using the Bailey-Borwein-Plouffe formula (BBP formula), which is a spigot algorithm for computing the nth binary digit of pi. However, this is a simplified explanation and will give you an approximation of pi.

**Mathematical Formula:**

The BBP formula is given by:

π = Σ (1/(16^k) * ((4/(8k+1)) + (2/(8k+4)) - (1/(8k+5)) - (1/(8k+6)) - (1/(8k+7)) - (1/(8k+8)) + (1/(8k+9)) + (1/(8k+10)) + (1/(8k+11)) + (1/(8k+12)) - (1/(8k+13)) - (1/(8k+14)) - (1/(8k+15)) - (1/(8k+16))))

where k is an integer starting from 0.

**Code Implementation:**

```python
def calculate_pi():
    n = 100  # number of iterations
    pi = 0.0
    for k in range(n):
        pi += (1/(16**k)) * ((4/(8*k+1)) + (2/(8*k+4)) - (1/(8*k+5)) - (1/(8*k+6)) - (1/(8*k+7)) - (1/(8*k+8)) + (1/(8*k+9)) + (1/(8*k+10)) + (1/(

In [None]:
#@title Patch GRPOTrainer
from trl import GRPOConfig, GRPOTrainer
class MemEffGRPOTrainer(GRPOTrainer):
    def _get_per_token_logps(self, model, input_ids, attention_mask, logits_to_keep):
        return model(input_ids=input_ids, attention_mask=attention_mask, logits_to_keep=logits_to_keep + 1, return_logits=False).logps

INFO 04-10 14:32:43 [__init__.py:239] Automatically detected platform cuda.


### Data Prep
<a name="Data"></a>

We directly leverage [@willccbb](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb) for data prep and all reward functions. You are free to create your own!

In [None]:
import re
from datasets import load_dataset, Dataset

# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

# uncomment middle messages for 1-shot prompting
def get_gsm8k_questions(split = "train") -> Dataset:
    data = load_dataset('openai/gsm8k', 'main')[split] # type: ignore
    data = data.map(lambda x: { # type: ignore
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_hash_answer(x['answer'])
    }) # type: ignore
    return data # type: ignore

dataset = get_gsm8k_questions()

# Reward functions
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

def int_reward_func(completions, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

README.md:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [None]:
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = False,
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "paged_adamw_8bit",
    logging_steps = 1,
    bf16 = True,
    fp16 = False,
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 2, # Increase to 4 for smoother training
    num_generations = 2, # Decrease if out of memory
    max_prompt_length = 256,
    max_completion_length = 200,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 250,
    save_steps = 100,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",
)

And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

| Step | Training Loss | reward    | reward_std | completion_length | kl       |
|------|---------------|-----------|------------|-------------------|----------|
| 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
| 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
| 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |


In [None]:
trainer = MemEffGRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func,
    ],
    args = training_args,
    train_dataset = dataset,
    peft_config=peft_config,
)
from transformers.utils.notebook import NotebookProgressCallback, NotebookTrainingTracker
from transformers.trainer_utils import IntervalStrategy
class ExtendedNotebookProgressCallback(NotebookProgressCallback):
    def on_train_begin(self, args, state, control, **kwargs):
        self.first_column = "Epoch" if args.eval_strategy == IntervalStrategy.EPOCH else "Step"
        self.training_loss = 0
        self.last_log = 0
        column_names = [self.first_column] + ["Training Loss", "reward", "reward_std", "completion_length", "kl"]
        if args.eval_strategy != IntervalStrategy.NO:
            column_names.append("Validation Loss")
        self.training_tracker = NotebookTrainingTracker(state.max_steps, column_names)

    def on_log(self, args, state, control, logs=None, **kwargs):
        # Only for when there is no evaluation
        if args.eval_strategy == IntervalStrategy.NO and "loss" in logs:
            values = {
                "Training Loss": logs["loss"],
                "reward": logs["reward"],
                "reward_std": logs["reward_std"],
                "completion_length": logs["completion_length"],
                "kl": logs["kl"]
            }
            # First column is necessarily Step sine we're not in epoch eval strategy
            values["Step"] = state.global_step
            self.training_tracker.write_line(values)

trainer.pop_callback(NotebookProgressCallback)
trainer.add_callback(ExtendedNotebookProgressCallback)
trainer.train()

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


-------------------- Question:
Ahmed and Emily are having a contest to see who can get the best grade in the class. There have been 9 assignments and Ahmed has a 91 in the class. Emily has a 92. The final assignment is worth the same amount as all the other assignments. Emily got a 90 on the final assignment. What is the minimum grade Ahmed needs to get to beat Emily if all grades are whole numbers? 
Answer:
100 
Response:
Let's break it down:

1. Ahmed has a total of 9 assignments with a grade of 91. 
Let's denote the total of his grades as 'A'.
The sum is A = 91 * 9

2. The grade Emily has is 92, over 9 assignments. 
Denote the sum of her grades as 'E'.
The sum is E = 92 * 9

3. To find the minimum grade Ahmed needs, let's first calculate the total sum of grades both Ahmed and Emily have.
Then we'll find their average.

For Ahmed's average to be higher than Emily's average, we need:

(A + x) / 10 > (E - 90 + 90) / 10 
Where 'x' is Ahmed's grade on the final assignment.

By expanding 

Step,Training Loss,reward,reward_std,completion_length,kl
1,0.0,0.0,0.0,176.5,0.0
2,0.0,0.50625,0.715946,173.5,0.0
3,0.0003,-0.08275,0.117026,189.5,0.007162
4,0.0003,0.535,0.756604,174.0,0.006622
5,0.0003,0.0335,0.041719,143.25,0.007222
6,0.0003,0.0,0.0,160.25,0.007957
7,0.0004,0.0,0.0,158.0,0.009026
8,0.0003,0.0,0.0,158.75,0.007247
9,0.0002,0.03125,0.044194,198.5,0.006098
10,0.0002,0.07525,0.10642,183.75,0.005454


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Jason has to cover a total distance of 120 miles.
He has already driven for 30 minutes at 60 miles per hour, which means:

Distances = Speed × Time
In 30 minutes (0.5 hours), Jason would have driven:
Distance = 60 × 0.5 = 30 miles

Now, Jason still needs to cover a distance of 120 - 30 = 90 miles in the remaining 1 hour 0 minutes (0.167 hours - to make it 1.5 hours, but he already has 0.5 hours under his belt – the remaining time will be 1 hour).

So, he has 1 hour to cover 90 miles:

Speed = Distance / Time
Speed = 90 miles / 1 hour = 90 miles per hour.

He must maintain this speed of 90 miles per hour for the remaining 1 hour.
-------------------- Question:
Pat is having a picnic with her family. She has 42 cookies. She also has 63 pieces of candy and 21 brownies. There are 7 people in her family. If each person gets the same number of each dessert, how much will each person get? 
Answer:
18 
Response:
To find out how m

TrainOutput(global_step=250, training_loss=0.0005284424846759066, metrics={'train_runtime': 14419.1325, 'train_samples_per_second': 0.069, 'train_steps_per_second': 0.017, 'total_flos': 0.0, 'train_loss': 0.0005284424846759066})

<a name="Inference"></a>
### Inference
Now let's try the model we just trained!

In [None]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "Calculate pi."},
], tokenize = False, add_generation_prompt = True)


input_text = tokenizer(text, return_tensors='pt').to(device)
output = model.generate(
    input_text.input_ids,
    max_new_tokens=1024,
    temperature = 0.8,
    top_p = 0.95,
)

print(tokenizer.decode(output[0], skip_special_tokens=True))

system

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>user

Calculate pi.assistant

<mathematical reasoning>
Pi (π) is a mathematical constant representing the ratio of a circle's circumference to its diameter. It is an irrational number, meaning it cannot be expressed as a finite decimal or fraction.

One common method for calculating pi is through the use of the Monte Carlo method or the Gregory-Leibniz series. However, an even more efficient approach is the Bailey-Borwein-Plouffe (BBP) algorithm, which is a spigot algorithm that allows for the calculation of any individual hexadecimal or binary digit of pi without having to compute the preceding digits.

For a simplified approach, we can use the formula π ≈ (355/113), which provides a fairly accurate result. This method is based on a historical calculation by the Chinese mathematician Zu Chongzhi in the 5th century.

Another simple f

In [None]:
# Because colab, kaggle, and github notebook implementations are not uniform...
from ipywidgets import Widget
Widget.close_all()