### News

### **Installation**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    # If you're not in Colab, just use pip install or uv pip install
    !pip install unsloth vllm
else:
    pass # For Colab / Kaggle, we need extra instructions hidden below \/

In [3]:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
!pip install --upgrade -qqq uv
if "COLAB_" not in "".join(os.environ.keys()):
    # If you're not in Colab, just use pip install!
    !pip install unsloth vllm
else:
    try: import numpy; get_numpy = f"numpy=={numpy.__version__}"
    except: get_numpy = "numpy"
    try: import subprocess; is_t4 = "Tesla T4" in str(subprocess.check_output(["nvidia-smi"]))
    except: is_t4 = False
    get_vllm, get_triton = ("vllm==0.10.1", "triton==3.2.0") if is_t4 else ("vllm", "triton")
    !uv pip install -qqq --upgrade \
        unsloth {get_vllm} {get_numpy} torchvision bitsandbytes xformers
    !uv pip install -qqq {get_triton}
!pip install transformers==4.55.4

## **Loading the Model**

### Unsloth

Load up `unsloth/mistral-7b-instruct-v0.3-bnb-4bit`, and set parameters

In [4]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 32 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.6, # Reduce if out of memory
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
INFO 08-29 18:00:11 [__init__.py:241] Automatically detected platform cuda.
ERROR 08-29 18:00:13 [fa_utils.py:57] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
🦥 Unsloth Zoo will now patch everything to make training faster!
Unsloth: Patching vLLM v1 graph capture
Unsloth: Patching vLLM v0 graph capture
==((====))==  Unsloth 2025.8.10: Fast Mistral patching. Transformers: 4.55.4. vLLM: 0.10.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.31. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/mistral-7b-instruct-v0.3-bnb-4bit with actual GPU utilization

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/446 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/157 [00:00<?, ?B/s]

INFO 08-29 18:00:57 [cuda.py:384] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-29 18:00:57 [cuda.py:433] Using XFormers backend.
INFO 08-29 18:00:58 [parallel_state.py:1134] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 08-29 18:00:58 [model_runner.py:1080] Starting to load model unsloth/mistral-7b-instruct-v0.3-bnb-4bit...
INFO 08-29 18:00:59 [bitsandbytes_loader.py:742] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 08-29 18:00:59 [weight_utils.py:296] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/4.14G [00:00<?, ?B/s]

INFO 08-29 18:01:51 [weight_utils.py:312] Time spent downloading weights for unsloth/mistral-7b-instruct-v0.3-bnb-4bit: 51.843110 seconds
INFO 08-29 18:01:52 [weight_utils.py:349] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 08-29 18:02:01 [punica_selector.py:19] Using PunicaWrapperGPU.
INFO 08-29 18:02:02 [model_runner.py:1112] Model loading took 4.0423 GiB and 62.760536 seconds
INFO 08-29 18:02:14 [worker.py:295] Memory profiling takes 10.90 seconds
INFO 08-29 18:02:14 [worker.py:295] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.59) = 8.76GiB
INFO 08-29 18:02:14 [worker.py:295] model weights take 4.04GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 0.24GiB; the rest of the memory reserved for KV Cache is 4.45GiB.
INFO 08-29 18:02:14 [executor_base.py:114] # cuda blocks: 2278, # CPU blocks: 0
INFO 08-29 18:02:14 [executor_base.py:119] Maximum concurrency for 1024 tokens per request: 35.59x
INFO 08-29 18:02:14 [vllm_utils.py:676] Unsloth: Running patched vLLM v0 `capture_model`.
INFO 08-29 18:02:14 [model_runner.py:1383] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run th

Capturing CUDA graph shapes:   0%|          | 0/27 [00:00<?, ?it/s]

INFO 08-29 18:02:43 [model_runner.py:1535] Graph capturing finished in 29 secs, took 0.60 GiB
INFO 08-29 18:02:43 [vllm_utils.py:683] Unsloth: Patched vLLM v0 graph capture finished in 29 secs.
INFO 08-29 18:02:44 [llm_engine.py:417] init engine (profile, create kv cache, warmup model) took 41.78 seconds
INFO 08-29 18:02:44 [llm.py:298] Supported_tasks: ['generate']
Unsloth: Just some info: will skip parsing ['post_feedforward_layernorm', 'q_norm', 'k_norm', 'pre_feedforward_layernorm']
Unsloth: Just some info: will skip parsing ['post_feedforward_layernorm', 'q_norm', 'k_norm', 'pre_feedforward_layernorm']


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/446 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

## **Adding LoRA Adpaters**

In [5]:

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

Unsloth 2025.8.10 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## **Data Preparations**

<a name="Data"></a>
### Data Prep

We directly leverage [@willccbb](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb) for data prep and all reward functions. You are free to create your own!

In [6]:
import re
from datasets import load_dataset, Dataset

# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

# uncomment middle messages for 1-shot prompting
def get_gsm8k_questions(split = "train") -> Dataset:
    data = load_dataset('openai/gsm8k', 'main')[split] # type: ignore
    data = data.map(lambda x: { # type: ignore
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_hash_answer(x['answer'])
    }) # type: ignore
    return data # type: ignore



In [7]:
dataset = load_dataset('openai/gsm8k', 'main')['train']

README.md: 0.00B [00:00, ?B/s]

main/train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

main/test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

In [8]:
dataset

Dataset({
    features: ['question', 'answer'],
    num_rows: 7473
})

In [9]:
dataset[5]["question"]

'Mark has a garden with flowers. He planted plants of three different colors in it. Ten of them are yellow, and there are 80% more of those in purple. There are only 25% as many green flowers as there are yellow and purple flowers. How many flowers does Mark have in his garden?'

In [10]:
dataset[5]["answer"]

"There are 80/100 * 10 = <<80/100*10=8>>8 more purple flowers than yellow flowers.\nSo in Mark's garden, there are 10 + 8 = <<10+8=18>>18 purple flowers.\nPurple and yellow flowers sum up to 10 + 18 = <<10+18=28>>28 flowers.\nThat means in Mark's garden there are 25/100 * 28 = <<25/100*28=7>>7 green flowers.\nSo in total Mark has 28 + 7 = <<28+7=35>>35 plants in his garden.\n#### 35"

In [11]:
#@title Getting and editing the data
dataset = get_gsm8k_questions()

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

In [12]:
dataset

Dataset({
    features: ['question', 'answer', 'prompt'],
    num_rows: 7473
})

In [13]:
dataset[5]["prompt"]

[{'content': '\nRespond in the following format:\n<reasoning>\n...\n</reasoning>\n<answer>\n...\n</answer>\n',
  'role': 'system'},
 {'content': 'Mark has a garden with flowers. He planted plants of three different colors in it. Ten of them are yellow, and there are 80% more of those in purple. There are only 25% as many green flowers as there are yellow and purple flowers. How many flowers does Mark have in his garden?',
  'role': 'user'}]

In [14]:
dataset[5]["question"]

'Mark has a garden with flowers. He planted plants of three different colors in it. Ten of them are yellow, and there are 80% more of those in purple. There are only 25% as many green flowers as there are yellow and purple flowers. How many flowers does Mark have in his garden?'

In [15]:
dataset[5]["answer"]

'35'

In [16]:
#@title Reward functions
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

def int_reward_func(completions, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

## **Setting up Weights and Biases for Logging**

In [17]:
import wandb

In [18]:
wandb.login()

  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mdannyai[0m ([33mdannyai-danny-the-analyst[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [32]:
import os
# set the wandb project where this run will be logged
os.environ["WANDB_PROJECT"]="Fine_Tune_GPRO_Mistral_v0.3_7B"

# save your trained model checkpoint to wandb
# os.environ["WANDB_LOG_MODEL"]="true" # throws an error, must use 'checkpoint' or 'end'
os.environ["WANDB_LOG_MODEL"]="checkpoint"

# turn off watch to log faster
os.environ["WANDB_WATCH"]="false"

## **Model Training**

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [33]:
max_prompt_length = 256

from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "paged_adamw_8bit",
    logging_steps = 1,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 6, # Decrease if out of memory
    max_prompt_length = max_prompt_length,
    max_completion_length = max_seq_length - max_prompt_length,
    # num_train_epochs = 1, # Set to 1 for a full training run
    # max_steps = 250, # reduce if runtime is disconnected
    max_steps = 5, # reduce if runtime is disconnected
    save_steps = 1,
    max_grad_norm = 0.1,
    report_to = ["wandb"], # Can use Weights & Biases
    output_dir = "outputs",
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 6


And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

| Step | Training Loss | reward    | reward_std | completion_length | kl       |
|------|---------------|-----------|------------|-------------------|----------|
| 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
| 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
| 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |


In [34]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func,
    ],
    args = training_args,
    train_dataset = dataset,
)

In [22]:
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 7,473 | Num Epochs = 1 | Total steps = 5
O^O/ \_/ \    Batch size per device = 6 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (6 x 1 x 1) = 6
 "-____-"     Trainable parameters = 83,886,080 of 7,331,909,632 (1.14% trained)


-------------------- Question:
A concert ticket costs $40. Mr. Benson bought 12 tickets and received a 5% discount for every ticket bought that exceeds 10. How much did Mr. Benson pay in all? 
Answer:
476 
Response:
<reasoning>
Since Mr. Benson bought 12 tickets and the discount is applied starting from the 11th ticket, we need to calculate the price for the first 10 tickets without the discount and the remaining 2 tickets with the discount.

First, let's assume the discounted price for 1 ticket is x (regular price - 5% of the regular price). Then, the price for 10 non-discounted tickets would be 10 * $40 = $400.

The total price for the 10 non-discounted tickets would be $400. The total price for the remaining 2 discounted tickets would be 2 * x (since we don't have their actual price yet).

To find the price for 1 discounted ticket, we need to subtract 5% of the regular price from the original price:
x = $40 - (0.05 * $40) = $38

So, the total price for the 2 discounted tickets would

Step,Training Loss,reward,reward_std,completions / mean_length,completions / min_length,completions / max_length,completions / clipped_ratio,completions / mean_terminated_length,completions / min_terminated_length,completions / max_terminated_length,kl,entropy,rewards / xmlcount_reward_func / mean,rewards / xmlcount_reward_func / std,rewards / soft_format_reward_func / mean,rewards / soft_format_reward_func / std,rewards / strict_format_reward_func / mean,rewards / strict_format_reward_func / std,rewards / int_reward_func / mean,rewards / int_reward_func / std,rewards / correctness_reward_func / mean,rewards / correctness_reward_func / std
1,-0.0,-0.1155,0.297184,257.166687,189.0,317.0,0.0,257.166687,189.0,317.0,0.0,0,-0.1155,0.297184,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,-0.802667,0.536963,451.0,333.0,661.0,0.0,451.0,333.0,661.0,0.0,No Log,-0.802667,0.536962,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-0.0,-0.477833,0.090711,279.0,241.0,316.0,0.0,279.0,241.0,316.0,1.8e-05,No Log,-0.477833,0.090711,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,-0.533167,0.437344,400.166687,252.0,697.0,0.0,400.166687,252.0,697.0,7.7e-05,No Log,-0.533167,0.437344,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,-0.018667,0.158745,145.333344,119.0,186.0,0.0,145.333344,119.0,186.0,0.000283,No Log,-0.018667,0.158745,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-1)... Done. 10.5s


-------------------- Question:
Jane is trying to decide whether to buy a house or a trailer. A house costs $480,000 and a trailer costs $120,000. Each loan will be paid in monthly installments over 20 years. How much more is the monthly payment on the house compared to the trailer? 
Answer:
1500 
Response:
<reasoning>
1. First, we need to find out the monthly payments for each loan.
2. To do this, we can use the formula for calculating the monthly mortgage payment, which is: P = rPrt / (1 - (1 + r)^-n) where P is the principal, r is the annual interest rate, t is the number of years, and n is the number of payments.
3. Since we only have the cost of the houses and trailers but not the interest rates or term lengths, we're going to assume a reasonable interest rate for both houses and trailers (around 4% per year, or 0.04).
4. Calculate the monthly payments for the house: P_house = (480,000 * 0.04 * 12) / (1 - (1 + 0.04)^(-20*12))
5. Calculate the monthly payments for the trailer: P_tra

[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-2)... Done. 8.2s


-------------------- Question:
Janet pays $40/hour for 3 hours per week of clarinet lessons and $28/hour for 5 hours a week of piano lessons. How much more does she spend on piano lessons than clarinet lessons in a year? 
Answer:
1040 
Response:
<reasoning>
To find the solution, we first need to calculate the weekly cost for each lesson type, then multiply by the number of weeks in a year, and finally, subtract the weekly clarinet cost from the weekly piano cost.

First, let's calculate the weekly cost for each lesson type:
- Clarinet: $40/hour * 3 hours = $120/week
- Piano: $28/hour * 5 hours = $140/week

Next, let's calculate the yearly cost for each lesson type:
- Clarinet: $120/week * 52 weeks (approximately a year) = $6,240/year
- Piano: $140/week * 52 weeks = $7,360/year

Now, to find the difference in cost between piano and clarinet lessons per year, subtract the weekly cost for clarinet lessons from the weekly cost for piano lessons:

Cost difference = $7,360/year - $6,240/year

[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-3)... Done. 25.2s


-------------------- Question:
Sabrina is collecting herbs to make a poultice for her grandmother. She needs twice as many basil leaves as sage leaves and 5 fewer sage leaves than verbena leaves. If she needs 12 basil leaves, how many leaves total does she need? 
Answer:
29 
Response:
<reasoning>
First, let's establish the relationships between the three types of herbs. Sabrina needs twice as many basil leaves as sage leaves, and she needs 5 fewer sage leaves than verbena leaves.

Let's denote the number of verbena leaves as V, the number of sage leaves as S, and the number of basil leaves as B.

From the given information, we have:
1) B = 2S
2) S = V - 5

We know that Sabrina needs 12 basil leaves, so we can substitute B with the expression from the first equation to find S:

12 = 2S => S = 6

Next, we can substitute S into the second equation to find V:

6 = V - 5 => V = 11

</reasoning>

<answer>
Sabrina needs a total of 11 + 6 + 12 = 29 leaves. These are verbena, sage, and basil le

[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-4)... Done. 27.4s


-------------------- Question:
Over the past five years, on July 4th, the high temperature for Washington, DC has been: 90 degrees in 2020, 90 degrees in 2019, 90 degrees in 2018, 79 degrees in 2017 and 71 degrees in 2016. What is the average temperature for July 4th in Washington, DC over the past 5 years? 
Answer:
84 
Response:
<reasoning>
To find the average temperature for July 4th in Washington, DC over the past 5 years, we need to calculate the sum of the temperatures and then divide that sum by the number of years.
</reasoning>

<answer>
The average temperature for July 4th in Washington, DC over the past 5 years is (90+90+90+79+71)/5 = 86.4 degrees Fahrenheit. It's approximately 86.4 degrees Fahrenheit. 
Extracted:
The average temperature for July 4th in Washington, DC over the past 5 years is (90+90+90+79+71)/5 = 86.4 degrees Fahrenheit. It's approximately 86.4 degrees Fahrenheit.


[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-5)... Done. 29.0s


TrainOutput(global_step=5, training_loss=3.379960276106431e-08, metrics={'train_runtime': 706.7747, 'train_samples_per_second': 0.042, 'train_steps_per_second': 0.007, 'total_flos': 0.0, 'train_loss': 3.379960276106431e-08})

## **Inference**

<a name="Inference"></a>
### Inference
Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

In [23]:
text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "Calculate pi."},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

' To calculate an approximation of Pi (π) using a method like the Leibniz Formula, I can provide you a simple Python script. However, for practical purposes, this method is not efficient as it requires a large number of calculations to converge to a good approximation.\n\nFor more accurate results, you can use libraries like numpy or mpmath in Python, or use a calculator with a built-in Pi function. Here\'s the Leibniz Formula implementation in Python:\n\n```python\nimport math\n\ndef leibniz_formula_for_pi(n: int) -> float:\n    sum = 4\n    for i in range(n):\n        sum += math.pow(-1, i) / (2 * i + 1)\n    return sum * 4\n\nif __name__ == "__main__":\n    n = 1000000\n    approximation = leibniz_formula_for_pi(n)\n    print(f"Pi approximation (Leibniz Formula, n={n}): {approximation}")\n```\n\nIn this script, increase the value of `n` for a better approximation. However, it will take more time as the number of iterations increases.\n\nFor practical purposes, you can use numpy or m

## **Saving, Loading Finetuned models**

In [24]:
from google.colab import userdata
from huggingface_hub import login
hf_token = userdata.get('HF_TOKEN')
if hf_token:
   login(hf_token)
   print("Successfully logged in to Hugging Face!")
else:
   print("Token is not set. Please save the token first.")

Successfully logged in to Hugging Face!


And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [25]:
model.save_lora("grpo_saved_lora")

In [26]:
# model.save_pretrained("Fine_Tune_GPRO_Mistral_v0.3_7B") # Local saving
# tokenizer.save_pretrained("Fine_Tune_GPRO_Mistral_v0.3_7Bh") # Local saving
# first create the model card on Huggingface,
# copy the repo name and paste it here
# After which, you can run the code
# Pushing to Huggingface
# This are just LoRA adapters
model.push_to_hub("DannyAI/Fine_Tune_GPRO_Mistral_v0.3_7B",token=hf_token)
tokenizer.push_to_hub("DannyAI/Fine_Tune_GPRO_Mistral_v0.3_7B",token=hf_token)

README.md:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ...p8y2ovucb/adapter_model.safetensors:   0%|          |  559kB /  336MB            

Saved model to https://huggingface.co/DannyAI/Fine_Tune_GPRO_Mistral_v0.3_7B


Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  /tmp/tmpfh916mch/tokenizer.model      : 100%|##########|  587kB /  587kB            

No files have been modified since last commit. Skipping to prevent empty commit.


Now we load the LoRA and test:

In [27]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "Calculate pi."},
], tokenize = False, add_generation_prompt = True)

In [28]:
from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)


output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    # lora_request = model.load_lora("DannyAI/Fine_Tune_GPRO_Mistral_v0.3_7B"),
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

output

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

" <reasoning>\nTo calculate an approximate value for pi, we can use the Leibniz Formula for pi, which states that pi is equal to the infinite sum of the series (4/(n^2) - 1/(n^4)). However, since this is an infinite series, we can only calculate an approximation by summing up to a large but finite number of terms.\n\nFor the purpose of this example, let's sum up to 100,000 terms.\n</reasoning>\n\n<answer>\nThe approximate value for pi obtained by summing up to 100,000 terms in the Leibniz Formula is approximately 3.141592653589793.\n\n<reasoning>\nThis calculation can be performed using the following code in Python:\n\n```python\ndef leibniz_formula_for_pi(n_terms):\n    terms = 4\n    total = 0\n    for i in range(1, n_terms + 1):\n        if i % 2 == 0:\n            total += terms / (i * i)\n        else:\n            total -= terms / (i * i * i * i)\n        terms *= -1\n    return total * 4\n\nprint(leibniz_format_for_pi(100000))\n```\n</reasoning>\n\n<answer>\nRunning the provided

In [35]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "How do I build a rocket ship?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)


output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    # lora_request = model.load_lora("DannyAI/Fine_Tune_GPRO_Mistral_v0.3_7B"),
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

output

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

" <reasoning>\nBuilding a rocket ship involves a complex process that requires a deep understanding of physics, engineering, and materials science. It involves several key steps:\n\n1. Defining the mission: Determine the purpose of the rocket ship, such as space exploration, satellite launch, or human space travel.\n\n2. Designing the rocket: Design the rocket based on the mission requirements. This includes determining the size, shape, and weight of the rocket, the type of fuel, and the propulsion system.\n\n3. Building the rocket: Construct the rocket according to the design. This involves assembling the various components, including the fuel tanks, engines, guidance system, and payload.\n\n4. Testing the rocket: Test the rocket to ensure it works as intended. This includes testing the engines, guidance system, and structural integrity.\n\n5. Launching the rocket: Launch the rocket from a launch site. This requires careful planning and coordination to ensure the rocket is launched sa

In [36]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "List all the metals in Africa?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)


output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    # lora_request = model.load_lora("DannyAI/Fine_Tune_GPRO_Mistral_v0.3_7B"),
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

output

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

" <reasoning>\nAfrica, being the second-largest continent in the world, is rich in natural resources, including various types of metals. These metals are often mined in significant quantities due to the continent's abundant mineral reserves.\n</reasoning>\n\n<answer>\n1. Gold - Africa is the world's largest producer of gold, with countries like South Africa, Ghana, Sudan, and Mali having substantial gold reserves.\n2. Diamond - Africa is the source of more than half of the world's diamonds, with countries like Botswana, Angola, and the Democratic Republic of the Congo being key producers.\n3. Platinum - South Africa is the leading producer of platinum, followed by Russia and the United States, but South Africa's reserves are particularly significant.\n4. Iron ore - The African countries with the largest iron ore reserves are Libya, Algeria, Mauritania, and South Africa.\n5. Bauxite - Guinea, Ghana, and Jamaica are the world's top bauxite producers, with significant reserves in Africa.\

In [29]:
# Figure out how to load HF Reasoning model
# from vllm import SamplingParams
# sampling_params = SamplingParams(
#     temperature = 0.8,
#     top_p = 0.95,
#     max_tokens = 1024,
# )


# output = model.fast_generate(
#     text,
#     sampling_params = sampling_params,
#     lora_request = model.load_adapter("DannyAI/Fine_Tune_GPRO_Mistral_v0.3_7B"),
#     # lora_request = model.load_lora("grpo_saved_lora"),
# )[0].outputs[0].text

# output

Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!

<a name="Save"></a>
### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [30]:
# # Merge to 16bit
# if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
# if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# # Merge to 4bit
# if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
# if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# # Just LoRA adapters
# if False:
#     model.save_pretrained("model")
#     tokenizer.save_pretrained("model")
# if False:
#     model.push_to_hub("hf/model", token = "")
#     tokenizer.push_to_hub("hf/model", token = "")


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [31]:
# # Save to 8bit Q8_0
# if False: model.save_pretrained_gguf("model", tokenizer,)
# if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# # Save to 16bit GGUF
# if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
# if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# # Save to q4_k_m GGUF
# if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
# if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")