# Math Question Answer Verification Competition

## Starter Code

Borrowed from [official Unsloth implementation](https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing#scrollTo=MKX_XKs_BNZR)

In [1]:
# %%capture
# This cell will take time
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Collecting unsloth
  Downloading unsloth-2024.11.5-py3-none-any.whl.metadata (59 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/59.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.6/59.6 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting unsloth-zoo>=2024.11.1 (from unsloth)
  Downloading unsloth_zoo-2024.11.4-py3-none-any.whl.metadata (16 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Downloading xformers-0.0.28.post3-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting bitsandbytes (from unsloth)
  Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Collecting triton>=3.0.0 (from unsloth)
  Downloading triton-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.3 kB)
Collecting tyro (from unsloth)
  Downloading tyro-0.8.14-py3-none-any.whl.metadata (8.4 kB)
Collecting transformers>=4.46.1 (from unsloth)
  Down

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 8bit quantization to reduce memory usage. Can be False.


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [3]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

==((====))==  Unsloth 2024.11.5: Fast Llama patching. Transformers = 4.46.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.1+cu124. CUDA = 8.0. CUDA Toolkit = 12.4.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

## Load model and wrap with LoRA adapters

In [4]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.11.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## Competition dataset

In [5]:
# download and load competition dataset

from datasets import load_dataset
dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp")
# print and see dataset
dataset

README.md:   0%|          | 0.00/2.09k [00:00<?, ?B/s]

train-00000-of-00002.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00001-of-00002.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/3.65M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['question', 'is_correct', 'answer', 'solution'],
        num_rows: 1000000
    })
    test: Dataset({
        features: ['question', 'is_correct', 'answer', 'solution'],
        num_rows: 10000
    })
})

In [6]:
prompt = """You are a great mathematician and you are tasked with finding if an answer to a given maths question is correct or not. Yout response should be 'True' if correct, otherwise 'False'. Below is Question and Answer.



### Question:
{}

### Answer:
{}

### Explainaition

### Output:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    question = examples["question"]
    ans       = examples["answer"]
    output      = examples["is_correct"]
    texts = []
    for instruction, input, output in zip(question, ans, output):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }




In [7]:
# Process the training dataset and generate prompt for each datapoint

train_dataset = dataset['train'].map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/1000000 [00:00<?, ? examples/s]

In [8]:
num_examples = len(train_dataset)
print(f"Number of examples in the dataset: {num_examples}")

Number of examples in the dataset: 1000000


In [9]:
#print a sample training example
train_dataset['text'][1]

"You are a great mathematician and you are tasked with finding if an answer to a given maths question is correct or not. Yout response should be 'True' if correct, otherwise 'False'. Below is Question and Answer.\n\n\n\n### Question:\nIf $x + y = 16$ and $x-y = 2$, what is the value of $x^2 - y^2$?\n\n### Answer:\n32\n\n### Explainaition\n\n### Output:\nTrue<|end_of_text|>"

In [10]:
import wandb

In [11]:
wandb.login()

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [12]:
sweep_config = {
    'method': 'random'
    }

In [13]:
metric = {
    'name': 'loss',
    'goal': 'minimize'
    }

sweep_config['metric'] = metric

In [14]:
parameters_dict = {
        "per_device_train_batch_size": {
            "values": [2, 4, 8]
        },
        "gradient_accumulation_steps": {
            "values": [2, 4, 8]
        },
        "max_steps": {
            "values": [10, 20, 50]
        },
        "warmup_steps": {
            "values": [5, 10, 20]
        },
        "num_train_epochs": {
            "value": 1  # Fixed parameter
        },
        "learning_rate": {
            "distribution": "log_uniform_values",
            "min": 1e-5,
            "max": 1e-3
        },
        "weight_decay": {
            "distribution": "uniform",
            "min": 0.0,
            "max": 0.1
        },
        "lr_scheduler_type": {
            "values": ["linear", "cosine", "cosine_with_restarts"]
        },
        "optim": {
            "values": [
                "adamw_8bit",
                "adamw_bnb_8bit",
                "adamw_torch",
                "adafactor",
                "adamw_hf"
            ]
        }
    }



sweep_config['parameters'] = parameters_dict

In [15]:
import pprint
pprint.pprint(sweep_config)

{'method': 'random',
 'metric': {'goal': 'minimize', 'name': 'loss'},
 'parameters': {'gradient_accumulation_steps': {'values': [2, 4, 8]},
                'learning_rate': {'distribution': 'log_uniform_values',
                                  'max': 0.001,
                                  'min': 1e-05},
                'lr_scheduler_type': {'values': ['linear',
                                                 'cosine',
                                                 'cosine_with_restarts']},
                'max_steps': {'values': [10, 20, 50]},
                'num_train_epochs': {'value': 1},
                'optim': {'values': ['adamw_8bit',
                                     'adamw_bnb_8bit',
                                     'adamw_torch',
                                     'adafactor',
                                     'adamw_hf']},
                'per_device_train_batch_size': {'values': [2, 4, 8]},
                'warmup_steps': {'values': [5, 10, 20]},
       

In [16]:
sweep_id = wandb.sweep(sweep_config, project="kaggle sweeps")

Create sweep with ID: gxnz7ob2
Sweep URL: https://wandb.ai/garima440-new-york-university/kaggle%20sweeps/sweeps/gxnz7ob2


## SFT

In [17]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
import os

def train(config=None):
    global model
    # Initialize a new wandb run
    with wandb.init(config=config) as run:
        # If called by wandb.agent, this config will be set by Sweep Controller
        config = wandb.config


        # Create a unique output directory for each run
        output_dir = os.path.join("outputs", run.name)
        os.makedirs(output_dir, exist_ok=True)

        # 1. Enable gradient checkpointing
        model.gradient_checkpointing_enable()

        # 5. Enable distributed training if multiple GPUs are available
        if torch.cuda.device_count() > 1:
            model = torch.nn.DataParallel(model)

        # 2. Configure mixed precision training
        dtype = torch.bfloat16 if is_bfloat16_supported() else torch.float16

        # 3. Optimize memory usage
        torch.cuda.empty_cache()

        training_args = TrainingArguments(
            per_device_train_batch_size=config.per_device_train_batch_size,
            gradient_accumulation_steps=config.gradient_accumulation_steps,
            warmup_steps=config.warmup_steps,
            num_train_epochs=1,
            max_steps=config.max_steps,
            learning_rate=config.learning_rate,
            fp16=not is_bfloat16_supported(),
            bf16=is_bfloat16_supported(),
            logging_steps=1,
            optim=config.optim,
            weight_decay=config.weight_decay,
            lr_scheduler_type=config.lr_scheduler_type,
            seed=3407,
            output_dir=output_dir,
            report_to="wandb",  # Enable WandB reporting
            run_name=run.name,  # Use WandB run name

            # Added optimization parameters
            gradient_checkpointing=True,
            #tf32=True,  # Enable TensorFloat-32 on Ampere GPUs
            dataloader_num_workers=4,  # Parallel data loading
            dataloader_pin_memory=True,  # Pin memory for faster data transfer
            torch_compile=True,  # Enable PyTorch 2.0 compilation

        )




        trainer = SFTTrainer(
            model=model,  # Make sure model is defined or passed as parameter
            tokenizer=tokenizer,  # Make sure tokenizer is defined or passed as parameter
            train_dataset=train_dataset,  # Make sure dataset is defined or passed as parameter
            dataset_text_field="text",
            max_seq_length=max_seq_length,  # Make sure this is defined or passed as parameter
            dataset_num_proc=8,
            packing=True,
            args=training_args,
            data_collator=None,  # Let DataLoader handle collation
        )



        # # Train and get stats
        # trainer_stats = trainer.train()

        # # Log metrics to WandB
        # metrics = {
        #     "loss": trainer_stats.metrics['train_loss']}
        # wandb.log(metrics)

        # return trainer_stats.metrics['train_loss']
        # Train and get stats
        trainer_stats = trainer.train()

        metrics = {
            "loss": trainer_stats.metrics['train_loss'],
        }
        wandb.log(metrics)
        print(f"final_loss: {trainer_stats.metrics['train_loss']}")





In [18]:
wandb.agent(sweep_id, train, count=30)

[34m[1mwandb[0m: Agent Starting Run: dj41h4cj with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 4
[34m[1mwandb[0m: 	learning_rate: 1.330534730680917e-05
[34m[1mwandb[0m: 	lr_scheduler_type: linear
[34m[1mwandb[0m: 	max_steps: 20
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adamw_hf
[34m[1mwandb[0m: 	per_device_train_batch_size: 8
[34m[1mwandb[0m: 	warmup_steps: 5
[34m[1mwandb[0m: 	weight_decay: 0.02286321352217939
[34m[1mwandb[0m: Currently logged in as: [33mgarima440[0m ([33mgarima440-new-york-university[0m). Use [1m`wandb login --relogin`[0m to force relogin


Generating train split: 0 examples [00:00, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 4
\        /    Total batch size = 32 | Total steps = 20
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,1.2417
2,1.2108
3,1.2253
4,1.2356
5,1.1915
6,1.1176
7,1.1087
8,1.0393
9,1.06
10,1.0688


final_loss: 1.0756968349218368


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇███
train/global_step,▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇████
train/grad_norm,█▇███▇▆▂▃▃▃▃▃▂▂▁▁▁▁▁
train/learning_rate,▂▄▅▇██▇▇▆▆▅▅▄▄▃▃▂▂▁▁
train/loss,█▇██▇▅▅▃▄▄▃▃▂▂▂▁▂▂▂▁

0,1
loss,1.0757
total_flos,5.968083617316864e+16
train/epoch,0.00993
train/global_step,20.0
train/grad_norm,0.72006
train/learning_rate,0.0
train/loss,0.9576
train_loss,1.0757
train_runtime,386.7636
train_samples_per_second,1.655


[34m[1mwandb[0m: Agent Starting Run: ielnn32t with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 2
[34m[1mwandb[0m: 	learning_rate: 0.00038399496045922026
[34m[1mwandb[0m: 	lr_scheduler_type: cosine_with_restarts
[34m[1mwandb[0m: 	max_steps: 10
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adafactor
[34m[1mwandb[0m: 	per_device_train_batch_size: 2
[34m[1mwandb[0m: 	warmup_steps: 5
[34m[1mwandb[0m: 	weight_decay: 0.02024834413297466


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 2
\        /    Total batch size = 4 | Total steps = 10
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.9941
2,1.0906
3,1.0096
4,1.0302
5,1.029
6,0.8911
7,0.853
8,1.013
9,0.8774
10,0.7378


final_loss: 0.9525839149951935


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▂▃▃▄▅▆▆▇██
train/global_step,▁▂▃▃▄▅▆▆▇███
train/grad_norm,▂▁▃▄█▄▂▄▁▂
train/learning_rate,▂▄▅▇█▇▆▃▂▁
train/loss,▆█▆▇▇▄▃▆▄▁

0,1
loss,0.95258
total_flos,3730052260823040.0
train/epoch,0.00062
train/global_step,10.0
train/grad_norm,0.99804
train/learning_rate,0.0
train/loss,0.7378
train_loss,0.95258
train_runtime,30.8811
train_samples_per_second,1.295


[34m[1mwandb[0m: Agent Starting Run: bdwa3pda with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 8
[34m[1mwandb[0m: 	learning_rate: 0.00043815116733797894
[34m[1mwandb[0m: 	lr_scheduler_type: cosine_with_restarts
[34m[1mwandb[0m: 	max_steps: 10
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adamw_bnb_8bit
[34m[1mwandb[0m: 	per_device_train_batch_size: 4
[34m[1mwandb[0m: 	warmup_steps: 10
[34m[1mwandb[0m: 	weight_decay: 0.04086312442412868


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 8
\        /    Total batch size = 32 | Total steps = 10
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.715
2,0.7639
3,0.7872
4,0.7912
5,0.7563
6,0.7175
7,0.7622
8,0.7289
9,0.7229
10,0.7468


final_loss: 0.7491904616355896


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▂▃▃▄▅▆▆▇██
train/global_step,▁▂▃▃▄▅▆▆▇███
train/grad_norm,▄▃▃▃▂▁█▆▂▁
train/learning_rate,▁▂▃▃▄▅▆▆▇█
train/loss,▁▅██▅▁▅▂▂▄

0,1
loss,0.74919
total_flos,2.984041808658432e+16
train/epoch,0.00497
train/global_step,10.0
train/grad_norm,0.19208
train/learning_rate,0.00044
train/loss,0.7468
train_loss,0.74919
train_runtime,199.32
train_samples_per_second,1.605


[34m[1mwandb[0m: Agent Starting Run: 9sko3e0s with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 4
[34m[1mwandb[0m: 	learning_rate: 0.0008376273691640912
[34m[1mwandb[0m: 	lr_scheduler_type: cosine
[34m[1mwandb[0m: 	max_steps: 10
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adamw_torch
[34m[1mwandb[0m: 	per_device_train_batch_size: 8
[34m[1mwandb[0m: 	warmup_steps: 5
[34m[1mwandb[0m: 	weight_decay: 0.045738146602204446


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 4
\        /    Total batch size = 32 | Total steps = 10
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.6382
2,0.6673
3,0.719
4,0.6957
5,0.6926
6,0.6798
7,0.7181
8,0.6757
9,0.6985
10,0.7242


final_loss: 0.6909057974815369


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▂▃▃▄▅▆▆▇██
train/global_step,▁▂▃▃▄▅▆▆▇███
train/grad_norm,▂▂▆▃▃▃▃▃█▁
train/learning_rate,▂▄▅▇█▇▆▃▂▁
train/loss,▁▃█▆▅▄█▄▆█

0,1
loss,0.69091
total_flos,2.984041808658432e+16
train/epoch,0.00497
train/global_step,10.0
train/grad_norm,0.23129
train/learning_rate,0.0
train/loss,0.7242
train_loss,0.69091
train_runtime,189.7634
train_samples_per_second,1.686


[34m[1mwandb[0m: Agent Starting Run: dc8atvef with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 8
[34m[1mwandb[0m: 	learning_rate: 0.00027381951914282824
[34m[1mwandb[0m: 	lr_scheduler_type: linear
[34m[1mwandb[0m: 	max_steps: 20
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adamw_hf
[34m[1mwandb[0m: 	per_device_train_batch_size: 2
[34m[1mwandb[0m: 	warmup_steps: 5
[34m[1mwandb[0m: 	weight_decay: 0.05976392387959204


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 20
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.5778
2,0.5197
3,0.5021
4,0.5417
5,0.5313
6,0.5061
7,0.5972
8,0.5382
9,0.5419
10,0.5416


final_loss: 0.5758084774017334


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇███
train/global_step,▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇████
train/grad_norm,▂▃▂▄▆▂█▁▂▁▁▁▂▁▁▁▁▁▁▁
train/learning_rate,▂▄▅▇██▇▇▆▆▅▅▄▄▃▃▂▂▁▁
train/loss,▄▂▁▂▂▁▄▂▂▂▂▂▅▄▄▄▅▅▇█

0,1
loss,0.57581
total_flos,2.984041808658432e+16
train/epoch,0.00497
train/global_step,20.0
train/grad_norm,0.31317
train/learning_rate,0.0
train/loss,0.7087
train_loss,0.57581
train_runtime,213.7242
train_samples_per_second,1.497


[34m[1mwandb[0m: Agent Starting Run: hmdxln73 with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 8
[34m[1mwandb[0m: 	learning_rate: 1.8217131016844047e-05
[34m[1mwandb[0m: 	lr_scheduler_type: linear
[34m[1mwandb[0m: 	max_steps: 10
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adamw_torch
[34m[1mwandb[0m: 	per_device_train_batch_size: 8
[34m[1mwandb[0m: 	warmup_steps: 5
[34m[1mwandb[0m: 	weight_decay: 0.06827197463541392


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 8
\        /    Total batch size = 64 | Total steps = 10
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.4226
2,0.4436
3,0.4521
4,0.539
5,0.6418
6,0.6898
7,0.6551
8,0.6723
9,0.6808
10,0.6654


final_loss: 0.5862295538187027


VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▂▃▃▄▅▆▆▇██
train/global_step,▁▂▃▃▄▅▆▆▇███
train/grad_norm,▆█▅▂▁▂▃▃▃▃
train/learning_rate,▂▄▅▇█▇▅▄▂▁
train/loss,▁▂▂▄▇█▇██▇

0,1
loss,0.58623
total_flos,5.968083617316864e+16
train/epoch,0.00993
train/global_step,10.0
train/grad_norm,0.23189
train/learning_rate,0.0
train/loss,0.6654
train_loss,0.58623
train_runtime,376.0433
train_samples_per_second,1.702


[34m[1mwandb[0m: Agent Starting Run: 3tk5nuxy with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 2
[34m[1mwandb[0m: 	learning_rate: 1.0318526707943802e-05
[34m[1mwandb[0m: 	lr_scheduler_type: cosine
[34m[1mwandb[0m: 	max_steps: 10
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adamw_torch
[34m[1mwandb[0m: 	per_device_train_batch_size: 4
[34m[1mwandb[0m: 	warmup_steps: 10
[34m[1mwandb[0m: 	weight_decay: 0.05127509879678813


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 2
\        /    Total batch size = 8 | Total steps = 10
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.4542
2,0.4208
3,0.3631
4,0.389
5,0.3653
6,0.3649
7,0.3887
8,0.3698
9,0.3878
10,0.3442


final_loss: 0.3847788155078888


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▂▃▃▄▅▆▆▇██
train/global_step,▁▂▃▃▄▅▆▆▇███
train/grad_norm,▂▂▂▁▂▃▆▇▇█
train/learning_rate,▁▂▃▃▄▅▆▆▇█
train/loss,█▆▂▄▂▂▄▃▄▁

0,1
loss,0.38478
total_flos,7460104521646080.0
train/epoch,0.00124
train/global_step,10.0
train/grad_norm,0.5341
train/learning_rate,1e-05
train/loss,0.3442
train_loss,0.38478
train_runtime,52.4005
train_samples_per_second,1.527


[34m[1mwandb[0m: Agent Starting Run: xkitzfzc with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 2
[34m[1mwandb[0m: 	learning_rate: 1.2149841091559192e-05
[34m[1mwandb[0m: 	lr_scheduler_type: cosine_with_restarts
[34m[1mwandb[0m: 	max_steps: 50
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adamw_bnb_8bit
[34m[1mwandb[0m: 	per_device_train_batch_size: 2
[34m[1mwandb[0m: 	warmup_steps: 10
[34m[1mwandb[0m: 	weight_decay: 0.04445435214863757


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 2
\        /    Total batch size = 4 | Total steps = 50
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.4537
2,0.4208
3,0.3707
4,0.4337
5,0.3533
6,0.3343
7,0.3621
8,0.3791
9,0.3617
10,0.3266


final_loss: 0.3922510206699371


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▁▁▁▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
train/global_step,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
train/grad_norm,▂▂▁▁▁▁▁▁▁▁▂▃▄▃▃▃▄▄▄▄▆▅▇▇█▆▆▆▆▆█▇▆▅▆▆▄▅▅▇
train/learning_rate,▂▂▃▄▅▆▇▇██████▇▇▇▇▇▆▆▆▅▅▅▄▄▄▃▃▃▂▂▂▂▂▁▁▁▁
train/loss,▆▆▃▃▃▃▂▃▃▂▂▃▄▂▁▃▄▁▄▅▅▆▅▄▅▅▅▄▆▄█▆▇▆▅█▂▆▅█

0,1
loss,0.39225
total_flos,1.86502613041152e+16
train/epoch,0.0031
train/global_step,50.0
train/grad_norm,0.90399
train/learning_rate,0.0
train/loss,0.4843
train_loss,0.39225
train_runtime,136.3571
train_samples_per_second,1.467


[34m[1mwandb[0m: Agent Starting Run: z1kbu98r with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 8
[34m[1mwandb[0m: 	learning_rate: 0.00011993016915374536
[34m[1mwandb[0m: 	lr_scheduler_type: linear
[34m[1mwandb[0m: 	max_steps: 10
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adamw_hf
[34m[1mwandb[0m: 	per_device_train_batch_size: 2
[34m[1mwandb[0m: 	warmup_steps: 20
[34m[1mwandb[0m: 	weight_decay: 0.09583758065477806


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 10
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.3826
2,0.3201
3,0.3047
4,0.3062
5,0.2931
6,0.3166
7,0.3834
8,0.3809
9,0.3996
10,0.4361


final_loss: 0.35231606662273407


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▂▃▃▄▅▆▆▇██
train/global_step,▁▂▃▃▄▅▆▆▇███
train/grad_norm,▂▁▁▃▃▄▄▆▇█
train/learning_rate,▁▂▃▃▄▅▆▆▇█
train/loss,▅▂▂▂▁▂▅▅▆█

0,1
loss,0.35232
total_flos,1.492020904329216e+16
train/epoch,0.00248
train/global_step,10.0
train/grad_norm,0.71585
train/learning_rate,6e-05
train/loss,0.4361
train_loss,0.35232
train_runtime,108.3746
train_samples_per_second,1.476


[34m[1mwandb[0m: Agent Starting Run: 4zl66fma with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 4
[34m[1mwandb[0m: 	learning_rate: 0.00036234914736673577
[34m[1mwandb[0m: 	lr_scheduler_type: linear
[34m[1mwandb[0m: 	max_steps: 50
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adamw_bnb_8bit
[34m[1mwandb[0m: 	per_device_train_batch_size: 4
[34m[1mwandb[0m: 	warmup_steps: 10
[34m[1mwandb[0m: 	weight_decay: 0.0879899113310179


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 4
\        /    Total batch size = 16 | Total steps = 50
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.329
2,0.2743
3,0.2545
4,0.2633
5,0.26
6,0.2963
7,0.3558
8,0.3771
9,0.4155
10,0.4536


final_loss: 0.5456074786186218


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇████
train/global_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇█████
train/grad_norm,▂▂▂▅█▇▆▇█▅▃▃▃▃▂▁▁▂▁▁▂▁▂▁▂▂▂▁▁▂▂▂▂▂▂▁▂▂▂▁
train/learning_rate,▂▂▃▄▅▆▇▇██▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▄▄▄▄▃▃▃▃▃▂▂▂▂▁▁
train/loss,▂▁▁▁▁▃▃▄▄▄▆▆▆▆▇▇█▇▇█▇▇▆▇▇▇▇▇▆▇▆▆▆▆▆▆▇▇▆▇

0,1
loss,0.54561
total_flos,7.46010452164608e+16
train/epoch,0.01241
train/global_step,50.0
train/grad_norm,0.27991
train/learning_rate,0.0
train/loss,0.607
train_loss,0.54561
train_runtime,495.0793
train_samples_per_second,1.616


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: 4m1zl935 with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 8
[34m[1mwandb[0m: 	learning_rate: 1.1484670115520544e-05
[34m[1mwandb[0m: 	lr_scheduler_type: linear
[34m[1mwandb[0m: 	max_steps: 10
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adamw_8bit
[34m[1mwandb[0m: 	per_device_train_batch_size: 2
[34m[1mwandb[0m: 	warmup_steps: 20
[34m[1mwandb[0m: 	weight_decay: 0.09100678200480684


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 10
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.3107
2,0.2735
3,0.2686
4,0.2834
5,0.2822
6,0.2908
7,0.295
8,0.3055
9,0.3042
10,0.3149


final_loss: 0.2928774327039719


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▂▃▃▄▅▆▆▇██
train/global_step,▁▂▃▃▄▅▆▆▇███
train/grad_norm,▄▃▅▇█▇▄▂▁▁
train/learning_rate,▁▂▃▃▄▅▆▆▇█
train/loss,▇▂▁▃▃▄▅▇▆█

0,1
loss,0.29288
total_flos,1.492020904329216e+16
train/epoch,0.00248
train/global_step,10.0
train/grad_norm,0.50428
train/learning_rate,1e-05
train/loss,0.3149
train_loss,0.29288
train_runtime,107.8413
train_samples_per_second,1.484


[34m[1mwandb[0m: Agent Starting Run: grhxogsl with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 8
[34m[1mwandb[0m: 	learning_rate: 2.3258339236015757e-05
[34m[1mwandb[0m: 	lr_scheduler_type: cosine_with_restarts
[34m[1mwandb[0m: 	max_steps: 50
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adafactor
[34m[1mwandb[0m: 	per_device_train_batch_size: 2
[34m[1mwandb[0m: 	warmup_steps: 5
[34m[1mwandb[0m: 	weight_decay: 0.0980170676082204


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 50
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.2904
2,0.2542
3,0.2433
4,0.2491
5,0.2392
6,0.2457
7,0.2516
8,0.2622
9,0.2692
10,0.2842


final_loss: 0.42259048104286195


0,1
loss,▁
train/epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇██
train/global_step,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
train/grad_norm,▄▃▆█▇▆▅▄▃▃▂▂▂▂▂▂▂▂▃▂▁▁▁▁▁▃▂▂▂▂▂▂▂▂▂▂▃▃▂▃
train/learning_rate,▂▄▅▇██████▇▇▇▇▇▆▆▆▆▅▅▅▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁
train/loss,▂▁▁▁▁▁▂▂▂▂▃▃▃▃▄▅▅▆▆▅▄▆▅▆▅▆▆▆▆▆▅▆▆▆▆▆██▇█

0,1
loss,0.42259
total_flos,7.46010452164608e+16
train/epoch,0.01241
train/global_step,50.0
train/grad_norm,0.38384
train/learning_rate,0.0
train/loss,0.5927
train_loss,0.42259
train_runtime,535.0231
train_samples_per_second,1.495


[34m[1mwandb[0m: Agent Starting Run: u4qic4l8 with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 8
[34m[1mwandb[0m: 	learning_rate: 8.703223787326703e-05
[34m[1mwandb[0m: 	lr_scheduler_type: cosine_with_restarts
[34m[1mwandb[0m: 	max_steps: 20
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adamw_torch
[34m[1mwandb[0m: 	per_device_train_batch_size: 8
[34m[1mwandb[0m: 	warmup_steps: 10
[34m[1mwandb[0m: 	weight_decay: 0.0449686214742016


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 8
\        /    Total batch size = 64 | Total steps = 20
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.225
2,0.22
3,0.2474
4,0.3041
5,0.3776
6,0.4547
7,0.4415
8,0.4585
9,0.4808
10,0.4994


final_loss: 0.4619389049708843


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇███
train/global_step,▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇████
train/grad_norm,▄▆▃▃▇█▅▂▂▂▁▁▁▁▁▁▁▁▁▁
train/learning_rate,▂▂▃▄▅▅▆▇▇██▇▇▆▅▃▂▂▁▁
train/loss,▁▁▂▃▄▆▅▆▆▆▇████▇▇▇█▇

0,1
loss,0.46194
total_flos,1.1936167234633728e+17
train/epoch,0.01986
train/global_step,20.0
train/grad_norm,0.17969
train/learning_rate,0.0
train/loss,0.5446
train_loss,0.46194
train_runtime,748.8149
train_samples_per_second,1.709


[34m[1mwandb[0m: Agent Starting Run: ygb83urk with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 4
[34m[1mwandb[0m: 	learning_rate: 0.0002033957556622916
[34m[1mwandb[0m: 	lr_scheduler_type: cosine_with_restarts
[34m[1mwandb[0m: 	max_steps: 50
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adamw_hf
[34m[1mwandb[0m: 	per_device_train_batch_size: 8
[34m[1mwandb[0m: 	warmup_steps: 20
[34m[1mwandb[0m: 	weight_decay: 0.027343931020075665


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 4
\        /    Total batch size = 32 | Total steps = 50
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.2229
2,0.2032
3,0.1972
4,0.1983
5,0.1963
6,0.219
7,0.2429
8,0.2781
9,0.3028
10,0.358


final_loss: 0.4225705301761627


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇████
train/global_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇█████
train/grad_norm,▅▆█▄▂█▅▄▃▃▁▂▁▁▁▂▁▂▂▂▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁
train/learning_rate,▁▂▂▂▃▃▄▄▅▅▆▆▆▇▇█████▇▇▇▇▆▆▅▅▅▄▃▃▃▂▂▂▁▁▁▁
train/loss,▂▁▁▁▁▂▃▃▄▅▅▅▅▆▆▇▆▆▇▇█▇▇██▇▇▇▇▇▇▇▇▇▆▆▇▇▇▆

0,1
loss,0.42257
total_flos,1.492020904329216e+17
train/epoch,0.02483
train/global_step,50.0
train/grad_norm,0.29966
train/learning_rate,0.0
train/loss,0.4439
train_loss,0.42257
train_runtime,937.2436
train_samples_per_second,1.707


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: frtpeq1l with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 8
[34m[1mwandb[0m: 	learning_rate: 1.723571053790328e-05
[34m[1mwandb[0m: 	lr_scheduler_type: cosine
[34m[1mwandb[0m: 	max_steps: 10
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adafactor
[34m[1mwandb[0m: 	per_device_train_batch_size: 8
[34m[1mwandb[0m: 	warmup_steps: 10
[34m[1mwandb[0m: 	weight_decay: 0.017164825370359984


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 8
\        /    Total batch size = 64 | Total steps = 10
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.176
2,0.1699
3,0.1871
4,0.2218
5,0.2438
6,0.2777
7,0.2691
8,0.277
9,0.2761
10,0.2974


final_loss: 0.2395815908908844


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▂▃▃▄▅▆▆▇██
train/global_step,▁▂▃▃▄▅▆▆▇███
train/grad_norm,▁▂▅█▆▄▂▂▁▁
train/learning_rate,▁▂▃▃▄▅▆▆▇█
train/loss,▁▁▂▄▅▇▆▇▇█

0,1
loss,0.23958
total_flos,5.968083617316864e+16
train/epoch,0.00993
train/global_step,10.0
train/grad_norm,0.25579
train/learning_rate,2e-05
train/loss,0.2974
train_loss,0.23958
train_runtime,376.75
train_samples_per_second,1.699


[34m[1mwandb[0m: Agent Starting Run: lqjczz69 with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 2
[34m[1mwandb[0m: 	learning_rate: 0.00016332108236694935
[34m[1mwandb[0m: 	lr_scheduler_type: cosine_with_restarts
[34m[1mwandb[0m: 	max_steps: 20
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adamw_bnb_8bit
[34m[1mwandb[0m: 	per_device_train_batch_size: 8
[34m[1mwandb[0m: 	warmup_steps: 20
[34m[1mwandb[0m: 	weight_decay: 0.03831286218833644


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 2
\        /    Total batch size = 16 | Total steps = 20
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.1821
2,0.1735
3,0.1637
4,0.1552
5,0.1436
6,0.1479
7,0.1499
8,0.1659
9,0.1581
10,0.1592


final_loss: 0.17277206629514694


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇███
train/global_step,▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇████
train/grad_norm,▄▄▃▂▂▁▁▃▂▂▂▂▃▄▅▇█▆▆▅
train/learning_rate,▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇██
train/loss,▅▄▃▂▁▁▂▃▂▂▂▃▃▄▄▅▅▆██

0,1
loss,0.17277
total_flos,2.984041808658432e+16
train/epoch,0.00497
train/global_step,20.0
train/grad_norm,0.57831
train/learning_rate,0.00016
train/loss,0.2182
train_loss,0.17277
train_runtime,190.0037
train_samples_per_second,1.684


[34m[1mwandb[0m: Agent Starting Run: pxcxkcl6 with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 2
[34m[1mwandb[0m: 	learning_rate: 4.8294627888071555e-05
[34m[1mwandb[0m: 	lr_scheduler_type: cosine
[34m[1mwandb[0m: 	max_steps: 10
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adamw_hf
[34m[1mwandb[0m: 	per_device_train_batch_size: 4
[34m[1mwandb[0m: 	warmup_steps: 10
[34m[1mwandb[0m: 	weight_decay: 0.00644144634627506


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 2
\        /    Total batch size = 8 | Total steps = 10
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.1525
2,0.145
3,0.136
4,0.1427
5,0.1353
6,0.1236
7,0.121
8,0.1077
9,0.0984
10,0.103


final_loss: 0.12651870176196098


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▂▃▃▄▅▆▆▇██
train/global_step,▁▂▃▃▄▅▆▆▇███
train/grad_norm,█▇▇▆▅▄▄▂▁▂
train/learning_rate,▁▂▃▃▄▅▆▆▇█
train/loss,█▇▆▇▆▄▄▂▁▂

0,1
loss,0.12652
total_flos,7460104521646080.0
train/epoch,0.00124
train/global_step,10.0
train/grad_norm,0.44072
train/learning_rate,5e-05
train/loss,0.103
train_loss,0.12652
train_runtime,52.6987
train_samples_per_second,1.518


[34m[1mwandb[0m: Agent Starting Run: j7uj1pnk with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 4
[34m[1mwandb[0m: 	learning_rate: 9.532012610657285e-05
[34m[1mwandb[0m: 	lr_scheduler_type: cosine
[34m[1mwandb[0m: 	max_steps: 50
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adafactor
[34m[1mwandb[0m: 	per_device_train_batch_size: 2
[34m[1mwandb[0m: 	warmup_steps: 20
[34m[1mwandb[0m: 	weight_decay: 0.0231821723159543


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 50
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.1061
2,0.0947
3,0.0897
4,0.097
5,0.0916
6,0.0873
7,0.0886
8,0.0849
9,0.0851
10,0.0968


final_loss: 0.15135971680283547


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
train/global_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
train/grad_norm,▃▂▁▁▁▁▁▁▂▄▃▃▃▃▆▄▄▃▅▃▅▅▅▄▆▅▅▆▆▅▅█▆▇▇▇▆▇▆▆
train/learning_rate,▁▂▂▃▃▄▄▅▅▅▆▆▇▇▇█████▇▇▇▇▆▆▅▅▅▄▃▃▃▂▂▁▁▁▁▁
train/loss,▂▁▁▁▁▁▁▁▁▂▂▂▂▃▃▃▃▂▂▃▃▃▃▃▃▃▃▄▃▄▅▅▇█▇█▇██▇

0,1
loss,0.15136
total_flos,3.73005226082304e+16
train/epoch,0.00621
train/global_step,50.0
train/grad_norm,0.69044
train/learning_rate,0.0
train/loss,0.2421
train_loss,0.15136
train_runtime,273.1345
train_samples_per_second,1.464


[34m[1mwandb[0m: Agent Starting Run: zo5kofwj with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 2
[34m[1mwandb[0m: 	learning_rate: 0.00042998693609508864
[34m[1mwandb[0m: 	lr_scheduler_type: cosine
[34m[1mwandb[0m: 	max_steps: 50
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adamw_torch
[34m[1mwandb[0m: 	per_device_train_batch_size: 4
[34m[1mwandb[0m: 	warmup_steps: 10
[34m[1mwandb[0m: 	weight_decay: 0.06313538897952126


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 2
\        /    Total batch size = 8 | Total steps = 50
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.1197
2,0.1079
3,0.0975
4,0.1051
5,0.1047
6,0.1106
7,0.1136
8,0.1212
9,0.1213
10,0.1432


final_loss: 0.25661410719156263


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
train/global_step,▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
train/grad_norm,▃▂▁▁▆█▆▇▆▅▇█▇█▆▆▇▇▆▆██▇▇▇▇▇▇▇▇▇█▇▇▇▇▆▇▆▆
train/learning_rate,▂▂▃▄▅▆▇▇██████▇▇▇▇▇▆▆▆▅▅▅▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁
train/loss,▂▁▁▁▁▁▂▂▂▂▃▃▃▄▄▄▅▄▄▅▅▅▅▅▅▆▆▆▆▅▇▆▇▇█▇▇██▇

0,1
loss,0.25661
total_flos,3.73005226082304e+16
train/epoch,0.00621
train/global_step,50.0
train/grad_norm,0.68026
train/learning_rate,0.0
train/loss,0.3551
train_loss,0.25661
train_runtime,249.1847
train_samples_per_second,1.605


[34m[1mwandb[0m: Agent Starting Run: 9qwbswdv with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 2
[34m[1mwandb[0m: 	learning_rate: 0.0001445308488106805
[34m[1mwandb[0m: 	lr_scheduler_type: linear
[34m[1mwandb[0m: 	max_steps: 20
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adamw_torch
[34m[1mwandb[0m: 	per_device_train_batch_size: 2
[34m[1mwandb[0m: 	warmup_steps: 20
[34m[1mwandb[0m: 	weight_decay: 0.02690261905506394


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 2
\        /    Total batch size = 4 | Total steps = 20
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.1794
2,0.1856
3,0.1567
4,0.1895
5,0.1722
6,0.1436
7,0.1597
8,0.1684
9,0.1699
10,0.1398


final_loss: 0.16418003663420677


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇███
train/global_step,▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇████
train/grad_norm,▄▃▂▅▄▁▃▄▅▄▅▆▃█▆▃▂▃▁▁
train/learning_rate,▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇██▁
train/loss,▆▆▃▇▅▁▃▄▅▁▅▃▁█▆▄▃▃▁▂

0,1
loss,0.16418
total_flos,7460104521646080.0
train/epoch,0.00124
train/global_step,20.0
train/grad_norm,0.78351
train/learning_rate,0.0
train/loss,0.1456
train_loss,0.16418
train_runtime,56.1549
train_samples_per_second,1.425


[34m[1mwandb[0m: Agent Starting Run: 60oinmhk with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 8
[34m[1mwandb[0m: 	learning_rate: 0.0004006657724625577
[34m[1mwandb[0m: 	lr_scheduler_type: linear
[34m[1mwandb[0m: 	max_steps: 10
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adamw_bnb_8bit
[34m[1mwandb[0m: 	per_device_train_batch_size: 8
[34m[1mwandb[0m: 	warmup_steps: 20
[34m[1mwandb[0m: 	weight_decay: 0.0699642772576533


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 8
\        /    Total batch size = 64 | Total steps = 10
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.0908
2,0.1325
3,0.1266
4,0.1363
5,0.2115
6,0.3917
7,0.4023
8,0.3839
9,0.3878
10,0.4031


final_loss: 0.26664909198880193


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▂▃▃▄▅▆▆▇██
train/global_step,▁▂▃▃▄▅▆▆▇███
train/grad_norm,▁▂▂▁▃█▇▂▃▃
train/learning_rate,▁▂▃▃▄▅▆▆▇█
train/loss,▁▂▂▂▄█████

0,1
loss,0.26665
total_flos,5.968083617316864e+16
train/epoch,0.00993
train/global_step,10.0
train/grad_norm,0.49104
train/learning_rate,0.0002
train/loss,0.4031
train_loss,0.26665
train_runtime,375.931
train_samples_per_second,1.702


[34m[1mwandb[0m: Agent Starting Run: q5vgbx9a with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 8
[34m[1mwandb[0m: 	learning_rate: 4.526834212333521e-05
[34m[1mwandb[0m: 	lr_scheduler_type: cosine
[34m[1mwandb[0m: 	max_steps: 50
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adamw_hf
[34m[1mwandb[0m: 	per_device_train_batch_size: 2
[34m[1mwandb[0m: 	warmup_steps: 5
[34m[1mwandb[0m: 	weight_decay: 0.06326601148942791


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 50
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.1217
2,0.1266
3,0.1291
4,0.1237
5,0.113
6,0.1065
7,0.1023
8,0.0958
9,0.092
10,0.0856


final_loss: 0.25012500286102296


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
train/global_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇█████
train/grad_norm,▄▅▅▅▃▁▁▁▂▂▁▁▁▂▂▄▄▆▆▆▆▅▆█▇▆██▇▆▇▅▆▆▆▅▆▆▅▆
train/learning_rate,▂▄▅▇██████▇▇▇▇▆▆▆▆▅▅▅▄▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁
train/loss,▂▂▂▂▁▁▁▁▁▁▁▁▁▁▂▂▂▄▄▄▄▅▄▅▅▅▅▅▆▅▆▇▆▆▇▇▇███

0,1
loss,0.25013
total_flos,7.46010452164608e+16
train/epoch,0.01241
train/global_step,50.0
train/grad_norm,0.56179
train/learning_rate,0.0
train/loss,0.473
train_loss,0.25013
train_runtime,529.9302
train_samples_per_second,1.51


[34m[1mwandb[0m: Agent Starting Run: 7d5f3sza with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 8
[34m[1mwandb[0m: 	learning_rate: 0.0004099387233019968
[34m[1mwandb[0m: 	lr_scheduler_type: linear
[34m[1mwandb[0m: 	max_steps: 10
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adafactor
[34m[1mwandb[0m: 	per_device_train_batch_size: 8
[34m[1mwandb[0m: 	warmup_steps: 20
[34m[1mwandb[0m: 	weight_decay: 0.0971836417499114


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112664022220997, max=1.0…

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 8
\        /    Total batch size = 64 | Total steps = 10
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.071
2,0.0794
3,0.0773
4,0.0852
5,0.1069
6,0.1877
7,0.2752
8,0.2924
9,0.3548
10,0.3969


final_loss: 0.19265197217464447


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▂▃▃▄▅▆▆▇██
train/global_step,▁▂▃▃▄▅▆▆▇███
train/grad_norm,▁▁▁▁▁▂█▄▆▅
train/learning_rate,▁▂▃▃▄▅▆▆▇█
train/loss,▁▁▁▁▂▄▅▆▇█

0,1
loss,0.19265
total_flos,5.968083617316864e+16
train/epoch,0.00993
train/global_step,10.0
train/grad_norm,0.78627
train/learning_rate,0.0002
train/loss,0.3969
train_loss,0.19265
train_runtime,376.8085
train_samples_per_second,1.698


[34m[1mwandb[0m: Agent Starting Run: rnbhs6ta with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 4
[34m[1mwandb[0m: 	learning_rate: 0.0003904274503108684
[34m[1mwandb[0m: 	lr_scheduler_type: cosine
[34m[1mwandb[0m: 	max_steps: 10
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adafactor
[34m[1mwandb[0m: 	per_device_train_batch_size: 8
[34m[1mwandb[0m: 	warmup_steps: 10
[34m[1mwandb[0m: 	weight_decay: 0.01022948841306618


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 4
\        /    Total batch size = 32 | Total steps = 10
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.1132
2,0.1206
3,0.0929
4,0.0845
5,0.0751
6,0.0863
7,0.1343
8,0.1564
9,0.2546
10,0.279


final_loss: 0.13970204070210457


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▂▃▃▄▅▆▆▇██
train/global_step,▁▂▃▃▄▅▆▆▇███
train/grad_norm,▃▄▂▁▁▂▃▃█▅
train/learning_rate,▁▂▃▃▄▅▆▆▇█
train/loss,▂▃▂▁▁▁▃▄▇█

0,1
loss,0.1397
total_flos,2.984041808658432e+16
train/epoch,0.00497
train/global_step,10.0
train/grad_norm,0.89332
train/learning_rate,0.00039
train/loss,0.279
train_loss,0.1397
train_runtime,190.4286
train_samples_per_second,1.68


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: 0giyfzhc with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 2
[34m[1mwandb[0m: 	learning_rate: 7.022946526949357e-05
[34m[1mwandb[0m: 	lr_scheduler_type: linear
[34m[1mwandb[0m: 	max_steps: 50
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adamw_bnb_8bit
[34m[1mwandb[0m: 	per_device_train_batch_size: 8
[34m[1mwandb[0m: 	warmup_steps: 10
[34m[1mwandb[0m: 	weight_decay: 0.09135204780662272


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 2
\        /    Total batch size = 16 | Total steps = 50
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.2671
2,0.2656
3,0.225
4,0.2069
5,0.176
6,0.1515
7,0.1244
8,0.1102
9,0.0941
10,0.0865


final_loss: 0.23494885638356208


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
train/global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇█████
train/grad_norm,██▇▆▅▂▂▁▁▁▂▂▂▂▂▁▁▃▃▃▃▃▃▄▄▃▃▃▃▃▂▂▃▃▃▂▃▂▂▂
train/learning_rate,▂▂▃▄▅▆▇▇██▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▂▂▂▂▁
train/loss,▄▄▄▃▃▂▁▁▁▁▁▁▁▁▂▁▁▃▃▃▄▃▄▄▄▄▅▄▄▅▄▅▆▇▆▇██▇█

0,1
loss,0.23495
total_flos,7.46010452164608e+16
train/epoch,0.01241
train/global_step,50.0
train/grad_norm,0.52362
train/learning_rate,0.0
train/loss,0.4606
train_loss,0.23495
train_runtime,471.3217
train_samples_per_second,1.697


[34m[1mwandb[0m: Agent Starting Run: 0kvlel1l with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 8
[34m[1mwandb[0m: 	learning_rate: 0.00023119394428836517
[34m[1mwandb[0m: 	lr_scheduler_type: cosine
[34m[1mwandb[0m: 	max_steps: 10
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adamw_8bit
[34m[1mwandb[0m: 	per_device_train_batch_size: 8
[34m[1mwandb[0m: 	warmup_steps: 10
[34m[1mwandb[0m: 	weight_decay: 0.013700242534897922


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 8
\        /    Total batch size = 64 | Total steps = 10
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.0765
2,0.0754
3,0.0697
4,0.0733
5,0.0816
6,0.1489
7,0.1812
8,0.1978
9,0.2503
10,0.2865


final_loss: 0.14411612823605538


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▂▃▃▄▅▆▆▇██
train/global_step,▁▂▃▃▄▅▆▆▇███
train/grad_norm,▁▁▁▁▂▅▅██▅
train/learning_rate,▁▂▃▃▄▅▆▆▇█
train/loss,▁▁▁▁▁▄▅▅▇█

0,1
loss,0.14412
total_flos,5.968083617316864e+16
train/epoch,0.00993
train/global_step,10.0
train/grad_norm,0.44636
train/learning_rate,0.00023
train/loss,0.2865
train_loss,0.14412
train_runtime,375.9864
train_samples_per_second,1.702


[34m[1mwandb[0m: Agent Starting Run: g209957n with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 2
[34m[1mwandb[0m: 	learning_rate: 1.4773743246839404e-05
[34m[1mwandb[0m: 	lr_scheduler_type: cosine
[34m[1mwandb[0m: 	max_steps: 10
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adafactor
[34m[1mwandb[0m: 	per_device_train_batch_size: 4
[34m[1mwandb[0m: 	warmup_steps: 5
[34m[1mwandb[0m: 	weight_decay: 0.06760216791901776


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 2
\        /    Total batch size = 8 | Total steps = 10
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.0758
2,0.0735
3,0.0726
4,0.0779
5,0.0726
6,0.0729
7,0.0685
8,0.0676
9,0.0618
10,0.0696


final_loss: 0.07127370163798333


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▂▃▃▄▅▆▆▇██
train/global_step,▁▂▃▃▄▅▆▆▇███
train/grad_norm,████▇▅▄▇▁▁
train/learning_rate,▂▄▅▇█▇▆▃▂▁
train/loss,▇▆▆█▆▆▄▄▁▄

0,1
loss,0.07127
total_flos,7460104521646080.0
train/epoch,0.00124
train/global_step,10.0
train/grad_norm,0.16504
train/learning_rate,0.0
train/loss,0.0696
train_loss,0.07127
train_runtime,53.1104
train_samples_per_second,1.506


[34m[1mwandb[0m: Agent Starting Run: 9q9soe22 with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 4
[34m[1mwandb[0m: 	learning_rate: 0.00064349148013034
[34m[1mwandb[0m: 	lr_scheduler_type: linear
[34m[1mwandb[0m: 	max_steps: 50
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adafactor
[34m[1mwandb[0m: 	per_device_train_batch_size: 4
[34m[1mwandb[0m: 	warmup_steps: 10
[34m[1mwandb[0m: 	weight_decay: 0.08541816225417657


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 4
\        /    Total batch size = 16 | Total steps = 50
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.069
2,0.0661
3,0.0628
4,0.0716
5,0.0807
6,0.0979
7,0.6096
8,0.4127
9,0.8227
10,0.2016


final_loss: 0.4226828666031361


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███
train/global_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇█████
train/grad_norm,▁▁▁▁▁█▄▂▂▂▂▄▂▂▂▂▁▂▁▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train/learning_rate,▂▂▃▄▅▆▇▇██▇▇▇▇▆▆▆▆▆▅▅▅▅▅▄▄▄▄▃▃▃▃▃▂▂▂▂▂▁▁
train/loss,▁▁▁▁▁▅▄▇▂▃▅▅██▅▅▅▆▄▅▅▆▃▃▃▄▃▄▃▃▃▃▃▄▃▄▄▄▄▄

0,1
loss,0.42268
total_flos,7.46010452164608e+16
train/epoch,0.01241
train/global_step,50.0
train/grad_norm,0.39138
train/learning_rate,0.0
train/loss,0.4415
train_loss,0.42268
train_runtime,500.7395
train_samples_per_second,1.598


[34m[1mwandb[0m: Agent Starting Run: zni4no3q with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 2
[34m[1mwandb[0m: 	learning_rate: 0.0002700005715645376
[34m[1mwandb[0m: 	lr_scheduler_type: cosine
[34m[1mwandb[0m: 	max_steps: 20
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adamw_bnb_8bit
[34m[1mwandb[0m: 	per_device_train_batch_size: 2
[34m[1mwandb[0m: 	warmup_steps: 5
[34m[1mwandb[0m: 	weight_decay: 0.07938601062119263


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 2
\        /    Total batch size = 4 | Total steps = 20
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.1242
2,0.1196
3,0.1045
4,0.1183
5,0.1044
6,0.1059
7,0.1161
8,0.1226
9,0.1274
10,0.1076


final_loss: 0.11962933577597142


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇███
train/global_step,▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇████
train/grad_norm,▁▂▁▄▄▅▆▇█▄█▆▅▆▇▆▆▅▅▆
train/learning_rate,▂▄▅▇███▇▇▆▆▅▄▃▃▂▂▁▁▁
train/loss,▅▄▁▄▁▁▄▅▆▂▅▇▄█▇▆▃▅▄▅

0,1
loss,0.11963
total_flos,7460104521646080.0
train/epoch,0.00124
train/global_step,20.0
train/grad_norm,0.88359
train/learning_rate,0.0
train/loss,0.1228
train_loss,0.11963
train_runtime,56.2416
train_samples_per_second,1.422


[34m[1mwandb[0m: Agent Starting Run: a0vta9ln with config:
[34m[1mwandb[0m: 	gradient_accumulation_steps: 2
[34m[1mwandb[0m: 	learning_rate: 0.00040949928653703663
[34m[1mwandb[0m: 	lr_scheduler_type: cosine_with_restarts
[34m[1mwandb[0m: 	max_steps: 50
[34m[1mwandb[0m: 	num_train_epochs: 1
[34m[1mwandb[0m: 	optim: adamw_torch
[34m[1mwandb[0m: 	per_device_train_batch_size: 8
[34m[1mwandb[0m: 	warmup_steps: 10
[34m[1mwandb[0m: 	weight_decay: 0.09358168979801576


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 64,445 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 2
\        /    Total batch size = 16 | Total steps = 50
 "-____-"     Number of trainable parameters = 83,886,080


Unsloth: Enabled auto compiling




Step,Training Loss
1,0.1267
2,0.1034
3,0.0704
4,0.0707
5,0.0989
6,0.1191
7,0.1258
8,0.1165
9,0.1296
10,0.1124


final_loss: 0.22409524410963058


VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss,▁
train/epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
train/global_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇█████
train/grad_norm,▆▄▂▁▃▅▄▄▄▄▆▆▇▆▅▇▇▇▆█▇█▆▇██▆▇▆▆▆▅▅▆▆▆▆▆▆▇
train/learning_rate,▂▂▃▄▅▇▇██████▇▇▇▇▇▆▆▆▅▅▅▅▄▄▃▃▃▂▂▂▂▂▁▁▁▁▁
train/loss,▂▂▁▁▂▂▂▂▂▂▂▃▃▃▃▃▄▃▄▄▅▄▄▄▄▄▄▄▄▄▄▄▅▆▅▆▇▇▆█

0,1
loss,0.2241
total_flos,7.46010452164608e+16
train/epoch,0.01241
train/global_step,50.0
train/grad_norm,0.58643
train/learning_rate,0.0
train/loss,0.4635
train_loss,0.2241
train_runtime,470.3479
train_samples_per_second,1.701


In [None]:
# trainer_stats = trainer.train()

## inference

In [None]:
# Sample inference data point
test_dataset = dataset['test']

sample_ques = test_dataset['question'][0]
sample_ans = test_dataset['answer'][0]


In [None]:
# Running inference on single test
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
input_prompt = prompt.format(
        sample_ques, # ques
        sample_ans, # given answer
        "", # output - leave this blank for generation! LLM willl generate is it is True or False
    )

print("Input Promt:\n", input_prompt)
inputs = tokenizer(
[
    input_prompt
], return_tensors = "pt").to("cuda")

input_shape = inputs['input_ids'].shape
input_token_len = input_shape[1] # 1 because of batch
outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
# you can get the whole generated text by uncommenting the below line
# text_generated = tokenizer.batch_decode([outputs, skip_special_tokens=True)

response = tokenizer.batch_decode([outputs[0][input_token_len:]], skip_special_tokens=True)
response

Input Promt:
 You are a great mathematician and you are tasked with finding if an answer to a given maths question is correct or not. Yout response should be 'True' if correct, otherwise 'False'. Below is Question and Answer.



### Question:
The Parker family needs to leave the house by 5 pm for a dinner party. Mrs. Parker was waiting to get into the bathroom at 2:30 pm. Her oldest daughter used the bathroom for 45 minutes and her youngest daughter used the bathroom for another 30 minutes. Then her husband used it for 20 minutes. How much time will Mrs. Parker have to use the bathroom to leave on time?

### Answer:
205

### Explainaition

### Output:



['True']

## saving model

In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

In [None]:
if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference


==((====))==  Unsloth 2024.11.5: Fast Llama patching. Transformers = 4.46.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.1+cu124. CUDA = 7.5. CUDA Toolkit = 12.4.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
