<a href="https://colab.research.google.com/github/frank-morales2020/MLxDL/blob/main/DeepSeek_UFTF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Enviroment Setup

In [None]:
# Install necessary modules
!pip install transformers accelerate trl bitsandbytes datasets peft --quiet
!pip install -U bitsandbytes -q


In [None]:
!nvidia-smi

Mon Feb 24 23:16:43 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   32C    P0             50W /  400W |    2447MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
import os

#Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).

os.environ["WANDB_MODE"] = "offline"

os.environ["WANDB_DISABLED"] = "true"


!pip install transformers accelerate --quiet

from transformers import TrainingArguments
import accelerate

# Initialize the Accelerator
accelerator = accelerate.Accelerator()

## DeepSeekr1 - Unsloth

https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B

In [None]:
!pip install huggingface_hub --quiet
!pip install unsloth -q
!pip install colab-env --quiet


import warnings

warnings.filterwarnings("ignore", message="You seem to be using the pipelines sequentially on GPU")

import colab_env
import os

access_token_write = os.getenv("HUGGINGFACE_ACCESS_TOKEN_WRITE")

from huggingface_hub import login

login(
  token=access_token_write,
  add_to_git_credential=True
)

In [None]:
import warnings
warnings.filterwarnings("ignore")
model_name = "unsloth/DeepSeek-R1-Distill-Llama-8B"

In [None]:
from unsloth import FastLanguageModel

max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/DeepSeek-R1-Distill-Llama-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = access_token_write,
)

In [None]:
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
Please answer the following medical question.

### Question:
{}

### Response:
<think>{}"""

In [None]:
question = "A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?"


FastLanguageModel.for_inference(model)
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])


<think>
Okay, so I need to figure out what cystometry would show for this 61-year-old woman. Let me start by breaking down the information given.

First, the patient has a history of involuntary urine loss during activities like coughing or sneezing but no leakage at night. That makes me think of stress urinary incontinence. Stress incontinence usually happens when the urethral muscles aren't strong enough to prevent urine from leaking out when there's increased pressure, like from coughing or sneezing.

She undergoes a gynecological exam and a Q-tip test. I'm not entirely sure about the Q-tip test, but I think it's a common test used to assess urethral function. From what I remember, the Q-tip is a small catheter that's placed in the urethra, and the doctor measures the closure pressure. If the pressure is low, it might indicate that the urethral sphincter isn't functioning well, contributing to incontinence.

Now, considering the findings from these tests, what would cystometry show

## UFTF DeepSeekr1

In [None]:
# First, uninstall all the problematic libraries.
!pip uninstall -y torch torchvision torchaudio transformers accelerate datasets peft bitsandbytes trl unsloth

# Check the current CUDA version (if available).
#!nvidia-smi

# Now, install a specific PyTorch version with its compatible dependencies.
!pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118

# Upgrade Libraries
!pip install -U transformers==4.36.0 accelerate datasets peft bitsandbytes trl -q
# Install Unsloth.
!pip install "unsloth[hf]" -q

# Check the installed versions
!pip show torch
!pip show transformers

# Check if the CUDA driver is correct:
#!nvidia-smi

#UFTF WITH 1 dataset

In [None]:
from IPython import get_ipython
from IPython.display import display

import os
import torch
import warnings
import gc
from transformers import (
    TrainingArguments,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    BitsAndBytesConfig,
    DataCollatorWithPadding,
    AutoModelForCausalLM,
)
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import Trainer
import copy

# Import from Unsloth
from unsloth import FastLanguageModel, is_bfloat16_supported
from unsloth.kernels import cross_entropy_loss

# Import SFTTrainer from TRL
from trl import SFTTrainer
import accelerate
from accelerate import Accelerator

# Set environment variables
os.environ["WANDB_MODE"] = "offline"
os.environ["WANDB_DISABLED"] = "true"

# Initialize the Accelerator
accelerator = Accelerator()

# Suppress warnings
warnings.filterwarnings("ignore")


def clear_memory():
    """Clears GPU memory and performs garbage collection."""
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.ipc_collect()


class FineTuningAgent:
    def __init__(self, model_id, dataset_name, config=None):
        """
        Initializes the FineTuningAgent.
        """
        self.model_id = model_id
        self.dataset_name = dataset_name
        if config is None:
            config = {}
        self.config = config
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.tokenizer = None
        self.model = None
        self.trainer = None
        self.training_args = None
        self.peft_config = None
        self.dataset = None
        self.counter = 0
        self.data_collator = None
        self.max_seq_length = self.config.get("max_seq_length", 2048)

    def _observe(self):
        """
        Loads the model, tokenizer, and dataset.
        """
        self.counter += 1
        print(f"Starting Observe ...")

        clear_memory()

        quantization_config = None
        # Determine if Unsloth is used.
        is_unsloth_model = self.config.get("use_unsloth", False)

        if self.config.get("quantization") and not is_unsloth_model:
            if "mistral" in self.model_id.lower():
                print("Mistral model detected. Using 4-bit quantization.")
                quantization_config = BitsAndBytesConfig(
                    load_in_4bit=True,
                    bnb_4bit_use_double_quant=True,
                    bnb_4bit_quant_type="nf4",
                    bnb_4bit_compute_dtype=torch.bfloat16,
                )
            else:
                quantization_config = BitsAndBytesConfig(
                    load_in_4bit=True,
                    bnb_4bit_use_double_quant=False,
                    bnb_4bit_quant_type="nf4",
                    bnb_4bit_compute_dtype=torch.float32,
                )
        model_downloaded = False
        max_retries = 3
        retry_count = 0
        while not model_downloaded and retry_count < max_retries:
            try:
                # Determine the correct model class based on architecture
                if "bert" in self.model_id.lower() and not is_unsloth_model:
                    self.model = AutoModelForSequenceClassification.from_pretrained(  # Use correct model type
                        self.model_id,
                        num_labels=2,  # For MRPC, which is binary classification
                        quantization_config=quantization_config,
                        trust_remote_code=True,
                    )
                elif "mistral" in self.model_id.lower() and not is_unsloth_model:
                    self.model = AutoModelForCausalLM.from_pretrained(
                        self.model_id,
                        quantization_config=quantization_config,
                        trust_remote_code=True,
                    )
                # Load Model with Unsloth
                elif is_unsloth_model:
                    print("Loading model using Unsloth...")
                    # This is the correct model ID to use with Unsloth
                    # Corrected Model ID.
                    unsloth_model_id = self.config.get(
                        "unsloth_model_id", "deepseek-ai/deepseek-coder-1.3b-base"
                    )
                    max_seq_length = self.config.get("max_seq_length", 2048)  # You can tune this.
                    dtype = self.config.get("dtype", None)  # You can tune this.
                    load_in_4bit = self.config.get("load_in_4bit", True)  # You can tune this.
                    access_token = self.config.get("access_token", None)
                    self.model, self.tokenizer = FastLanguageModel.from_pretrained(
                        model_name=unsloth_model_id,
                        max_seq_length=max_seq_length,
                        dtype=dtype,
                        load_in_4bit=load_in_4bit,
                        token=access_token,
                    )

                else:
                    print(f"Model {self.model_id} not supported.")
                    return

                model_downloaded = True
            except KeyboardInterrupt:
                print(
                    f"Model download interrupted. Retrying... (Attempt {retry_count + 1}/{max_retries})"
                )
                retry_count += 1
                # Clear GPU memory to avoid potential issues
                clear_memory()
                if retry_count == max_retries:
                    print("Max retry reached, skipping model download.")
                    return
            except Exception as e:
                print(f"An error occurred during model download: {e}")
                retry_count += 1
                # Clear GPU memory to avoid potential issues
                clear_memory()

                if retry_count == max_retries:
                    print("Max retry reached, skipping model download.")
                    return
        # Load Tokenizer with HF library if it is not an unsloth model.
        if not is_unsloth_model:
            self.tokenizer = AutoTokenizer.from_pretrained(
                self.model_id, trust_remote_code=True
            )

        # Add padding token if it does not exist
        if self.tokenizer.pad_token is None:
            self.tokenizer.add_special_tokens({"pad_token": "[PAD]"})
            self.model.resize_token_embeddings(len(self.tokenizer))

        # Move model to device
        self.model.to(self.device)

        # Load Dataset (using dataset name from Hugging Face Hub)
        dataset = load_dataset(self.dataset_name, split="train")
        self.dataset = dataset.shuffle().select(
            range(self.config.get("dataset_size", 125))
        )  # Set a default dataset size of 125

        print("\n")
        print(f"Observe finished.")

    def _orient(self):
        """
        Orients the agent by formatting the dataset and preparing training arguments.
        """
        print("\n")
        self.counter += 1
        print(f"Starting Orient ...")
        if self.dataset_name == "SetFit/mrpc":
            print("Dataset: SetFit/mrpc")
            preprocessing_function = self._preprocess_function_mrpc
            dataset_text_field = None  # No need for dataset_text_field for mrpc
        elif self.dataset_name == "b-mc2/sql-create-context":
            print("Dataset: b-mc2/sql-create-context")
            preprocessing_function = self._preprocess_function_sql_create_context
            dataset_text_field = "text"  # We use the text field
        elif self.dataset_name == "anthropic/hh-rlhf":
            print("Dataset: anthropic/hh-rlhf")
            preprocessing_function = self._preprocess_function_anthropic_hh_rlhf
            dataset_text_field = "text"  # We use the text field
        else:
            print(f"Dataset: {self.dataset_name} not supported.")
            return

        # Set the train/test split.
        test_size_percentage = self.config.get("test_split_percentage", 0.2)  # Set a default test size to 20%
        self.dataset = self.dataset.train_test_split(
            test_size=test_size_percentage
        )

        self.dataset = self.dataset.map(
            preprocessing_function,
            batched=True,
            remove_columns=self.dataset["train"].column_names,
        )
        self.dataset_text_field = dataset_text_field

        print("\n")
        print(f"Orient Dataset: {self.dataset}")

        print("\n")
        print(f"Orient finished.")

    def _decide(self):
        """
        Decides on the fine-tuning strategy, including LoRA configuration.
        """
        self.counter += 1
        print("\n")
        print(f"Starting Decide ...")
        clear_memory()
        # PEFT Configuration (LoRA)
        if self.config.get("lora"):
            self.model = prepare_model_for_kbit_training(self.model)
            if "bert" in self.model_id.lower():
                peft_config = LoraConfig(
                    lora_alpha=16,  # You can tune this.
                    lora_dropout=0.1,  # You can tune this.
                    r=64,  # You can tune this.
                    bias="none",
                    target_modules=["query", "key", "value", "dense"],  # Correct target modules for BERT
                    task_type="SEQ_CLS",  # correct task type
                )
            elif "mistral" in self.model_id.lower():
                peft_config = LoraConfig(
                    lora_alpha=128,
                    lora_dropout=0.05,
                    r=256,
                    bias="none",
                    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
                    task_type="CAUSAL_LM",
                )
            # If we are using unsloth, we will use this config
            elif self.config.get("use_unsloth", False):
                peft_config = LoraConfig(
                    lora_alpha=16,
                    lora_dropout=0.05,
                    r=64,
                    bias="none",
                    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
                    task_type="CAUSAL_LM",
                )
                print("\n")
                print(f"LORA: {peft_config}")
            else:
                print(f"Model {self.model_id} not supported.")
                return

            self.peft_config = peft_config
            self.model = get_peft_model(self.model, peft_config)

            self.model.print_trainable_parameters()

        print('\n')
        print(f"Decide finished.")

    def _act(self):
        """
        Acts by preprocessing the dataset and initializing the training loop.
        """
        self.counter += 1
        print("\n")
        print(f"Starting Act ...")
        clear_memory()

        try:
            if "train" not in self.dataset or "test" not in self.dataset:
                print(f"Missing train or test split for {self.dataset_name}")
                return

            print("Dataset preprocessed successfully.")
            print("\n")

            # Create TrainingArguments with the desired parameters
            training_args_config = self.config.get("training_args", {})
            self.training_args = TrainingArguments(
                output_dir=training_args_config.get("output_dir", "./output"),
                per_device_train_batch_size=training_args_config.get(
                    "per_device_train_batch_size", 2
                ),
                gradient_accumulation_steps=training_args_config.get(
                    "gradient_accumulation_steps", 4
                ),
                warmup_steps=training_args_config.get("warmup_steps", 5),
                max_steps=training_args_config.get("max_steps", 60),
                learning_rate=training_args_config.get("learning_rate", 2e-4),
                fp16=training_args_config.get("fp16", not is_bfloat16_supported()),
                bf16=training_args_config.get("bf16", is_bfloat16_supported()),
                logging_steps=training_args_config.get("logging_steps", 10),
                optim=training_args_config.get("optim", "adamw_8bit"),
                weight_decay=training_args_config.get("weight_decay", 0.01),
                lr_scheduler_type=training_args_config.get("lr_scheduler_type", "linear"),
                seed=training_args_config.get("seed", 3407),
                evaluation_strategy=training_args_config.get("evaluation_strategy", "steps"),  # Now we use a strategy
                eval_steps=training_args_config.get("eval_steps", 20),  # Now we use eval steps
                # report_to=training_args_config.get("report_to", None), # Not needed this parameter
            )

            # Initialize Trainer
            print("Initializing Trainer...")
            # Use the Trainer class instead of SFTTrainer
            # Use SFTTrainer
            self.trainer = SFTTrainer(
                model=self.model,
                tokenizer=self.tokenizer,
                train_dataset=self.dataset["train"],
                eval_dataset=self.dataset["test"],  # We add the eval dataset
                dataset_text_field=self.dataset_text_field,
                max_seq_length=self.max_seq_length,
                dataset_num_proc=self.config.get("dataset_num_proc", 2),
                args=self.training_args,
            )

        except Exception as e:
            print(f"An error occurred in _act(): {e}")
            raise

        print("\n")
        print(f"Act finished.")

    def run(self):
        """
        Executes the OODA loop and fine-tunes the language model.
        """
        self.counter += 1
        print("\n")
        print(f"Starting Run ...")
        clear_memory()
        self._observe()
        if self.model is None:
            print("Model loading failed, skipping _orient, _decide and _act")
            return
        self._orient()
        self._decide()
        self._act()

        print("\n")
        print(f"Run Dataset: {self.dataset}")
        # Add this part
        if self.dataset and "test" in self.dataset:
          print(f"Test Dataset Size: {len(self.dataset['test'])}")
        else:
          print("No test dataset found or dataset is None.")
        print(f"Eval Batch Size: {self.trainer.args.per_device_eval_batch_size}")
        print("\n")

        if self.trainer is not None:
            try:
                # Train the model
                self.trainer.train()
                print("\n")
                print("Evaluation:")
                eval_results = self.evaluate()
                print("\n")
                print(eval_results)
                print("\n")
            except Exception as e:
                print(f"An error occurred during training or evaluation: {e}")
                raise
        else:
            print("Trainer is None. Skipping training and evaluation.")

        print(f"Run  finished.")

    def evaluate(self):
        """
        Evaluates the fine-tuned language model.
        """
        try:
            eval_results = self.trainer.evaluate()
            return eval_results
        except Exception as e:
            print(f"An error occurred in evaluate(): {e}")
            raise

    def _preprocess_function_mrpc(self, examples):
        """
        Preprocesses the data for the SetFit/mrpc dataset.
        """
        print("Preprocess Dataset: SetFit/mrpc")
        inputs = self.tokenizer(
            examples["text1"],
            examples["text2"],
            max_length=128,  # Adjust as needed
            truncation=True,
            padding="max_length",
        )
        # Corrected labels for classification models
        inputs["labels"] = examples["label"]
        return inputs

    def _preprocess_function_sql_create_context(self, examples):
        """
        Preprocesses the data for the b-mc2/sql-create-context dataset.
        """
        print("Preprocess Dataset: b-mc2/sql-create-context")
        # Tokenize inputs and labels
        inputs = [f"### Question: {q} ### Context: {c}" for q, c in zip(examples["question"], examples["context"])]
        model_inputs = self.tokenizer(inputs, max_length=1024, truncation=True, padding="max_length")

        # Tokenize labels
        labels_tokenized = self.tokenizer(examples["answer"], max_length=1024, truncation=True, padding="max_length")

        # Assign labels to model_inputs
        model_inputs["labels"] = labels_tokenized["input_ids"]
        # Add 'text' field for the text in the model.
        model_inputs["text"] = inputs

        return model_inputs

    def _preprocess_function_anthropic_hh_rlhf(self, examples):
        """
        Preprocesses the data for the anthropic/hh-rlhf dataset.
        """
        print("Preprocess Dataset: anthropic/hh-rlhf")
        # Construct "question" and "context" using the 'text' column for b-mc2/sql-create-context
        inputs = examples["chosen"]

        model_inputs = self.tokenizer(inputs, max_length=1024, truncation=True, padding="max_length")
        # Tokenize labels
        labels_tokenized = self.tokenizer(examples["chosen"], max_length=1024, truncation=True, padding="max_length")
        model_inputs["labels"] = labels_tokenized["input_ids"]
        # Add 'text' field for the text in the model.
        model_inputs["text"] = inputs

        return model_inputs


# Configuration for experiments
RL_PAIRS = [
    # mrpc
    {
        "model_id": "unsloth/DeepSeek-R1-Distill-Llama-8B",
        "dataset_name": "SetFit/mrpc",
        "config": {
            "unsloth_model_id": "deepseek-ai/deepseek-coder-1.3b-base",  # Corrected model ID for Unsloth
            "dataset_size": 125,
            "test_split_percentage": 0.2,
            "quantization": True,
            "lora": True,
            "use_unsloth": True,
            "training_args": {
                "evaluation_strategy": "steps",  # Now we use a strategy
                "eval_steps": 20,  # Now we use eval steps
                "num_train_epochs": 1,
                "max_steps": 60,
            }
        },
    },
]

for pair in RL_PAIRS:
    model_id = pair["model_id"]
    dataset_name = pair["dataset_name"]
    config = pair["config"]

    agent = FineTuningAgent(model_id, dataset_name, config)
    agent.run()



Starting Run ...
Starting Observe ...
Loading model using Unsloth...
==((====))==  Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.46.3.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
deepseek-ai/deepseek-coder-1.3b-base does not have a padding token! Will use pad_token = <pad>.


Repo card metadata block was not found. Setting CardData to empty.




Observe finished.


Starting Orient ...
Dataset: SetFit/mrpc


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Preprocess Dataset: SetFit/mrpc


Map:   0%|          | 0/25 [00:00<?, ? examples/s]

Preprocess Dataset: SetFit/mrpc


Orient Dataset: DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 100
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 25
    })
})


Orient finished.


Starting Decide ...


LORA: LoraConfig(task_type='CAUSAL_LM', peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, inference_mode=False, r=64, target_modules={'up_proj', 'o_proj', 'down_proj', 'gate_proj', 'q_proj', 'k_proj', 'v_proj'}, exclude_modules=None, lora_alpha=16, lora_dropout=0.05, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={}, eva_config=None, use_dora=False, layer_replication=None, runtime_config=LoraRuntimeConfig(ephemeral_gp

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
max_steps is given, it will override any value given in num_train_epochs


trainable params: 59,965,440 || all params: 1,406,437,376 || trainable%: 4.2636


Decide finished.


Starting Act ...
Dataset preprocessed successfully.


Initializing Trainer...


Act finished.


Run Dataset: DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 100
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 25
    })
})
Test Dataset Size: 25
Eval Batch Size: 2




==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 100 | Num Epochs = 5
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 59,965,440


Step,Training Loss,Validation Loss
20,5.9819,6.035682
40,3.3344,5.182404
60,2.8336,5.084407




Evaluation:




{'eval_loss': 5.083302974700928, 'eval_runtime': 1.4732, 'eval_samples_per_second': 16.97, 'eval_steps_per_second': 8.824, 'epoch': 4.8}


Run  finished.


# UTFT WITH 3 DATASETS

In [None]:
!pip uninstall -y torch torchvision torchaudio transformers accelerate datasets peft bitsandbytes trl unsloth

# Check the current CUDA version (if available).
#!nvidia-smi

# Now, install a specific PyTorch version with its compatible dependencies.
!pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118

# Upgrade Libraries
!pip install -U transformers==4.36.0 accelerate datasets peft bitsandbytes trl -q
# Install Unsloth.
!pip install "unsloth[hf]" -q

# Check the installed versions
!pip show torch
!pip show transformers

# Check if the CUDA driver is correct:
#!nvidia-smi


In [None]:
from IPython import get_ipython
from IPython.display import display

import os
import torch
import warnings
import gc
from transformers import (
    TrainingArguments,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    BitsAndBytesConfig,
    DataCollatorWithPadding,
    AutoModelForCausalLM,
)
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import Trainer
import copy

# Import from Unsloth
from unsloth import FastLanguageModel, is_bfloat16_supported
from unsloth.kernels import cross_entropy_loss

# Import SFTTrainer from TRL
from trl import SFTTrainer
import accelerate
from accelerate import Accelerator

# Set environment variables
os.environ["WANDB_MODE"] = "offline"
os.environ["WANDB_DISABLED"] = "true"

# Initialize the Accelerator
accelerator = Accelerator()

# Suppress warnings
warnings.filterwarnings("ignore")


def clear_memory():
    """Clears GPU memory and performs garbage collection."""
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.ipc_collect()


class FineTuningAgent:
    def __init__(self, model_id, dataset_name, config=None):
        """
        Initializes the FineTuningAgent.
        """
        self.model_id = model_id
        self.dataset_name = dataset_name
        if config is None:
            config = {}
        self.config = config
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.tokenizer = None
        self.model = None
        self.trainer = None
        self.training_args = None
        self.peft_config = None
        self.dataset = None
        self.counter = 0
        self.data_collator = None
        self.max_seq_length = self.config.get("max_seq_length", 2048)
        self.dataset_text_field = None # New variable

    def _observe(self):
        """
        Loads the model, tokenizer, and dataset.
        """
        self.counter += 1
        print(f"Starting Observe ...")

        clear_memory()

        quantization_config = None
        # Determine if Unsloth is used.
        is_unsloth_model = self.config.get("use_unsloth", False)

        if self.config.get("quantization") and not is_unsloth_model:
            if "mistral" in self.model_id.lower():
                print("Mistral model detected. Using 4-bit quantization.")
                quantization_config = BitsAndBytesConfig(
                    load_in_4bit=True,
                    bnb_4bit_use_double_quant=True,
                    bnb_4bit_quant_type="nf4",
                    bnb_4bit_compute_dtype=torch.bfloat16,
                )
            else:
                quantization_config = BitsAndBytesConfig(
                    load_in_4bit=True,
                    bnb_4bit_use_double_quant=False,
                    bnb_4bit_quant_type="nf4",
                    bnb_4bit_compute_dtype=torch.float32,
                )
        model_downloaded = False
        max_retries = 3
        retry_count = 0
        while not model_downloaded and retry_count < max_retries:
            try:
                # Determine the correct model class based on architecture
                if "bert" in self.model_id.lower() and not is_unsloth_model:
                    self.model = AutoModelForSequenceClassification.from_pretrained(  # Use correct model type
                        self.model_id,
                        num_labels=2,  # For MRPC, which is binary classification
                        quantization_config=quantization_config,
                        trust_remote_code=True,
                    )
                elif "mistral" in self.model_id.lower() and not is_unsloth_model:
                    self.model = AutoModelForCausalLM.from_pretrained(
                        self.model_id,
                        quantization_config=quantization_config,
                        trust_remote_code=True,
                    )
                # Load Model with Unsloth
                elif is_unsloth_model:
                    print("Loading model using Unsloth...")
                    # This is the correct model ID to use with Unsloth
                    # Corrected Model ID.
                    unsloth_model_id = self.config.get(
                        "unsloth_model_id", "deepseek-ai/deepseek-coder-1.3b-base"
                    )
                    max_seq_length = self.config.get("max_seq_length", 2048)  # You can tune this.
                    dtype = self.config.get("dtype", None)  # You can tune this.
                    load_in_4bit = self.config.get("load_in_4bit", True)  # You can tune this.
                    access_token = self.config.get("access_token", None)
                    self.model, self.tokenizer = FastLanguageModel.from_pretrained(
                        model_name=unsloth_model_id,
                        max_seq_length=max_seq_length,
                        dtype=dtype,
                        load_in_4bit=load_in_4bit,
                        token=access_token,
                    )

                else:
                    print(f"Model {self.model_id} not supported.")
                    return

                model_downloaded = True
            except KeyboardInterrupt:
                print(
                    f"Model download interrupted. Retrying... (Attempt {retry_count + 1}/{max_retries})"
                )
                retry_count += 1
                # Clear GPU memory to avoid potential issues
                clear_memory()
                if retry_count == max_retries:
                    print("Max retry reached, skipping model download.")
                    return
            except Exception as e:
                print(f"An error occurred during model download: {e}")
                retry_count += 1
                # Clear GPU memory to avoid potential issues
                clear_memory()

                if retry_count == max_retries:
                    print("Max retry reached, skipping model download.")
                    return
        # Load Tokenizer with HF library if it is not an unsloth model.
        if not is_unsloth_model:
            self.tokenizer = AutoTokenizer.from_pretrained(
                self.model_id, trust_remote_code=True
            )

        # Add padding token if it does not exist
        if self.tokenizer.pad_token is None:
            self.tokenizer.add_special_tokens({"pad_token": "[PAD]"})
            self.model.resize_token_embeddings(len(self.tokenizer))

        # Move model to device
        self.model.to(self.device)

        # Load Dataset (using dataset name from Hugging Face Hub)
        dataset = load_dataset(self.dataset_name, split="train")
        self.dataset = dataset.shuffle().select(
            range(self.config.get("dataset_size", 125))
        )  # Set a default dataset size of 125

        print("\n")
        print(f"Observe finished.")

    def _orient(self):
        """
        Orients the agent by formatting the dataset and preparing training arguments.
        """
        print("\n")
        self.counter += 1
        print(f"Starting Orient ...")
        if self.dataset_name == "SetFit/mrpc":
            print("Dataset: SetFit/mrpc")
            preprocessing_function = self._preprocess_function_mrpc
            self.dataset_text_field = None  # No need for dataset_text_field for mrpc
        elif self.dataset_name == "b-mc2/sql-create-context":
            print("Dataset: b-mc2/sql-create-context")
            preprocessing_function = self._preprocess_function_sql_create_context
            self.dataset_text_field = "text"  # We use the text field
        elif self.dataset_name == "anthropic/hh-rlhf":
            print("Dataset: anthropic/hh-rlhf")
            preprocessing_function = self._preprocess_function_anthropic_hh_rlhf
            self.dataset_text_field = "text"  # We use the text field
        else:
            print(f"Dataset: {self.dataset_name} not supported.")
            return

        # Set the train/test split.
        test_size_percentage = self.config.get("test_split_percentage", 0.2)  # Set a default test size to 20%
        self.dataset = self.dataset.train_test_split(
            test_size=test_size_percentage
        )

        self.dataset = self.dataset.map(
            preprocessing_function,
            batched=True,
            remove_columns=self.dataset["train"].column_names,
        )

        print("\n")
        print(f"Orient Dataset: {self.dataset}")

        print("\n")
        print(f"Orient finished.")

    def _decide(self):
        """
        Decides on the fine-tuning strategy, including LoRA configuration.
        """
        self.counter += 1
        print("\n")
        print(f"Starting Decide ...")
        clear_memory()
        # PEFT Configuration (LoRA)
        if self.config.get("lora"):
            self.model = prepare_model_for_kbit_training(self.model)
            if "bert" in self.model_id.lower():
                peft_config = LoraConfig(
                    lora_alpha=16,  # You can tune this.
                    lora_dropout=0.1,  # You can tune this.
                    r=64,  # You can tune this.
                    bias="none",
                    target_modules=["query", "key", "value", "dense"],  # Correct target modules for BERT
                    task_type="SEQ_CLS",  # correct task type
                )
            elif "mistral" in self.model_id.lower():
                peft_config = LoraConfig(
                    lora_alpha=128,
                    lora_dropout=0.05,
                    r=256,
                    bias="none",
                    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
                    task_type="CAUSAL_LM",
                )
            # If we are using unsloth, we will use this config
            elif self.config.get("use_unsloth", False):
                peft_config = LoraConfig(
                    lora_alpha=16,
                    lora_dropout=0.05,
                    r=64,
                    bias="none",
                    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
                    task_type="CAUSAL_LM",
                )
                print("\n")
                print(f"LORA: {peft_config}")
            else:
                print(f"Model {self.model_id} not supported.")
                return

            self.peft_config = peft_config
            self.model = get_peft_model(self.model, peft_config)

            self.model.print_trainable_parameters()

        print('\n')
        print(f"Decide finished.")

    def _act(self):
        """
        Acts by preprocessing the dataset and initializing the training loop.
        """
        self.counter += 1
        print("\n")
        print(f"Starting Act ...")
        clear_memory()

        try:
            if "train" not in self.dataset or "test" not in self.dataset:
                print(f"Missing train or test split for {self.dataset_name}")
                return

            print("Dataset preprocessed successfully.")
            print("\n")

            # Create TrainingArguments with the desired parameters
            training_args_config = self.config.get("training_args", {})
            self.training_args = TrainingArguments(
                output_dir=training_args_config.get("output_dir", "./output"),
                per_device_train_batch_size=training_args_config.get(
                    "per_device_train_batch_size", 2
                ),
                gradient_accumulation_steps=training_args_config.get(
                    "gradient_accumulation_steps", 4
                ),
                warmup_steps=training_args_config.get("warmup_steps", 5),
                max_steps=training_args_config.get("max_steps", 60),
                learning_rate=training_args_config.get("learning_rate", 2e-4),
                fp16=training_args_config.get("fp16", not is_bfloat16_supported()),
                bf16=training_args_config.get("bf16", is_bfloat16_supported()),
                logging_steps=training_args_config.get("logging_steps", 10),
                optim=training_args_config.get("optim", "adamw_8bit"),
                weight_decay=training_args_config.get("weight_decay", 0.01),
                lr_scheduler_type=training_args_config.get("lr_scheduler_type", "linear"),
                seed=training_args_config.get("seed", 3407),
                evaluation_strategy=training_args_config.get("evaluation_strategy", "steps"), # we need this
                eval_steps=training_args_config.get("eval_steps", 20), # We need this
                # report_to=training_args_config.get("report_to", None), # Not needed this parameter
            )

            # Initialize Trainer
            print("Initializing Trainer...")
            # Use the Trainer class instead of SFTTrainer
            # Use SFTTrainer
            self.trainer = SFTTrainer(
                model=self.model,
                tokenizer=self.tokenizer,
                train_dataset=self.dataset["train"],
                eval_dataset=self.dataset["test"], # We need this
                dataset_text_field=self.dataset_text_field,
                max_seq_length=self.max_seq_length,
                dataset_num_proc=self.config.get("dataset_num_proc", 2),
                args=self.training_args,
            )

        except Exception as e:
            print(f"An error occurred in _act(): {e}")
            raise

        print("\n")
        print(f"Act finished.")

    def run(self):
        """
        Executes the OODA loop and fine-tunes the language model.
        """
        self.counter += 1
        print("\n")
        print(f"Starting Run ...")
        clear_memory()
        self._observe()
        if self.model is None:
            print("Model loading failed, skipping _orient, _decide and _act")
            return
        self._orient()
        self._decide()
        self._act()

        print("\n")
        print(f"Run Dataset: {self.dataset}")
        if self.dataset and "test" in self.dataset:
          print(f"Test Dataset Size: {len(self.dataset['test'])}")
        else:
          print("No test dataset found or dataset is None.")
        print(f"Eval Batch Size: {self.trainer.args.per_device_eval_batch_size}")
        print("\n")

        if self.trainer is not None:
            try:
                # Train the model
                self.trainer.train()
                print("\n")
                print("Evaluation:")
                eval_results = self.evaluate()
                print("\n")
                print(eval_results)
                print("\n")
            except Exception as e:
                print(f"An error occurred during training or evaluation: {e}")
                raise
        else:
            print("Trainer is None. Skipping training and evaluation.")

        print(f"Run  finished.")

    def evaluate(self):
        """
        Evaluates the fine-tuned language model.
        """
        return self.trainer.evaluate()

    def _preprocess_function_mrpc(self, examples):
        """
        Preprocesses the data for the SetFit/mrpc dataset.
        """
        print("Preprocess Dataset: SetFit/mrpc")
        inputs = self.tokenizer(
            examples["text1"],
            examples["text2"],
            max_length=128,  # Adjust as needed
            truncation=True,
            padding="max_length",
        )
        # Corrected labels for classification models
        inputs["labels"] = examples["label"]
        return inputs

    def _preprocess_function_sql_create_context(self, examples):
        """
        Preprocesses the data for the b-mc2/sql-create-context dataset.
        """
        print("Preprocess Dataset: b-mc2/sql-create-context")
        # Tokenize inputs and labels
        inputs = [f"### Question: {q} ### Context: {c}" for q, c in zip(examples["question"], examples["context"])]
        model_inputs = self.tokenizer(inputs, max_length=1024, truncation=True, padding="max_length")

        # Tokenize labels
        labels_tokenized = self.tokenizer(examples["answer"], max_length=1024, truncation=True, padding="max_length")

        # Assign labels to model_inputs
        model_inputs["labels"] = labels_tokenized["input_ids"]
        # Add 'text' field for the text in the model.
        model_inputs["text"] = inputs

        return model_inputs

    def _preprocess_function_anthropic_hh_rlhf(self, examples):
        """
        Preprocesses the data for the anthropic/hh-rlhf dataset.
        """
        print("Preprocess Dataset: anthropic/hh-rlhf")
        # Construct "question" and "context" using the 'text' column for b-mc2/sql-create-context
        inputs = examples["chosen"]

        model_inputs = self.tokenizer(inputs, max_length=1024, truncation=True, padding="max_length")
        # Tokenize labels
        labels_tokenized = self.tokenizer(examples["chosen"], max_length=1024, truncation=True, padding="max_length")
        model_inputs["labels"] = labels_tokenized["input_ids"]
        # Add 'text' field for the text in the model.
        model_inputs["text"] = inputs

        return model_inputs


# Configuration for experiments
RL_PAIRS = [
    # mrpc
    {
        "model_id": "unsloth/DeepSeek-R1-Distill-Llama-8B",
        "dataset_name": "SetFit/mrpc",
        "config": {
            "unsloth_model_id": "deepseek-ai/deepseek-coder-1.3b-base",  # Corrected model ID for Unsloth
            "dataset_size": 125,
            "test_split_percentage": 0.2,
            "quantization": True,
            "lora": True,
            "use_unsloth": True,
            "max_seq_length": 2048,
            "dtype": None,
            "load_in_4bit": True,
            "access_token": None,  # No needed a token
            "dataset_num_proc": 2,
            "training_args": {
                "output_dir": "./unsloth_mrpc_output",
                "per_device_train_batch_size": 2,
                "gradient_accumulation_steps": 4,
                "report_to": None,
                "gradient_checkpointing": True,
                "optim": "adamw_8bit",
                "logging_steps": 10,
                "save_strategy": "epoch",
                "learning_rate": 2e-4,
                "bf16": True,
                "fp16": False,
                "max_grad_norm": 0.3,
                "warmup_steps": 5,
                "lr_scheduler_type": "linear",
                "num_train_epochs": 1,
                "weight_decay": 0.01,
                "max_steps": 60,
                "seed": 3407,
                "evaluation_strategy": "steps", # We need this
                "eval_steps": 20, # We need this
            },
        },
    },
    # sql-create-context
    {
        "model_id": "unsloth/DeepSeek-R1-Distill-Llama-8B",
        "dataset_name": "b-mc2/sql-create-context",
        "config": {
            "unsloth_model_id": "deepseek-ai/deepseek-coder-1.3b-base",  # Corrected model ID for Unsloth
            "dataset_size": 125,
            "test_split_percentage": 0.2,
            "quantization": True,
            "lora": True,
            "use_unsloth": True,
            "max_seq_length": 2048,
            "dtype": None,
            "load_in_4bit": True,
            "access_token": None,  # No needed a token
            "dataset_num_proc": 2,
            "training_args": {
                "output_dir": "./unsloth_sql_create_context_output",
                "per_device_train_batch_size": 2,
                "gradient_accumulation_steps": 4,
                "report_to": None,
                "gradient_checkpointing": True,
                "optim": "adamw_8bit",
                "logging_steps": 10,
                "save_strategy": "epoch",
                "learning_rate": 2e-4,
                "bf16": True,
                "fp16": False,
                "max_grad_norm": 0.3,
                "warmup_steps": 5,
                "lr_scheduler_type": "linear",
                "num_train_epochs": 1,
                "weight_decay": 0.01,
                "max_steps": 60,
                "seed": 3407,
                "evaluation_strategy": "steps", # We need this
                "eval_steps": 20, # We need this
            },
        },
    },
    # hh-rlhf
    {
        "model_id": "unsloth/DeepSeek-R1-Distill-Llama-8B",
        "dataset_name": "anthropic/hh-rlhf",
        "config": {
            "unsloth_model_id": "deepseek-ai/deepseek-coder-1.3b-base",  # Corrected model ID for Unsloth
            "dataset_size": 125,
            "test_split_percentage": 0.2,
            "quantization": True,
            "lora": True,
            "use_unsloth": True,
            "max_seq_length": 2048,
            "dtype": None,
            "load_in_4bit": True,
            "access_token": None,  # No needed a token
            "dataset_num_proc": 2,
            "training_args": {
                "output_dir": "./unsloth_hh_rlhf_output",
                "per_device_train_batch_size": 2,
                "gradient_accumulation_steps": 4,
                "report_to": None,
                "gradient_checkpointing": True,
                "optim": "adamw_8bit",
                "logging_steps": 10,
                "save_strategy": "epoch",
                "learning_rate": 2e-4,
                "bf16": True,
                "fp16": False,
                "max_grad_norm": 0.3,
                "warmup_steps": 5,
                "lr_scheduler_type": "linear",
                "num_train_epochs": 1,
                "weight_decay": 0.01,
                "max_steps": 60,
                "seed": 3407,
                "evaluation_strategy": "steps", # We need this
                "eval_steps": 20, # We need this
            },
        },
    },
]

# Run the experiments
for rl_pair in RL_PAIRS:
    print("\n")
    print("*" * 50)
    print(
        f"Running experiment with model: {rl_pair['model_id']} and dataset: {rl_pair['dataset_name']}"
    )
    print("*" * 50)
    print("\n")

    agent = FineTuningAgent(
        model_id=rl_pair["model_id"],
        dataset_name=rl_pair["dataset_name"],
        config=rl_pair["config"],
    )
    # Initiate the OODA loop and fine-tuning process
    agent.run()
    print("\n")



**************************************************
Running experiment with model: unsloth/DeepSeek-R1-Distill-Llama-8B and dataset: SetFit/mrpc
**************************************************




Starting Run ...
Starting Observe ...
Loading model using Unsloth...
==((====))==  Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.46.3.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
deepseek-ai/deepseek-coder-1.3b-base does not have a padding token! Will use pad_token = <pad>.


Repo card metadata block was not found. Setting CardData to empty.




Observe finished.


Starting Orient ...
Dataset: SetFit/mrpc


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Preprocess Dataset: SetFit/mrpc


Map:   0%|          | 0/25 [00:00<?, ? examples/s]

Preprocess Dataset: SetFit/mrpc


Orient Dataset: DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 100
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 25
    })
})


Orient finished.


Starting Decide ...


LORA: LoraConfig(task_type='CAUSAL_LM', peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, inference_mode=False, r=64, target_modules={'v_proj', 'k_proj', 'q_proj', 'down_proj', 'o_proj', 'gate_proj', 'up_proj'}, exclude_modules=None, lora_alpha=16, lora_dropout=0.05, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={}, eva_config=None, use_dora=False, layer_replication=None, runtime_config=LoraRuntimeConfig(ephemeral_gp

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
max_steps is given, it will override any value given in num_train_epochs


trainable params: 59,965,440 || all params: 1,406,437,376 || trainable%: 4.2636


Decide finished.


Starting Act ...
Dataset preprocessed successfully.


Initializing Trainer...


Act finished.


Run Dataset: DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 100
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 25
    })
})
Test Dataset Size: 25
Eval Batch Size: 2




==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 100 | Num Epochs = 5
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 59,965,440


Step,Training Loss,Validation Loss
20,5.9819,6.035682
40,3.3344,5.182404
60,2.8336,5.084407




Evaluation:




{'eval_loss': 5.083302974700928, 'eval_runtime': 1.4623, 'eval_samples_per_second': 17.097, 'eval_steps_per_second': 8.89, 'epoch': 4.8}


Run  finished.




**************************************************
Running experiment with model: unsloth/DeepSeek-R1-Distill-Llama-8B and dataset: b-mc2/sql-create-context
**************************************************




Starting Run ...
Starting Observe ...
Loading model using Unsloth...
==((====))==  Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.46.3.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
deepseek-ai/deepseek-coder-1.3b-base does not have a padding token! Will use pad_token = <pa

Map:   0%|          | 0/25 [00:00<?, ? examples/s]

Preprocess Dataset: b-mc2/sql-create-context


Orient Dataset: DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels', 'text'],
        num_rows: 100
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels', 'text'],
        num_rows: 25
    })
})


Orient finished.


Starting Decide ...


LORA: LoraConfig(task_type='CAUSAL_LM', peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, inference_mode=False, r=64, target_modules={'v_proj', 'k_proj', 'q_proj', 'down_proj', 'o_proj', 'gate_proj', 'up_proj'}, exclude_modules=None, lora_alpha=16, lora_dropout=0.05, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={}, eva_config=None, use_dora=False, layer_replication=None, runtime_config=L

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
max_steps is given, it will override any value given in num_train_epochs


trainable params: 59,965,440 || all params: 1,406,437,376 || trainable%: 4.2636


Decide finished.


Starting Act ...
Dataset preprocessed successfully.


Initializing Trainer...


Act finished.


Run Dataset: DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels', 'text'],
        num_rows: 100
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels', 'text'],
        num_rows: 25
    })
})
Test Dataset Size: 25
Eval Batch Size: 2




==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 100 | Num Epochs = 5
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 59,965,440


Step,Training Loss,Validation Loss
20,3.991,11.193081
40,2.3474,10.053211
60,1.9657,9.947631




Evaluation:




{'eval_loss': 9.9539155960083, 'eval_runtime': 1.4999, 'eval_samples_per_second': 16.668, 'eval_steps_per_second': 8.667, 'epoch': 4.8}


Run  finished.




**************************************************
Running experiment with model: unsloth/DeepSeek-R1-Distill-Llama-8B and dataset: anthropic/hh-rlhf
**************************************************




Starting Run ...
Starting Observe ...
Loading model using Unsloth...
==((====))==  Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.46.3.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
deepseek-ai/deepseek-coder-1.3b-base does not have a padding token! Will use pad_token = <pad>.


Ob

Map:   0%|          | 0/25 [00:00<?, ? examples/s]

Preprocess Dataset: anthropic/hh-rlhf


Orient Dataset: DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels', 'text'],
        num_rows: 100
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels', 'text'],
        num_rows: 25
    })
})


Orient finished.


Starting Decide ...


LORA: LoraConfig(task_type='CAUSAL_LM', peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, inference_mode=False, r=64, target_modules={'v_proj', 'k_proj', 'q_proj', 'down_proj', 'o_proj', 'gate_proj', 'up_proj'}, exclude_modules=None, lora_alpha=16, lora_dropout=0.05, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={}, eva_config=None, use_dora=False, layer_replication=None, runtime_config=LoraRunt

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
max_steps is given, it will override any value given in num_train_epochs


trainable params: 59,965,440 || all params: 1,406,437,376 || trainable%: 4.2636


Decide finished.


Starting Act ...
Dataset preprocessed successfully.


Initializing Trainer...


Act finished.


Run Dataset: DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels', 'text'],
        num_rows: 100
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels', 'text'],
        num_rows: 25
    })
})
Test Dataset Size: 25
Eval Batch Size: 2




==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 100 | Num Epochs = 5
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 59,965,440


Step,Training Loss,Validation Loss
20,5.1383,4.697424
40,2.9177,3.34326
60,2.6208,3.137609




Evaluation:




{'eval_loss': 3.1397392749786377, 'eval_runtime': 1.4632, 'eval_samples_per_second': 17.086, 'eval_steps_per_second': 8.885, 'epoch': 4.8}


Run  finished.


