### Installation

In [1]:
%%capture

!pip install pip3-autoremove
!pip-autoremove torch torchvision torchaudio -y
!pip install torch torchvision torchaudio xformers --index-url https://download.pytorch.org/whl/cu121
!pip install unsloth

### Unsloth

In [2]:
from unsloth import FastLanguageModel
import torch
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
HF_TOKEN = user_secrets.get_secret("HF_TOKEN")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-7B",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
    token = HF_TOKEN
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


2025-04-26 09:00:28.086521: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1745658028.302561      31 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1745658028.365677      31 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Unsloth: Failed to patch Gemma3ForConditionalGeneration.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.19: Fast Qwen2 patching. Transformers: 4.51.1.
   \\   /|    Tesla T4. Num GPUs = 2. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json:   0%|          | 0.00/106k [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/4.72k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/617 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.3.19 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


<a name="Data"></a>
### Data Prep

In [4]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset_path = "/kaggle/input/data-dataset/data_analyst_dataset.json"

dataset = load_dataset("json", data_files=dataset_path)
dataset = dataset["train"].train_test_split(test_size=0.06)
dataset = dataset.map(formatting_prompts_func, batched = True,)
dataset

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/4700 [00:00<?, ? examples/s]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output', 'text'],
        num_rows: 4700
    })
    test: Dataset({
        features: ['instruction', 'input', 'output', 'text'],
        num_rows: 300
    })
})

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [5]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset['train'],
    eval_dataset = dataset['test'],
    dataset_text_field = "text",
    max_seq_length = 2048,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 1, 
        #max_steps = 200,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 5,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        metric_for_best_model="eval_loss",
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/4700 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/300 [00:00<?, ? examples/s]

In [6]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
7.266 GB of memory reserved.


In [7]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 4,700 | Num Epochs = 1 | Total steps = 293
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 40,370,176/7,000,000,000 (0.58% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
5,1.2137
10,0.8006
15,0.3464
20,0.1817
25,0.1164
30,0.0841
35,0.0685
40,0.0521
45,0.0495
50,0.043


In [8]:
import math
# Evaluasi model
eval_results = trainer.evaluate()
eval_loss = eval_results["eval_loss"]
perplexity = math.exp(eval_loss)

print(f"Eval loss: {eval_loss}")
print(f"Perplexity: {perplexity}")


Unsloth: Not an error, but Qwen2ForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


Eval loss: 0.03129734843969345
Perplexity: 1.0317922600961498


In [9]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

4748.7982 seconds used for training.
79.15 minutes used for training.
Peak reserved memory = 9.926 GB.
Peak reserved memory for training = 2.66 GB.
Peak reserved memory % of max memory = 67.336 %.
Peak reserved memory for training % of max memory = 18.045 %.


In [25]:
model.push_to_hub("ncardian/Qwen2.5-7B-data-coder", token = HF_TOKEN) # Online saving
tokenizer.push_to_hub("ncardian/Qwen2.5-7B-data-coder", token = HF_TOKEN) # Online saving

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/80.8M [00:00<?, ?B/s]

Saved model to https://huggingface.co/ncardian/Qwen2.5-7B-data-coder


  0%|          | 0/1 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

### Saving to float16 for VLLM

In [15]:
# Merge to 16bit

#model.push_to_hub_merged("ncardian/Qwen2.5-7B-data-coder", tokenizer, save_method="merged_16bit", token=HF_TOKEN)

# Merge to 4bit
if False:
    model.push_to_hub_merged("ncardian/Qwen2.5-7B-data-coder", tokenizer, save_method="merged_4bit", token=HF_TOKEN)

# Just LoRA adapters
#if False:
model.push_to_hub_merged("ncardian/Qwen2.5-7B-data-coder", tokenizer, save_method="lora", token=HF_TOKEN)


Unsloth: Saving LoRA adapters. Please wait...


README.md:   0%|          | 0.00/597 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.66k [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.
No files have been modified since last commit. Skipping to prevent empty commit.


Saved lora model to https://huggingface.co/ncardian/Qwen2.5-7B-data-coder


### GGUF / llama.cpp Conversion
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [23]:
# Save to 8bit Q8_0
#if False: model.save_pretrained_gguf("model", tokenizer,)
#if False: model.push_to_hub_gguf("ncardian/Qwen2.5-7B-data-coder-8bit-GGUF", tokenizer, token = HF_TOKEN)

# Save to 16bit GGUF
#if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
#if False: model.push_to_hub_gguf("ncardian/Qwen2.5-7B-data-coder-16bit-GGUF", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
#if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
model.push_to_hub_gguf("ncardian/Qwen2.5-7B-data-coder-q4_k_m-GGUF", tokenizer, quantization_method = "q4_k_m", token = HF_TOKEN)

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "ncardian/Qwen2.5-7B-data-coder", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = HF_TOKEN,
    )

Cloning into 'llama.cpp'...
Submodule 'kompute' (https://github.com/nomic-ai/kompute.git) registered for path 'ggml/src/ggml-kompute/kompute'
Cloning into '/kaggle/working/llama.cpp/ggml/src/ggml-kompute/kompute'...
Submodule path 'ggml/src/ggml-kompute/kompute': checked out '4565194ed7c32d1d2efa32ceab4d3c6cae006306'
make: Entering directory '/kaggle/working/llama.cpp'
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.34.1")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAV

100%|██████████| 28/28 [00:20<00:00,  1.35it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving ncardian/Qwen2.5-7B-data-coder-q4_k_m-GGUF/pytorch_model-00001-of-00004.bin...
Unsloth: Saving ncardian/Qwen2.5-7B-data-coder-q4_k_m-GGUF/pytorch_model-00002-of-00004.bin...
Unsloth: Saving ncardian/Qwen2.5-7B-data-coder-q4_k_m-GGUF/pytorch_model-00003-of-00004.bin...
Unsloth: Saving ncardian/Qwen2.5-7B-data-coder-q4_k_m-GGUF/pytorch_model-00004-of-00004.bin...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at ncardian/Qwen2.5-7B-data-coder-q4_k_m-GGUF into f16 GGUF format.
The output location will be /kaggle/working/ncardian/Qwen2.5-7B-data-coder-

RuntimeError: Unsloth: Quantization failed for /kaggle/working/ncardian/Qwen2.5-7B-data-coder-q4_k_m-GGUF/unsloth.Q4_K_M.gguf
You are in a Kaggle environment, which might be the reason this is failing.
Kaggle only provides 20GB of disk space in the working directory.
Merging to 16bit for 7b models use 16GB of space.
This means using `model.{save_pretrained/push_to_hub}_merged` works, but
`model.{save_pretrained/push_to_hub}_gguf will use too much disk space.
You can try saving it to the `/tmp` directory for larger disk space.
I suggest you to save the 16bit model first, then use manual llama.cpp conversion.