<a href="https://colab.research.google.com/github/a-agmon/agents-benchmarks/blob/main/trans_Gemma3_(12B).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Installation

In [None]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9\.]{3,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + "0.0.32.post2" if v == "2.8.0" else "0.0.29.post3"
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth

In [None]:
import os
hf_key = os.getenv('HF_TOKEN')

### Unsloth

`FastModel` supports loading nearly any model now! This includes Vision and Text models!

In [None]:
from unsloth import FastModel
import torch

fourbit_models = [
    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-1b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-4b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-27b-it-unsloth-bnb-4bit",

    # Other popular models!
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/Llama-3.3-70B",
    "unsloth/mistral-7b-instruct-v0.3",
    "unsloth/Phi-4",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastModel.from_pretrained(
    #model_name = "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    model_name = "google/gemma-3-12b-it",
    max_seq_length = 2048, # Choose any for long context!
    load_in_4bit = False,  # 4 bit quantization to reduce memory
    load_in_8bit = True, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.9.1: Fast Gemma3 patching. Transformers: 4.56.0.
   \\   /|    NVIDIA A100-SXM4-80GB. Num GPUs = 1. Max memory: 79.318 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

model-00003-of-00005.safetensors:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

model-00004-of-00005.safetensors:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

model-00002-of-00005.safetensors:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

model-00001-of-00005.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00005-of-00005.safetensors:   0%|          | 0.00/4.60G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/210 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/70.0 [00:00<?, ?B/s]

chat_template.json: 0.00B [00:00, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

preprocessor_config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update a small amount of parameters!

In [None]:
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers     = False, # Turn off for just text!
    finetune_language_layers   = True,  # Should leave on!
    finetune_attention_modules = True,  # Attention good for GRPO
    finetune_mlp_modules       = True,  # SHould leave on always!

    r = 64,           # Larger = higher accuracy, but might overfit => alon changed it from 8 to 12
    lora_alpha = 64,  # Recommended alpha == r at least
    lora_dropout = 0.0,
    bias = "none",
    random_state = 3407,
    target_modules = ["q_proj", "v_proj", "k_proj", "o_proj",
                  "gate_proj", "up_proj", "down_proj",
                  "embed_tokens", "lm_head"],  # Add embeddings
)



Unsloth: Making `base_model.model.model.vision_tower.vision_model` require gradients


In [None]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

In [None]:
from datasets import load_dataset, DatasetDict
training_file = "/content/final_trainset.jsonl"

dataset = load_dataset('json', data_files=training_file, split='train')

# Split into train/validation (90/10 split)
train_test_split = dataset.train_test_split(test_size=0.1, seed=42)

# Create DatasetDict with proper naming for SFTTrainer
dataset_dict = DatasetDict({
    'train': train_test_split['train'],
    'validation': train_test_split['test']  # Rename 'test' to 'validation'
})

print(f"Train size: {len(dataset_dict['train'])}")
print(f"Validation size: {len(dataset_dict['validation'])}")


Generating train split: 0 examples [00:00, ? examples/s]

Train size: 16588
Validation size: 1844


Let's see how row 100 looks like!

In [None]:
dataset[100]

{'text': '<start_of_turn>user\nTranslate the following medical text from Hebrew to English accurately, preserving all medical terminology and context. IMPORTANT: Respond with the translated text only, without any additional commentary.\n\nמטופלת עם רגישות וסטיבולרית, נשלחה לבדיקת ספקולום. סד" תקין.<end_of_turn>\n<start_of_turn>model\nPatient with vestibular sensitivity, referred for speculum examination. CBC normal.<end_of_turn>'}

We now have to apply the chat template for `Gemma-3` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.

Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.

In [None]:
dataset[100]["text"]

'<start_of_turn>user\nTranslate the following medical text from Hebrew to English accurately, preserving all medical terminology and context. IMPORTANT: Respond with the translated text only, without any additional commentary.\n\nמטופלת עם רגישות וסטיבולרית, נשלחה לבדיקת ספקולום. סד" תקין.<end_of_turn>\n<start_of_turn>model\nPatient with vestibular sensitivity, referred for speculum examination. CBC normal.<end_of_turn>'

In [None]:
from trl import SFTConfig
import inspect

# See what parameters SFTConfig accepts
print(inspect.signature(SFTConfig.__init__))

# Or see all available attributes
config = SFTConfig(output_dir="./test")
print([attr for attr in dir(config) if not attr.startswith('_')])

['accelerator_config', 'activation_offloading', 'adafactor', 'adam_beta1', 'adam_beta2', 'adam_epsilon', 'assistant_only_loss', 'auto_find_batch_size', 'average_tokens_across_devices', 'batch_eval_metrics', 'bf16', 'bf16_full_eval', 'chat_template_path', 'completion_only_loss', 'data_seed', 'dataloader_drop_last', 'dataloader_num_workers', 'dataloader_persistent_workers', 'dataloader_pin_memory', 'dataloader_prefetch_factor', 'dataset_kwargs', 'dataset_num_proc', 'dataset_text_field', 'ddp_backend', 'ddp_broadcast_buffers', 'ddp_bucket_cap_mb', 'ddp_find_unused_parameters', 'ddp_timeout', 'ddp_timeout_delta', 'debug', 'deepspeed', 'deepspeed_plugin', 'default_optim', 'device', 'disable_tqdm', 'distributed_state', 'do_eval', 'do_predict', 'do_train', 'eos_token', 'eval_accumulation_steps', 'eval_batch_size', 'eval_delay', 'eval_do_concat_batches', 'eval_on_start', 'eval_packing', 'eval_steps', 'eval_strategy', 'eval_use_gather_object', 'fp16', 'fp16_backend', 'fp16_full_eval', 'fp16_opt

In [None]:
from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset_dict['train'],
    eval_dataset=dataset_dict['validation'],
    args=SFTConfig(
        # Output directory
        output_dir="./gemma-finetune-results",

        # Training parameters
        num_train_epochs=3,  # Full epochs!
        max_steps=-1,  # -1 means use num_train_epochs
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,

        # Learning rate schedule
        learning_rate=5e-6,  # More conservative
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
        weight_decay=0.01,
        optim="adamw_8bit",

        # EVALUATION PARAMETERS
        do_eval=True,  # Enable evaluation
        eval_strategy="steps",
        eval_steps=50,
        eval_accumulation_steps=2,

        # Saving parameters
        save_strategy="steps",
        save_steps=100,
        save_total_limit=3,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False,

        # Logging
        logging_steps=10,
        logging_first_step=True,

        # Dataset configuration
        dataset_text_field="text",
        max_length=1024,

        # CORRECTED PRECISION SETTINGS
        fp16=False,  # Changed to False
        bf16=True,   # Changed to True - matches your model

        # Optional optimizations
        gradient_checkpointing=True,  # Save memory

        # Misc
        seed=3407,
        report_to="none",  # Or "wandb" if you want
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/16588 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/1844 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the model
Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [None]:
# from trl import SFTTrainer, SFTConfig
# trainer = SFTTrainer(
#     model = model,
#     tokenizer = tokenizer,
#     train_dataset = dataset_dict['train'],
#     eval_dataset = dataset_dict['validation'], # Can set up evaluation!
#     args = SFTConfig(
#         dataset_text_field = "text",
#         per_device_train_batch_size = 2,
#         gradient_accumulation_steps = 4, # Use GA to mimic batch size!
#         warmup_steps = 5,
#         num_train_epochs = 1, # Set this for 1 full training run.
#         max_steps = None,
#         learning_rate = 2e-5, # Reduce to 2e-5 for long training runs
#         logging_steps = 1,
#         optim = "adamw_8bit",
#         weight_decay = 0.01,
#         lr_scheduler_type = "linear",
#         seed = 3407,
#         report_to = "none", # Use this for WandB etc
#     ),
# )

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!

In [None]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
)

Map (num_proc=12):   0%|          | 0/16588 [00:00<?, ? examples/s]

Map (num_proc=12):   0%|          | 0/1844 [00:00<?, ? examples/s]

Let's verify masking the instruction part is done! Let's print the 100th row again.  Notice how the sample only has a single `<bos>` as expected!

In [None]:
tokenizer.decode(trainer.train_dataset[100]["input_ids"])

"<bos><start_of_turn>user\nTranslate the following medical text from Hebrew to English:\nTranslate the following medical text from Hebrew to English accurately, preserving all medical terminology and context. IMPORTANT: Respond with the translated text only, without any additional commentary.\n\nימין: שעה 12 ,שעה 3 נמושו פיברואדונומות ידועות ללא שינוי.\nשמאל שעה 3 ושעה 6 נמושו פיברואדנומות ידועות ללא שינוי\nללא שינויים בעור ופיטמה, ללא הגדלת בלוטות.<end_of_turn>\n<start_of_turn>model\nRight: At 12 o'clock and 3 o'clock, known fibroadenomas palpated without change.\nLeft: At 3 o'clock and 6 o'clock, known fibroadenomas palpated without change.\nNo changes to the skin and nipple, no enlarged nodes.<end_of_turn>"

Now let's print the masked out example - you should see only the answer is present:

In [None]:
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")

"                                                                                                                                        Right: At 12 o'clock and 3 o'clock, known fibroadenomas palpated without change.\nLeft: At 3 o'clock and 6 o'clock, known fibroadenomas palpated without change.\nNo changes to the skin and nipple, no enlarged nodes.<end_of_turn>"

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.557 GB.
8.701 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 16,588 | Num Epochs = 3 | Total steps = 6,222
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 8 x 1) = 8
 "-____-"     Trainable parameters = 290,852,864 of 12,495,204,976 (2.33% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,Validation Loss
50,0.8828,0.819504
100,0.7419,0.734267
150,0.6083,0.606831
200,0.5113,0.50551
250,0.4508,0.467342
300,0.4478,0.449904
350,0.4465,0.434563
400,0.3954,0.421493
450,0.3945,0.408803
500,0.4095,0.398661


Unsloth: Not an error, but Gemma3ForConditionalGeneration does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


Step,Training Loss,Validation Loss
50,0.8828,0.819504
100,0.7419,0.734267
150,0.6083,0.606831
200,0.5113,0.50551
250,0.4508,0.467342
300,0.4478,0.449904
350,0.4465,0.434563
400,0.3954,0.421493
450,0.3945,0.408803
500,0.4095,0.398661


KeyboardInterrupt: 

In [None]:
trainer_stats


In [None]:
# The log history is in trainer.state.log_history, not trainer_stats
import pandas as pd
import matplotlib.pyplot as plt

# Get the log history from the trainer object
df = pd.DataFrame(trainer.state.log_history)

# Check what columns we have
print("Available columns:", df.columns.tolist())
print("\nDataFrame shape:", df.shape)

# Check if validation loss decreased
if 'eval_loss' in df.columns:
    print("\nBest eval loss:", df['eval_loss'].min())
    print("Final eval loss:", df['eval_loss'].dropna().iloc[-1])

# Plot training vs validation loss
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Training loss
train_df = df[df['loss'].notna()]
if not train_df.empty:
    ax1.plot(train_df.index, train_df['loss'])
    ax1.set_title('Training Loss')
    ax1.set_xlabel('Steps')
    ax1.set_ylabel('Loss')

# Validation loss
if 'eval_loss' in df.columns:
    eval_df = df[df['eval_loss'].notna()]
    if not eval_df.empty:
        ax2.plot(eval_df.index, eval_df['eval_loss'], color='orange')
        ax2.set_title('Validation Loss')
        ax2.set_xlabel('Steps')
        ax2.set_ylabel('Loss')

plt.tight_layout()
plt.show()

In [None]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)
messages = [{
    "role": "user",
    "content": [{
        "type" : "text",
        "text" : "Translate the following medical text from Hebrew to English accurately, preserving all medical terminology and context. IMPORTANT: Respond with the translated text only, without any additional commentary\nמטופלת בהכרה מלאה, רגועה וערנית,  לד״ 134/80 דופק 91, חום בנורמה  עזרה קלה בביצוע ADL, מעברים בהשגחה, מתניידת לבד בכ״ג, אכלה ושתתה כרצונה , קיבלה טיפול תרופתי כרשום, שולטת על סוגריה, לא נצפה ביקור",
    }]
}]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
)
outputs = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 64, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 0.2, top_p = 0.95, top_k = 64,
)
tokenizer.batch_decode(outputs)

Using `cache_implementation='hybrid' is deprecated. Please only use one of ('static', 'offloaded_static'), and the layer structure will be inferred automatically.


['<bos><start_of_turn>user\nTranslate the following medical text from Hebrew to English accurately, preserving all medical terminology and context. IMPORTANT: Respond with the translated text only, without any additional commentary\nמטופלת בהכרה מלאה, רגועה וערנית,  לד״ 134/80 דופק 91, חום בנורמה  עזרה קלה בביצוע ADL, מעברים בהשגחה, מתניידת לבד בכ״ג, אכלה ושתתה כרצונה , קיבלה טיפול תרופתי כרשום, שולטת על סוגריה, לא נצפה ביקור<end_of_turn>\n<start_of_turn>model\nPatient is fully conscious, calm, and alert, BP 134/80, pulse 91, temperature within normal limits. Requires minimal assistance with ADL, transfers with supervision, ambulates independently with a walker, ate and drank as desired, received medication as prescribed, continent, no visitors observed.']

In [None]:
import torch
import gc

# Clear GPU cache without losing anything
torch.cuda.empty_cache()
gc.collect()

89545

In [None]:
# rescue
# Skip Unsloth's merge, use HF directly
from transformers import AutoModelForCausalLM
import torch
import gc

# Clear GPU memory first
torch.cuda.empty_cache()
gc.collect()

# Temporarily move model to CPU for saving
model = model.cpu()
torch.cuda.empty_cache()
gc.collect()

# Save just the adapter first
model.save_pretrained("lora-adapter")
tokenizer.save_pretrained("lora-adapter")

# Clear the model from memory completely before loading base model
del model
torch.cuda.empty_cache()
gc.collect()

# Load and merge manually
base_model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-3-12b-it",  # Make sure this matches your actual base model
    torch_dtype=torch.bfloat16,
    device_map="auto",
    low_cpu_mem_usage=True  # Added this to reduce memory usage
)

# Load the PEFT model
from peft import PeftModel
peft_model = PeftModel.from_pretrained(base_model, "lora-adapter")
merged = peft_model.merge_and_unload()

# Push to hub
merged.push_to_hub(
    "Aagmon/gemma-3-12b-ft-trans-imp-2",
    token=hf_key,
    max_shard_size="2GB"  # Reduced from 5GB to help with memory
)
tokenizer.push_to_hub(
    "Aagmon/gemma-3-12b-ft-trans-imp-2",
    token=hf_key
)




config.json:   0%|          | 0.00/916 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json:   0%|          | 0.00/109k [00:00<?, ?B/s]

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

model-00005-of-00005.safetensors:   0%|          | 0.00/4.60G [00:00<?, ?B/s]

model-00004-of-00005.safetensors:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

model-00001-of-00005.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00005.safetensors:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

model-00002-of-00005.safetensors:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

```python
from transformers import AutoModelForCausalLM

# Load original tied model
model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b-it", tie_word_embeddings=False)

# Set the randomly initialized lm_head to the previously tied embeddings
model.lm_head.weight.data = model.model.embed_tokens.weight.data.clone()

# Save the untied model
untied_model_dir = "dir/for/untied/model"
model.save_pretrained(untied_model_dir)
model.config.save_pretrained(untied_model_dir)

# Now use the original model but in untied format
model = AutoModelForCausalLM.from_pretrained(untied_model_dir)
```



model-00004-of-00013.safetensors:   0%|          | 0.00/1.97G [00:00<?, ?B/s]

model-00008-of-00013.safetensors:   0%|          | 0.00/1.94G [00:00<?, ?B/s]

model-00002-of-00013.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

Upload 13 LFS files:   0%|          | 0/13 [00:00<?, ?it/s]

model-00013-of-00013.safetensors:   0%|          | 0.00/1.01G [00:00<?, ?B/s]

model-00012-of-00013.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00011-of-00013.safetensors:   0%|          | 0.00/1.94G [00:00<?, ?B/s]

model-00005-of-00013.safetensors:   0%|          | 0.00/1.94G [00:00<?, ?B/s]

model-00001-of-00013.safetensors:   0%|          | 0.00/2.01G [00:00<?, ?B/s]

model-00003-of-00013.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00009-of-00013.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00007-of-00013.safetensors:   0%|          | 0.00/1.97G [00:00<?, ?B/s]

model-00006-of-00013.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00010-of-00013.safetensors:   0%|          | 0.00/1.97G [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

In [None]:
# thats when there are no memory issues
# Skip Unsloth's merge, use HF directly
from transformers import AutoModelForCausalLM



# Save just the adapter first
model.save_pretrained("lora-adapter")
tokenizer.save_pretrained("lora-adapter")

# Load and merge manually
base_model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-3-12b-it",  # or whatever base you used
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load the PEFT model
from peft import PeftModel
peft_model = PeftModel.from_pretrained(base_model, "lora-adapter")
merged = peft_model.merge_and_unload()

# Push to hub
merged.push_to_hub(
    "Aagmon/gemma-3-12b-ft-trans-imp-1",
    token=hf_key,
    max_shard_size="5GB"
)
tokenizer.push_to_hub(
    "Aagmon/gemma-3-12b-ft-trans-imp-1",
    token=hf_key
)

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]



OutOfMemoryError: CUDA out of memory. Tried to allocate 1.91 GiB. GPU 0 has a total capacity of 39.56 GiB of which 908.88 MiB is free. Process 3393 has 38.65 GiB memory in use. Of the allocated memory 38.10 GiB is allocated by PyTorch, and 28.91 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
!nvidia-smi

Sun Sep  7 10:14:24 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   60C    P0             29W /   70W |   14460MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                