<a href="https://colab.research.google.com/github/fatiimahoseini/n8n-llm/blob/main/nb/Gemma3N_(4B)-Conversational.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News


[Vision RL](https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl) is now supported! Train Qwen2.5-VL, Gemma 3 etc. with GSPO or GRPO.

Introducing Unsloth [Standby for RL](https://docs.unsloth.ai/basics/memory-efficient-rl): GRPO is now faster, uses 30% less memory with 2x longer context.

Gpt-oss fine-tuning now supports 8× longer context with 0 accuracy loss. [Read more](https://docs.unsloth.ai/basics/long-context-gpt-oss-training)

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [1]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9\.]{3,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.32.post2" if v == "2.8.0" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.55.4
!pip install --no-deps trl==0.22.2
import torch; torch._dynamo.config.recompile_limit = 64;


In [None]:
%%capture
!pip install --no-deps --upgrade timm # Only for Gemma 3N

### Unsloth

`FastModel` supports loading nearly any model now! This includes Vision and Text models!

In [2]:
from unsloth import FastModel
import torch

fourbit_models = [
    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3n-E4B-it-unsloth-bnb-4bit",
    "unsloth/gemma-3n-E2B-it-unsloth-bnb-4bit",
    # Pretrained models
    "unsloth/gemma-3n-E4B-unsloth-bnb-4bit",
    "unsloth/gemma-3n-E2B-unsloth-bnb-4bit",

    # Other Gemma 3 quants
    "unsloth/gemma-3-1b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-4b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-27b-it-unsloth-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-1b-it-unsloth-bnb-4bit",
    dtype = None, # None for auto detection
    max_seq_length = 1024, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.9.7: Fast Gemma3 patching. Transformers: 4.55.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.


model.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/233 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

# Let's finetune Gemma 3N!

You can finetune the vision and text parts for now through selection - the audio part can also be finetuned - we're working to make it selectable as well!

We now add LoRA adapters so we only need to update a small amount of parameters!

In [22]:
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers     = False, # Turn off for just text!
    finetune_language_layers   = True,  # Should leave on!
    finetune_attention_modules = True,  # Attention good for GRPO
    finetune_mlp_modules       = True,  # Should leave on always!

    r = 8,           # Larger = higher accuracy, but might overfit
    lora_alpha = 8,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
)

Unsloth: Making `model.base_model.model.model` require gradients


<a name="Data"></a>
### Data Prep
We now use the `Gemma-3` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. Gemma-3 renders multi turn conversations like below:

```
<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3` and more.

In [10]:
print(df['completion'].dtype)
print(df['completion'].sample(5))

object
219    {'id': 'slP122GjD9meGkS6', 'meta': {'instanceI...
66     {'nodes': [{'name': 'Simplify Result', 'type':...
50     {'name': 'Ask a human', 'nodes': [{'id': 'a60c...
49     {'nodes': [{'id': '2498bb93-176f-458c-acee-f54...
191    {'meta': {'instanceId': '408f9fb9940c3cb18ffde...
Name: completion, dtype: object


In [12]:
import pandas as pd
from datasets import Dataset

# Load the user's data using pandas
df = pd.read_json("/content/n8n_data/n8n_training_data.json")

# Restructure the data to create a 'conversations' column
# Assuming 'prompt' is the user turn and 'completion' is the model turn
df['conversations'] = df.apply(lambda row: [
    {"role": "user", "content": row["prompt"]},
    {"role": "assistant", "content": str(row["completion"])} # Convert completion to string in case it's not
], axis=1)

# Select only the 'conversations' column and convert to Hugging Face dataset
dataset = Dataset.from_pandas(df[['conversations']])

# Display the first few examples of the dataset
display(dataset[:5])

{'conversations': [[{'content': "یک ورک\u200cفلو n8n با نودهای 'Email Trigger (IMAP), Markdown, Send Email, Email Summarization Chain, Write email, OpenAI, Sticky Note2, Sticky Note5, Sticky Note1, Sticky Note7, Approve Email, OpenAI Chat Model, Set Email text, Sticky Note, Sticky Note11, Approved?' بساز.",
    'role': 'user'},
   {'content': '{\'id\': \'Nvn78tMRNnKji7Fg\', \'meta\': {\'instanceId\': \'a4bfc93e975ca233ac45ed7c9227d84cf5a2329310525917adaf3312e10d5462\', \'templateCredsSetupCompleted\': True}, \'name\': \'Very simple Human in the loop system email with AI e IMAP\', \'tags\': [], \'nodes\': [{\'id\': \'271bb16f-9b62-41d9-ab76-114cd7ba915a\', \'name\': \'Email Trigger (IMAP)\', \'type\': \'n8n-nodes-base.emailReadImap\', \'position\': [-1300, 1340], \'parameters\': {\'options\': {}}, \'credentials\': {\'imap\': {\'id\': \'k31W9oGddl9pMDy4\', \'name\': \'IMAP info@n3witalia.com\'}}, \'typeVersion\': 2}, {\'id\': \'42d150d8-d574-49f9-9c0e-71a2cdea3b79\', \'name\': \'Markdown

We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!

In [13]:
from unsloth.chat_templates import standardize_data_formats
dataset = standardize_data_formats(dataset)

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/276 [00:00<?, ? examples/s]

We now apply the chat template for `Gemma-3` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.

In [14]:
def formatting_prompts_func(examples):
   convos = examples["conversations"]
   texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
   return { "text" : texts, }

dataset = dataset.map(formatting_prompts_func, batched = True)

Map:   0%|          | 0/276 [00:00<?, ? examples/s]

Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.

In [15]:
dataset[0]["text"]

'<start_of_turn>user\nیک ورک\u200cفلو n8n با نودهای \'Email Trigger (IMAP), Markdown, Send Email, Email Summarization Chain, Write email, OpenAI, Sticky Note2, Sticky Note5, Sticky Note1, Sticky Note7, Approve Email, OpenAI Chat Model, Set Email text, Sticky Note, Sticky Note11, Approved?\' بساز.<end_of_turn>\n<start_of_turn>model\n{\'id\': \'Nvn78tMRNnKji7Fg\', \'meta\': {\'instanceId\': \'a4bfc93e975ca233ac45ed7c9227d84cf5a2329310525917adaf3312e10d5462\', \'templateCredsSetupCompleted\': True}, \'name\': \'Very simple Human in the loop system email with AI e IMAP\', \'tags\': [], \'nodes\': [{\'id\': \'271bb16f-9b62-41d9-ab76-114cd7ba915a\', \'name\': \'Email Trigger (IMAP)\', \'type\': \'n8n-nodes-base.emailReadImap\', \'position\': [-1300, 1340], \'parameters\': {\'options\': {}}, \'credentials\': {\'imap\': {\'id\': \'k31W9oGddl9pMDy4\', \'name\': \'IMAP info@n3witalia.com\'}}, \'typeVersion\': 2}, {\'id\': \'42d150d8-d574-49f9-9c0e-71a2cdea3b79\', \'name\': \'Markdown\', \'type\'

In [16]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

We get the first 3000 rows of the dataset

In [None]:
# from datasets import load_dataset
# dataset = load_dataset("mlabonne/FineTome-100k", split = "train[:3000]")

README.md:   0%|          | 0.00/982 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!

In [17]:
from unsloth.chat_templates import standardize_data_formats
dataset = standardize_data_formats(dataset)

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/276 [00:00<?, ? examples/s]

Let's see how row 100 looks like!

In [18]:
dataset[100]

{'conversations': [{'content': "یک ورک\u200cفلو n8n با نودهای 'When clicking ‘Test workflow’, Wait, Sticky Note36, Sticky Note28, Connect to your own data source, Get urls from own data source, Example fields from data source, Sticky Note33, Sticky Note34, Sticky Note35, Sticky Note37, 40 items at a time, 10 at a time, Markdown data and Links, Split out page URLs, Retrieve Page Markdown and Links, Sticky Note38' بساز.",
   'role': 'user'},
  {'content': '{\'meta\': {\'instanceId\': \'6b6a2db47bdf8371d21090c511052883cc9a3f6af5d0d9d567c702d74a18820e\'}, \'nodes\': [{\'id\': \'f4570aad-db25-4dcd-8589-b1c8335935de\', \'name\': \'When clicking ‘Test workflow’\', \'type\': \'n8n-nodes-base.manualTrigger\', \'position\': [-180, 3800], \'parameters\': {}, \'typeVersion\': 1}, {\'id\': \'bd481559-85f2-4865-8d85-e50e72369f26\', \'name\': \'Wait\', \'type\': \'n8n-nodes-base.wait\', \'position\': [940, 3620], \'webhookId\': \'f10708f0-38c6-4c75-b635-37222d5b183a\', \'parameters\': {\'amount\': 45

We now have to apply the chat template for `Gemma-3` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.

In [19]:
def formatting_prompts_func(examples):
   convos = examples["conversations"]
   texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
   return { "text" : texts, }

dataset = dataset.map(formatting_prompts_func, batched = True)

Map:   0%|          | 0/276 [00:00<?, ? examples/s]

Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.

In [20]:
dataset[100]["text"]

'<start_of_turn>user\nیک ورک\u200cفلو n8n با نودهای \'When clicking ‘Test workflow’, Wait, Sticky Note36, Sticky Note28, Connect to your own data source, Get urls from own data source, Example fields from data source, Sticky Note33, Sticky Note34, Sticky Note35, Sticky Note37, 40 items at a time, 10 at a time, Markdown data and Links, Split out page URLs, Retrieve Page Markdown and Links, Sticky Note38\' بساز.<end_of_turn>\n<start_of_turn>model\n{\'meta\': {\'instanceId\': \'6b6a2db47bdf8371d21090c511052883cc9a3f6af5d0d9d567c702d74a18820e\'}, \'nodes\': [{\'id\': \'f4570aad-db25-4dcd-8589-b1c8335935de\', \'name\': \'When clicking ‘Test workflow’\', \'type\': \'n8n-nodes-base.manualTrigger\', \'position\': [-180, 3800], \'parameters\': {}, \'typeVersion\': 1}, {\'id\': \'bd481559-85f2-4865-8d85-e50e72369f26\', \'name\': \'Wait\', \'type\': \'n8n-nodes-base.wait\', \'position\': [940, 3620], \'webhookId\': \'f10708f0-38c6-4c75-b635-37222d5b183a\', \'parameters\': {\'amount\': 45}, \'type

<a name="Train"></a>
### Train the model
Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [23]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Switching to float32 training since model cannot work with float16


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/276 [00:00<?, ? examples/s]

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!

In [24]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
)

Map (num_proc=2):   0%|          | 0/276 [00:00<?, ? examples/s]

Let's verify masking the instruction part is done! Let's print the 100th row again.  Notice how the sample only has a single `<bos>` as expected!

In [25]:
tokenizer.decode(trainer.train_dataset[100]["input_ids"])

'<bos><start_of_turn>user\nیک ورک\u200cفلو n8n با نودهای \'When clicking ‘Test workflow’, Wait, Sticky Note36, Sticky Note28, Connect to your own data source, Get urls from own data source, Example fields from data source, Sticky Note33, Sticky Note34, Sticky Note35, Sticky Note37, 40 items at a time, 10 at a time, Markdown data and Links, Split out page URLs, Retrieve Page Markdown and Links, Sticky Note38\' بساز.<end_of_turn>\n<start_of_turn>model\n{\'meta\': {\'instanceId\': \'6b6a2db47bdf8371d21090c511052883cc9a3f6af5d0d9d567c702d74a18820e\'}, \'nodes\': [{\'id\': \'f4570aad-db25-4dcd-8589-b1c8335935de\', \'name\': \'When clicking ‘Test workflow’\', \'type\': \'n8n-nodes-base.manualTrigger\', \'position\': [-180, 3800], \'parameters\': {}, \'typeVersion\': 1}, {\'id\': \'bd481559-85f2-4865-8d85-e50e72369f26\', \'name\': \'Wait\', \'type\': \'n8n-nodes-base.wait\', \'position\': [940, 3620], \'webhookId\': \'f10708f0-38c6-4c75-b635-37222d5b183a\', \'parameters\': {\'amount\': 45}, \

Now let's print the masked out example - you should see only the answer is present:

In [26]:
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")

'                                                                                                                       {\'meta\': {\'instanceId\': \'6b6a2db47bdf8371d21090c511052883cc9a3f6af5d0d9d567c702d74a18820e\'}, \'nodes\': [{\'id\': \'f4570aad-db25-4dcd-8589-b1c8335935de\', \'name\': \'When clicking ‘Test workflow’\', \'type\': \'n8n-nodes-base.manualTrigger\', \'position\': [-180, 3800], \'parameters\': {}, \'typeVersion\': 1}, {\'id\': \'bd481559-85f2-4865-8d85-e50e72369f26\', \'name\': \'Wait\', \'type\': \'n8n-nodes-base.wait\', \'position\': [940, 3620], \'webhookId\': \'f10708f0-38c6-4c75-b635-37222d5b183a\', \'parameters\': {\'amount\': 45}, \'typeVersion\': 1.1}, {\'id\': \'cc9e9947-19e4-47c5-95b0-a631d688a8b6\', \'name\': \'Sticky Note36\', \'type\': \'n8n-nodes-base.stickyNote\', \'position\': [549.7858793743054, 3709.534654112671], \'parameters\': {\'color\': 7, \'width\': 327.8244990224782, \'height\': 268.48353140372035, \'content\': \'**40 at a time seems to be the

In [27]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
1.512 GB of memory reserved.


# Let's train the model!

To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [28]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 276 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 4 x 1) = 4
 "-____-"     Trainable parameters = 6,522,880 of 1,006,408,832 (0.65% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.5191
2,2.3238
3,2.6307
4,2.2982
5,2.6374
6,2.6302
7,2.4382
8,2.1902
9,2.0686
10,2.3386


In [29]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

240.667 seconds used for training.
4.01 minutes used for training.
Peak reserved memory = 2.338 GB.
Peak reserved memory for training = 0.826 GB.
Peak reserved memory % of max memory = 15.861 %.
Peak reserved memory for training % of max memory = 5.603 %.


<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Gemma-3` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`

In [39]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)
messages = [{
    "role": "user",
    "content": [{
        "type" : "text",
        "text" : "create a global error workflow",
    }]
}]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
    tokenize = True,
    return_dict = True,
).to("cuda")
outputs = model.generate(
    **inputs,
    max_new_tokens = 512, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
)
tokenizer.batch_decode(outputs)

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/IPython/core/interactiveshell.py", line 3553, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipython-input-3789865791.py", line 20, in <cell line: 0>
    outputs = model.generate(
              ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/peft/peft_model.py", line 1973, in generate
    outputs = self.base_model.generate(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/unsloth/models/vision.py", line 246, in unsloth_base_fast_generate
    output = self._old_generate(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", li

TypeError: object of type 'NoneType' has no len()

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [40]:
messages = [{
    "role": "user",
    "content": [{"type" : "text", "text" : "create a workflow to send an email every day at 1pm with n8n",}]
}]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
    tokenize = True,
    return_dict = True,
).to("cuda")

from transformers import TextStreamer
_ = model.generate(
    **inputs,
    max_new_tokens = 1024, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

## Workflow using n8.n for Daily Email Distribution in n8.n

Here’s a comprehensive workflow leveraging n8.n to consistently send daily emails. This setup utilizes a ServiceAccount, triggers that check the date of the day, and then uses n8.n’s built-in email sending functionality.

**1. ServiceAccount Setup (n8.n Secret)**

* **ServiceAccount Name:**  `<your_service_account_name>` - Make sure you've configured the service account to work with your n8.n project's email.  You can find this by going to `$account` in the n8.n UI (the "Account's ServiceAccount").
* **ServiceAccount's Email Address:** `<your_service_account_email_address>` - Ensure your ServiceAccount's credentials are properly configured in the UI.

**2. n8.n Workflow - Define the Automation:**

* **Protocol:** `HTTP`
* **Trigger:** `Schedule'n8.nSchedule'n8.nSchedule'"n8.nSchedule'" - '00:00:00' - '00:00:00' - '00:00:00', 'n8.n' - "Schedule" - 'n8.nSchedule' - '0.008 - 0.000, 0' - 'n8.n' - "Schedule" - 'n8.nSchedule'" - '0

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [32]:
model.save_pretrained("gemma-3n")  # Local saving
tokenizer.save_pretrained("gemma-3n")
# model.push_to_hub("HF_ACCOUNT/gemma-3n", token = "...") # Online saving
# tokenizer.push_to_hub("HF_ACCOUNT/gemma-3n", token = "...") # Online saving

('gemma-3n/tokenizer_config.json',
 'gemma-3n/special_tokens_map.json',
 'gemma-3n/chat_template.jinja',
 'gemma-3n/tokenizer.model',
 'gemma-3n/added_tokens.json',
 'gemma-3n/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [33]:
if False:
    from unsloth import FastModel
    model, tokenizer = FastModel.from_pretrained(
        model_name = "gemma-3n", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = True,
    )

messages = [{
    "role": "user",
    "content": [{"type" : "text", "text" : "What is Gemma-3N?",}]
}]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
    tokenize = True,
    return_dict = True,
).to("cuda")

from transformers import TextStreamer
_ = model.generate(
    **inputs,
    max_new_tokens = 128, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

Gemma-3N is a powerful and efficient open-source language model developed by the Tian Qian Research Group (TQRG). It's a family of models built on top of the Llama 3 model architecture, specifically optimized for improved reasoning capabilities. Here’s a breakdown of what makes it notable:

**Key Characteristics & What Makes It Special:**

* **Foundation Model:** It's based on Llama 3 (7B, 13B, and 33B versions), a leading series of large language models.
* **RAG Focus:** It leverages Retrieval-Augmented Generation (


### Saving to float16 for VLLM

We also support saving to `float16` directly for deployment! We save it in the folder `gemma-3N-finetune`. Set `if False` to `if True` to let it run!

In [34]:
if False: # Change to True to save finetune!
    model.save_pretrained_merged("gemma-3N-finetune", tokenizer)

If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [None]:
if False: # Change to True to upload finetune
    model.push_to_hub_merged(
        "HF_ACCOUNT/gemma-3N-finetune", tokenizer,
        token = "hf_..."
    )

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!

In [35]:
if False: # Change to True to save to GGUF
    model.save_pretrained_gguf(
        "gemma-3N-finetune",
        quantization_type = "Q8_0", # For now only Q8_0, BF16, F16 supported
    )

Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [None]:
if False: # Change to True to upload GGUF
    model.push_to_hub_gguf(
        "gemma-3N-finetune",
        quantization_type = "Q8_0", # Only Q8_0, BF16, F16 supported
        repo_id = "HF_ACCOUNT/gemma-3N-finetune-gguf",
        token = "hf_...",
    )

Now, use the `gemma-3N-finetune.gguf` file or `gemma-3N-finetune-Q4_K_M.gguf` file in llama.cpp.

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
