### Installation

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    !pip install --no-deps unsloth vllm
# Install latest Hugging Face for Gemma-3!
!pip install --no-deps git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3

In [2]:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    !pip install --no-deps unsloth vllm
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    # Skip restarting message in Colab
    import sys, re, requests; modules = list(sys.modules.keys())
    for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft "trl==0.15.2" triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
    !pip install -r vllm_requirements.txt

### Using Unsloth library to downlaod 4-bit quantized model

Unsloth library has quantized version of open source LLMs which is suitable for fine-tuning. For this project we are using Gemma-3 27B.

In [3]:
from unsloth import FastModel
import torch

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-27b-it-unsloth-bnb-4bit",
    max_seq_length = 8000, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 04-26 02:03:38 [__init__.py:239] Automatically detected platform cuda.
==((====))==  Unsloth 2025.3.19: Fast Gemma3 patching. Transformers: 4.50.0.dev0. vLLM: 0.8.4.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json:   0%|          | 0.00/343k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/5 [00:00<?, ?it/s]

model-00001-of-00005.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00005.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00003-of-00005.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00004-of-00005.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00005-of-00005.safetensors:   0%|          | 0.00/2.04G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/210 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/70.0 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

### Low Rank Adaptation (LoRA)

 We now add LoRA adapters so we only need to update a small amount of parameters. Only language layers and attention kept true, we dont need to update the vision layers.

 We are choosing a rank 16 for balance between accuracy and compute required for finetuning



In [4]:
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers     = False,
    finetune_language_layers   = True,
    finetune_attention_modules = True,
    finetune_mlp_modules       = True,

    r = 16,
    lora_alpha = 16,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
)

Unsloth: Making `model.base_model.model.language_model.model` require gradients


### Login to Hugging Face to load the dataset

In [5]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineGrained).
The token `first_steps` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `first_

<a name="Data"></a>
### Data Prep
Using the `Gemma-3` format for conversation style finetunes.

```
<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
```



In [6]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

In [7]:
from datasets import load_dataset
dataset = load_dataset("deebak14/rhinoscript_ft_data_02", split = "train")

README.md:   0%|          | 0.00/345 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/613k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/349 [00:00<?, ? examples/s]

We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!

In [8]:
from unsloth.chat_templates import standardize_data_formats
dataset = standardize_data_formats(dataset)

Let's see how row 100 looks like!

In [9]:
dataset[100]

{'messages': [{'content': 'You are RhinoCodex, an expert assistant that writes clean, parametric Python scripts for Rhino 3D using only rhinoscriptsyntax and math.\n\nFor every modeling task, first outline a clear, step-by-step algorithm explaining how all geometry and variables are derived through user input or geometric relationships.\nAfter the pseudocode, pause to think carefully about the required operations. Analyze the pseudocode and reason about which RhinoPython functions are needed to implement the task.\nExplicitly identify the relevant functions before writing any code. Then write complete, reusable code that computes all values procedurally â€” never hard-code geometry or use memorized examples.',
   'role': 'system'},
  {'content': 'Design an urban canopy network whose density and height vary based on real-time pedestrian density data collected over time.',
   'role': 'user'},
  {'content': '### Reasoning & Steps\nData-driven urban systems can respond to human behavior, c

We now have to apply the chat template for `Gemma-3` onto the conversations, and save it to `text`

In [10]:
def apply_chat_template(examples):
    texts = tokenizer.apply_chat_template(examples["messages"])
    return { "text" : texts }
pass
dataset = dataset.map(apply_chat_template, batched = True)

Map:   0%|          | 0/349 [00:00<?, ? examples/s]

Let's see how the chat template did! Notice `Gemma-3` default adds a `<bos>`!

In [11]:
dataset[100]["text"]

'<bos><start_of_turn>user\nYou are RhinoCodex, an expert assistant that writes clean, parametric Python scripts for Rhino 3D using only rhinoscriptsyntax and math.\n\nFor every modeling task, first outline a clear, step-by-step algorithm explaining how all geometry and variables are derived through user input or geometric relationships.\nAfter the pseudocode, pause to think carefully about the required operations. Analyze the pseudocode and reason about which RhinoPython functions are needed to implement the task.\nExplicitly identify the relevant functions before writing any code. Then write complete, reusable code that computes all values procedurally â€” never hard-code geometry or use memorized examples.\n\nDesign an urban canopy network whose density and height vary based on real-time pedestrian density data collected over time.<end_of_turn>\n<start_of_turn>model\n### Reasoning & Steps\nData-driven urban systems can respond to human behavior, creating comfort-enhancing structures 

<a name="Train"></a>
### Train the model
Using Huggingface TRL's `SFTTrainer`
Training is done for 1-epoch and all other hyperparameters are left to default.

In [12]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        num_train_epochs = 2, # Set this for 1 full training run.
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=12):   0%|          | 0/349 [00:00<?, ? examples/s]

###Calculate loss only for the output

Using Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!

In [13]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
)

Map (num_proc=12):   0%|          | 0/349 [00:00<?, ? examples/s]

Verifying if the masking the instruction part is done by printing the 100th row again

In [14]:
tokenizer.decode(trainer.train_dataset[100]["input_ids"])

'<bos><bos><start_of_turn>user\nYou are RhinoCodex, an expert assistant that writes clean, parametric Python scripts for Rhino 3D using only rhinoscriptsyntax and math.\n\nFor every modeling task, first outline a clear, step-by-step algorithm explaining how all geometry and variables are derived through user input or geometric relationships.\nAfter the pseudocode, pause to think carefully about the required operations. Analyze the pseudocode and reason about which RhinoPython functions are needed to implement the task.\nExplicitly identify the relevant functions before writing any code. Then write complete, reusable code that computes all values procedurally â€” never hard-code geometry or use memorized examples.\n\nDesign an urban canopy network whose density and height vary based on real-time pedestrian density data collected over time.<end_of_turn>\n<start_of_turn>model\n### Reasoning & Steps\nData-driven urban systems can respond to human behavior, creating comfort-enhancing struct

Printing the masked out example

In [15]:
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")

'                                                                                                                                                                     ### Reasoning & Steps\nData-driven urban systems can respond to human behavior, creating comfort-enhancing structures in dense city zones.\n\n1. Input site boundary and pedestrian heat map dataset.\n2. Generate base canopy grid across the site.\n3. For each canopy module:\n- Increase density and lower height where pedestrian density is high.\n- Thin out and increase clearance in low-use areas.\n4. Optimize for shading and circulation flow.\n5. Output adaptive canopy geometry.\n\n### Thinking\nAnalyzing the provided pseudocode and referencing the available RhinoPython functions...\nTo implement the above steps for designing an adaptive urban canopy network based on pedestrian density data, the following RhinoPython functions are needed:\n\n- rs.AddSrfPtGrid(count: (int, int), points: list[point], degree: (int, int) = (3, 3)

In [16]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.557 GB.
21.0 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [17]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 349 | Num Epochs = 2 | Total steps = 86
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 113,516,544/27,000,000,000 (0.42% trained)
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.1851
2,1.1258
3,1.0295
4,1.0782
5,1.0034
6,1.0423
7,0.8265
8,0.9659
9,0.8952
10,0.7772


In [18]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

1843.8124 seconds used for training.
30.73 minutes used for training.
Peak reserved memory = 33.695 GB.
Peak reserved memory for training = 12.695 GB.
Peak reserved memory % of max memory = 85.181 %.
Peak reserved memory for training % of max memory = 32.093 %.


<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Gemma-3` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`

Using `TextStreamer` for continuous inference - to see the generation token by token, instead of waiting the whole time!

In [22]:
messages = [{
        "role": "system",
        "content": [{"type": "text", "text": "You are RhinoCodex, an expert assistant that writes clean, parametric Python scripts for Rhino 3D using only rhinoscriptsyntax and math.For every modeling task, first outline a clear, step-by-step algorithm explaining how all geometry and variables are derived through user input or geometric relationships.After the pseudocode, pause to think carefully about the required operations. Analyze the pseudocode and reason about which RhinoPython functions are needed to implement the task. Explicitly identify the relevant functions before writing any code. Then write complete, reusable code that computes all values procedurally — never hard-code geometry or use memorized examples."}]
    },
    {
    "role": "user",
    "content": [{
        "type" : "text",
        "text" : "create a circle along a curve",
    }]
}]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 16000, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 0.1, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

### Reasoning & Steps
Rolling a circle along a curve generates a pipe-like form useful in railings, pipes, or architectural profiles.

1. Define a curve path.
2. Define a circle profile.
3. Sweep the circle along the curve.
4. Output the resulting pipe surface.

### Thinking
Analyzing the provided pseudocode and referencing the available RhinoPython functions...
To implement the above steps of creating a circle along a curve and generating a pipe-like form, the key operation is to sweep the circle profile along the curve path. This is typically done using the AddSweep1 function, which allows sweeping a rail curve with one or more rail curves and a shape curve. In this case, the curve path will be the rail curve, and the circle profile will be the shape curve. Therefore, the relevant RhinoPython function is:\n\nrs.AddSweep1(rail: list[guid], shapes: list[guid], closed: bool=False) -> guid"

### Code
```python
import rhinoscriptsyntax as rs

# Step 1: Get the curve path from the user
cur