# Project Pipeline Demonstration (Live Code)

Welcome to the live demonstration notebook for the **Gemma 3n Fine-Tuning for Emergency Assistance** project.

### The Purpose of This Notebook & An Important Note on the Model

The full project, which uses the **multimodal `unsloth/gemma-3n-E2B-it` model**, is designed to run within a Docker container. This is crucial for handling the complex dependencies of its vision components.

Replicating this specific environment with `pip` in a live notebook can be unreliable. Therefore, to ensure this demonstration runs flawlessly from end to end, **we will use the text-only `unsloth/gemma-2b-it` model**.

This allows us to demonstrate the **entire, identical code pipeline** — data processing, LoRA fine-tuning, model merging, and inference — without encountering environment-specific errors related to the vision tower. The logic you see here is the same logic used in the main project.

## Step 0: Project Architecture Overview

Before we begin, here is the professional structure of our project, which this notebook simulates.

```
gemma_local_trainer/
├── Dockerfile             # For the main training environment
├── Dockerfile.convert     # For GGUF conversion
├── README.md
├── requirements.txt
├── data/                  # Full dataset location
├── models/                # Full model artifact location
├── scripts/
│   ├── convert_to_gguf.py
│   └── inference_gguf.py
└── src/
    ├── __init__.py
    ├── config.py
    ├── inference.py
    ├── train_pipeline.py
    └── utils.py
```

## Step 1: Installing Dependencies

This cell will install all necessary Python libraries for this demonstration.

*(Note: `%%capture` is used to hide the lengthy installation output.)*

In [1]:
import os

os.environ["TORCH_LOGS"] = "+dynamo"
os.environ["TORCHDYNAMO_VERBOSE"] = "1"
os.environ["TORCHINDUCTOR_COMPILE_THREADS"] = "1"
os.environ["TORCH_COMPILE_DISABLE"] = "1"

In [2]:
%%capture
# We install the necessary packages. Note that the version constraints that worked
# in the Dockerfile might conflict with a live notebook's pre-installed packages.
# This simpler installation is more robust for a demo environment.
!pip install "unsloth[cu121-ampere-torch23]" "transformers" "datasets" "trl" "peft"

## Step 2: Setup and Lightweight Configuration

Now, we'll import libraries, define a lightweight configuration for our demo, and create the temporary dataset. All files will be placed in `_demo` directories to avoid cluttering the main project.

In [3]:
from unsloth import FastModel, is_bfloat16_supported
import gc
import shutil
import torch
from pathlib import Path
from datasets import load_dataset
from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
from trl import SFTTrainer, SFTConfig

# --- DEMO-SPECIFIC Configuration ---
# We create temporary directories for this run
PROJECT_ROOT = Path("./").resolve()
MODELS_DEMO_DIR = PROJECT_ROOT / "models_demo"
DATA_DEMO_DIR = PROJECT_ROOT / "data_demo"

# Demo paths
DATASET_PATH = DATA_DEMO_DIR / "emergency_dataset_demo.jsonl"
LORA_ADAPTERS_PATH = MODELS_DEMO_DIR / "gemma_3n_demo_adapters"
MERGED_MODEL_PATH = MODELS_DEMO_DIR / "gemma_3n_demo_merged"

# Lightweight Training Hyperparameters for a quick run
DEMO_TRAINING_ARGS = {
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 1,
    "warmup_steps": 1,
    "max_steps": 2, # CRITICAL: Only 2 steps to run instantly!
    "learning_rate": 2e-4,
    "logging_steps": 1,
    "optim": "paged_adamw_8bit",
    "seed": 3407,
    "report_to": "none",
}

# Standard model and LoRA config from your project
BASE_MODEL_NAME = "unsloth/gemma-3n-E2B-it"
MAX_SEQ_LENGTH = 1024
DTYPE = torch.bfloat16 if is_bfloat16_supported() else torch.float16
LORA_CONFIG = {
    "r": 8, # Smaller rank for faster demo
    "lora_alpha": 16,
    "lora_dropout": 0,
    "bias": "none",
    "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],
}

# --- Create Demo Files and Directories ---
DATA_DEMO_DIR.mkdir(exist_ok=True)
MODELS_DEMO_DIR.mkdir(exist_ok=True)

# Create the mock dataset file using your provided example
with open(DATASET_PATH, "w") as f:
    f.write('{"instruction": "How do I treat a sprained ankle?", "output": "1. Rest the ankle and avoid putting weight on it.\\n2. Apply a cold pack wrapped in a cloth for 15–20 minutes at a time, several times a day.\\n3. Wrap the ankle with an elastic bandage for compression, but not so tight as to cut off circulation.\\n4. Elevate the ankle above heart level when possible to reduce swelling.\\n5. Seek medical attention if you cannot walk, the ankle looks deformed, or the pain is severe. Call 911 for emergency care if there is a bone protruding or you suspect a fracture."}\n')
    f.write('{"instruction": "If I or someone else has a bee sting, what steps do I follow?", "output": "1. Use the edge of a credit card or your fingernail to scrape the stinger out gently without squeezing it.\\n2. Wash the area with soap and water.\\n3. Apply a cold pack to reduce swelling and pain.\\n4. Consider taking an over‑the‑counter antihistamine for itching or a pain reliever for discomfort.\\n5. Watch for signs of an allergic reaction such as difficulty breathing, swelling of the face or throat; call 911 immediately if they occur."}\n')

print("Demo environment is set up and ready.")

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
Demo environment is set up and ready.


## Step 3: Fine-Tuning (Live Run)

This cell executes the **real fine-tuning code** from `train_pipeline.py`. It uses the lightweight configuration, so it will complete in just a few seconds while demonstrating that the training loop, data processing, and model saving logic are all correct.

In [4]:
# --- This is the core logic from your project, now with all optimizations ---

print("--- STAGE 1: Starting QLoRA Fine-tuning (Live Demo Run) ---")

# 1. Load dataset
dataset = load_dataset("json", data_files=str(DATASET_PATH), split="train")

# 2. Load 4-bit model
model, tokenizer = FastModel.from_pretrained(
    model_name=BASE_MODEL_NAME, max_seq_length=MAX_SEQ_LENGTH, dtype=DTYPE, load_in_4bit=True,
)

# 3. Apply LoRA (with Unsloth's gradient checkpointing)
model = FastModel.get_peft_model(
    model,
    **LORA_CONFIG,
    use_gradient_checkpointing="unsloth", # CRITICAL: Added from your working file
)

# 4. Format data
def format_chat_template(sample: dict) -> dict:
    messages = [{"role": "user", "content": sample["instruction"]}, {"role": "assistant", "content": sample["output"]}]
    return {"text": tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)}
formatted_dataset = dataset.map(format_chat_template, num_proc=os.cpu_count()//2)

# 5. Train the model (with packing=True)
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=formatted_dataset,
    dataset_text_field="text",
    packing=True, # CRITICAL: Added from your working file
    max_seq_length=MAX_SEQ_LENGTH,
    args=SFTConfig(
        output_dir=str(LORA_ADAPTERS_PATH),
        bf16=is_bfloat16_supported(),
        fp16=not is_bfloat16_supported(),
        **DEMO_TRAINING_ARGS
    ),
)
trainer.train()

print("\nSaving final adapter model...")
trainer.model.save_pretrained(str(LORA_ADAPTERS_PATH))
tokenizer.save_pretrained(str(LORA_ADAPTERS_PATH)) # Также сохраняем токенайзер

print(f"\nLoRA adapters saved to: {LORA_ADAPTERS_PATH}")

--- STAGE 1: Starting QLoRA Fine-tuning (Live Demo Run) ---


Generating train split: 0 examples [00:00, ? examples/s]

==((====))==  Unsloth 2025.8.1: Fast Gemma3N patching. Transformers: 4.53.3.
   \\   /|    NVIDIA GeForce RTX 3080 Ti Laptop GPU. Num GPUs = 1. Max memory: 16.0 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu126. CUDA: 8.6. CUDA Toolkit: 12.6. Triton: 3.3.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.31.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Gemma3N does not support SDPA - switching to eager!


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Unsloth: Making `model.base_model.model.model.language_model` require gradients


num_proc must be <= 2. Reducing num_proc to 2 for dataset of size 2.


Map (num_proc=2):   0%|          | 0/2 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/2 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 2 | Num Epochs = 1 | Total steps = 2
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 1 x 1) = 1
 "-____-"     Trainable parameters = 4,079,616 of 5,443,517,888 (0.07% trained)
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
1,9.1731
2,8.8183


Unsloth: Will smartly offload gradients to save VRAM!

Saving final adapter model...

LoRA adapters saved to: /app/models_demo/gemma_3n_demo_adapters


## Step 4: Merging the Model (Live Run)

Now, we execute the second part of our training pipeline: merging the trained LoRA adapters into the base model to create a final, standalone artifact.

In [5]:
# --- This is the merging logic from src/train_pipeline.py ---

print("--- STAGE 2: Starting Model Merge (Live Demo Run) ---")

# Clear memory first
del model, trainer
gc.collect()
torch.cuda.empty_cache()

# Load base model in full precision
model, tokenizer = FastModel.from_pretrained(
    model_name=BASE_MODEL_NAME, max_seq_length=MAX_SEQ_LENGTH, dtype=DTYPE, load_in_4bit=False,
)

# Attach adapters and merge
model = PeftModel.from_pretrained(model, str(LORA_ADAPTERS_PATH))
model = model.merge_and_unload()

# Save the final merged model
model.save_pretrained(str(MERGED_MODEL_PATH))
tokenizer.save_pretrained(str(MERGED_MODEL_PATH))

print(f"Final merged model saved to: {MERGED_MODEL_PATH}")

--- STAGE 2: Starting Model Merge (Live Demo Run) ---
==((====))==  Unsloth 2025.8.1: Fast Gemma3N patching. Transformers: 4.53.3.
   \\   /|    NVIDIA GeForce RTX 3080 Ti Laptop GPU. Num GPUs = 1. Max memory: 16.0 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu126. CUDA: 8.6. CUDA Toolkit: 12.6. Triton: 3.3.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.31.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Gemma3N does not support SDPA - switching to eager!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Some parameters are on the meta device because they were offloaded to the cpu.
Some parameters are on the meta device because they were offloaded to the cpu.


Saving checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Final merged model saved to: /app/models_demo/gemma_3n_demo_merged


## Step 5: Inference with the Merged Model (Live Run)

The pipeline is complete! We now have a fine-tuned model saved in `models_demo/`. Let's run a live inference call.

**Note:** Since we only trained for 2 steps on 2 examples, the model's output will be random and nonsensical. **This is expected and correct**, as it proves the model can be loaded and can generate text end-to-end.

In [6]:
# --- This is the inference logic from src/inference.py ---

print("--- INFERENCE (Live Demo Run) ---")

# 1. Load the merged model and tokenizer
model = AutoModelForCausalLM.from_pretrained(str(MERGED_MODEL_PATH), torch_dtype=DTYPE, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(str(MERGED_MODEL_PATH))

# 2. Prepare prompt
system_prompt = "You are a helpful assistant for emergency situations."
user_prompt = "How do I treat a sprained ankle?"
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")

# 3. Generate response
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids, streamer=text_streamer, max_new_tokens=50)

print("\n\nInference call complete.")

--- INFERENCE (Live Demo Run) ---


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Some parameters are on the meta device because they were offloaded to the cpu.


<bos><start_of_turn>user
You are a helpful assistant for emergency situations.

How do I treat a sprained ankle?<end_of_turn>
<start_of_turn>model
Okay, I understand you're looking for information on how to treat a sprained ankle.  **I must preface this with a very important disclaimer:**  **I am an AI and cannot provide medical advice. This information is for general knowledge and


Inference call complete.


## Step 6: Cleanup

To keep the project directory clean, this final cell removes all temporary files and folders created during the demonstration.

In [7]:
# Clean up the directories created for this demo
try:
    shutil.rmtree(MODELS_DEMO_DIR)
    shutil.rmtree(DATA_DEMO_DIR)
    print("✅ Demo directories successfully cleaned up.")
except OSError as e:
    print(f"Error during cleanup: {e}")

✅ Demo directories successfully cleaned up.


## Step 6: Converting the Model to GGUF (Demonstration)

The final step in our pipeline is to convert the fine-tuned model into the highly efficient GGUF format. This makes the model portable and allows it to run on a wide range of hardware (including CPUs) using tools like `llama.cpp`.

To keep our main training environment clean, this process uses a separate, lightweight Docker image (`gguf-converter`) which contains all the necessary compilation tools.

Below, we simulate this conversion process. The actual commands are shown for reference.

In [9]:
# --- This cell simulates the GGUF conversion and testing process ---

# In a real run, you would execute these commands in your terminal:
# 1. Build the converter image:
#    docker build -t gguf-converter -f Dockerfile.convert .
#
# 2. Run the conversion script:
#    docker run -it --rm -v "$(pwd)/models_demo/gemma_3n_demo_merged:/app/model_input:ro" -v "$(pwd)/models_demo/gguf:/app/model_output" -v "$(pwd)/scripts/convert_to_gguf.py:/app/convert_to_gguf.py" gguf-converter python /app/convert_to_gguf.py

GGUF_MODELS_PATH = MODELS_DEMO_DIR / "gguf"
GGUF_MODELS_PATH.mkdir(parents=True, exist_ok=True)

print("--- SIMULATING GGUF CONVERSION ---")
print(f"Input model directory: {MERGED_MODEL_PATH}")
print(f"Output directory for GGUF files: {GGUF_MODELS_PATH}")

print("\nStep 1: Converting to F16 GGUF...")
import time
time.sleep(1) # Simulate work
print("Step 2: Quantizing to Q4_K_M...")
time.sleep(1)

# Create a dummy GGUF file to show the result
dummy_gguf_file = GGUF_MODELS_PATH / "gemma-finetuned-Q4_K_M.gguf"
with open(dummy_gguf_file, "w") as f:
    f.write("This is a dummy GGUF file.")

print(f"\n✅ Simulation complete. Dummy GGUF file created at: {dummy_gguf_file}")

print("\n--- SIMULATING GGUF INFERENCE TEST ---")
# In a real run, you would execute:
#    docker run -it --rm -v "$(pwd)/models_demo/gguf:/app/models:ro" -v "$(pwd)/scripts/inference_gguf.py:/app/inference_gguf.py" gguf-converter python /app/inference_gguf.py

print(f"Loading dummy model from {dummy_gguf_file}...")
time.sleep(0.5)
print("> User Prompt: How do I treat a sprained ankle?")
print("\n< Model Response (simulated from GGUF):")
print("1. Rest the ankle.\n2. Apply a cold pack.")

--- SIMULATING GGUF CONVERSION ---
Input model directory: /app/models_demo/gemma_3n_demo_merged
Output directory for GGUF files: /app/models_demo/gguf

Step 1: Converting to F16 GGUF...
Step 2: Quantizing to Q4_K_M...

✅ Simulation complete. Dummy GGUF file created at: /app/models_demo/gguf/gemma-finetuned-Q4_K_M.gguf

--- SIMULATING GGUF INFERENCE TEST ---
Loading dummy model from /app/models_demo/gguf/gemma-finetuned-Q4_K_M.gguf...
> User Prompt: How do I treat a sprained ankle?

< Model Response (simulated from GGUF):
1. Rest the ankle.
2. Apply a cold pack.


In [10]:
# Clean up all directories created for this demo
try:
    if os.path.exists(MODELS_DEMO_DIR):
        shutil.rmtree(MODELS_DEMO_DIR)
    if os.path.exists(DATA_DEMO_DIR):
        shutil.rmtree(DATA_DEMO_DIR)
    print("✅ All demo directories successfully cleaned up.")
except OSError as e:
    print(f"Error during cleanup: {e}")

✅ All demo directories successfully cleaned up.
