# Local Fine-Tuning: Nutritionist Agent for Glucose Spike Analysis

This project focuses on fine-tuning a **Llama 3.2 3B** model to act as a specialized Nutrition Analyst. The model is trained to predict metabolic impacts directly from natural language meal descriptions, eliminating the need for external database lookups during inference.

## Key Objectives
* **Synthetic Data Generation:** Create high-quality instruction pairs for metabolic reasoning.
* **Efficient Fine-Tuning:** Utilize **QLoRA** and **Unsloth** for rapid local training.
* **Agentic Deployment:** Host the expert model locally via **Ollama** for seamless integration.

### 1. Synthetic Data Generation with Pydantic
To ensure high-quality training data, use Pydantic to enforce a structured schema for your synthetic nutritionist's thought process. You can use a larger model like Gemini 2.5 Flash as a "teacher" to generate thousands of examples of meal logs paired with analytical reasoning.

In [None]:
from pydantic import BaseModel, Field
from typing import List, Optional

class SyntheticNutritionRecord(BaseModel):
    meal_description: str = Field(description="Natural language description of the meal")
    glucose_impact: float = Field(description="Simulated glucose spike in mg/dL")
    analytical_reasoning: str = Field(description="Step-by-step logic linking food components to the spike")

# We wrap individual records in a collection for batch generation
class SyntheticBatch(BaseModel):
    records: List[SyntheticNutritionRecord]

In [3]:
import google.generativeai as genai

# Setup Gemini for structured output
model = genai.GenerativeModel('gemini-2.5-flash')
full_dataset = []

for i in range(50):  # Generate 50 batches of 10
    response = model.generate_content(
        "Generate 10 diverse synthetic nutrition logs. Themes: [Randomize Themes]",
        generation_config=genai.GenerationConfig(
            response_mime_type="application/json",
            response_schema=SyntheticBatch
        )
    )
    batch = SyntheticBatch.model_validate_json(response.text)
    full_dataset.extend(batch.records)

In [4]:
full_dataset[:2]

[SyntheticNutritionRecord(meal_description='Breakfast: Instant oatmeal with a banana.', glucose_impact=65.0, analytical_reasoning='The instant oatmeal contains processed carbohydrates that are quickly digested and absorbed, leading to a rapid rise in blood glucose. The banana adds simple sugars (fructose, glucose) which further contribute to the immediate spike. Though a small amount of fat and protein might be present, their buffering effect is minimal compared to the high glycemic load of the oatmeal and fruit.'),
 SyntheticNutritionRecord(meal_description='Lunch: Whole wheat pasta with chicken breast and tomato sauce.', glucose_impact=90.0, analytical_reasoning='This meal is high in complex carbohydrates from the whole wheat pasta, which will cause a significant glucose rise, albeit slower than simple sugars. The tomato sauce contains natural sugars, adding to the carbohydrate load. Chicken breast provides protein, and olive oil provides fat, both of which can slow glucose absorptio

In [5]:
import json

def export_to_gemini_jsonl(synthetic_records, output_file="gemini_finetune_data.jsonl"):
    """
    Converts a list of synthetic records into Gemini-compatible JSONL format.
    """
    with open(output_file, 'w', encoding='utf-8') as f:
        for record in synthetic_records:
            # Construct the Gemini-specific nested structure
            gemini_entry = {
                "contents": [
                    {
                        "role": "user", 
                        "parts": [{"text": record.meal_description}] # From your Pydantic model
                    },
                    {
                        "role": "model", 
                        "parts": [{"text": f"Reasoning: {record.analytical_reasoning}\nImpact: {record.glucose_impact} mg/dL"}]
                    }
                ]
            }
            # Write as a single line
            f.write(json.dumps(gemini_entry) + '\n')
    
    print(f"✅ Created {len(synthetic_records)} training examples in {output_file}")



### 3. Data Refinement & Filtering
Before fine-tuning,  filter synthetic data to ensure quality:

Heuristic Filtering: Remove exact duplicates or records with missing fields.

Self-Correction: Using second LLM pass (or the same teacher model) to "criticize" and correct the reasoning of the generated records.

Format for Fine-Tuning: Saved validated records in a JSONL format, where each line is a single JSON object containing an instruction-response pair.

In [6]:
export_to_gemini_jsonl(full_dataset)

✅ Created 500 training examples in gemini_finetune_data.jsonl


In [13]:
import json
import random
from sklearn.model_selection import train_test_split

def split_jsonl(full_dataset, val_size=0.1):
    """
    Splits a list of synthetic records into training and validation JSONL files.
    """
    # 1. Perform the split (90% Train, 10% Validation by default)
    # Using random_state ensures reproducibility for your research repo
    train_data, val_data = train_test_split(
        full_dataset, 
        test_size=val_size, 
        random_state=42, 
        shuffle=True
    )

    return train_data, val_data


In [14]:
train_data, val_data = split_jsonl(full_dataset)

In [12]:
export_to_gemini_jsonl(train_data, output_file="train.jsonl")
export_to_gemini_jsonl(val_data, output_file="val.jsonl")

✅ Created 450 training examples in train.jsonl
✅ Created 50 training examples in val.jsonl


## Fine-Tuning Methodology

To achieve high-performance results on a consumer-grade GPU, we employ the following optimization stack:

### 1. QLoRA (4-bit Quantization)
We use **4-bit quantization** combined with **Low-Rank Adaptation (LoRA)**. 
* **Benefit:** Reduces VRAM requirements significantly.
* **Impact:** A 3B parameter model can be trained with as little as **12GB–16GB of RAM**.

### 2. Unsloth Optimization
**Unsloth** is the industry standard for fast local fine-tuning.
* **Speed:** Training is typically **2x–3x faster** than standard methods.
* **Efficiency:** Reduces VRAM usage by up to **70%**.

### 3. Dataset Strategy
* **Size:** 100–500 high-quality synthetic examples.
* **Focus:** Teaches the model specific tool-calling patterns and metabolic reasoning logic.

In [17]:
import torch
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# 1. Configuration
max_seq_length = 2048 
dtype = None # Auto-detect (Float16 for older GPUs, Bfloat16 for RTX 30/40 series)
load_in_4bit = True # Essential for local consumer GPUs

# 2. Load Model & Tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# 3. Add LoRA Adapters
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Rank: higher = more parameters but more memory
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0, # Optimized for 0
    bias = "none",    # Optimized for "none"
    use_gradient_checkpointing = "unsloth", # Reduces VRAM usage
    random_state = 3407,
)




Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel


==((====))==  Unsloth 2025.12.9: Fast Llama patching. Transformers: 4.57.3.
   \\   /|    NVIDIA GeForce RTX 5070 Laptop GPU. Num GPUs = 1. Max memory: 7.96 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 12.0. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2025.12.9 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [38]:
# 4. Data Preparation
# Load your local JSONL files
dataset = load_dataset("json", data_files={"train": "train.jsonl", "test": "val.jsonl"})

from unsloth.chat_templates import get_chat_template

# Ensure your tokenizer is using the correct Llama 3.2 template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.2", # Use llama-3.1 or llama-3.2 depending on your model
)

def formatting_prompts_func(examples):
    # 'contents' is the top-level list in your Gemini-style dataset
    all_contents = examples["contents"] 
    texts = []
    
    for conversation in all_contents:
        # 1. Map Gemini roles ('model') to Llama 3.2 roles ('assistant')
        formatted_messages = []
        for turn in conversation:
            role = "user" if turn["role"] == "user" else "assistant"
            # 2. Join all parts into a single string
            content = " ".join([part["text"] for part in turn["parts"]])
            formatted_messages.append({"role": role, "content": content})
        
        # 3. Apply the Llama 3.2 chat template
        # This adds the necessary <|begin_of_text|> and <|eot_id|> tokens
        rendered_text = tokenizer.apply_chat_template(
            formatted_messages, 
            tokenize = False, 
            add_generation_prompt = False
        )
        texts.append(rendered_text)
        
    return { "text" : texts }

# Apply the map 
dataset = dataset.map(formatting_prompts_func, batched = True)


Map: 100%|██████████████████████████████████████████████████████████████████| 450/450 [00:00<00:00, 13499.92 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 8165.21 examples/s]


In [39]:

# 5. Training Setup
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset["train"],
    eval_dataset = dataset["test"],
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60, # Set to 60 for a quick test; use num_train_epochs=1 for full pass
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit", # Saves VRAM
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)


Map (num_proc=2): 100%|███████████████████████████████████████████████████████| 450/450 [00:00<00:00, 519.89 examples/s]
Map (num_proc=2): 100%|██████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 64.31 examples/s]


In [40]:

# 6. Execute Training
trainer_stats = trainer.train()

# 7. Local Saving
model.save_pretrained("lora_model") 
tokenizer.save_pretrained("lora_model")

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 450 | Num Epochs = 2 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 24,313,856 of 3,237,063,680 (0.75% trained)


Step,Training Loss
1,3.6647
2,3.8241
3,3.6734
4,3.5088
5,3.2151
6,2.9877
7,2.6012
8,2.3607
9,2.2277
10,2.0373


Unsloth: Will smartly offload gradients to save VRAM!


('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/chat_template.jinja',
 'lora_model/tokenizer.json')

### Testing inference locally 

In [41]:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "lora_model", # Load your locally saved model
    max_seq_length = 2048,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model) # Enable 2x faster inference



==((====))==  Unsloth 2025.12.9: Fast Llama patching. Transformers: 4.57.3.
   \\   /|    NVIDIA GeForce RTX 5070 Laptop GPU. Num GPUs = 1. Max memory: 7.96 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 12.0. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 3072, padding_idx=128004)
        (layers): ModuleList(
          (0-27): 28 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=3072, out_features=3072, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=3072, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=3072, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lor

In [42]:
inputs = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Analyze my meal: 2 slices of white bread and grape juice."}],
    tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64)
print(tokenizer.batch_decode(outputs)[0])

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Analyze my meal: 2 slices of white bread and grape juice.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Reasoning: White bread is a refined carbohydrate that is quickly broken down into glucose, leading to a rapid and significant increase. Grape juice, although natural, is also a sugar-rich beverage with minimal fiber content. This combination results in a very high and immediate glucose response.
Impact: 120.0 mg/dL<|eot_id|>


### Export Fine-Tuned Model to GGUF
Fine-tuning with Unsloth creates "LoRA adapters." To use them with a local server, you must merge these adapters with the base model and export them to the GGUF format.

In [45]:
# Merge and export to GGUF format for local hosting
model.save_pretrained_gguf("my_nutrition_model", tokenizer, quantization_method = "q4_k_m")

Unsloth: Merging model weights to 16-bit format...
Found HuggingFace hub cache directory: /home/amod/.cache/huggingface/hub
Checking cache directory for required files...
Cache check failed: model-00001-of-00002.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files: 100%|███████████████████████████████████████| 2/2 [00:00<00:00, 27962.03it/s]


Note: tokenizer.model not found (this is OK for non-SentencePiece models)


Unsloth: Merging weights into 16bit: 100%|████████████████████████████████████████████████| 2/2 [00:12<00:00,  6.23s/it]


Unsloth: Merge process complete. Saved to `/home/amod/synth_health/my_nutrition_model`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF bf16 might take 3 minutes.
\        /    [2] Converting GGUF bf16 to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: llama.cpp folder exists but binaries not found - will rebuild
Unsloth: Updating system package directories
Unsloth: All required system packages already installed!
Unsloth: Install llama.cpp and building - please wait 1 to 3 minutes
Unsloth: Install GGUF and other packages
Unsloth: Successfully installed llama.cpp!
Unsloth: Preparing converter script...
Unsloth: [1] Converting model into bf16 GGUF format.
This might take 3 minutes...
Unsloth: Initial conversion c

{'save_directory': 'my_nutrition_model',
 'gguf_files': ['Llama-3.2-3B-Instruct.Q4_K_M.gguf'],
 'modelfile_location': '/home/amod/synth_health/Modelfile',
 'want_full_precision': False,
 'is_vlm': False,
 'fix_bos_token': False}

#### Step 2: Create an Ollama Modelfile
Ollama uses a "Modelfile" to configure how a model should behave, including its system prompt and template.

In [46]:
%%writefile Modelfile
FROM ./Llama-3.2-3B-Instruct.Q4_K_M.gguf
SYSTEM "You are a professional Nutritionist Agent. When you receive food logs, use your fine-tuned logic to analyze glucose impact."
PARAMETER temperature 0.2

Overwriting Modelfile


### Phase 3: Deployment
The saved model is exported to GGUF format and hosted via **Ollama** to serve as the core "brain" for the nutrition agent.

### 1. Create the model in Ollama
`ollama create nutrition-agent -f Modelfile`

### 2. Start the local server
`ollama serve`

### Setup an agent to call our fine tuned model to analyze meal descriptions

In [49]:
import ollama

def nutrition_agent(user_meal):
    # The agent calls your specialized model directly
    # No external database lookup is needed because the model has 'learned' the spikes.
    response = ollama.chat(
        model='nutrition-agent',
        messages=[
            {'role': 'system', 'content': 'You are a Nutrition Analyst. Predict glucose spikes based on your fine-tuned knowledge.'},
            {'role': 'user', 'content': f"Analyze my meal: {user_meal}"}
        ]
    )
    
    # Return the direct prediction from the model's weights
    return response['message']['content']



In [52]:

meal_description = "granola with milk"
print(nutrition_agent(meal_description))

Reasoning: The high refined carbohydrate load (oats, white rice) and lactose sugar from milk. This is an extremely pure simple sugar absorption without any direct fat or fiber to buffer.
Impact: 95.0.0 mg/dL
