In [1]:
import os
import pathlib

project_root = pathlib.Path.cwd()

cache_dir = project_root / "models_cache"
cache_dir.mkdir(exist_ok=True)
os.environ['HF_HOME'] = str(cache_dir)
print(f"La variable de entorno HF_HOME se ha establecido en: {os.environ['HF_HOME']}")


La variable de entorno HF_HOME se ha establecido en: /Users/deimagjas/machinelearning/gemma3-finetunning/models_cache


In [2]:
import json
from typing import Dict, List, Tuple, Union

import mlx.optimizers as optim
from mlx.utils import tree_flatten
from mlx_lm import load, generate
from mlx_lm.tuner import TrainingArgs, linear_to_lora_layers, train

In [3]:
!uv pip show mlx_lm

Name: mlx-lm
Version: 0.28.2
Location: /Users/deimagjas/machinelearning/gemma3-finetunning/.venv/lib/python3.13/site-packages
Requires: jinja2, mlx, numpy, protobuf, pyyaml, transformers
Required-by:


In [4]:
from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Carga de modelo desde HF

La razón por la que el código funciona con google/gemma-3-270m-it pero no con google/gemma-3-270m se debe a la diferencia entre los
  dos tipos de modelos:

   1. `google/gemma-3-270m-it`: El sufijo "-it" significa "Instruction Tuned" (ajustado para instrucciones). Este modelo ha sido
      específicamente entrenado para entender y seguir instrucciones en un formato de chat o de pregunta-respuesta. Su tokenizador
      incluye una "plantilla de chat" (chat template) que formatea la entrada de manera que el modelo la entienda.

   2. `google/gemma-3-270m`: Este es el modelo base. Es un modelo de lenguaje pre-entrenado que es bueno para predecir la siguiente
      palabra en un texto, pero no ha sido ajustado para seguir instrucciones o para conversar. Su tokenizador no tiene una plantilla de
      chat predeterminada.

In [5]:
from mlx_lm.sample_utils import make_sampler

sampler = make_sampler(temp=0.7, top_p=0.95, top_k=50)
model_path = "google/gemma-3-270m-it"
model, tokenizer = load(model_path)

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

In [6]:
prompt = "create a list of steps in order to help someone that has an anxiety attack"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
response = generate(
        model,
        tokenizer,
        prompt=prompt,
        verbose=True,
        sampler=sampler         
    )

Okay, here's a list of steps to help someone who has an anxiety attack. It's important to remember that this is a starting point and you may need to adjust the steps based on the individual's specific needs and the severity of their anxiety. **It's always best to consult with a mental health professional for personalized guidance.**
*   **Assess the Situation:**
    *   **Is it a mild anxiety attack?** If so, acknowledge it and let the person know you're there.
    *   **Is it a severe anxiety attack?** If so, it's important to be prepared for the possibility of a major reaction.
    *   **Is the anxiety attack severe?** If so, it's important to be observant and take steps to manage it.

*   **Communicate with the Person:**
    *   **Be calm and non-judgmental.**
    *   **Tell them you're there to support them.**
    *   **Explain what you're doing.**
    *   **Be specific about what you're doing to help.**
    *   **Avoid blaming or criticizing.**
    *   **Focus on their feelings an

In [8]:
from mlx_lm import convert

repo = model_path 

convert(
        repo,
        quantize=True,
        dtype="float16",
        q_bits=4,
        mlx_path="./gemma3-mlx")

[INFO] Loading


Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

[INFO] Using dtype: float16
[INFO] Quantizing
[INFO] Quantized model with 4.502 bits per weight.


In [9]:
model, tokenizer = load("./gemma3-mlx")
prompt = "create a list of steps in order to help someone that has an anxiety attack"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, verbose=True, sampler=sampler)

Here are the steps to help someone who has an anxiety attack:

1. **Gather your senses:**
   *   **Physical:**  Engage your body to check for any physical sensations, such as tightness in your chest, feeling of pressure in your stomach, or a feeling of being overwhelmed.  This could be a physical exam or a calming visualization.
   *   **Visual:**  Engage your eyes to check for any visual cues, such as a distorted reflection of a familiar object or a visual of a calming scene.  This could be a calming landscape or a calming visual.
   *   **Auditory:**  Engage your mind to check for any auditory cues, such as a sound of a siren or a sound of a calming sound.  This could be a calming sound or a calming sound.

2.  **Challenge your mind:**
   *   **Cognitive:**  Engage your cognitive abilities to process and analyze the information.  Is the problem unclear? Are there any conflicting thoughts?  Is there a need for validation or a need for acceptance?
   *   **Emotional:**  Engage your emo

# Creando Adaptador

In [10]:
adapter_path = "adapters_gemma3"
os.makedirs(adapter_path, exist_ok=True)
adapter_config_path = os.path.join(adapter_path, "adapter_config.json")
adapter_file_path = os.path.join(adapter_path, "adapters.safetensors")

# Lora config
Aquí se ajustan los hyperparámetros para el entrenamiento

In [11]:
lora_config = {
    "num_layers": 4,
    "lora_parameters": {
        "rank": 4,
        "scale": 20.0,
        "dropout": 0.0,
    },
}

In [12]:
with open(adapter_config_path, "w") as f:
    json.dump(lora_config, f, indent=4)

In [13]:
training_args = TrainingArgs(
    batch_size=1,
    adapter_file=adapter_file_path,
    iters=200,
    steps_per_eval=50,
    grad_checkpoint=True,
)

# Parameters and adapter
La razón por la que ves 163,840 parámetros entrenables en lugar de los 270 millones del modelo completo es porque no estás
  re-entrenando el modelo entero. Estás utilizando una técnica de ajuste fino de alta eficiencia de parámetros (PEFT) llamada LoRA 
  (Low-Rank Adaptation).

  Así es como funciona en tu notebook:

   1. Congelar el modelo base: En la celda con el id: a3b86f5c, la primera línea es model.freeze(). Esto "congela" todos los 270 millones
      de parámetros del modelo Gemma, haciendo que no sean entrenables.

   2. Inyectar adaptadores LoRA: La siguiente línea, linear_to_lora_layers(...), añade pequeños "adaptadores" o capas de bajo rango a
      ciertas partes del modelo (en tu caso, a 8 capas, según se define en lora_config).

   3. Entrenar solo los adaptadores: Solo se entrenan los parámetros de estos nuevos y pequeños adaptadores. El número 163,840 es la suma
      de todos los parámetros de estas nuevas capas LoRA que se han añadido.

  En resumen:

   * 270 Millones: Es el tamaño total del modelo base, que permanece sin cambios.
   * 163,840: Es el número de parámetros nuevos y adicionales que estás entrenando. Estos parámetros son los que "aprenden" la nueva
     tarea (en este caso, generar consultas SQL) y adaptan el conocimiento del modelo original.

  Esta es la gran ventaja de LoRA: te permite especializar un modelo enorme en una tarea específica de forma muy rápida y con muchos
  menos recursos computacionales, ya que solo necesitas entrenar una fracción minúscula (<0.1%) de los parámetros totales.

In [14]:
model.freeze()
linear_to_lora_layers(model, lora_config["num_layers"], lora_config["lora_parameters"])
num_train_params = sum(v.size for _, v in tree_flatten(model.trainable_parameters()))
print(f"Number of trainable parameters: {num_train_params}")
model.train()

Number of trainable parameters: 210944


Model(
  (model): Gemma3Model(
    (embed_tokens): QuantizedEmbedding(262144, 640, group_size=64, bits=4, mode=affine)
    (layers.0): TransformerBlock(
      (self_attn): Attention(
        (q_proj): QuantizedLinear(input_dims=640, output_dims=1024, bias=False, group_size=64, bits=4, mode=affine)
        (k_proj): QuantizedLinear(input_dims=640, output_dims=256, bias=False, group_size=64, bits=4, mode=affine)
        (v_proj): QuantizedLinear(input_dims=640, output_dims=256, bias=False, group_size=64, bits=4, mode=affine)
        (o_proj): QuantizedLinear(input_dims=1024, output_dims=640, bias=False, group_size=64, bits=4, mode=affine)
        (q_norm): RMSNorm()
        (k_norm): RMSNorm()
        (rope): RoPE(256, traditional=False)
      )
      (mlp): MLP(
        (gate_proj): QuantizedLinear(input_dims=640, output_dims=2048, bias=False, group_size=64, bits=4, mode=affine)
        (down_proj): QuantizedLinear(input_dims=2048, output_dims=640, bias=False, group_size=64, bits=4, mod

In [15]:
class Metrics:
    def __init__(self) -> None:
        self.train_losses: List[Tuple[int, float]] = []
        self.val_losses: List[Tuple[int, float]] = []

    def on_train_loss_report(self, info: Dict[str, Union[float, int]]) -> None:
        self.train_losses.append((info["iteration"], info["train_loss"]))

    def on_val_loss_report(self, info: Dict[str, Union[float, int]]) -> None:
        self.val_losses.append((info["iteration"], info["val_loss"]))

In [16]:
metrics = Metrics()

# load data

In [17]:
import types
from mlx_lm.tuner.datasets import load_custom_hf_dataset

args = types.SimpleNamespace(
    hf_dataset={
        "path": "nvidia/HelpSteer",
        "train_split": "train[:1%]",
        "valid_split": "train[-1%:]",        
        "prompt_feature": "prompt",
        "completion_feature": "response",
        "config": {},                        
    },
    mask_prompt=False,                       
    train=True,                              
    test=False                                
)
train_set, val_set, test_set = load_custom_hf_dataset(
    args=args,
    tokenizer=tokenizer
    
)

Loading Hugging Face dataset nvidia/HelpSteer.


In [18]:
print(f"Test set size: {len(test_set)}")
print(f"Validation set size: {len(val_set)}")
print(f"Training set size: {len(train_set)}")
print(f"test set: {test_set[:2]}")

Test set size: 0
Validation set size: 353
Training set size: 353
test set: []


In [19]:
from mlx_lm.tuner.datasets import CacheDataset

train_dataset = CacheDataset(train_set)
val_dataset = CacheDataset(val_set)

train(
    model,
    optim.Adam(learning_rate=1e-5),
    train_dataset,
    val_dataset,
    args=training_args,
    training_callback=metrics
)


Starting training..., iters: 200


Calculating loss...: 100%|██████████| 25/25 [02:38<00:00,  6.33s/it]

Iter 1: Val loss 4.566, Val took 158.359s





Iter 10: Train loss 3.452, Learning Rate 1.000e-05, It/sec 0.070, Tokens/sec 48.329, Trained Tokens 6874, Peak mem 2.714 GB
Iter 20: Train loss 3.834, Learning Rate 1.000e-05, It/sec 0.297, Tokens/sec 167.589, Trained Tokens 12508, Peak mem 3.627 GB
Iter 30: Train loss 2.960, Learning Rate 1.000e-05, It/sec 0.584, Tokens/sec 289.822, Trained Tokens 17468, Peak mem 3.627 GB
Iter 40: Train loss 3.295, Learning Rate 1.000e-05, It/sec 0.259, Tokens/sec 133.125, Trained Tokens 22605, Peak mem 3.627 GB


Calculating loss...: 100%|██████████| 25/25 [01:53<00:00,  4.54s/it]

Iter 50: Val loss 3.740, Val took 113.854s





Iter 50: Train loss 3.424, Learning Rate 1.000e-05, It/sec 0.172, Tokens/sec 127.907, Trained Tokens 30041, Peak mem 3.627 GB
Iter 60: Train loss 3.315, Learning Rate 1.000e-05, It/sec 0.286, Tokens/sec 161.827, Trained Tokens 35701, Peak mem 3.627 GB
Iter 70: Train loss 3.663, Learning Rate 1.000e-05, It/sec 0.349, Tokens/sec 205.625, Trained Tokens 41585, Peak mem 3.627 GB
Iter 80: Train loss 3.544, Learning Rate 1.000e-05, It/sec 0.273, Tokens/sec 184.697, Trained Tokens 48362, Peak mem 3.627 GB
Iter 90: Train loss 3.341, Learning Rate 1.000e-05, It/sec 0.158, Tokens/sec 68.871, Trained Tokens 52708, Peak mem 3.627 GB


Calculating loss...: 100%|██████████| 25/25 [00:38<00:00,  1.55s/it]


Iter 100: Val loss 3.736, Val took 66.616s
Iter 100: Train loss 3.579, Learning Rate 1.000e-05, It/sec 0.516, Tokens/sec 237.238, Trained Tokens 57307, Peak mem 3.627 GB
Iter 100: Saved adapter weights to adapters_gemma3/adapters.safetensors and adapters_gemma3/0000100_adapters.safetensors.
Iter 110: Train loss 3.489, Learning Rate 1.000e-05, It/sec 0.298, Tokens/sec 173.821, Trained Tokens 63148, Peak mem 3.796 GB
Iter 120: Train loss 3.691, Learning Rate 1.000e-05, It/sec 0.444, Tokens/sec 335.493, Trained Tokens 70704, Peak mem 3.796 GB
Iter 130: Train loss 3.583, Learning Rate 1.000e-05, It/sec 0.282, Tokens/sec 219.660, Trained Tokens 78488, Peak mem 3.796 GB
Iter 140: Train loss 3.237, Learning Rate 1.000e-05, It/sec 0.417, Tokens/sec 281.243, Trained Tokens 85235, Peak mem 3.796 GB


Calculating loss...: 100%|██████████| 25/25 [00:52<00:00,  2.12s/it]

Iter 150: Val loss 3.464, Val took 53.144s





Iter 150: Train loss 3.684, Learning Rate 1.000e-05, It/sec 0.562, Tokens/sec 388.644, Trained Tokens 92146, Peak mem 3.796 GB
Iter 160: Train loss 4.138, Learning Rate 1.000e-05, It/sec 0.180, Tokens/sec 139.538, Trained Tokens 99914, Peak mem 3.796 GB
Iter 170: Train loss 3.907, Learning Rate 1.000e-05, It/sec 0.286, Tokens/sec 157.133, Trained Tokens 105408, Peak mem 3.796 GB
Iter 180: Train loss 3.305, Learning Rate 1.000e-05, It/sec 0.341, Tokens/sec 240.794, Trained Tokens 112470, Peak mem 3.796 GB
Iter 190: Train loss 3.285, Learning Rate 1.000e-05, It/sec 0.373, Tokens/sec 220.971, Trained Tokens 118388, Peak mem 3.796 GB


Calculating loss...: 100%|██████████| 25/25 [00:47<00:00,  1.90s/it]

Iter 200: Val loss 3.503, Val took 47.889s





Iter 200: Train loss 3.644, Learning Rate 1.000e-05, It/sec 0.249, Tokens/sec 207.426, Trained Tokens 126711, Peak mem 3.796 GB
Iter 200: Saved adapter weights to adapters_gemma3/adapters.safetensors and adapters_gemma3/0000200_adapters.safetensors.
Saved final weights to adapters_gemma3/adapters.safetensors.


## Fusionar modelo base con adaptador
revisar, por que al parecer se está fijando el valor del modelo a fucionar, esto debería ser así?

In [20]:
! python -m mlx_lm fuse  --model ./models_cache/hub/models--google--gemma-3-270m-it/snapshots/ac82b4e820549b854eebf28ce6dedaf9fdfa17b3 --adapter-path ./adapters_gemma3 --save-path ./new_gemma3 

python(66528) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading pretrained model
README.md: 100%|███████████████████████████| 28.3k/28.3k [00:00<00:00, 72.3MB/s]


# Subir modelo a HF
Utilizando el API de HF se sube el modelo a deimagjas/Phi-3.5-mini-instruct-4bit-sft

In [22]:
from huggingface_hub import  upload_folder

repo_id = "deimagjas/gemma-3-270m-it-sft"

upload_folder(
    folder_path="./new_gemma3",
    repo_id=repo_id
)


Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

CommitInfo(commit_url='https://huggingface.co/deimagjas/gemma-3-270m-it-sft/commit/7731ae397a84fabb197f42888fa436ee6909f119', commit_message='Upload folder using huggingface_hub', commit_description='', oid='7731ae397a84fabb197f42888fa436ee6909f119', pr_url=None, repo_url=RepoUrl('https://huggingface.co/deimagjas/gemma-3-270m-it-sft', endpoint='https://huggingface.co', repo_type='model', repo_id='deimagjas/gemma-3-270m-it-sft'), pr_revision=None, pr_num=None)

## Test HF model

In [23]:
model_path = "deimagjas/gemma-3-270m-it-sft"
model_sft, tokenizer_sft = load(model_path)

Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/1.49k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/17.2k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/536M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

In [26]:
prompt = "create a list of steps in order to help someone that has an anxiety attack"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer_sft.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
response = generate(model_sft, tokenizer_sft, prompt=prompt, verbose=True)

It's important to remember that anxiety attacks are temporary and can be managed. The most important thing is to take a moment to calm down and think about how you're feeling.<end_of_turn>
<unused97> model<end_of_turn>
<unused97> model
<unused97> model
<unused97> model<end_of_turn>
<unused97> model
<unused97> model<end_of_turn>
<unused97> model
<unused97> model<end_of_turn>
<unused97> model
<unused97> model<end_of_turn>
<unused97> model
<unused97> model
<unused97> model<end_of_turn>
ডেইলি<end_of_turn>
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ডেইলি
ড

# Conclusión

El fine tunning en este caso fallo, el modelo presenta fallas en la inferencia. ¿por qué?
Ademas consume 5X memoria mas o menos