In [1]:
import os
import pathlib

project_root = pathlib.Path.cwd()

cache_dir = project_root / "models_cache"
cache_dir.mkdir(exist_ok=True)
os.environ['HF_HOME'] = str(cache_dir)
print(f"La variable de entorno HF_HOME se ha establecido en: {os.environ['HF_HOME']}")


La variable de entorno HF_HOME se ha establecido en: /Users/deimagjas/machinelearning/gemma3-finetunning/models_cache


In [2]:
import json
from typing import Dict, List, Tuple, Union

import mlx.optimizers as optim
from mlx.utils import tree_flatten
from mlx_lm import load, generate
from mlx_lm.tuner import TrainingArgs, linear_to_lora_layers, train

In [3]:
!uv pip show mlx_lm

Name: mlx-lm
Version: 0.28.0
Location: /Users/deimagjas/machinelearning/gemma3-finetunning/.venv/lib/python3.13/site-packages
Requires: jinja2, mlx, numpy, protobuf, pyyaml, transformers
Required-by:


In [4]:
from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Carga de modelo desde HF

La razón por la que el código funciona con google/gemma-3-270m-it pero no con google/gemma-3-270m se debe a la diferencia entre los
  dos tipos de modelos:

   1. `google/gemma-3-270m-it`: El sufijo "-it" significa "Instruction Tuned" (ajustado para instrucciones). Este modelo ha sido
      específicamente entrenado para entender y seguir instrucciones en un formato de chat o de pregunta-respuesta. Su tokenizador
      incluye una "plantilla de chat" (chat template) que formatea la entrada de manera que el modelo la entienda.

   2. `google/gemma-3-270m`: Este es el modelo base. Es un modelo de lenguaje pre-entrenado que es bueno para predecir la siguiente
      palabra en un texto, pero no ha sido ajustado para seguir instrucciones o para conversar. Su tokenizador no tiene una plantilla de
      chat predeterminada.

In [6]:
model_path = "google/gemma-3-270m-it"
model, tokenizer = load(model_path)

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

chat_template.jinja:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/536M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

In [7]:
prompt = "generate an SQL query to find all users who registered in the last 30 days"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, verbose=True)

```sql
SELECT *
FROM Users
WHERE registration_date >= DATE('now', '-30 days');
```

**Explanation:**

* **`SELECT *`**: This selects all columns from the `Users` table.
* **`FROM Users`**: This specifies the table from which to retrieve data.
* **`WHERE registration_date >= DATE('now', '-30 days')`**: This is the filtering condition.
    * `registration_date`:  This is the column name that contains the date of the user's registration.
    * `DATE('now', '-30 days')`: This calculates the date 30 days ago from the current date.
    * `>=`: This ensures that we only include users whose registration date is greater than or equal to 30 days ago.

**Important Considerations:**

* **Data Type of `registration_date`:**  The `registration_date` column should be a date or datetime data type.  If it's stored as a string, you might need to convert it to a date type using `STR_TO_DATE('now', '-30 days')` before using it in the `WHERE` clause.
* **Database
Prompt: 26 tokens, 55.325 tokens-per-sec
Ge

# Creando Adaptador

In [8]:
adapter_path = "adapters_gemma3"
os.makedirs(adapter_path, exist_ok=True)
adapter_config_path = os.path.join(adapter_path, "adapter_config.json")
adapter_file_path = os.path.join(adapter_path, "adapters.safetensors")

# Lora config
Aquí se ajustan los hyperparámetros para el entrenamiento

In [9]:
lora_config = {
    "num_layers": 8,
    "lora_parameters": {
        "rank": 8,
        "scale": 20.0,
        "dropout": 0.0,
    },
}

In [10]:
with open(adapter_config_path, "w") as f:
    json.dump(lora_config, f, indent=4)

In [11]:
training_args = TrainingArgs(
    adapter_file=adapter_file_path,
    iters=200,
    steps_per_eval=50,
    grad_checkpoint=True,
)

# Parameters and adapter
La razón por la que ves 163,840 parámetros entrenables en lugar de los 270 millones del modelo completo es porque no estás
  re-entrenando el modelo entero. Estás utilizando una técnica de ajuste fino de alta eficiencia de parámetros (PEFT) llamada LoRA 
  (Low-Rank Adaptation).

  Así es como funciona en tu notebook:

   1. Congelar el modelo base: En la celda con el id: a3b86f5c, la primera línea es model.freeze(). Esto "congela" todos los 270 millones
      de parámetros del modelo Gemma, haciendo que no sean entrenables.

   2. Inyectar adaptadores LoRA: La siguiente línea, linear_to_lora_layers(...), añade pequeños "adaptadores" o capas de bajo rango a
      ciertas partes del modelo (en tu caso, a 8 capas, según se define en lora_config).

   3. Entrenar solo los adaptadores: Solo se entrenan los parámetros de estos nuevos y pequeños adaptadores. El número 163,840 es la suma
      de todos los parámetros de estas nuevas capas LoRA que se han añadido.

  En resumen:

   * 270 Millones: Es el tamaño total del modelo base, que permanece sin cambios.
   * 163,840: Es el número de parámetros nuevos y adicionales que estás entrenando. Estos parámetros son los que "aprenden" la nueva
     tarea (en este caso, generar consultas SQL) y adaptan el conocimiento del modelo original.

  Esta es la gran ventaja de LoRA: te permite especializar un modelo enorme en una tarea específica de forma muy rápida y con muchos
  menos recursos computacionales, ya que solo necesitas entrenar una fracción minúscula (<0.1%) de los parámetros totales.

In [12]:
model.freeze()
linear_to_lora_layers(model, lora_config["num_layers"], lora_config["lora_parameters"])
num_train_params = sum(v.size for _, v in tree_flatten(model.trainable_parameters()))
print(f"Number of trainable parameters: {num_train_params}")
model.train()

Number of trainable parameters: 163840


Model(
  (model): Gemma3Model(
    (embed_tokens): Embedding(262144, 640)
    (layers.0): TransformerBlock(
      (self_attn): Attention(
        (q_proj): Linear(input_dims=640, output_dims=1024, bias=False)
        (k_proj): Linear(input_dims=640, output_dims=256, bias=False)
        (v_proj): Linear(input_dims=640, output_dims=256, bias=False)
        (o_proj): Linear(input_dims=1024, output_dims=640, bias=False)
        (q_norm): RMSNorm()
        (k_norm): RMSNorm()
        (rope): RoPE(256, traditional=False)
      )
      (mlp): MLP(
        (gate_proj): Linear(input_dims=640, output_dims=2048, bias=False)
        (down_proj): Linear(input_dims=2048, output_dims=640, bias=False)
        (up_proj): Linear(input_dims=640, output_dims=2048, bias=False)
      )
      (input_layernorm): RMSNorm()
      (post_attention_layernorm): RMSNorm()
      (pre_feedforward_layernorm): RMSNorm()
      (post_feedforward_layernorm): RMSNorm()
    )
    (layers.1): TransformerBlock(
      (self_att

In [13]:
class Metrics:
    def __init__(self) -> None:
        self.train_losses: List[Tuple[int, float]] = []
        self.val_losses: List[Tuple[int, float]] = []

    def on_train_loss_report(self, info: Dict[str, Union[float, int]]) -> None:
        self.train_losses.append((info["iteration"], info["train_loss"]))

    def on_val_loss_report(self, info: Dict[str, Union[float, int]]) -> None:
        self.val_losses.append((info["iteration"], info["val_loss"]))

In [14]:
metrics = Metrics()

# load data

In [16]:
from mlx_lm.tuner.datasets import load_hf_dataset
config = { }
train_set, val_set, test_set = load_hf_dataset(
    data_id="mlx-community/wikisql",
    tokenizer=tokenizer,
    config=config,
)

README.md:   0%|          | 0.00/841 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/74.0k [00:00<?, ?B/s]

data/valid-00000-of-00001.parquet:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/10.3k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating valid split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

In [17]:
print(f"Test set size: {len(test_set)}")
print(f"Validation set size: {len(val_set)}")
print(f"Training set size: {len(train_set)}")
print(f"test set: {test_set[:2]}")

Test set size: 100
Validation set size: 100
Training set size: 1000
test set: {'text': ["table: 1-10015132-16\ncolumns: Player, No., Nationality, Position, Years in Toronto, School/Club Team\nQ: What is terrence ross' nationality\nA: SELECT Nationality FROM 1-10015132-16 WHERE Player = 'Terrence Ross'", "table: 1-10015132-16\ncolumns: Player, No., Nationality, Position, Years in Toronto, School/Club Team\nQ: What clu was in toronto 1995-96\nA: SELECT School/Club Team FROM 1-10015132-16 WHERE Years in Toronto = '1995-96'"]}


In [18]:
from mlx_lm.tuner.datasets import CacheDataset

train_dataset = CacheDataset(train_set)
val_dataset = CacheDataset(val_set)

train(
    model,
    optim.Adam(learning_rate=1e-5),
    train_dataset,
    val_dataset,
    args=training_args,
    training_callback=metrics
)


Starting training..., iters: 200


Calculating loss...: 100%|██████████| 25/25 [00:05<00:00,  4.78it/s]

Iter 1: Val loss 4.059, Val took 5.235s





Iter 10: Train loss 3.808, Learning Rate 1.000e-05, It/sec 1.891, Tokens/sec 720.052, Trained Tokens 3808, Peak mem 1.847 GB
Iter 20: Train loss 3.188, Learning Rate 1.000e-05, It/sec 2.803, Tokens/sec 981.376, Trained Tokens 7309, Peak mem 2.176 GB
Iter 30: Train loss 2.860, Learning Rate 1.000e-05, It/sec 3.060, Tokens/sec 1047.298, Trained Tokens 10732, Peak mem 2.176 GB
Iter 40: Train loss 2.616, Learning Rate 1.000e-05, It/sec 2.877, Tokens/sec 1073.506, Trained Tokens 14463, Peak mem 2.176 GB


Calculating loss...: 100%|██████████| 25/25 [00:03<00:00,  7.11it/s]

Iter 50: Val loss 2.467, Val took 3.534s





Iter 50: Train loss 2.504, Learning Rate 1.000e-05, It/sec 3.093, Tokens/sec 1078.381, Trained Tokens 17949, Peak mem 2.176 GB
Iter 60: Train loss 2.397, Learning Rate 1.000e-05, It/sec 2.865, Tokens/sec 1083.910, Trained Tokens 21732, Peak mem 2.176 GB
Iter 70: Train loss 2.312, Learning Rate 1.000e-05, It/sec 2.892, Tokens/sec 1058.885, Trained Tokens 25394, Peak mem 2.176 GB
Iter 80: Train loss 2.292, Learning Rate 1.000e-05, It/sec 2.591, Tokens/sec 950.634, Trained Tokens 29063, Peak mem 2.506 GB
Iter 90: Train loss 2.286, Learning Rate 1.000e-05, It/sec 2.772, Tokens/sec 1032.973, Trained Tokens 32789, Peak mem 2.506 GB


Calculating loss...: 100%|██████████| 25/25 [00:03<00:00,  7.33it/s]

Iter 100: Val loss 2.166, Val took 3.425s





Iter 100: Train loss 2.278, Learning Rate 1.000e-05, It/sec 2.597, Tokens/sec 922.803, Trained Tokens 36342, Peak mem 2.506 GB
Iter 100: Saved adapter weights to adapters_gemma3/adapters.safetensors and adapters_gemma3/0000100_adapters.safetensors.
Iter 110: Train loss 2.150, Learning Rate 1.000e-05, It/sec 2.878, Tokens/sec 974.053, Trained Tokens 39726, Peak mem 2.506 GB
Iter 120: Train loss 2.036, Learning Rate 1.000e-05, It/sec 2.695, Tokens/sec 983.549, Trained Tokens 43375, Peak mem 2.506 GB
Iter 130: Train loss 2.074, Learning Rate 1.000e-05, It/sec 2.759, Tokens/sec 971.446, Trained Tokens 46896, Peak mem 2.506 GB
Iter 140: Train loss 2.079, Learning Rate 1.000e-05, It/sec 2.660, Tokens/sec 988.435, Trained Tokens 50612, Peak mem 2.506 GB


Calculating loss...: 100%|██████████| 25/25 [00:03<00:00,  7.29it/s]

Iter 150: Val loss 2.031, Val took 3.457s





Iter 150: Train loss 2.088, Learning Rate 1.000e-05, It/sec 2.568, Tokens/sec 964.229, Trained Tokens 54367, Peak mem 2.506 GB
Iter 160: Train loss 2.027, Learning Rate 1.000e-05, It/sec 2.547, Tokens/sec 978.339, Trained Tokens 58208, Peak mem 2.506 GB
Iter 170: Train loss 1.952, Learning Rate 1.000e-05, It/sec 2.758, Tokens/sec 1085.310, Trained Tokens 62143, Peak mem 2.837 GB
Iter 180: Train loss 1.800, Learning Rate 1.000e-05, It/sec 3.167, Tokens/sec 1086.855, Trained Tokens 65575, Peak mem 2.837 GB
Iter 190: Train loss 1.988, Learning Rate 1.000e-05, It/sec 3.102, Tokens/sec 1151.570, Trained Tokens 69287, Peak mem 2.837 GB


Calculating loss...: 100%|██████████| 25/25 [00:03<00:00,  7.31it/s]

Iter 200: Val loss 1.934, Val took 3.447s





Iter 200: Train loss 1.831, Learning Rate 1.000e-05, It/sec 2.744, Tokens/sec 1019.964, Trained Tokens 73004, Peak mem 2.837 GB
Iter 200: Saved adapter weights to adapters_gemma3/adapters.safetensors and adapters_gemma3/0000200_adapters.safetensors.
Saved final weights to adapters_gemma3/adapters.safetensors.


## Fusionar modelo base con adaptador
revisar, por que al parecer se está fijando el valor del modelo a fucionar, esto debería ser así?

In [19]:
! python -m mlx_lm fuse  --model ./models_cache/hub/models--google--gemma-3-270m-it/snapshots/ac82b4e820549b854eebf28ce6dedaf9fdfa17b3 --adapter-path ./adapters_gemma3 --save-path ./new_gemma3 

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading pretrained model


# Subir modelo a HF
Utilizando el API de HF se sube el modelo a deimagjas/Phi-3.5-mini-instruct-4bit-sft

In [20]:
from huggingface_hub import  upload_folder

repo_id = "deimagjas/gemma-3-270m-it-sft"

upload_folder(
    folder_path="./new_gemma3",
    repo_id=repo_id
)


Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

CommitInfo(commit_url='https://huggingface.co/deimagjas/gemma-3-270m-it-sft/commit/323474bc271eb170772ecaddabdfa941c805b7cb', commit_message='Upload folder using huggingface_hub', commit_description='', oid='323474bc271eb170772ecaddabdfa941c805b7cb', pr_url=None, repo_url=RepoUrl('https://huggingface.co/deimagjas/gemma-3-270m-it-sft', endpoint='https://huggingface.co', repo_type='model', repo_id='deimagjas/gemma-3-270m-it-sft'), pr_revision=None, pr_num=None)

## Test HF model

In [21]:
model_path = "deimagjas/gemma-3-270m-it-sft"
model_sft, tokenizer_sft = load(model_path)

Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/1.49k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/17.2k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/536M [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

In [22]:
prompt = "generate an SQL query to find all users who registered in the last 30 days"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer_sft.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
response = generate(model_sft, tokenizer_sft, prompt=prompt, verbose=True)

```sql
SELECT *ี่ยể้นوفيرเซ็นเซอร์เซ็นเซอร์ผู้เซ็นเซอร์ที่เซ็นเซอร์ผู้เซ็นเซอร์เคยเซ็นเซอร์ผู้เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์ผู้เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์เซ็นเซอร์
Prompt: 26 tokens, 1.710 tokens-per-sec
Generation: 256 tokens, 118.625 tokens-per-sec
Peak memory: 2.837 GB


# Conclusión

El fine tunning en este caso fallo, el modelo presenta fallas en la inferencia. ¿por qué?
Ademas consume 5X memoria mas o menos