<a href="https://colab.research.google.com/github/ferchomuri/archi/blob/main/train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###📌 Variables de entorno

In [1]:
from google.colab import userdata

### 📌 Instalar librerías necesarias

In [2]:
!pip install transformers datasets accelerate huggingface_hub

Collecting datasets
  Downloading datasets-3.4.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12=

In [3]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import login
from datasets import load_dataset

### 📌 Iniciar sesión en Hugging Face (sustituye tu token)*italicized text*

In [4]:
HUGGINGFACE_TOKEN = userdata.get('hugface')
login(HUGGINGFACE_TOKEN)

### 📌 Modelo base

In [5]:
MODEL_NAME = "bigcode/starcoder"

### 📌 Cargar el modelo optimizado (usa la GPU si está disponible)

In [6]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,  # Reduce uso de memoria
    device_map="auto"  # Mapea en GPU automáticamente
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/38.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/7 [00:00<?, ?it/s]

model-00001-of-00007.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

model-00002-of-00007.safetensors:   0%|          | 0.00/9.86G [00:00<?, ?B/s]

model-00003-of-00007.safetensors:   0%|          | 0.00/9.85G [00:00<?, ?B/s]

model-00004-of-00007.safetensors:   0%|          | 0.00/9.86G [00:00<?, ?B/s]

model-00005-of-00007.safetensors:   0%|          | 0.00/9.85G [00:00<?, ?B/s]

model-00006-of-00007.safetensors:   0%|          | 0.00/9.86G [00:00<?, ?B/s]

model-00007-of-00007.safetensors:   0%|          | 0.00/4.08G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]



### 📌 Cargar el tokenizador

In [7]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

tokenizer_config.json:   0%|          | 0.00/677 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/777k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/442k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.06M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/532 [00:00<?, ?B/s]

### 📌 Configurar token de padding si no existe

In [8]:
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print("Modelo cargado correctamente ✅")

Modelo cargado correctamente ✅


### 📌 Subir archivos JSONL al entorno de Colab (solo la primera vez)

In [9]:
from google.colab import files

uploaded = files.upload()  # Selecciona tus archivos JSONL cuando aparezca la ventana


Saving cleanarchitecture.jsonl to cleanarchitecture.jsonl
Saving domaindrivendesign.jsonl to domaindrivendesign.jsonl
Saving nodejs.jsonl to nodejs.jsonl
Saving reactjs.jsonl to reactjs.jsonl


### 📌 Cargar dataset local

In [10]:
dataset = load_dataset("json", data_files=list(uploaded.keys()))

print("Dataset cargado correctamente ✅")

Generating train split: 0 examples [00:00, ? examples/s]

Dataset cargado correctamente ✅


### 📌 Función para tokenizar el dataset

In [11]:
def tokenize_function(examples):
    return tokenizer(
        examples["prompt"],
        text_target=examples["completion"],
        padding="max_length",
        truncation=True,
        max_length=512
    )

### 📌 Tokenizar dataset

In [12]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)

print("Dataset tokenizado correctamente ✅")

Map:   0%|          | 0/14 [00:00<?, ? examples/s]

Dataset tokenizado correctamente ✅


### 📌 Dividir el dataset en train (80%) y eval (20%)

In [13]:
split_dataset = tokenized_datasets["train"].train_test_split(test_size=0.2)


### 📌 Configurar los argumentos de entrenamiento

In [14]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=2,
    push_to_hub=False,  # Desactivar carga automática a Hugging Face
    fp16=True  # Usa flotantes de 16 bits para reducir uso de memoria
)

### 📌 Inicializar Trainer

In [15]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=split_dataset["train"],
    eval_dataset=split_dataset["test"]
)



RuntimeError: You can't move a model that has some modules offloaded to cpu or disk.

### 📌 Iniciar entrenamiento

In [None]:
trainer.train()