<a href="https://colab.research.google.com/github/felafax/felafax/blob/main/notebooks/Llama3_1_on_Free_Colab_TPU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To run this, press "*Runtime*" and press "*Run all*" on a **free** Google Colab TPU!
<div class="align-center">
  <a href="https://github.com/felafax/felfax"><img src="https://felafax.ai/felafax.svg" width="145"></a></a> ⭐ <i>Star us on <a href="https://github.com/felafax/felafax">Github</a> </i> ⭐ and email us founders@felafax.ai for any questions!
</div>

# Setup

In [1]:
!pip install git+https://github.com/felafax/felafax.git -q
!pip uninstall -y tensorflow && pip install tensorflow-cpu -q

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = ''

In [9]:
MODEL_NAME = "meta-llama/Llama-3.2-1B"
HF_TOKEN = input("Please enter your HuggingFace token: ")
TRAINER_DIR = "/"
TEST_MODE = False


CHECKPOINT_DIR = os.path.join(TRAINER_DIR, "checkpoints")
EXPORT_DIR = os.path.join(TRAINER_DIR, "finetuned_export")
os.makedirs(CHECKPOINT_DIR, exist_ok=True)
os.makedirs(EXPORT_DIR, exist_ok=True)

In [12]:
from felafax.trainer_engine import setup
setup.setup_environment(base_dir=TRAINER_DIR)

import jax
from transformers import AutoTokenizer
from felafax.trainer_engine import checkpoint, trainer, utils
from felafax.trainer_engine.data import data

# Step 0: Configure different parts of training pipeline

In [13]:
dataset_config = data.DatasetConfig(
    data_source="yahma/alpaca-cleaned",
    max_seq_length=32,
    batch_size=8,
    num_workers=4,
    mask_prompt=False,
    train_test_split=0.15,

    ignore_index=-100,
    pad_id=0,
    seed=42,

    # Setting max_examples limits the number of examples in the dataset.
    # This is useful for testing the pipeline without running the entire dataset.
    max_examples=100 if TEST_MODE else None,
)


In [19]:
trainer_config = trainer.TrainerConfig(
    model_name=MODEL_NAME,
    param_dtype="bfloat16",
    compute_dtype="bfloat16",

    # Training configuration
    num_epochs=1,
    num_steps=50,
    use_lora=True,
    lora_rank=16,
    learning_rate=1e-3,
    log_interval=1,

    num_tpus=jax.device_count(),

    # Eval configuration
    eval_interval=50,
    eval_steps=5,

    # Additional info required by trainer
    base_dir=TRAINER_DIR,
    hf_token=HF_TOKEN,
)



In [15]:
checkpointer_config = checkpoint.CheckpointerConfig(
    checkpoint_dir=CHECKPOINT_DIR,
    max_to_keep=2,
    save_interval_steps=50,
    erase_existing_checkpoints=True,
)
checkpointer = checkpoint.Checkpointer(config=checkpointer_config)

# Step 1: Downloading dataset...

For this colab, we're utilizing the refined **Alpaca dataset**, curated by yahma. This dataset is a carefully filtered selection of 52,000 entries from the original Alpaca collection. Feel free to substitute this section with your own data preparation code if you prefer.

It's crucial to include the EOS_TOKEN (End of Sequence Token) in your tokenized output. Failing to do so may result in endless generation loops.

In [16]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, token=HF_TOKEN)

# Download and load the data files
train_data, val_data = data.load_data(config=dataset_config)

# Create datasets for SFT (supervised fine-tuning)
train_dataset = data.SFTDataset(
    config=dataset_config,
    data=train_data,
    tokenizer=tokenizer,
)
val_dataset = data.SFTDataset(
    config=dataset_config,
    data=val_data,
    tokenizer=tokenizer,
)

# Create dataloaders
train_dataloader = data.create_dataloader(
    config=dataset_config,
    dataset=train_dataset,
    shuffle=True,
)
val_dataloader = data.create_dataloader(
    config=dataset_config,
    dataset=val_dataset,
    shuffle=False,
)


Downloading readme: 100%|██████████| 11.6k/11.6k [00:00<00:00, 21.3MB/s]
Downloading data: 100%|██████████| 44.3M/44.3M [00:00<00:00, 102MB/s] 
Generating train split: 51760 examples [00:00, 149637.77 examples/s]


# Step 2: Create Trainer and load the model

In [20]:
trainer = trainer.Trainer(
    trainer_config=trainer_config,
    train_dataloader=train_dataloader,
    val_dataloader=val_dataloader,
    checkpointer=checkpointer,
)

Creating TPU device mesh with shape (1, 4, 1)...
Loading model from HuggingFace...


In [21]:
trainer.train()

Started epoch 1 of 1...
Step 0 | Train Loss: 0.0000 | Val Loss: 0.0000 | Next Token Prediction Accuracy (train, val): 0.00%, 0.00%
Running eval for 5 steps...
Step 1 | Train Loss: 4.0994 | Val Loss: 3.6469 | Next Token Prediction Accuracy (train, val): 26.21%, 26.69%
Step 2 | Train Loss: 3.5532 | Val Loss: 3.6469 | Next Token Prediction Accuracy (train, val): 24.19%, 26.69%
Step 3 | Train Loss: 3.0454 | Val Loss: 3.6469 | Next Token Prediction Accuracy (train, val): 29.03%, 26.69%
Step 4 | Train Loss: 2.5019 | Val Loss: 3.6469 | Next Token Prediction Accuracy (train, val): 48.39%, 26.69%
Step 5 | Train Loss: 2.0899 | Val Loss: 3.6469 | Next Token Prediction Accuracy (train, val): 50.81%, 26.69%
Step 6 | Train Loss: 1.5054 | Val Loss: 3.6469 | Next Token Prediction Accuracy (train, val): 77.42%, 26.69%
Step 7 | Train Loss: 1.3784 | Val Loss: 3.6469 | Next Token Prediction Accuracy (train, val): 77.42%, 26.69%
Step 8 | Train Loss: 1.2931 | Val Loss: 3.6469 | Next Token Prediction Accurac

# Step 3: Export fine-tuned model

In [22]:
trainer.export(export_dir=EXPORT_DIR)

Model and tokenizer saved to /finetuned_export
Hugging Face model saved at: /finetuned_export


In [23]:
utils.upload_dir_to_hf(
    dir_path=EXPORT_DIR,
    repo_name="felarof01/test-llama3-alpaca-from-colab",
    token=HF_TOKEN,
)

tokenizer.json: 100%|██████████| 9.09M/9.09M [00:00<00:00, 13.0MB/s]


Model uploaded to Hugging Face Hub at https://huggingface.co/felarof01/test-llama3-alpaca-from-colab
