# Launch SFT jobs in the Notebook.
<a target="_blank" href="https://colab.research.google.com/github/ai-hero/llm-research-orchestration/blob/main/notebooks/fine_tuning_research.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [1]:
!pip uninstall aihero-research-finetuning -y
!pip uninstall aihero-research-config -y
!pip install -q git+https://github.com/ai-hero/llm-research-fine-tuning.git@main#egg=aihero-research-finetuning
!pip install numpy==1.25.2 # Bug in collab - https://github.com/numpy/numpy/issues/25150

Found existing installation: aihero-research-finetuning 0.3.2
Uninstalling aihero-research-finetuning-0.3.2:
  Successfully uninstalled aihero-research-finetuning-0.3.2
Found existing installation: aihero-research-config 0.3.1
Uninstalling aihero-research-config-0.3.1:
  Successfully uninstalled aihero-research-config-0.3.1
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for aihero-research-finetuning (pyproject.toml) ... [?25l[?25hdone
  Building wheel for aihero-research-config (pyproject.toml) ... [?25l[?25hdone


In [None]:
# Set all important env variables needed for the application to work

## NOTE: It's best practice don't set the here, set them in your secrets.

## wandb
%env WANDB_API_KEY=
%env WANDB_USERNAME=

## huggingface
%env HF_TOKEN=

# S3 endpoint to access data and save model
%env S3_ENDPOINT=s3.amazonaws.com
%env S3_ACCESS_KEY_ID=
%env S3_SECRET_ACCESS_KEY=
%env S3_REGION=us-east-2
%env S3_SECURE=true

## Preparing the dataset for Fine-Tuning.
In this example, we'll prepare some of the dataset for fine-tuning

In [3]:
from datasets import DatasetDict, load_dataset

In [4]:
dolly_dataset = load_dataset("databricks/databricks-dolly-15k")

In [5]:
dolly_dataset["train"].to_pandas().head()

Unnamed: 0,instruction,context,response,category
0,When did Virgin Australia start operating?,"Virgin Australia, the trading name of Virgin A...",Virgin Australia commenced services on 31 Augu...,closed_qa
1,Which is a species of fish? Tope or Rope,,Tope,classification
2,Why can camels survive for long without water?,,Camels use the fat in their humps to keep them...,open_qa
3,"Alice's parents have three daughters: Amy, Jes...",,The name of the third daughter is Alice,open_qa
4,When was Tomoaki Komorida born?,Komorida was born in Kumamoto Prefecture on Ju...,"Tomoaki Komorida was born on July 10,1981.",closed_qa


Let's first build our prompt/completion dataset

In [6]:
def build_training_example(row):
    """Build a training example from a row in the dataset."""
    prompt = f"## Instruction: {row.get('instruction')}\n"
    if row.get("context", ""):
        prompt = f"{prompt}## Context: {row['context']}\n"
    prompt = f"{prompt}## Response:"

    completion = row["response"]
    return {"prompt": prompt, "completion": completion}


extracted_dataset = dolly_dataset.map(build_training_example).remove_columns(
    ["instruction", "context", "response", "category"]
)
extracted_dataset["train"].to_pandas().head()

Unnamed: 0,prompt,completion
0,## Instruction: When did Virgin Australia star...,Virgin Australia commenced services on 31 Augu...
1,## Instruction: Which is a species of fish? To...,Tope
2,## Instruction: Why can camels survive for lon...,Camels use the fat in their humps to keep them...
3,## Instruction: Alice's parents have three dau...,The name of the third daughter is Alice
4,## Instruction: When was Tomoaki Komorida born...,"Tomoaki Komorida was born on July 10,1981."


Next, let's split the data into train/val/test split

In [7]:
def build_dataset(dataset, train_size=0.8, val_size=0.1, test_size=0.1):
    """Build the dataset dict by splitting the dataset into train, validation and test sets."""
    train_testvalid = dataset.train_test_split(train_size=train_size)
    test_valid = train_testvalid["test"].train_test_split(test_size=test_size / (test_size + val_size))
    return DatasetDict(
        {
            "train": train_testvalid["train"],
            "val": test_valid["train"],
            "test": test_valid["test"],
        }
    )


new_dataset = build_dataset(extracted_dataset["train"])

In [8]:
import os
import shutil
from pathlib import Path


def save_dataset(dataset_name, new_dataset):
    """Save the dataset to disk and return the path to the dataset."""
    current_directory = Path(".")
    shutil.rmtree(current_directory / dataset_name, ignore_errors=True)
    os.mkdir(current_directory / dataset_name)
    dataset_path = (current_directory / dataset_name).as_posix()
    new_dataset.save_to_disk(dataset_path)
    return dataset_path


dataset_name = "dolly-15k"
dataset_path = save_dataset(dataset_name, new_dataset)

Saving the dataset (0/1 shards):   0%|          | 0/12008 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1501 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1502 [00:00<?, ? examples/s]

## Running the fine tuning

In [9]:
def build_config(dataset_name, dataset_path):
    """Build the config file for the dataset."""
    current_directory = Path(".")
    config_path = (current_directory / f"{dataset_name}.yaml").as_posix()
    config_yaml = f"""
project:
  name: "{dataset_name}"

task: "completion"

dataset:
  name: "{dataset_name}"
  type: "local"
  task: "completion"
  path: "{dataset_path}"

base:
  name: "meta-llama/Llama-2-7b-hf"
  type: "huggingface"

output:
  name: "rparundekar/llama2-7b-mmlu"
  type: "huggingface"

trainer:
  packing: false
  max_seq_length: 512

sft:
  per_device_train_batch_size: 1
  per_device_eval_batch_size: 1
  learning_rate: 0.0002
  lr_scheduler_type: "cosine"
  optim: "paged_adamw_8bit"
  warmup_ratio: 0.1
  max_steps: 500
  gradient_accumulation_steps: 4
  gradient_checkpointing: True
  gradient_checkpointing_kwargs:
    use_reentrant: False
  logging_strategy: "steps"
  logging_steps: 5
  evaluation_strategy: "steps"
  eval_steps: 100
peft:
  r: 64  # the rank of the LoRA matrices
  lora_alpha: 16 # the weight
  lora_dropout: 0.1 # dropout to add to the LoRA layers
  bias: "none" # add bias to the nn.Linear layers?
  task_type: "CAUSAL_LM"
  target_modules:  # the name of the layers to add LoRA
    - "q_proj"
    - "k_proj"
    - "v_proj"
    - "o_proj"
    - "gate_proj"
    - "up_proj"
    - "down_proj"
    -  "lm_head"
quantized: true
"""
    with open(config_path, "w", encoding="utf-8") as f:
        f.write(config_yaml)

    return config_path


config_path = build_config(dataset_name, dataset_path)

In [10]:
# Load the training Job (validates schema)
from aihero.research.config.schema import TrainingJob

training_config = TrainingJob.load(config_path)

In [None]:
from aihero.research.finetuning.train import TrainingJobRunner

TrainingJobRunner(training_config).run()

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful
Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful
Loading model


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading dataset
Loading dataset locally:  ['test', 'train', 'val', 'dataset_dict.json']
Starting training
trainable params: 162,218,048 || all params: 6,900,641,856 || trainable%: 2.350767528370625


Map:   0%|          | 0/12008 [00:00<?, ? examples/s]

Map:   0%|          | 0/1501 [00:00<?, ? examples/s]

[34m[1mwandb[0m: Currently logged in as: [33mrahulparundekar[0m. Use [1m`wandb login --relogin`[0m to force relogin


Testing custom code, on ground truth if provided




Updating records_table with predictions, test results, and errors
Skipping custom tests
Skipping custom metrics
Building table




Metrics: {'passed': 0.0}
Generating initial predictions for sample split


  0%|          | 0/100 [00:00<?, ?it/s]