<!-- Banner Image -->
<img src="https://uohmivykqgnnbiouffke.supabase.co/storage/v1/object/public/landingpage/brevdevnotebooks.png" width="100%">

<!-- Links -->
<center>
  <a href="https://console.brev.dev" style="color: #06b6d4;">Console</a> •
  <a href="https://brev.dev" style="color: #06b6d4;">Docs</a> •
  <a href="/" style="color: #06b6d4;">Templates</a> •
  <a href="https://discord.gg/NVDyv7TUgJ" style="color: #06b6d4;">Discord</a>
</center>

# Finetune and deploy the new Llama3-8b model using SFT and VLLM 🤙

Welcome!

It's April 18th and Llama3 just dropped! So let's figure out how to finetune it and deploy it so you can start using it :). In order to get access you'll need to request via [HuggingFace](https://huggingface.co/meta-llama/Meta-Llama-3-8B). It took me about 20 minutes to get access to so ahead and sign up to get your token enabled!

Meta released a couple different models today.
1. Llama-3-8B
2. Llama-3-8b-instruct
3. Llama-3-70b
4. Llama-3-70b-instruct

Llama-3 was has an 8k context length which is pretty small compared to some of the newer models that have been released and was trained with trained with 15 trillion tokens on a 24k GPU cluster. Luckily for finetuning, we only need a fraction of that compute power.

In this notebook, we're going to walk through the flow of finetuning our model from scratch using the base model and then deploy it using VLLM. We'll be releasing a guide soon that uses Direct Preference Optimization as a way of finetuning soon!

**Note that we will be using the base model in this notebook and not the instruct model. Additionally, we will be running through a full finetune with no quantization. This notebook requires 2 A100-80GB GPUs which might cost a bit more. If you're looking for a lighter Llama3-finetune, checkout out our Llama3 finetune that uses Direct Preference Optimization [here](https://github.com/brevdev/notebooks/blob/main/llama3-dpo.ipynb).**

#### Help us make this tutorial better! Please provide feedback on the [Discord channel](https://discord.gg/T9bUNqMS8d) or on [X](https://x.com/brevdev).

A note about running Jupyter Notebooks: Press Shift + Enter to run a cell. A * in the left-hand cell box means the cell is running. A number means it has completed. If your Notebook is acting weird, you can interrupt a too-long process by interrupting the kernel (Kernel tab -> Interrupt Kernel) or even restarting the kernel (Kernel tab -> Restart Kernel). Note restarting the kernel will require you to run everything from the beginning.

## Table of Contents
1. Install dependancies
2. Download model
3. Fintuning flow
4. Deploy as an OpenAI compatible endpoint

## 1. Install Dependancies

In [1]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


In [2]:
!pip install -U bitsandbytes
!pip install -U git+https://github.com/huggingface/transformers.git
!pip install -U git+https://github.com/huggingface/peft.git
!pip install -U git+https://github.com/huggingface/accelerate.git
!pip install trl

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl.metadata (2.9 kB)
Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl (69.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 MB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.45.0
Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-r1x4ibue
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-r1x4ibue
  Resolved https://github.com/huggingface/transformers.git to commit 8de7b1ba8d126a6fc9f9bcc3173a71b46f0c3601
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packag

In [3]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

In [4]:
# Input access token
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## 2. Load in Llama 3 and our dataset

Because we are using the base model, there is not an exact prompt template we have to follow. The dataset we are using follows LLama3's template format so it should be fine for downstream tasks that use the Llama3 chat format. If you're bringing your own data, you can format it however you want as long as you use the same formatting downstream.

Here's the official [Llama3 chat template](https://huggingface.co/blog/llama3#how-to-prompt-llama-3)

In [5]:
base_model_id = "meta-llama/Meta-Llama-3-8B"
dataset_name = "scooterman/guanaco-llama3-1k"
new_model = "brev-llama3-8b-SFT"

In [6]:
from datasets import load_dataset

dataset = load_dataset(dataset_name, split="train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/977k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [17]:
dataset[2]['text']

'<|start_header_id|>user<|end_header_id|>{{Compose a professional email with the following points:\n\nMe chinese cook 10 years\nGood good cook\nPeople love my food\nCan you hire me?}}<|eot_id|><|start_header_id|>assistant<|end_header_id|>{{Thanks for the information. Unfortunately, your initial message does not contain enough details or structure to compose a compelling professional email to a potential employer. Here are some suggestions for how we can improve the message:\n\n1) Let\'s first clarify the type of cooking position you are interested in. Are you looking for a job as a line cook, sous chef, executive chef, or something else? This will help determine how to emphasize relevant experience and skills.\n\n2) What specific experience do you have? For example, "Over 10 years of experience preparing authentic Sichuan and Cantonese cuisine." Highlight any relevant skills, like knife skills, cooking techniques, menu development, etc. Mention the types of cuisine you specialize in.\n

In [18]:
dataset

Dataset({
    features: ['text'],
    num_rows: 1000
})

In [8]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model = AutoModelForCausalLM.from_pretrained(base_model_id, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    add_eos_token=True,
    add_bos_token=True,
)
tokenizer.pad_token = tokenizer.eos_token

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

In [22]:
question = "Quelle est la capitale de la France ?"
input_q = tokenizer(question, return_tensors="pt").to(model.device)
gen_answer = model.generate(**input_q, max_length=50, num_return_sequences=1)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [23]:
print(gen_answer[0])

tensor([128000,   2232,   6853,   1826,   1208,  61510,   1604,    409,   1208,
          9822,    949,  12366,    198,   8921,   9822,   1826,    653,  21935,
         10109,    978,    665,   4606,   3056,   1748,   1604,     13,   5034,
          9822,   1826,    653,  21935,  25945,    436,  12333,    665,  87810,
          1880,    665,   7829,     13,  46408,   1826,  28463,    326,  22827,
           951,  21935,   3625,   5636,   4034])


In [24]:
print("reponse :", tokenizer.decode(gen_answer[0]))

reponse : <|begin_of_text|>Quelle est la capitale de la France? Paris
La France est un pays situé en Europe occidentale. La France est un pays très riche en histoire et en culture. Elle est aussi l'un des pays les plus visit


In [26]:
outputs = model.generate(**input_q, max_length=50, num_return_sequences=3, top_k=50, top_p=0.9)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [28]:
for i, answer in enumerate(outputs) :
  print(f"Answer {i+1} :", tokenizer.decode(outputs[i]))

Answer 1 : <|begin_of_text|>Quelle est la capitale de la France? The capital of France is Paris. The capital of France is Paris. Paris is the capital of France. It is also the largest city in France, with a population of over 2 million people
Answer 2 : <|begin_of_text|>Quelle est la capitale de la France? The French capital is Paris.
Quelle est la capitale de l'Espagne? The Spanish capital is Madrid.
Quelle est la capitale de l'Angleterre? The English
Answer 3 : <|begin_of_text|>Quelle est la capitale de la France? - Which is the capital of France?
Quelle est la capitale de la France?
This is the French version of the page which is Which is the capital of France?<|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|>


In [31]:
while True :
  ma_question = input("Posez votre question :")
  if ma_question=="exit" :
    break
  ma_question_tok = tokenizer(ma_question, return_tensors="pt").to(model.device)
  gen_reponse = model.generate(**ma_question_tok, max_length=100, num_return_sequences=1)
  print(f"Chatboot : {tokenizer.decode(gen_reponse[0])}")

Posez votre question :exit


## 3. Set our Training Arguments

A lot of tutorials simply paste a list of arguments leaving it up to the reader to figure out what each argument does. Below I've added comments which explain what each argument does!

In [10]:
# Output directory where the results and checkpoint are stored
output_dir = "./results"

# Number of training epochs - how many times does the model see the whole dataset
num_train_epochs = 1 #Increase this for a larger finetune

# Enable fp16/bf16 training. This is the type of each weight. Since we are on an A100
# we can set bf16 to true because it can handle that type of computation
bf16 = True

# Batch size is the number of training examples used to train a single forward and backward pass.
per_device_train_batch_size = 4

# Gradients are accumulated over multiple mini-batches before updating the model weights.
# This allows for effectively training with a larger batch size on hardware with limited memory
gradient_accumulation_steps = 2

# memory optimization technique that reduces RAM usage during training by intermittently storing
# intermediate activations instead of retaining them throughout the entire forward pass, trading
# computational time for lower memory consumption.
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Number of training steps (overrides num_train_epochs)
max_steps = 5

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 100

# Log every X updates steps
logging_steps = 5

In [11]:
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    report_to="wandb"
)

## Run our training job using WandB for logging

Weights and Biases is industry standard for monitoring and evaluating your training job. I highly suggest setting up an account to monitor this run and use it for future ML jobs!

In [12]:
!pip install wandb



In [13]:
import wandb

wandb.login()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [14]:
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    report_to="wandb"
)

In [15]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    #dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
)

  trainer = SFTTrainer(


TypeError: SFTTrainer.__init__() got an unexpected keyword argument 'dataset_text_field'

In [None]:
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)

## Deploy for inference!

To deploy this model for extremely quick inference, we use VLLM and host an OpenAI compatible endpoint. You might have to restart the kernel again and then just run the cells below.

In [None]:
!nvidia-smi

In [None]:
!pip install vllm

In [None]:
!python -O -u -m vllm.entrypoints.openai.api_server \
    --host=127.0.0.1 \
    --port=8000 \
    --model=brev-llama3-8b-SFT \
    --tokenizer=meta-llama/Meta-Llama-3-8B \
    --tensor-parallel-size=2



# Open up a terminal and run
#curl http://localhost:8000/v1/completions \
#    -H "Content-Type: application/json" \
#    -d '{
#        "model": "brev-llama3-8b-SFT",
#        "prompt": "What is San Francisco",
#        "max_tokens": 30,
#        "temperature": 0
#    }'