<a href="https://colab.research.google.com/github/Eddiebee/AI-Craft/blob/main/fine_tuning_LLM_with_Comet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning LLMs

In this section, we demonstrate how to fine-tune LLMs. Note that you will need to use a GPU for this section. You can do so by clicking "Runtime -> Change runtime type" and selecting a GPU.

Let's load all the necessary libraries:

In [1]:
! pip install transformers[torch] comet-ml comet-llm datasets evaluate sentencepiece --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m648.9/648.9 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.0/68.0 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.4/302.4 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.5/54.5 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m277.3/277.3 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━

In [2]:
from transformers import AutoTokenizer
from datasets import load_dataset
import evaluate
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
from transformers import Trainer, TrainingArguments
import transformers
transformers.set_seed(35)
from datasets import Features, Value, Dataset, DatasetDict
import comet_ml
import comet_llm
import os
import numpy as np
import pickle
import json
import pandas as pd
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


comet_ml is installed but `COMET_API_KEY` is not set.


In [3]:
from google.colab import userdata
COMET_API_KEY = userdata.get('COMET_API_KEY')

### Dataset Preparation

The code below loads the datasets and converts them into the proper format. We are also sampling the dataset. You can choose different sample sizes to run different experiments. More samples typically lead to a better performing model.

In [4]:
# loads the data from the jsonl files
emotion_dataset_train = pd.read_json(path_or_buf="https://raw.githubusercontent.com/comet-ml/comet-llmops/main/data/merged_training_sample_prepared_train.jsonl", lines=True)
emotion_dataset_val_temp = pd.read_json(path_or_buf="https://raw.githubusercontent.com/comet-ml/comet-llmops/main/data/merged_training_sample_prepared_valid.jsonl", lines=True)

# takes first half of samples from emotion_dataset_val_temp and make emotion_dataset_val
emotion_dataset_val = emotion_dataset_val_temp.iloc[:int(len(emotion_dataset_val_temp)/2)]

# takes second half of samples from emotion_dataset_val_temp and make emotion_dataset_test
emotion_dataset_test = emotion_dataset_val_temp.iloc[int(len(emotion_dataset_val_temp)/2):]

sample = True

if sample == True:
    final_ds = DatasetDict({
        "train": Dataset.from_pandas(emotion_dataset_train.sample(50)),
        "validation": Dataset.from_pandas(emotion_dataset_val.sample(50)),
        "test": Dataset.from_pandas(emotion_dataset_test.sample(50))
    })
else:
    final_ds = DatasetDict({
        "train": Dataset.from_pandas(emotion_dataset_train),
        "validation": Dataset.from_pandas(emotion_dataset_val),
        "test": Dataset.from_pandas(emotion_dataset_test)
    })

### Tokenize Dataset

The code below defines a tokenizer and uses the Hugging Face tokenizer to tokenize the datasets. This is the format the model expects so this is an important step.

In [5]:
# model checkpoint
model_checkpoint = "google/flan-t5-base"

# We'll create a tokenizer from model checkpoint
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=False)

# We'll need padding to have same length sequences in a batch
tokenizer.pad_token = tokenizer.eos_token

# prefix
prefix_instruction = "Classify the provided piece of text into one of the following emotion labels.\n\nEmotion labels: ['anger', 'fear', 'joy', 'love', 'sadness', 'surprise']"

# Define a tokenization function that first concatenates text and target
def tokenize_function(example):
    merged = prefix_instruction + "\n\n" + "Text: " + example["prompt"].strip("\n\n###\n\n") + "\n\n" + "Emotion output:" + example["completion"].strip(" ").strip("\n")
    batch = tokenizer(merged, padding='max_length', truncation=True)
    batch["labels"] = batch["input_ids"].copy()
    return batch

# Apply it on our dataset, and remove the text columns
tokenized_datasets = final_ds.map(tokenize_function, remove_columns=["prompt", "completion"])



tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

### Finetuning Model

Once the datasets have been tokenized, it's time to finetune the model. We are using the HF Trainer to simplify the finetuning code. In the code below, it's also important to initialize a Comet project which allows tracking the experimental results to Comet. You can also set the `COMET_LOG_ASSETS` to `True` to store all artifacts to Comet.

In [6]:
# initialize comet_ml
comet_ml.init(project_name="emotion-classification")

# training an autoregressive language model from a pretrained checkpoint
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint).to(device)

# set this to log HF results and assets to Comet
os.environ["COMET_LOG_ASSETS"] = "True"

# HF Trainer
model_name = model_checkpoint.split("/")[-1]
training_args = Seq2SeqTrainingArguments(
    num_train_epochs=1,
    output_dir="./results",
    overwrite_output_dir=True,
    logging_steps=1,
    evaluation_strategy = "epoch",
    learning_rate=1e-4,
    weight_decay=0.01,
    save_total_limit=5,
    save_steps=7,
    auto_find_batch_size=True
)

# instantiate HF Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
)

# run trainer
trainer.train()

Please paste your Comet API key from https://www.comet.com/api/my/settings/
(api key may not show as you type)
Comet API key: ··········


[1;38;5;39mCOMET INFO:[0m Valid Comet API Key saved in /root/.comet.config (set COMET_CONFIG to change where it is saved).


config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Epoch,Training Loss,Validation Loss
1,0.0271,0.017744


TrainOutput(global_step=13, training_loss=0.3605009632614943, metrics={'train_runtime': 39.8773, 'train_samples_per_second': 1.254, 'train_steps_per_second': 0.326, 'total_flos': 34237867622400.0, 'train_loss': 0.3605009632614943, 'epoch': 1.0})

The code below stores the results locally:

In [7]:
# save the model
trainer.save_model("./results")

---

### Register Model

The code below registers the model to Comet.

In [10]:
# set existing experiment
import os
from comet_ml import ExistingExperiment

# COMET_API_KEY = "COMET_API_KEY"

experiment = ExistingExperiment(api_key=COMET_API_KEY, previous_experiment="097ab78e6e154f24b8090a1a7dd6abb8")
experiment.log_model("Emotion-T5-Base", "results/checkpoint-7")
experiment.register_model("Emotion-T5-Base")

[1;38;5;196mCOMET ERROR:[0m Run will not be logged 
[1;38;5;196mCOMET ERROR:[0m Error logging the model 'Emotion-T5-Base', no such file or directory: '/results/checkpoint-7'


In [12]:
from comet_ml import Experiment


experiment = Experiment(
  api_key=COMET_API_KEY,
  project_name="llmops-course",
  workspace="eddiebee"
)

[1;38;5;39mCOMET INFO:[0m Couldn't find a Git repository in '/content' nor in any parent directory. Set `COMET_GIT_DIRECTORY` if your Git Repository is elsewhere.
[1;38;5;39mCOMET INFO:[0m Experiment is live on comet.com https://www.comet.com/eddiebee/llmops-course/2c918ddf4d434ac2889691525662a454



In [13]:
experiment.log_model("Emotion-T5-Base", "results/checkpoint-7")

[('spiece.model',
  {'web': 'https://www.comet.com/api/asset/download?assetId=27beeca2bd374527bbf042e65267c613&experimentKey=2c918ddf4d434ac2889691525662a454',
   'api': 'https://www.comet.com/api/rest/v2/experiment/asset/get-asset?assetId=27beeca2bd374527bbf042e65267c613&experimentKey=2c918ddf4d434ac2889691525662a454',
   'assetId': '27beeca2bd374527bbf042e65267c613'}),
 ('config.json',
  {'web': 'https://www.comet.com/api/asset/download?assetId=2abb9c1bd0914f99b25ad2ea49c9adb0&experimentKey=2c918ddf4d434ac2889691525662a454',
   'api': 'https://www.comet.com/api/rest/v2/experiment/asset/get-asset?assetId=2abb9c1bd0914f99b25ad2ea49c9adb0&experimentKey=2c918ddf4d434ac2889691525662a454',
   'assetId': '2abb9c1bd0914f99b25ad2ea49c9adb0'}),
 ('generation_config.json',
  {'web': 'https://www.comet.com/api/asset/download?assetId=3840fa75e3e34372a3bc86c9b4550fd7&experimentKey=2c918ddf4d434ac2889691525662a454',
   'api': 'https://www.comet.com/api/rest/v2/experiment/asset/get-asset?assetId=384

In [14]:
# register the model
experiment.register_model("Emotion-T5-Base")

---

### Deploy Model

The code below helps to download the model and specific version to whatever environment you are deploying from.

In [16]:
from comet_ml import API

api = API(api_key=COMET_API_KEY)
COMET_WORKSPACE = "eddiebee"

# model name
model_name = "Emotion-T5-Base"

#get the Model object
model = api.get_model(workspace=COMET_WORKSPACE, model_name=model_name)

# Download a Registry Model:
model.download("1.0.0", "./deploy", expand=True)

[1;38;5;39mCOMET INFO:[0m Remote Model 'eddiebee/Emotion-T5-Base:1.0.0' download has been started asynchronously.
[1;38;5;39mCOMET INFO:[0m Still downloading 12 file(s), remaining 2.77 GB/2.77 GB


KeyboardInterrupt: 