# (LoRA) Fine-tuning Stablelm2 LLM

### This notebook is meant for running only on google colab.

The objective of this colab notebook is to finetune an LLM that accurately responds to UOB Banking Contents as part of our TDP Capstone Project.

The outline of this notebook is as follow:

> 1. Identifying the LLM to finetune on (HF model)
> 2. Configuring and quantising the model with qLoRA
> 3. Loading and structuring the dataset
> 4. Finetuning the LLM based on parameter config
> 5. Exporting and deploying the model to Ollama

Information on model finetuning have been referenced from the following websites
- [Fine-Tuning Ollama Models with Unsloth](https://medium.com/@yuxiaojian/fine-tuning-ollama-models-with-unsloth-a504ff9e8002)

### Pre Fine-Tuning Checks

In [1]:
!nvidia-smi

Tue Oct  1 14:19:15 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   60C    P8              11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

### Imports and Dependencies

In [2]:
!pip3 install auto-gptq
!pip3 install optimum
!pip3 install bitsandbytes
!pip3 install wandb
!pip3 install transformers
!pip3 install accelerate
!pip3 install peft
!pip3 install datasets

Collecting auto-gptq
  Downloading auto_gptq-0.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting datasets (from auto-gptq)
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting rouge (from auto-gptq)
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Collecting gekko (from auto-gptq)
  Downloading gekko-1.2.1-py3-none-any.whl.metadata (3.0 kB)
Collecting peft>=0.5.0 (from auto-gptq)
  Downloading peft-0.13.0-py3-none-any.whl.metadata (13 kB)
Collecting pyarrow>=15.0.0 (from datasets->auto-gptq)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets->auto-gptq)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets->auto-gptq)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets->auto-gptq)
  Downloading multiprocess-0.

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed, Trainer, AutoModel
from peft import LoraConfig, get_peft_model, PeftModel, prepare_model_for_kbit_training
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset, Dataset
from google.colab import files

import transformers
import pandas as pd
import json
import torch
import os
import gc
import wandb
import uuid
import shutil


from accelerate import Accelerator
from functools import partial
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')



Mounted at /content/drive


Before we define the model and tokeniser from our parent model (i.e HF), we will establish the key variables: accelerator, set_seed() and run_id

> Accelerator: This refers to hardware components like GPUs or TPUs, which accelerate model training or inference. In the context of Hugging Face’s LLM, this ensures that models are properly allocated to available hardware (e.g., CPUs, GPUs, or even multiple GPUs) without requiring you to manually manage device placement.

> set_seed(): This ensures reproducibility by fixing the random seed. In machine learning, some aspects of model training, such as weight initialization or shuffling of data, can introduce randomness.

> run_id: This is a unique identifier for a specific training or fine-tuning session. It’s typically used in logging frameworks like WandB (Weights and Biases) or TensorBoard to track individual runs. This variable helps in managing and comparing different experiments, making it easier to analyze metrics such as loss, accuracy, and other performance indicators across multiple runs

In [4]:
accelerator = Accelerator()
set_seed(42)
run_id = str(uuid.uuid4())

### Load Model and Tokenizer from HuggingFace

In [5]:
model = AutoModelForCausalLM.from_pretrained(
  "stabilityai/stablelm-2-1_6b",
  device_map="auto",
  torch_dtype="auto"
)

tokenizer = AutoTokenizer.from_pretrained(
  "stabilityai/stablelm-2-1_6b",
  trust_remote_code=True,
  use_fast=True
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.29G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/121 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/895 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.01M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/917k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/784 [00:00<?, ?B/s]

### Using Base Model to test Pre-Training Performance




In [None]:
# model.cuda()

# inputs = tokenizer("The weather is always wonderful", return_tensors="pt").to(model.device)

# tokens = model.generate(
#   **inputs,
#   max_new_tokens=64,
#   temperature=0.70,
#   top_p=0.95,
#   do_sample=True,
# )

# print(tokenizer.decode(tokens[0], skip_special_tokens=True))

### Prepare Model for Training

> model.train(): This switches the model to training mode. Since certain layers in the model behave differently compared to inference (evaluation) mode. (e.g layers like dropout and batch normalization are activated in training mode), running the train() command ensures that these components behave correctly for training

> model.gradient_checkpointing_enable(): In large models, storing all the intermediate activations during the forward pass can use up a lot of memory, especially when training on GPUs. This line enables gradient checkpointing for the model, which can be particularly useful when working with large transformer models that require substantial GPU memory.

> prepare_model_for_kbit_training(model): This line prepares the model for k-bit quantized training. Quantized training reduces the precision of the model weights from 32-bit floating-point numbers (commonly used in deep learning) to a smaller number of bits, such as 8-bit or 4-bit (k-bits). This can greatly reduce the memory footprint of the model and accelerate the training process by using more compact representations of the weights.

In [6]:
model.train() # model in training mode (dropout modules are activated)

# enable gradient check pointing
model.gradient_checkpointing_enable()

# enable quantized training
model = prepare_model_for_kbit_training(model)

Low-Rank Adaptation (LoRA) is a technique used to reduce the number of trainable parameters in large models, which makes the overall fine-tuning more efficient.

> <b>r</b>: This is the rank of the low-rank adaptation. In LoRA, instead of updating the full model weights, a low-rank matrix is trained, and r determines the size of this matrix. A lower r means fewer parameters, making the training more efficient.

> <b>lora_alpha</b>: This is a scaling factor for LoRA. After the low-rank adaptation, the result is scaled by this factor to control the magnitude of updates. It adjusts the learning rate for the low-rank adaptation.

> <b>target_modules</b>: This specifies which parts of the model should use LoRA. In this case, specifying (q_proj) module means that 'query projection' of the transformer will be adapted using LoRA. LoRA can be applied selectively to certain layers or components to reduce computational overhead while still effectively fine-tuning the model.

> <b>lora_dropout</b>: LoRA introduces dropout in the low-rank matrices to prevent overfitting. A value of 0.05 means that 5% of the weights in the low-rank adaptation will be dropped out during training.

> <b>bias</b>: This specifies how biases are treated. In this case, the bias terms in the model are not trainable (none), meaning that only the LoRA-adapted parts of the model are modified during training.

> <b>task_type</b>: This defines the type of task the model is being trained for. Here, the task is Causal Language Modeling (CAUSAL_LM), which is commonly used in autoregressive models like GPT, where the model predicts the next word in a sequence based on previous words.

Once we have defined the LoRA config, we will run the peft function (Parameter-Efficient Fine-Tuning)

> <b>get_peft_model()</b>: PEFT wraps the original model with the LoRA-adapted layers, so that only the specified target_modules (in this case, the q_proj) are fine-tuned using the low-rank adaptation, while the rest of the model remains frozen (not updated). This approach significantly reduces the number of trainable parameters, making fine-tuning much more memory-efficient and computationally cheaper, especially for very large models

> <b>print_trainable_parameters()</b>: This prints out the number of trainable parameters in the model. Since LoRA only fine-tunes a small portion of the model (in this case, the query projection), this number will be much smaller than the full parameter count of the original model. This is useful for understanding the efficiency gains provided by LoRA, as it highlights the reduction in the number of parameters that need to be updated during training.

In [7]:
# LoRA config
config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj"], # You can specify more targetted modules here, but it could impact the overall computational time (e.g ["q_proj", "v_proj", "k_proj", "out_proj"])
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# LoRA trainable version of model
model = get_peft_model(model, config)

# Trainable parameter count
model.print_trainable_parameters()

trainable params: 786,432 || all params: 1,645,301,760 || trainable%: 0.0478


### Preparing Dataset






### Variable Instantiation

In [8]:
dataset_name="LLM Model Training"
modelpath="stabilityai/stablelm-2-1_6b"

In [9]:
!ls "/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/datasets/" # Check the contents of the folder
file_path = '/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/datasets/train.csv' # Check if the file exists

if os.path.exists(file_path):
  print("File exists! Formatting dataset...")
  data = pd.read_csv(file_path)
  data = data[['user_input', 'response']]

  formatted = {}
  formatted['conversation'] = []
  for index, row in data.iterrows():
    formatted['conversation'].append({
      "content": row['user_input'],
      "role": "user"
    })
    formatted['conversation'].append({
      "content": row['response'],
      "role": "assistant"
    })

  with open("/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/datasets/training.json", "w") as json_file:
    json.dump(formatted, json_file, indent=2)
else:
  print("File not found.")

train.csv  training.json
File exists! Formatting dataset...


### Loading the dataset

In [10]:
with open("/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/datasets/training.json", 'r') as json_file:
  training = json.load(json_file)

training['conversation'][0:9]

[{'content': 'How can I sign up for the TMRW app?', 'role': 'user'},
 {'content': 'You can sign up for the TMRW app via our website or directly on the app from the App Store.',
  'role': 'assistant'},
 {'content': 'What documents do I need to apply for a credit card?',
  'role': 'user'},
 {'content': "You'll need your NRIC, latest income documents, and proof of residence to apply for a credit card.",
  'role': 'assistant'},
 {'content': 'How can I apply for a personal loan?', 'role': 'user'},
 {'content': 'You can apply for a personal loan through our online banking portal or by visiting a UOB branch.',
  'role': 'assistant'},
 {'content': "Quels services d'investissement proposez-vous ?",
  'role': 'user'},
 {'content': "UOB propose des options d'investissement telles que des fonds communs, des obligations et des dépôts structurés.",
  'role': 'assistant'},
 {'content': 'How do I check my TMRW app balance?', 'role': 'user'}]

Be sure to read up on the documentation of each LLM used for finetuning at the [official website](https://ollama.com/library)

In the case of StableLM2, which is a significantly smaller LLM compared to models like Mistral and LLama3, the prompt-response interaction used during its fine-tuning can be seen below:

```
{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ .Response }}<|im_end|>
```

What this means is that the dataset should contain the following information:

> System Message (Optional): If provided, a system message sets the context or defines how the assistant should behave (e.g., polite, informative, task-specific). It is optional and only included if needed.

> User Message (Prompt): The user's input (a question, command, or request) is always included when it exists. This forms the core of the conversation, and the model uses it to generate its response.

> Assistant Response: This is the generated output from the model, representing the assistant's reply to the user’s input.


In [11]:
pairs = []
for i in range(0, len(training['conversation'])-1, 2):  # step by 2 to get pairs of user and assistant
  if training['conversation'][i]['role'] == 'user' and training['conversation'][i+1]['role'] == 'assistant':
    pairs.append([
      {'content': training['conversation'][i]['content'], 'role': training['conversation'][i]['role']},
      {'content': training['conversation'][i+1]['content'], 'role': training['conversation'][i+1]['role']}
    ])

custom_dataset = Dataset.from_dict({
  "conversations": pairs
})

dataset_split = custom_dataset.train_test_split(test_size=0.1)
dataset_split

DatasetDict({
    train: Dataset({
        features: ['conversations'],
        num_rows: 425
    })
    test: Dataset({
        features: ['conversations'],
        num_rows: 48
    })
})

### EOS and Model Configuration

The following code snippet deals with setting up the end-of-sequence (EOS) token for the tokenizer and model configuration, and it also defines templates for message formatting. In addition, it introduces a specific value for ignored tokens during loss calculation.


> tokenizer.encode("<|im_end|>")[0]: Encodes the token "<|im_end|>" using the tokenizer, turning it into a sequence of token IDs. The [0] is used to extract the first token ID from the encoded result, which corresponds to the end-of-sequence marker.

> tokenizer.eos_token_id: This sets the eos_token_id (end-of-sequence token) in the tokenizer to the token ID corresponding to "<|im_end|>".
model.config.eos_token_id = tokenizer.eos_token_id: After setting the EOS token ID for the tokenizer, the same token ID is applied to the model configuration. This ensures that both the tokenizer and the model understand what token to look for to mark the end of a sequence. The EOS token is typically used to signal the end of a generated sequence, ensuring that the model knows when to stop predicting.

Once that is done, we will specify the format when they are passed to the model. Each template wraps the message ({msg}) with the respective special tokens to distinguish between the user's and assistant's parts in a conversation

In [12]:
tokenizer.eos_token_id = tokenizer.encode("<|im_end|>")[0]
model.config.eos_token_id = tokenizer.eos_token_id

templates = [
  "<|im_start|>assistant\n{msg}<|im_end|>",
  "<|im_start|>user\n{msg}<|im_end|>"
]
IGNORE_INDEX=-100

In [13]:
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): StableLmForCausalLM(
      (model): StableLmModel(
        (embed_tokens): Embedding(100352, 2048)
        (layers): ModuleList(
          (0-23): 24 x StableLmDecoderLayer(
            (self_attn): StableLmSdpaAttention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=2048, out_features=2048, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj)

In [14]:
tokenizer

GPT2TokenizerFast(name_or_path='stabilityai/stablelm-2-1_6b', vocab_size=100289, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|im_end|>', 'unk_token': '<|endoftext|>', 'additional_special_tokens': ['<|reg_extra|>', '<|endoftext|>', '<|fim_prefix|>', '<|fim_middle|>', '<|fim_suffix|>', '<|fim_pad|>', '<gh_stars>', '<filename>', '<issue_start>', '<issue_comment>', '<issue_closed>', '<jupyter_start>', '<jupyter_text>', '<jupyter_code>', '<jupyter_output>', '<empty_output>', '<commit_before>', '<commit_msg>', '<commit_after>', '<reponame>', '<|endofprompt|>', '<|im_start|>', '<|im_end|>', '<|pause|>', '<|reg0|>', '<|reg1|>', '<|reg2|>', '<|reg3|>', '<|reg4|>', '<|reg5|>', '<|reg6|>', '<|reg7|>', '<|extra0|>']}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	100256: AddedToken("<|reg_extra|>", rstrip=False, lstrip=False, single_word=False, normalized=F

### Tokenizing our Data

We will define the tokenize and collate functions respectively.


> The tokenize function prepares conversational data by converting messages into token IDs, attention masks, and labels. User inputs are ignored during training (label set to IGNORE_INDEX), while assistant responses are the target labels.

> The collate function ensures that batches of data are properly padded to the same length, which is necessary for efficient training.

> The dataset is then tokenized using the map() function and multithreading, ensuring that it is ready for training in the required format.

In [15]:
# Defining our tokenize function
def tokenize(input, max_length):
  input_ids, attention_mask, labels = [], [], []

  for i, msg in enumerate(input["conversations"]):
    isHuman = msg["role"] == "user"
    msg_chatml = templates[isHuman].format(msg = msg["content"])
    msg_tokenized = tokenizer(msg_chatml, truncation = False, add_special_tokens = False)

    input_ids += msg_tokenized["input_ids"]
    attention_mask += msg_tokenized["attention_mask"]
    labels += [IGNORE_INDEX] * len(msg_tokenized["input_ids"]) if isHuman else msg_tokenized["input_ids"]

  return {
    "input_ids": input_ids[:max_length],
    "attention_mask": attention_mask[:max_length],
    "labels": labels[:max_length],
  }

# Defining our collate function - to transform list of dictionaries [ {input_ids: [123, ..]}, {.. ]
# to single batch dictionary { input_ids: [..], labels: [..], attention_mask: [..] }
def collate(elements):
  tokens = [e["input_ids"] for e in elements]
  tokens_maxlen = max([len(t) for t in tokens])

  for i, sample in enumerate(elements):
    input_ids = sample["input_ids"]
    labels = sample["labels"]
    attention_mask = sample["attention_mask"]

    pad_len = tokens_maxlen-len(input_ids)

    input_ids.extend( pad_len * [tokenizer.pad_token_id] )
    labels.extend( pad_len * [IGNORE_INDEX] )
    attention_mask.extend( pad_len * [0] )

  batch = {
    "input_ids": torch.tensor( [e["input_ids"] for e in elements] ),
    "labels": torch.tensor( [e["labels"] for e in elements] ),
    "attention_mask": torch.tensor( [e["attention_mask"] for e in elements] ),
  }

  return batch

# tokenize training and validation datasets
dataset_tokenized = dataset_split.map(
  partial(tokenize, max_length = 1600),
  batched = False,
  num_proc = os.cpu_count() // accelerator.num_processes,    # multithreaded
  remove_columns = dataset_split["train"].column_names
)

Map (num_proc=2):   0%|          | 0/425 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/48 [00:00<?, ? examples/s]

> Setting the Padding Token: The eos_token is used as the padding token to ensure consistency in how the model handles padding.

> Collator Setup: The data collator ensures that the tokenized data is appropriately batched and padded, ready to be fed into the language model for training.

In [16]:
# setting pad token
tokenizer.pad_token = tokenizer.eos_token
# data collator
data_collator = transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)


### Fine-tuning the LLM

#### Display pre-finetuning statistics

In [17]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
7.586 GB of memory reserved.


#### Fine-Tuning Process

In [18]:
# hyperparameters
lr = 2e-4
batch_size = 4
num_epochs = 5
grad_acc_steps = 4
dataset_name="LLM Model Training"
modelpath = "stabilityai/stablelm-2-1_6b"
max_length = 1024
output_dir = f'/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/wandb/out-{run_id}'

# define training arguments
training_args = transformers.TrainingArguments(
  output_dir= output_dir,
  learning_rate=lr,
  per_device_train_batch_size=batch_size,
  per_device_eval_batch_size=batch_size,
  num_train_epochs=num_epochs,
  weight_decay=0.01,
  logging_strategy="epoch",
  eval_strategy="epoch",
  save_strategy="epoch",
  load_best_model_at_end=True,
  gradient_accumulation_steps=grad_acc_steps,
  warmup_steps=2,
  fp16=True,
  optim="paged_adamw_8bit",
)

trainer = transformers.Trainer(
  model=model,
  args=training_args,
  data_collator=collate,
  train_dataset=dataset_tokenized["train"],
  eval_dataset=dataset_tokenized["test"],
)

# The weights and biases logs are automatically saved on wandb. But we can also view it in the path saved under the google drive
if accelerator.is_main_process:
  run = wandb.init(
    project="OA2-finetune",
    dir=output_dir, # Configure accordingly to save the log in google drive. Modify accordingly if run locally
    name="stabilityai/stablelm-2-1_6b".split("/")[1]+"_"+dataset_name+f"_bs-{batch_size}_LR-{lr}_maxlen-{max_length}_{run_id}",
    config = {
      "model_name": "StableLM2_TDP",
      "run_id": run_id,
      "dataset": dataset_name,
      "output_dir": output_dir, # Configure accordingly to save the log in google drive. Modify accordingly if run locally
      "lr": lr,
      "max_length": max_length,
      "train_batch_size": batch_size,
      "validation_batch_size": batch_size,
      "ga_steps": grad_acc_steps,
      "training_args": training_args,
      "GPUs": accelerator.num_processes,
    }
  )

model.config.use_cache = False
trainer.save_model(f'/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/llm-adapter/{run_id}')
trainer_stats = trainer.train()

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Epoch,Training Loss,Validation Loss
0,2.4977,1.886894
1,1.7085,1.573246
2,1.4844,1.467443
4,1.3826,1.433948


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


#### Show final memory and time stats

In [19]:
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

154.8072 seconds used for training.
2.58 minutes used for training.
Peak reserved memory = 9.01 GB.
Peak reserved memory for training = 1.424 GB.
Peak reserved memory % of max memory = 61.093 %.
Peak reserved memory for training % of max memory = 9.656 %.


### Push model to huggingface hub

In [None]:
# from huggingface_hub import notebook_login
# notebook_login()

### Clear the memory

> gc.collect(): Forces the Python garbage collector to release memory that is no longer in use. In situations where there are circular references (e.g., objects referencing each other), the garbage collector may not automatically free that memory. This helps in reclaiming that memory, which is especially useful in long-running scripts or training loops where memory usage might increase over time, leading to potential out-of-memory errors.

> empty_cache(): It releases all unused memory that PyTorch has cached on the GPU. This is helpful when you need to reduce GPU memory usage, especially after large tensor computations or when switching between different models during training.

In [20]:
gc.collect()
torch.cuda.empty_cache()

Rerun the necessary imports for the model export. We no longer need other libraries stated at the start.

In [9]:
import torch
import os
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
from google.colab import files

### Save the Fine-Tuned Model

In [22]:
# Define the path where you want to create the folders
model_saved_path = '/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/fine-tuned'
# os.makedirs(model_saved_path, exist_ok=True) # Create the folder. Comment or delete this statement after running once
print(f"Folder created at: {model_saved_path}")

Folder created at: /content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/fine-tuned


1. On google colab

In [23]:
model.save_pretrained(f'{model_saved_path}/tdp_stablelm2_ft')
tokenizer.save_pretrained(f'{model_saved_path}/tdp_stablelm2_ft')

('/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/fine-tuned/tdp_stablelm2_ft/tokenizer_config.json',
 '/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/fine-tuned/tdp_stablelm2_ft/special_tokens_map.json',
 '/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/fine-tuned/tdp_stablelm2_ft/vocab.json',
 '/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/fine-tuned/tdp_stablelm2_ft/merges.txt',
 '/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/fine-tuned/tdp_stablelm2_ft/added_tokens.json',
 '/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/fine-tuned/tdp_stablelm2_ft/tokenizer.json')

2. On Local Machine

In [None]:
# model.save_pretrained('./tdp_stablelm2_ft')
# tokenizer.save_pretrained('./tdp_stablelm2_ft')

# # Zip the saved model directory
# shutil.make_archive('tdp_stablelm2_ft', 'zip', './tdp_stablelm2_ft')

# files.download('tdp_stablelm2_ft.zip')


# RUN THIS IF YOU RAN THE CODE ABOVE TO SAVE COPY ON GOOGLE DRIVE
# Zip the saved model directory
# shutil.make_archive('tdp_stablelm2_ft', 'zip', f'{model_saved_path}/tdp_stablelm2_ft')

# Download the zipped file (optional)
# files.download(f'{model_saved_path}/tdp_stablelm2_ft.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Loading the Fine-Tuned Model

In [None]:
# Load the fine-tuned model and tokenizer
model = AutoModel.from_pretrained('./tdp_stablelm2_ft')
tokenizer = AutoTokenizer.from_pretrained('./tdp_stablelm2_ft')

# Test the tokenizer and model (example sentence)
input_text = "This is a test input for my fine-tuned model."
inputs = tokenizer(input_text, return_tensors='pt')

# Perform forward pass through the model
outputs = model(**inputs)

# Print the model's output (you can process this further based on the model type)
print(outputs)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Loading adapter weights from ./tdp_stablelm2_ft led to unexpected keys not found in the model:  ['model.layers.0.self_attn.q_proj.lora_A.default.weight', 'model.layers.0.self_attn.q_proj.lora_B.default.weight', 'model.layers.1.self_attn.q_proj.lora_A.default.weight', 'model.layers.1.self_attn.q_proj.lora_B.default.weight', 'model.layers.10.self_attn.q_proj.lora_A.default.weight', 'model.layers.10.self_attn.q_proj.lora_B.default.weight', 'model.layers.11.self_attn.q_proj.lora_A.default.weight', 'model.layers.11.self_attn.q_proj.lora_B.default.weight', 'model.layers.12.self_attn.q_pr

BaseModelOutputWithPast(last_hidden_state=tensor([[[ -5.2231,   2.3152,  -4.0828,  ...,   8.7993,   3.0453,  -0.8264],
         [ -2.6569,   1.4298,  -3.7546,  ...,   8.7502,   3.0141,   2.1007],
         [ -3.7972,   0.2874, -10.5286,  ...,   9.0043,  -1.4867,   2.3723],
         ...,
         [ -0.8138,  -4.1902,  -6.1144,  ...,  -0.2880,  -0.9248,   0.6036],
         [ -3.7965,  -5.7221,  -8.9033,  ...,  -3.1065,   4.0865,  -1.0431],
         [ -2.8160,  -2.0190,   2.5456,  ...,   4.1146,   4.4637,   1.2360]]]), past_key_values=((tensor([[[[-2.2908e+00,  8.6010e+00,  2.1892e-01,  ...,  9.3315e-01,
           -9.6343e-01, -7.3244e-01],
          [ 6.2671e-01,  8.2429e+00,  1.6056e-01,  ...,  7.7811e-01,
           -1.0604e+00, -8.0578e-01],
          [ 2.9366e+00,  6.8564e+00,  1.5531e-01,  ...,  1.2075e+00,
           -1.0381e+00, -8.2465e-01],
          ...,
          [ 3.4593e+00, -8.6954e+00,  3.4498e-03,  ..., -1.9549e-01,
           -8.1097e-01, -6.1023e-01],
          [ 9.5302

### Testing the Fine-Tined Model

In [None]:
# # Load the model for causal language modeling (if applicable)
# model = AutoModelForCausalLM.from_pretrained('./tdp_stablelm2_ft')
# tokenizer = AutoTokenizer.from_pretrained('./tdp_stablelm2_ft')

# # Tokenize your input for generation
# input_ids = tokenizer.encode("What products are offered at UOB?", return_tensors='pt')

# # Generate text
# generated_output = model.generate(input_ids, max_length=200)

# # Decode the generated text
# generated_text = tokenizer.decode(generated_output[0], skip_special_tokens=True)

# print(generated_text)

### Reload the tokenizer and model

Over here, we are going to reload the full floating point fp16 model (not the quantized version as defined above). Next, we will merge the adapter that we have trained and stored in the llm-adapter folder, with the model that we just loaded to output our finalised fine-tuned model.

In [24]:
tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-2-1_6b")
fp16_model = AutoModelForCausalLM.from_pretrained(
  "stabilityai/stablelm-2-1_6b",
  low_cpu_mem_usage=True,
  return_dict=True,
  torch_dtype=torch.float16,
  device_map="auto",
)

model = PeftModel.from_pretrained(fp16_model, "/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/fine-tuned/tdp_stablelm2_ft")
model = model.merge_and_unload()

After merging the LoRA adapter, we will save the final model and tokenizer in a new directory to prepare for gguf conversion

If the runtime and computational resources are being used to run the above code, skip this line and go straight to tehe GGUF / llama.cpp conversion since our model and tokenizers are alreay saved in the path below

In [25]:
saved_path = '/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/tdp_stablelm2_ft_merged'

model.save_pretrained(saved_path)
tokenizer.save_pretrained(saved_path)

('/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/tdp_stablelm2_ft_merged/tokenizer_config.json',
 '/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/tdp_stablelm2_ft_merged/special_tokens_map.json',
 '/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/tdp_stablelm2_ft_merged/vocab.json',
 '/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/tdp_stablelm2_ft_merged/merges.txt',
 '/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/tdp_stablelm2_ft_merged/added_tokens.json',
 '/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/tdp_stablelm2_ft_merged/tokenizer.json')

### GGUF / llama.cpp Conversion
We need to build the llama.cpp in order to use the conversion tools. Do allocate time for this as it took on average 48mins to run!!!!

In [26]:
# Command took on average 48mins to run!!!!
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && make clean && LLAMA_CUDA=1 make

Cloning into 'llama.cpp'...
remote: Enumerating objects: 34952, done.[K
remote: Counting objects: 100% (9117/9117), done.[K
remote: Compressing objects: 100% (595/595), done.[K
remote: Total 34952 (delta 8829), reused 8571 (delta 8520), pack-reused 25835 (from 1)[K
Receiving objects: 100% (34952/34952), 57.84 MiB | 13.45 MiB/s, done.
Resolving deltas: 100% (25427/25427), done.
Already up to date.
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -fopenmp -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -

In [27]:
# Install the requirements
!pip3 install -r llama.cpp/requirements.txt

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cpu, https://download.pytorch.org/whl/cpu, https://download.pytorch.org/whl/cpu, https://download.pytorch.org/whl/cpu
Collecting transformers<5.0.0,>=4.45.1 (from -r llama.cpp/./requirements/requirements-convert_legacy_llama.txt (line 3))
  Downloading transformers-4.45.1-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting gguf>=0.1.0 (from -r llama.cpp/./requirements/requirements-convert_legacy_llama.txt (line 4))
  Downloading gguf-0.10.0-py3-none-any.whl.metadata (3.5 kB)
Collecting protobuf<5.0.0,>=4.21.0 (from -r llama.cpp/./requirements/requirements-convert_legacy_llama.txt (line 5))
  Downloading protobuf-4.25.5-cp37-abi3-manylinux2014_x86_64.whl.metadata (541 bytes)
Collecting torch~=2.2.1 (from -r llama.cpp/./requirements/requirements-convert_hf_to_gguf.txt (line 3))
  Downloading ht

We are running the actual conversion over here.

In [1]:
!python llama.cpp/convert_hf_to_gguf.py "/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/tdp_stablelm2_ft_merged"

INFO:hf-to-gguf:Loading model: tdp_stablelm2_ft_merged
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:output.weight,             torch.float16 --> F16, shape = {2048, 100352}
INFO:hf-to-gguf:token_embd.weight,         torch.float16 --> F16, shape = {2048, 100352}
INFO:hf-to-gguf:blk.0.attn_norm.bias,      torch.float16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.0.attn_norm.weight,    torch.float16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.0.ffn_down.weight,     torch.float16 --> F16, shape = {5632, 2048}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,     torch.float16 --> F16, shape = {2048, 5632}
INFO:hf-to-gguf:blk.0.ffn_up.weight,       torch.float16 --> F16, shape = {2048, 5632}
INFO:hf-to-gguf:blk.0.ffn_norm.bias,       torch.float16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,     torch.float16 --> F32, shape = {2048}
INFO:hf-to-gguf:b

In [8]:
os.listdir("/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/tdp_stablelm2_ft_merged")

['config.json',
 'generation_config.json',
 'model.safetensors',
 'tokenizer_config.json',
 'special_tokens_map.json',
 'vocab.json',
 'merges.txt',
 'tokenizer.json',
 'stablelm-2-1.6B-F16.gguf',
 'ggml-model-Q4_K_M.gguf']

#### Quantization

Llama.cpp gives us a ton of quantization options. Here's a couple resources to dive deeper into which options are available. We will use the Q4_K_M format

In [7]:
# Running the quantization script
!cd llama.cpp && ./llama-quantize "/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/tdp_stablelm2_ft_merged/stablelm-2-1.6B-F16.gguf" "/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/tdp_stablelm2_ft_merged/ggml-model-Q4_K_M.gguf" Q4_K_M


# Download the quantized gguf onto our local machine
files.download("/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/tdp_stablelm2_ft_merged/ggml-model-Q4_K_M.gguf")

main: build = 3861 (f1b8c427)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing '/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/tdp_stablelm2_ft_merged/stablelm-2-1.6B-F16.gguf' to '/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/tdp_stablelm2_ft_merged/ggml-model-Q4_K_M.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 25 key-value pairs and 340 tensors from /content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/tdp_stablelm2_ft_merged/stablelm-2-1.6B-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = stablelm
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Stablelm 2

In [11]:
# files.download("/content/drive/My Drive/TDP Capstone Grp 7/LLM Model Training/qLoRA/tdp_stablelm2_ft_merged/ggml-model-Q4_K_M.gguf")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Run and Deploy the Fine-tuned Model (run in local)

Building the Ollama Modelfile

In [16]:
tuned_model_path = "tdp_stablelm2_ft_merged/ggml-model-Q4_K_M.gguf"
sys_message = "You are a helpful United Overseas Bank (Singapore) AI chatbot that is capable of handling customer queries. \
  Every response must be detailed and informative. In addition, you should avoid answering questions that are not related to banking with UOB"

In [17]:
cmds = []

In [18]:
base_model = f"FROM {tuned_model_path}"
template = '''TEMPLATE "{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ .Response }}<|im_end|>
"'''

params = '''PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>'''

system = f'''SYSTEM """{sys_message}"""'''

In [19]:
cmds.append(base_model)
cmds.append(template)
cmds.append(params)
cmds.append(system)

In [20]:
def generate_model(cmds):
  modelfile = ""
  for command in cmds:
    modelfile += command + "\n"
  with open("stablelm2.modelfile", "w") as f:
    f.write(modelfile)

In [21]:
generate_model(cmds)

If you do not see a modelfile in your working directory, do not proceed any further

#### **Installing** Ollama and compiling with GGUF file

In [22]:
!ollama list

NAME                 	ID          	SIZE  	MODIFIED    
mistral:latest       	f974a74358d6	4.1 GB	7 days ago 	
myOwnStablelm2:latest	1df5ba03896a	982 MB	9 days ago 	
stablelm2:latest     	714a6116cffa	982 MB	4 weeks ago	
llama3:latest        	365c0bd3c000	4.7 GB	5 weeks ago	
llama3.1:latest      	91ab477bec9d	4.7 GB	5 weeks ago	


#### Creating a new custom LLM Ollama Model on your Local Machine using the .gguf and modefile

The suggestion would be to download the .gguf file and run the code to deploy our model to ollama locally, so that we can call the model for our project.

Note: Run the creation of our ollama model from the gguf file on VSC to deploy it locally, after the gguf file has been downloaded

In [23]:
!ollama create betaStableLm2Tdp --file stablelm2.modelfile

[?25ltransferring model data ⠙ [?25h[?25l[2K[1Gtransferring model data ⠹ [?25h[?25l[2K[1Gtransferring model data ⠸ [?25h[?25l[2K[1Gtransferring model data ⠼ [?25h[?25l[2K[1Gtransferring model data ⠴ [?25h[?25l[2K[1Gtransferring model data ⠦ [?25h[?25l[2K[1Gtransferring model data ⠧ [?25h[?25l[2K[1Gtransferring model data ⠇ [?25h[?25l[2K[1Gtransferring model data ⠏ [?25h[?25l[2K[1Gtransferring model data ⠋ [?25h[?25l[2K[1Gtransferring model data ⠙ [?25h[?25l[2K[1Gtransferring model data ⠹ [?25h[?25l[2K[1Gtransferring model data ⠸ [?25h[?25l[2K[1Gtransferring model data ⠼ [?25h[?25l[2K[1Gtransferring model data ⠴ [?25h[?25l[2K[1Gtransferring model data ⠦ [?25h[?25l[2K[1Gtransferring model data ⠧ [?25h[?25l[2K[1Gtransferring model data ⠇ [?25h[?25l[2K[1Gtransferring model data ⠏ [?25h[?25l[2K[1Gtransferring model data ⠋ [?25h[?25l[2K[1Gtransferring model data ⠙ [?25h[?25l[2K[1Gtransferring model data ⠹ [

In [24]:
!ollama list

NAME                   	ID          	SIZE  	MODIFIED       
betaStableLm2Tdp:latest	f478475d58b8	1.0 GB	10 seconds ago	
mistral:latest         	f974a74358d6	4.1 GB	7 days ago    	
myOwnStablelm2:latest  	1df5ba03896a	982 MB	9 days ago    	
stablelm2:latest       	714a6116cffa	982 MB	4 weeks ago   	
llama3:latest          	365c0bd3c000	4.7 GB	5 weeks ago   	
llama3.1:latest        	91ab477bec9d	4.7 GB	5 weeks ago   	


In [25]:
!ollama run betaStableLm2Tdp

[?25l⠙ [?25h[?25l[2K[1G⠹ [?25h[?25l[2K[1G⠸ [?25h[?25l[2K[1G⠼ [?25h[?25l[2K[1G⠴ [?25h[?25l[2K[1G⠦ [?25h[?25l[2K[1G⠧ [?25h[?25l[2K[1G⠇ [?25h[?25l[2K[1G⠏ [?25h[?25l[2K[1G⠋ [?25h[?25l[2K[1G⠙ [?25h[?25l[2K[1G⠹ [?25h[?25l[2K[1G⠸ [?25h[?25l[2K[1G⠼ [?25h[?25l[2K[1G⠼ [?25h[?25l[?25l[2K[1G[?25h[2K[1G[?25h[?2004h>>> [38;5;245mSend a message (/? for help)[28D[0m[K
Use Ctrl + d or /bye to exit.
>>> [38;5;245mSend a message (/? for help)[28D[0m[K
>>> [38;5;245mSend a message (/? for help)[28D[0m