# Final Team Project: Advanced Generative Chatbot Design

TEAM 6: Bin Lu, Isaack Karanja, Victor Veselov


# Introduction

Welcome to this Jupyter notebook, which is a part of our comprehensive report on building a sophisticated chatbot capable of engaging in multi-turn conversations, adapting to context, and handling a wide array of topics.

In this notebook, we aim to fine-tune an existing Language Model (LLM) from Hugging Face to enhance dialogue summarization capabilities. We will delve into the details of our review process wherein we scrutinized various models such as OPT and FlanT5 and outline the rationale behind our choice of LLaMa2-chat-instruct. This model stands out due to its high-quality instruction-tuned capabilities and inherent ability for summarizing text. 

To further augment the quality of inferences drawn by the model, we will be guided through a complete fine-tuning approach. Subsequently, we will assess these results using ROUGE metrics - a popular choice for evaluating automated summaries.

Furthermore, we will explore Parameter Efficient Fine-Tuning (PEFT), demonstrating how it enhances the performance. Although PEFT might slightly compromise on certain performance metrics, you will observe that its benefits convincingly outweigh these minor setbacks.

<img src="https://github.com/SweatyCrayfish/Ubuntu-Lllama-2/blob/main/Final%20Folder/Model%20Selection/Photos.jpeg?raw=true"
     alt="Markdown Monster icon"
     style="float: left; margin-right: 10px width:100;" />

# [Methodology](#methodology)
**[Dataset](#Dataset)**
- **[Dataset Analysis](#dataset-analysis)**
- **[Data Cleanup](#data-cleanup)**

**[Model Selection](#model-selection)**
- **[Base model vs Instructional Tuned Models](#base-model-vs-instructional-tuned-models)**

**[Training](#Training)**
- **[Full training vs PET](#Full-Fine-Tunning-vs-Parameter-Efficient-Fine-Tunning-(PEFT))**
- **[Time, Memory Implications](#time-memory-implications)**

**[Conclusion]()**
- **[Future work]()**
- **[References]()**

In [1]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer, AutoModelForCausalLM, BitsAndBytesConfig, LlamaForCausalLM
import torch
import time
import json
import evaluate
import pandas as pd
import numpy as np
import torch.nn as nn
import bitsandbytes as bnb
import pandas as pd
from datasets import DatasetDict, Dataset
import torch
from transformers import AutoTokenizer, T5ForConditionalGeneration, AutoModelForSeq2SeqLM
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
from evaluate import load
from peft import get_peft_config, PeftModel, PeftConfig, get_peft_model, LoraConfig, TaskType


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)


# [Dataset](#Methodology)

### Download the dataset 
We download the proprocessed dataset from hugginface and assign it to a pandas dataframe

In [2]:
from datasets import load_dataset
ubuntu_question_answer = load_dataset("mugithi/ubuntu_question_answer_jsonl")

In [3]:
for split, ubuntu_question_answer in ubuntu_question_answer.items():
    ubuntu_question_answer.to_json(f"{split}.jsonl")

Creating json from Arrow format:   0%|          | 0/13 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

In [4]:
train_dataset = load_dataset('json', data_files='train.jsonl',split='train')  
test_dataset = load_dataset('json', data_files='test.jsonl', split='train')

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

### Preparing the data for LLaMa2 fine tuning

We will employ Supervised Fine-Tuning (SFT) to refine LLaMa2 for the Question & Answer (Q&A) task. Within the SFT framework, we will adjust the model using a dataset of instruction and response pairs derived from the `ubuntu_question_answer` dataframe. The objective is to minimize the discrepancy between the generated answers and the actual responses, which will serve as labels. This process will also enable the model to acclimate to the response nuances present in our training data.

A formatting function `formatting_func` will be utilized to process each example from the `ubuntu_question_answer` dataframe, crafting a structured prompt that will be used to fine tune LLaMa2 effectively.

Furthermore, we will leverage Hugging Face's SFTTrainer function during the training phase. This function, accepting `formatting_function`, `training`, and `test` datasets allow us to focus on preparing your dataset and structuring your training examples correctly.



In [5]:
def formatting_func(dataset):
    text = f"###question: {dataset['question']} , ###answer: {dataset['answer']}"
    return [text]

In [6]:
example = [formatting_func(example) for example in train_dataset][2]

In [7]:
example[0]

'###question: hey people! i have a text file of 250mb, and i need to insert some text at the beginning of the file and the end. how would i do it, not opening the whole file? ) sorry for the noobness ) , ###answer: you can do this echo test | cat - file > new file'

## Tokenization

The LLaMa2 tokenizer processes the `example[0]` text  and carries out the following steps to ready it for the LLaMa2 model:

- **Raw Text -> Byte-Pair Encoding (BPE) Tokenization:** Initially, the text is segmented into subwords or tokens using Byte-Pair Encoding (BPE) method, which iteratively merges the most common pair of characters or character sequences.

- **Add Special Tokens:** Special tokens like`<s>` (Beginning of Sequence) and `</s>` (End of Sequence) are incorporated at the start and finish of the token sequence to indicate the boundaries of a sequence.

- **Convert Tokens to IDs:** Each token is mapped to its corresponding ID based on the LLaMa2 vocabulary.

- **Padding or Truncation (if required):** Unlike BERT, LLaMa2 doesn’t have a specified padding token. We set the pad_token to `<pad>` and specified its value in the model `model.config.pad_token_id`

- **Create Attention Mask:** An attention mask can be created to distinguish actual tokens from any padding, ensuring the model focuses only on the real content during processing. This step may require custom handling due to the absence of a designated padding token in LLaMa2.


In [8]:
from transformers import LlamaTokenizer
model_name ="meta-llama/Llama-2-13b-chat-hf"

tokenizer = LlamaTokenizer.from_pretrained(model_name, add_eos_token=True, spaces_between_special_tokens=True)
tokenizer.padding_side = "left"  # Allow batched inference
tokenizer.pad_token = "<pad>"

In [9]:
import numpy as np
from tabulate import tabulate
def sentence_encoding(example):

  train_encodings = tokenizer(example, padding=True, return_attention_mask=True)
  token_id = train_encodings['input_ids']
  attention_mask = train_encodings['attention_mask']
    
  tokens = tokenizer.tokenize(tokenizer.decode(token_id))
  token_ids = [i if isinstance(i, int) else i.item().strip() for i in token_id]
  attention = [i if isinstance(i, int) else i.item().strip() for i in attention_mask]


  table = np.array([tokens, token_ids, attention], dtype=object).T
  print(tabulate(table, 
                 headers = ['Tokens', 'Token IDs', 'Attention Mask'],
                 tablefmt = 'fancy_grid'))


sentence_encoding(example[0])

╒════════════╤═════════════╤══════════════════╕
│ Tokens     │   Token IDs │   Attention Mask │
╞════════════╪═════════════╪══════════════════╡
│ <s>        │           1 │                1 │
├────────────┼─────────────┼──────────────────┤
│ ▁###       │         835 │                1 │
├────────────┼─────────────┼──────────────────┤
│ question   │       12470 │                1 │
├────────────┼─────────────┼──────────────────┤
│ :          │       29901 │                1 │
├────────────┼─────────────┼──────────────────┤
│ ▁he        │         540 │                1 │
├────────────┼─────────────┼──────────────────┤
│ y          │       29891 │                1 │
├────────────┼─────────────┼──────────────────┤
│ ▁people    │        2305 │                1 │
├────────────┼─────────────┼──────────────────┤
│ !          │       29991 │                1 │
├────────────┼─────────────┼──────────────────┤
│ ▁i         │         474 │                1 │
├────────────┼─────────────┼────────────

**observation**
- On printnig out the 'Tokens', 'Token IDs', 'Attention Mask' of we observed that Beginning of Sequence `<s>` and End of Sequence `</s>` Tokens were added. LLaMa2 also supports other special  `[INST]` and `[/INST]` to wrap instructions, from our research. The use of this tokens have mix results in the implementaiton so we used `[###]` character that the model mapped to `token_id = 835`  to singnal the model of the beggining of a turn in the conversation 

# Model Selection

For the Question and Question downstream task, we considered several transformer models for fine tuning. Fine tuning  pre-trained language models is a common technique in natural language processing to adapt the models for downstream tasks  since training resources required to train are out of reach for most orgnzations as shown in the table below (Naveed et, al 2023)

| Model | Architecture   | Parameters | Layers | Attention Heads | Processing Units | Training Unit Type    | Creator | Training Data                                       |
|-------|----------------|------------|--------|-----------------|------------------|--------------|----------|-----------------------------------------------------|
| T5    | Encoder-decoder| 11 billion | 24     | 128             | 1024             | TPU v3       | Google   | C4 dataset                                          |
| OPT   | Causal-Decoder-only    | 175 billion| 96     | 96              | 992              | 40GB A100 GPU| Meta     | Pile, PushShift Reddit                              |
| LLaMA2 | Causal-Decoder-only    | 65 billion | 80     | 64              | 2048             | 80GB A100 GPU| Meta     | CommonCrawl, C4, GitHub, Wikipedia, Books, arXiv, StackExchange |




LLaMa2 an open-source Large Language Mode from Meta AI has been optimized for dialogue use cases (Touvron, et al, 2023) and outperforms open-source chat models on most benchmarks tested, indicating a strong performance in a dialogue environment. The OPT model also from Meta series is comparable in performance to GPT-3, which is known for its strong performance in various NLP tasks including conversational AI. The T5 model from Google, particularly when fine-tuned, achieves state-of-the-art results on many NLP benchmarks and is also capable of being fine-tuned for conversational tasks, indicating robust performance. (Naveed et al, 2023)

LLaMa2 and Flan-T5 were designed with fine-tuning in mind as opposed to OPT which althrough has good performance, its creators did not provide alot of information on finetuning. 


**reference** 

- Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., & Mian, A. (2023). A Comprehensive Overview of Large Language Models. arXiv. https://doi.org/10.48550/arXiv.2307.06435

- Touvron, H., Martin, L., Stone, K., et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv. https://doi.org/10.48550/arXiv.2307.09288

- Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv. https://doi.org/10.48550/arXiv.2302.13971



### Instrunctionally tuned

Instructionally tuned models are a subtype of fine-tuned language models specially trained to follow natural language instructions to solve a task. They are fine-tuned with input-output pairs that include instructions and attempts to follow those instructions, making them adept at understanding and executing tasks as instructed in natural language. This training approach is supervised and helps improve the model's zero-shot performance on unseen tasks (Wei et al, 2021). For both the LLaMa2 and T5 models, we evalauted the instruct tuned models. 

| Instruct Fine Tuned Variant  | Model Type   | Number of Parameters |
|----------------|--------------|----------------------|
| FLAN-T5-Small  | FLAN-T5      | 80 Million           |
| **FLAN-T5-Base**   | **FLAN-T5**      | **250 Million**          |
| FLAN-T5-Large  | FLAN-T5      | 780 Million          |
| FLAN-T5-XL     | FLAN-T5      | 3 Billion            |
| **LLaMa2-Chat-7B** | **LLaMa2-Chat**  | **7 Billion            |
| LLaMa2-Chat-13B | LLaMa2-Chat  | 13 Billion           |
| LLaMa2-Chat-70B| LLaMa2-Chat  | 70 Billion           |



**observation**
- We evaluated all three models, Flan-T5, OPT and LLaMa2 and decided to focus our experiemnts on LLaMa2 becuase of its state of the art parformance, higher context length of 4096 tokens vs T5 model 512 tokens which affect its ability to generate coherent responses in conversation or other tasks requiring an understanding of broader contex.

# Training

In this section, we investigate the use of LLaMa2, a pre-trained Language Model, for a Question & Answer task as a Subject Matter Expert (SME) on an Ubuntu Forum. Our aim is to develop a chatbot by fine-tuning LLaMa2 to suit this specific task. 

We will deliberate on various alternatives that we explored to accomplish this task, including the full fine-tuning of the model. Subsequently, our decision to employ the LoRA technique will be elucidated. Finally, we will evaluate and compare the performance of our model before and after fine-tuning.


### Full Fine-Tunning vs Parameter Efficient Fine-Tunning (PEFT)

Fine tuning involves taking a model that has been pre-trained on a large general domain corpus and further training it on data from the target task. This allows the model to retain the general language knowledge acquired during pre-training while adapting to the nuances and specifics of the new task.

In full fine tuning, all of the model's parameters are updated during the process. While this typically yields great performance, it also requires significant compute resources and introduces challenges for model scaling and storage. To address this, recent work has explored more parameter efficient fine tuning (PEFT) methods that only update a subset of the model's parameters.

PEFT methods fall into a few categories as summarized below 
- **LoRA:** LoRA stands for Low-Rank Adaptation. It is a method used to fine-tune pre-trained language models by introducing a low-rank adaptation layer. This adaptation layer is added to the pre-trained model and only the parameters of this layer are fine-tuned during the adaptation process.
- **Prefix Tuning:** This technique is used to fine-tune language models by only updating a fixed-length prefix of the input sequence. The prefix can be considered as a form of continuous prompt that is optimized during the training process.
- **P-tuning:** P-tuning, or Parameterized-tuning, is a fine-tuning approach where the prompts are parameterized and optimized during the training process. Unlike fixed textual prompts, these parameterized prompts can adapt to the task at hand.
- **Prompt Tuning:** This is a method where you fine-tune the model by optimizing over a set of fixed textual prompts. Unlike P-tuning, the prompts here are not parameterized.
- **Full Fine-Tuning:** In full fine-tuning, the entire model, including all its parameters, is fine-tuned for a specific task.

| Method          |  Explanation                                                                                   | Pros                                                     | Cons                                                   | Comparison with Full Fine-Tuning            |
|-----------------|--------------------------------------------------------------------------------------------------------|----------------------------------------------------------|--------------------------------------------------------|--------------------------------------------|
| [LoRA](https://arxiv.org/abs/2106.09685)            | Adds a low-rank adaptation layer for fine-tuning. Only this layer's parameters are updated.             | 1. Computationally efficient<br>2. Effective fine-tuning  | 1. May not capture complex task-specific requirements  | Less parameter-intensive than full fine-tuning |
| [Prefix Tuning](https://arxiv.org/abs/2101.00190)   | Optimizes a fixed-length prefix of the input sequence during fine-tuning.                                | 1. Efficient for sequence generation tasks<br>2. Can adapt to various tasks  | 1. Limited to sequence tasks                          | More task-specific but less general than full fine-tuning |
| [P-tuning](https://arxiv.org/abs/2103.10385)        | Parameterizes and optimizes the prompts during fine-tuning.                                              | 1. Highly adaptable to tasks<br>2. Can improve prompt efficacy  | 1. May require careful initialization                 | More adaptable but may require more tuning than full fine-tuning |
| [Prompt Tuning](https://arxiv.org/abs/2104.08691)   | Fine-tunes over a set of fixed textual prompts.                                                          | 1. Easy to implement<br>2. Useful for specific tasks     | 1. Not adaptable<br>2. Limited to known prompts         | Less adaptable and more constrained than full fine-tuning  |
| Full Fine-Tuning| Fine-tunes all the parameters of the model for a specific task.                                          | 1. Highly effective for task-specific requirements<br>2. Can capture complex features  | 1. Computationally expensive<br>2. Risk of overfitting  | Most comprehensive but resource-intensive  |

In this section of the report, we will delve into the application of LoRa in our Jupyter notebook with the aim of attaining substantial reductions in several areas. This includes the volume of data required, the time consumed during training, the computational requirements necessary for processing, and the overall cost associated with these operations.


### PEFT with LoRA

LoRA stands for Low-Rank Adaptation, a technique to efficiently fine-tune large pre-trained language models. Below is how it works

**Components of LoRa:**
- **Adapter Matrices:** LoRa introduces low-rank adapter matrices that interface with specific weight matrices in the pre-trained model. In the case of Transformer models, these adapter matrices are added in parallel to the Query, Key, and Value matrices in each attention layer.  
- **Low-Rank Structure:** 
The adapter matrices are "low-rank," meaning they can be represented with far fewer parameters than the original weight matrices. For example, if the original matrix has a rank of 4096, the adapter might have a rank of just 1 or 2.

**Training Process:**  
- **Initialization:** The adapter matrices are randomly initialized while the original weight matrices remain fixed.
- **Fine-Tuning:** During the fine-tuning process, only these adapter matrices are updated. They capture essential new task-specific knowledge. 
- **Task Specialization:** The original weight matrix provides broad, generalized knowledge. In contrast, the low-rank adapter focuses on specialized features needed for the new task.

**Deployment:**  
- **Inference:** When the model is deployed for making predictions, the adapter matrices can be combined with the original weight matrices without affecting the inference time. 
- **Adaptability** the adapters can also be easily swapped for different tasks

**Benefits:**  
- **Efficiency:** Requires much less computational power and memory than traditional fine-tuning methods.
- **Less Overfitting:** By focusing on a small set of parameters, LoRa reduces the risk of overfitting, especially when fine-tuning data is limited.
- **Strong Performance:** Experiments have shown that LoRa can achieve similar or better performance compared to traditional fine-tuning methods, despite using significantly fewer trainable parameters.
- **Data Efficient:** LoRa has shown strong performance in adapting large models like GPT-3, RoBERTa, and DeBERTa using only thousands to tens of thousands of examples.


## Quantized Lora Memory Usage

### Memory

To train our model, we needed to factor in the model weights, activations, optimizer states, and additional memory for elements like mini-batch data, regularization, and algorithm overhead summarized in the table below for a typical state of the art langauge models such as LLaMa2 stored in full precision. For a 7B parameter.


| Item (Full Precision)                         | Memory Usage (bytes per parameter) |
|------------------------------|------------------------------------|
| Model Weights             | 4 (32bit)                                  |
| AdamW Optimizer (2 states)   | +8                                 |
| Gradients                    | +4                                 |
| Activations and Buffer       | +8 (based on parameter sequence length, hidden size, and batch size) |

For the LLaMa-7B  model, for full training of the LLaMa 7B model, the memory requirement comes out to 160 GB which requires at last 2 NVidia 80G A100s


| Item (Optimized)                                 | Memory Usage (bytes per parameter) |
|--------------------------------------------------|------------------------------------|
| Model Weights (half precision bf16)           | .5 (4bit with nf4 quantization)    |
| AdamW Optimizer (2 states) with bitsandbytes     | +4                                 |
| Gradients                                        | +2                                 |
| Activations and Buffer                           | +1 (Optimized: gradient_checkpointing ) |

By adding multipe optimizations the training memory requirement comes down to resonable amount that can fit into a single GPU 

**QLoRA**

- We made use Quantiaized LoRA which include the following innovations over LoRA __4-bit quantization__, the introduction of __4-bit NormalFloat (NF4)__, __double quantization__, __paged optimizers__, and __backpropagation through quantized weights__ to fine-tune the model, which collectively reduce memory usage and facilitate fine-tuning large models on limited hardware.

- Imagine a tall baseball player who suddenly becomes short (quantization). This change demands adjustments like getting smaller shoes (4-bit NormalFloat) to fit better. Further, the player needs to alter his training regime (backpropagation through quantized weights to adapters added at every layer) to adapt to his new height, ensuring his performance remains top-notch despite the physical change. Similarly, QLoRA makes necessary adjustments to maintain performance while reducing the model's "size" (memory requirement).
As we will see, this results in a smaller model that can is memorry

**reference**
- Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. https://doi.org/10.48550/arXiv.2106.09685

- Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. https://doi.org/10.48550/arXiv.2305.14314
<img src="https://github.com/SweatyCrayfish/Ubuntu-Lllama-2/blob/main/Final%20Folder/Model%20Selection/QLora%20_1.jpeg?raw=true"
 alt="Markdown Monster icon"
     style="float: left; margin-right: 20px width:100;" />
     
     

For serving, you typically only need to consider the model weights and for the 7B model you only require 10GB with 7GB for a 8Bit quantized model and LoRA adapter.

In [10]:
from pynvml import *

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"Nvidia SMI reported GPU memory occupied: {info.used//1024**3} GB.")

In [11]:
def model_giga_bytes(model):
    mem_params = sum([param.nelement()*param.element_size() for param in model.parameters()])
    mem_bufs = sum([buf.nelement()*buf.element_size() for buf in model.buffers()])
    mem = mem_params + mem_bufs # in bytes
    gpu_Gbytes =  mem / 1024 / 1024 / 1024
    print(f"Mem Prams + Mem Buffer used Calculated Model Memory: {round(gpu_Gbytes,2)} GB")

In [12]:
print_gpu_utilization()

Nvidia SMI reported GPU memory occupied: 0 GB.


**Observation**
- Before loading the model to memory, the  Nvidia SMI reported GPU memory occupied consumed is ~zero

### Loading the base model from hugginface

In [13]:
model_name ="meta-llama/Llama-2-7b-chat-hf"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    device_map="auto"
)
original_model = LlamaForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
original_model.config.use_cache = False
tokenizer = AutoTokenizer.from_pretrained(model_name, add_eos_token=True, spaces_between_special_tokens=True)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [14]:
model_giga_bytes(original_model)
print_gpu_utilization()

Mem Prams + Mem Buffer used Calculated Model Memory: 3.57 GB
Nvidia SMI reported GPU memory occupied: 5 GB.


**observation**
- On loading the  7Billion 4bit quantiazed model, the   calculated model memory stood at 3.57 GB, inclusive of Model Parameters and  Model Buffer, the Nvidia SMI reported a significantly higher GPU memory occupation of 5 GB.

# Model Preprocess

### 3.1 - Setup the PEFT/LoRA model for Fine-Tuning

`r=16` Rank relates to how many dimensions worth of new knowledge the adapter can represent.Low rank is like zooming out - you only see the big, rough picture. High rank is like zooming in - you see more fine-grained details with Rank number like adapter matrix resolution. Higher Rank requires more parameters. Rank `1-4` often works well in practice.  

`lora_alpha=64` = Scales the entire adapter matrix values during training which affect how rapidly the adapters can adapt the model, acting effectively like learning rate. Higher learning rate reduces the stability. 

`bias="lora_only"`In LoRA, the low-rank adapter matrices are inserted in parallel to the original weight matrices of the model. By default, these adapter matrices include trainable bias terms. Setting `none` removes the bias terms from the adapter matrices. The other options are `lora_only` option can be useful if you want to disable bias in the original model (e.g. for regularization) but still allow the adapters to learn bias and `all` (default) enables bias in both the fixed original model and the adaptable LoRA modules.

                                            `output = input @ weight_matrix + bias`

**Differnces wbetween LoRA and QLoRA**

In the paper **Lora: low-rank adaptation of large language models** (Hu et al, 2023) suggest a bias value of `none` for most of their examples and use LoRA adapters that targets to only the key, value and query adapters. In QLoRA (Dettmers et al, 2023), the authors target the  all the key+query, all attention layers, all FFN layers, all layers, attention + FFN output layers with LoRA adapters. They also update all bias in the model vs keeping it frozen. This can be intutively understood as a baseball who suddenly finds himself short, they would player who needs adjustments many more areas in order to perform at previous level.  


### 3.2 - Train PEFT Adapter

Define training arguments and create `Trainer` instance.

#### Define the LoRA adaptor

In [15]:
import os
from peft import LoraConfig, get_peft_model,  TaskType
from typing import List

max_seq_length = 512
micro_batch_size = 4
gradient_accumulation_steps = 4
lora_r: int = 16
lora_alpha: int = 64
lora_dropout: float = 0.1
cutoff_len: int = 256
lora_target_modules: List[str] = [
    "q_proj",
    "up_proj",
    "o_proj",
    "k_proj",
    "down_proj",
    "gate_proj",
    "v_proj"
  ],
use_wandb = True
wandb_run_name="LLaMav2_Train_01"
max_steps = 1000


# Define LoRA Config
LoraConfig = LoraConfig(
 r=lora_r,
 lora_alpha=lora_alpha,
 target_modules=lora_target_modules[0],
 lora_dropout=lora_dropout,
 bias="all",
 task_type="CAUSAL_LM"
)

In [16]:
peft_model = get_peft_model(original_model, LoraConfig)
peft_model.print_trainable_parameters()

trainable params: 39,976,960 || all params: 6,778,392,576 || trainable%: 0.589770503135875


**observation**
- With QLoRA, we train only 29.5% of the model's parameters, which equates to 19,988,480 trainable parameters out of the total 6,778,392,576 parameters in the full model. This decreases our memory footprint by over 70%

In [17]:
import transformers
from trl import SFTTrainer

output_dir = f'.log/peft-dialogue-summary-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    # auto_find_batch_size=True,
    output_dir=output_dir,
    per_device_train_batch_size=micro_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    learning_rate=2e-4, # Higher learning rate than full fine-tuning.
    max_steps=max_steps,
    optim="adamw_torch",
    logging_steps=10,
    do_eval=True,
    fp16=True,
    evaluation_strategy="steps",
    save_steps=50,                # Save checkpoints every 50 steps
    save_strategy="steps",       # Save the model checkpoint every logging step
    report_to="wandb",
    eval_steps=20, 
    run_name=wandb_run_name if use_wandb else None,    
)
    
peft_trainer = SFTTrainer(
    model=original_model,
    peft_config=LoraConfig,
    max_seq_length=max_seq_length,
    formatting_func=formatting_func,
    args=peft_training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

Map:   0%|          | 0/12100 [00:00<?, ? examples/s]

Map:   0%|          | 0/5186 [00:00<?, ? examples/s]

In [18]:
# Train model
peft_trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mmugithi[0m ([33mnyumbani[0m). Use [1m`wandb login --relogin`[0m to force relogin


You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
20,0.7486,4.741363
40,0.0102,5.961457
60,0.0061,6.324196
80,0.0056,6.434465
100,0.0055,6.455405
120,0.0054,6.486933
140,0.0054,6.529018
160,0.0053,6.551917
180,0.0053,6.600418
200,0.0052,6.607201


TrainOutput(global_step=1000, training_loss=0.037262808330357076, metrics={'train_runtime': 3026.3655, 'train_samples_per_second': 5.287, 'train_steps_per_second': 0.33, 'total_flos': 2.65467394523136e+17, 'train_loss': 0.037262808330357076, 'epoch': 1000.0})

**observation**
- The training report indicates that a LoRA adapter was prepared for inference post-training, with a 30-minute training duration. 
- The training loss diminished significantly from 2.05 to 0.0065 across 100 steps, while the validation loss increased from 3.14 to 6.42. 
- The final output showcases a training loss of 0.623 with metrics such as a training runtime of 313.2791 seconds and a total floating point operations (FLOPs) of approximately 2.65e+16. - - This suggests an initial learning phase with subsequent overfitting, as evidenced by the diverging validation loss.

# Merging with base model 

Next we integrate LoRA adapters with the base model to utilize them effectively. Initially, the adapters will be saved to disk, followed by the deletion of the base model to free GPU capacity. Subsequently, the adapter will be reloaded and combined with the base model prior to employing it for inference tasks. This procedure ensures efficient utilization of resources while preparing the model for inference operations.

#### Save the LoRA adapter to disk

In [19]:
lora_adapter_dir = "lora_adapter_dir"
peft_trainer.save_model(lora_adapter_dir)
tokenizer.save_pretrained(lora_adapter_dir)

('lora_adapter_dir/tokenizer_config.json',
 'lora_adapter_dir/special_tokens_map.json',
 'lora_adapter_dir/tokenizer.model',
 'lora_adapter_dir/added_tokens.json',
 'lora_adapter_dir/tokenizer.json')

#### Free memory for merging weights

In [20]:
del original_model
torch.cuda.empty_cache()

#### Load the adapter and the base model

Load the LoRa configuration from directory

In [21]:
lora_adapter_config = LoraConfig.from_pretrained(lora_adapter_dir)

Load the base model and tokenizer

In [22]:
base_model = AutoModelForCausalLM.from_pretrained(lora_adapter_config.base_model_name_or_path, device_map="auto", quantization_config=bnb_config)
base_model.config.use_cache = False
trained_tokenizer = AutoTokenizer.from_pretrained(lora_adapter_config.base_model_name_or_path, add_eos_token=True, spaces_between_special_tokens=True)
trained_tokenizer.padding_side = "left"
trained_tokenizer.pad_token = "<pad>"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Merge the lora adapter and base model

In [23]:
model_and_lora_adapter = PeftModel.from_pretrained(base_model, lora_adapter_dir,  device_map="auto")

In [24]:
del base_model
torch.cuda.empty_cache()

In [25]:
print(lora_adapter_config.base_model_name_or_path)

meta-llama/Llama-2-7b-chat-hf


In [26]:
orignal_model = AutoModelForCausalLM.from_pretrained(lora_adapter_config.base_model_name_or_path, device_map="auto", quantization_config=bnb_config)
orignal_model.config.use_cache = False

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Save the merged model and tokenizer to disk

**observation**

- As of the time of these report QLoRA  does not support merging the adapter with the base model since they are at different precisions. The idea behind QLora is to have mini-adapters that are interchagable at inference time, and in a way way to createan army-mixture of experts

# Evaluation

We are going to perform human evaluation against results on from the base model and final model and compare the results for this, we will load the original_model and peform infence against the orginal model and q_lora model. 

#### Sample questions for human evaluation

In [27]:
sample_questions=[
    "Does ubuntu come with a firewall by default?",
    "Can grub-install work with ext3?",
    "Is there a CLI command to roll back updates?",
    "How do I get rid of Google Chrome?"
]

#### Perform zero shot inference on sample questions

In [28]:
def test_model_with_zero_shot(question, model_eval, tokenizer_eval):
    QUESTION = """### question:"""
    ANSWER= """### answer:"""
        
    prompt = f"{QUESTION} {question.strip()} {ANSWER}"
    
    test_inputs = tokenizer_eval(prompt, return_tensors="pt").to("cuda")
    ## Assign Inputs to GPU
    with torch.no_grad():
        test_outputs = model_eval.generate(**test_inputs,
                            max_new_tokens=64,
                            temperature = 0.1
                                          )

    return tokenizer.decode(test_outputs[0][test_inputs['input_ids'].shape[-1]:], skip_special_tokens=True)

In [29]:
def collect_results(sample_questions, base_model, base_tokenizer, qlora_model, trained_tokenizer):
    questions_list = []
    base_model_results_list = []
    qlora_model_results_list = []
    
    for question in sample_questions:
        base_model_results = test_model_with_zero_shot(question, base_model, base_tokenizer)
        qlora_model_results = test_model_with_zero_shot(question, qlora_model, trained_tokenizer)
        
        questions_list.append(question)
        base_model_results_list.append(base_model_results)
        qlora_model_results_list.append(qlora_model_results)
    
    data = {
        'Question': questions_list,
        'Base_Model_Results': base_model_results_list,
        'QLoRA_Model_Results': qlora_model_results_list
    }
    
    df = pd.DataFrame(data)
    return df

# 
results_df = collect_results(sample_questions, orignal_model, tokenizer, model_and_lora_adapter,trained_tokenizer)

In [30]:
from IPython.display import HTML
pd.set_option('display.max_colwidth', 500)
html_output = tabulate(results_df, headers='keys', tablefmt='html')
display(HTML(html_output))

Unnamed: 0,Question,Base_Model_Results,QLoRA_Model_Results
0,Does ubuntu come with a firewall by default?,". Yes, Ubuntu comes with a firewall by default. The firewall is called ""Ubuntu Firewall"" or ""ufw"" for short. It is enabled by default on most Ubuntu installations. ufw is a simple, easy-to-use firewall that provides a powerful and flexible way to control network traffic on","t is enabled by default in Ubuntu. The firewall is called iptables, and it controls incoming and outgoing network traffic. To enable or disable iptables, you can use the following commands: * To enable iptables, type: sudo iptables -A INPUT -j"
1,Can grub-install work with ext3?,"t is possible to use grub-install with ext3 file system, but it requires some additional steps. Here's how you can do it: 1. First, make sure that you have the ext3 driver installed on your system. You can do this by running the following command: ```","t is possible to use ext3 as the install media for ubuntu. all you need to do is use the command: sudo grub-install /dev/disk/by-uuid /dev/disk/by-uuid ntfs-label = ""my ubuntu install"" where /dev/"
2,Is there a CLI command to roll back updates?,"t There are several ways to roll back updates in Linux, depending on the update mechanism used. Here are some common methods: 1. Using `apt-rollback`: `apt-rollback` is a command-line tool that allows you to roll back updates in Ubuntu and other Debian-","t is possible to roll back updates in Ubuntu using the `restore' command. This command will restore the package levels to what they were before the most recent update. For example, if you updated a package called 'mysql-server' and want to return to a previous version of that package, you would"
3,How do I get rid of Google Chrome?,"t is possible to remove Google Chrome from your computer, but it's important to note that Chrome is a widely used and popular browser, and many people find it to be a convenient and reliable tool for browsing the internet. If you're looking to remove Chrome for any reason, here are the steps you can","s the world's most popular search engine, Google Chrome is the most widely used web browser, and for good reason. It's easy to use, fast, and has a wide range of features that make browsing the web a breeze. However, if you're looking to switch to a different"


**observation**
- Even with a limited run, you can onbserve in this notebook that the model is starting to adopt the tone of the tone of the questions in the ubuntu chat forum. For example #3. Where the user asks about uninstalling GOogle Chrome, the chatbot reponse that the question is ambigous with `sorry, i don't understand what you're trying to do` 

# Conclusion

The report has presented the results of evaluating three different models - Flan-T5, OPT, and LLaMa2. The decision to focus on LLaMa2 was based on its state-of-the-art performance and its ability to handle a higher context length of 4096 tokens compared to the T5 model's 512 tokens. This aspect significantly influences a model's capacity to generate coherent responses in tasks that require understanding broader context.

The memory consumption at different stages, from loading the model into memory to the post-training phase, was closely monitored. It is noteworthy that the use of QLoRa allowed for training only 29.5% of the model's parameters, which considerably reduced our memory footprint by over 70%. 

However, while training loss showed a significant decrease from 2.05 to 0.0065 across 100 steps, validation loss increased from 3.14 to 6.42. This suggests an initial learning phase followed by subsequent overfitting shown by diverging validation loss.

QLoRa's utility lies in its interchangeability at inference time, creating a flexible system akin to an army-mixture of experts. Even with limited runs, it was observed that the model began adopting specific tones from the Ubuntu chat forum questions.

In conclusion, despite some identified issues such as overfitting during training, LLaMa2 with QLoRa shows promise in handling broader contexts and reducing memory footprint during training. However, further testing and adjustments are necessary for optimal performance.

# References

- HuggingFace (n.d). Supervised Fine-tuning Trainer. https://huggingface.co/docs/trl/main/en/sft_trainer
- Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLORA: Efficient Finetuning of Quantized LLMs. Arxiv. https://doi.org/10.48550/arXiv.2305.14314
- Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. https://doi.org/10.48550/arXiv.2106.09685
- Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai,
and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning
Representations, 2021.
- Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. https://doi.org/10.48550/arXiv.2106.09685
- Li, X., & Liang, P. (2021). Prefix-Tuning: Optimizing Continuous Prompts for Generation. Retrieved from https://doi.org/10.48550/arXiv.2101.00190
- Lester, B., Al-Rfou, R., & Constant, N. (2021). The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691. https://doi.org/10.48550/arXiv.2104.08691
- Liu, X., Zheng, Y., Du, Z., Ding, M., Qian, Y., Yang, Z., & Tang, J. (2021). GPT understands, too. arXiv preprint arXiv:2103.10385. https://doi.org/10.48550/arXiv.2103.10385
