# Fine-tuning an LLM: the distributed edition

This notebook was submitted in pursuance of the Deep Learning course, part of the MSDS Curriculum at University of San Francisco.  

**Group Members**
- Maneel Reddy Karri
- Gurusankar Gopalakrishnan
- Devendra Govil

***

## Introduction <a></a>

**Large Language Models(LLMs)** refer to advanced artificial intelligence systems designed to process and generate human-like text. They are trained on vast amounts of text data and learn to understand and produce natural language in a variety of contexts. These belong to the family of auto-regressive transformers based models and have had a huge boon thanks to their remarkable performance on a range of Natural Language Processing tasks. There is no better way to exemplify the significance they hold in the Natural Language Processing domain than to look at ChatGPT, a large language model that has taken the world by storm. 


ChatGPT is a large language model which is developed by OpenAI and based on the GPT 3.5 architecture and is coupled with a step of RLHF (reinforcement learning through human feedback).

### Fine-tuning an LLM <a name='h3_1'></a>

A general purpose LLM like ChatGPT or Bard is trained on an enormous corpus of text data and has remarkable performance on general purpose NLP tasks. They provide very good base models that can help one get started with modelling NLP problems.  

However, they still experience various issues. The data they are trained on may not be include stuff that is relevant to our use case. For example, if I am working in a finance company with proprietary data, our base model has never seen that (or possibly even similar) data, leading to significant performance degradation on many NLP tasks related to niche or specialized domains.Similarly, a general purpose LLM, which is built to support a variety of NLP tasks, is by design incentivized for general performance rather than for a specific task. 

Hence, finetuning is extremely important because it may help in getting significant performance boost on tasks of relevance. It may sometimes in fact be the only way to get acceptable performance on relevant tasks. Since training an LLM from scratch is prohibitively expensive, time consuming, requires a lot of expertise, and is resource cumbersome, fine-tuning a base LLM provides a very attractive option for organizations and individuals, to improve performance of LLMs on domain specific tasks.

### Goals for this project

Our project entailed fine-tuning an LLM using the following bleeding edge techniques:
- Different distributed training approaches like data, model and pipeline parallelism
- Using parameter efficient fine-tuning techniques like LoRA (Low Rank Adapters)
- Model inference using a distributed orchestration platform like rayserve

***

## Data Preprocessing

For finetuning our model we are required to take an instruction dataset that contains instruction, question, and answer pairs. These help us to transform our model to optimize for the purposes of question answering as well as for fine-tuning. 

We use the famous and reknowed [Alpaca dataset](https://huggingface.co/datasets/teknium/GPT4-LLM-Cleaned) that is available in [here](https://huggingface.co/datasets/teknium/GPT4-LLM-Cleaned) which has the benefit of being processed for the express purpose of finetuning an LLM. This is beneficial because it helps us in being able to devot more of our time and resources on tackling the challenging task of finetuning the LLM on the distributed architecture using the LoRA technique.

To start off, We first load the model from the Hugging Face library, initialize the tokenizer and add special tokens such as:
- End of Sentence
- Beginning of Sentence
- Unknown Tokens
   
Adding these special tokens allows us to convert the text to generate embedding indices.

In [None]:
from transformers import LlamaTokenizer, LlamaForCausalLM,LlamaConfig

from peft import LoraConfig, TaskType, get_peft_model
import torch
model_name_or_path = "openlm-research/open_llama_3b"
config = LlamaConfig.from_pretrained(model_name_or_path)
model = LlamaForCausalLM.from_pretrained(model_name_or_path,config = config,load_in_8bit = True,torch_dtype = torch.float16,) 
tokenizer = LlamaTokenizer.from_pretrained(
        model_name_or_path,trust_remote_code = True
        )
tokenizer.pad_token ="[PAD]"
special_tokens = {'bos_token': "<s>",'eos_token': "</s>",'unk_token': "<unk>"}
for k, val in special_tokens.items():
    tokenizer.add_special_tokens({k: val})

In [None]:
from datasets import Dataset, DatasetDict, load_dataset, load_from_disk
tokenizer_name = tokenizer.__class__.__name__

In [None]:
ds = load_dataset("teknium/GPT4-LLM-Cleaned",streaming=False)

Found cached dataset json (/home/mkarri/.cache/huggingface/datasets/teknium___json/teknium--GPT4-LLM-Cleaned-a71aa8ae1ac3982d/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 310.80it/s]


**The below code has been adapted from the Open AI Access Collective which can be referenced [here](https://github.com/OpenAccess-AI-Collective/axolotl).**

In [None]:
# The code below has been taken from the OpenAI Collective which is located 
# at the location: https://github.com/OpenAccess-AI-Collective/axolotl
# Please note that this template code was modified for the purposes of this notebook

import abc
import functools
from typing import List, Tuple, Union
from datasets import IterableDataset
from enum import Enum, auto
from typing import Generator, List, Optional, Tuple, Union

class InvalidDataException(Exception):
    """
    Exception raised when the data is invalid
    """
class PromptStyle(Enum):
    """
    Enum for prompt styles
    """

    INSTRUCT = "instruct"
    CHAT = "chat"


class AlpacaPrompter:
    """
    Base class for alpaca prompters
    """

    system_prompt = "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n"
    system_no_input_prompt = "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n"
    prompt_style: Optional[PromptStyle] = None

    def __init__(self, prompt_style=PromptStyle.INSTRUCT.value):
        self.prompt_style = prompt_style if prompt_style else PromptStyle.INSTRUCT.value
        self.match_prompt_style()

    def match_prompt_style(self):
        if self.prompt_style == PromptStyle.INSTRUCT.value:
            self.prompt_input = (
                self.system_prompt
                + "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n"
            )
            self.prompt_no_input = (
                self.system_no_input_prompt
                + "### Instruction:\n{instruction}\n\n### Response:\n"
            )
            self.response_split = "### Response:"
        if self.prompt_style == PromptStyle.CHAT.value:
            self.prompt_input = (
                self.system_prompt + "USER: {instruction}\n{input}\nASSISTANT:"
            )
            self.prompt_no_input = (
                self.system_no_input_prompt + "USER: {instruction}\nASSISTANT:"
            )
            self.response_split = "ASSISTANT:"

    def build_prompt(
        self,
        instruction: str,
        input: Union[None, str] = None,  # pylint: disable=redefined-builtin
        output: Union[None, str] = None,
    ) -> Generator[str, None, None]:
        # returns the full prompt from instruction and optional input
        # if a label (=response, =output) is provided, it's also appended.
        if input:
            res = self.prompt_input.format(instruction=instruction, input=input)
        else:
            res = self.prompt_no_input.format(instruction=instruction)
        if output:
            res = f"{res}{output}"
        yield res

    def get_response(self, output: str) -> str:
        return output.split(self.response_split)[1].strip()



class PromptTokenizingStrategy(abc.ABC):
    """
    Abstract class for tokenizing strategies
    """

    def __init__(
        self,
        prompter,
        tokenizer,
        train_on_inputs: bool = False,
        sequence_len: int = 2048,
    ):
        self.prompter = prompter
        self.tokenizer: PreTrainedTokenizer = tokenizer
        self.train_on_inputs = train_on_inputs
        self.sequence_len = sequence_len

    @abc.abstractmethod
    def tokenize_prompt(self, prompt):
        pass

    @functools.lru_cache(maxsize=128)
    def _get_user_token(self):
        id_or_ids = self.tokenizer.convert_tokens_to_ids("<|USER|>")
        if isinstance(id_or_ids, (int,)):
            return id_or_ids
        return False

    @functools.lru_cache(maxsize=128)
    def _get_assistant_token(self):
        id_or_ids = self.tokenizer.convert_tokens_to_ids("<|ASSISTANT|>")
        if isinstance(id_or_ids, (int,)):
            return id_or_ids
        return False

    def _tokenize(self, prompt: str, add_eos_token=True, strip_bos_token=False):
        result = self.tokenizer(
            prompt,
            truncation=True,
            max_length=self.sequence_len,
            padding=False,
            return_tensors=None,
        )
        if (
            result["input_ids"][-1] != self.tokenizer.eos_token_id
            and len(result["input_ids"]) < self.sequence_len
            and add_eos_token
        ):
            result["input_ids"].append(self.tokenizer.eos_token_id)
            result["attention_mask"].append(1)

        if result["input_ids"][0] == self.tokenizer.bos_token_id and strip_bos_token:
            result["input_ids"] = result["input_ids"][1:]
            result["attention_mask"] = result["attention_mask"][1:]

        result["labels"] = result["input_ids"].copy()
        return result


class InstructionPromptTokenizingStrategy(PromptTokenizingStrategy):
    """
    Tokenizing strategy for instruction-based prompts.
    """

    def parse_instruction_fields(self, prompt) -> Tuple[str, str, str]:
        raise NotImplementedError

    def tokenize_prompt(self, prompt):
        (
            instruction,
            input,  # pylint: disable=redefined-builtin
            response,
        ) = self.parse_instruction_fields(prompt)
        
        full_prompt = self._build_full_prompt(instruction, input, response)
        tokenized_full_prompt = self._tokenize(full_prompt)
        if not self.train_on_inputs:
            user_prompt = next(
                iter(
                    self.prompter.build_prompt(
                        instruction,
                        input,
                    )
                )
            )
            tokenized_user_prompt = self._tokenize(user_prompt, add_eos_token=False)
            user_prompt_len = len(tokenized_user_prompt["input_ids"])
            # TODO this could be sped up using numpy array slicing
            tokenized_full_prompt["labels"] = [
                -100
            ] * user_prompt_len + tokenized_full_prompt["labels"][user_prompt_len:]

        return tokenized_full_prompt

    def _build_full_prompt(
        self, instruction, input, response  # pylint: disable=redefined-builtin
    ):
        return next(
            iter(
                self.prompter.build_prompt(
                    instruction,
                    input,
                    response,
                )
            )
        )

class TokenizedPromptDataset(IterableDataset):
    """
    Iterable dataset that returns tokenized prompts from a stream of text files.
        Args:
            prompt_tokenizer (PromptTokenizingStrategy): The prompt tokenizing method for proccessing the data.
            dataset (dataset.Dataset): Dataset with text files.
    """

    def __init__(  # pylint: disable=super-init-not-called
        self,
        prompt_tokenizer: PromptTokenizingStrategy,
        dataset: IterableDataset,
    ):
        self.prompt_tokenizer = prompt_tokenizer
        self.dataset = dataset

    def __iter__(self):
        iterator = iter(self.dataset)
        count = 0
        # Loop through the entire dataset
        for example in iterator:
            try:
                yield self.prompt_tokenizer.tokenize_prompt(example)
                count += 1
            except InvalidDataException:
                pass
        if count == 0:
            raise RuntimeError("Expected at least one datapoint in dataset.")

class AlpacaPromptTokenizingStrategy(InstructionPromptTokenizingStrategy):
    """
    Tokenizing strategy for Alpaca prompts.
    """

    def parse_instruction_fields(self, prompt) -> Tuple[str, str, str]:
        return (
            prompt["instruction"],
            prompt["input"] if "input" in prompt else "",
            prompt["output"],
        )

In [None]:
ds_strategy = AlpacaPromptTokenizingStrategy(AlpacaPrompter('instruct'),tokenizer,False,256)
ds_wrapper = TokenizedPromptDataset(ds_strategy, ds['train'])
samples = []
for d in [ds_wrapper]:
    samples = samples + list(d)


**So what is happening here?**  
  
**Step 1**: Create the appropriate prompt template for the dataset. Below is the prompt for the first example of the dataset. 

In [None]:
print(next(AlpacaPrompter('instruct').build_prompt(ds['train'][0]['instruction'],ds['train'][0]['input'],ds['train'][0]['output'])))

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Response:
1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.

2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.

3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.


**Step 2**: Create the appropriate tokenizations. Here, we can see that inputs to the model consist of both the instruction 
    as well as the answer. The labels or the output based on which the model needs to be finetuned has some -100s to it.  However, the text form of the labels denoted as  -100 noted below:-<br>

Below is an instruction that describes a task.  

Write a response that appropriately completes the request.  
***USER:*** *Give three tips for staying healthy.*  
This is done so that we can parallely process the entire sample at one go.


**Step 3**: Split into train and test datasets

In [None]:
dataset = Dataset.from_list(samples).shuffle(seed=42)
dataset = dataset.train_test_split(test_size=0.02, shuffle=False)
train_dataset = dataset['train']
eval_dataset = dataset['test']

***

## Model Implementation and Methods

### Loading the base Open Llama Model

In [None]:
from transformers import LlamaTokenizer, LlamaForCausalLM,LlamaConfig

from peft import LoraConfig, TaskType, get_peft_model
import torch
model_name_or_path = "openlm-research/open_llama_3b"
config = LlamaConfig.from_pretrained(model_name_or_path)
model = LlamaForCausalLM.from_pretrained(model_name_or_path,config = config,load_in_8bit = True,torch_dtype = torch.float16,) 
tokenizer = LlamaTokenizer.from_pretrained(
        model_name_or_path,trust_remote_code = True
        )
tokenizer.pad_token ="[PAD]"
special_tokens = {'bos_token': "<s>",'eos_token': "</s>",'unk_token': "<unk>"}
for k, val in special_tokens.items():
    tokenizer.add_special_tokens({k: val})
inputs = tokenizer("Tell me about Alpacas.", return_tensors="pt",truncation=False).to('cuda')
generation_output = model.generate(**inputs,max_new_tokens=32)

**Note: The below cell shows a way to call the model**

In [2]:
print(tokenizer.decode(generation_output[0]))

<s>Tell me about Alpacas.
Alpacas are a type of camelid, a group of animals that includes llamas, guanacos, vicunas,


Hurray! We just loaded an Open Llama model with 3 billion parameters and are able to generate inferences. This is a huge step. To put this into perspective, the inference requires us to initialize the model and load **> 3 GB** of parameter weights alone.

### Ruminations on model architectures: Open Llama vs ChatGPT

ChatGPT is a SOTA LLM which is based on GPT-3.5 and follows the transformer architecture for GPT-3.5. This is based on a decoder only architecture whose primary aim is to predict the next token. More details can be found [here](https://towardsdatascience.com/gpt-3-explained-19e5f2bd3288)  
  

**To Note:** Decoder architecture which is the backbone for the GPT-3.5 architecture has the primary objective of predicting the next word.  
  
![Alt Text](./GPTarchitecture.png)

**Image Source**: The image above has been taken from the paper present [here](https://arxiv.org/pdf/2005.14165.pdf)

A traditional GPT model consists of stacked decoders. The $n_{\texttt{layers}}$ = 96 layers listed above (for GPT-3 Large) consists of a number of stacked decoders. The decoders of the GPT 3.5 model alternate between sparse and self attention layers but more on that later. We can see this in the below image.

In the below the most important parameters are:-
- $d_{\texttt{model}}$: The embedding dimensions of the input text
- $n_{\texttt{layers}}$: The number of decoder layers present.

There are 96 of these encoders in the GPT architecture.

<!-- ![Alt Text](decoder.png) -->
<image src="./decoder.png" alt="Decoder Architecture" height="300" width="300">

The model we have chosen is Open LLama-3B which is an Apache 2.0 version of Llama. It replicates the Llama architecture which is an open source model by Meta AI. We can check the model at the [location](https://arxiv.org/abs/2302.13971).

**We can have a detailed look at the architecture of Open Llama below.**

In [3]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 3200, padding_idx=0)
    (layers): ModuleList(
      (0-25): 26 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear8bitLt(in_features=3200, out_features=3200, bias=False)
          (k_proj): Linear8bitLt(in_features=3200, out_features=3200, bias=False)
          (v_proj): Linear8bitLt(in_features=3200, out_features=3200, bias=False)
          (o_proj): Linear8bitLt(in_features=3200, out_features=3200, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear8bitLt(in_features=3200, out_features=8640, bias=False)
          (down_proj): Linear8bitLt(in_features=8640, out_features=3200, bias=False)
          (up_proj): Linear8bitLt(in_features=3200, out_features=8640, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm

In the above the most important parameters are:-

- $d_{\texttt{model}}$: 3200
- $n_{\texttt{layers}}$: 26  
  
<br>The main difference between the traditional decoder and the LLama paper are as below(https://paperswithcode.com/method/llama) :-
1. RMSNorm normalizing function is used to improve the training stability, by normalizing the input of each transformer sub-layer, instead of normalizing the output.
2. The ReLU non-linearity is replaced by the SwiGLU activation function to improve performance.
3. Absolute positional embeddings are removed and instead rotary positional embeddings (RoPE) are added at each layer of the network.


Based on the code in the lLamma paper(https://github.com/facebookresearch/llama/blob/main/llama/model.py), the main changes have been done to the code.

![Alt Text](llama.png)

The feedforward layer has been changed as below:-

![Swish ReLU](./swishrelu.png)

Both rotary embeddings and swish Relu are out of context for this paper. For more information refer to (https://blog.eleuther.ai/rotary-embeddings/#how-is-this-different-from-the-sinusoidal-embeddings-used-in-attention-is-all-you-need) for rotary embeddings and Swish Relu(https://arxiv.org/pdf/2002.05202.pdf) The main point we are trying to make is that there have been minor changes to the original transformer architecture.

### Model Inference and Finetuning

Model Inference is done the following way as per the above example. There is a sequential way in which the output is determined.:-

- **Input**: Tell me about Alpacas.   
**Output**: [BOS] 

- **Input**: Tell me about Alpacas. <BOS>   
**Output**: Alpacas  

- **Input**: Tell me about Alpacas. <BOS> Alpacas   
**Output**: are  

- **Input**: Tell me about Alpacas. <BOS> Alpacas are   
**Output**: a  

- **Input**: Tell me about Alpacas. <BOS> Alpacas are a   
**Output**: type  

- **Input**: Tell me about Alpacas. <BOS> Alpacas are a type   
**Output**: of  

- **Input**: Tell me about Alpacas. <BOS> Alpacas are a type of    
**Output**: camel  

- **Input**: Tell me about Alpacas. <BOS> Alpacas are a type of camel   
**Output**: [EOS]  

- **Final Input**: Tell me about Alpacas. <BOS> Alpacas are a type of camel   
**Output**: [EOS]  


Model finetuning is a supervised learning process in contrast to model training which is semi supervised. 
The cross entropy loss is the loss function. An example is shown below:-
    


**Input**: Tell me about Alpacas.<br>
**Output**: Alpacas are a type of camel found in South America.<br>

$$ CE_{loss} = - \bigg(log(P(Alpacas)) + log(P(are|Alpacas)) + log(P(a|Alpacas \space are))   \\
+ log(P(type|Alpacas \space are \space a)) + log(P(camel|Alpacas \space are \space a \space type)) + ...\bigg) $$

### Resource Requirement

LLMs take an enormous amount of resources to train.  
For e.g. from the Llama paper the 65B parameter model took 21 days on 2048 A100 GPUs. Hence the total cost of training a 65B model would take:  
2048(# of GPUs) * 24(# hours per day) * 21(# days of training) * 8.8(# Cost of GPU per hour) = $449k

A100 GPUs are state of the art and are not consumer GPUs. While finetuning should likely cost a fraction of the cost, there is still a significant 
portion of compute time as well as power. Finetuning based on the Llama 7B based on the Stanford Alpaca(https://crfm.stanford.edu/2023/03/13/alpaca.html) took ~$600. 

### LoRA for finetuning

While we currently don't have access to a A100 GPU, a finetuning experiment will still take time. Based on the model configurations above, even ordinary finetuning can take time. Hence, in this case, we use LoRA (Low Rank Adapters) which ingest trainable parameters to each layer of the model.  

**LoRA** is based on this [paper](https://arxiv.org/pdf/2106.09685.pdf) and notes that during finetuning, most of the delta weights are mainly zero and hence have a "low instrinsic dimension". Hence, they can still learn efficiently even if projected to a smaller subpace. Hence, when learning the delta weights, we follow the following equation:   
   
$$h = W_0x + \Delta Wx = W_0x + BAx$$


The main benefit of this is simplicity and no additional latency inference. B & A are known as the Lora weight matrices. The main parameter of the LORA adapter is the rank($r$) and the idea goes as below: 
$$(\Delta Wx).shape = (M, N)$$
$$\Delta Wx = A_{M \times r} * B_{r \times N}$$

### Preparing model for finetuning

We use the Peft package from Huggingface to assist us in the LORA finetuning.

In [4]:
from peft import LoraConfig, PeftModel, get_peft_model
from peft import prepare_model_for_kbit_training

In [5]:
lora_target_modules = ['gate_proj','down_proj','up_proj','q_proj','v_proj','k_proj','o_proj']
lora_alpha = 16
lora_r = 8
lora_config = LoraConfig(
        r=lora_r,
        lora_alpha=lora_alpha,
        target_modules=lora_target_modules,
        task_type="CAUSAL_LM",
    )

Here, $LoRA_r$ refers to the low rank taken in this case. $ LoRA_\alpha $ refers to the alpha parameter which indicates the scaling of the weights 
when calculating gradients. The lora target modules defines the layers where LoRA Adapters are being added.

In [6]:
model = prepare_model_for_kbit_training(
            model, use_gradient_checkpointing=True
        )
model = get_peft_model(model, lora_config)

In [7]:
model.print_trainable_parameters()

trainable params: 12,712,960 || all params: 3,439,186,560 || trainable%: 0.36965020007521776


In [8]:
model.base_model.model.model.layers[0].self_attn.q_proj

Linear8bitLt(
  in_features=3200, out_features=3200, bias=False
  (lora_dropout): ModuleDict(
    (default): Identity()
  )
  (lora_A): ModuleDict(
    (default): Linear(in_features=3200, out_features=8, bias=False)
  )
  (lora_B): ModuleDict(
    (default): Linear(in_features=8, out_features=3200, bias=False)
  )
  (lora_embedding_A): ParameterDict()
  (lora_embedding_B): ParameterDict()
)

In [9]:
print("LORA A weights")
print(model.base_model.model.model.layers[0].self_attn.q_proj.lora_A.default.state_dict())
print("LORA B weights")
print(model.base_model.model.model.layers[0].self_attn.q_proj.lora_B.default.state_dict())
model.print_trainable_parameters()

LORA A weights
OrderedDict([('weight', tensor([[ 0.0119, -0.0117, -0.0060,  ...,  0.0118, -0.0083,  0.0060],
        [-0.0040, -0.0096,  0.0049,  ..., -0.0099, -0.0024, -0.0158],
        [-0.0110, -0.0039,  0.0003,  ...,  0.0128,  0.0003, -0.0120],
        ...,
        [-0.0010,  0.0118,  0.0046,  ..., -0.0135,  0.0029,  0.0147],
        [-0.0059,  0.0155, -0.0033,  ..., -0.0075,  0.0021,  0.0076],
        [ 0.0140,  0.0060, -0.0108,  ..., -0.0110,  0.0146,  0.0043]],
       device='cuda:0'))])
LORA B weights
OrderedDict([('weight', tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0'))])
trainable params: 12,712,960 || all params: 3,439,186,560 || trainable%: 0.36965020007521776


The LORA B parameters are set to 0 so that the randomly initialized weights do not affect the model performance at the start of the training.

## Experiments and Results

### Prepare the trainer for training


Here, we use the Huggingface Trainer which allows for a config driven management.

Now we set some learning parameters for our trainer.

In [16]:
# Learning Rate 
learning_rate = 0.0002
# Weight decay associated with learning parameter
weight_decay = 0.0
batch_size = 16
micro_batch_size = 4
eval_steps = 50
save_steps = 1000
# Num of epochs
num_epochs = 3
# Optimizer(We use Adam 8 bit as the model has been loaded in 8 bit)
optimizer = 'adamw_bnb_8bit'
import math
total_num_steps = int(
    math.ceil(len(train_dataset) * num_epochs / batch_size)
)
warmup_steps = 10
logging_steps = max(min(int(0.005 * total_num_steps), 10), 1)
training_arguments_kwargs = {}
## Train model in FP16 or float because it has been loaded in 8 bit
training_arguments_kwargs["fp16"] = True
training_arguments_kwargs["tf32"] = False
## Gradient Checkpointing
gradient_checkpointing = True
training_arguments_kwargs["warmup_steps"] = warmup_steps
training_arguments_kwargs["logging_steps"] = logging_steps
training_arguments_kwargs["gradient_checkpointing"] = gradient_checkpointing
gradient_accumulation_steps = batch_size // micro_batch_size
optimizer = "adamw_bnb_8bit"

We use a 8bit Adam which is quantized for 8bits. For more on quantization please refer this [blog](https://huggingface.co/blog/hf-bitsandbytes-integration).

In [17]:
import transformers
training_args = transformers.TrainingArguments(
        per_device_train_batch_size=micro_batch_size,
        per_device_eval_batch_size=micro_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        eval_accumulation_steps=gradient_accumulation_steps,
        num_train_epochs=num_epochs,
        learning_rate=learning_rate,
        evaluation_strategy="steps",
        save_strategy="steps",
        eval_steps=eval_steps,
        save_steps=save_steps,
        output_dir='./lora-out',
        save_total_limit=3,
        group_by_length=False,
        report_to=None,
        run_name=None,
        optim=optimizer,
        lr_scheduler_type="cosine",
        weight_decay=weight_decay,
        **training_arguments_kwargs,
    )
trainer_kwargs = {}

In [18]:
from transformers.trainer_pt_utils import get_parameter_names
from torch import nn
import bitsandbytes as bnb
decay_parameters = get_parameter_names(model, [nn.LayerNorm])
decay_parameters = [name for name in decay_parameters if "bias" not in name]
optimizer_grouped_parameters = [
    {
        "params": [
            p
            for n, p in model.named_parameters()
            if (n in decay_parameters and p.requires_grad)
        ],
        "weight_decay": training_args.weight_decay,
    },
    {
        "params": [
            p
            for n, p in model.named_parameters()
            if (n not in decay_parameters and p.requires_grad)
        ],
        "weight_decay": 0.0,
    },
]

optimizer = bnb.optim.Adam8bit(
    optimizer_grouped_parameters,
    betas=(training_args.adam_beta1, training_args.adam_beta2),
    eps=training_args.adam_epsilon,
    lr=training_args.learning_rate,
)
lr_scheduler = transformers.get_cosine_schedule_with_warmup(
                optimizer,
                training_args.warmup_steps,
                total_num_steps,
            )
trainer_kwargs["optimizers"] = (optimizer, lr_scheduler)

In [20]:
data_collator_kwargs = {
        "padding": True,
    }
data_collator_kwargs["pad_to_multiple_of"] = 8

In [None]:
#model.config.use_cache = False
import torch
import transformers
model = torch.compile(model)


In [21]:
trainer = transformers.Trainer(
        model=model,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        args=training_args,
        data_collator=transformers.DataCollatorForSeq2Seq(
            tokenizer,
            return_tensors="pt",
            **data_collator_kwargs,
        ),
        **trainer_kwargs,
    )

In [22]:
trainer.train()

Step,Training Loss,Validation Loss


Once the model training is completed (it takes 11 hours.), the model adapter weights can be found at (https://huggingface.co/maneel/OpenLLama3B_8bitQuantized_alpaca_finetuned).

### Model Inference

Once, we have the model trained. we need to make an endpoint from it so that we can perform inference from it easily.  
The framework we have used is Ray Serve and heavily inspired from this [notebook](https://docs.ray.io/en/latest/ray-air/examples/gptj_serving.html).

In [23]:
import pandas as pd
from ray import serve
from starlette.requests import Request


@serve.deployment(ray_actor_options={"num_gpus": 1})
class PredictDeployment:
    def __init__(self, model_id: str, revision: str = None):
        # from transformers import AutoModelForCausalLM, AutoTokenizer
        from peft import PeftModel
        from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig
        import torch

        model = LlamaForCausalLM.from_pretrained(model_id,
                                                      torch_dtype=torch.float16, 
                                                      low_cpu_mem_usage=True,
                                                      load_in_8bit=True, 
                                                      device_map="auto")
        #load the adapter delta weights on top of base model
        self.model = PeftModel.from_pretrained(model, "maneel/OpenLLama3B_8bitQuantized_alpaca_finetuned",\
                                               adapter_name="maneel_openllama")
        
        self.tokenizer = LlamaTokenizer.from_pretrained(model_id)
        
        self.generation_config = GenerationConfig(temperature=0.8,
                                             top_p=0.75,
                                             top_k=40,
                                             num_beams=4,
                                             no_repeat_ngram_size=3,
                                             max_new_tokens=256)
        
    def generate(self, text: str) -> pd.DataFrame:
        input_ids = self.tokenizer(text, return_tensors="pt").input_ids.to(
            self.model.device
        )

        gen_tokens = self.model.generate(
            input_ids=input_ids,
            generation_config=self.generation_config,
        )
        return pd.DataFrame(
            self.tokenizer.batch_decode(gen_tokens), columns=["responses"]
        )

    async def __call__(self, http_request: Request) -> str:
        json_request: str = await http_request.json()
        prompts = []
        for prompt in json_request:
            text = prompt["text"]
            if isinstance(text, list):
                prompts.extend(text)
            else:
                prompts.append(text)
        return self.generate(prompts)


As we can see below in the outputs of the cell, a ray server has been launched.

In [25]:
model_id = "openlm-research/open_llama_3b"  #base openllama model
revision = "float16"
deployment = PredictDeployment.bind(model_id=model_id, revision=revision)
serve.run(deployment)

2023-06-28 19:20:21,102	INFO worker.py:1627 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
[2m[36m(ServeController pid=3936403)[0m INFO 2023-06-28 19:20:24,806 controller 3936403 deployment_state.py:1298 - Deploying new version of deployment default_PredictDeployment.
[2m[36m(HTTPProxyActor pid=3936481)[0m INFO:     Started server process [3936481]
[2m[36m(ServeController pid=3936403)[0m INFO 2023-06-28 19:20:24,876 controller 3936403 deployment_state.py:1537 - Adding 1 replica to deployment default_PredictDeployment.


[2m[36m(ServeReplica:default_PredictDeployment pid=3936595)[0m [2023-06-28 19:20:28,006] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2m[36m(ServeReplica:default_PredictDeployment pid=3936595)[0m 
[2m[36m(ServeReplica:default_PredictDeployment pid=3936595)[0m Welcome to bitsandbytes. For bug reports, please run
[2m[36m(ServeReplica:default_PredictDeployment pid=3936595)[0m 
[2m[36m(ServeReplica:default_PredictDeployment pid=3936595)[0m python -m bitsandbytes
[2m[36m(ServeReplica:default_PredictDeployment pid=3936595)[0m 
[2m[36m(ServeReplica:default_PredictDeployment pid=3936595)[0m  and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
[2m[36m(ServeReplica:default_PredictDeployment pid=3936595)[0m bin /home/mkarri/anaconda3/envs/axototl/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda121.so
[2m[36m(ServeReplica:default_PredictDeployment p

[2m[36m(ServeReplica:default_PredictDeployment pid=3936595)[0m   warn(msg)
[2m[36m(ServeReplica:default_PredictDeployment pid=3936595)[0m The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
Downloading adapter_model.bin:   0%|          | 0.00/51.0M [00:00<?, ?B/s]
Downloading adapter_model.bin:  21%|██        | 10.5M/51.0M [00:00<00:01, 20.6MB/s]
Downloading adapter_model.bin:  41%|████      | 21.0M/51.0M [00:00<00:00, 33.2MB/s]
Downloading adapter_model.bin:  62%|██████▏   | 31.5M/51.0M [00:00<00:00, 42.6MB/s]
Downloading adapter_model.bin:  82%|████████▏ | 41.9M/51.0M [00:01<00:00, 43.2MB/s]
Downloading adapter_model.bin: 100%|██████████| 51.0M/51.0M [00:01<00:00, 35.1MB/s]


RayServeSyncHandle(deployment='default_PredictDeployment')

**Note:** The output of the above cell show that our ray serve is up and running and has detected GPUs for launching the inference endpoints.

#### Now, we test our inference endpoint by passing on a request.

In [27]:
import requests
prompt = (
    """Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Explain in simple terms how the attention mechanism of a transformer model works
### Response:"""
)

sample_input = {"text": prompt}

output = requests.post("http://localhost:8000/", json=[sample_input]).json()
print(output)

[{'responses': '<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.\n### Instruction:\nExplain in simple terms how the attention mechanism of a transformer model works\n### Response:\nA transformer network is a type of artificial neural network that is made up of multiple layers of interconnected nodes. At the topmost layer, called the input layer, the nodes are responsible for inputting the data into the network. The nodes in this layer will take in the input, process it, and pass it on to the next layer.\nThe next layer, also known as the output layer, takes in the processed data from the previous layer, processes it further, and passes it on. This process continues until the end of the network, where the nodes at the final layer will have processed the input data and passed it on as output.\nOnce the data has passed through all the layers, the final output will be the processed and pre-processed data. The attention mechanism 

[2m[36m(ServeReplica:default_PredictDeployment pid=3936595)[0m INFO 2023-06-28 19:29:31,585 default_PredictDeployment default_PredictDeployment#ajdfkL szbOguWTta / default replica.py:654 - __CALL__ OK 64223.7ms


***Voila! It works!***

As we can see, the model is at an inference end-point that we can easily query.

We have used ray serve for inference and serving our model because:
- It automatically does load balancing for us
- It is able to effectively utilize multiple GPUs for inference
- It automatically create RESTful API endpoints
- It is easy and intuitive to use

### Results

As we can see below that the loss continues to decrease as function of the step size and based on the requests above, we can show that the model has had a decent performance on our loss metric which is the Cross Entropy Loss as mentioned earlier here. This was done on our final evaluation dataset. 

![Loss Results](./loss_results.png)

## Conclusion and Future Work

### Conclusion

**Achievements**

We were able to successfully do the following:

1. Fine-tune an LLM model (Open Llama-3B) successfully
2. Utilized multiple GPUs via distributed training using data parallelism approach
3. Utilized parameter efficient finetuning using LoRA 
4. Successfully deployed the model using ray serve which created an inference endpoint

**Implications**

We were able to finetune an LLM on limited compute resources. This bodes well for a varitey of use cases. Training a large language model from scratch is not feasible for most organizations and individuals. However, using a large language model (particularly a closed source one) presents a number of challenges:
1. It might not serve well for our specific use case
2. We may not be comfortable in sharing proprietary or sensitive data with major corporations
3. Training a Large Language Model from scratch might be extremely resource intensive and computationally burdensome. 

Our approach is able to successfully resolve all of these issues:
1. We were able to use extremely limited resources to get significant boost in performance.
2. Finetuning an open source LLM bodes well for a lot of use cases where proprietary information can't be sent to closed source corporations.
3. This also gives a big boost to open source development as open source projects with finetuning and proprietary information can beat large closed source models or come close to their performance.
4. This democratizes the usage of large language models as using and training them is no longer the preserve of a few Large Corporations.

**Limitations**

Our approach however is not without the following limitations:
1. Our approach works only for models that can fit on a single GPU which necessarily excludes models above a certain threshold size.
2. Our approach also necessitates the usage of parameter efficient finetuning or training approaches which have their own limitations in performance and stability.
3. Our approach also requires a large swath of data which might not be readily available in the format that we desire
4. Our approach requires us to use low dimensional quantization which is again not without its potential issues as it might lead to the model loosing performance or inference capabilities.

### Future Work

To address some of the limitations as well as level up our work, the next steps that we need to follow are rather clear:
1. We need to find ways to train/finetune even larger models that can't fit in one GPU. This can be done by using the following:
   1. Model Parallelism using multiple libraries like DeepSpeed (Library by Microsoft), FSDP (Fully Sharded Data Parallelism by Meta AI)
   2. Pipeline Parallelism using the same libraries
   3. Using custome data types like Bfloat16, FP16 and other quantization techniques like QLoRA
2. Use a real world dataset that pertains to a specific domain like Finance or Medicine 
3. Increase context length using techniques like Landmark-Attention/Superhot/Alibi
4. Use other libraries for model serving and inference endpoints
5. Try deploying an inference endpoint using multiple GPUs
6. Deploying our solution and making it publicly available via a REST API. This may require us to create a platform for the same

***

------------------------------ **END OF FILE** ----------------------------------

***