## Introduction
- What are Large Language Models(LLMs)?<br>
Large Language Models refer to advanced artificial intelligence systems designed to process and generate human-like text. They are trained on vast amounts of text data and learn to understand and produce natural language in a variety of contexts.
- Is Chat GPT a large language model?<br>
Chat GPT is a large language model which is developed by OpenAI and based on the GPT 3.5 architecture and fine-tuned on human feedback.
- What is the architecture of Chat GPT?<br>
Chat GPT's underlying architecture i.e. GPT 3.5 is a finetuned version of the pretrained GPT 3. GPT 3 is a transformer
decoder architecture where its primary objective is predicting the next word. <br>
![Alt Text](GPTarchitecture.png)

In the above the most important parameters are:-
- $d_{\texttt{model}}$: The embedding dimensions of the input text
- $n_{\texttt{layers}}$: The number of decoder layers present.

## Why is finetuning important ?
Finetuning an Large Language Model is important as the GPT or decoder only transformers are tasked with predicting the next word. In such a case, it just tries to predict the next word in the sentence. This might not make coherent sense especially when the intention of the LLM is to be utilized as a chatbot.<br>
Let us see check this in an example of Llama.

In [1]:
from transformers import LlamaTokenizer, LlamaForCausalLM,LlamaConfig

from peft import LoraConfig, TaskType, get_peft_model
import torch
model_name_or_path = "openlm-research/open_llama_3b"
config = LlamaConfig.from_pretrained(model_name_or_path)
model = LlamaForCausalLM.from_pretrained(model_name_or_path,config = config,load_in_8bit = True,torch_dtype = torch.float16,) 
tokenizer = LlamaTokenizer.from_pretrained(
        model_name_or_path,trust_remote_code = True
        )
tokenizer.pad_token ="[PAD]"
special_tokens = {'bos_token': "<s>",'eos_token': "</s>",'unk_token': "<unk>"}
for k, val in special_tokens.items():
    tokenizer.add_special_tokens({k: val})
inputs = tokenizer("Tell me about Alpacas.", return_tensors="pt",truncation=False).to('cuda')
generation_output = model.generate(**inputs,max_new_tokens=32)

  from .autonotebook import tqdm as notebook_tqdm


[2023-06-28 18:38:04,555] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)

Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /home/mkarri/anaconda3/envs/axototl/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda121.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-12.1/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 121
CUDA SETUP: Loading binary /home/mkarri/anaconda3/envs/axototl/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda121.so...


  warn(msg)


In [2]:
print(tokenizer.decode(generation_output[0]))

<s>Tell me about Alpacas.
Alpacas are a type of camelid, a group of animals that includes llamas, guanacos, vicunas,


As we can see the model has just tried to complete the sentence. 

## Model Architecture and over-view of inference

A traditional GPT model consists of stacked decoders. The $n_{\texttt{layers}}$ = 96 layers listed above consists of the number of 
stacked decoders.(The decoders of the GPT 3.5 model alternate between sparse and self attention layers but more on that later)
![Alt Text](Decoderarchitecture.png)

There are 96 of the above encoders in the GPT architecture.

![Alt Text](decoder.png)

The model we have chosen is OpenLLamma 3B which is an Apache 2.0 version of Llamma. It replicates the Llamma architecture
which is an open source model by Meta AI.(https://arxiv.org/abs/2302.13971). Looking at the model, we can tell the following:-

In [3]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 3200, padding_idx=0)
    (layers): ModuleList(
      (0-25): 26 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear8bitLt(in_features=3200, out_features=3200, bias=False)
          (k_proj): Linear8bitLt(in_features=3200, out_features=3200, bias=False)
          (v_proj): Linear8bitLt(in_features=3200, out_features=3200, bias=False)
          (o_proj): Linear8bitLt(in_features=3200, out_features=3200, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear8bitLt(in_features=3200, out_features=8640, bias=False)
          (down_proj): Linear8bitLt(in_features=8640, out_features=3200, bias=False)
          (up_proj): Linear8bitLt(in_features=3200, out_features=8640, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm

In the above the most important parameters are:-
- $d_{\texttt{model}}$: 3200
- $n_{\texttt{layers}}$: 26
<br>The main difference between the traditional decoder and the LLama paper are as below(https://paperswithcode.com/method/llama) :-
1. RMSNorm normalizing function is used to improve the training stability, by normalizing the input of each transformer sub-layer, instead of normalizing the output.
2. The ReLU non-linearity is replaced by the SwiGLU activation function to improve performance.
3. Absolute positional embeddings are removed and instead rotary positional embeddings (RoPE) are added at each layer of the network.


Based on the code in the lLamma paper(https://github.com/facebookresearch/llama/blob/main/llama/model.py), the main changes have been done to the code.

![Alt Text](llama.png)

The feedforward layer has been changed as below:-

![Alt Text](swishrelu.png)

Both rotary embeddings and swish Relu are out of context for this paper. For more information refer to (https://blog.eleuther.ai/rotary-embeddings/#how-is-this-different-from-the-sinusoidal-embeddings-used-in-attention-is-all-you-need) for rotary embeddings and Swish Relu(https://arxiv.org/pdf/2002.05202.pdf) The main point we are trying to make is that there have been minor changes to the original transformer architecture.

## Model Inference and Finetuning

Model Inference is done the following way as per the above example. There is a sequential way in which the output is determined.:-

**Input**: Tell me about Alpacas. **Output**:<BOS><br>
**Input**: Tell me about Alpacas. <BOS> **Output**: Alpacas<br>
**Input**: Tell me about Alpacas. <BOS> Alpacas **Output**: are<br>
**Input**: Tell me about Alpacas. <BOS> Alpacas are **Output**: a<br>
**Input**: Tell me about Alpacas. <BOS> Alpacas are a **Output**: type<br>
**Input**: Tell me about Alpacas. <BOS> Alpacas are a type **Output**: of<br>
**Input**: Tell me about Alpacas. <BOS> Alpacas are a type of  **Output**: camel<br>
**Input**: Tell me about Alpacas. <BOS> Alpacas are a type of camel **Output**: <EOS><br>
**Final Input**: Tell me about Alpacas. <BOS> Alpacas are a type of camel **Output**: <EOS><br>

Model finetuning is a supervised learning process in contrast to model training which is semi supervised. 
The cross entropy loss is the loss function. An example is shown below:-
    


**Input**: Tell me about Alpacas.<br>
**Output**: Alpacas are a type of camel found in South America.<br>

CE_loss = - (log(P(Alpacas)) + log(P(are/Alpacas)) + log(P(a/Alpacas are)) + log(P(type/Alpacas are a)) + log(P(camel/Alpacas are a type)))...

LLMs take an enormous amount of resources to train. For e.g. from the Llama paper the 65B parameter model took 21 days on 2048 A100 GPUs. Hence the total cost of training a 65B model would take:-<br>
2048(# of GPUs) * 24(# hours per day) * 21(# days of training) * 8.8(# Cost of GPU per hour) = $449k

A100 GPUs are state of the art and are not consumer GPUs. While finetuning should likely cost a fraction of the cost, there is still a significant 
portion of compute time as well as power. Finetuning based on the Llama 7B based on the Stanford Alpaca(https://crfm.stanford.edu/2023/03/13/alpaca.html) took ~$600. 

## LORA for finetuning

In [None]:
!nvidia-smi

While we currently don't have access to a A100 GPU, a finetuning experiment will still take time. Based on the model configurations above, even ordinary finetuning can take time. Hence, in this case,
we use LoRA(Low Rank Adapters) which ingest trainable parameters to each layer of the model. 
LORA is based on the paper(https://arxiv.org/pdf/2106.09685.pdf) and notes that during finetuning, most of the delta weights are mainly zero and hence have a "low instrinsic dimension". Hence, they can still learn efficiently even if projected to a smaller subpace. Hence, when learning the delta weights, we follow the following equation:-
$h = W_0x + \Delta Wx = W_0x + BAx$


The main benefit of this is simplicity and no additional latency inference. B & A are known as the Lora weight matrices. The main parameter of the LORA adapter is the rank($r$) and the idea goes as below:-<br>
$\Delta Wx = M X N$<br>
$\Delta Wx = A_{M \times r} X B_{r \times N}$

## Preparing model for finetuning

We use the Peft package from Huggingface to assist us in the LORA finetuning.

In [4]:
from peft import LoraConfig, PeftModel, get_peft_model
from peft import prepare_model_for_kbit_training

In [5]:
lora_target_modules = ['gate_proj','down_proj','up_proj','q_proj','v_proj','k_proj','o_proj']
lora_alpha = 16
lora_r = 8
lora_config = LoraConfig(
        r=lora_r,
        lora_alpha=lora_alpha,
        target_modules=lora_target_modules,
        task_type="CAUSAL_LM",
    )

Here, lora_r refers to the low rank taken in this case. lora_alpha refers to the alpha parameter which indicates the scaling of the weights 
when calculating gradients. The lora target modules defines the layers where Lora Adapters are being added.

In [6]:
model = prepare_model_for_kbit_training(
            model, use_gradient_checkpointing=True
        )
model = get_peft_model(model, lora_config)

In [7]:
model.print_trainable_parameters()

trainable params: 12,712,960 || all params: 3,439,186,560 || trainable%: 0.36965020007521776


In [8]:
model.base_model.model.model.layers[0].self_attn.q_proj

Linear8bitLt(
  in_features=3200, out_features=3200, bias=False
  (lora_dropout): ModuleDict(
    (default): Identity()
  )
  (lora_A): ModuleDict(
    (default): Linear(in_features=3200, out_features=8, bias=False)
  )
  (lora_B): ModuleDict(
    (default): Linear(in_features=8, out_features=3200, bias=False)
  )
  (lora_embedding_A): ParameterDict()
  (lora_embedding_B): ParameterDict()
)

In [9]:
print("LORA A weights")
print(model.base_model.model.model.layers[0].self_attn.q_proj.lora_A.default.state_dict())
print("LORA B weights")
print(model.base_model.model.model.layers[0].self_attn.q_proj.lora_B.default.state_dict())
model.print_trainable_parameters()

LORA A weights
OrderedDict([('weight', tensor([[ 0.0119, -0.0117, -0.0060,  ...,  0.0118, -0.0083,  0.0060],
        [-0.0040, -0.0096,  0.0049,  ..., -0.0099, -0.0024, -0.0158],
        [-0.0110, -0.0039,  0.0003,  ...,  0.0128,  0.0003, -0.0120],
        ...,
        [-0.0010,  0.0118,  0.0046,  ..., -0.0135,  0.0029,  0.0147],
        [-0.0059,  0.0155, -0.0033,  ..., -0.0075,  0.0021,  0.0076],
        [ 0.0140,  0.0060, -0.0108,  ..., -0.0110,  0.0146,  0.0043]],
       device='cuda:0'))])
LORA B weights
OrderedDict([('weight', tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0'))])
trainable params: 12,712,960 || all params: 3,439,186,560 || trainable%: 0.36965020007521776


The LORA B parameters are set to 0 so that the randomly initialized weights do not affect the model performance at the start of the training.

## Preparing the Datasets 

In [10]:
from datasets import Dataset, DatasetDict, load_dataset, load_from_disk
tokenizer_name = tokenizer.__class__.__name__

In [11]:
ds = load_dataset("teknium/GPT4-LLM-Cleaned",streaming=False)

Found cached dataset json (/home/mkarri/.cache/huggingface/datasets/teknium___json/teknium--GPT4-LLM-Cleaned-a71aa8ae1ac3982d/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 310.80it/s]


The above dataset is a refined version of the Alpaca dataset which is an instruction finetuned dataset. The below code has been adapted from the Open AI Access Collective(https://github.com/OpenAccess-AI-Collective/axolotl).

In [12]:
import abc
import functools
from typing import List, Tuple, Union
from datasets import IterableDataset
from enum import Enum, auto
from typing import Generator, List, Optional, Tuple, Union

class InvalidDataException(Exception):
    """
    Exception raised when the data is invalid
    """
class PromptStyle(Enum):
    """
    Enum for prompt styles
    """

    INSTRUCT = "instruct"
    CHAT = "chat"


class AlpacaPrompter:
    """
    Base class for alpaca prompters
    """

    system_prompt = "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n"
    system_no_input_prompt = "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n"
    prompt_style: Optional[PromptStyle] = None

    def __init__(self, prompt_style=PromptStyle.INSTRUCT.value):
        self.prompt_style = prompt_style if prompt_style else PromptStyle.INSTRUCT.value
        self.match_prompt_style()

    def match_prompt_style(self):
        if self.prompt_style == PromptStyle.INSTRUCT.value:
            self.prompt_input = (
                self.system_prompt
                + "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n"
            )
            self.prompt_no_input = (
                self.system_no_input_prompt
                + "### Instruction:\n{instruction}\n\n### Response:\n"
            )
            self.response_split = "### Response:"
        if self.prompt_style == PromptStyle.CHAT.value:
            self.prompt_input = (
                self.system_prompt + "USER: {instruction}\n{input}\nASSISTANT:"
            )
            self.prompt_no_input = (
                self.system_no_input_prompt + "USER: {instruction}\nASSISTANT:"
            )
            self.response_split = "ASSISTANT:"

    def build_prompt(
        self,
        instruction: str,
        input: Union[None, str] = None,  # pylint: disable=redefined-builtin
        output: Union[None, str] = None,
    ) -> Generator[str, None, None]:
        # returns the full prompt from instruction and optional input
        # if a label (=response, =output) is provided, it's also appended.
        if input:
            res = self.prompt_input.format(instruction=instruction, input=input)
        else:
            res = self.prompt_no_input.format(instruction=instruction)
        if output:
            res = f"{res}{output}"
        yield res

    def get_response(self, output: str) -> str:
        return output.split(self.response_split)[1].strip()



class PromptTokenizingStrategy(abc.ABC):
    """
    Abstract class for tokenizing strategies
    """

    def __init__(
        self,
        prompter,
        tokenizer,
        train_on_inputs: bool = False,
        sequence_len: int = 2048,
    ):
        self.prompter = prompter
        self.tokenizer: PreTrainedTokenizer = tokenizer
        self.train_on_inputs = train_on_inputs
        self.sequence_len = sequence_len

    @abc.abstractmethod
    def tokenize_prompt(self, prompt):
        pass

    @functools.lru_cache(maxsize=128)
    def _get_user_token(self):
        id_or_ids = self.tokenizer.convert_tokens_to_ids("<|USER|>")
        if isinstance(id_or_ids, (int,)):
            return id_or_ids
        return False

    @functools.lru_cache(maxsize=128)
    def _get_assistant_token(self):
        id_or_ids = self.tokenizer.convert_tokens_to_ids("<|ASSISTANT|>")
        if isinstance(id_or_ids, (int,)):
            return id_or_ids
        return False

    def _tokenize(self, prompt: str, add_eos_token=True, strip_bos_token=False):
        result = self.tokenizer(
            prompt,
            truncation=True,
            max_length=self.sequence_len,
            padding=False,
            return_tensors=None,
        )
        if (
            result["input_ids"][-1] != self.tokenizer.eos_token_id
            and len(result["input_ids"]) < self.sequence_len
            and add_eos_token
        ):
            result["input_ids"].append(self.tokenizer.eos_token_id)
            result["attention_mask"].append(1)

        if result["input_ids"][0] == self.tokenizer.bos_token_id and strip_bos_token:
            result["input_ids"] = result["input_ids"][1:]
            result["attention_mask"] = result["attention_mask"][1:]

        result["labels"] = result["input_ids"].copy()
        return result


class InstructionPromptTokenizingStrategy(PromptTokenizingStrategy):
    """
    Tokenizing strategy for instruction-based prompts.
    """

    def parse_instruction_fields(self, prompt) -> Tuple[str, str, str]:
        raise NotImplementedError

    def tokenize_prompt(self, prompt):
        (
            instruction,
            input,  # pylint: disable=redefined-builtin
            response,
        ) = self.parse_instruction_fields(prompt)
        
        full_prompt = self._build_full_prompt(instruction, input, response)
        tokenized_full_prompt = self._tokenize(full_prompt)
        if not self.train_on_inputs:
            user_prompt = next(
                iter(
                    self.prompter.build_prompt(
                        instruction,
                        input,
                    )
                )
            )
            tokenized_user_prompt = self._tokenize(user_prompt, add_eos_token=False)
            user_prompt_len = len(tokenized_user_prompt["input_ids"])
            # TODO this could be sped up using numpy array slicing
            tokenized_full_prompt["labels"] = [
                -100
            ] * user_prompt_len + tokenized_full_prompt["labels"][user_prompt_len:]

        return tokenized_full_prompt

    def _build_full_prompt(
        self, instruction, input, response  # pylint: disable=redefined-builtin
    ):
        return next(
            iter(
                self.prompter.build_prompt(
                    instruction,
                    input,
                    response,
                )
            )
        )

class TokenizedPromptDataset(IterableDataset):
    """
    Iterable dataset that returns tokenized prompts from a stream of text files.
        Args:
            prompt_tokenizer (PromptTokenizingStrategy): The prompt tokenizing method for proccessing the data.
            dataset (dataset.Dataset): Dataset with text files.
    """

    def __init__(  # pylint: disable=super-init-not-called
        self,
        prompt_tokenizer: PromptTokenizingStrategy,
        dataset: IterableDataset,
    ):
        self.prompt_tokenizer = prompt_tokenizer
        self.dataset = dataset

    def __iter__(self):
        iterator = iter(self.dataset)
        count = 0
        # Loop through the entire dataset
        for example in iterator:
            try:
                yield self.prompt_tokenizer.tokenize_prompt(example)
                count += 1
            except InvalidDataException:
                pass
        if count == 0:
            raise RuntimeError("Expected at least one datapoint in dataset.")

class AlpacaPromptTokenizingStrategy(InstructionPromptTokenizingStrategy):
    """
    Tokenizing strategy for Alpaca prompts.
    """

    def parse_instruction_fields(self, prompt) -> Tuple[str, str, str]:
        return (
            prompt["instruction"],
            prompt["input"] if "input" in prompt else "",
            prompt["output"],
        )

In [13]:
ds_strategy = AlpacaPromptTokenizingStrategy(AlpacaPrompter('instruct'),tokenizer,False,256)
ds_wrapper = TokenizedPromptDataset(ds_strategy, ds['train'])
samples = []
for d in [ds_wrapper]:
    samples = samples + list(d)


So what is happening here?<br>
**Step 1**: Create the appropriate prompt template for the dataset. Below is the prompt for the first example of the dataset. 

In [14]:
print(next(AlpacaPrompter('instruct').build_prompt(ds['train'][0]['instruction'],ds['train'][0]['input'],ds['train'][0]['output'])))

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Response:
1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.

2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.

3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.


**Step 2**: Create the appropriate tokenizations. Here, we can see that inputs to the model consist of both the instruction 
    as well as the answer. The labels or the output based on which the model needs to be finetuned has some -100s to it.  However, the text form of the labels denoted as  -100 noted below:-<br>

<i>Below is an instruction that describes a task. Write a response that appropriately completes the request.
USER: Give three tips for staying healthy.</i><br>
This is done so that we can parallely process the entire sample at one go.


**Step 3**: Split into train and test datasets

In [15]:
dataset = Dataset.from_list(samples).shuffle(seed=42)
dataset = dataset.train_test_split(test_size=0.02, shuffle=False)
train_dataset = dataset['train']
eval_dataset = dataset['test']

## Prepare the trainer for training
Here, we use the Huggingface Trainer which allows for a config driven management.

Now we set some learning parameters for our trainer.

In [None]:
# # Learning Rate 
# learning_rate = 0.0002
# # Weight decay associated with learning parameter
# weight_decay = 0.0
# batch_size = 16
# micro_batch_size = 4
# eval_steps = 50
# save_steps = 1000
# # Num of epochs
# num_epochs = 3
# # Optimizer(We use Adam 8 bit as the model has been loaded in 8 bit)
# optimizer = 'adamw_bnb_8bit'
# import math
# total_num_steps = int(
#     math.ceil(len(train_dataset) * num_epochs / batch_size)
# )
# warmup_steps = 10
# logging_steps = max(min(int(0.005 * total_num_steps), 10), 1)
# training_arguments_kwargs = {}
# ## Train model in FP16 or float because it has been loaded in 8 bit
# training_arguments_kwargs["fp16"] = True
# training_arguments_kwargs["tf32"] = False
# ## Gradient Checkpointing
# gradient_checkpointing = True
# training_arguments_kwargs["warmup_steps"] = warmup_steps
# training_arguments_kwargs["logging_steps"] = logging_steps
# training_arguments_kwargs["gradient_checkpointing"] = gradient_checkpointing
# gradient_accumulation_steps = batch_size // micro_batch_size
# optimizer = "adamw_bnb_8bit"
# import transformers
# training_args = transformers.TrainingArguments(
#         per_device_train_batch_size=micro_batch_size,
#         per_device_eval_batch_size=micro_batch_size,
#         gradient_accumulation_steps=gradient_accumulation_steps,
#         eval_accumulation_steps=gradient_accumulation_steps,
#         num_train_epochs=num_epochs,
#         learning_rate=learning_rate,
#         evaluation_strategy="steps",
#         save_strategy="steps",
#         eval_steps=eval_steps,
#         save_steps=save_steps,
#         output_dir='./lora-out',
#         save_total_limit=3,
#         group_by_length=False,
#         report_to=None,
#         run_name=None,
#         optim=optimizer,
#         lr_scheduler_type="cosine",
#         weight_decay=weight_decay,
#         **training_arguments_kwargs,
#     )
# trainer_kwargs = {}
# from transformers.trainer_pt_utils import get_parameter_names
# from torch import nn
# import bitsandbytes as bnb
# decay_parameters = get_parameter_names(model, [nn.LayerNorm])
# decay_parameters = [name for name in decay_parameters if "bias" not in name]
# optimizer_grouped_parameters = [
#     {
#         "params": [
#             p
#             for n, p in model.named_parameters()
#             if (n in decay_parameters and p.requires_grad)
#         ],
#         "weight_decay": training_args.weight_decay,
#     },
#     {
#         "params": [
#             p
#             for n, p in model.named_parameters()
#             if (n not in decay_parameters and p.requires_grad)
#         ],
#         "weight_decay": 0.0,
#     },
# ]

# optimizer = bnb.optim.Adam8bit(
#     optimizer_grouped_parameters,
#     betas=(training_args.adam_beta1, training_args.adam_beta2),
#     eps=training_args.adam_epsilon,
#     lr=training_args.learning_rate,
# )
# lr_scheduler = transformers.get_cosine_schedule_with_warmup(
#                 optimizer,
#                 training_args.warmup_steps,
#                 total_num_steps,
#             )
# trainer_kwargs["optimizers"] = (optimizer, lr_scheduler)
# data_collator_kwargs = {
#         "padding": True,
#     }
# data_collator_kwargs["pad_to_multiple_of"] = 8
# # import torch
# # import transformers
# # model = torch.compile(model)
# trainer = transformers.Trainer(
#         model=model,
#         train_dataset=train_dataset,
#         eval_dataset=eval_dataset,
#         args=training_args,
#         data_collator=transformers.DataCollatorForSeq2Seq(
#             tokenizer,
#             return_tensors="pt",
#             **data_collator_kwargs,
#         ),
#         **trainer_kwargs,
#     )
# model = torch.compile(model)
# trainer.train()

In [16]:
# Learning Rate 
learning_rate = 0.0002
# Weight decay associated with learning parameter
weight_decay = 0.0
batch_size = 16
micro_batch_size = 4
eval_steps = 50
save_steps = 1000
# Num of epochs
num_epochs = 3
# Optimizer(We use Adam 8 bit as the model has been loaded in 8 bit)
optimizer = 'adamw_bnb_8bit'
import math
total_num_steps = int(
    math.ceil(len(train_dataset) * num_epochs / batch_size)
)
warmup_steps = 10
logging_steps = max(min(int(0.005 * total_num_steps), 10), 1)
training_arguments_kwargs = {}
## Train model in FP16 or float because it has been loaded in 8 bit
training_arguments_kwargs["fp16"] = True
training_arguments_kwargs["tf32"] = False
## Gradient Checkpointing
gradient_checkpointing = True
training_arguments_kwargs["warmup_steps"] = warmup_steps
training_arguments_kwargs["logging_steps"] = logging_steps
training_arguments_kwargs["gradient_checkpointing"] = gradient_checkpointing
gradient_accumulation_steps = batch_size // micro_batch_size
optimizer = "adamw_bnb_8bit"

We use a 8bit Adam which is quantized for 8bits. For more on quantization which is out of context. The following blog 
might help. (https://huggingface.co/blog/hf-bitsandbytes-integration)

In [17]:
import transformers
training_args = transformers.TrainingArguments(
        per_device_train_batch_size=micro_batch_size,
        per_device_eval_batch_size=micro_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        eval_accumulation_steps=gradient_accumulation_steps,
        num_train_epochs=num_epochs,
        learning_rate=learning_rate,
        evaluation_strategy="steps",
        save_strategy="steps",
        eval_steps=eval_steps,
        save_steps=save_steps,
        output_dir='./lora-out',
        save_total_limit=3,
        group_by_length=False,
        report_to=None,
        run_name=None,
        optim=optimizer,
        lr_scheduler_type="cosine",
        weight_decay=weight_decay,
        **training_arguments_kwargs,
    )
trainer_kwargs = {}

In [18]:
from transformers.trainer_pt_utils import get_parameter_names
from torch import nn
import bitsandbytes as bnb
decay_parameters = get_parameter_names(model, [nn.LayerNorm])
decay_parameters = [name for name in decay_parameters if "bias" not in name]
optimizer_grouped_parameters = [
    {
        "params": [
            p
            for n, p in model.named_parameters()
            if (n in decay_parameters and p.requires_grad)
        ],
        "weight_decay": training_args.weight_decay,
    },
    {
        "params": [
            p
            for n, p in model.named_parameters()
            if (n not in decay_parameters and p.requires_grad)
        ],
        "weight_decay": 0.0,
    },
]

optimizer = bnb.optim.Adam8bit(
    optimizer_grouped_parameters,
    betas=(training_args.adam_beta1, training_args.adam_beta2),
    eps=training_args.adam_epsilon,
    lr=training_args.learning_rate,
)
lr_scheduler = transformers.get_cosine_schedule_with_warmup(
                optimizer,
                training_args.warmup_steps,
                total_num_steps,
            )
trainer_kwargs["optimizers"] = (optimizer, lr_scheduler)

In [20]:
data_collator_kwargs = {
        "padding": True,
    }
data_collator_kwargs["pad_to_multiple_of"] = 8

In [None]:
#model.config.use_cache = False
import torch
import transformers
model = torch.compile(model)


In [21]:
trainer = transformers.Trainer(
        model=model,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        args=training_args,
        data_collator=transformers.DataCollatorForSeq2Seq(
            tokenizer,
            return_tensors="pt",
            **data_collator_kwargs,
        ),
        **trainer_kwargs,
    )

In [22]:
trainer.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mmkarri[0m ([33mwreckit[0m). Use [1m`wandb login --relogin`[0m to force relogin


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss


KeyboardInterrupt: 

Once the model training is completed(it takes 11 hours.), the model adapter weights can be found at (https://huggingface.co/maneel/OpenLLama3B_8bitQuantized_alpaca_finetuned).

## Model Inference

Once, we have the model trained. we need to make an endpoint from it so that we can perform inference from it easily.
The framework we have used is Ray Serve and heavily inspired from the above notebook.(https://docs.ray.io/en/latest/ray-air/examples/gptj_serving.html)

In [23]:
import pandas as pd
from ray import serve
from starlette.requests import Request


@serve.deployment(ray_actor_options={"num_gpus": 1})
class PredictDeployment:
    def __init__(self, model_id: str, revision: str = None):
        # from transformers import AutoModelForCausalLM, AutoTokenizer
        from peft import PeftModel
        from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig
        import torch

        model = LlamaForCausalLM.from_pretrained(model_id,
                                                      torch_dtype=torch.float16, 
                                                      low_cpu_mem_usage=True,
                                                      load_in_8bit=True, 
                                                      device_map="auto")
        #load the adapter delta weights on top of base model
        self.model = PeftModel.from_pretrained(model, "maneel/OpenLLama3B_8bitQuantized_alpaca_finetuned",\
                                               adapter_name="maneel_openllama")
        
        self.tokenizer = LlamaTokenizer.from_pretrained(model_id)
        
        self.generation_config = GenerationConfig(temperature=0.8,
                                             top_p=0.75,
                                             top_k=40,
                                             num_beams=4,
                                             no_repeat_ngram_size=3,
                                             max_new_tokens=256)
        
    def generate(self, text: str) -> pd.DataFrame:
        input_ids = self.tokenizer(text, return_tensors="pt").input_ids.to(
            self.model.device
        )

        gen_tokens = self.model.generate(
            input_ids=input_ids,
            generation_config=self.generation_config,
        )
        return pd.DataFrame(
            self.tokenizer.batch_decode(gen_tokens), columns=["responses"]
        )

    async def __call__(self, http_request: Request) -> str:
        json_request: str = await http_request.json()
        prompts = []
        for prompt in json_request:
            text = prompt["text"]
            if isinstance(text, list):
                prompts.extend(text)
            else:
                prompts.append(text)
        return self.generate(prompts)


In [25]:
model_id = "openlm-research/open_llama_3b"  #base openllama model
revision = "float16"
deployment = PredictDeployment.bind(model_id=model_id, revision=revision)
serve.run(deployment)

2023-06-28 19:20:21,102	INFO worker.py:1627 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
[2m[36m(ServeController pid=3936403)[0m INFO 2023-06-28 19:20:24,806 controller 3936403 deployment_state.py:1298 - Deploying new version of deployment default_PredictDeployment.
[2m[36m(HTTPProxyActor pid=3936481)[0m INFO:     Started server process [3936481]
[2m[36m(ServeController pid=3936403)[0m INFO 2023-06-28 19:20:24,876 controller 3936403 deployment_state.py:1537 - Adding 1 replica to deployment default_PredictDeployment.


[2m[36m(ServeReplica:default_PredictDeployment pid=3936595)[0m [2023-06-28 19:20:28,006] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2m[36m(ServeReplica:default_PredictDeployment pid=3936595)[0m 
[2m[36m(ServeReplica:default_PredictDeployment pid=3936595)[0m Welcome to bitsandbytes. For bug reports, please run
[2m[36m(ServeReplica:default_PredictDeployment pid=3936595)[0m 
[2m[36m(ServeReplica:default_PredictDeployment pid=3936595)[0m python -m bitsandbytes
[2m[36m(ServeReplica:default_PredictDeployment pid=3936595)[0m 
[2m[36m(ServeReplica:default_PredictDeployment pid=3936595)[0m  and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
[2m[36m(ServeReplica:default_PredictDeployment pid=3936595)[0m bin /home/mkarri/anaconda3/envs/axototl/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda121.so
[2m[36m(ServeReplica:default_PredictDeployment p

[2m[36m(ServeReplica:default_PredictDeployment pid=3936595)[0m   warn(msg)
[2m[36m(ServeReplica:default_PredictDeployment pid=3936595)[0m The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
Downloading adapter_model.bin:   0%|          | 0.00/51.0M [00:00<?, ?B/s]
Downloading adapter_model.bin:  21%|██        | 10.5M/51.0M [00:00<00:01, 20.6MB/s]
Downloading adapter_model.bin:  41%|████      | 21.0M/51.0M [00:00<00:00, 33.2MB/s]
Downloading adapter_model.bin:  62%|██████▏   | 31.5M/51.0M [00:00<00:00, 42.6MB/s]
Downloading adapter_model.bin:  82%|████████▏ | 41.9M/51.0M [00:01<00:00, 43.2MB/s]
Downloading adapter_model.bin: 100%|██████████| 51.0M/51.0M [00:01<00:00, 35.1MB/s]


RayServeSyncHandle(deployment='default_PredictDeployment')

In [27]:
import requests
prompt = (
    """Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Explain in simple terms how the attention mechanism of a transformer model works
### Response:"""
)

sample_input = {"text": prompt}

output = requests.post("http://localhost:8000/", json=[sample_input]).json()
print(output)

[{'responses': '<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.\n### Instruction:\nExplain in simple terms how the attention mechanism of a transformer model works\n### Response:\nA transformer network is a type of artificial neural network that is made up of multiple layers of interconnected nodes. At the topmost layer, called the input layer, the nodes are responsible for inputting the data into the network. The nodes in this layer will take in the input, process it, and pass it on to the next layer.\nThe next layer, also known as the output layer, takes in the processed data from the previous layer, processes it further, and passes it on. This process continues until the end of the network, where the nodes at the final layer will have processed the input data and passed it on as output.\nOnce the data has passed through all the layers, the final output will be the processed and pre-processed data. The attention mechanism 

[2m[36m(ServeReplica:default_PredictDeployment pid=3936595)[0m INFO 2023-06-28 19:29:31,585 default_PredictDeployment default_PredictDeployment#ajdfkL szbOguWTta / default replica.py:654 - __CALL__ OK 64223.7ms


As we can see, the model is at an inference end-point that we can easily query.