# Training LLMs with Small Datasets


1.   Understanding Model Size
2.   Quantization
3.   Basic FIne Tuning
4.   Advanced Fine Tuning



# Understanding Model Size

Llama 2 :
- 7b  (no. of weights)
- 13b
- 70b

Each weight represented by 32-bits
- 8 bits per byte
- 70b model
=> 70b x 32 bits/ 8bits per byte = 280 GB approx (Size of weights)


Llama 7b:
- 7b x 32 / 8 = 28 GB.

A100 Nvidia - 40 GB, 80GB


## Qunatization

Instead of using 32-bits to represent a weight. You sacle it to 4-bits.
You lose precision but you can manage it in small gpus

2^4 = 32
2^32 = ........

7b model:
 -7b x 4 bits / 8 bits per byte= 3.5 GB

# Fine Tuning with QLora

Quantized LoRa - means training with quantized weights (4-bit in this case)

LoRa - Low RanK Adaptation -> freeze pre-trained model weights and injects trainable rank decomposition matrices into each layer of transformer



In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m33.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.0/261.0 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
  Installing build dependen

# Load the model to use: Llama-7B

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = 'meta-llama/Llama-2-7b-chat-hf'

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type = "nf4",
    bnb_4bit_compute_dtype= torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

# Training Setup

In [None]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)


In [None]:
def print_trainable_parameters(model):

  """
  Prints the number of trainable params
  """
  trainable_params = 0
  all_params = 0
  for _, param in model.named_parameters():
    all_param += param.numel()
    if param.requires_grad:
      trainable_params += param.numel()

  print(f"trainable params : {trainable_params} || all_params: {all_param} || trainable: {100* {trainable_params/all_param}}")



In [None]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["self_attn.q_proj"],
    lora_dropout=0.05,
    bias = "none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)
print_trainable_parameters(model)

# Datasets

In [None]:
from datasets import load_dataset

data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples : tokenizer(samples["quote"]), batched=True)

# Training

In [None]:
import transformers

# needed for Llama tokenizer
tokenizer.pad_token - tokenizer.eos_token

trainer = transformers.Trainer(
    model = model,
    train_dataset=data['train'],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

model.config.use_cache = False
trainer.train()

# Inference

In [None]:
from transformers import TextStreamer

In [None]:
# Define a stream
def stream(user_prompt):
  runtimeFlag = "cuda:0"
  sys_prompt = 'You are helpful assistant that blah blah'
  B_INST, E_INST = "[INST]", "[\INST]"
  B_SYS, E_SYS = "[SYS]", "[\SYS]"

  prompt = f"{B_INST} {B_SYS}{system_prompt.strip()}{E_SYS}{user_prompt.strip()} {E_INST}\n\n"

  inputs = tokenizer([prompt], return_tensors="pt").to(runtimeFlag)

  streamer = TextStreamer(tokenizer)

  _ = model.generate(**inputs,streamer=streamer, max_new_tokens=500)

# Advanced Fine Tuning
- Prompt masking
- End of sequence token

Attention:
- is the idea that the prediction of the next token depends on earlier tokens

[The][ quick][ brown][ fox][ jumped][...]
[1]     [1]     [1]    [1]    [1]

|<pad>| [The][ quick][ brown][ fox][ jumped][...]
[0]     [1]     [1]    [1]    [1]

Loss Masking:
- selecting what token predictions to penalize

Inputs: [The][ quick][ brown][ fox][ jumped]
Predic: [boy][brown][fox][jumped][ over]
Actual: [ quick][ brown][ fox][ jumped][over]
Losses:  [5]    [0.3]    [0.02]  [0.1] [0.3]

[link- for training LLama for function calls](https://www.youtube.com/watch?v=OQdp-OeG1as)