# PEFT SAMPLE

Create conda environment

In [1]:
# conda create -n trainLLM python=3.11
# conda activate trainLLM
# conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia -y
# conda install -c conda-forge tensorboardx notebook jupyterlab -y
# conda install -c conda-forge opencv pandas matplotlib tqdm -y
# conda install -c conda-forge scikit-learn scikit-image -y
# conda install -c conda-forge numpy scipy -y
# conda install -c anaconda h5py -y
# conda install -c huggingface transformers -y
# conda install -c conda-forge peft accelerate -y

# pip install -q bitsandbytes datasets accelerate loralib scikit-learn joblib ipywidgets
# pip install -U git+https://github.com/huggingface/transformers.git
# pip install -U git+https://github.com/huggingface/peft.git -qqq


Make sure to switch the kernel to the newly created environment in Jupyter Notebook.


In [1]:
# check if GPU is available 
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

True
NVIDIA GeForce RTX 3050


Orignal notebook: https://colab.research.google.com/drive/1jCkpikz0J2o20FBQmYmAGdiKmJGOMo-o?usp=sharing#scrollTo=cg3fiQOvmI3Q

# Model loading

Here let's load the opt-6.7b model, its weights in half-precision (float16) are about 13GB on the Hub! If we load them in 8-bit we would require around 7GB of memory instead.

In [2]:
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

model_id = "facebook/opt-1.3b"

model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    load_in_4bit=True, 
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

config.json:   0%|          | 0.00/653 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.63G [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

# Post-processing on the model

Finally, we need to apply some post-processing on the 8-bit model to enable training, let's freeze all our layers, and cast the layer-norm in float32 for stability. We also cast the output of the last layer in float32 for the same reasons.

In [3]:
for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

# Apply LoRA

Here comes the magic with peft! Let's load a PeftModel and specify that we are going to use low-rank adapters (LoRA) using get_peft_model utility function from peft.

In [4]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [5]:
from peft import LoraConfig, get_peft_model 

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 3145728 || all params: 714924032 || trainable%: 0.4400087085056892


# Training

In [6]:
import transformers
from datasets import load_dataset
data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples['quote']), batched=True)

trainer = transformers.Trainer(
    model=model, 
    train_dataset=data['train'],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4, 
        gradient_accumulation_steps=4,
        warmup_steps=100, 
        max_steps=200, 
        learning_rate=2e-4, 
        fp16=True,
        logging_steps=20, # reduce for faster logging
        output_dir='outputs'
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

Downloading readme:   0%|          | 0.00/5.55k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/647k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/2508 [00:00<?, ? examples/s]

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
20,2.694
40,2.6454
60,2.442
80,2.4317
100,2.4181
120,2.5164
140,2.4639
160,2.4342
180,2.4544
200,2.3981


TrainOutput(global_step=200, training_loss=2.489819107055664, metrics={'train_runtime': 447.1308, 'train_samples_per_second': 7.157, 'train_steps_per_second': 0.447, 'total_flos': 2186265458638848.0, 'train_loss': 2.489819107055664, 'epoch': 1.28})

# Save the model

Orignal notebook suggest to upload to the HuggingFace Hub, but we will save it locally.

In [8]:
with open("outputs/model/README.md", "w") as f: # create an empty README.md file otherwise huggingface fails
    f.write('')
model.save_pretrained("outputs/model")



# Inference

There is a separate notebook for inference.

In [7]:
batch = tokenizer("Two things are infinite: ", return_tensors='pt')

with torch.cuda.amp.autocast():
  output_tokens = model.generate(**batch, max_new_tokens=200)

print('\n\n', tokenizer.decode(output_tokens[0], skip_special_tokens=True))





 Two things are infinite:  The universe and human stupidity.   And one thing is infinite:  The universe and human stupidity.
I think you mean "The universe and human stupidity."
I think you mean "The universe and human stupidity."
I think you mean "The universe and human stupidity."
I think you mean "The universe and human stupidity."
I think you mean "The universe and human stupidity."
I think you mean "The universe and human stupidity."
I think you mean "The universe and human stupidity."
I think you mean "The universe and human stupidity."
I think you mean "The universe and human stupidity."
I think you mean "The universe and human stupidity."
I think you mean "The universe and human stupidity."
I think you mean "The universe and human stupidity."
I think you mean "The universe and human stupidity."
I think you mean "The universe and human stupidity."
I think you mean "The universe and human stupidity
