<a href="https://colab.research.google.com/github/eljandoubi/Copilot/blob/main/LightweightFineTuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lightweight Fine-Tuning Project

* PEFT technique: LoftQ initialization & QLoRA-style training
* Model: facebook/opt-125m
* Evaluation approach: Perplexity
* Fine-tuning dataset: codeparrot/github-code

If you are running this in Colab, please restart the notebook after executing the next cell.

In [None]:
!pip install -r requirements.txt



## Loading and Evaluating a Foundation Model

In the cells below, I will load the pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [None]:
from datasets import load_dataset

In [None]:
train_size=1_000

In [None]:
val_size=train_size//10

In [None]:
test_size=val_size

In [None]:
seed=42

I will load the dataset in streaming mode to avoid downloading the entire 1TB.

In [None]:
iter_ds=load_dataset("codeparrot/github-code", streaming=True, trust_remote_code=True,
                split="train").shuffle(seed=seed,
                                       buffer_size=train_size+val_size+test_size)

In [None]:
iter_train_ds=iter_ds.take(train_size)

In [None]:
iter_val_ds=iter_ds.skip(train_size).take(val_size)

In [None]:
iter_test_ds=iter_ds.skip(train_size+val_size).take(test_size)

In [None]:
from transformers import AutoTokenizer

In [None]:
model_id = "facebook/opt-125m"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [None]:
if tokenizer.pad_token is None:
  print("It was None")
  tokenizer.pad_token = tokenizer.eos_token

In [None]:
from transformers import PreTrainedTokenizer

I will segment the text so that it can be processed by the model within the context length.

In [None]:
def chunk_and_encode(
        samples: dict[str,  str],
        tokenizer: PreTrainedTokenizer,
        max_len: int,
        stride: int,
        col_name: str) -> dict[str, list[list[int]]]:
    """
    Split test in chunks and encode them
    Args:
        samples (dict[str, str]):  batch of data raws from hugging face dataset
        tokenizer (PreTrainedTokenizer): hugging face tokenizer
        max_len (int): the length of chunk
        stride (int): the number of overlapping tokens
        col_name (str): the name of the text column
    Return:
        tokenized chunks (dict[str, list[list[int]]])
    """
    chunks = tokenizer(
        samples[col_name],
        truncation=True,
        padding=True,
        max_length=max_len,
        stride=stride,
        return_overflowing_tokens=True,
        )

    return {
        'input_ids': chunks['input_ids'],
        'attention_mask': chunks['attention_mask']
        }

In [None]:
max_length=2**11

In [None]:
stride=max_length//16

In [None]:
col_name="code"

In [None]:
from functools import partial

In [None]:
process_text = partial(chunk_and_encode,
                tokenizer=tokenizer,
                max_len=max_length,
                stride=stride,
                col_name=col_name)

In [None]:
from datasets import Dataset,IterableDataset

In [None]:
def gen_from_iterable_dataset(iterable_ds: IterableDataset)->dict:
    """Create a generator from an iterable dataset"""
    yield from iterable_ds

In [None]:
def create_dataset(iterable_ds: IterableDataset)->Dataset:
    """Create a dataset from an iterable dataset"""
    iter_token=iterable_ds.map(process_text,
                              remove_columns=iter_ds.column_names,
                              batched=True)
    return Dataset.from_generator(partial(gen_from_iterable_dataset, iter_token))

In [None]:
train_ds=create_dataset(iter_train_ds).shuffle(seed=seed)

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
val_ds=create_dataset(iter_val_ds)

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
test_ds=create_dataset(iter_test_ds)

Generating train split: 0 examples [00:00, ? examples/s]

I will load the model in NF4 and use double quantization, as described in the QLoRA paper. The computation will be performed using Brain Float 16-bit precision.

In [None]:
import torch

In [None]:
from transformers import BitsAndBytesConfig

config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

In [None]:
from transformers import AutoModelForCausalLM

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=config)

In [None]:
model

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-11): 12 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear4bit(in_features=768, out_features=768, bias=True)
            (v_proj): Linear4bit(in_features=768, out_features=768, bias=True)
            (q_proj): Linear4bit(in_features=768, out_features=768, bias=True)
            (out_proj): Linear4bit(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear4bit(in_features=768, out_features=3072, bias=True)
          (fc2): Linear4bit(in_features=3072, out_features=768, bias=True)
          (final_layer_nor

Perplexity (PPL) is one of the most common metrics for evaluating language models.

It is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base `e`.

In [None]:
from transformers import PreTrainedModel

In [None]:
from tqdm import tqdm

In [None]:
def evaluate(model: PreTrainedModel,
             eval_ds: Dataset,
             batch_size: int,
            )->dict[str,float]:

    """
    Compute the perplexity of a model over an evaluation dataset
    """
    model.eval()
    losses = []
    device = model.device

    for batch in tqdm(eval_ds.iter(batch_size)):
        input_ids=torch.LongTensor(batch["input_ids"]).to(device)
        with torch.no_grad():
            batch_loss = model(input_ids, labels=input_ids).loss.reshape(1,-1)

        losses.append(batch_loss)
    loss = torch.mean(torch.cat(losses))
    try:
        perplexity = torch.exp(loss).item()
    except OverflowError:
        perplexity = float("inf")
    return {"perplexity":perplexity}

In [None]:
batch_size=16

In [None]:
base_score=evaluate(model,test_ds,batch_size)

164it [02:42,  1.01it/s]


In [None]:
base_score

{'perplexity': 24.5625}

Free GPU RAM

In [None]:
torch.cuda.empty_cache()

## Performing Parameter-Efficient Fine-Tuning

In the cells below, I will create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

In [None]:
model

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-11): 12 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), ep

In [None]:
from peft import LoftQConfig, LoraConfig, get_peft_model

Adjust the quantization bit to 4 and incorporate 10 iterations of LoftQ in the configuration.

In [None]:
loftq_config = LoftQConfig(loftq_bits=4,loftq_iter=10)

In [None]:
lora_config = LoraConfig(
    init_lora_weights="loftq",
    loftq_config=loftq_config,
    r=64,
    lora_alpha=32,
    target_modules="all-linear",
    lora_dropout=0.01,
    bias="none",
    task_type="CAUSAL_LM"
)

In [None]:
model = get_peft_model(model, lora_config)

In [None]:
model.print_trainable_parameters()

trainable params: 10,616,832 || all params: 135,856,128 || trainable%: 7.814761215629522


In [None]:
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): OPTForCausalLM(
      (model): OPTModel(
        (decoder): OPTDecoder(
          (embed_tokens): Embedding(50272, 768, padding_idx=1)
          (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
          (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (layers): ModuleList(
            (0-11): 12 x OPTDecoderLayer(
              (self_attn): OPTAttention(
                (k_proj): lora.Linear(
                  (base_layer): Linear(in_features=768, out_features=768, bias=True)
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.01, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=768, out_features=64, bias=False)
                  )
                  (lora_B): ModuleDict(
                    (default): Linear(in_features=64, out_features=768, bias=False)
              

In [None]:
from transformers import DataCollatorForLanguageModeling

In [None]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [None]:
from transformers import TrainingArguments

In [None]:
import torch.multiprocessing as mp

In [None]:
model_name = model_id.split("/")[1]

I set the optimizer to lion with have record for only one momentom and I will quantized to 8bit and move the paged memory to cpu.

In [None]:
training_args = TrainingArguments(
        f"{model_name}-finetuned-lora",
        optim="paged_lion_8bit",
        learning_rate=5e-6,
        weight_decay=0.01,
        auto_find_batch_size=True,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        dataloader_num_workers=mp.cpu_count(),
        fp16=True,
        logging_steps=100,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        push_to_hub=False,
        greater_is_better=False,
    )

In [None]:
from transformers import Trainer

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset= val_ds,
    data_collator=data_collator,
)

In [None]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss
0,1.6004,1.527814
1,1.4681,1.490909
2,1.3907,1.482334


TrainOutput(global_step=25608, training_loss=1.5175122290765892, metrics={'train_runtime': 18624.7305, 'train_samples_per_second': 5.5, 'train_steps_per_second': 1.375, 'total_flos': 1.2042190267298611e+17, 'train_loss': 1.5175122290765892, 'epoch': 3.0})

In [None]:
 model_saved = f"best-{model_name}-finetuned-lora"

In [None]:
model.save_pretrained(model_saved)

In [None]:
torch.cuda.empty_cache()

## Performing Inference with a PEFT Model

In the cells below, I will load the saved PEFT model weights and evaluate the performance of the trained PEFT model.

In [None]:
from peft import AutoPeftModelForCausalLM

In [None]:
model = AutoPeftModelForCausalLM.from_pretrained(model_saved, config=lora_config, device_map="auto")

In [None]:
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): OPTForCausalLM(
      (model): OPTModel(
        (decoder): OPTDecoder(
          (embed_tokens): Embedding(50272, 768, padding_idx=1)
          (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
          (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (layers): ModuleList(
            (0-11): 12 x OPTDecoderLayer(
              (self_attn): OPTAttention(
                (k_proj): lora.Linear(
                  (base_layer): Linear(in_features=768, out_features=768, bias=True)
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.01, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=768, out_features=64, bias=False)
                  )
                  (lora_B): ModuleDict(
                    (default): Linear(in_features=64, out_features=768, bias=False)
              

I merged the QLoRA weights with the original ones to enable faster inference.

In [None]:
model = model.merge_and_unload()

In [None]:
model

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-11): 12 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), ep

In [None]:
batch_size=8

In [None]:
base_score=evaluate(model,test_ds,batch_size)

327it [03:14,  1.68it/s]


In [None]:
base_score

{'perplexity': 9.781842231750488}

We can see that fine-tuning has improved perplexity.

In [None]:
torch.cuda.empty_cache()

Try it yourself

In [None]:
from transformers import pipeline
import gradio as gr

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

demo = gr.Interface.from_pipeline(pipe)
demo.launch()

Running on local URL:  http://127.0.0.1:7860
Sagemaker notebooks may require sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Running on public URL: https://503b9c0a530b2e72c1.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


