<a href="https://colab.research.google.com/github/eljandoubi/Copilot/blob/main/LightweightFineTuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lightweight Fine-Tuning Project

* PEFT technique: LoftQ initialization & QLoRA-style training
* Model: GPT-2
* Evaluation approach: Perplexity
* Fine-tuning dataset: codeparrot/github-code

In [None]:
!pip install -r requirements.txt

## Loading and Evaluating a Foundation Model

In the cells below, I will load the pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [None]:
from datasets import load_dataset

In [None]:
train_size=1_000

In [None]:
val_size=train_size//10

In [None]:
test_size=val_size

In [None]:
seed=42

I will load the dataset in streaming mode to avoid downloading the entire 1TB.

In [None]:
iter_ds=load_dataset("codeparrot/github-code", streaming=True, trust_remote_code=True,
                split="train").shuffle(seed=seed,
                                       buffer_size=train_size+val_size+test_size)

In [None]:
iter_train_ds=iter_ds.take(train_size)

In [None]:
iter_val_ds=iter_ds.skip(train_size).take(val_size)

In [None]:
iter_test_ds=iter_ds.skip(train_size+val_size).take(test_size)

In [None]:
from transformers import AutoTokenizer

In [None]:
model_id = "gpt2"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [None]:
if tokenizer.pad_token is None:
  print("it was None")
  tokenizer.pad_token = tokenizer.eos_token

In [None]:
from transformers import PreTrainedTokenizer

I will segment the text so that it can be processed by the model within the context length.

In [None]:
def chunk_and_encode(
        samples: dict[str,  str],
        tokenizer: PreTrainedTokenizer,
        max_len: int,
        stride: int,
        col_name: str) -> dict[str, list[list[int]]]:
    """
    Split test in chunks and encode them
    Args:
        samples (dict[str, str]):  batch of data raws from hugging face dataset
        tokenizer (PreTrainedTokenizer): hugging face tokenizer
        max_len (int): the length of chunk
        stride (int): the number of overlapping tokens
        col_name (str): the name of the text column
    Return:
        tokenized chunks (dict[str, list[list[int]]])
    """

    chunks = []
    chunks_mask = []
    pad_id = tokenizer.pad_token_id

    for text in samples[col_name]:
        tokens = tokenizer(text, truncation=False,
                           return_attention_mask=False,
                           padding=False)['input_ids']

        start_idx = 0
        while start_idx < len(tokens):
            end_idx = min(start_idx + max_len, len(tokens))
            chunk = tokens[start_idx:end_idx]
            len_chunk = len(chunk)
            chunk += (max_len - len_chunk) * [pad_id]
            attention_mask = [1] * len_chunk + (max_len - len_chunk) * [0]

            chunks.append(chunk)
            chunks_mask.append(attention_mask)

            start_idx += stride
    return {
        'input_ids': chunks,
        'attention_mask': chunks_mask
    }

In [None]:
max_length=2**10

In [None]:
stride=max_length//16

In [None]:
col_name="code"

In [None]:
from functools import partial

In [None]:
process_text = partial(chunk_and_encode,
                tokenizer=tokenizer,
                max_len=max_length,
                stride=stride,
                col_name=col_name)

In [None]:
from datasets import Dataset,IterableDataset

In [None]:
def gen_from_iterable_dataset(iterable_ds: IterableDataset)->dict:
    """Create a generator from an iterable dataset"""
    yield from iterable_ds

In [None]:
def create_dataset(iterable_ds: IterableDataset)->Dataset:
    """Create a dataset from an iterable dataset"""
    iter_token=iterable_ds.map(process_text,
                              remove_columns=iter_ds.column_names,
                              batched=True)
    return Dataset.from_generator(partial(gen_from_iterable_dataset, iter_token))

In [None]:
train_ds=create_dataset(iter_train_ds).shuffle(seed=seed)

In [None]:
val_ds=create_dataset(iter_val_ds)

In [None]:
test_ds=create_dataset(iter_test_ds)

I will load the model in NF4, as described in the QLoRA paper. The computation will be performed using Brain Float 16-bit precision.

In [None]:
import torch

In [None]:
from transformers import BitsAndBytesConfig

config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

In [None]:
from transformers import AutoModelForCausalLM

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=config)

In [None]:
model

Perplexity (PPL) is one of the most common metrics for evaluating language models.

It is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base `e`.

In [None]:
from transformers import PreTrainedModel

In [None]:
from tqdm import tqdm

In [3]:
def evaluate(model: PreTrainedModel,
             eval_ds: Dataset,
             batch_size: int,
            )->dict[str,float]:

    """
    Compute the perplexity of a model over an evaluation dataset
    """
    model.eval()
    losses = []
    for batch in tqdm(eval_ds.iter(batch_size)):
        input_ids=torch.LongTensor(batch["input_ids"])
        with torch.no_grad():
            batch_loss = model(input_ids, labels=input_ids).loss.reshape(1,-1)

        losses.append(batch_loss)
    loss = torch.mean(torch.cat(losses))
    try:
        perplexity = torch.exp(loss).item()
    except OverflowError:
        perplexity = float("inf")
    return {"perplexity":perplexity}

NameError: name 'PreTrainedModel' is not defined

In [None]:
batch_size=64

In [None]:
base_score=evaluate(model,test_ds,batch_size)

In [None]:
base_score

In [None]:
torch.cuda.empty_cache()

## Performing Parameter-Efficient Fine-Tuning

In the cells below, I will create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

In [None]:
model

In [None]:
from peft import LoftQConfig, LoraConfig, get_peft_model, prepare_model_for_kbit_training

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.