# Lightweight Fine-Tuning Project

In this cell, describe your choices for each of the following

* PEFT technique: QLoRA
* Model: mistralai/Mistral-7B-v0.1
* Evaluation approach: Perplexity
* Fine-tuning dataset: codeparrot/github-code

!pip install -r requirements.txt

## Loading and Evaluating a Foundation Model

In the cells below, I load the pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
from datasets import load_dataset

In [2]:
train_size=10_000

In [3]:
val_size=train_size//10

In [4]:
test_size=val_size

In [5]:
seed=42

In [6]:
iter_ds=load_dataset("codeparrot/github-code", streaming=True, trust_remote_code=True,
                split="train").shuffle(seed=seed,
                                       buffer_size=train_size+val_size+test_size)

In [7]:
iter_train_ds=iter_ds.take(train_size)

In [8]:
iter_val_ds=iter_ds.skip(train_size).take(val_size)

In [9]:
iter_test_ds=iter_ds.skip(train_size+val_size).take(test_size)

In [10]:
from transformers import AutoModelForCausalLM, AutoTokenizer

In [11]:
model_id = "mistralai/Mistral-7B-v0.1"

In [12]:
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [13]:
if tokenizer.pad_token is None:
  print("it was None")
  tokenizer.pad_token = tokenizer.eos_token

it was None


In [14]:
from transformers import PreTrainedTokenizer

In [15]:
import torch

In [16]:
def chunk_and_encode(
        samples: dict[str,  str],
        tokenizer: PreTrainedTokenizer,
        max_len: int,
        stride: int,
        col_name: str) -> dict[str, list[list[int]]]:
    """
    Split test in chunks and encode them
    Args:
        samples (dict[str, str]):  batch of data raws from hugging face dataset
        tokenizer (PreTrainedTokenizer): hugging face tokenizer
        max_len (int): the length of chunk
        stride (int): the number of overlapping tokens
        col_name (str): the name of the text column
    Return:
        tokenized chunks (dict[str, list[list[int]]])
    """

    chunks = []
    chunks_mask = []
    pad_id = tokenizer.pad_token_id

    for text in samples[col_name]:
        tokens = tokenizer(text, truncation=False,
                           return_attention_mask=False,
                           padding=False)['input_ids']

        start_idx = 0
        while start_idx < len(tokens):
            end_idx = min(start_idx + max_len, len(tokens))
            chunk = tokens[start_idx:end_idx]
            len_chunk = len(chunk)
            chunk += (max_len - len_chunk) * [pad_id]
            attention_mask = [1] * len_chunk + (max_len - len_chunk) * [0]

            chunks.append(torch.LongTensor(chunk))
            chunks_mask.append(torch.LongTensor(attention_mask))
           
            start_idx += stride
    return {
        'input_ids': chunks,
        'attention_mask': chunks_mask
    }

In [17]:
max_length=2**11

In [18]:
stride=max_length//8

In [19]:
col_name="code"

In [20]:
from functools import partial

In [21]:
process_text = partial(chunk_and_encode,
                tokenizer=tokenizer,
                max_len=max_length,
                stride=stride,
                col_name=col_name)

In [22]:
from datasets import Dataset,IterableDataset

In [23]:
def gen_from_iterable_dataset(iterable_ds: IterableDataset)->dict:
    """Create a generator from an iterable dataset"""
    yield from iterable_ds

In [24]:
def create_dataset(iterable_ds: IterableDataset)->Dataset:
    """Create a dataset from an iterable dataset"""
    iter_token=iterable_ds.map(process_text,
                              remove_columns=iter_ds.column_names,
                              batched=True)
    return Dataset.from_generator(partial(gen_from_iterable_dataset, iter_token))

In [None]:
train_ds=create_dataset(iter_train_ds).shuffle(seed=seed)

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
train_ds

In [None]:
val_ds=create_dataset(iter_val_ds)

In [None]:
test_ds=create_dataset(iter_test_ds)

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_4bit=True,
                                             bnb_4bit_compute_dtype=torch.bfloat16)

In [None]:
model

Perplexity (PPL) is one of the most common metrics for evaluating language models.

It is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base `e`.

In [None]:
from transformers import PreTrainedModel

In [None]:
from tqdm import tqdm

In [None]:
def evaluate(model: PreTrainedModel,
             eval_ds: Dataset)->dict[str,float]:

    """
    Compute the perplexity of a model over an evaluation dataset
    """
    model.eval()
    losses = []
    for batch in tqdm(eval_ds):
        input_ids=torch.vstack(batch["input_ids"])
        with torch.no_grad():
            batch_loss = model(input_ids, labels=input_ids).loss.reshape(1,-1)

        losses.append(batch_loss)
    loss = torch.mean(torch.cat(losses))
    try:
        perplexity = torch.exp(loss).item()
    except OverflowError:
        perplexity = float("inf")
    return {"perplexity":perplexity}

In [None]:
base_score=evaluate(model,eval_ds.iter(8))

In [None]:
base_score

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.