<h1 align="center" style="color:green;font-size: 3em;">
Implementing Fine-tuning Techniques</h1>


Implementing various fine-tuning methods as described in different papers, specifically LoRA and IA3.

We will do this in 3 parts:

Pt1:

In this notebook, we will:

- Evaluate the perplexity of a causal language model.
- Fine-tune a sequence classification model using different learning rates and analyze its performance.

### Install Dependencies

In [1]:
%pip install datasets hf_xet -q

Note: you may need to restart the kernel to use updated packages.


### Import Libraries

In [2]:
# importing required libraries
import torch
import torch.nn as nn
import collections
import random
import numpy as np
import math
import matplotlib.pyplot as plt
import warnings

from torch.optim import AdamW
from typing import List
from torch.nn import functional as F
from tqdm import tqdm
from datasets import load_dataset, Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    Trainer,
    DataCollatorForLanguageModeling,
    TrainingArguments,
)
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForCausalLM, T5Tokenizer, T5ForSequenceClassification
from torch.utils.data import DataLoader

warnings.simplefilter("ignore")
print(torch.__version__)

2.6.0+cu124


In [3]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

### Import and Evaluate Models

The two main types of models here: causal models and sequence classification models. The primary difference between them lies in their applications and functionality.

#### Causal Models

Causal models, also known as autoregressive models, generate the next word in a sequence based on the preceding words. They are used for tasks such as text generation, language modeling, machine translation, and speech recognition. These models operate unidirectionally, predicting the next token using only previous tokens.


#### Sequence Classification Models

Sequence classification models categorize a given input sequence into predefined categories. They are useful for tasks like sentiment analysis, spam detection, topic classification, named entity recognition (NER), and document classification. These models often process the entire input sequence at once, using context from all tokens to make a classification decision.

### Causal Model

First, we will initialize a causal model, specifically OPT-125m. When initializing a model, it is important to also initialize the corresponding tokenizer, as it handles the preprocessing of text data into a format that the model can understand.

More about the OPT-125m model and its capabilities [here](https://huggingface.co/facebook/opt-125m).


In [4]:
# Import the causal model
causal_model_name = "facebook/opt-125m"
causal_model = AutoModelForCausalLM.from_pretrained(causal_model_name).to(device)

# Import the tokenizer
causal_tokenizer = AutoTokenizer.from_pretrained(causal_model_name)

config.json:   0%|          | 0.00/651 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/251M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/251M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

Our dataset for this task is Wikitext, which is a collection of articles from Wikipedia. This dataset is widely used for language modeling and text generation tasks because of its comprehensive and diverse range of topics. To read more about the Wikitext dataset and its features [here](https://huggingface.co/datasets/Salesforce/wikitext).

In [5]:
# Import wikitext dataset
causal_test = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
causal_test_encodings = causal_tokenizer("\n\n".join(causal_test["text"]), return_tensors="pt")

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/733k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Next, we will evaluate our model's effectiveness on the Wikitext dataset, an industry-standard benchmark for language modeling tasks. We will use perplexity to assess how well our model generates text.

Perplexity measures how well a language model predicts the next word in a sequence. It is calculated as the exponentiated average negative log-likelihood of a sequence. A lower perplexity score indicates better performance, meaning the model is more accurate and confident in its predictions. Perplexity is a standard metric for comparing models and evaluating their ability to generate natural-sounding text. For more details on perplexity, read [here](https://huggingface.co/docs/transformers/en/perplexity).

In [6]:
# Implement the method to calculate the perplexity
def calc_perplexity(model, encodings, stride):
  """
  Args:
  model: our pretrained language model that we are evaluating on
  encodings: input encodings containing input_ids and other relevant attributes
  stride: the step size for segmenting the input sequence

  Returns:
  The perplexity of the model on the given dataset.
  """

  # Define max_length and seq_len
  max_length = 1024
  seq_len = encodings.input_ids.size(1)

  nlls = []
  prev_end_loc = 0

  # Loop through the sequence with the given stride
  for begin_loc in tqdm(range(0, seq_len, stride)):
      end_loc = min(begin_loc + max_length, seq_len)
      trg_len = end_loc - prev_end_loc

      # Get the input_ids for the current chunk and move to the correct device
      input_ids = encodings.input_ids[:, begin_loc: end_loc].to(model.device)
      target_ids = input_ids.clone()

      # Mask out non-target positions
      target_ids[:, :-trg_len] = -100

      # Ensure no gradients are calculated
      with torch.no_grad():
          # Get the model outputs
          outputs = model(input_ids, labels=target_ids)
          neg_log_likelihood = outputs.loss

      nlls.append(neg_log_likelihood)

      prev_end_loc = end_loc
      if end_loc == seq_len:
          break

  # Return the perplexity
  return torch.exp(torch.stack(nlls).mean())

In [7]:
## Calculate the perplexity of our causal model
calc_perplexity(causal_model, causal_test_encodings, 256)

 24%|██▍       | 268/1124 [00:47<03:21,  4.24it/s]