# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: LoRA (Low Rank Adaptation)
* Model: 
* Evaluation approach: 
* Fine-tuning dataset: 

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [2]:
# Install the required versino of datasets if needed (uncomment to run)
# You may need to restart the kernel after running this cell
# ! pip install -q "datasets==2.15.0"

In [1]:
# Load the Climate Sentiment dataset from Hugging Face
# Link for more info: https://huggingface.co/datasets/climatebert/climate_sentiment?row=12
from datasets import load_dataset

# Load the train and test splits of the climate_sentiment dataset
splits = ["train", "test"]
ds = {split: ds for split, ds in zip(splits, load_dataset("climatebert/climate_sentiment", split=splits))}

# Show the dataset
ds

  from .autonotebook import tqdm as notebook_tqdm


{'train': Dataset({
     features: ['text', 'label'],
     num_rows: 1000
 }),
 'test': Dataset({
     features: ['text', 'label'],
     num_rows: 320
 })}

In [4]:
# Inspect the first element. For labels, 0 is risk, 1 is neutral, and 2 is opportunity
ds['train'][0]

{'text': '− Scope 3: Optional scope that includes indirect emissions associated with the goods and services supply chain produced outside the organization. Included are emissions from the transport of products from our logistics centres to stores (downstream) performed by external logistics operators (air, land and sea transport) as well as the emissions associated with electricity consumption in franchise stores.',
 'label': 1}

### Pre-process datasets
The dataset needs to be processed by converting all of the text into tokens for the models.

In [58]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token  # Assign padding token
tokenizer.build_inputs_with_special_tokens(tokenizer.all_special_tokens)  # Rebuild vocabulary

# Tokenize dataset
tokenized_dataset = {}
for split in splits:
    tokenized_dataset[split] = ds[split].map(
        lambda x: tokenizer(x['text'], padding = 'max_length', truncation=True), batched=True
    )

Map: 100%|██████████| 1000/1000 [00:00<00:00, 2218.74 examples/s]
Map: 100%|██████████| 320/320 [00:00<00:00, 2409.65 examples/s]


In [59]:
tokenized_dataset

{'train': Dataset({
     features: ['text', 'label', 'input_ids', 'attention_mask'],
     num_rows: 1000
 }),
 'test': Dataset({
     features: ['text', 'label', 'input_ids', 'attention_mask'],
     num_rows: 320
 })}

In [60]:
# Display the first element. The tokenized elements are stored in 'input_ids'
tokenized_dataset['train'][0]

{'text': '− Scope 3: Optional scope that includes indirect emissions associated with the goods and services supply chain produced outside the organization. Included are emissions from the transport of products from our logistics centres to stores (downstream) performed by external logistics operators (air, land and sea transport) as well as the emissions associated with electricity consumption in franchise stores.',
 'label': 1,
 'input_ids': [14095,
  41063,
  513,
  25,
  32233,
  8354,
  326,
  3407,
  12913,
  8971,
  3917,
  351,
  262,
  7017,
  290,
  2594,
  5127,
  6333,
  4635,
  2354,
  262,
  4009,
  13,
  34774,
  389,
  8971,
  422,
  262,
  4839,
  286,
  3186,
  422,
  674,
  26355,
  19788,
  284,
  7000,
  357,
  2902,
  5532,
  8,
  6157,
  416,
  7097,
  26355,
  12879,
  357,
  958,
  11,
  1956,
  290,
  5417,
  4839,
  8,
  355,
  880,
  355,
  262,
  8971,
  3917,
  351,
  8744,
  7327,
  287,
  8663,
  7000,
  13,
  50256,
  50256,
  50256,
  50256,
  50256,
  

In [62]:
from transformers import GPT2ForSequenceClassification

model = GPT2ForSequenceClassification.from_pretrained(
    'gpt2',
    num_labels=3,
    id2label={0: 'risk', 1: 'neutral', 2: 'opportunity'},
    label2id={'risk': 0, 'neutral': 1, 'opportunity': 2}
)

# Freeze all the parameters of the base model using param.requires_grad = False
# more info here: https://huggingface.co/transformers/v4.2.2/training.html
for param in model.base_model.parameters():
    param.requires_grad = False

# Use model.score to output the final classification layer for GPT2. In others it may be model.classifier
model.score

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Linear(in_features=768, out_features=3, bias=False)

In [63]:
# Print full model parameters
print(model)

GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=3, bias=False)
)


### Evaluate base model on test set

In [65]:
model.eval()

GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=3, bias=False)
)

**Below is work in progress**

In [69]:
import torch

input_ids = tokenized_dataset["test"][0]["input_ids"]
attention_mask = tokenized_dataset["test"][0]["attention_mask"]

# Convert to tensors (assuming attention_mask is a list)
input_ids = torch.tensor([input_ids])
attention_mask = torch.tensor([attention_mask])

with torch.no_grad():
    outputs = model(input_ids, attention_mask=attention_mask)
    logits = outputs.logits
    predicted_label_index = torch.argmax(logits, dim=-1)
    predicted_label = model.config.id2label[predicted_label_index.item()]


print('Predicted label: ', predicted_label)



Predicted label:  neutral


In [82]:
def predict(model, tokenized_dataset, split="test"):
  """
  Function to make predictions on a specific split of the tokenized dataset.
  """
  total = 0
  correct = 0
  predictions = []
  for i, datapoint in enumerate(tokenized_dataset[split]):
    if i >= 5:
      break
    input_ids = torch.tensor([datapoint["input_ids"]])
    attention_mask = torch.tensor([datapoint["attention_mask"]])

    with torch.no_grad():
      outputs = model(input_ids, attention_mask=attention_mask)
      logits = outputs.logits
      predicted_label_index = torch.argmax(logits, dim=-1)
      predicted_label = model.config.id2label[predicted_label_index.item()]

    predictions.append(predicted_label)
    true_label = datapoint['label']
    
    total += 1
    if predicted_label == true_label:
      correct += 1
  accuracy = correct / total
  return predictions, accuracy

# Make predictions on the test set
# Make predictions on the test set and calculate accuracy
test_predictions, test_accuracy = predict(model, tokenized_dataset, split="test")
print("Test Set Predictions:", test_predictions)
print("Test Set Accuracy:", test_accuracy)

Test Set Predictions: ['neutral', 'neutral', 'neutral', 'neutral', 'neutral']
Test Set Accuracy: 0.0


In [83]:
tokenized_dataset['test']['label']

[0,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 2,
 2,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 2,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 2,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 2,
 0,
 1,
 1,
 1,
 1,
 2,
 0,
 0,
 1,
 1,
 2,
 1,
 0,
 0,
 1,
 0,
 2,
 1,
 0,
 0,
 1,
 1,
 1,
 2,
 2,
 0,
 2,
 0,
 1,
 0,
 0,
 0,
 1,
 2,
 1,
 1,
 1,
 1,
 1,
 2,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 2,
 1,
 1,
 2,
 1,
 0,
 1,
 1,
 2,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 2,
 0,
 0,
 0,
 1,
 2,
 1,
 1,
 2,
 1,
 1,
 0,
 2,
 0,
 2,
 1,
 1,
 0,
 1,
 1,
 2,
 2,
 1,
 0,
 2,
 0,
 1,
 1,
 1,
 0,
 2,
 1,
 2,
 2,
 1,
 0,
 1,
 1,
 2,
 1,
 2,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 2,
 1,
 1,
 2,
 0,
 2,
 1,
 1,
 2,
 1,
 0,
 0,
 2,
 0,
 0,
 1,
 1,
 0,
 2,
 2,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 2,
 1,
 1,
 0,
 1,
 1,
 2,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 2,
 2,
 1,
 1,
 1,
 0,
 0,


In [75]:
tokenized_dataset['test'][0]

{'text': 'Sustainable strategy ‘red lines’ For our sustainable strategy range, we incorporate a series of proprietary ‘red lines’ in order to ensure the poorest- performing companies from an ESG perspective are not eligible for investment.',
 'label': 0,
 'input_ids': [50,
  24196,
  4811,
  564,
  246,
  445,
  3951,
  447,
  247,
  1114,
  674,
  13347,
  4811,
  2837,
  11,
  356,
  19330,
  257,
  2168,
  286,
  20622,
  564,
  246,
  445,
  3951,
  447,
  247,
  287,
  1502,
  284,
  4155,
  262,
  27335,
  12,
  9489,
  2706,
  422,
  281,
  13380,
  38,
  6650,
  389,
  407,
  8867,
  329,
  4896,
  13,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50

In [50]:
inputs = tokenizer("I love generative AI", return_tensors='pt')

In [55]:
with torch.no_grad():
    outputs = model(**inputs).logits
    probabilities = torch.nn.functional.softmax(outputs, dim=1)
    predicted_class = torch.argmax(probabilities)

In [56]:
predicted_class

tensor(2)

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.