<a href="https://colab.research.google.com/github/bartheart/Tuning_LLM_for_sentiment_analysis/blob/main/LLM_tuningipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [7]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m782.2 kB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Installing collected packages: responses, evaluate
Successfully installed evaluate-0.4.1 responses-0.18.0


In [37]:
from datasets import load_dataset, DatasetDict, Dataset

In [38]:
from transformers import (
    AutoTokenizer,
    AutoConfig,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer)

In [39]:
from peft import PeftModel, PeftConfig, get_peft_model, LoraConfig
import evaluate
import torch
import numpy as np

#Base model selection

In [40]:
model_checkpoint = 'distilbert-base-uncased'

Define label maps

In [41]:
id2label = {0: "Negative", 1: "Positive"}
label2id = {"Negative": 0 , "Positive": 1}

Import auto base model from transformers

In [42]:
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=2, id2label=id2label, label2id = label2id)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


load dataset

In [43]:
dataset= load_dataset("shawhin/imdb-truncated")
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['label', 'text'],
        num_rows: 1000
    })
})

#Preprocessing

Change all the data into numbers for the nueral network

In [44]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space = True)

In [45]:
def tokenize(examples):
  text = examples['text']

  tokenizer.truncate_side = "left"
  tokenized_inputs = tokenizer (
      text,
      return_tensors='np',
      truncation= True,
      max_length= 512
  )

  return tokenized_inputs

In [46]:
if tokenizer.pad is None:
  tokenizer.add_special_tokens({'pad_token': '[PAD]'})
  model.resize_token_embeddings(len(tokenizer))

In [47]:
tokenized_dataset = dataset.map(tokenize, batched=True)
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 1000
    })
})

In [48]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [49]:
accuracy = evaluate.load("accuracy")


def compute_metrics (p):
  predictions, labels = p
  predictions = np.argmax(predictions, axis=1)

  return {"accuracy": accuracy.compute(predictions=predictions, references=labels)}

In [50]:
test_list = ["This one is a pass",
             "I love this icecream",
             "what a bad day",
             "this is absoloutely devestating",
             "I like cakes"]

In [51]:
for text in test_list:
  inputs = tokenizer.encode(text, return_tensors="pt")
  logits = model(inputs).logits
  predictions = torch.argmax(logits)

  print(text + " - " + id2label[predictions.tolist()])

This one is a pass - Negative
I love this icecream - Positive
what a bad day - Positive
this is absoloutely devestating - Negative
I like cakes - Positive


Fine-tuning the model

In [52]:
peft_config = LoraConfig(
    task_type="SEQ_CLS",
    r = 4,
    lora_alpha=32,
    lora_dropout=0.01,
    target_modules = ['q_lin']
)

In [53]:
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

trainable params: 628,994 || all params: 67,584,004 || trainable%: 0.9306847223789819


In [54]:
lr = 0.001
batch_size = 4
num_epochs = 10

In [19]:
!pip install transformers --upgrade



In [55]:
training_args = TrainingArguments(
    output_dir= model_checkpoint + "-lora-text-classification",
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

In [56]:
trainer = Trainer(
    model= model,
    args= training_args,
    train_dataset= tokenized_dataset["train"],
    eval_dataset= tokenized_dataset["validation"],
    tokenizer= tokenizer,
    data_collator= data_collator,
    compute_metrics= compute_metrics,
)

In [57]:
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.2825,{'accuracy': 0.893}
2,0.426800,0.378236,{'accuracy': 0.902}
3,0.426800,0.547771,{'accuracy': 0.894}
4,0.189900,0.87096,{'accuracy': 0.866}
5,0.189900,0.793918,{'accuracy': 0.886}
6,0.053400,0.858498,{'accuracy': 0.89}
7,0.053400,0.910178,{'accuracy': 0.886}
8,0.017700,0.967726,{'accuracy': 0.881}
9,0.017700,0.974122,{'accuracy': 0.886}
10,0.016600,0.943908,{'accuracy': 0.887}


Trainer is attempting to log a value of "{'accuracy': 0.893}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Checkpoint destination directory distilbert-base-uncased-lora-text-classification/checkpoint-250 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Trainer is attempting to log a value of "{'accuracy': 0.902}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Checkpoint destination directory distilbert-base-uncased-lora-text-classification/checkpoint-500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Trainer is attempting to log a value of "{'accuracy': 0.894}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we drop

TrainOutput(global_step=2500, training_loss=0.1408770809173584, metrics={'train_runtime': 475.5075, 'train_samples_per_second': 21.03, 'train_steps_per_second': 5.258, 'total_flos': 1113026652407424.0, 'train_loss': 0.1408770809173584, 'epoch': 10.0})

In [26]:
!pip install torch --upgrade



In [34]:
import torch
print(torch.version.cuda)

12.1


In [66]:
model.to('mps')
print("Tunned model predictions:")
for text in test_list:
  inputs = tokenizer.encode(text, return_tensors="pt").to("mps")
  logits = model(inputs).logits
  predictions = torch.max(logits,1).indices

  print(text + " - " + id2label[predictions.tolist()[0]])

RuntimeError: ignored