<a href="https://colab.research.google.com/github/dariodona/Sexism-in-Sentiment-Analysis-data/blob/main/notebook/distilbert_base_uncased_new_lora_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This Notebook has been used to traing a Distilbert Base Model for sentiment analysis using the IMDb Truncated Dataset.

In [1]:
import pandas as pd

Loading the transformers model and importing the necessary class functions.

In [2]:
from transformers import(
    AutoTokenizer,
    AutoConfig,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer
)

Installing Peft for reducing parameters size by adapting to LoRa Configuration.

In [None]:
!pip install peft

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.13.0->peft)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.13.0->peft)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.13.0->peft)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.13.0->peft)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.13.0->peft)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.13.0->peft)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting

In [None]:
from peft import PeftModel, PeftConfig, get_peft_model, LoraConfig

In [None]:
!pip install evaluate

[0m

In [None]:
import evaluate
import torch
import numpy as np
import datasets

Loading the model checkpoint, Which in this case being, Distilbert-Base-Uncased Model

In [None]:
model_checkpoint = 'distilbert-base-uncased'

In [None]:
# define label maps
id2label = {
    0: 'Negative',
    1: 'Positive'
}

label2id = {
    'Negative': 0,
    'Positive': 1
}

In [None]:
# generate classification model from model checkpoint
model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint,
    num_labels = 2,
    id2label = id2label,
    label2id = label2id
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The IMDb dataset,

In [None]:
from datasets import load_dataset
dataset = load_dataset('shawhin/imdb-truncated')


In [None]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['label', 'text'],
        num_rows: 1000
    })
})


In [None]:
# create tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space = True)

In [None]:
# define tokenization function

def tokenize_function(examples):
  # extract text
  text = examples['text']

  # tokenize and truncate text
  tokenizer.truncation_side = 'left'
  tokenized_inputs = tokenizer(
      text,
      return_tensors = 'np',
      truncation = True,
      max_length = 512

  )
  return tokenized_inputs

In [None]:
# adding a pad token to the set, if it doesn't already exist

if tokenizer.pad_token is None:
  tokenizer.add_special_tokens({'pad_token': '[PAD]'})
  model.resize_token_embeddings(len(tokenizer))



In [None]:
# tokenize training and validation sets

tokenized_dataset = dataset.map(tokenize_function, batched = True)
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 1000
    })
})

In [None]:
# creating data collator
data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

In [None]:
# import accuracy evaluation metrics
accuracy = evaluate.load('accuracy')

# define an evaluation function to pass into the trainer later on
def compute_metrics(p):
  prediction, labels = p
  predictions = np.argmax(prediction, axis = 1)

  return {'accuracy': accuracy.compute(predictions = predictions, references = labels)}


Inference generation from Untrained models; performed to understand model's capability pre fine-tune.

In [None]:
# Untrained Models performance check

# defining a list of examples
text_list = ['It was good',
             'Not a fan., Dont Recommend.',
             'Better than the first one.',
             'This is not even worth watching once.',
             'This one is a pass']

print('Untrained model predictions \n\n')
print('---------------------------')

for text in text_list:
  # tokenize text
  inputs = tokenizer.encode(text, return_tensors = 'pt')
  # compute logits
  logits = model(inputs).logits
  # convert logits to labels
  predictions = torch.argmax(logits)

  print(text + " - " + id2label[predictions.tolist()])



Untrained model predictions 


---------------------------
It was good - Negative
Not a fan., Dont Recommend. - Negative
Better than the first one. - Negative
This is not even worth watching once. - Negative
This one is a pass - Negative


In [None]:
# commenting on this results, before the model is fine tuned, the model got three right and two wrongs
# now fine tuning the model with LoRa on the corpus loaded to check the resutls
# LoRa - Low Rank Adaptation

In [None]:
peft_config = LoraConfig(task_type = 'SEQ_CLS', # sequence Classification,
                         r = 4, # intrinsic Rank of trainable weight matrix
                         lora_alpha = 32, # similar to learning rate
                         lora_dropout = 0.01, # probability of dropout
                         target_modules = ['q_lin'] # applying LoRa to only the query layers represented by q_lin
                         )

In [None]:
model = get_peft_model(model,
                       peft_config)

model.print_trainable_parameters()

trainable params: 628,994 || all params: 67,584,004 || trainable%: 0.9306847223789819


In [None]:
# hyperparameters

lr = 1e-3 # size of optimization step
batch_size = 4 # number of examples processed per optimization step
num_epochs = 4 # number of times the model runs through the training data


In [None]:
!python -m importlib.invalidate_caches


/usr/bin/python3: No module named importlib.invalidate_caches


In [None]:
!pip install transformers[torch]

[0m

In [None]:
!pip install accelerate -U

[0m

In [None]:
!pip install accelerate==0.21.0

Collecting accelerate==0.21.0
  Downloading accelerate-0.21.0-py3-none-any.whl.metadata (17 kB)
Downloading accelerate-0.21.0-py3-none-any.whl (244 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: accelerate
  Attempting uninstall: accelerate
    Found existing installation: accelerate 0.25.0
    Uninstalling accelerate-0.25.0:
      Successfully uninstalled accelerate-0.25.0
Successfully installed accelerate-0.21.0
[0m

In [None]:
!pip install --upgrade transformers
!pip install --upgrade accelerate


Collecting accelerate
  Downloading accelerate-0.25.0-py3-none-any.whl.metadata (18 kB)
Downloading accelerate-0.25.0-py3-none-any.whl (265 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: accelerate
  Attempting uninstall: accelerate
    Found existing installation: accelerate 0.21.0
    Uninstalling accelerate-0.21.0:
      Successfully uninstalled accelerate-0.21.0
Successfully installed accelerate-0.25.0
[0m

In [None]:
!pip install --upgrade pip setuptools


[0m

In [None]:
# define training arguments

training_args = TrainingArguments(
    output_dir=model_checkpoint + "-new_lora-text-classification",
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.1,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

In [None]:
# create trainer object

trainer = Trainer(
    model = model, # our peft model
    args = training_args, # hyperparameters
    train_dataset = tokenized_dataset['train'], # training data
    eval_dataset = tokenized_dataset['validation'], # validation data
    tokenizer = tokenizer, # defining tokenizer
    data_collator = data_collator, # this will dynamically pad examples in each batch
    compute_metrics = compute_metrics, # evaluates the model using the compute metrics function
)

In [None]:
# train the model now

trainer.train()


You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.373305,{'accuracy': 0.886}
2,0.436300,0.456347,{'accuracy': 0.869}
3,0.436300,0.523157,{'accuracy': 0.889}
4,0.160100,0.551162,{'accuracy': 0.89}


Trainer is attempting to log a value of "{'accuracy': 0.886}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.869}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.889}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.89}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


TrainOutput(global_step=1000, training_loss=0.2982009811401367, metrics={'train_runtime': 11406.9346, 'train_samples_per_second': 0.351, 'train_steps_per_second': 0.088, 'total_flos': 444421902910080.0, 'train_loss': 0.2982009811401367, 'epoch': 4.0})

In [None]:
import torch

# Check if a GPU is available and use it, otherwise, use CPU
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Move your model to the selected device
model.to(device)

print(' Trained Model Prediction ')
print(' ------------------------ ')

for text in text_list:
    inputs = tokenizer.encode(text, return_tensors='pt').to(device)
    logits = model(inputs).logits
    predictions = torch.argmax(logits, dim=1)
    print(text + " - " + id2label[predictions.item()])


 Trained Model Prediction 
 ------------------------ 
It was good - Positive
Not a fan., Dont Recommend. - Negative
Better than the first one. - Positive
This is not even worth watching once. - Negative
This one is a pass - Negative


In [None]:
import os

# Specify the directory path
directory_path = 'C:\\PythonML\\Text_Sentiment_Model'

# Create the directory if it doesn't exist
os.makedirs(directory_path, exist_ok=True)

# Save the model's state dictionary to a local file path
torch.save(model.state_dict(), os.path.join(directory_path, 'distilbert-base-uncased-new_lora-text-classification.pth'))



In [None]:
new_test_list = [
    "You're an idiot.",
    "You belong with me.",
    "I love you baby.",
    "I want you to kiss me,",
    "I feel good today.",
    "You're an asshole."
]

In [None]:
print(' Trained Model Prediction ')
print(' ------------------------ ')

for text in new_test_list:
    inputs = tokenizer.encode(text, return_tensors='pt').to(device)
    logits = model(inputs).logits
    predictions = torch.argmax(logits, dim=1)
    print(text + " - " + id2label[predictions.item()])

 Trained Model Prediction 
 ------------------------ 
You're an idiot. - Negative
You belong with me. - Positive
I love you baby. - Positive
I want you to kiss me, - Positive
I feel good today. - Positive
You're an asshole. - Negative


In [None]:
# Authenticate with Hugging Face (if you haven't already)
# !transformers-cli login

2023-12-29 10:58:38.163142: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-29 10:58:38.163216: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-29 10:58:38.166104: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[1m[31mERROR! `huggingface-cli login` uses an outdated login mechanism that is not compatible with the Hugging Face Hub backend anymore. Please use `huggingface-cli login instead.[0m


In [None]:
# !huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: Traceback (most recent call last):
  File "/usr/local/bin/huggingface-cli", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/commands/huggingface_cli.py", line 49, in main
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/commands/user.py", line

In [None]:
# Specify the model name and description for your Hugging Face Space
# model_name = 'distilbert-base-withnew-lora-text-sentiment-classification'
# model_description = "This model leverages the power of DistilBERT, a lightweight version of BERT, to perform text sentiment classification. It has been trained on a dataset of text samples and can accurately classify text into different sentiment categories, such as positive, negative, or neutral. With an accuracy rate of 89%, it provides reliable sentiment analysis results, making it a valuable tool for understanding the sentiment expressed in text data. Whether you're analyzing social media posts, customer reviews, or any other text content, this model can help you gain insights into the sentiments behind the words."

# Push the model to your Hugging Face Space
# !huggingface-cli repo create $model_name  # Create a new repository
# !huggingface-cli upload $model_name  # Upload your model to the repository

# print(f"Model {model_name} has been uploaded to your Hugging Face Space.")

Not logged in
Traceback (most recent call last):
  File "/usr/local/bin/huggingface-cli", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/commands/huggingface_cli.py", line 48, in main
    service = args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/commands/upload.py", line 164, in __init__
    raise ValueError(f"'{repo_name}' is not a local file or folder. Please set `local_path` explicitly.")
ValueError: 'distilbert-base-withnew-lora-text-sentiment-classification' is not a local file or folder. Please set `local_path` explicitly.
Model distilbert-base-withnew-lora-text-sentiment-classification has been uploaded to your Hugging Face Space.


In [None]:
# !zip -r project_files.zip .


  adding: .config/ (stored 0%)
  adding: .config/.last_opt_in_prompt.yaml (stored 0%)
  adding: .config/logs/ (stored 0%)
  adding: .config/logs/2023.12.19/ (stored 0%)
  adding: .config/logs/2023.12.19/14.20.35.996145.log (deflated 56%)
  adding: .config/logs/2023.12.19/14.20.16.265569.log (deflated 86%)
  adding: .config/logs/2023.12.19/14.20.35.154355.log (deflated 57%)
  adding: .config/logs/2023.12.19/14.19.39.750127.log (deflated 91%)
  adding: .config/logs/2023.12.19/14.20.06.327238.log (deflated 58%)
  adding: .config/logs/2023.12.19/14.20.25.266295.log (deflated 58%)
  adding: .config/.last_update_check.json (deflated 22%)
  adding: .config/default_configs.db (deflated 98%)
  adding: .config/config_sentinel (stored 0%)
  adding: .config/gce (stored 0%)
  adding: .config/configurations/ (stored 0%)
  adding: .config/configurations/config_default (deflated 15%)
  adding: .config/.last_survey_prompt.yaml (stored 0%)
  adding: .config/active_config (stored 0%)
  adding: C:\PythonM

In [None]:
# from google.colab import files

# files.download('/content/project_files.zip')


FileNotFoundError: ignored

In [None]:
a_test_list = [
    "You're an idiot.",
    "You belong with me.",
    "I love you baby.",
    "I want you to kiss me,",
    "I feel good today.",
    "You're an asshole.",
    "You know me now",
    "I can be your nightmare",
    "Who do you think you're?",
    "I'm going to kill you"
]

In [None]:
print(' Trained Model Prediction ')
print(' ------------------------ ')

for text in a_test_list:
    inputs = tokenizer.encode(text, return_tensors='pt').to(device)
    logits = model(inputs).logits
    predictions = torch.argmax(logits, dim=1)
    print(text + " - " + id2label[predictions.item()])

 Trained Model Prediction 
 ------------------------ 
You're an idiot. - Negative
You belong with me. - Positive
I love you baby. - Positive
I want you to kiss me, - Positive
I feel good today. - Positive
You're an asshole. - Negative
You know me now - Positive
I can be your nightmare - Negative
Who do you think you're? - Negative
I'm going to kill you - Negative


In [None]:
a_test_list = [
    "You're an idiot.",
    "You belong with me.",
    "I love you baby.",
    "I want you to kiss me,",
    "Fantastic! it is",
    "You've always wisely chosen the wrong path",
    "You know the not so wise version of me",
    "I can be your joyous nightmare",
    "Who do you think you're? hottest guy ever",
    "I'm going to kill you, my love"
]

In [None]:
print(' Trained Model Prediction ')
print(' ------------------------ ')

for text in a_test_list:
    inputs = tokenizer.encode(text, return_tensors='pt').to(device)
    logits = model(inputs).logits
    predictions = torch.argmax(logits, dim=1)
    print(text + " - " + id2label[predictions.item()])

 Trained Model Prediction 
 ------------------------ 
You're an idiot. - Negative
You belong with me. - Positive
I love you baby. - Positive
I want you to kiss me, - Positive
Fantastic! it is - Positive
You've always wisely chosen the wrong path - Positive
You know the not so wise version of me - Negative
I can be your joyous nightmare - Positive
Who do you think you're? hottest guy ever - Negative
I'm going to kill you, my love - Positive
