Sentiment Classification Project

Project Overview

This project demonstrates how to fine-tune a pre-trained model, DistilBERT, to classify text as either positive or negative sentiment. The process involves using a small dataset and optimizing the workflow to work on devices with limited resources, such as a laptop without a dedicated GPU.

Steps to Complete the Project

Step 1: Understanding the Goal

The aim is to take a sentence and predict whether it has a positive or negative sentiment.

For example, "I love this movie" is positive, while "I hate this product" is negative.

Step 2: Preparing the Dataset

A small dataset in CSV format is used with two columns:

text: The sentence to analyze.

label: The sentiment (1 for positive, 0 for negative).

Example:

text,label
I love this movie,1
This film was terrible,0

Step 3: Tokenization

Sentences are converted into a format that the machine learning model can understand.

Tokenization breaks down sentences into smaller parts (like words) and assigns numeric IDs to them.

Step 4: Fine-Tuning the Model

DistilBERT, a smaller version of BERT, is fine-tuned on the dataset.

The fine-tuning process involves training the model to understand the relationship between the text and its label.

Step 5: Evaluating the Model

After training, the model is tested on new data to check its accuracy and performance.

Metrics like accuracy, precision, and F1 score are calculated to ensure reliability.

Step 6: Saving and Using the Model

The trained model is saved locally and can be used to predict the sentiment of new sentences.

For example: "This is amazing!" → Predicted as Positive.

Errors Encountered and Solutions

Error 1: PyTorch Not Installed

Issue: The error "PyTorch needs to be installed to be able to return PyTorch tensors" occurred.

Solution: Installed PyTorch using the command:

pip install torch

Error 2: Undefined tokenized_dataset

Issue: The tokenized_dataset variable was used before being defined, causing a NameError.

Solution: Ensured the dataset was properly tokenized and assigned to tokenized_dataset before its usage in the training script.

These issues were tackled step-by-step, ensuring the project ran smoothly on a system with limited hardware resources.

Tools Used

Python: The programming language for the project.

Hugging Face Transformers: For working with pre-trained models like DistilBERT.

PyTorch: For training and handling the model.

Scikit-learn: For evaluation metrics.

Datasets Library: For loading and processing the data.

Key Learnings

How to fine-tune a pre-trained model for text classification.

Working with limited hardware resources.

Using evaluation metrics to measure model performance.

This project is an excellent starting point for anyone interested in learning about natural language processing (NLP) and machine learning. By following these steps, you can build a simple yet effective sentiment analysis tool.

Step 1: Import Libraries

Markdown Explanation:
We start by importing the necessary Python libraries:

transformers: For the pretrained model and tokenizer.

datasets: To load and preprocess the dataset. 

torch: For handling tensors and model training on the CPU. 

numpy: For numerical computations. 

sklearn: For evaluation metrics. 

In [2]:
pip install datasets


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-18.1.0-cp311-cp311-win_amd64.whl.metadata (3.4 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting requests>=2.32.2 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting tqdm>=4.66.3 (from datasets)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-win_amd64.whl.metadata (13 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting aiohttp (from datasets)
  Downloading aiohttp-3.11.11-cp311-cp311-win_amd64.whl.metadata (8.0 kB)
Collecting aiohappyeyeballs>=2.3.0 (from aiohttp->datasets)
  Downloading aiohappyeyeballs-2.4.4-py3-none-any.whl.metadata (6.1 kB)
Collecting aiosignal>=

In [None]:
pip install torch

In [5]:
# Import necessary libraries
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
import torch
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support


Step 2: Load and Prepare the Dataset
Markdown Explanation:

We'll use a small custom dataset saved as a CSV file with two columns: text and label.

The datasets library makes it easy to load and preprocess datasets.

In [6]:
from datasets import load_dataset

# Path to the dataset CSV file
data_files = {"train": "C:/Users/ARUN/Downloads/AI Engineering projects/PROJECT1 BERT/sentiment_data.csv", 
              "validation": "C:/Users/ARUN/Downloads/AI Engineering projects/PROJECT1 BERT/sentiment_data.csv"}

# Load the dataset from the CSV file
dataset = load_dataset('csv', data_files=data_files)

# Check the first few rows
print(dataset["train"][0])


Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

{'text': 'I love this movie', 'label': 1}


Step 3: Tokenize the Data
Markdown Explanation:

Pretrained models like DistilBERT require tokenized input (convert text into numeric IDs).

We'll use the DistilBertTokenizer for this purpose.
The map function applies the tokenizer to each row of the dataset.

In [12]:
!pip install --upgrade pip


Collecting pip
  Downloading pip-24.3.1-py3-none-any.whl.metadata (3.7 kB)
Downloading pip-24.3.1-py3-none-any.whl (1.8 MB)
   ---------------------------------------- 0.0/1.8 MB ? eta -:--:--
   ---------------------------------------- 1.8/1.8 MB 16.6 MB/s eta 0:00:00



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip
ERROR: To modify pip, please run the following command:
C:\Users\ARUN\AppData\Local\Programs\Python\Python311\python.exe -m pip install --upgrade pip


In [14]:
pip install pytorch


Collecting pytorch
  Using cached pytorch-1.0.2.tar.gz (689 bytes)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: pytorch
  Building wheel for pytorch (setup.py): started
  Building wheel for pytorch (setup.py): finished with status 'error'
  Running setup.py clean for pytorch
Failed to build pytorch
Note: you may need to restart the kernel to use updated packages.


  error: subprocess-exited-with-error
  
  python setup.py bdist_wheel did not run successfully.
  exit code: 1
  
  [6 lines of output]
  Traceback (most recent call last):
    File "<string>", line 2, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "C:\Users\ARUN\AppData\Local\Temp\pip-install-9te9hkmf\pytorch_a6e3a938fe1348ccb83db4beac5864d5\setup.py", line 15, in <module>
      raise Exception(message)
  Exception: You tried to install "pytorch". The package named for PyTorch is "torch"
  [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pytorch

[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (pytorch)


In [24]:
# Load the DistilBERT tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Tokenization function
def tokenize_function(example):
    return tokenizer(example['text'], truncation=True, padding='max_length', max_length=64)

# Apply tokenization to the dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Format the dataset for PyTorch
tokenized_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])


ValueError: PyTorch needs to be installed to be able to return PyTorch tensors.

In [16]:
import torch
print(torch.__version__)


2.5.1+cpu


In [22]:
!pip uninstall -y torch




Found existing installation: torch 2.5.1
Uninstalling torch-2.5.1:
  Successfully uninstalled torch-2.5.1


You can safely remove it manually.


In [23]:
!pip install torch


Collecting torch
  Using cached torch-2.5.1-cp311-cp311-win_amd64.whl.metadata (28 kB)
Using cached torch-2.5.1-cp311-cp311-win_amd64.whl (203.1 MB)
Installing collected packages: torch
Successfully installed torch-2.5.1



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification


In [4]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from datasets import load_dataset

# Load the DistilBERT tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Tokenization function
def tokenize_function(example):
    return tokenizer(example['text'], truncation=True, padding='max_length', max_length=64)

# Define the path to the CSV file
data_files = {"train": "C:/Users/ARUN/Downloads/AI Engineering projects/PROJECT1 BERT/sentiment_data.csv",
              "validation": "C:/Users/ARUN/Downloads/AI Engineering projects/PROJECT1 BERT/sentiment_data.csv"}

# Load dataset
dataset = load_dataset('csv', data_files=data_files)

# Apply tokenization to the dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Format the dataset for PyTorch
tokenized_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

# Print the first example to verify
print(tokenized_dataset["train"][0])


{'label': tensor(1), 'input_ids': tensor([ 101, 1045, 2293, 2023, 3185,  102,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])}


Step 4: Load Pretrained Model
Markdown Explanation:

We'll use DistilBertForSequenceClassification, a smaller and faster version of BERT optimized for your CPU.
The num_labels parameter specifies the number of output classes (2 for binary classification: positive and negative).

In [2]:
from transformers import DistilBertForSequenceClassification

# Load the DistilBERT model for binary classification
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step 5: Define Training Arguments
Markdown Explanation:

TrainingArguments specifies how training should be done (e.g., number of epochs, batch size, logging).
Adjusted for your device:

Small batch size (per_device_train_batch_size=2) to fit in 8GB RAM.
Fewer epochs (num_train_epochs=2) for quicker runs.

In [6]:
pip install accelerate

Collecting accelerate
  Downloading accelerate-1.2.1-py3-none-any.whl.metadata (19 kB)
Downloading accelerate-1.2.1-py3-none-any.whl (336 kB)
Installing collected packages: accelerate
Successfully installed accelerate-1.2.1
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
pip install tf-keras


Collecting tf-kerasNote: you may need to restart the kernel to use updated packages.


  You can safely remove it manually.

[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip



  Downloading tf_keras-2.18.0-py3-none-any.whl.metadata (1.6 kB)
Collecting tensorflow<2.19,>=2.18 (from tf-keras)
  Downloading tensorflow-2.18.0-cp311-cp311-win_amd64.whl.metadata (3.3 kB)
Collecting tensorflow-intel==2.18.0 (from tensorflow<2.19,>=2.18->tf-keras)
  Downloading tensorflow_intel-2.18.0-cp311-cp311-win_amd64.whl.metadata (4.9 kB)
Collecting tensorboard<2.19,>=2.18 (from tensorflow-intel==2.18.0->tensorflow<2.19,>=2.18->tf-keras)
  Downloading tensorboard-2.18.0-py3-none-any.whl.metadata (1.6 kB)
Downloading tf_keras-2.18.0-py3-none-any.whl (1.7 MB)
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ---------------------------------------- 1.7/1.7 MB 13.4 MB/s eta 0:00:00
Downloading tensorflow-2.18.0-cp311-cp311-win_amd64.whl (7.5 kB)
Downloading tensorflow_intel-2.18.0-cp311-cp311-win_amd64.whl (390.2 MB)
   ---------------------------------------- 0.0/390.2 MB ? eta -:--:--
    --------------------------------------- 8.1/390.2 MB 42.1 MB/s eta 0

In [3]:

from transformers import TrainingArguments

# Define the training arguments
training_args = TrainingArguments(
    output_dir='results',               # Directory to save model checkpoints
    num_train_epochs=2,                 # Number of epochs
    per_device_train_batch_size=2,      # Batch size per device
    per_device_eval_batch_size=2,       # Batch size for evaluation
    evaluation_strategy="epoch",        # Evaluate at the end of each epoch
    save_strategy="epoch",              # Save model at each epoch
    logging_dir='logs',                 # Directory for logs
    logging_steps=10,                   # Log every 10 steps
    load_best_model_at_end=True,        # Load the best model at the end of training
    seed=42                             # Seed for reproducibility
)







Step 6: Define Metrics for Evaluation
Markdown Explanation:

To evaluate the model, we define metrics like accuracy, precision, recall, and F1 score using scikit-learn

In [4]:
# Define metrics for evaluation
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='binary')
    acc = accuracy_score(labels, predictions)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }


Step 7: Train the Model
Markdown Explanation:

Use the Hugging Face Trainer API to simplify the training loop.
Pass the model, training arguments, datasets, and metrics to the Trainer.

In [7]:
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

# Load the DistilBERT model for binary classification
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

# Define the training arguments
training_args = TrainingArguments(
    output_dir='results',               # Directory to save model checkpoints
    num_train_epochs=2,                 # Number of epochs
    per_device_train_batch_size=2,      # Batch size per device
    per_device_eval_batch_size=2,       # Batch size for evaluation
    evaluation_strategy="epoch",        # Evaluate at the end of each epoch
    save_strategy="epoch",              # Save model at each epoch
    logging_dir='logs',                 # Directory for logs
    logging_steps=10,                   # Log every 10 steps
    load_best_model_at_end=True,        # Load the best model at the end of training
    seed=42                             # Seed for reproducibility
)

# Initialize Trainer
trainer = Trainer(
    model=model,                            # The model you defined
    args=training_args,                     # The training arguments
    train_dataset=tokenized_dataset['train'],  # The tokenized training dataset
    eval_dataset=tokenized_dataset['validation'],  # The tokenized validation dataset
    compute_metrics=compute_metrics         # The metrics function
)

# Train the model
trainer.train()



Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


NameError: name 'tokenized_dataset' is not defined

In [10]:
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer, Trainer, TrainingArguments
from datasets import load_dataset
import numpy as np
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

# Load the DistilBERT tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Load the dataset
data_files = {"train": "C:/Users/ARUN/Downloads/AI Engineering projects/PROJECT1 BERT/sentiment_data.csv",
              "validation": "C:/Users/ARUN/Downloads/AI Engineering projects/PROJECT1 BERT/sentiment_data.csv"}
dataset = load_dataset('csv', data_files=data_files)

# Tokenization function
def tokenize_function(example):
    return tokenizer(example['text'], truncation=True, padding='max_length', max_length=64)

# Apply tokenization to the dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Format the dataset for PyTorch
tokenized_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

# Load the model for binary classification
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

# Define the training arguments
training_args = TrainingArguments(
    output_dir='results',               # Directory to save model checkpoints
    num_train_epochs=2,                 # Number of epochs
    per_device_train_batch_size=2,      # Batch size per device
    per_device_eval_batch_size=2,       # Batch size for evaluation
    evaluation_strategy="epoch",        # Evaluate at the end of each epoch
    save_strategy="epoch",              # Save model at each epoch
    logging_dir='logs',                 # Directory for logs
    logging_steps=10,                   # Log every 10 steps
    load_best_model_at_end=True,        # Load the best model at the end of training
    seed=42                             # Seed for reproducibility
)

# Initialize Trainer
trainer = Trainer(
    model=model,                            # The model you defined
    args=training_args,                     # The training arguments
    train_dataset=tokenized_dataset['train'],  # The tokenized training dataset
    eval_dataset=tokenized_dataset['validation'],  # The tokenized validation dataset
    compute_metrics=compute_metrics         # The metrics function
)

# Train the model
trainer.train()


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.622505,1.0,1.0,1.0,1.0
2,No log,0.586624,1.0,1.0,1.0,1.0


TrainOutput(global_step=6, training_loss=0.6934785842895508, metrics={'train_runtime': 25.9396, 'train_samples_per_second': 0.386, 'train_steps_per_second': 0.231, 'total_flos': 165584248320.0, 'train_loss': 0.6934785842895508, 'epoch': 2.0})

Step 8: Evaluate the Model
Markdown Explanation:

After training, evaluate the model on the validation dataset to check its performanc

In [11]:
# Evaluate the model
eval_result = trainer.evaluate()
print(f"Evaluation Results: {eval_result}")


Evaluation Results: {'eval_loss': 0.5866236090660095, 'eval_accuracy': 1.0, 'eval_f1': 1.0, 'eval_precision': 1.0, 'eval_recall': 1.0, 'eval_runtime': 0.3026, 'eval_samples_per_second': 16.524, 'eval_steps_per_second': 9.915, 'epoch': 2.0}


Step 9: Save the Fine-Tuned Model
Markdown Explanation:

Save the trained model locally for future use

In [12]:
# Save the fine-tuned model
trainer.save_model("fine_tuned_distilbert")


Step 10: Perform Inference on New Text
Markdown Explanation:

Test the fine-tuned model with new sentences to predict their sentiment.

In [2]:
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer

# Load the fine-tuned model and tokenizer again
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')



Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
import torch

# Save the model manually using PyTorch
torch.save(model.state_dict(), 'fine_tuned_distilbert/pytorch_model.bin')

# Save tokenizer
tokenizer.save_pretrained('fine_tuned_distilbert')


('fine_tuned_distilbert\\tokenizer_config.json',
 'fine_tuned_distilbert\\special_tokens_map.json',
 'fine_tuned_distilbert\\vocab.txt',
 'fine_tuned_distilbert\\added_tokens.json')

Perform Inference on New Text
Markdown Explanation:

Test the fine-tuned model with new sentences to predict their sentiment.

In [4]:
# Load the saved model for inference
model = DistilBertForSequenceClassification.from_pretrained("fine_tuned_distilbert")
tokenizer = DistilBertTokenizer.from_pretrained("fine_tuned_distilbert")

# Test with new text
text = "I absolutely love this product!"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding='max_length', max_length=64)

# Perform inference
with torch.no_grad():
    outputs = model(**inputs)
logits = outputs.logits
predicted_label = torch.argmax(logits, dim=1).item()

# Print prediction
print("Predicted Label:", "Positive" if predicted_label == 1 else "Negative")


Predicted Label: Positive
