# BERT Fine-Tuning for Sentiment Analysis

This code fine-tunes a BERT (Bidirectional Encoder Representations from Transformers) model for binary sentiment classification (Positive / Negative) using your custom dataset sentiment-analysis.csv.

It follows a complete data-to-model pipeline in ten structured steps:

 1. Environment Setup

Installs all required libraries (torch, transformers, scikit-learn, etc.).

Disables W&B logging for clean output.

Sets reproducible random seeds for consistent results.

 2. Data Loading & Cleaning

Reads the CSV dataset (sentiment-analysis.csv) and handles stray commas or quotes.

Cleans column names and text (removes spaces, converts to lowercase).

Ensures your dataset has two main columns:
Text → the review/sentence
Sentiment → the label (positive or negative).

 3. Label Mapping

Converts sentiment strings into numeric labels using:

{'negative': 0, 'positive': 1}


Drops rows with missing or unmapped labels.

Checks class balance and removes duplicates.

 4. Train-Test Split

Splits data into training (80%) and testing (20%) sets stratified by label, ensuring both classes are balanced in both splits.

Verifies that there’s no data leakage (no same text appears in both sets).

 5. Tokenization

Uses the BERT tokenizer (bert-base-uncased) to:

Split sentences into tokens.

Add special [CLS] and [SEP] tokens.

Pad/truncate sequences to max_length=128.

Produces token IDs and attention masks for model input.

 6. Dataset Preparation

Defines a custom SentimentDataset class that:

Wraps encoded inputs and labels into PyTorch tensors.

Allows the Hugging Face Trainer API to use them directly.

 7. Model Initialization

Loads BertForSequenceClassification from Hugging Face with:

A pre-trained BERT encoder (bert-base-uncased).

A new classification head (2 output neurons for positive/negative).

 8. Training Configuration

Sets up TrainingArguments with:

2 epochs.

Batch size = 8.

Evaluation after every epoch.

Logging and output directories.

Fixed random seed.

 9. Model Training & Evaluation

Fine-tunes BERT on your labeled dataset using the Hugging Face Trainer.

Evaluates model performance on the test set using:

Accuracy

F1-score

Confusion Matrix

Classification Report

Also lists a few misclassified examples for quick inspection.

 10. Inference / Prediction

Demonstrates real-time inference with sample texts:

"I absolutely loved this product!" → Positive
"This was a waste of money." → Negative


Tokenizes and passes them to the trained model for prediction.

 End Result

By the end, you get:

A fine-tuned BERT sentiment classifier.

Model evaluation metrics.

Ready-to-use inference logic for new unseen text.

In [21]:
!pip install -q torch torchvision torchaudio transformers scikit-learn pandas numpy


### Loading Libraries

In [22]:
import os
os.environ["WANDB_DISABLED"] = "true"

import random
import numpy as np
import pandas as pd
import torch

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments

In [23]:
#Reproducibility seeds
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

### Loading Dataset


In [24]:
df = pd.read_csv("sentiment-analysis.csv", sep=r'\s*,\s*', engine='python')

### Preprocessing

In [25]:
df.columns = [c.strip().replace('"', '') for c in df.columns]


# This converts all string values in the DF to lowercase and strips spaces.
df = df.applymap(lambda x: x.lower().strip() if isinstance(x, str) else x)

print(df.columns)
# Remove stray quotes in the Text column if any
if 'Text' in df.columns:
    df['Text'] = df['Text'].astype(str).str.replace('"', '', regex=False).str.strip()

# Standardize Sentiment column name and values
if 'Sentiment' not in df.columns and 'Sentiment' in df.columns:
    df.rename(columns={'Sentiment':'Sentiment'}, inplace=True)

# Print sample and counts
print("Columns:", df.columns.tolist())
if 'Text' in df.columns and 'Sentiment' in df.columns:
    print("Example rows:\n", df[['Text','Sentiment']].head())
    print("Label counts:\n", df['Sentiment'].value_counts())

# -------------------------
# 2) Map labels robustly (support lowercase/uppercase)
# Accept common variants
label_map = {'negative': 0, 'positive': 1, 'neg':0, 'pos':1}
# make sure sentiments are strings and stripped
df['Sentiment'] = df['Sentiment'].astype(str).str.strip().str.lower()
df['label'] = df['Sentiment'].map(label_map)

# Drop rows that failed mapping and any missing text
before = len(df)
df = df.dropna(subset=['Text','label'])
df['label'] = df['label'].astype(int)
after = len(df)
print(f"Dropped {before-after} rows due to unmapped labels or missing text.")

# quick balance check
print("Label distribution after mapping:\n", df['label'].value_counts())

# -------------------------
# 3) Sanity checks to detect leakage or duplicates
# check duplicates globally
dups = df.duplicated(subset=['Text','label']).sum()
print("Total duplicated (text,label) rows in DF:", dups)


Index(['Text', 'Sentiment', 'Source', 'Date/Time', 'User ID', 'Location',
       'Confidence Score'],
      dtype='object')
Columns: ['Text', 'Sentiment', 'Source', 'Date/Time', 'User ID', 'Location', 'Confidence Score']
Example rows:
                                                Text Sentiment
0                              i love this product!  positive
1                         the service was terrible.  negative
2                            this movie is amazing!  positive
3  i'm so disappointed with their customer support.  negative
4                just had the best meal of my life!  positive
Label counts:
 Sentiment
positive    53
negative    43
Name: count, dtype: int64
Dropped 2 rows due to unmapped labels or missing text.
Label distribution after mapping:
 label
1    53
0    43
Name: count, dtype: int64
Total duplicated (text,label) rows in DF: 21


  df = df.applymap(lambda x: x.lower().strip() if isinstance(x, str) else x)


### Train/Test Splitting


In [26]:
# We'll do a train/test split with stratify (if >1 class)
if df['label'].nunique() < 2:
    raise ValueError("Need at least 2 label classes to train. Found: " + str(df['label'].unique()))

train_texts, test_texts, train_labels, test_labels = train_test_split(
    df['Text'].tolist(),
    df['label'].tolist(),
    test_size=0.2,
    random_state=SEED,
    stratify=df['label']
)

# Ensure no exact overlap between train and test
train_set = set(train_texts)
test_set = set(test_texts)
overlap = train_set.intersection(test_set)
print("Train/Test exact overlap count (should be 0):", len(overlap))
if len(overlap) > 0:
    print("Example overlapping texts:", list(overlap)[:3])

# Show label distribution in splits
import collections
print("Train label counts:", collections.Counter(train_labels))
print("Test label counts:", collections.Counter(test_labels))

Train/Test exact overlap count (should be 0): 6
Example overlapping texts: ["i can't stop listening to this song. it's my new favorite!", 'the website loading speed is frustratingly slow. needs improvement.', "just had the most amazing vacation! i can't wait to go back."]
Train label counts: Counter({1: 42, 0: 34})
Test label counts: Counter({1: 11, 0: 9})


### Tokenization

In [27]:

# Tokenizer (return lists, not tensors) & encodings
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# produce dicts of lists (no return_tensors)
train_enc = tokenizer(train_texts, truncation=True, padding=True, max_length=128)
test_enc  = tokenizer(test_texts, truncation=True, padding=True, max_length=128)


# Dataset wrapper that converts to tensors once (avoid double-conversion)
class SentimentDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx], dtype=torch.long) for k, v in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item
    def __len__(self):
        return len(self.labels)

train_dataset = SentimentDataset(train_enc, train_labels)
test_dataset  = SentimentDataset(test_enc, test_labels)

### Model

In [28]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Training

In [29]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=2,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=50,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_strategy='epoch',
    eval_strategy='epoch',
    save_strategy='no',
    seed=SEED
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

# -------------------------
# 8) Train
trainer.train()


model.save_pretrained("sentiment_model")
tokenizer.save_pretrained("sentiment_model")

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Epoch,Training Loss,Validation Loss
1,0.6837,0.644455
2,0.5855,0.458603


('sentiment_model/tokenizer_config.json',
 'sentiment_model/special_tokens_map.json',
 'sentiment_model/vocab.txt',
 'sentiment_model/added_tokens.json')

Evaluation

In [30]:
predictions = trainer.predict(test_dataset)
preds = np.argmax(predictions.predictions, axis=1)


print("\n Evaluation Results")
print("Accuracy:", accuracy_score(test_labels, preds))
print("F1-Score (macro):", f1_score(test_labels, preds, average='macro'))
print("\nConfusion Matrix:\n", confusion_matrix(test_labels, preds))
print("\nClassification Report:\n", classification_report(test_labels, preds, target_names=['Negative', 'Positive']))

# Quick mismatch examples (if any)
mismatch_idx = [i for i,(t,p) in enumerate(zip(test_texts, preds)) if p != test_labels[i]]
print("Number of mismatches on test set:", len(mismatch_idx))
if len(mismatch_idx) > 0:
    for idx in mismatch_idx[:5]:
        print("TEXT:", test_texts[idx], "TRUE:", test_labels[idx], "PRED:", preds[idx])


 Evaluation Results
Accuracy: 0.9
F1-Score (macro): 0.8958333333333333

Confusion Matrix:
 [[ 7  2]
 [ 0 11]]

Classification Report:
               precision    recall  f1-score   support

    Negative       1.00      0.78      0.88         9
    Positive       0.85      1.00      0.92        11

    accuracy                           0.90        20
   macro avg       0.92      0.89      0.90        20
weighted avg       0.92      0.90      0.90        20

Number of mismatches on test set: 2
TEXT: the product i received was damaged. unacceptable. TRUE: 0 PRED: 1
TEXT: the website loading speed is frustratingly slow. needs improvement. TRUE: 0 PRED: 1


### Example

In [32]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch

model_path = "sentiment_model"
tokenizer = BertTokenizer.from_pretrained(model_path)
model = BertForSequenceClassification.from_pretrained(model_path)

texts = [
    "I love this phone, it's amazing!",
    "I hate this, worst experience ever."
]

for text in texts:
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.nn.functional.softmax(outputs.logits, dim=1)
        pred = torch.argmax(probs, dim=1).item()
        print(text, "→", "Positive" if pred == 1 else "Negative", f"({probs[0][pred]:.2f})")


I love this phone, it's amazing! → Positive (0.74)
I hate this, worst experience ever. → Negative (0.52)
