<a href="https://colab.research.google.com/github/YASHGUPTA1161/Sentiment-analysis/blob/main/bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Emotion Classification in short texts with BERT

Applying BERT to the problem of multiclass text classification. Our dataset consists of written dialogs, messages and short stories. Each dialog utterance/message is labeled with one of the five emotion categories: joy, anger, sadness, fear, neutral.

## Workflow:
1. Import Data
2. Data preprocessing and downloading BERT
3. Training and validation
4. Saving the model


In [3]:
!pip install transformers datasets accelerate huggingface_hub



Collecting datasets
  Using cached datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Using cached dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Using cached xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Using cached multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Using cached fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Using cached nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Using cached nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4

In [24]:
# ✅ STEP 1: Load CSVs
train_df = pd.read_csv('/content/data_test.csv', encoding='utf-8')
test_df  = pd.read_csv('/content/data_test.csv', encoding='utf-8')

# ✅ STEP 2: Map emotion labels to integers — THIS IS REQUIRED TO FIX THE ERROR
label2id = {'joy': 0, 'sadness': 1, 'fear': 2, 'anger': 3, 'neutral': 4}
id2label = {v: k for k, v in label2id.items()}

train_df['labels'] = train_df['Emotion'].map(label2id)  # 👈 converts to integer
test_df['labels']  = test_df['Emotion'].map(label2id)

# Optional: Drop original label column
train_df.drop(columns=['Emotion'], inplace=True)
test_df.drop(columns=['Emotion'], inplace=True)

# ✅ STEP 3: Convert to Hugging Face Datasets
train_ds = Dataset.from_pandas(train_df)
test_ds  = Dataset.from_pandas(test_df)

In [25]:
# ✅ STEP 4: Tokenize the text
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def tokenize_fn(examples):
    return tokenizer(
        examples['Text'],
        padding='max_length',   # 👈 important
        truncation=True,        # 👈 important
        max_length=350
    )

train_ds = train_ds.map(tokenize_fn, batched=True)
test_ds  = test_ds.map(tokenize_fn, batched=True)

Map:   0%|          | 0/3393 [00:00<?, ? examples/s]

Map:   0%|          | 0/3393 [00:00<?, ? examples/s]

In [26]:
# ✅ STEP 5: Set correct format for PyTorch training — OTHERWISE `Trainer` WILL FAIL
train_ds.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
test_ds.set_format("torch", columns=["input_ids", "attention_mask", "labels"])


In [27]:
# ✅ STEP 6: Define model
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=len(label2id),
    id2label=id2label,
    label2id=label2id
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [30]:
# ✅ STEP 7: Set training args
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='bert_emotion',
    per_device_train_batch_size=6,
    per_device_eval_batch_size=6,
    num_train_epochs=3,
    learning_rate=2e-5,
    logging_dir='logs',
    eval_steps=len(train_ds)
)


In [31]:
# ✅ STEP 8: Define Trainer
from transformers import Trainer, DataCollatorWithPadding
from sklearn.metrics import accuracy_score, f1_score

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

def compute_metrics_fn(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    return {
        'accuracy': accuracy_score(labels, preds),
        'f1': f1_score(labels, preds, average='weighted')
    }

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics_fn
)

  trainer = Trainer(


In [32]:
# ✅ STEP 9: Train and evaluate
trainer.train()
trainer.evaluate()

Step,Training Loss
500,0.9858
1000,0.4484
1500,0.2745


{'eval_loss': 0.15535539388656616,
 'eval_accuracy': 0.9610963748894783,
 'eval_f1': 0.9611697222047844,
 'eval_runtime': 72.7538,
 'eval_samples_per_second': 46.637,
 'eval_steps_per_second': 7.78,
 'epoch': 3.0}

In [33]:
# ✅ STEP 10: Save
trainer.save_model('models/bert_emotion')
tokenizer.save_pretrained('models/bert_emotion')


('models/bert_emotion/tokenizer_config.json',
 'models/bert_emotion/special_tokens_map.json',
 'models/bert_emotion/vocab.txt',
 'models/bert_emotion/added_tokens.json',
 'models/bert_emotion/tokenizer.json')

In [41]:
# ✅ STEP 11: Inference
from transformers import pipeline

classifier = pipeline('text-classification', model='models/bert_emotion', tokenizer=tokenizer)
print(classifier("I want to suck your titties"))
# www.kaggle.com/datasets/yashgupta1161/emotion-dataset

Device set to use cuda:0


[{'label': 'fear', 'score': 0.6884295344352722}]
