<a href="https://colab.research.google.com/github/advait2811/emotion_and_sentiment_evolution/blob/main/prototype_meld.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:

!pip install transformers datasets torch pandas scikit-learn accelerate -q


import pandas as pd
import torch
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

In [2]:

train_url = 'https://raw.githubusercontent.com/declare-lab/MELD/master/data/MELD/train_sent_emo.csv'
dev_url = 'https://raw.githubusercontent.com/declare-lab/MELD/master/data/MELD/dev_sent_emo.csv'

df_train = pd.read_csv(train_url)
df_dev = pd.read_csv(dev_url)

print("Dataset columns:", df_train.columns)
print("\nSample from the training data:")
print(df_train.head())

emotion_labels = df_train['Emotion'].unique()
label2id = {label: i for i, label in enumerate(emotion_labels)}
id2label = {i: label for i, label in enumerate(emotion_labels)}

print("\nEmotion to ID Mapping:", label2id)

df_train['label'] = df_train['Emotion'].map(label2id)
df_dev['label'] = df_dev['Emotion'].map(label2id)


model_name = "distilbert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=len(emotion_labels),
    id2label=id2label,
    label2id=label2id
)

Dataset columns: Index(['Sr No.', 'Utterance', 'Speaker', 'Emotion', 'Sentiment', 'Dialogue_ID',
       'Utterance_ID', 'Season', 'Episode', 'StartTime', 'EndTime'],
      dtype='object')

Sample from the training data:
   Sr No.                                          Utterance          Speaker  \
0       1  also I was the point person on my company’s tr...         Chandler   
1       2                   You must’ve had your hands full.  The Interviewer   
2       3                            That I did. That I did.         Chandler   
3       4      So let’s talk a little bit about your duties.  The Interviewer   
4       5                             My duties?  All right.         Chandler   

    Emotion Sentiment  Dialogue_ID  Utterance_ID  Season  Episode  \
0   neutral   neutral            0             0       8       21   
1   neutral   neutral            0             1       8       21   
2   neutral   neutral            0             2       8       21   
3   neutral   neu

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:

text_example = "Why do all you guys have to be so goofy?"
actual_emotion = "anger"

print(f"Example sentence: '{text_example}'")
print(f"Actual emotion: {actual_emotion}")
print("-" * 30)
print("PREDICTION BEFORE TRAINING:")

inputs = tokenizer(text_example, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

predicted_id = torch.argmax(logits, dim=1).item()
predicted_emotion = model.config.id2label[predicted_id]

print(f"Predicted emotion: {predicted_emotion}")
print("\nNOTE: The prediction is random because the model's classification head has not been trained yet.")

Example sentence: 'Why do all you guys have to be so goofy?'
Actual emotion: anger
------------------------------
PREDICTION BEFORE TRAINING:
Predicted emotion: disgust

NOTE: The prediction is random because the model's classification head has not been trained yet.


In [4]:
train_dataset = Dataset.from_pandas(df_train)
dev_dataset = Dataset.from_pandas(df_dev)

def tokenize_function(examples):
    return tokenizer(examples['Utterance'], padding="max_length", truncation=True)

tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_dev_dataset = dev_dataset.map(tokenize_function, batched=True)

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {
        'accuracy': accuracy_score(labels, predictions),
        'f1': f1_score(labels, predictions, average='weighted')
    }

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    eval_strategy="epoch",
    report_to="none",
    save_steps=500 # Add save_steps to save the model during training
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_dev_dataset,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

before_train_results = trainer.predict(tokenized_dev_dataset)

print("\nMetrics for the model BEFORE training:")
print(before_train_results.metrics)

trainer.train()
print("Fine-tuning complete!")

Map:   0%|          | 0/9989 [00:00<?, ? examples/s]

Map:   0%|          | 0/1109 [00:00<?, ? examples/s]

  trainer = Trainer(



Metrics for the model BEFORE training:
{'test_loss': 1.9525814056396484, 'test_model_preparation_time': 0.0019, 'test_accuracy': 0.056807935076645624, 'test_f1': 0.033428377618716816, 'test_runtime': 16.9036, 'test_samples_per_second': 65.607, 'test_steps_per_second': 4.141}


Epoch,Training Loss,Validation Loss,Model Preparation Time,Accuracy,F1
1,1.3412,1.226587,0.0019,0.611362,0.570095


Fine-tuning complete!


In [10]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

base_model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForSequenceClassification.from_pretrained(
    base_model_name,
    num_labels=len(emotion_labels),
    id2label=id2label,
    label2id=label2id
).to(device)

# Load the fine-tuned model from the saved path
fine_tuned_model = AutoModelForSequenceClassification.from_pretrained("./results/checkpoint-500").to(device)


def predict_emotion(text, model, tokenizer):
    model.eval()
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
    with torch.no_grad():
        logits = model(**inputs).logits
        probs = torch.nn.functional.softmax(logits, dim=-1)
        pred_id = torch.argmax(probs, dim=1).item()
        pred_label = model.config.id2label[pred_id]
    return pred_label, probs.squeeze().cpu().numpy()

text = "I cannot believe you did this, I am so mad!"
base_label, base_probs = predict_emotion(text, base_model, tokenizer)
ft_label, ft_probs = predict_emotion(text, fine_tuned_model, tokenizer)

print("Input text:", text)
print("\nBase model prediction (before training):", base_label)
print("Probabilities:", base_probs)

print("\nFine-tuned model prediction (after training):", ft_label)
print("Probabilities:", ft_probs)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Input text: I cannot believe you did this, I am so mad!

Base model prediction (before training): neutral
Probabilities: [0.16723748 0.12003696 0.14285499 0.1530111  0.14109965 0.15068448
 0.12507531]

Fine-tuned model prediction (after training): anger
Probabilities: [0.06713898 0.08070379 0.03415503 0.12761983 0.22214466 0.08902798
 0.37920976]
