# Emotion Detection Using NLP (RoBERTa)

Now that Exploratory Data Analysis (EDA) is complete, we have a better understanding of sentiment distribution, text characteristics, and keyword usage in the dataset. However, sentiment analysis alone only classifies statements as positive, neutral, or negative, which may not capture the full emotional depth of the text.

To enhance our insights, we are now using RoBERTa for emotion detection to extract specific emotions like joy, sadness, anger, and fear. This step happens after EDA because we now know how sentiment correlates with other features, and we can use emotion labels as a new feature to improve later predictive modeling. By setting a 95% confidence threshold, we ensure that only highly reliable predictions are used.

In [14]:
!pip install pandas transformers torch
!pip install --upgrade requests
import pandas as pd
import re

Collecting requests
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Downloading requests-2.32.3-py3-none-any.whl (64 kB)
   ---------------------------------------- 0.0/64.9 kB ? eta -:--:--
   ------ --------------------------------- 10.2/64.9 kB ? eta -:--:--
   ------ --------------------------------- 10.2/64.9 kB ? eta -:--:--
   ------------------ --------------------- 30.7/64.9 kB 220.2 kB/s eta 0:00:01
   ------------------------------------- -- 61.4/64.9 kB 365.7 kB/s eta 0:00:01
   ---------------------------------------- 64.9/64.9 kB 350.6 kB/s eta 0:00:00
Installing collected packages: requests
  Attempting uninstall: requests
    Found existing installation: requests 2.32.2
    Uninstalling requests-2.32.2:
      Successfully uninstalled requests-2.32.2
Successfully installed requests-2.32.3


In [18]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Define the model name for a model that detects more emotions
model_name = "j-hartmann/emotion-english-distilroberta-base"

# Use the slow tokenizer and force PyTorch weights (avoid safetensors)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForSequenceClassification.from_pretrained(model_name, use_safetensors=False)

# Automatically extract the available labels from the model's configuration
id2label = model.config.id2label  # Dictionary mapping label ids to emotion names
emotions = list(id2label.values())

print("This model is able to detect", len(emotions), "emotions:")
for emotion in emotions:
    print(emotion)


pytorch_model.bin:   0%|          | 0.00/329M [00:00<?, ?B/s]

This model is able to detect 7 emotions:
anger
disgust
fear
joy
neutral
sadness
surprise


In [20]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Define the model name for a model that detects more emotions
model_name = "j-hartmann/emotion-english-distilroberta-base"

# Use the slow tokenizer and force PyTorch weights (avoid safetensors)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForSequenceClassification.from_pretrained(model_name, use_safetensors=False)

# Create a pipeline for text classification with truncation enabled
emotion_classifier = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer,
    top_k=None,       # Return scores for all labels
    truncation=True,  # Enable truncation for texts longer than the model's max length
    max_length=512    # Set the maximum sequence length
)

# Load your CSV dataset (ensure the CSV file is in your working directory or provide the correct path)
csv_file = "feature_engineered_data.csv"
df = pd.read_csv(csv_file)

# Check if the 'statement' column exists; adjust the column name if needed.
if 'statement' not in df.columns:
    raise ValueError("The CSV file must contain a column named 'statement'.")

# Define a function to classify the emotion in a statement with a 95% confidence threshold
def classify_emotion(text):
    # Ensure the text is a string; handle missing or non-string values.
    if not isinstance(text, str):
        text = str(text)
    if text.strip() == "":
        return "Uncertain"
    
    # Get the scores for each emotion for the input text.
    result = emotion_classifier(text)
    # Expecting a list containing one list of score dictionaries.
    scores = result[0] if isinstance(result, list) and len(result) > 0 else []
    if not scores:
        return "Uncertain"
    
    # Identify the emotion with the highest score.
    best = max(scores, key=lambda x: x['score'])
    # Return the detected emotion if confidence is at least 95%, else "Uncertain"
    return best['label'] if best['score'] >= 0.95 else "Uncertain"

# Apply the classification function on the 'statement' column
df['predicted_emotion'] = df['statement'].apply(classify_emotion)

# Save the resulting dataframe to a new CSV file.
output_csv = "emotion_output.csv"
df.to_csv(output_csv, index=False)

print("Processing complete. The output has been saved to", output_csv)


Device set to use cpu


model.safetensors:   0%|          | 0.00/329M [00:00<?, ?B/s]

Processing complete. The output has been saved to emotion_output.csv


In [22]:
import os

# Rename the file
os.rename("emotion_output.csv", "emotion_data.csv")

print("File renamed successfully!")

File renamed successfully!
