## Import Dataset

A dataset of manually annotated English Twitter tweets with six basic emotions: anger, fear, joy, love, sadness, and surprise.

For further information, click [here](https://www.kaggle.com/datasets/parulpandey/emotion-dataset?select=training.csv).

In [1]:
!unzip archive.zip
!unzip archive-1.zip

Archive:  archive.zip
  inflating: test.csv                
  inflating: training.csv            
  inflating: validation.csv          
Archive:  archive-1.zip
  inflating: tweet_emotions.csv      


In [2]:
import pandas as pd
train = pd.read_csv("training.csv")
test = pd.read_csv("test.csv")
validation = pd.read_csv("validation.csv")

print(f"Train shape:\t\t {train.shape}")
print(f"Test shape:\t\t {test.shape}")
print(f"Validation shape:\t {validation.shape}")

Train shape:		 (16000, 2)
Test shape:		 (2000, 2)
Validation shape:	 (2000, 2)


In [3]:
train.head(9)

Unnamed: 0,text,label
0,i didnt feel humiliated,0
1,i can go from feeling so hopeless to so damned...,0
2,im grabbing a minute to post i feel greedy wrong,3
3,i am ever feeling nostalgic about the fireplac...,2
4,i am feeling grouchy,3
5,ive been feeling a little burdened lately wasn...,0
6,ive been taking or milligrams or times recomme...,5
7,i feel as confused about life as a teenager or...,4
8,i have been with petronas for years i feel tha...,1


In [4]:
train["label"].value_counts()

1    5362
0    4666
3    2159
4    1937
2    1304
5     572
Name: label, dtype: int64

In [5]:
test["label"].value_counts()

1    695
0    581
3    275
4    224
2    159
5     66
Name: label, dtype: int64

In [6]:
validation["label"].value_counts()

1    704
0    550
3    275
4    212
2    178
5     81
Name: label, dtype: int64

In [7]:
data1 = pd.concat([train, test, validation], ignore_index=True)

print(f"Test shape: {data1.shape}\n")
print(data1["label"].value_counts())

Test shape: (20000, 2)

1    6761
0    5797
3    2709
4    2373
2    1641
5     719
Name: label, dtype: int64


From the result above, we can see that the data is lacking of numbers.

So, to overcome the lackness, we are going to use another dataset, Emotion Detection from Text ([here](https://www.kaggle.com/datasets/pashupatigupta/emotion-detection-from-text)), which has 13 emotions as the output: 


1.   neutral
2.   worry
3.   happiness
4.   sadness
5.   love
6.   surprise
7.   fun
8.   relief
9.   hate
10.   empty
11.   enthusiasm
12.   boredom
13.   anger
<br><br>

So, since the data output is different than the previous one, we have to do some configurations with the data, such as assuming that 'hate' and 'anger' are categorized as 'anger', etc.



Below are the configurations:


*   anger : hate, anger
*   fear : worry
*   joy : happiness, fun, relief
*   love : love, enthusiasm
*   sadness: sadness, empty, boredom 
*   surprise: surprise
*   add neutral as new emotion





In [8]:
data2 = pd.read_csv("tweet_emotions.csv").drop("tweet_id", axis=1)

print(f"Data2 shape:\t\t {data2.shape}")

Data2 shape:		 (40000, 2)


In [9]:
data2.head(9)

Unnamed: 0,sentiment,content
0,empty,@tiffanylue i know i was listenin to bad habi...
1,sadness,Layin n bed with a headache ughhhh...waitin o...
2,sadness,Funeral ceremony...gloomy friday...
3,enthusiasm,wants to hang out with friends SOON!
4,neutral,@dannycastillo We want to trade with someone w...
5,worry,Re-pinging @ghostridah14: why didn't you go to...
6,sadness,"I should be sleep, but im not! thinking about ..."
7,worry,Hmmm. http://www.djhero.com/ is down
8,sadness,@charviray Charlene my love. I miss you


value_counts() before configurations

In [10]:
data2["sentiment"].value_counts()

neutral       8638
worry         8459
happiness     5209
sadness       5165
love          3842
surprise      2187
fun           1776
relief        1526
hate          1323
empty          827
enthusiasm     759
boredom        179
anger          110
Name: sentiment, dtype: int64

value_counts() after configurations

In [11]:
new_data2 = data2.copy()
new_data2["sentiment"] = data2["sentiment"]\
    .replace(["hate", "anger"], 0)\
    .replace(["worry"], 1)\
    .replace(["happiness", "fun", "relief"], 2)\
    .replace(["enthusiasm", "love"], 3)\
    .replace(["empty", "boredom", "sadness"], 4)\
    .replace(["surprise"], 5)\
    .replace(["neutral"], 6)

new_data2["sentiment"].value_counts()

6    8638
2    8511
1    8459
4    6171
3    4601
5    2187
0    1433
Name: sentiment, dtype: int64

Now, we are going to concatenate the dataset, but of course we need to change the column name first.

In [12]:
print(f"Previous column names: {list(new_data2.columns)}")

new_data2 = new_data2.rename(columns={"sentiment": "label", "content": "text"})

print(f"Updated column names: {list(new_data2.columns)}")

Previous column names: ['sentiment', 'content']
Updated column names: ['label', 'text']


In [13]:
data = pd.concat([data1, new_data2], ignore_index=True)

print(f"Data shape:\t\t {data.shape}\n")
print(data["label"].value_counts())

Data shape:		 (60000, 2)

1    15220
2    10152
6     8638
4     8544
3     7310
0     7230
5     2906
Name: label, dtype: int64


In [14]:
data.head(9)

Unnamed: 0,text,label
0,i didnt feel humiliated,0
1,i can go from feeling so hopeless to so damned...,0
2,im grabbing a minute to post i feel greedy wrong,3
3,i am ever feeling nostalgic about the fireplac...,2
4,i am feeling grouchy,3
5,ive been feeling a little burdened lately wasn...,0
6,ive been taking or milligrams or times recomme...,5
7,i feel as confused about life as a teenager or...,4
8,i have been with petronas for years i feel tha...,1


## Split into Train and Test

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data["text"], data["label"], test_size = 0.20, random_state = 0)

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (48000,)
X_test shape: (12000,)
y_train shape: (48000,)
y_test shape: (12000,)


Before we use the dataset, we have checked before from value_counts(), that for each label the variant is pretty distant, which makes our data right now is unbalanced.

Actually, there are 2 methods we can use to overcome this unbalanced data, which are undersampling (to reduce the size of the majority class) and oversampling (to increase the number of minority class). 

But I would rather use undersampling, since it retains the original distribution of the minority class. 

In [16]:
from imblearn.under_sampling import RandomUnderSampler

# Create an undersampler
undersampler = RandomUnderSampler(random_state=42)

# Perform undersampling
X_train_resampled, y_train_resampled = undersampler.fit_resample(X_train.to_numpy().reshape(-1, 1), y_train)
X_test_resampled, y_test_resampled = undersampler.fit_resample(X_test.to_numpy().reshape(-1, 1), y_test)

print("Train value_counts():")
print(y_train_resampled.value_counts(), "\n")

print("Test value_counts():")
print(y_test_resampled.value_counts())

Train value_counts():
0    2309
1    2309
2    2309
3    2309
4    2309
5    2309
6    2309
Name: label, dtype: int64 

Test value_counts():
0    597
1    597
2    597
3    597
4    597
5    597
6    597
Name: label, dtype: int64


In [17]:
X_train_resampled, y_train_resampled = list(X_train_resampled.flatten()), list(y_train_resampled)
X_test_resampled, y_test_resampled = list(X_test_resampled.flatten()), list(y_test_resampled)

## Install/Get Pretrained Model

The model that is going to be used is DistilBERT, where you can find it [here](https://huggingface.co/distilbert-base-uncased).

Shortly, I choose this DistilBERT because ...

In [18]:
!pip install transformers

import transformers
print(transformers.__version__)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m86.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m28.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m77.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.29.2
4.29.2


In [19]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

## Finetune the Pretrained Model with Our Dataset

In [20]:
train_encodings = tokenizer(X_train_resampled, truncation=True, padding=True)
test_encodings = tokenizer(X_test_resampled, truncation=True, padding=True)

In [21]:
import tensorflow as tf

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    y_train_resampled
))

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    y_test_resampled
))

In [22]:
from transformers import TFDistilBertForSequenceClassification, TFTrainer, TFTrainingArguments

training_args = TFTrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=2,              # total number of training epochs
    per_device_train_batch_size=8,  # batch size per device during training
    per_device_eval_batch_size=16,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    eval_steps=10,
)

In [23]:
import time

start_time = time.time()

with training_args.strategy.scope():
    model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=7)

trainer = TFTrainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=test_dataset,           # evaluation dataset
)

end_time = time.time()
print(f"Elapsed time to instantiated: {end_time-start_time} seconds")

start_time = time.time()

trainer.train()

end_time = time.time()
print(f"Elapsed time to finetune: {end_time-start_time} seconds")

Downloading tf_model.h5:   0%|          | 0.00/363M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'activation_13', 'vocab_transform', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier', 'classifier', 'dropout_19']
You should probably TRAIN this model on a down-stream task to be able to use i

Elapsed time to instantiated: 8.137639045715332 seconds
Elapsed time to finetune: 529.7035241127014 seconds


## Evaluate

Now we have our final model, and it is the time to evaluate and try to input with some sentences. 

From the result below, we get 1. as our eval loss, which tells us that the model is pretty good.

In [35]:
start_time = time.time()

print(trainer.evaluate(test_dataset))

end_time = time.time()
print(f"Elapsed time to evaluate: {end_time-start_time} seconds")

{'eval_loss': 1.187558356132216}
Elapsed time to evaluate: 20.54791283607483 seconds


If we test the model with our test_dataset, it results 1.18 as the eval_loss.

In [36]:
start_time = time.time()

print(trainer.predict(test_dataset))

end_time = time.time()
print(f"Elapsed time to predict: {end_time-start_time} seconds")

PredictionOutput(predictions=array([[ 6.0187836 , -0.37147436, -1.7852987 , ..., -0.15760252,
        -1.9130595 , -3.007426  ],
       [ 5.9017463 , -0.05276119, -1.2468693 , ..., -0.724301  ,
        -2.0450115 , -3.0926142 ],
       [ 6.1353617 , -0.6003872 , -1.7128254 , ..., -0.68330324,
        -2.3361878 , -3.1043491 ],
       ...,
       [-2.3351448 , -0.85917205,  1.313197  , ..., -0.13477433,
         0.50108933,  1.279402  ],
       [-1.3858484 , -0.76941574,  1.7462626 , ..., -0.8707756 ,
         0.70469   ,  0.22384073],
       [-2.2496705 ,  1.4597071 , -0.09453072, ...,  0.2934363 ,
         0.03188933,  1.4468437 ]], dtype=float32), label_ids=array([0, 0, 0, ..., 6, 6, 6], dtype=int32), metrics={'eval_loss': 1.192080898139313})
Elapsed time to predict: 13.79526662826538 seconds


Now, we are going to input some sentences to our model, and see if it predicts correctly or not.

In [111]:
import random

new_data = [
    "love you",
    "Spent the evening exploring a new neighborhood, discovering hidden gems and unique shops.", 
    "i hate that man so bad", 
    "Visited a local farmers market and indulged in fresh produce and homemade goodies. ",
    "I don't know what to do right now", 
    "Attended an art exhibition today and was captivated by the creativity and talent on display.",
    "i am so bored right now!",
    "Caught up with an old friend over lunch today. It's always nice to reconnect and share stories",
    "Knowledge of God's Word is a great antidote to idolatry",
    "God preordained, for his own glory and the display of His attributes of mercy and justice, a part of the human race, without any merit of their own, to eternal salvation, and another part, in just punishment of their sin, to eternal damnation.",
    "I honor Jesus",
    "Our hearts are restless until they find their rest in God.", 
    "Just got the most amazing surprise from my friends! I'm still in shock and can't stop smiling. Feeling incredibly grateful right now.",
    "When your favorite band announces a surprise album drop and you're not mentally prepared for the eargasm you're about to experience. 🎵🔥",
    "Just witnessed the most amazing sunset while strolling on the beach. Mother Nature never fails to surprise and awe me. 🌅😍",
    "Just got a promotion at work! I wasn't expecting it at all, and now I'm on cloud nine. Hard work pays off, and surprises make it even sweeter. 🎉💼 #careeradvancement #grateful",
    "Can't believe my friends pulled off the ultimate surprise party for me! They totally got me good. Feeling loved and grateful. ❤️🥳 #bestfriends #surpriseparty",
    "Just watched a horror movie alone in the dark. Now every little sound in my house has me on edge. 😱",
    "Thought I saw a figure standing in the mirror behind me when I turned off the lights. Not sure I'll be sleeping well tonight. 😱 #mirrorfears #paranoia",
    "Taking a walk in the park and enjoying the beautiful weather. Nature has a way of bringing calm and tranquility. #outdoortherapy #naturewalk",
    "To know God's purpose for us, we must be acquainted with His Word.",
    "The human heart is a factory of idols."
]

# create a dummy label column
y_train_dummy = [random.randint(0, 7) for i in range(len(new_data))]
print(y_train_dummy, "\n")

new_data_encodings = tokenizer(new_data, truncation=True, padding=True)
new_data_tensor = tf.data.Dataset.from_tensor_slices((
    dict(new_data_encodings),
    y_train_dummy
))
predictions = trainer.predict(new_data_tensor)
print(predictions)

[6, 0, 2, 6, 0, 6, 4, 7, 4, 6, 7, 2, 3, 4, 3, 0, 0, 5, 0, 0, 7, 2] 

PredictionOutput(predictions=array([[-1.3593085 , -0.34572324, -0.23599069,  3.3544414 ,  0.14567883,
        -0.03964495, -1.3139923 ],
       [-2.990524  , -0.49826095,  2.6572907 ,  0.65759295, -1.0258245 ,
         0.61485183,  0.45214778],
       [ 3.1735146 ,  0.23068732, -2.5453503 , -0.45927492,  1.2206827 ,
        -1.2267896 , -1.6585865 ],
       [-2.0790038 , -0.17262484,  0.9796052 ,  0.0552711 ,  0.53632915,
        -0.58838654,  1.259444  ],
       [-1.2653323 ,  2.475512  , -1.5856332 , -1.3685074 ,  1.100146  ,
        -0.41641656,  0.36778778],
       [-2.8989956 , -0.2959767 ,  0.41437384, -0.14811608, -0.71310055,
         3.7053812 , -0.44438475],
       [ 0.93066806,  0.51144147, -2.2258465 , -1.3400166 ,  2.7208338 ,
         0.04363485, -1.3046426 ],
       [-2.7211478 , -0.31621185,  2.0790792 ,  1.0823683 , -0.07675532,
         0.26302513, -0.16932417],
       [-1.3484821 ,  0.94806945,  0.8

In [112]:
print(predictions[0].shape, "\n")
print(predictions)

(22, 7) 

PredictionOutput(predictions=array([[-1.3593085 , -0.34572324, -0.23599069,  3.3544414 ,  0.14567883,
        -0.03964495, -1.3139923 ],
       [-2.990524  , -0.49826095,  2.6572907 ,  0.65759295, -1.0258245 ,
         0.61485183,  0.45214778],
       [ 3.1735146 ,  0.23068732, -2.5453503 , -0.45927492,  1.2206827 ,
        -1.2267896 , -1.6585865 ],
       [-2.0790038 , -0.17262484,  0.9796052 ,  0.0552711 ,  0.53632915,
        -0.58838654,  1.259444  ],
       [-1.2653323 ,  2.475512  , -1.5856332 , -1.3685074 ,  1.100146  ,
        -0.41641656,  0.36778778],
       [-2.8989956 , -0.2959767 ,  0.41437384, -0.14811608, -0.71310055,
         3.7053812 , -0.44438475],
       [ 0.93066806,  0.51144147, -2.2258465 , -1.3400166 ,  2.7208338 ,
         0.04363485, -1.3046426 ],
       [-2.7211478 , -0.31621185,  2.0790792 ,  1.0823683 , -0.07675532,
         0.26302513, -0.16932417],
       [-1.3484821 ,  0.94806945,  0.86371267, -0.48213154, -0.49019268,
         0.35541257, -0.

In [113]:
# Define the emotion labels
emotion_labels = ['anger', 'fear', 'joy', 'love', 'sadness', 'surprise', "neutral"]

# Convert logits to probabilities and get the predicted label
probs = tf.nn.softmax(predictions[0], axis=-1)
predicted_label = tf.argmax(probs, axis=-1).numpy()
print(f"Predicted labels:\n{predicted_label}\n")

for i in range(len(probs)):
    print(f"[{emotion_labels[predicted_label[i]]}] >> '{new_data[i]}'")

Predicted labels:
[3 2 0 6 1 5 4 2 1 1 2 4 2 5 5 2 3 4 1 2 6 5]

[love] >> 'love you'
[joy] >> 'Spent the evening exploring a new neighborhood, discovering hidden gems and unique shops.'
[anger] >> 'i hate that man so bad'
[neutral] >> 'Visited a local farmers market and indulged in fresh produce and homemade goodies. '
[fear] >> 'I don't know what to do right now'
[surprise] >> 'Attended an art exhibition today and was captivated by the creativity and talent on display.'
[sadness] >> 'i am so bored right now!'
[joy] >> 'Caught up with an old friend over lunch today. It's always nice to reconnect and share stories'
[fear] >> 'Knowledge of God's Word is a great antidote to idolatry'
[fear] >> 'God preordained, for his own glory and the display of His attributes of mercy and justice, a part of the human race, without any merit of their own, to eternal salvation, and another part, in just punishment of their sin, to eternal damnation.'
[joy] >> 'I honor Jesus'
[sadness] >> 'Our hearts are

In [47]:
# 'anger', 'fear', 'joy', 'love', 'sadness', 'surprise'
probs[1]

<tf.Tensor: shape=(7,), dtype=float32, numpy=
array([0.80349857, 0.0423576 , 0.00263824, 0.02124636, 0.11399365,
       0.00986182, 0.00640369], dtype=float32)>

## Save Our New Model

In [30]:
trainer.save_model('mclass_model_distilbert')

## Load Our Model

In [32]:
model = TFDistilBertForSequenceClassification.from_pretrained("mclass_model_distilbert")
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Some layers from the model checkpoint at mclass_model_distilbert were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at mclass_model_distilbert and are newly initialized: ['dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [33]:
# create a dummy label column
y_train_dummy = [3, 1, 4, 1]

new_data = ["love you", "i hate that man so bad", "I don't know what to do right now", "i am so bored right now!"]
new_data_encodings = tokenizer(new_data, truncation=True, padding=True)
new_data_tensor = tf.data.Dataset.from_tensor_slices((
    dict(new_data_encodings),
    y_train_dummy
))
predictions = model.predict(new_data_tensor)
predictions



TFSequenceClassifierOutput(loss=None, logits=array([[-1.3593097 , -0.3457232 , -0.2359908 ,  3.3544421 ,  0.14567849,
        -0.03964442, -1.3139918 ],
       [ 3.1735144 ,  0.2306873 , -2.54535   , -0.45927522,  1.2206829 ,
        -1.2267894 , -1.6585863 ],
       [-1.2653328 ,  2.4755108 , -1.5856324 , -1.3685071 ,  1.1001453 ,
        -0.4164165 ,  0.36778894],
       [ 0.9306681 ,  0.5114414 , -2.225846  , -1.3400164 ,  2.7208333 ,
         0.04363425, -1.3046423 ]], dtype=float32), hidden_states=None, attentions=None)

In [34]:
# Define the emotion labels
emotion_labels = ['anger', 'fear', 'joy', 'love', 'sadness', 'surprise']

# Convert logits to probabilities and get the predicted label
probs = tf.nn.softmax(predictions[0], axis=-1)

for i in range(len(probs)):
    predicted_label = tf.argmax(probs, axis=-1).numpy()[i]
    print(f"The predicted label for the sentence '{new_data[i]}' is {emotion_labels[predicted_label]}.")

The predicted label for the sentence 'love you' is love.
The predicted label for the sentence 'i hate that man so bad' is anger.
The predicted label for the sentence 'I don't know what to do right now' is fear.
The predicted label for the sentence 'i am so bored right now!' is sadness.
