<a href="https://colab.research.google.com/github/abhi-11nav/Text-Emotion-Detection/blob/main/Text_Emotion_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Importing the necessary libraries 

import pandas as pd
import numpy as np 

In [2]:
!git clone https://github.com/abhi-11nav/Text-Emotion-Detection.git

Cloning into 'Text-Emotion-Detection'...
remote: Enumerating objects: 79, done.[K
remote: Counting objects: 100% (79/79), done.[K
remote: Compressing objects: 100% (77/77), done.[K
remote: Total 79 (delta 51), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (79/79), done.


In [3]:
# Importing data

data = pd.read_csv("/content/Text-Emotion-Detection/tweet_emotions.csv")

In [4]:
data.head()

Unnamed: 0,tweet_id,sentiment,content
0,1956967341,empty,@tiffanylue i know i was listenin to bad habi...
1,1956967666,sadness,Layin n bed with a headache ughhhh...waitin o...
2,1956967696,sadness,Funeral ceremony...gloomy friday...
3,1956967789,enthusiasm,wants to hang out with friends SOON!
4,1956968416,neutral,@dannycastillo We want to trade with someone w...


Funeral ceremony...gloomy friday...

In [5]:
# Let us drop the tweet id

data.drop("tweet_id", axis=1, inplace=True)

In [6]:
data.head()

Unnamed: 0,sentiment,content
0,empty,@tiffanylue i know i was listenin to bad habi...
1,sadness,Layin n bed with a headache ughhhh...waitin o...
2,sadness,Funeral ceremony...gloomy friday...
3,enthusiasm,wants to hang out with friends SOON!
4,neutral,@dannycastillo We want to trade with someone w...


In [7]:
# Let us check if the tweet has any missing values 

data.isna().any()

sentiment    False
content      False
dtype: bool

No missing values

In [8]:
# Let us check the number of categories in sentiment variable

data['sentiment'].value_counts()

neutral       8638
worry         8459
happiness     5209
sadness       5165
love          3842
surprise      2187
fun           1776
relief        1526
hate          1323
empty          827
enthusiasm     759
boredom        179
anger          110
Name: sentiment, dtype: int64

Since the data is imbalanced, we'll be deadling with it 

Data Imbalance

### Eliminating the last two categories of sentiment as they are least represented. 

In [9]:
# dropping the last two samples

# Appending indexes to remove
indexes_to_remove = []


for index in data[data['sentiment']=="boredom"].index:
  indexes_to_remove.append(index)

for index in data[data['sentiment']=="anger"].index:
  indexes_to_remove.append(index)

In [10]:
len(indexes_to_remove)

289

In [11]:
data.drop(indexes_to_remove, inplace=True, axis=0)

In [12]:
data["sentiment"].value_counts()

neutral       8638
worry         8459
happiness     5209
sadness       5165
love          3842
surprise      2187
fun           1776
relief        1526
hate          1323
empty          827
enthusiasm     759
Name: sentiment, dtype: int64

In [13]:
labels = [label for label in data["sentiment"].unique()]

In [14]:
balanced_df = pd.DataFrame()

for label in labels: 
  balanced_df = pd.concat([data[data["sentiment"]==label].sample(759),balanced_df], axis=0)

In [15]:
balanced_df["sentiment"].value_counts()

relief        759
happiness     759
hate          759
fun           759
love          759
surprise      759
worry         759
neutral       759
enthusiasm    759
sadness       759
empty         759
Name: sentiment, dtype: int64

 Now we have a balanced dataset

In [16]:
# shuffling samples and resetting indexes

balanced_df = balanced_df.sample(len(balanced_df))

In [17]:
balanced_df.reset_index(inplace=True)

In [18]:
balanced_df.head()

Unnamed: 0,index,sentiment,content
0,15160,sadness,OMFG my favourite jerk chicken place closed
1,30878,happiness,@ExocetAU i always have those for my Champions...
2,25803,worry,"@icyjoey don't frown my lil aussie, I still lo..."
3,18739,hate,o damn i just accidentally listened to rick ross
4,13953,neutral,The @Jonasbrothers 3d movie was amazing but a ...


In [19]:
balanced_df.drop("index", inplace=True, axis=1)

In [20]:
# Changing the name of the data frame

data = balanced_df

In [21]:
# Let us look at the sentences

data['content'][0]

'OMFG my favourite jerk chicken place closed'

In [22]:
data['content'][1]

'@ExocetAU i always have those for my Champions League parties  Tis awesome'

Text Preprocessing

In [23]:
# Importing libraries

import re 

import nltk 
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [24]:
def text_preprocess(dataset,list_name):
  
  for i in range(dataset.shape[0]):
    list_name.append(re.sub('[^a-zA-Z]',' ',str(dataset.iloc[i,1])))

  print("Number and other symbols eliminated from the text")

  # String spacing 
  for x in range(len(list_name)):
    list_name[x] = " ".join(y for y in str(list_name[x]).split()).lower()

  print("Text reorganized and converted to small letter")
  
  for index in range(len(list_name)):
    temp_list= []
    # Lemmatization
    for word in list_name[index].split():
      if word not in stopwords.words('english'):
        temp_list.append(word)
    list_name[index] = " ".join(lemmatizer.lemmatize(words) for words in temp_list )

In [25]:
sentences = []

text_preprocess(data,sentences)

Number and other symbols eliminated from the text
Text reorganized and converted to small letter


In [26]:
p_data = pd.concat([pd.DataFrame(np.array(sentences), columns=["Content"]), data['sentiment']], axis=1)

In [27]:
p_data.head()

Unnamed: 0,Content,sentiment
0,omfg favourite jerk chicken place closed,sadness
1,exocetau always champion league party ti awesome,happiness
2,icyjoey frown lil aussie still love muah,worry
3,damn accidentally listened rick ross,hate
4,jonasbrothers movie amazing little short wanted,neutral


Text preprocessing done

Reference : https://phdstatsphys.wordpress.com/2018/12/27/word2vec-how-to-train-and-update-it/

We have vectors stored in **vectors**

## One hot encoding - padding

In [28]:
unique_words = set()

for sent in sentences:
  for word in sent.split():
    unique_words.add(word)

In [29]:
unique_words = list(unique_words)

In [30]:
# The length of unique words will be vocabulary size

vocabulary_size = len(unique_words)

In [31]:
# Importing libraries for one hot encoding 

from tensorflow.keras.preprocessing.text import one_hot

In [32]:
sent_tokens = []

for sent in sentences:
  temp_list = []
  for word in sent.split():
    temp_list.append(word)
  
  sent_tokens.append(temp_list)

In [33]:
[word for word in sent_tokens[0]]

['omfg', 'favourite', 'jerk', 'chicken', 'place', 'closed']

In [34]:
sentences = [str(sent) for sent in sentences]

In [35]:
one_hot_vectors = []

for sent in sent_tokens:
  one_hot_vec = []
  for words in sent:
    one_hot_vec.append(one_hot(words,vocabulary_size)[0])
  
  one_hot_vectors.append(one_hot_vec)

In [36]:
# Importing libraries necessary for padding sequence

from tensorflow.keras.preprocessing.sequence import pad_sequences

In [37]:
# Finding the sentence length 

max_len = 0

for sent in sent_tokens:
  if len(sent)>max_len:
    max_len = len(sent)

In [38]:
# Padding sequences 

embedded_docs = pad_sequences(one_hot_vectors,maxlen=max_len, padding='post')

In [39]:
embedded_docs[0]

array([12591,  6200,  3148,   520,  4444,   106,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0],
      dtype=int32)

In [45]:
X = embedded_docs

In [40]:
max_len

27

In [41]:
# Grabbing the labels

y = p_data['sentiment']

labels = []
corresponding_num = []

for ind,lab in enumerate(y.unique()):
  labels.append(lab)
  corresponding_num.append(ind)

encodings = [val for val in y]

for i,value in enumerate(encodings):
  for ind,unique in enumerate(labels):
    if value==unique:
      encodings[i] = ind
      
encodings = np.array(encodings)

y = encodings

encoding y into vectors of size 13

In [64]:
from tensorflow.keras.utils import to_categorical

In [65]:
y = to_categorical(y,13)

Train-test-split

In [67]:
from sklearn.model_selection import train_test_split

In [68]:
train_X, test_X, train_y, test_y = train_test_split(X,y, test_size=0.12, random_state=22)

In [69]:
train_X.shape

(7347, 27)

In [70]:
train_y.shape

(7347, 13)

In [71]:
test_X.shape

(1002, 27)

In [72]:
test_y.shape

(1002, 13)

## Birdectional LSTM RNN MODEL

Implementing Bi-directional Long short term Memory recurrent neural network 

In [73]:
# Importing the necessary libraries

import tensorflow 
from tensorflow import keras

from keras.layers import Dense, Flatten, Input, LSTM, Bidirectional, Embedding, Dropout, CuDNNLSTM
from keras.models import Model, Sequential

The fluctuations are normal within certain limits and depend on the fact that you use a heuristic method but in your case they are excessive. Despite all the performance takes a definite direction and therefore the system works. From the graphs you have posted, the problem depends on your data so it's a difficult training. If you have already tried to change the learning rate try to change training algorithm. You would agree to test your data: first compute the Bayes error rate using a KNN (use the trick regression in case you need), in this way you can check whether the input data contain all the information you need. Then try the LSTM without the validation or dropout to verify that it has the ability to achieve the result for you necessary. If the training algorithm is not suitable you should have the same problems even without the validation or dropout. Just at the end adjust the training and the validation size to get the best result in the test set. Statistical learning theory is not a topic that can be talked about at one time, we must proceed step by step.


source :https://stats.stackexchange.com/questions/345990/why-does-the-loss-accuracy-fluctuate-during-the-training-keras-lstm

In [74]:
## Creating model

embedding_vector_features=300
model=Sequential()
model.add(Embedding(vocabulary_size,embedding_vector_features,input_length=max_len))
model.add(Bidirectional(CuDNNLSTM(100, return_sequences=True)))
model.add(Dropout(0.2))
model.add(Bidirectional(CuDNNLSTM(100)))
model.add(Dense(13,activation='softmax'))

In [75]:
# Importing library for optimizer
from keras import optimizers

# adam optimizer with custom learning rate
from keras.optimizers import Adam

optimizer = Adam(learning_rate=1e-3)


# Compiling the modle 
model.compile(optimizer = optimizer, loss="categorical_crossentropy", metrics=['accuracy'])

Keras callbacks

In [76]:
from keras.callbacks import EarlyStopping, ModelCheckpoint

In [77]:
patience = EarlyStopping(patience=5)

save_best = ModelCheckpoint("lstm_model.h5", save_best_only=True)

In [83]:
model.fit(train_X, train_y,validation_data=(test_X, test_y),epochs=300, callbacks=(patience, save_best))

Epoch 1/300
Epoch 2/300
Epoch 3/300
Epoch 4/300
Epoch 5/300
Epoch 6/300


<keras.callbacks.History at 0x7fb0ca614a60>

In [84]:
import matplotlib.pyplot as plt

In [85]:
test_X[3]

array([13746,  8197,  3877, 11394,  8878,  6805, 12253,  6289,  1171,
        7986, 10534,  6552,   175,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0],
      dtype=int32)

In [86]:
test_y[3]

array([0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.], dtype=float32)

In [103]:
prediction = model.predict(test_X[3].reshape(1,-1))



In [104]:
np.argmax(prediction)

7

In [96]:
test

(27, 1)