<a href="https://colab.research.google.com/github/abhi-11nav/Text-Emotion-Detection/blob/main/Text_Emotion_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [69]:
# Importing the necessary libraries 

import pandas as pd
import numpy as np 

In [47]:
# Cloning the github repository 

!git clone https://github.com/abhi-11nav/Text-Emotion-Detection.git

fatal: destination path 'Text-Emotion-Detection' already exists and is not an empty directory.


In [48]:
# Importing data

data = pd.read_csv("/content/Text-Emotion-Detection/tweet_emotions.csv")

In [49]:
data.head()

Unnamed: 0,tweet_id,sentiment,content
0,1956967341,empty,@tiffanylue i know i was listenin to bad habi...
1,1956967666,sadness,Layin n bed with a headache ughhhh...waitin o...
2,1956967696,sadness,Funeral ceremony...gloomy friday...
3,1956967789,enthusiasm,wants to hang out with friends SOON!
4,1956968416,neutral,@dannycastillo We want to trade with someone w...


Funeral ceremony...gloomy friday...

In [50]:
# Let us drop the tweet id

data.drop("tweet_id", axis=1, inplace=True)

In [51]:
data.head()

Unnamed: 0,sentiment,content
0,empty,@tiffanylue i know i was listenin to bad habi...
1,sadness,Layin n bed with a headache ughhhh...waitin o...
2,sadness,Funeral ceremony...gloomy friday...
3,enthusiasm,wants to hang out with friends SOON!
4,neutral,@dannycastillo We want to trade with someone w...


In [52]:
# Let us check if the tweet has any missing values 

data.isna().any()

sentiment    False
content      False
dtype: bool

No missing values

In [53]:
# Let us check the number of categories in sentiment variable

data['sentiment'].value_counts()

neutral       8638
worry         8459
happiness     5209
sadness       5165
love          3842
surprise      2187
fun           1776
relief        1526
hate          1323
empty          827
enthusiasm     759
boredom        179
anger          110
Name: sentiment, dtype: int64

Since the data is imbalanced, we'll be deadling with it 

Data Imbalance

### Eliminating the last two categories of sentiment as they are least represented. 

In [61]:
# dropping the last two samples

# Appending indexes to remove
indexes_to_remove = []


for index in data[data['sentiment']=="boredom"].index:
  indexes_to_remove.append(index)

for index in data[data['sentiment']=="anger"].index:
  indexes_to_remove.append(index)

In [63]:
len(indexes_to_remove)

289

In [71]:
data.drop(indexes_to_remove, inplace=True, axis=0)

In [111]:
data["sentiment"].value_counts()

neutral       8638
worry         8459
happiness     5209
sadness       5165
love          3842
surprise      2187
fun           1776
relief        1526
hate          1323
empty          827
enthusiasm     759
Name: sentiment, dtype: int64

In [112]:
labels = [label for label in data["sentiment"].unique()]

In [113]:
balanced_df = pd.DataFrame()

for label in labels: 
  balanced_df = pd.concat([data[data["sentiment"]==label].sample(759),balanced_df], axis=0)

In [114]:
balanced_df["sentiment"].value_counts()

relief        759
happiness     759
hate          759
fun           759
love          759
surprise      759
worry         759
neutral       759
enthusiasm    759
sadness       759
empty         759
Name: sentiment, dtype: int64

 Now we have a balanced dataset

In [117]:
# shuffling samples and resetting indexes

balanced_df = balanced_df.sample(len(balanced_df))

In [118]:
balanced_df.reset_index(inplace=True)

In [119]:
balanced_df.head()

Unnamed: 0,index,sentiment,content
0,29040,surprise,@dorkchops WOOHOOO very cool see I knew u woul...
1,29426,worry,"John, are you sure we aren't mtb?"
2,37330,enthusiasm,"i want to wake up early, and get a coffee tomo..."
3,5270,surprise,@NHL10 I was hoping for a better trailer
4,5965,relief,Managed to finally get through to someone who ...


In [122]:
balanced_df.drop("index", inplace=True, axis=1)

In [132]:
# Changing the name of the data frame

data = balanced_df

In [133]:
# Let us look at the sentences

data['content'][0]

'@dorkchops WOOHOOO very cool see I knew u would get to see her'

In [134]:
data['content'][1]

"John, are you sure we aren't mtb?"

Text Preprocessing

In [135]:
# Importing libraries

import re 

import nltk 
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [136]:
def text_preprocess(dataset,list_name):
  
  for i in range(dataset.shape[0]):
    list_name.append(re.sub('[^a-zA-Z]',' ',str(dataset.iloc[i,1])))

  print("Number and other symbols eliminated from the text")

  # String spacing 
  for x in range(len(list_name)):
    list_name[x] = " ".join(y for y in str(list_name[x]).split()).lower()

  print("Text reorganized and converted to small letter")
  
  for index in range(len(list_name)):
    temp_list= []
    # Lemmatization
    for word in list_name[index].split():
      if word not in stopwords.words('english'):
        temp_list.append(word)
    list_name[index] = " ".join(lemmatizer.lemmatize(words) for words in temp_list )

In [137]:
sentences = []

text_preprocess(data,sentences)

Number and other symbols eliminated from the text
Text reorganized and converted to small letter


In [138]:
p_data = pd.concat([pd.DataFrame(np.array(sentences), columns=["Content"]), data['sentiment']], axis=1)

In [139]:
p_data.head()

Unnamed: 0,Content,sentiment
0,dorkchops woohooo cool see knew u would get see,surprise
1,john sure mtb,worry
2,want wake early get coffee tomorrow today goin...,enthusiasm
3,nhl hoping better trailer,surprise
4,managed finally get someone left message earli...,relief


Text preprocessing done

Converting text to vectors 

Word2vec

In [141]:
# Importing necessary libraries

import gensim

from gensim.models import Word2Vec

from tqdm import tqdm

from nltk import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [142]:
# Words list

words_list = []

# looping through to append words
for index in range(len(sentences)):
  words_list.append(nltk.word_tokenize(sentences[index]))

print(len(words_list)," length of sentences")

8349  length of sentences


In [143]:
empty_lists = []

for i,wl in enumerate(words_list):
  if not wl:
    empty_lists.append(i)

print("The number of empty lists are: ", len(empty_lists))

The number of empty lists are:  4


Since there are 21 empty lists. We will combine them with the labels and drop the 21 rows

In [144]:
# Let us combine the dataset and get rid of any null values that may have occured after preprocessing

preprocessed_data = pd.concat([pd.DataFrame(np.array(words_list)),pd.DataFrame(data['sentiment'])], axis=1)

  This is separate from the ipykernel package so we can avoid doing imports until


In [145]:
# Checking for null values 

preprocessed_data.isna().any()

0            False
sentiment    False
dtype: bool

In [146]:
# We have empty lists that we have to get rid of and we have the indexes of those lists store in empty_lists list

# Verifying elemnts from the list

for indexes in empty_lists:
  print(preprocessed_data.iloc[indexes,0])

[]
[]
[]
[]


There we go, our empty lists. 

In [147]:
preprocessed_data.drop(empty_lists, axis=0, inplace=True)

In [148]:
word_lists = [lists for lists in preprocessed_data.iloc[:,0]]

In [149]:
model = gensim.models.Word2Vec(words_list, window=5, min_count = 2)

In [150]:
# Empty list 
X = []

# Looping though words
for words in tqdm(word_lists):
  X.append(np.mean([model.wv[word] for word in words if word in model.wv.index2word], axis=0))

  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)
100%|██████████| 8345/8345 [00:02<00:00, 4156.31it/s]


In [151]:
# Coverting them to arrays

X = np.array(X)
y = preprocessed_data['sentiment']

  This is separate from the ipykernel package so we can avoid doing imports until


In [152]:
labels = []
corresponding_num = []

for ind,lab in enumerate(y.unique()):
  labels.append(lab)
  corresponding_num.append(ind)

In [153]:
encodings = [val for val in y]

In [154]:
for i,value in enumerate(encodings):
  for ind,unique in enumerate(labels):
    if value==unique:
      encodings[i] = ind

In [155]:
encodings = np.array(encodings)

In [156]:
y = encodings

Checking types

In [157]:
# Converting all the arrays to same data type

X = np.array([val.astype(np.float64) for val in X])

  This is separate from the ipykernel package so we can avoid doing imports until


Checking for null values in the array

In [158]:
pd.DataFrame(X).isna().sum()

0    90
dtype: int64

Found 227 null values

In [159]:
# Let us combine the dataset and get rid of any null values that may have occured after preprocessing

vector_data = pd.concat([pd.DataFrame(X),pd.DataFrame(y)], axis=1)

In [160]:
vector_data.head()

Unnamed: 0,0,0.1
0,"[0.15683357417583466, 0.04605188965797424, -0....",0
1,"[0.07509304583072662, 0.023692592978477478, -0...",1
2,"[0.10748273879289627, 0.029702097177505493, -0...",2
3,"[0.04636358842253685, 0.012991226278245449, -0...",0
4,"[0.09467421472072601, 0.02797847054898739, -0....",3


In [161]:
vector_data.isna().any()

0     True
0    False
dtype: bool

Dropping all the null values

In [162]:
vector_data.dropna(inplace=True)

In [163]:
vector_data.shape

(8255, 2)

In [164]:
X = np.array([feat for feat in vector_data.iloc[:,0]])
y = np.array([label for label in vector_data.iloc[:,1]])

In [165]:
from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(X,y, train_size = 0.93, random_state= 12)

In [166]:
from sklearn.naive_bayes import GaussianNB

In [167]:
gnb = GaussianNB()

gnb.fit(train_X, train_y)

GaussianNB()

In [168]:
predictions = gnb.predict(test_X)

In [169]:
from sklearn.metrics import accuracy_score

In [170]:
score = accuracy_score(test_y, predictions)

In [171]:
print("And the final score is ...... ..... ...", score)

And the final score is ...... ..... ... 0.11937716262975778


In [172]:
train_X.shape

(7677, 100)

In [173]:
train_y.shape

(7677,)

Converting to categories

In [181]:
from keras.utils import to_categorical

In [182]:
train_y = to_categorical(train_y,13)

In [183]:
# Covertiing test_y to binary 

test_y = to_categorical(test_y,13)

TERM FREQUENCY - INVERSE DOCUMENT FREQUENCY

In [198]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [202]:
vectorizer = TfidfVectorizer()

In [206]:
words = vectorizer.fit_transform(sentences)

In [207]:
words.to_array()

AttributeError: ignored

## LSTM RNN MODEL

Implementing Bi-directional Long short term Memory recurrent neural network 

In [184]:
# Importing the necessary libraries

import tensorflow 
from tensorflow import keras

from keras.layers import Dense, Flatten, Input, LSTM, Bidirectional, Embedding, Dropout, CuDNNLSTM
from keras.models import Model, Sequential

In [185]:
train_X.shape[1:]

(100,)

The fluctuations are normal within certain limits and depend on the fact that you use a heuristic method but in your case they are excessive. Despite all the performance takes a definite direction and therefore the system works. From the graphs you have posted, the problem depends on your data so it's a difficult training. If you have already tried to change the learning rate try to change training algorithm. You would agree to test your data: first compute the Bayes error rate using a KNN (use the trick regression in case you need), in this way you can check whether the input data contain all the information you need. Then try the LSTM without the validation or dropout to verify that it has the ability to achieve the result for you necessary. If the training algorithm is not suitable you should have the same problems even without the validation or dropout. Just at the end adjust the training and the validation size to get the best result in the test set. Statistical learning theory is not a topic that can be talked about at one time, we must proceed step by step.


source :https://stats.stackexchange.com/questions/345990/why-does-the-loss-accuracy-fluctuate-during-the-training-keras-lstm

In [186]:
input = Input(shape=(100,1))
lstm = CuDNNLSTM(100, return_sequences=True)(input)
dropout = Dropout(0.2)(lstm)
lstm2 = CuDNNLSTM(100)(dropout)
dropout2 = Dropout(0.2)(lstm2)
lstm3 = Dense(100, activation="relu")(dropout2)
dropout3 = Dropout(0.2)(lstm3)
prediction = Dense(13, activation="softmax")(dropout3)

In [187]:
# Model

model = Model(inputs = input, outputs = prediction)

In [188]:
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 100, 1)]          0         
                                                                 
 cu_dnnlstm (CuDNNLSTM)      (None, 100, 100)          41200     
                                                                 
 dropout (Dropout)           (None, 100, 100)          0         
                                                                 
 cu_dnnlstm_1 (CuDNNLSTM)    (None, 100)               80800     
                                                                 
 dropout_1 (Dropout)         (None, 100)               0         
                                                                 
 dense (Dense)               (None, 100)               10100     
                                                                 
 dropout_2 (Dropout)         (None, 100)               0     

In [189]:
# Setting the learning rate for the optimizer. 

adam_optimizer = keras.optimizers.Adam(learning_rate=1e-3, decay=1e-6)

# Compiling the model

model.compile(optimizer=adam_optimizer, loss="categorical_crossentropy", metrics="accuracy")

Keras callbacks

In [190]:
from keras.callbacks import EarlyStopping, ModelCheckpoint

In [191]:
save_best = ModelCheckpoint("lstm_model.h5", save_best_only=True)

In [192]:
train_X.shape

(7677, 100)

In [193]:
train_y.shape

(7677, 13)

In [194]:
test_X.shape

(578, 100)

In [195]:
test_y.shape

(578, 13)

In [196]:
model.fit(train_X, train_y, validation_data=(test_X,test_y),epochs=1000,callbacks=[save_best])

Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000
Epoch 32/1000
Epoch 33/1000
Epoch 34/1000
Epoch 35/1000
Epoch 36/1000
Epoch 37/1000
Epoch 38/1000
Epoch 39/1000
Epoch 40/1000
Epoch 41/1000
Epoch 42/1000
Epoch 43/1000
Epoch 44/1000
Epoch 45/1000
Epoch 46/1000
Epoch 47/1000
Epoch 48/1000
Epoch 49/1000
Epoch 50/1000
Epoch 51/1000
Epoch 52/1000
Epoch 53/1000
Epoch 54/1000
Epoch 55/1000
Epoch 56/1000
Epoch 57/1000
Epoch 58/1000
Epoch 59/1000

KeyboardInterrupt: ignored