## Comparing Single and Bi-Directional Long Short-Term Memory Units

In this notebook, we will be comparing two models - 

1. A Single LSTM unit 
2. A Bi-Directional LSTM unit

We compare the output of the two models by looking at the micro-average and macro-average values of the precision, recall and F1 scores.

HOW TO RUN - 
1. Select Cell from the list of Menu
2. Run All

Importing all essential libraries

In [76]:
# import your libraries here
import keras
import numpy as np
import pandas as pd
import scipy
import sklearn
import nltk
import warnings
warnings.filterwarnings('ignore')
from tensorflow.keras.utils import to_categorical
%matplotlib inline
import matplotlib.pyplot as plt 

In [77]:
#Reading the dataset
dataset = pd.read_csv('labeled_data.csv')
dataset.head()

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...


In [79]:
#Removing useless columns to only include the tweets and their labels
dt_transformed = dataset[['class', 'tweet']]
y = np.array(dt_transformed['class']) #output labels

Importing all NLTK libraries

In [65]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [81]:
#Data cleaning process. Cleaning the tweets in the dataframe and storing the cleaned text in a list.
corpus = []
for i in range(0, len(y)):
  text = re.sub('[^a-zA-Z]', ' ', dt_transformed['tweet'][i])
  text = text.lower() #coverting the text to lower case
  text = re.sub(r'\$\w*', '', text) #removing special characters
  text = re.sub(r'https?:\/\/.*[\r\n]*', '', text) #removing hyerlinks in the tweet
  text = re.sub(r'#', '', text) #removing hashtags in the tweet
  text = text.split()
  ps = PorterStemmer()
  all_stopwords = stopwords.words('english')
  all_stopwords.remove('not')
  text = [ps.stem(word) for word in text if not word in set(all_stopwords)] #removing stopwords and taking stem of the word
  text = ' '.join(text)
  corpus.append(text)

In [67]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras import regularizers

max_words = 5000 #setting a random maximum words count. 5000 found more optimal than 7000
#Max tweet length is 280 characters and as observed in DataVisualization, max tweets are less than 200 characters
max_len = 200 #setting a random

#Implementing keras' tokenizer and text_to_sentences to turn the 
#string data into 3D float vector embeddings to train the neural network on.
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(corpus)
sequences = tokenizer.texts_to_sequences(corpus)
#using pad_sequences to pad words for sentences less than the max_words length
tweets = pad_sequences(sequences, maxlen=max_len)

In [82]:
#one-hot encoder to encode the three outputs - 0,1,2 into encoded vector outputs
labels = to_categorical(y)

In [84]:
from sklearn.model_selection import train_test_split
#creating training and testing sets
X_train, X_test, y_train, y_test = train_test_split(tweets,labels, random_state=0)

In [85]:
from keras import layers
from keras.models import Sequential
#creating the model 
model_simple = Sequential()
model_simple.add(layers.Embedding(max_words, 20))
model_simple.add(layers.LSTM(15,dropout=0.5))
#since output layer has three classes
model_simple.add(layers.Dense(3,activation='softmax'))
#hyperparameters can be changed as per need. I am using adam optimizer for now as it is most common
model_simple.compile(optimizer='adam',loss='categorical_crossentropy', metrics=['accuracy'])
#number of epochs can be increased. I have kept it at 20 to reduce training time
model_simple.fit(X_train, y_train, epochs=20,validation_data=(X_test, y_test))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7fcc4f2b9cd0>

In [86]:
#using evaluate function to get loss and validation accuracy obtained
simple_loss, simple_acc = model_simple.evaluate(X_test, y_test, verbose=2)
y_pred = model_simple.predict(X_test)

194/194 - 3s - loss: 0.4454 - accuracy: 0.8873 - 3s/epoch - 17ms/step


In [87]:
#converting the output predicted y, which is a vector of probabilities into a one-hot encoded output
for i in range(len(y_pred)):
  for j in range(3):
    if(y_pred[i][j]<0.5):
      y_pred[i][j]=0.0
    else:
      y_pred[i][j]=1.0

Output from a simple Single LSTM model ->

In [88]:
from sklearn.metrics import classification_report
classification_report(y_test, y_pred,zero_division=1)

'              precision    recall  f1-score   support\n\n           0       0.38      0.19      0.26       359\n           1       0.92      0.95      0.93      4800\n           2       0.83      0.83      0.83      1037\n\n   micro avg       0.89      0.89      0.89      6196\n   macro avg       0.71      0.66      0.67      6196\nweighted avg       0.87      0.89      0.88      6196\n samples avg       0.89      0.89      0.89      6196\n'

In [70]:
from keras import layers
from keras.models import Sequential
#creating the model 
model = Sequential()
model.add(layers.Embedding(max_words, 40, input_length=max_len))
#dropout can be varied as per efficiency. I have kept it at a standard 0.6
model.add(layers.Bidirectional(layers.LSTM(20,dropout=0.6)))
model.add(layers.Dense(3,activation='softmax'))
#I have used a different optimizer here, since I have already tried the adam optimizer
#Hyperparameters can be varied as per efficiency
model.compile(optimizer='rmsprop',loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=20,validation_data=(X_test, y_test))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [89]:
#using evaluate function to get loss and validation accuracy obtained
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=2)
y_pred = model.predict(X_test)

194/194 - 4s - loss: 0.2772 - accuracy: 0.9045 - 4s/epoch - 22ms/step


In [91]:
#converting the output predicted y, which is a vector of probabilities into a one-hot encoded output
for i in range(len(y_pred)):
  for j in range(3):
    if(y_pred[i][j]<0.5):
      y_pred[i][j]=0.0
    else:
      y_pred[i][j]=1.0

Output from a Bi-Directional LSTM Model ->

In [92]:
from sklearn.metrics import classification_report
classification_report(y_test, y_pred,zero_division=1)

'              precision    recall  f1-score   support\n\n           0       0.52      0.08      0.13       359\n           1       0.92      0.97      0.94      4800\n           2       0.87      0.87      0.87      1037\n\n   micro avg       0.91      0.90      0.90      6196\n   macro avg       0.77      0.64      0.65      6196\nweighted avg       0.89      0.90      0.88      6196\n samples avg       0.91      0.90      0.90      6196\n'