# TEXT CLASSIFICATION METHODS COMPARISON
Hi,
This is my 2nd notebook on NLP, and I want to address the COVID-19 text classification problem. What I mainly wanted to do, is to create a comparative study of the different methods used for NLP text classification/semantic analysis. You can copy my notebook, make changes here and there, and let me know how it goes. I'll also include comments for each step of what I'm doing. Also, I'm available for suggestions/corrections, so do comment if you have any. 

*This is going to be a long one, let's go!*

# Import Libraries
Pretty straigtforward, we'll start with importing the libraries.

# Reading the dataset
We'll use pandas to read the train and test dataset. 

In [1]:

import numpy as np 
import pandas as pd 
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from tqdm import tqdm
from sklearn.model_selection import train_test_split
import tensorflow as tf
from keras.models import Sequential
from keras.layers.recurrent import LSTM, GRU,SimpleRNN
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D, Input
from keras.preprocessing import sequence, text
from keras.callbacks import EarlyStopping
import random

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from plotly import graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff



In [2]:
train = pd.read_csv('../input/covid-19-nlp-text-classification/Corona_NLP_train.csv', encoding = 'latin1') 
test = pd.read_csv('../input/covid-19-nlp-text-classification/Corona_NLP_test.csv', encoding = 'latin1')

#Now you can try without the encoding (which I had done before), it throws an error, something like this:  'utf-8' codec can't decode byte <byte> in position <position>: unexpected end of data

In [3]:
train.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative


In [4]:
train.shape

(41157, 6)

Now, having read our train and test data, let's get the max number of words in a sentence. We'd need this for padding (explained in later section)

In [5]:
train['OriginalTweet'].apply(lambda x:len(str(x).split())).max()

64

Let's check out the unique output classes of the data. We'll store them in a variable to use it for predictions.

In [6]:
label = train['Sentiment'].unique()
label

array(['Neutral', 'Positive', 'Extremely Negative', 'Negative',
       'Extremely Positive'], dtype=object)

# Convert categorical variable into dummy/indicator variables.

**Syntax:**

**pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None) **

Parameters:
dataarray-like, Series, or DataFrame
Data of which to get dummy indicators.

prefixstr, list of str, or dict of str, default None
String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, prefix can be a dictionary mapping column names to prefixes.

prefix_sepstr, default ‘_’
If appending prefix, separator/delimiter to use. Or pass a list or dictionary as with prefix.

dummy_nabool, default False
Add a column to indicate NaNs, if False NaNs are ignored.

columnslist-like, default None
Column names in the DataFrame to be encoded. If columns is None then all the columns with object or category dtype will be converted.

sparsebool, default False
Whether the dummy-encoded columns should be backed by a SparseArray (True) or a regular NumPy array (False).

drop_firstbool, default False
Whether to get k-1 dummies out of k categorical levels by removing the first level.

dtypedtype, default np.uint8
Data type for new columns. Only a single dtype is allowed.

Returns
DataFrame
Dummy-coded data.

In [7]:
y=train['Sentiment'].values
y = pd.get_dummies(y)
print('Shape of label tensor:', y)

Shape of label tensor:        Extremely Negative  Extremely Positive  Negative  Neutral  Positive
0                       0                   0         0        1         0
1                       0                   0         0        0         1
2                       0                   0         0        0         1
3                       0                   0         0        0         1
4                       1                   0         0        0         0
...                   ...                 ...       ...      ...       ...
41152                   0                   0         0        1         0
41153                   1                   0         0        0         0
41154                   0                   0         0        0         1
41155                   0                   0         0        1         0
41156                   0                   0         1        0         0

[41157 rows x 5 columns]


# Tokenization
Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram characters) tokenization.

Here, we use keras.processing class Tokenizer.

**Syntax:**

tf.keras.preprocessing.text.Tokenizer(
    num_words=None,
    filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
    lower=True, split=' ', char_level=False, oov_token=None,
    document_count=0, **kwargs
)

**num_words**: max number of words to be kept based on word frequency. 

**filters**: removing special characters from the data

**lower**: convert to lowecase

**split**: split the data on ' '

**char_level**: (Boolean value true/false) whether every character has to be treated as a token.

**document_count**: An integer count of the total number of documents that were used to fit the Tokenizer


So, next we would use this tokenizer to convert the text into sequences and to ensure a unifrom length, we pad these sequences to the max_len with 0s.

# Encoding Data


We would tokenize our entire data, so I'd create a new dataframe combining the tweet values of both train and test data, and fit our tokenizer on this new dataframe.

Now, having initalized a tokenizer in the previous step, we would now use the tokenizer to convert the text from train dataset to tokens, and pad the values with 0s to ensure a uniform length.

In [8]:
tmp = train['OriginalTweet'] + test['OriginalTweet']
tmp = tmp.astype(str)
tokenizer = text.Tokenizer(num_words=400000,
    filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
    lower=True,
    split=" ")
max_len = 70
tokenizer.fit_on_texts(tmp)
word_index = tokenizer.word_index
len(word_index)

24677

In [9]:
X = train['OriginalTweet'].values
X = tokenizer.texts_to_sequences(X)
X = sequence.pad_sequences(X, maxlen=max_len)
print('Shape of data tensor:', X.shape)
 

Shape of data tensor: (41157, 70)


# Splitting the data
We would split up our train dataset into xtrain, xvalid, ytrain and yvalid. Let's go parameter by parameter.

Syntax:
**sklearn.model_selection.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)
test_size
train_size
shuffle**

***arrays**:

train.OriginalTweet.values: X value (Input data on which the model has to be trained), train.Sentiment.values: y value (Output data on which the model has to be trained),


**stratify**: If not None, data is split in a stratified fashion, using this as the class labels.

**test_size**: Dividing the train and test data set (In our case 20%)

**train_size**: Dividing the train and test data set (In our case 100% - 20% = 80%)

**random_state**: value for initializing the internal random number generator, which will decide the splitting of data into train and test indices in your case


**shuffle**: how the train and test data is divided. (In this case: 20%)

**stratify**:If not None, data is split in a stratified fashion, using this as the class labels.


In [10]:
x_train, x_test, y_train, y_test = train_test_split(X, y, 
                                                  random_state=46, 
                                                  test_size=0.3, shuffle=True)

In [11]:
print('x_train.shape: ' + str(x_train.shape),' y_train.shape: '+str(y_train.shape))
print('x_test.shape: ' + str(x_test.shape),' y_train.shape: '+str(y_test.shape))


x_train.shape: (28809, 70)  y_train.shape: (28809, 5)
x_test.shape: (12348, 70)  y_train.shape: (12348, 5)


# The Embedding Layer

To move onto the next step, we need to be familiar with the concept of word embeddings.

> Word embeddings are in fact a class of techniques where individual words are represented as real-valued vectors in a predefined vector space. Each word is mapped to one vector and the vector values are learned in a way that resembles a neural network, and hence the technique is often lumped into the field of deep learning.

In simple words, we can put it in this way that each word is represented as a vector in vector space, and that's how we maintain the similarity between the words. Consider two similar words, like 'good' and 'great'. The distance of the vectors between these words need to be less to denote their similarity. There are a number of techniques to convert words to vectors, such as:

1. Frequency based Embedding

 1.1. Count Vectors
 
 1.2. TF-IDF
 
 1.3. Co-Occurrence Matrix
 
2. Prediction based Embedding

 2.1. CBOW
 
 2.2. Skip-Gram

3. Using pre-trained Word Vectors

  3.1. Word2Vec
  
  3.2. GloVe
  
I won't go into much detail regarding the embeddings, but if you want to know more, you should definitely check out this [link](https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/)

Also, while choosing an embedding, there's no right or wrong. It completely depends on the problem statement. I'll be trying out the default [pretrained Keras Embedding Layer](https://keras.io/api/layers/core_layers/embedding/) and the pretrained [GloVe vectors](https://towardsdatascience.com/light-on-math-ml-intuitive-guide-to-understanding-glove-embeddings-b13b4f19c010) here. GloVe works excellent when the data size is huge, as they compare the words to a giant global corpus. 


I've included the GloVe vector file in input data. Let me just initialize and build the embedding matrix, which would serve as weights in Embedding layer of my neural network models.

In [12]:
embeddings_index = {}
f = open('../input/glove840b300dtxt/glove.840B.300d.txt','r',encoding='utf-8')
for line in tqdm(f):
    values = line.split(' ')
    word = values[0]
    coefs = np.asarray([float(val) for val in values[1:]])
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

2196018it [05:15, 6969.17it/s]

Found 2196017 word vectors.





In [13]:
embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in tqdm(word_index.items()):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

100%|██████████| 24677/24677 [00:00<00:00, 119637.65it/s]


# Model Building

This is the part where it gets interesting. After all the process of cleaning, tokenizing, embedding words, we're now ready to create our model and feed the data into it and check which model gives a better accuracy.

**1. Simple RNN (Recurrent Neural Networks) Model**

We'll start off with a very simple RNN model. If you're new to the concept of tensorflow, have a [quick look](https://www.tensorflow.org/tutorials/quickstart/beginner)

Let's first address the question. 

What is RNN?
> > > Recurrent neural networks (RNN) are a class of neural networks that are helpful in modeling sequence data. Derived from feedforward networks, RNNs exhibit similar behavior to how human brains function. Simply put: recurrent neural networks produce predictive results in sequential data that other algorithms can’t.

This is a very good [article](https://builtin.com/data-science/recurrent-neural-networks-and-lstm) to jumpstart with the concepts of RNN and LSTM and understand why they're in much popular demand.

**Activation Function**
> In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs. A standard integrated circuit can be seen as a digital network of activation functions that can be "ON" (1) or "OFF" (0), depending on input. This is similar to the behavior of the linear perceptron in neural networks.[Continue reading..](https://missinglink.ai/guides/neural-network-concepts/7-types-neural-network-activation-functions-right/#:~:text=Activation%20functions%20are%20mathematical%20equations,relevant%20for%20the%20model's%20prediction.)

**Optimizer**
> They tie together the loss function and model parameters by updating the model in response to the output of the loss function. In simpler terms, optimizers shape and mold your model into its most accurate possible form by futzing with the weights. The loss function is the guide to the terrain, telling the optimizer when it’s moving in the right or wrong direction.
[Continue Reading..](https://towardsdatascience.com/optimizers-for-training-neural-network-59450d71caf6#:~:text=Many%20people%20may%20be%20using,help%20to%20get%20results%20faster)
The [learning rate scheduler](https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1) controls the learning rate of the model per epoch according to a predefined scheduler.

**The Sequential Model**

Let me just give a walkthrough on what a sequential model is.
In simple words, sequential model is a linear stack of layers, where each layer represent some kind of input, output or computation.

We'd be using the [sequential model from Keras](https://keras.io/guides/sequential_model/).

If you checked out the above link, you might have encountered the 'Dense' layer everywhere. So, what exactly is it?

> The dense layer is a neural network layer that is connected deeply, which means each neuron in the dense layer receives input from all neurons of its previous layer. The dense layer is found to be the most commonly used layer in the models. In the background, the dense layer performs a matrix-vector multiplication.

Syntax:

tf.keras.layers.Dense(
    units,
    activation=None,
    use_bias=True,
    kernel_initializer="glorot_uniform",
    bias_initializer="zeros",
    kernel_regularizer=None,
    bias_regularizer=None,
    activity_regularizer=None,
    kernel_constraint=None,
    bias_constraint=None,
    **kwargs
)

**units**: Positive integer, dimensionality of the output space.

**activation**: Activation function to use. If you don't specify anything, no activation is applied (ie. "linear" activation: a(x) = x).

**use_bias**: Boolean, whether the layer uses a bias vector.

**kernel_initializer**: Initializer for the kernel weights matrix.

**bias_initializer**: Initializer for the bias vector.

**kernel_regularizer**: Regularizer function applied to the kernel weights matrix.

**bias_regularizer**: Regularizer function applied to the bias vector.

**activity_regularizer**: Regularizer function applied to the output of the layer (its "activation").

**kernel_constraint**: Constraint function applied to the kernel weights matrix.

**bias_constraint**: Constraint function applied to the bias vector.

In [14]:
SimpleRNNModel = Sequential()
SimpleRNNModel.add(Input(shape=x_train.shape[1]))
SimpleRNNModel.add(Embedding(len(tokenizer.word_index)+1,32))
SimpleRNNModel.add(SimpleRNN(100))
SimpleRNNModel.add(Dense(5, activation='softmax'))
#SimpleRNNModel.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
SimpleRNNModel.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy','categorical_accuracy','AUC','Precision','Recall'])    
SimpleRNNModel.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 70, 32)            789696    
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 100)               13300     
_________________________________________________________________
dense (Dense)                (None, 5)                 505       
Total params: 803,501
Trainable params: 803,501
Non-trainable params: 0
_________________________________________________________________


# Fitting the data
We would now [fit the model](https://keras.io/api/models/model_training_apis/) on our data.

In [15]:

SimpleRNNModelResults = SimpleRNNModel.fit(x_train, y_train, epochs=5, batch_size=64,validation_split=0.1)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


# Overfitting Model Alert!
So, I see the accuracy is quite high ~95%, and also high precision, recall and AUC but the val_accuracy is not that great. This means that there might be a possibility of overfitting, where in the model performs well with the train data, but while performing with new data which it isn't trained with, it might not be performing quite well. 



**Preprocessing the test dataset**

**2. LSTM (Long Short Term Memory) Networks**

> Long short-term memory is an artificial recurrent neural network architecture used in the field of deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. It can not only process single data points, but also entire sequences of data.

Technically, LSTM was built to overcome the [vanishing gradient](https://medium.datadriveninvestor.com/how-do-lstm-networks-solve-the-problem-of-vanishing-gradients-a6784971a577) issue encountered in RNNs. In simple words, RNN training is something like this. From each layer, the error is backpropagated to update the weights of previous layers, but in a case where the gradient is exponentially so less, that it becomes insignificant and the weights are not updated at all. We call this as the *vanishing gradient* problem.

I'd first try with a simple LSTM model, with a very similar architecture as of the simple RNN model and check how much accuracy that gives us.

In [16]:
SimpleLSTMModel = Sequential()
SimpleLSTMModel.add(Input(shape=x_train.shape[1]))
SimpleLSTMModel.add(Embedding(len(tokenizer.word_index)+1,32))
SimpleLSTMModel.add(LSTM(100))
SimpleLSTMModel.add(Dense(5, activation='softmax'))
SimpleLSTMModel.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    
SimpleLSTMModel.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 70, 32)            789696    
_________________________________________________________________
lstm (LSTM)                  (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 505       
Total params: 843,401
Trainable params: 843,401
Non-trainable params: 0
_________________________________________________________________


In [17]:

SimpleLSTMModelResults = SimpleLSTMModel.fit(x_train, y_train, epochs=5, batch_size=64,validation_split=0.1)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


So you see, the improvement is noticable changing the model from Simple RNN to LSTM. This is majorly because of the learning structure of LSTM. Let's try with a GRU model and check how much accuracy that provides.

**3. GRU Gated Recurrent Units**

Gated recurrent units are a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al. GRU consists of a update and forget gate.

In [18]:
SimpleGRUModel = Sequential()
SimpleGRUModel.add(Input(shape=x_train.shape[1]))
SimpleGRUModel.add(Embedding(len(tokenizer.word_index)+1,32))
SimpleGRUModel.add(GRU(100))
SimpleGRUModel.add(Dense(5, activation='softmax'))
SimpleGRUModel.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    
SimpleGRUModel.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 70, 32)            789696    
_________________________________________________________________
gru (GRU)                    (None, 100)               40200     
_________________________________________________________________
dense_2 (Dense)              (None, 5)                 505       
Total params: 830,401
Trainable params: 830,401
Non-trainable params: 0
_________________________________________________________________


In [19]:

SimpleGRUModelResults = SimpleGRUModel.fit(x_train, y_train, epochs=5, batch_size=64,validation_split=0.1)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


**4. Bidirectional LSTM Model**

The Bidirectional LSTM or BiLSTM is a modified version of a simple LSTM, where we use 2 LSTM models, one processing the input and learning occuring in a forward direction and one for backward. It is proven better in terms of accuracy than traditional RNN/GRU and LSTM.

If you're confused about model selection for your dataset, you can refer [this discussion thread](https://datascience.stackexchange.com/questions/25650/what-is-lstm-bilstm-and-when-to-use-them#:~:text=BiLSTM%20means%20bidirectional%20LSTM%2C%20which,this%20architecture%20to%20other%20RNNs.)

In [20]:
BILSTMModel = Sequential()
BILSTMModel.add(Input(shape=x_train.shape[1]))
BILSTMModel.add(Embedding(len(tokenizer.word_index)+1,32))
BILSTMModel.add(Bidirectional(LSTM(100, return_sequences=True)))
BILSTMModel.add(GlobalMaxPooling1D())
BILSTMModel.add(Dense(5, activation='softmax'))
BILSTMModel.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    
BILSTMModel.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 70, 32)            789696    
_________________________________________________________________
bidirectional (Bidirectional (None, 70, 200)           106400    
_________________________________________________________________
global_max_pooling1d (Global (None, 200)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 5)                 1005      
Total params: 897,101
Trainable params: 897,101
Non-trainable params: 0
_________________________________________________________________


In [21]:
BILSTMModelResults = BILSTMModel.fit(x_train, y_train, epochs=5, batch_size=64,validation_split=0.1)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


# Hyperparameter Tuning/ Fine Tuning

Let's try some tweaking parameters here and there and adding more layers to see if we can improve the accuracy. I'll also use the weights of the Embedding layer coming from GloVe vector which was initialized in an earlier step.

In [22]:
BILSTMModel_2 = Sequential()
BILSTMModel_2.add(Input(shape=x_train.shape[1]))
BILSTMModel_2.add(Embedding(24678,300, weights=[embedding_matrix]))
BILSTMModel_2.add(Bidirectional(LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1)))
BILSTMModel_2.add(GlobalMaxPooling1D())
BILSTMModel_2.add(Dense(50, activation='relu'))
BILSTMModel_2.add(Dropout(0.2))
BILSTMModel_2.add(Dense(5, activation='softmax'))
BILSTMModel_2.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    
BILSTMModel_2.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 70, 300)           7403400   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 70, 100)           140400    
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 100)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 50)                5050      
_________________________________________________________________
dropout (Dropout)            (None, 50)                0         
_________________________________________________________________
dense_5 (Dense)              (None, 5)                 255       
Total params: 7,549,105
Trainable params: 7,549,105
Non-trainable params: 0
____________________________________________

In [23]:
BILSTMModel_2Results = BILSTMModel_2.fit(x_train, y_train, epochs=5, batch_size=64,validation_split=0.1)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
