## Text Classification using TensorFlow

#### Objective: Classify question documents into one of 9 categories.

#### Author: Juan Gordyn

### Importing libraries and checking classes

In [303]:
# importing libraries
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping, LearningRateScheduler
from sklearn.metrics import multilabel_confusion_matrix
from sklearn.metrics import f1_score

Loading data and checking class proportions:

In [69]:
questions_df = pd.read_csv('questions.csv', header = None, names = ['label', 'text', 'license'])
questions_df.label.value_counts()/len(questions_df)

astronomy           0.244269
ai                  0.202527
opendata            0.136028
sports              0.126375
quantumcomputing    0.114400
computergraphics    0.074197
martialarts         0.045530
coffee              0.030647
beer                0.026028
Name: label, dtype: float64

### Loading and pre-processing the data

Building class to pre-process the data, split into train and validation so that it is easier afterwards when building the model.

For pre-processing, I will keep only words (not numbers, not symbols, etc) and pre-hyphened words (such as state-of-the-art, etc), words containing ' (such as google's) or words unified by an underscore. Stop-words will be kept because I consider that sometimes meaning within the sequence can be altered when removing them. All the words will be lower-cased and lemmatized.

In [304]:
class DataManager:
    def __init__(self, maxlen= 50, random_state=6789):
        self.numeral_labels = list()
        self.maxlen = maxlen
        self.numeral_data = list()
        self.random_state = random_state
        self.random = np.random.RandomState(random_state)
    
    def preprocessing(self, text_corpus):
        # tokenizing by simple words or words combined by _ or -
        tokenizer = RegexpTokenizer(r"[a-zA-Z]+(?:[-'_][a-zA-Z]+)*'?")
        tokenized_text = [tokenizer.tokenize(doc.lower()) for doc in text_corpus]
        for i in range(len(tokenized_text)):
            tokenized_text[i] = ' '.join([WordNetLemmatizer().lemmatize(token) for token in tokenized_text[i]])
        return tokenized_text
    
    def read_data(self):
        # loading data
        questions_df = pd.read_csv('questions.csv', header = None, names = ['label', 'text', 'license'])
        # pre-processing the data and storing in dataframe
        questions_df['text'] = self.preprocessing(questions_df.text)
        # converting labels and pre-processed text to lists
        self.str_labels = list(questions_df.label)
        self.str_questions = list(questions_df.text)
         
        # turns labels into numbers
        le= preprocessing.LabelEncoder()
        le.fit(self.str_labels)
        # array of labels as numbers
        self.numeral_labels = np.array(le.transform(self.str_labels))
        # classes to be able to print them whenever we want, as reference
        self.str_classes= le.classes_
        # number of classes, will be helpful for output layer of NN
        self.num_classes= len(self.str_classes)
    
    def manipulate_data(self):
        # tokenizing the pre-processed sequences
        tokenizer = tf.keras.preprocessing.text.Tokenizer()
        tokenizer.fit_on_texts(self.str_questions)
        # converting tokens into numerical representation
        self.numeral_data = tokenizer.texts_to_sequences(self.str_questions)
        # add padding to complete sequence up to max_sequence and truncating when max_sequence is exceeded
        self.numeral_data = tf.keras.preprocessing.sequence.pad_sequences(self.numeral_data, padding='post', truncating= 'post', maxlen= self.maxlen)
        # building word-index dictionaries
        self.word2idx = tokenizer.word_index
        self.word2idx = {k:v for k,v in self.word2idx.items()}
        self.idx2word = {v:k for k,v in self.word2idx.items()}
        # vocab_size = number of unique words in our corpus
        self.vocab_size = len(self.word2idx)
    
    def train_valid_split(self, train_ratio=0.8):
        # select indices in random order along it first index to do train-validation split
        idxs = np.random.permutation(np.arange(len(self.str_questions)))
        # size of training data
        train_size = int(train_ratio*len(idxs)) +1
        # x train and x val as string questions
        self.train_str_questions, self.valid_str_questions = self.str_questions[0:train_size], self.str_questions[train_size:]
        # x train and x val as numerical representations
        self.train_numeral_data, self.valid_numeral_data = self.numeral_data[0:train_size], self.numeral_data[train_size:]
        # y train and y val as numerical labels
        self.train_numeral_labels, self.valid_numeral_labels = self.numeral_labels[0:train_size], self.numeral_labels[train_size:]
        # generate train and validation sets as tensors to be ingested by Neural Network model
        self.tf_train_set = tf.data.Dataset.from_tensor_slices((self.train_numeral_data, self.train_numeral_labels))
        self.tf_valid_set = tf.data.Dataset.from_tensor_slices((self.valid_numeral_data, self.valid_numeral_labels))

In [302]:
# checking what percentage of questions exceed 100 tokens (to understand if selecting 100 as maxlen is OK)
np.sum(questions_df.text.apply(lambda x: len(x))>100)/len(questions_df)

0.04882936604917151

In [305]:
# only 5%, we can select maxlen=100
# implementing the above class
dm = DataManager(maxlen=100)
dm.read_data()
dm.manipulate_data()
dm.train_valid_split(train_ratio=0.8)

In [301]:
# showing how the data is actually encoded
for i in range(2):
    print('Tokenized question ',i+1,': ', dm.str_questions[i],  '\n\nNumerical representation: ', \
          dm.train_numeral_data[i],'\n\n Label number: ',dm.train_numeral_labels[i], '; Label name: ', \
          dm.str_labels[i], '\n','\n', sep='')

Tokenized question 1: what is the difference between a qudit system with d and a two-qubit system

Numerical representation: [   9    6    1   60   32    2 2902   55   13   85   10    2   70  117
   55    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0]

 Label number: 7; Label name: quantumcomputing


Tokenized question 2: what doe the sun look like from the heliopause

Numerical representation: [   9   18    1   76  250  102   21    1 5096    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0  

### Building the model

I will build a Recurrent Neural Network model with the following particularities:

- run_mode gives 3 possibilities in terms of Word Embeddings: randomly initializing and further training word embeddings (scratch mode), using glove pre-trained word embeddings and freezing their training (init-only), using glove pre-trained word embeddings allowing to further tune their weights (init-fine-tune).
- cell_type: being able to use LSTM, GRU or simple RNN.
- network_type: uni-directional or bi-directional.
- state_sizes: to control the number of hidden layers (final length of the provided list) and their corresponding dimensions (each number of the list).

To address overfitting:

- dropout_rate: if dropout rate is specified, then dropout will be performed in all the layers. If not specified, then it will not be performed in any.
- l2_reg: if this parameter is specified, then l2 regularization will be performed in the last fully connected layer. If not specified, regularization will not be applied.

In [306]:
# Defining RNN model
class RNN:
    def __init__(self, run_mode = 'scratch', cell_type= 'gru', network_type = 'uni-directional', embed_model= 'glove-wiki-gigaword-100', 
                 embed_size= 128, state_sizes = [64, 64], dropout_rate = None, l2_reg = None, data_manager = None):
        self.run_mode = run_mode
        self.data_manager = data_manager
        self.cell_type = cell_type
        self.network_type = network_type
        self.state_sizes = state_sizes
        self.embed_model = embed_model
        self.embed_size = embed_size
        # if we are using glove, embed size should be the number indicated in the model's name
        if self.run_mode != 'scratch':
            self.embed_size = int(self.embed_model.split("-")[-1])
        self.data_manager = data_manager
        self.vocab_size = dm.vocab_size +1
        self.word2idx = dm.word2idx
        self.word2vect = None
        # initialize embedding matrix as all 0
        self.embed_matrix = np.zeros(shape= [self.vocab_size, self.embed_size])
        # for regularization
        self.dropout_rate = dropout_rate
        self.l2_reg = l2_reg
    
    def build_embedding_matrix(self):
        self.word2vect = api.load(self.embed_model) # load embedding model
        for word, idx in self.word2idx.items():
            try:
                self.embed_matrix[idx] = self.word2vect.word_vec(word) # assign weight for the corresponding word and index
            except KeyError: # word cannot be found
                pass
    
    @staticmethod
    # method to build hidden layers with all the combinations of network and cell types
    # activation function tanh usually used in RNN
    # return_sequences indicates if the concatenation of all hidden values for of all 
    # hidden cells in addition to output
    def get_layer(cell_type= 'gru', state_size= 128, network_type='uni-directional', return_sequences= False, activation = 'tanh'):
        if network_type=="bi-directional":
            if cell_type=='gru':
                return tf.keras.layers.Bidirectional(tf.keras.layers.GRU(state_size, return_sequences=return_sequences, activation=activation))
            elif cell_type== 'lstm':
                return tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(state_size, return_sequences=return_sequences, activation=activation))
            else:
                return tf.keras.layers.Bidirectional(tf.keras.layers.SimpleRNN(state_size, return_sequences=return_sequences, activation=activation))
        else:
            if cell_type=='gru':
                return tf.keras.layers.GRU(state_size, return_sequences=return_sequences, activation=activation)
            elif cell_type== 'lstm':
                return tf.keras.layers.LSTM(state_size, return_sequences=return_sequences, activation=activation)
            else:
                return tf.keras.layers.SimpleRNN(state_size, return_sequences=return_sequences, activation=activation)
    def build(self):
        x = tf.keras.layers.Input(shape=[None])
        if self.run_mode == "scratch":
            # if scratch, random initialization of embeddings
            # mask_zero defines if values padded at the end of the sequence can be ignored for training or not
            h = tf.keras.layers.Embedding(self.vocab_size, self.embed_size, mask_zero=True, trainable=True)(x)
        else:
            if self.run_mode=='init-only':
                trainable_param = False
            else:
                trainable_param = True
            self.build_embedding_matrix()
            h = tf.keras.layers.Embedding(self.vocab_size, self.embed_size, mask_zero=True, trainable=trainable_param,
                                                        weights=[self.embed_matrix])(x)
        # we repeat this proccess as many times as number of layers
        num_layers = len(self.state_sizes)
        for i in range(num_layers):
            h =  self.get_layer(self.cell_type, self.state_sizes[i], self.network_type, return_sequences=True)(h)
            if self.dropout_rate != None:
                # dropout to each hidden layer to control overfitting
                h = tf.keras.layers.Dropout(self.dropout_rate)(h)
        h = self.get_layer(self.cell_type, self.state_sizes[i], self.network_type, return_sequences=False)(h)
        if self.dropout_rate != None:
            # dropout to fully connected layer to control overfitting
            h = tf.keras.layers.Dropout(self.dropout_rate)(h)
        if self.l2_reg != None:
            # output layer with softmax with l2 regularization
            h = tf.keras.layers.Dense(dm.num_classes, activation='softmax', kernel_regularizer = tf.keras.regularizers.l2(self.l2_reg))(h)
        else:
            # output layer with softmax without l2 regularization
            h = tf.keras.layers.Dense(dm.num_classes, activation='softmax')(h)
        self.model = tf.keras.Model(inputs=x, outputs=h)
        
    def compile_model(self, *args, **kwargs):
        self.model.compile(*args, **kwargs)
    
    def fit(self, *args, **kwargs):
        return self.model.fit(*args, **kwargs)
    
    def evaluate(self, *args, **kwargs):
        self.model.evaluate(*args, **kwargs)
        
    def predict(self, *args, **kwargs):
        return self.model.predict(*args, **kwargs)

### Training the model

I will train the model using 6 different configurations looping over the different possibilities for cell_type and network_type parameters.

I am applying early stopping to avoid overfitting. If the validation loss does not decrease for 3 epochs in a row, the model stops training. I won't apply dropout nor l2_reg at this stage. If I still see signs of overfitting, I will then apply those techniques only to the best-performing model.

I have already, on my own, tried the different approaches for the Word Embeddings (scratch, init-only, init-fine-tune) and init-fine-tune was the best performing, so that parameter will be fixed here (to avoid excessive number of different settings)

I will be building the model with 2 hidden layers so that it doesn't take excessive training time.

In [49]:
# initializing model index
i = 0
# initializing Data Frame to store performances
results_df = pd.DataFrame()
# looping all the possible values for the parameters
for cell_type in ['simple-rnn', 'gru', 'lstm']:
    for network_type in ['uni-directional', 'bi-directional']:
        print('Model', i+1, '; Training with cell_type=', cell_type, '; network_type=',network_type, sep='')
        rnn = RNN(run_mode= 'init-fine-tune', data_manager = dm, cell_type=cell_type, network_type=network_type)
        rnn.build()
        opt = tf.keras.optimizers.RMSprop(learning_rate=0.001)
        callback = EarlyStopping(patience=3, monitor='val_accuracy', mode='max')
        rnn.compile_model(optimizer=opt, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
        history = rnn.fit(dm.tf_train_set.batch(64), epochs=10, validation_data = dm.tf_valid_set.batch(64), callbacks = [callback])
        # keeping the number of epochs that yields the highest val accuracy for each model
        valid_accuracies = history.history['val_accuracy']
        max_accuracy = np.max(valid_accuracies)
        max_epochs = np.argmax(valid_accuracies)
        # storing all results in DataFrame
        results_df.loc[i, 'cell_type'] = cell_type
        results_df.loc[i, 'netowrk_type'] = network_type
        results_df.loc[i, 'optimal_epoch'] = max_epochs + 1
        results_df.loc[i, 'optimal_accuracy'] = max_accuracy
        i+=1
        print('\n')

Model1; Training with cell_type=simple-rnn; network_type=uni-directional
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Model2; Training with cell_type=simple-rnn; network_type=bi-directional
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Model3; Training with cell_type=gru; network_type=uni-directional
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Model4; Training with cell_type=gru; network_type=bi-directional
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10


Model5; Training with cell_type=lstm; network_type=uni-directional
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10


Model6; Training with cell_type=lstm; network_type=bi-directional
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 

We can see that there are signs of overfitting in all the models: our training accuracy is always pretty close to 1 while our validation accuracy is somewhere between 0.89 and 0.92. I will then select the best performing model (the one with the optimal validation accuracy) and apply dropout and regularization.

### The results

In [50]:
results_df

Unnamed: 0,cell_type,netowrk_type,optimal_epoch,optimal_accuracy
0,simple-rnn,uni-directional,7.0,0.889649
1,simple-rnn,bi-directional,8.0,0.891727
2,gru,uni-directional,8.0,0.918978
3,gru,bi-directional,5.0,0.918123
4,lstm,uni-directional,6.0,0.920445
5,lstm,bi-directional,7.0,0.919589


We can observe that our best-performing model is lstm uni-directional. So let's re-train it, addressing the overfitting problem this time with dropout and regularization, save it as the optimal model and take a look at the predictions to see if it is working well in all the classes.

I will first define a scheduler for the learning rate, to decrease as the training progresses (to speed up the performance at the beginning and avoid overshooting once the training is advanced).

In [310]:
# defining scheduler for learning rate
def scheduler(epoch, learning_rate):
    if epoch < 5:
        return 0.01
    elif epoch < 8:
        return 0.001
    else:
        return 0.0001

In [312]:
# re-running model on optimal parameters, adding lr scheduler, dropout_rate and l2_reg
optimal_rnn = RNN(run_mode= 'init-fine-tune', data_manager = dm, cell_type='lstm', network_type='uni-directional',\
                 dropout_rate = 0.5, l2_reg = 0.0001)
optimal_rnn.build()
opt = tf.keras.optimizers.RMSprop(learning_rate=0.01)
# learning rate scheduler
scheduler_lr = tf.keras.callbacks.LearningRateScheduler(scheduler)
# early stopping
early_stopping = EarlyStopping(patience=3, monitor='val_accuracy', mode='max')
optimal_rnn.compile_model(optimizer=opt, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
history = optimal_rnn.fit(dm.tf_train_set.batch(64), epochs=10, validation_data = dm.tf_valid_set.batch(64),\
                          callbacks = [scheduler_lr,  early_stopping])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


We can see in our results that we have obtained a higher validation accuracy (0.93 now vs 0.92 in the previous optimal model without l2_reg, dropout nor lr_scheduler) and we have managed to reduce overfitting: the difference between training and validation accuracy is now smaller.

In [313]:
# using predict method to be able to actually retrieve the predictions and not only the performance
predictions = optimal_rnn.predict(dm.valid_numeral_data)
predictions_list = []
for i in range(len(predictions)):
    predictions_list.append(np.argmax(predictions[i]))

In [314]:
# building confusion matrices for each class
multilabel_confusion_matrix(dm.valid_numeral_labels, predictions_list)

array([[[6449,  138],
        [ 142, 1454]],

       [[6106,   75],
        [  85, 1917]],

       [[7951,   11],
        [  23,  198]],

       [[7929,   13],
        [  17,  224]],

       [[7451,   92],
        [  73,  567]],

       [[7739,   36],
        [  38,  370]],

       [[7003,   82],
        [  89, 1009]],

       [[7163,   63],
        [  50,  907]],

       [[7101,   62],
        [  55,  965]]])

By taking a look at the Confusion Matrices we can get an idea that the model is performing quite good in all the classes. Still we can calculate each class fscore to confirm this:

In [315]:
f1_score(dm.valid_numeral_labels, predictions_list, average=None)

array([0.91217064, 0.95993991, 0.92093023, 0.93723849, 0.87297921,
       0.90909091, 0.92188214, 0.94135963, 0.94284319])

In [316]:
dm.str_classes

array(['ai', 'astronomy', 'beer', 'coffee', 'computergraphics',
       'martialarts', 'opendata', 'quantumcomputing', 'sports'],
      dtype='<U16')

We see that the model is actually doing pretty good in all the classes. computergraphics is not performing as well as the others, so we could dig deeper and see why this is happening (maybe we could change data pre-processing, this time including numbers as well and see if this boosts the performance in this class, because numbers should be meaningful in a domain such as computergraphics, etc...)