In this notebook I will be going over a text clasification with a BERT tokenizer model using much of the code from https://stackabuse.com/text-classification-with-bert-tokenizer-and-tf-2-0-in-python/. I had done text classification of the same dataset in an earlier notebook so I am curious to see if using BERT will improve the results.

In [1]:
!pip install bert-for-tf2
!pip install sentencepiece



I am running this in Google Colab which by default will not run the script of TensorFlow 2.0, meaning that we need to add some lines to make this happen.

In [2]:
try:
    %tensorflow_version 2.x
except Exception:
    pass
import tensorflow as tf

#library with a bunch of pretrained models developed in TensorFlow
import tensorflow_hub as hub

from tensorflow.keras import layers
import bert
import pandas as pd
import numpy as np

The dataset is a standard movie review and corresponding sentiment (positive or negative). The data set contains 50,000 reviews and sentiment pairs. 

In [3]:
from google.colab import drive
drive.mount('/content/drive')
movie_reviews = pd.read_csv("/content/drive/My Drive/Colab Notebooks/IMDB Dataset.csv")

movie_reviews.isnull().values.any()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


False

## Preprocessing
The following chunk will preprocess the data by removing punctuation and special characters.

In [4]:
def preprocess_text(sen):
    # Removing html tags
    sentence = remove_tags(sen)

    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)

    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)

    return sentence

In [5]:
import re
#Return the cleaned text
TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
    return TAG_RE.sub('', text)

In [6]:
reviews = []
sentences = list(movie_reviews['review'])
for sen in sentences:
    reviews.append(preprocess_text(sen))

Now that the data is cleaned, I will just show the format that the data is in. There are two columns, one for the review and the other for the sentiment as shown by the following cell.

In [7]:
print(movie_reviews.columns.values)

['review' 'sentiment']


The following cell will show you the format of the sentiment as either "positive" or "negative".

In [8]:
movie_reviews.sentiment.unique()

array(['positive', 'negative'], dtype=object)

There is no reason to leave the sentiment in the string of "positive" or "negative" so we convert that into 1 for positive and 0 for negative.

In [9]:
y = movie_reviews['sentiment']

y = np.array(list(map(lambda x: 1 if x=="positive" else 0, y)))

Now we have two different arrays, reviews and y which contain the reviews and the sentiment respectively. To see the format of the data now the following cells will show the first data point for each array.

In [10]:
print(reviews[0])

One of the other reviewers has mentioned that after watching just Oz episode you ll be hooked They are right as this is exactly what happened with me The first thing that struck me about Oz was its brutality and unflinching scenes of violence which set in right from the word GO Trust me this is not show for the faint hearted or timid This show pulls no punches with regards to drugs sex or violence Its is hardcore in the classic use of the word It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary It focuses mainly on Emerald City an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda Em City is home to many Aryans Muslims gangstas Latinos Christians Italians Irish and more so scuffles death stares dodgy dealings and shady agreements are never far away would say the main appeal of the show is due to the fact that it goes where other shows wouldn dare Forget pretty pictures

In [11]:
print(y[0])

1


## Tokenization
We cannot use the BERT embeddings as an input if we do not tokenize the reviews first. The following cell will provide the BERT tokenization. The url inputted as a string contains a saved model in the form of a TensorFlow 2 model. This is a BERT model from the TensorFlow models repository on GitHub at https://github.com/tensorflow/models/tree/master/official/nlp/bert. The model has 12 hidden layers, a hidden size of 768, and 12 attention heads.

In [12]:
BertTokenizer = bert.bert_tokenization.FullTokenizer
#create a BERT embedding layer by importing the BERT model from hub.KerasLayer
#trainable parameter is false since we will not further train the model
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
                            trainable=False)
#create a BERT vocabulary file
vocabulary_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
#set the text to lowercase
to_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
#pass vocabulary_file and to_lower_case to the BertTokenizer object
tokenizer = BertTokenizer(vocabulary_file, to_lower_case)

To show the form of the tonization of a sentence I created a random sentence to pass in.

In [13]:
tokenizer.tokenize("I don't think that you're able to visit Henry's house.")

['i',
 'don',
 "'",
 't',
 'think',
 'that',
 'you',
 "'",
 're',
 'able',
 'to',
 'visit',
 'henry',
 "'",
 's',
 'house',
 '.']

In [14]:
tokenizer.convert_tokens_to_ids(tokenizer.tokenize("I don't think that you're able to visit Henry's house."))

[1045,
 2123,
 1005,
 1056,
 2228,
 2008,
 2017,
 1005,
 2128,
 2583,
 2000,
 3942,
 2888,
 1005,
 1055,
 2160,
 1012]

The following cell contains a simple function that will return the id numbers of the tokenized words.

In [15]:
def tokenize_reviews(text_reviews):
    return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text_reviews))

Now to actually tokenize every review in the dataset.

In [16]:
tokenized_reviews = [tokenize_reviews(review) for review in reviews]

## Preparing to Train
Since each review can be a different length from the other ones, we must figure out how to fix this. A potentials solution would be to pad the sentences by 0s. However this can lead to the matrix being much less filled for the smaller sentences. To mitigate this, we will pad for each batch. So while there will still be some sparse matrices, we will only have to pad to the length of the largest sentence in each batch.

To help us do this, we will have to find the length of each sentence. The following cell will create an array containing each review, the sentiment, and the length. After that we will shuffle the data because the current form has positive and negative reviews separated. After that we will sort the data by the length of the sentence.

In [17]:
reviews_with_len = [[review, y[i], len(review)]
                 for i, review in enumerate(tokenized_reviews)]
import random
random.shuffle(reviews_with_len)
reviews_with_len.sort(key=lambda x: x[2])
# remove length since it is not needed anymore
sorted_reviews_labels = [(review_lab[0], review_lab[1]) for review_lab in reviews_with_len]

The following cell will convert the data into a TensorFlow 2.0 compliant input dataset shape.

In [18]:
processed_dataset = tf.data.Dataset.from_generator(lambda: sorted_reviews_labels, output_types=(tf.int32, tf.int32))

The following cell is where the padding will occur. Using a batch size of 32, we will pad the reviews locally by batches.

In [19]:
BATCH_SIZE = 32
batched_dataset = processed_dataset.padded_batch(BATCH_SIZE, padded_shapes=((None, ), ()))

The following cell will show how the padding is applied. Notice the 0s chained on at the end of each array in the beginning but the arrays at the end do not need padding.

In [20]:
next(iter(batched_dataset))

(<tf.Tensor: shape=(32, 21), dtype=int32, numpy=
 array([[ 3191,  1996,  2338,  5293,  1996,  3185,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 2054,  5896,  2054,  2466,  2054,  6752,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 3078,  5436,  3078,  3257,  3532,  7613,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 2062, 23873,  3993,  2062, 11259,  2172,  2172,  2062, 14888,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 1045,  2876,  9278,  2023,  2028,  2130,  2006,  7922, 12635,
          2305,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 2023,  3185,  2003,  6659,  2021,  2009,  2038,  2070,  2204,
    

The following cell will separate the data into train and test datasets.

In [21]:
import math
TOTAL_BATCHES = math.ceil(len(sorted_reviews_labels) / BATCH_SIZE)
TEST_BATCHES = TOTAL_BATCHES // 10
batched_dataset.shuffle(TOTAL_BATCHES)
test_data = batched_dataset.take(TEST_BATCHES)
train_data = batched_dataset.skip(TEST_BATCHES)

## Creating Model
Now we are all set to create our model. To do so, we will create a class that inherits from the tf.keras.Model class. Inside the class we will define our model layers which will consist of three convolutional neural network layers with a glabal max pooling. 

In [22]:
class TEXT_MODEL(tf.keras.Model):
    
    def __init__(self,
                 vocabulary_size,
                 embedding_dimensions=128,
                 cnn_filters=50,
                 dnn_units=512,
                 model_output_classes=2,
                 dropout_rate=0.1,
                 training=False,
                 name="text_model"):
        super(TEXT_MODEL, self).__init__(name=name)
        
        self.embedding = layers.Embedding(vocabulary_size,
                                          embedding_dimensions)
        #three convolutional neural network layers
        self.cnn_layer1 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=2,
                                        padding="valid",
                                        activation="relu")
        self.cnn_layer2 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=3,
                                        padding="valid",
                                        activation="relu")
        self.cnn_layer3 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=4,
                                        padding="valid",
                                        activation="relu")
        #global max pooling is applied to the output of each of the convolutional neural network layer
        self.pool = layers.GlobalMaxPool1D()
        
        self.dense_1 = layers.Dense(units=dnn_units, activation="relu")
        self.dropout = layers.Dropout(rate=dropout_rate)
        if model_output_classes == 2:
            self.last_dense = layers.Dense(units=1,
                                           activation="sigmoid")
        else:
            self.last_dense = layers.Dense(units=model_output_classes,
                                           activation="softmax")
    
    def call(self, inputs, training):
        l = self.embedding(inputs)
        l_1 = self.cnn_layer1(l) 
        l_1 = self.pool(l_1) 
        l_2 = self.cnn_layer2(l) 
        l_2 = self.pool(l_2)
        l_3 = self.cnn_layer3(l)
        l_3 = self.pool(l_3) 
        #three convolutional neural network layers are concatenated together and their output is fed to the first densely connected neural network
        concatenated = tf.concat([l_1, l_2, l_3], axis=-1) # (batch_size, 3 * cnn_filters)
        concatenated = self.dense_1(concatenated)
        concatenated = self.dropout(concatenated, training)
        model_output = self.last_dense(concatenated)
        
        return model_output

The following cell contains the values we will use for our hyperparameters.

In [32]:
VOCAB_LENGTH = len(tokenizer.vocab)
EMB_DIM = 200
CNN_FILTERS = 100
DNN_UNITS = 256
OUTPUT_CLASSES = 2

DROPOUT_RATE = 0.2

NB_EPOCHS = 5

In the next cell we will create an object of the class and pass the hyper parameters in.

In [33]:
text_model = TEXT_MODEL(vocabulary_size=VOCAB_LENGTH,
                        embedding_dimensions=EMB_DIM,
                        cnn_filters=CNN_FILTERS,
                        dnn_units=DNN_UNITS,
                        model_output_classes=OUTPUT_CLASSES,
                        dropout_rate=DROPOUT_RATE)














Before we can actually train the models we need to compile it. The following script compiles the model. Since there are only two output classes we can use the binary crossentropy loss function.

In [34]:

text_model.compile(loss="binary_crossentropy",
                       optimizer="adam",
                       metrics=["accuracy"])



Now we can actually train the model.

In [36]:
text_model.fit(train_data, epochs=NB_EPOCHS)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f1433918e90>

In [37]:
results = text_model.evaluate(test_data)
print(results)


[0.6821133494377136, 0.8826121687889099]


## Final Thoughts
In this notebook I showed how to use the BERT tokenizer to create word embeddings that can be used to perform text classification. For the class with the 3 CNNs, we got an accuracy of .8826. I did a similar classification of movie reviews with the same dataset in an earlier notebook and was not able to get an accuracy above 88%. So it seems like this may be a better method for this type of classification but it is not abundantly clear because the results were about the same.