This example shows how to use BERT Tokenizer for text classification <br>
(We can extend this example to show BERT Tokenizer + Bert Embeddings usage for text classification) 


In [None]:
!pip install bert-for-tf2
!pip install sentencepiece



In [None]:
try:
    %tensorflow_version 2.x
except Exception:
    pass
import tensorflow as tf
import pandas as pd

import tensorflow_hub as hub

from tensorflow.keras import layers
import bert

In [None]:
!ls /content/drive/My\ Drive/DataFiles

IMDBDataset.csv  Telco-Customer-Churn-IBM.csv


In [None]:
movie_reviews = pd.read_csv("/content/drive/My Drive/DataFiles/IMDBDataset.csv")

movie_reviews.isnull().values.any()

movie_reviews.shape

(50000, 2)

In [None]:
pd.set_option("display.max_rows", 200)
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", 200)

movie_reviews.head(10)

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me...",positive
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire p...",positive
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue i...",positive
3,Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenl...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what mone...",positive
5,"Probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it's not preachy or boring. It just never gets old, despite my having seen it some 15 o...",positive
6,I sure would like to see a resurrection of a up dated Seahunt series with the tech they have today it would bring back the kid excitement in me.I grew up on black and white TV and Seahunt with Gun...,positive
7,"This show was an amazing, fresh & innovative idea in the 70's when it first aired. The first 7 or 8 years were brilliant, but things dropped off after that. By 1990, the show was not really funny ...",negative
8,Encouraged by the positive comments about this film on here I was looking forward to watching this film. Bad mistake. I've seen 950+ films and this is truly one of the worst of them - it's awful i...,negative
9,"If you like original gut wrenching laughter you will like this movie. If you are young or old then you will love this movie, hell even my mom liked it.<br /><br />Great Camp!!!",positive


**Pre-Processing**

In [None]:
## Pre-process the data
def preprocess_text(sen):
    # Removing html tags
    sentence = remove_tags(sen)

    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)

    # Single character removal (\s is a white space, \s+ one is 1/more white spaces)
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)

    return sentence

In [None]:
## Remove all html tags
import re
TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
    return TAG_RE.sub('', text)

In [None]:
## Create i/p variable 
reviews = []
sentences = list(movie_reviews['review'])
for sen in sentences:
    reviews.append(preprocess_text(sen))

In [None]:
print(movie_reviews.columns.values)

['review' 'sentiment']


In [None]:
movie_reviews.sentiment.unique()

array(['positive', 'negative'], dtype=object)

In [None]:
## Create o/p variable
import numpy as np

y = movie_reviews['sentiment']

y = np.array(list(map(lambda x: 1 if x=="positive" else 0, y)))

In [None]:
print(reviews[10])
print(y[10])

Phil the Alien is one of those quirky films where the humour is based around the oddness of everything rather than actual punchlines At first it was very odd and pretty funny but as the movie progressed didn find the jokes or oddness funny anymore Its low budget film thats never problem in itself there were some pretty interesting characters but eventually just lost interest imagine this film would appeal to stoner who is currently partaking For something similar but better try Brother from another planet 
0


**Creating a BERT Tokenizer**

*Special Tokens used in BERT*

[CLS] : The first token of every sequence. A classification token which is normally used in conjunction with a softmax layer for classification tasks. For anything else, it can be safely ignored.

[SEP] : A sequence delimiter token which was used at pre-training for sequence-pair tasks (i.e. Next sentence prediction). Must be used when sequence pair tasks are required. When a single sequence is used it is just appended at the end.

[MASK] : Token used for masked words. Only used for pre-training.

**Transfer learning** is a machine learning method where a model developed for a task is reused as the starting point for a model on a second task

In [None]:
## Tokenization refers to dividing a sentence into individual words

# First create BertTokenizer object from bert.bert_tokenization module
# Create BERT embedding layer from hub.KerasLayer; tfhub is a repository for pre-trained models (if trainable = False, we will not be further training the BERT embedding)
# Create BERT vocabulary file and the the lower case variables to BertTokenizer object.

BertTokenizer = bert.bert_tokenization.FullTokenizer

bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",trainable=False)

vocabulary_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
to_lower_case = bert_layer.resolved_object.do_lower_case.numpy()

tokenizer = BertTokenizer(vocabulary_file, to_lower_case)

In [None]:
## any words that are not in the vocab(Out-Of-Vocabulary(OOV) words) will be considered as a separate token and will be pre-fixed with a ##
tokenizer.tokenize("don't be so judgmental")

['don', "'", 't', 'be', 'so', 'judgment', '##al']

In [None]:
## token embeddings are the vocabulary IDs for each of the tokens.
tokenizer.convert_tokens_to_ids(tokenizer.tokenize("dont be so judgmental"))

[2123, 2102, 2022, 2061, 8689, 2389]

In [None]:
def tokenize_reviews(text_reviews):
    return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text_reviews))

In [None]:
reviews[10]

'Phil the Alien is one of those quirky films where the humour is based around the oddness of everything rather than actual punchlines At first it was very odd and pretty funny but as the movie progressed didn find the jokes or oddness funny anymore Its low budget film thats never problem in itself there were some pretty interesting characters but eventually just lost interest imagine this film would appeal to stoner who is currently partaking For something similar but better try Brother from another planet '

In [None]:
## Convert tokens to ids in i/p data
tokenized_reviews = [tokenize_reviews(review) for review in reviews]

In [None]:
len(tokenized_reviews)

50000

In [None]:
print(reviews[10])
print(tokenized_reviews[10])

## differenc between the lengths are due to OOVs
print(len(reviews[10].split()))
print(len(tokenizer.tokenize(reviews[10])))
print(len(tokenized_reviews[10]))

Phil the Alien is one of those quirky films where the humour is based around the oddness of everything rather than actual punchlines At first it was very odd and pretty funny but as the movie progressed didn find the jokes or oddness funny anymore Its low budget film thats never problem in itself there were some pretty interesting characters but eventually just lost interest imagine this film would appeal to stoner who is currently partaking For something similar but better try Brother from another planet 
[6316, 1996, 7344, 2003, 2028, 1997, 2216, 21864, 15952, 3152, 2073, 1996, 17211, 2003, 2241, 2105, 1996, 5976, 2791, 1997, 2673, 2738, 2084, 5025, 8595, 12735, 2012, 2034, 2009, 2001, 2200, 5976, 1998, 3492, 6057, 2021, 2004, 1996, 3185, 12506, 2134, 2424, 1996, 13198, 2030, 5976, 2791, 6057, 4902, 2049, 2659, 5166, 2143, 2008, 2015, 2196, 3291, 1999, 2993, 2045, 2020, 2070, 3492, 5875, 3494, 2021, 2776, 2074, 2439, 3037, 5674, 2023, 2143, 2052, 5574, 2000, 2962, 2099, 2040, 2003, 2

**Preparing the training data for BERT**

The reviews in our dataset have varying lengths. Some reviews are very small while others are very long. To train the model, the input sentences should be of equal length. To create sentences of equal length, one way is to pad the shorter sentences by 0s. However, this can result in a sparse matrix contain large number of 0s. 
Alternatively, we can pad sentences within each batch. Since we will be training the model in batches, we can pad the sentences within the training batch locally depending upon the length of the longest sentence. To do so, we first need to find the length of each sentence.

The following script creates a list of lists where each sublist contains tokenized review, the label of the review and the length of the review:

In [None]:
reviews_with_len = [[review, y[i], len(review)]
                 for i, review in enumerate(tokenized_reviews)]

In [None]:
reviews_with_len[10]

[[6316,
  1996,
  7344,
  2003,
  2028,
  1997,
  2216,
  21864,
  15952,
  3152,
  2073,
  1996,
  17211,
  2003,
  2241,
  2105,
  1996,
  5976,
  2791,
  1997,
  2673,
  2738,
  2084,
  5025,
  8595,
  12735,
  2012,
  2034,
  2009,
  2001,
  2200,
  5976,
  1998,
  3492,
  6057,
  2021,
  2004,
  1996,
  3185,
  12506,
  2134,
  2424,
  1996,
  13198,
  2030,
  5976,
  2791,
  6057,
  4902,
  2049,
  2659,
  5166,
  2143,
  2008,
  2015,
  2196,
  3291,
  1999,
  2993,
  2045,
  2020,
  2070,
  3492,
  5875,
  3494,
  2021,
  2776,
  2074,
  2439,
  3037,
  5674,
  2023,
  2143,
  2052,
  5574,
  2000,
  2962,
  2099,
  2040,
  2003,
  2747,
  2112,
  15495,
  2005,
  2242,
  2714,
  2021,
  2488,
  3046,
  2567,
  2013,
  2178,
  4774],
 0,
 93]

In [None]:
## Shuffle the dataset to uniformly distribute 1s and 0s in the data
import random
random.shuffle(reviews_with_len)

In [None]:
## Sort based on length of review; 3rd item in the sublist i.e. the length of the review
reviews_with_len.sort(key=lambda x: x[2])

In [None]:
## Remove length of the review attribute
sorted_reviews_labels = [(review_lab[0], review_lab[1]) for review_lab in reviews_with_len]

In [None]:
sorted_reviews_labels[3]

([2062, 23873, 3993, 2062, 11259, 2172, 2172, 2062, 14888], 0)

In [None]:
## Convert the sorted dataset into a TensorFlow 2.0-compliant input dataset shape.
processed_dataset = tf.data.Dataset.from_generator(lambda: sorted_reviews_labels, output_types=(tf.int32, tf.int32))

In [None]:
## Padding the dataset for each batch: 
# Let's use batch size = 32; meaning that after processing 32 reviews, the weights of the neural network will be updated.
# To pad the reviews locally with respect to batches, execute the following:

BATCH_SIZE = 32
batched_dataset = processed_dataset.padded_batch(BATCH_SIZE, padded_shapes=((None, ), ()))

In [None]:
# Print the first batch and see how padding has been applied to it:
next(iter(batched_dataset))
# first review of the batch(and the rest) is padded with 0s to match with the length of the last review in the batch

(<tf.Tensor: shape=(32, 21), dtype=int32, numpy=
 array([[ 2054,  5896,  2054,  2466,  2054,  6752,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 3078,  5436,  3078,  3257,  3532,  7613,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 3191,  1996,  2338,  5293,  1996,  3185,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 2062, 23873,  3993,  2062, 11259,  2172,  2172,  2062, 14888,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 2023,  3185,  2003,  6659,  2021,  2009,  2038,  2070,  2204,
          3896,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 1045,  2876,  9278,  2023,  2028,  2130,  2006,  7922, 12635,
    

In [None]:
import math
# total number of batches in the dataset
TOTAL_BATCHES = math.ceil(len(sorted_reviews_labels) / BATCH_SIZE)
# 10% of the batches = test batches
TEST_BATCHES = TOTAL_BATCHES // 10

batched_dataset.shuffle(TOTAL_BATCHES)

test_data = batched_dataset.take(TEST_BATCHES)
train_data = batched_dataset.skip(TEST_BATCHES)

In [None]:
TOTAL_BATCHES

1563

**Creating the Model**


An **embedding** is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words.

**Dropout** is easily implemented by randomly selecting nodes to be dropped-out with a given probability (e.g. 10%) each weight update cycle. This is how Dropout regularization is implemented in Keras. Dropout is only used during the training of a model and is not used when evaluating the skill of the model. If the dropout rate is 10%, i.e. one in 10 inputs will be randomly excluded from each update cycle. Use dropout on incoming (visible) as well as hidden units.

**Softmax** calculates a probability for every possible class. It is implemented through a neural network layer just before the output layer. The Softmax layer must have the same number of nodes as the output layer.

1 **Epoch** = 1 Forward pass + 1 Backward pass for ALL training samples.
Batch Size = Number of training samples in 1 Forward/1 Backward pass
Number of iterations = Number of passes i.e. 1 Pass = 1 Forward pass + 1 Backward pass <br> Example : If we have 1000 training samples and Batch size is set to 500, it will take 2 iterations to complete 1 Epoch.


In [None]:
## this clas inherits from tf.keras.Model class
# use 3 Convolutional NN layers
class TEXT_MODEL(tf.keras.Model):
    
    def __init__(self,
                 vocabulary_size,
                 embedding_dimensions=128,
                 cnn_filters=50,
                 dnn_units=512,
                 model_output_classes=2,
                 dropout_rate=0.1,
                 training=False,
                 name="text_model"):
        super(TEXT_MODEL, self).__init__(name=name)
        
        self.embedding = layers.Embedding(vocabulary_size,
                                          embedding_dimensions)
        self.cnn_layer1 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=2,
                                        padding="valid",
                                        activation="relu")
        self.cnn_layer2 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=3,
                                        padding="valid",
                                        activation="relu")
        self.cnn_layer3 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=4,
                                        padding="valid",
                                        activation="relu")
        self.pool = layers.GlobalMaxPool1D()
        
        self.dense_1 = layers.Dense(units=dnn_units, activation="relu")
        self.dropout = layers.Dropout(rate=dropout_rate)
        if model_output_classes == 2:
            self.last_dense = layers.Dense(units=1,
                                           activation="sigmoid")
        else:
            self.last_dense = layers.Dense(units=model_output_classes,
                                           activation="softmax")
            
    # global max pooling is applied to the output of each of the convolutional neural network layer
    # The three convolutional neural network layers are concatenated together and their output is fed to the first densely connected neural network. 
    # The second densely connected neural network is used to predict the output sentiment since it only contains 2 classes

    def call(self, inputs, training):
        l = self.embedding(inputs)
        l_1 = self.cnn_layer1(l) 
        l_1 = self.pool(l_1) 
        l_2 = self.cnn_layer2(l) 
        l_2 = self.pool(l_2)
        l_3 = self.cnn_layer3(l)
        l_3 = self.pool(l_3) 
        
        concatenated = tf.concat([l_1, l_2, l_3], axis=-1) # (batch_size, 3 * cnn_filters)
        concatenated = self.dense_1(concatenated)
        concatenated = self.dropout(concatenated, training)
        model_output = self.last_dense(concatenated)
        
        return model_output

In [None]:
VOCAB_LENGTH = len(tokenizer.vocab)
print(VOCAB_LENGTH)
EMB_DIM = 200
CNN_FILTERS = 100
DNN_UNITS = 256
OUTPUT_CLASSES = 2

DROPOUT_RATE = 0.2

NB_EPOCHS = 5

30522


In [None]:
text_model = TEXT_MODEL(vocabulary_size=VOCAB_LENGTH,
                        embedding_dimensions=EMB_DIM,
                        cnn_filters=CNN_FILTERS,
                        dnn_units=DNN_UNITS,
                        model_output_classes=OUTPUT_CLASSES,
                        dropout_rate=DROPOUT_RATE)

In [None]:
## Before we can actually train the model we need to compile it
if OUTPUT_CLASSES == 2:
    text_model.compile(loss="binary_crossentropy",
                       optimizer="adam",
                       metrics=["accuracy"])
else:
    text_model.compile(loss="sparse_categorical_crossentropy",
                       optimizer="adam",
                       metrics=["sparse_categorical_accuracy"])

In [None]:
text_model.fit(train_data, epochs=NB_EPOCHS)
# loss = sum of errors made for each batch in training or validation sets

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f5399b77358>

In [None]:
?tf.keras.Model.evaluate

In [None]:
results = text_model.evaluate(test_data)
print(results)

[0.614834725856781, 0.8872195482254028]
