## BERT

Bidirectional Encoder Representations from Transformers is a technique for NLP pre-training developed by Google.  It’s a neural network architecture designed by Google researchers that’s totally transformed what’s state-of-the-art for NLP tasks, like text classification, translation, summarization, and question answering.

Now that BERT's been added to TF Hub as a loadable module, it's easy(ish) to add into existing Tensorflow text pipelines. In an existing pipeline, BERT can replace text embedding layers like ELMO and GloVE. Alternatively, finetuning BERT can provide both an accuracy boost and faster training time in many cases.

Here, we'll train a model to predict whether an IMDB movie review is positive or negative using BERT in Tensorflow with tf hub. Some code was adapted from this colab notebook.

Source for Learning:
1. [Google-Research](https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb#scrollTo=xiYrZKaHwV81)
2. [Stack Abuse](https://stackabuse.com/text-classification-with-bert-tokenizer-and-tf-2-0-in-python/)

In [1]:
!pip3 install bert-for-tf2
!pip3 install sentencepiece
!python3 -c "import nltk; nltk.download('punkt'); nltk.download('wordnet')"

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import nltk as nlp
import tensorflow as tf
import bert
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input,Dense
import tensorflow_hub as tfhub
import tensorflow_datasets as tfds
from datetime import datetime

In addition to the standard libraries we imported above, we'll need to install BERT's python package.

Fetching BERT Model from TensorFlow Hub

In [3]:
bert_layer = tfhub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",trainable=True)

### Importing and Cleaning Spam Messages Dataset

In [4]:
df = pd.read_csv('SPAM Text Message Data.csv')
df["Category"] = [1 if each == "spam" else 0 for each in df["Category"]]

def remove_punctuation(text):
    '''a function for removing punctuation'''
    import string
    # replacing the punctuations with no space, 
    # which in effect deletes the punctuation marks 
    translator = str.maketrans('', '', string.punctuation)
    # return the text stripped of punctuation marks
    return text.translate(translator)

def lowerwords(text):
	text = re.sub("[^a-zA-Z]"," ",text) # Excluding Numbers
	text = [word.lower() for word in text.split()]
    # joining the list of words with space separator
	return " ".join(text)
        
df['Message'] = df['Message'].apply(remove_punctuation)
df['Message'] = df['Message'].apply(lowerwords)

description_list = []
for description in df["Message"]:
    description = nlp.word_tokenize(description)
    #description = [ word for word in description if not word in set(stopwords.words("english"))]
    lemma = nlp.WordNetLemmatizer()
    description = [lemma.lemmatize(word) for word in description]
    description = " ".join(description)
    description_list.append(description) # we hide all word one section

In [5]:
X_Messages = df['Message']
y_Labels = df['Category']

print (X_Messages.shape,y_Labels.shape)

(5572,) (5572,)


### Data Preparation for BERT Model

#### Creating a BERT Tokenizer

We will first create an object of the FullTokenizer class from the bert.bert_tokenization module. Next, we create a BERT Embedding Layer by importing the BERT Model from tfhub.KerasLayer. The trainable parameter is set to False, which means that we will not be training the BERT Embedding. In the next line, we create a BERT Vocabulary file in the form a numpy array. We then set the text to lowercase and finally we pass our vocabulary_file and to_lower_case variables to the BertTokenizer object.

In [6]:
BertTokenizer = bert.bert_tokenization.FullTokenizer
bert_layer = tfhub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
                            trainable=False)
vocabulary_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
to_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = BertTokenizer(vocabulary_file, to_lower_case)

#### Data Preparation

In [7]:
def get_masks(tokens,max_seq_length):
    """
    This Function Trims/ Pads a depending on length of token
    """
    if len(tokens)>max_seq_length:
        # Cutting Down the Excess Length
        tokens = tokens[0:max_seq_length]
        return [1]*len(tokens)
    else :
        return [1]*len(tokens) + [0] * (max_seq_length - len(tokens))


def get_segments(tokens, max_seq_length):
    
    if len(tokens)>max_seq_length:
        # Cutting Down the Excess Length
        tokens = tokens[:max_seq_length]
        segments = []
        current_segment_id = 0
        for token in tokens:
            segments.append(current_segment_id)
            if token == "[SEP]":
                current_segment_id = 1
        return segments
    
    else:
        segments = []
        current_segment_id = 0
        for token in tokens:
            segments.append(current_segment_id)
            if token == "[SEP]":
                current_segment_id = 1
        return segments + [0] * (max_seq_length - len(tokens))


def get_ids(tokens, tokenizer, max_seq_length):    
    if len(tokens)>max_seq_length:
        tokens = tokens[:max_seq_length]
        token_ids = tokenizer.convert_tokens_to_ids(tokens)
        return token_ids
    else:
        token_ids = tokenizer.convert_tokens_to_ids(tokens)
        input_ids = token_ids + [0] * (max_seq_length-len(token_ids))
        return input_ids

Creating Data for BERT Model

In [8]:
def CreatingData(X_Train,tokenizer,max_seq_length=128):
    
    X_IDs = []
    X_Masks = []
    X_Segments = []

    for i in range(X_Train.shape[0]):
        x = X_Train[i]
        x = tokenizer.tokenize(x)
        x = ["[CLS]"] + x + ["[SEP]"]

        X_IDs.append(get_ids(x, tokenizer, max_seq_length))
        X_Masks.append(get_masks(x,max_seq_length))
        X_Segments.append(get_segments(x, max_seq_length))

    return np.array(X_IDs), np.array(X_Masks), np.array(X_Segments)


In [9]:
X_IDs, X_Masks, X_Segments = CreatingData(X_Messages,tokenizer)
print (X_IDs.shape)
print (X_Masks.shape)
print (X_Segments.shape)
print (y_Labels.shape)

(5572, 128)
(5572, 128)
(5572, 128)
(5572,)


### Creating Model

In [10]:
def Build_Model(bert_layer=bert_layer,Max_Seq_Length=128):
    IDs = Input(shape=(Max_Seq_Length,), dtype=tf.int32)
    Masks = Input(shape=(Max_Seq_Length,), dtype=tf.int32)
    Segments = Input(shape=(Max_Seq_Length,), dtype=tf.int32)

    Pooled_Output, Sequence_Output = bert_layer([IDs,Masks,Segments])

    x = Sequence_Output[:,0,:]
    Output = Dense(1,activation="sigmoid")(x)

    return Model(inputs=[IDs,Masks,Segments],outputs=Output)

Model = Build_Model()
Model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy','AUC'])

### Training

In [11]:
Model.fit([X_IDs, X_Masks, X_Segments],y_Labels,epochs=10,validation_split=0.1)

Epoch 1/10
Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7ffa24fef240>