## BERT for Prediction fo Disaster Tweets

Bidirectional Encoder Representations from Transformers is a technique for NLP pre-training developed by Google.  It’s a neural network architecture designed by Google researchers that’s totally transformed what’s state-of-the-art for NLP tasks, like text classification, translation, summarization, and question answering.

Now that BERT's been added to TF Hub as a loadable module, it's easy(ish) to add into existing Tensorflow text pipelines. In an existing pipeline, BERT can replace text embedding layers like ELMO and GloVE. Alternatively, finetuning BERT can provide both an accuracy boost and faster training time in many cases.

Here, we'll train a model to predict whether an IMDB movie review is positive or negative using BERT in Tensorflow with tf hub. Some code was adapted from this colab notebook.

Source for Learning:
1. [Google-Research](https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb#scrollTo=xiYrZKaHwV81)
2. [Stack Abuse](https://stackabuse.com/text-classification-with-bert-tokenizer-and-tf-2-0-in-python/)

In [None]:
!pip3 install bert-for-tf2
!pip3 install sentencepiece
!python3 -c "import nltk; nltk.download('punkt'); nltk.download('wordnet'); nltk.download('stopwords')"

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import nltk as nlp
from nltk.corpus import stopwords
import string
import tensorflow as tf
import bert

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input,Dense,LSTM,Dropout
import tensorflow_hub as tfhub
import tensorflow_datasets as tfds
from datetime import datetime

In addition to the standard libraries we imported above, we'll need to install BERT's python package.

Fetching BERT Model from TensorFlow Hub

In [None]:
bert_layer = tfhub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",trainable=True)

### Importing and Cleaning Spam Messages Dataset

In [None]:
Train = pd.read_csv('train.csv')
Test = pd.read_csv('test.csv')

# Removing Non-Alphabet Characters
def remove_non_alphabet(x):
    return ' '.join([i for i in x.split() if i.isalpha() == True])

# Lowering Words
def lowerwords(text):
	text = re.sub("[^a-zA-Z]"," ",text) # Excluding Numbers
	text = [word.lower() for word in text.split()]
    # joining the list of words with space separator
	return " ".join(text)


# Removing Punctuation
def remove_punctuation(text):
    '''a function for removing punctuation'''
    # replacing the punctuations with no space, 
    # which in effect deletes the punctuation marks 
    translator = str.maketrans('', '', string.punctuation)
    # return the text stripped of punctuation marks
    return text.translate(translator)


# Removing StopWords
def remove_stopwords(text):
    StopWords = set(stopwords.words('english'))
    output = ' '.join([i for i in text.split() if i not in StopWords])
    return output


def remove_urls(text):
    text = re.sub(r'ttps?://\S+|www\.\S+<.*?>', '', text, flags=re.MULTILINE)
    return text


# Lemmatizer
def Lemmatizing(description):
    description = nlp.word_tokenize(description)
    #description = [ word for word in description if not word in set(stopwords.words("english"))]
    lemma = nlp.WordNetLemmatizer()
    description = [lemma.lemmatize(word) for word in description]
    description = " ".join(description)
    
    return description

def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [None]:
Train['text'] = Train['text'].apply(remove_urls)
Train['text'] = Train['text'].apply(remove_emoji)
Train['text'] = Train['text'].apply(remove_punctuation)
Train['text'] = Train['text'].apply(remove_non_alphabet)
Train['text'] = Train['text'].apply(lowerwords)
Train['text'] = Train['text'].apply(Lemmatizing)
Train['text'] = Train['text'].apply(remove_stopwords)

Test['text'] = Test['text'].apply(remove_urls)
Test['text'] = Test['text'].apply(remove_emoji)
Test['text'] = Test['text'].apply(remove_punctuation)
Test['text'] = Test['text'].apply(remove_non_alphabet)
Test['text'] = Test['text'].apply(lowerwords)
Test['text'] = Test['text'].apply(Lemmatizing)
Test['text'] = Test['text'].apply(remove_stopwords)


X_Train = Train['text']
y_Labels = Train['target']
X_Test = Test['text']

### Data Preparation for BERT Model

#### Creating a BERT Tokenizer

We will first create an object of the FullTokenizer class from the bert.bert_tokenization module. Next, we create a BERT Embedding Layer by importing the BERT Model from tfhub.KerasLayer. The trainable parameter is set to False, which means that we will not be training the BERT Embedding. In the next line, we create a BERT Vocabulary file in the form a numpy array. We then set the text to lowercase and finally we pass our vocabulary_file and to_lower_case variables to the BertTokenizer object.

In [None]:
BertTokenizer = bert.bert_tokenization.FullTokenizer
bert_layer = tfhub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
                            trainable=False)
vocabulary_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
to_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = BertTokenizer(vocabulary_file, to_lower_case)

#### Data Preparation

In [None]:
def get_masks(tokens,max_seq_length):
    """
    This Function Trims/ Pads a depending on length of token
    """
    if len(tokens)>max_seq_length:
        # Cutting Down the Excess Length
        tokens = tokens[0:max_seq_length]
        return [1]*len(tokens)
    else :
        return [1]*len(tokens) + [0] * (max_seq_length - len(tokens))


def get_segments(tokens, max_seq_length):
    
    if len(tokens)>max_seq_length:
        # Cutting Down the Excess Length
        tokens = tokens[:max_seq_length]
        segments = []
        current_segment_id = 0
        for token in tokens:
            segments.append(current_segment_id)
            if token == "[SEP]":
                current_segment_id = 1
        return segments
    
    else:
        segments = []
        current_segment_id = 0
        for token in tokens:
            segments.append(current_segment_id)
            if token == "[SEP]":
                current_segment_id = 1
        return segments + [0] * (max_seq_length - len(tokens))


def get_ids(tokens, tokenizer, max_seq_length):    
    if len(tokens)>max_seq_length:
        tokens = tokens[:max_seq_length]
        token_ids = tokenizer.convert_tokens_to_ids(tokens)
        return token_ids
    else:
        token_ids = tokenizer.convert_tokens_to_ids(tokens)
        input_ids = token_ids + [0] * (max_seq_length-len(token_ids))
        return input_ids

Creating Data for BERT Model

In [None]:
def CreatingData(X_Train,tokenizer,max_seq_length=150):
    
    X_IDs = []
    X_Masks = []
    X_Segments = []

    for i in range(X_Train.shape[0]):
        x = X_Train[i]
        x = tokenizer.tokenize(x)
        x = ["[CLS]"] + x + ["[SEP]"]

        X_IDs.append(get_ids(x, tokenizer, max_seq_length))
        X_Masks.append(get_masks(x,max_seq_length))
        X_Segments.append(get_segments(x, max_seq_length))

    return np.array(X_IDs), np.array(X_Masks), np.array(X_Segments)


In [None]:
X_Train_IDs, X_Train_Masks, X_Train_Segments = CreatingData(X_Train,tokenizer)
X_Test_IDs, X_Test_Masks, X_Test_Segments = CreatingData(X_Test,tokenizer)
print (X_Train_IDs.shape)
print (X_Test_IDs.shape)

(7613, 150)
(3263, 150)


### Creating Model

In [None]:
def Build_Model(bert_layer=bert_layer,Max_Seq_Length=150):
    IDs = Input(shape=(Max_Seq_Length,), dtype=tf.int32)
    Masks = Input(shape=(Max_Seq_Length,), dtype=tf.int32)
    Segments = Input(shape=(Max_Seq_Length,), dtype=tf.int32)

    Pooled_Output, Sequence_Output = bert_layer([IDs,Masks,Segments])

    x = Sequence_Output[:,0,:]
    x = Dropout(0.2)(x)
    Outputs = Dense(1,activation="sigmoid")(x)

    return Model(inputs=[IDs,Masks,Segments],outputs=Outputs)

Model = Build_Model()
Model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy','AUC'])

In [None]:
Model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_4 (InputLayer)            [(None, 150)]        0                                            
__________________________________________________________________________________________________
input_5 (InputLayer)            [(None, 150)]        0                                            
__________________________________________________________________________________________________
input_6 (InputLayer)            [(None, 150)]        0                                            
__________________________________________________________________________________________________
keras_layer_1 (KerasLayer)      [(None, 768), (None, 109482241   input_4[0][0]                    
                                                                 input_5[0][0]                

In [None]:
Model.fit([X_Train_IDs, X_Train_Masks, X_Train_Segments], y_Labels, epochs=25, batch_size=128, validation_split=0.1)

Epoch 1/25
Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<tensorflow.python.keras.callbacks.History at 0x7f1f509ffa20>

### Predictions

In [None]:
Predictions = np.array(Model.predict([X_Test_IDs, X_Test_Masks, X_Test_Segments]))
Predictions = np.round(Predictions.flatten()).astype(int)

submission = pd.read_csv('sample_submission.csv')
submission['target'] = Predictions
submission.to_csv('./submission.csv', index=False, header=True)