## Classifying the text using BERT
### By using BERT as tokenizer we will pull out vocabulary file and as use it as tokenizer to convert sentence to tokens and tokens to id's

#### Dataset used: Standford tweets
#### We will use data from drive and for picking up data from drive we use google.colab library
##### Lets get Started!!

## Import Dependencies
### Beautifulsoup is to convert xml format to string

In [None]:
import numpy as np
import re
import tensorflow as tf
from tensorflow import keras
import math
import pandas as pd
import random #For shuffling the dataset
from bs4 import  BeautifulSoup #The input data would be in XML form

Performing Bert becomes easy by 'bert-for-tf2' library
sentencepiece is prerequisite fot bert-for-tf2


In [None]:
!pip install bert-for-tf2
!pip install sentencepiece

Collecting bert-for-tf2
[?25l  Downloading https://files.pythonhosted.org/packages/87/df/ab6d927d6162657f30eb0ae3c534c723c28c191a9caf6ee68ec935df3d0b/bert-for-tf2-0.14.5.tar.gz (40kB)
[K     |████████                        | 10kB 20.4MB/s eta 0:00:01[K     |████████████████                | 20kB 1.8MB/s eta 0:00:01[K     |████████████████████████▏       | 30kB 2.3MB/s eta 0:00:01[K     |████████████████████████████████| 40kB 1.8MB/s 
[?25hCollecting py-params>=0.9.6
  Downloading https://files.pythonhosted.org/packages/a4/bf/c1c70d5315a8677310ea10a41cfc41c5970d9b37c31f9c90d4ab98021fd1/py-params-0.9.7.tar.gz
Collecting params-flow>=0.8.0
  Downloading https://files.pythonhosted.org/packages/a9/95/ff49f5ebd501f142a6f0aaf42bcfd1c192dc54909d1d9eb84ab031d46056/params-flow-0.8.2.tar.gz
Building wheels for collected packages: bert-for-tf2, py-params, params-flow
  Building wheel for bert-for-tf2 (setup.py) ... [?25l[?25hdone
  Created wheel for bert-for-tf2: filename=bert_for_tf2

All the  variables of Bert can be retrieved from Tensorflow_hub.
Tensorflow hub is a place where we get all the pretrained models of Text, image as well as video.
This is basically a Transfer learning


In [None]:
#Bert is bert-for-tf2 library which soimply called as bert
import bert
import tensorflow_hub as hub

## Data import
#### Data is being imported from Google drive.
#### For this mounting of google drive is required, which colab has a method to mount

In [None]:
from google.colab import drive
##Mount the drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Data: It does not include columns and custom columns are created for representation.
encoding='latin1' as Data is present in latin1 encoding 
We will give encoding as latin1 because most of the english code is latin encoded



In [None]:
column_names=["sentiment", "id", "date", "query", "user", "text"]
data= pd.read_csv('/content/drive/My Drive/NLP/Projects/BERT/Sentimental Data/train.csv',
                  header=None,
                  names=column_names,
                  engine='python',encoding='latin1')

In [None]:
##Sample Data
data.head(5)

Unnamed: 0,sentiment,id,date,query,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [None]:
## Dropping below columns as they dont serve any purpose 
data.drop([ "id", "date", "query", "user"],axis=1, inplace=True)

In [None]:
data.tail(5)

Unnamed: 0,sentiment,text
1599995,4,Just woke up. Having no school is the best fee...
1599996,4,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,Happy 38th Birthday to my boo of alll time!!! ...
1599999,4,happy #charitytuesday @theNSPCC @SparksCharity...


Label Values has 0 and 4 for positive and negative respectively.
Coverting  values from 0, 4 to 0, 1 will become easy for binary classificaton.

In [None]:

data.sentiment.value_counts()

4    800000
0    800000
Name: sentiment, dtype: int64

In [None]:

data.sentiment= data.sentiment.apply(lambda label: 1 if label==4 else label)

## Data Preprocessing
#### Data preprocessing contains cleaning of text such as converting text from lxml to text, removing @tags, urls, removing un-necessary symbols etc., 

In [None]:
## We will clean the data by using function
def clean_data(text):
  ##Conver the text from lxml from text
  text= BeautifulSoup(text,'lxml').get_text()
  ##Remove @tags from the text
  text= re.sub(r"@[A-Za-z0-9]+", ' ', text)

  ##Remove links
  text= re.sub(r"https?://[A-Za-z0-9./]+",' ', text)
  ##Keeping only letters in the tweets
  text= re.sub(r"[^A-Za-z0-9.?!]+",' ', text)

  ##removing the whitespaces
  text= re.sub(r" +",' ', text)

  return text

In [None]:
## Now lets call the function to the text
data.text= data.text.apply(lambda text:clean_data(text))

In [None]:
data.tail(10)

Unnamed: 0,sentiment,text
1599990,1,WOOOOO! Xbox is back
1599991,1,Mmmm That sounds absolutely perfect... but my...
1599992,1,ReCoVeRiNg FrOm ThE lOnG wEeKeNd
1599993,1,GRITBOYS
1599994,1,Forster Yeah that does work better than just ...
1599995,1,Just woke up. Having no school is the best fee...
1599996,1,TheWDB.com Very cool to hear old Walt intervie...
1599997,1,Are you ready for your MoJo Makeover? Ask me f...
1599998,1,Happy 38th Birthday to my boo of alll time!!! ...
1599999,1,happy charitytuesday


## Tokenization
Creation of Tokens and converting sentence to tokens followed by id's is done by Bert layer.
Bert layer is captured from tensorflow hub directory and vocabfiles and other parameters are drawn.
By using bert-for-tf2 we will create tokenizer and process the tokenization.

In [None]:
### Instantiate full tokenizer from bert library
fulltokenizer= bert.bert_tokenization.FullTokenizer
##instantiate Bert layer
bert_layer= hub.KerasLayer(handle= "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
                           trainable=False)
##Create vocab file
vocab_file= bert_layer.resolved_object.vocab_file.asset_path.numpy()
##Create lowercase parameter
lower_case= bert_layer.resolved_object.do_lower_case.numpy()
##Pass vocab_file and lower_case parameters to bert library
token= fulltokenizer(vocab_file, lower_case)

In [None]:
type(lower_case),lower_case

(numpy.bool_, True)

In [None]:
## Now lets encode the sentence by convert the sentence to tokens and later tokens to Id's
def encode_tokens(sent):
  ##Convert sentence to tokens (tokens are not numbers or id's it will divide the sentence into words)
  sent_token= token.tokenize(sent)
  #print(sent_token)
  ###Converting words to ids
  token_id= token.convert_tokens_to_ids(sent_token)
  return token_id

In [None]:
encode_tokens(u"I am colo")

[1045, 2572, 8902, 2080]

In [None]:
data_inputs= data.text.apply(lambda text: encode_tokens(text))

Conversion of dataseries to list

In [None]:
data_text= data_inputs.tolist()
data_label= data.sentiment.tolist()

## Dataset Creation
Dataset is creted from the above generated lists.
Below are the steps followed to create dataset
1. Create list combination of sentence, label and length of sentence
2. Sorting of the result list with length of sentence and we will remove rows which the minimum length of 7 and below.
  2.1 This is because, while creating model we are using n-gram(bi, tri and four). If we won't remove length which is less than 7 , we cannot use four gram. whcih would reduce some accuracy.
  2.2. As because we are sorting the lengths we would get lengthof sentence in order.

3. dataset is created using generator.
4. Padded batches are also created. where we dnt have to bother about pad_sequencing the token ids

In [None]:
data_with_len=[[sent, data_label[i], len(sent)] for i, sent in enumerate(data_text)]
## WE will shuffle the dataset
random.shuffle(data_with_len)
##We will sort with the length
data_with_len.sort(key=lambda x: x[2])
##Lets take only those with len > 7
sorted_all= [(sent[0], sent[1]) for sent in data_with_len if sent[2] >7]

In [None]:
##Lets create a dataset as it is a list we will cretae data from-generator
dataset= tf.data.Dataset.from_generator(lambda: sorted_all, output_types=(tf.int32, tf.int32))

In [None]:
next(iter(dataset))

(<tf.Tensor: shape=(8,), dtype=int32, numpy=
 array([ 1045,  2074,  2179,  2041, 16371, 13871,  8454,  2439],
       dtype=int32)>, <tf.Tensor: shape=(), dtype=int32, numpy=0>)

In [None]:
batch_size=32
dataset= dataset.padded_batch(batch_size=batch_size, padded_shapes=((None,), ()))

## Model Building
Before creating model, shuffling of dataset happens, if the dataset is splitting to train and test.

### Model Layers:
Below are the list of layers used in the model:
##### Embedding
##### Bigram
##### Trigram
##### fourgram
##### Concatination of n-grams
##### Dense
##### Dropout layer
##### Output layer

In [None]:
num_batches= math.ceil(len(sorted_all) / batch_size)
num_batches_test= num_batches//10
dataset.shuffle(num_batches)
test_dataset= dataset.take(num_batches_test)
train_dataset= dataset.skip(num_batches_test)

In [None]:
num_batches_test, num_batches

(4118, 41188)

In [None]:
len(sorted_all)

1318004

Model Building

In [None]:
class DNN(tf.keras.Model):

  def __init__(self, vocab_size, embed_dim=128, num_filters=50,
              num_units=512, num_classes=2, dropout_rate=0.1,
              trainable=False, name='dcnn'):
    super(DNN, self).__init__(name= name)
    


    ##Embed layer
    self.embed= keras.layers.Embedding(input_dim= vocab_size, output_dim= embed_dim)

    #Bigram layer
    self.bigram= keras.layers.Conv1D(filters= num_filters, kernel_size=2, activation=tf.nn.relu,  padding='VALID')

    ##Tri gram
    self.trigram= keras.layers.Conv1D(filters= num_filters, kernel_size=3, activation=tf.nn.relu, padding='VALID')

    ##fourgram
    self.fourgram= keras.layers.Conv1D(filters= num_filters, kernel_size=4, activation=tf.nn.relu, padding='VALID')

    ## GlobaLAveragePool
    self.globalpooling= keras.layers.GlobalMaxPool1D()

    ##Dense
    self.dense= keras.layers.Dense(units= num_units, activation=tf.nn.relu)

    ##Dropout
    self.dropout= keras.layers.Dropout(rate= dropout_rate)


    self.output_layer= keras.layers.Dense(units= 1, activation='sigmoid')

  
  def call(self, inputs, training):
    x= self.embed(inputs)

    x_1= self.bigram(x)# batch_size, nb_filters, seq_len-1)
    x_1= self.globalpooling(x_1)# (batch_size, nb_filters)
    x_2= self.trigram(x)# batch_size, nb_filters, seq_len-1)
    x_2= self.globalpooling(x_2)# (batch_size, nb_filters)
    x_3= self.fourgram(x)# batch_size, nb_filters, seq_len-1)
    x_3= self.globalpooling(x_3)# (batch_size, nb_filters)

    ##Concat the ngram layers to the last dimension
    concat= tf.concat([x_1, x_2, x_3], axis=-1)# (batch_size, 3* nb_filters)
    x= self.dense(concat)
    x= self.dropout(x, training)

    output= self.output_layer(x)

    return output
     


## Model Training
### We will compile the model and train it with 5 epochs

In [None]:

vocab_size= len(token.vocab)
embed_dim= 200
num_filters= 100
num_classes= 2
num_units= 256
dropout_rate= 0.2
num_epochs= 5


In [None]:
print(vocab_size, embed_dim, num_filters, num_units, num_classes, dropout_rate)

30522 200 100 256 1 0.2


In [None]:

Dcnn= DNN(vocab_size, embed_dim, num_filters, num_units, num_classes, dropout_rate)

In [None]:
## Now lets compile the model
Dcnn.compile(optimizer= 'adam', loss= 'binary_crossentropy',
             metrics=['accuracy'])



### Checkpoint Manager
Setting up the checkpoint path
Creating checkpoint with class name(which is model name)
Creating checkpoint manager and passing checkpoint path aswellas checkpoint method


In [None]:
checkpoint_path='/content/drive/My Drive/NLP/Projects/BERT/Sentimental Data/ckpt_bert_tok'
checkpoint= tf.train.Checkpoint(Dcnn= Dcnn)
##Maxto_keep will keep the latest n number of checkpoint files
checkpoint_man= tf.train.CheckpointManager(checkpoint, 
                                           checkpoint_path, max_to_keep=1)

Callback function is created so that checkpoint would be saved after every epoch

In [None]:
class MyCallBack(tf.keras.callbacks.Callback):

  def on_epoch_end(self, epoch, logs=None):
    checkpoint_man.save()
    print("Checkpoint saved at {}".format(checkpoint_path))


In [None]:
##Lets train the model
Dcnn.fit(train_dataset, epochs= num_epochs, callbacks=[MyCallBack()])

Epoch 1/5
  37070/Unknown - 2102s 57ms/step - loss: 0.4294 - accuracy: 0.8021Checkpoint saved at /content/drive/My Drive/NLP/Projects/BERT/Sentimental Data/ckpt_bert_tok.
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7ff95ee1ecc0>

In [None]:
Dcnn.summary()

Model: "dcnn"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_20 (Embedding)     multiple                  6104400   
_________________________________________________________________
conv1d_60 (Conv1D)           multiple                  40100     
_________________________________________________________________
conv1d_61 (Conv1D)           multiple                  60100     
_________________________________________________________________
conv1d_62 (Conv1D)           multiple                  80100     
_________________________________________________________________
global_max_pooling1d_19 (Glo multiple                  0         
_________________________________________________________________
dense_40 (Dense)             multiple                  77056     
_________________________________________________________________
dropout_20 (Dropout)         multiple                  0      

In [None]:
Dcnn.evaluate(test_dataset)



[0.4274745285511017, 0.8303029537200928]

## Prediction
Prediction is done by taking the sentence and cleaning as done for training data, then converting to tokens followed by id's

In [None]:
def get_prediction(sentence):
    tokens = encode_tokens(sentence)
    inputs = tf.expand_dims(tokens, 0)

    output = Dcnn(inputs, training=False)

    sentiment = math.floor(output*2)

    if sentiment == 0:
        print("Ouput of the model: {}\nPredicted sentiment: negative.".format(
            output))
    elif sentiment == 1:
        print("Ouput of the model: {}\nPredicted sentiment: positive.".format(
            output))

In [None]:
get_prediction(u"The movie is pretty good")