# NLP with Bert for Sentiment Analysis

### Importing the libraries

In [1]:
import os.path
import numpy as np
import tensorflow as tf
import ktrain
from ktrain import text

In the above cell we have imported the os.path library which will help us to add the path of the dataset. We also imported tensorflow which is tensorflow 2.0 and ktrain library which will help us to build the BERT model. Numpy library was also imported. We specifically imported the text module from the ktrain library.

## Part 1: Data Preprocessing

### Loading the IMDB dataset

In [3]:
data = tf.keras.utils.get_file(fname = "aclImdb_v1.tar.gz",
                              origin = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz",
                              extract = True)

IMDB_DATADIR = os.path.join(os.path.dirname(data), 'aclImdb')

Downloading data from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


In [4]:
print(os.path.dirname(data))
print(IMDB_DATADIR)

C:\Users\User1\.keras\datasets
C:\Users\User1\.keras\datasets\aclImdb


We import the dataset directly from the AI Standford website (It contains a lot of great datasets). We use the get_file function from the tensorflow/keras library and the utils module. We are downloading the data directly from the official Stanford Website for AI. The above code is a short procedure to download it. We create a IMDB directory where we join the path of our keras library and the data from the website. I have also printed out the location of the dataset.

### Creating the training and test sets

In [5]:
(x_train, y_train), (x_test, y_test), preproc = text.texts_from_folder(datadir=IMDB_DATADIR,
                                                                      classes=['pos','neg'],
                                                                      maxlen=500,
                                                                      train_test_names=['train','test'],
                                                                      preprocess_mode='bert')

detected encoding: utf-8
downloading pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.

cleanup downloaded zip...
done.

preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en


In this we are using a function called 'texts_from_folder' from the text module in the ktrain library. This allows us to get three variables which are the training set, testing set and the preprocessing of the texts.

In this the parameters mentioned are datadir which should be the directory where all the texts are present. The classes within the library which in this case is positive and negative (Check the folder name). The maxlen specifies the number of characters you want from the text. The train_test_names should contain the folders name in which you have stored your training set texts and the testing set texts. The preprocess_mode should be bert in this case

## Part 2: Building the BERT model

In [6]:
bert_model = text.text_classifier(name='bert',
                                 train_data=(x_train, y_train),
                                 preproc=preproc)

Is Multi-Label? False
maxlen is 500
done.


In the above cell we have created a BERT model using the text module from ktrain library. Ktrain library is a wrapper for the tensorflow and keras library. It make the execultion of models like BERT which are very complicated easy enough to be done in 1 line. We use the text_classifier function which using the method for classification as BERT, the tuple of the training and the test set and also the preprocessing data.

## Part 3: Training the BERT model

In [7]:
learner = ktrain.get_learner(model=bert_model,
                            train_data=(x_train, y_train),
                            val_data=(x_test, y_test),
                            batch_size=6)

We first create a learner part for the BERT model and classify it as positive or negative text. We use a function get_learner which is directly inside the ktrain library. It return a learner instance. It takes model as a paremeter and also the training and testing data and also the batch_size for length of test 500 the batch size is 6.

In [None]:
learner.fit_onecycle(lr=2e-5, epochs=1)

We will call a function called the fit_onecycle from the get_learner function which is inside the ktrain library. We will train our BERT model only using 1 epoch. It feels it won't provide the high accuracy but it does!!!!!. We input the learning rate based on the number of epochs. We choose the number of epochs as 1. 