# Sentiment Analysis Using BERT

In this notebook, we build a binary text classifier to classify movie reviews as either positive or negative using [BERT](https://arxiv.org/abs/1810.04805), a pretrained NLP model that can be used for transfer learning on text data.  We will use the [*ktrain* library](https://github.com/amaiya/ktrain), a lightweight wrapper around Keras to help train (and deploy) neural networks.  For more information on *ktrain*, see [this Medium post](https://towardsdatascience.com/ktrain-a-lightweight-wrapper-for-keras-to-help-train-neural-networks-82851ba889c).

We will begin by installing *ktrain* and importing the required *ktrain* modules.

In [None]:
# install ktrain
!pip3 install ktrain

Collecting ktrain
[?25l  Downloading https://files.pythonhosted.org/packages/12/11/49bdde1b08a210365c04367e84d3fb1489db6f4b262358a10719962891a2/ktrain-0.16.3.tar.gz (25.2MB)
[K     |████████████████████████████████| 25.2MB 129kB/s 
[?25hCollecting tensorflow==2.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/85/d4/c0cd1057b331bc38b65478302114194bd8e1b9c2bbc06e300935c0e93d90/tensorflow-2.1.0-cp36-cp36m-manylinux2010_x86_64.whl (421.8MB)
[K     |████████████████████████████████| 421.8MB 40kB/s 
[?25hCollecting scikit-learn==0.21.3
[?25l  Downloading https://files.pythonhosted.org/packages/a0/c5/d2238762d780dde84a20b8c761f563fe882b88c5a5fb03c056547c442a19/scikit_learn-0.21.3-cp36-cp36m-manylinux1_x86_64.whl (6.7MB)
[K     |████████████████████████████████| 6.7MB 54.7MB/s 
Collecting keras_bert>=0.81.0
  Downloading https://files.pythonhosted.org/packages/ec/08/bffa03eb899b20bfb60553e4503f8bac00b83d415bc6ead08f6b447e8aaa/keras-bert-0.84.0.tar.gz
Collecting langdetect

In [None]:
# import ktrain
import ktrain
from ktrain import text

In [None]:
ktrain.__version__

'0.16.3'

Next, we will fetch and extract the IMDb movie review dataset.

In [None]:
# download IMDb movie review dataset
import tensorflow as tf

In [None]:
# set path to dataset
import os.path
#dataset = '/root/.keras/datasets/aclImdb'

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


## STEP 1:  Load and Preprocess the Dataset

The `texts_from_folder` function will load the training and validation data from the specified folder and automatically preprocess it according to BERT's requirements.  In doing so, the BERT model and vocabulary will be automatically downloaded.

In [None]:
trn, val, preproc = text.texts_from_folder('/content/drive/My Drive/upload', 
                                          maxlen=280, 
                                          preprocess_mode='bert',
                                          train_test_names=['train', 
                                                            'test'],
                                          classes=['pos', 'neg'])

detected encoding: utf-8
downloading pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.

cleanup downloaded zip...
done.

preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en


## STEP 2:  Load a pretrained BERT model and wrap it in a `ktrain.Learner` object

This step can be condensed into a single line of code, but we execute it as two lines for clarity. (You can ignore the deprecation warnings arising from Keras 2.2.4 with TensorFlow 1.14.0.)  

In [None]:
model = text.text_classifier('bert', trn, preproc=preproc)
learner = ktrain.get_learner(model,train_data=trn, val_data=val, batch_size=6)

Is Multi-Label? False
maxlen is 280
done.


## STEP 3:  Train and Fine-Tune the Model

We employ the `learner.fit_onecycle` method in *ktrain* that employs the use of a [1cycle learning  rate schedule](https://arxiv.org/pdf/1803.09820.pdf).  We use a learning rate of 2e-5 based on recommendations from [the original paper](https://arxiv.org/abs/1810.04805).

In [None]:
learner.fit_onecycle(2e-5, 1)



begin training using onecycle policy with max lr of 2e-05...
Train on 1059 samples, validate on 111 samples


<tensorflow.python.keras.callbacks.History at 0x7feda6fa4a20>