#### We need to install the ktrain library. Its a light weight wrapper for keras to help train neural networks. With only a few lines of code it allows you to build models, estimate optimal learning rate, loading and preprocessing text and image data from various sources and much more. More about our approach can be found at [this](https://towardsdatascience.com/bert-text-classification-in-3-lines-of-code-using-keras-264db7e7a358) article.

In [None]:
# To install only the requirements of this notebook, uncomment the lines below and run this cell

# ===========================

# !pip install numpy==1.19.5
# !pip install pandas==1.1.5
# !pip install ktrain==0.26.3

# ===========================

In [None]:
# To install the requirements for the entire chapter, uncomment the lines below and run this cell

# ===========================

# try:
#     import google.colab
#     !curl  https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch4/ch4-requirements.txt | xargs -n 1 -L 1 pip install
# except ModuleNotFoundError:
#     !pip install -r "ch4-requirements.txt"

# ===========================

In [None]:
# use tensorflow 2.4.0 for this notebook
# !pip install tensorflow==2.4.0

In [1]:
#Importing
import ktrain
from ktrain import text

In [2]:
##obtain the dataset
import os
try :
    from google.colab import files
    import tensorflow as tf
    dataset = tf.keras.utils.get_file(
        fname="aclImdb.tar.gz", 
        origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz", 
        extract=True,
    )
    IMDB_DATADIR = os.path.join(os.path.dirname(dataset), "aclImdb")
except ModuleNotFoundError :
    pwd = os.getcwd()
    file_path = os.path.join(pwd, 'Data', 'aclImdb')
    if not os.path.exists(file_path) :
        import tensorflow as tf
        dataset = tf.keras.utils.get_file(
            fname="aclImdb.tar.gz", 
            origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz", 
            extract=True,
        )

        # set path to dataset
        IMDB_DATADIR=pwd
    else :

        # set path to dataset
        IMDB_DATADIR=file_path

## STEP 1: Preprocessing
The texts_from_folder function will load the training and validation data from the specified folder and automatically preprocess it according to BERT's requirements. In doing so, the BERT model and vocabulary will be automatically downloaded.

In [3]:
(x_train, y_train), (x_test, y_test), preproc = text.texts_from_folder (IMDB_DATADIR,                                                              maxlen=500,                                                                    preprocess_mode='bert',                                                         train_test_names=['train', 'test'],
    classes=['pos', 'neg'])

detected encoding: utf-8
downloading pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.

cleanup downloaded zip...
done.

preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en


### STEP 2: Loading a pre trained BERT and wrapping it in a ktrain.learner object

In [4]:
model = text.text_classifier('bert', (x_train, y_train), preproc=preproc)
learner = ktrain.get_learner(model,train_data=(x_train, y_train), val_data=(x_test, y_test), batch_size=6)

Is Multi-Label? False
maxlen is 500
done.


### STEP 3: Training and Tuning the model's parameters

In [5]:
learner.fit_onecycle(2e-5, 4)



begin training using onecycle policy with max lr of 2e-05...
Epoch 1/4
  52/4167 [..............................] - ETA: 35:17:25 - loss: 0.6890 - accuracy: 0.5385