## **BERT**

BERT (Bidirectional Encoder Representations from Transformer) and its descendants are currently state-of-the-art models for nearly all NLP tasks.

Released by Google in 2019, BERT builds powerful context-aware representations of words that can be exploited to perform custom classification tasks.

For further details: https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html

**Fine-tuning BERT for text classification with ktrain**

This notebook fine-tunes BERT for a custom text classification task. Simply, a classification layer is added on top of BERT and the whole model is retrained starting from pre-trained BERT weights.

With ktrain, we can do all this with just a few lines of code. 

In [0]:
%tensorflow_version 2.x
!pip3 install ktrain

In [0]:
'''
# if you want to run the notebook from scratch
import shutil
shutil.rmtree('/content/data')
'''

**STEP 1**: load and preprocess the data

This is actually the hardest part of the job when you are using ktrain :)

Train / test data must be stored in folders with a specific structure in order to be processed by the ktrain methods.

In [0]:
import os
import pandas as pd
import random

df = pd.read_csv("articles_topics.csv")[['title', 'topic']]

dfTrain = []
dfTest  = []
for topic in df.topic.unique():
    data = df.query("topic == '{}'".format(topic))
    dfTrain += data[:16].to_dict(orient = 'rows')   # keep just 16 train + 4 test examples for each topic (to balance the dataset)
    dfTest  += data[16:20].to_dict(orient = 'rows')
dfTrain = pd.DataFrame(dfTrain)
dfTest  = pd.DataFrame(dfTest)

main_dir = os.getcwd()

labels = sorted(dfTrain.topic.unique())

def data_to_folders(df, labels, to_folder):
    for label in labels:
        current_dir = os.path.join(to_folder, label)
        os.mkdir(current_dir)
        for ix, row in df.query("topic == '{}'".format(label)).iterrows():
            with open(os.path.join(current_dir, '{}.txt'.format(ix)), 'w') as f:
                f.write(row['title'])

for dataset, dataset_id in [(dfTrain, 'train'), (dfTest, 'test')]:
    to_folder = os.path.join(main_dir, 'data', dataset_id)
    os.makedirs(to_folder)
    data_to_folders(dataset, labels, to_folder)

**STEP 2**: do the fine-tuning

In [0]:
import ktrain
from ktrain import text

(x_train, y_train), (x_test, y_test), preproc = text.texts_from_folder(os.path.join(main_dir, 'data'),
                                                                       maxlen  = 30,
                                                                       preprocess_mode = 'bert',
                                                                       classes = labels)

model = text.text_classifier('bert', (x_train, y_train), preproc = preproc)
learner = ktrain.get_learner(model, train_data = (x_train, y_train), val_data = (x_test, y_test), batch_size = 6)

using Keras version: 2.2.4-tf
detected encoding: utf-8
downloading pretrained BERT model (multi_cased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.

cleanup downloaded zip...
done.

preprocessing train...
language: it


preprocessing test...
language: it


Is Multi-Label? False
maxlen is 30
done.


In [0]:
learner.fit_onecycle(lr = 2e-5, epochs = 10, checkpoint_folder = os.path.join(main_dir, 'checkpoints'))



begin training using onecycle policy with max lr of 2e-05...
Train on 160 samples, validate on 40 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f5ef407fcf8>

In [0]:
learner.model.load_weights(os.path.join(main_dir, 'checkpoints', 'weights-08.hdf5'))

The model is now fine-tuned on our classification task. Below we visualize our predictions on the test set.

In [0]:
# retrieve labels
import numpy as np
dfTest.loc[:, 'encoding'] = np.argmax(y_test, axis = 1)
labels_dict = dfTest[['topic', 'encoding']].drop_duplicates()
labels_dict = {row['encoding'] : row['topic'] for _, row in dfTest.iterrows()}

In [0]:
predictions = learner.predict((x_test, y_test))
predictions = list(map(lambda i : labels_dict[i], np.argmax(predictions, axis = 1)))

dfTest.loc[:, 'prediction'] = predictions

In [0]:
def wrap(text, maxlen =  max(map(len, labels_dict.values()))):
    return text+(' ' * (maxlen - len(text)))

print("{}\t{}\t{}".format(wrap('GROUND TRUTH'), wrap('PREDICTION'), 'TITLE'))
for _, row in dfTest.iterrows():
    print("{}\t{}\t{}".format(wrap(row['topic']), wrap(row['prediction']), row['title']))