# Emotion Classification in short texts with BERT

Applying BERT to the problem of multiclass text classification. Our dataset consists of written dialogs, messages and short stories. Each dialog utterance/message is labeled with one of the five emotion categories: joy, anger, sadness, fear, neutral. 

## Workflow: 
1. Import Data
2. Data preprocessing and downloading BERT
3. Training and validation
4. Saving the model

Multiclass text classification with BERT and [ktrain](https://github.com/amaiya/ktrain). Use google colab for a free GPU 

👋  **Let's start** 

In [None]:
# install ktrain on Google Colab
!pip3 install ktrain

Collecting ktrain
[?25l  Downloading https://files.pythonhosted.org/packages/79/11/168a692027a2be036676d8ba24f0b7b9dfb18a9deb3daf10e6a0cfd2917d/ktrain-0.22.1.tar.gz (25.2MB)
[K     |████████████████████████████████| 25.3MB 129kB/s 
Collecting keras_bert>=0.86.0
  Downloading https://files.pythonhosted.org/packages/e2/7f/95fabd29f4502924fa3f09ff6538c5a7d290dfef2c2fe076d3d1a16e08f0/keras-bert-0.86.0.tar.gz
Collecting langdetect
[?25l  Downloading https://files.pythonhosted.org/packages/56/a3/8407c1e62d5980188b4acc45ef3d94b933d14a2ebc9ef3505f22cf772570/langdetect-1.0.8.tar.gz (981kB)
[K     |████████████████████████████████| 983kB 37.3MB/s 
Collecting cchardet
[?25l  Downloading https://files.pythonhosted.org/packages/1e/c5/7e1a0d7b4afd83d6f8de794fce82820ec4c5136c6d52e14000822681a842/cchardet-2.1.6-cp36-cp36m-manylinux2010_x86_64.whl (241kB)
[K     |████████████████████████████████| 245kB 48.8MB/s 
Collecting seqeval
  Downloading https://files.pythonhosted.org/packages/93/e5/b77051

In [None]:
import pandas as pd
import numpy as np

import ktrain
from ktrain import text

## 1. Import Data

In [None]:
def shuffle(df, n=1, axis=0):     
    df = df.copy()
    for _ in range(n):
      df.apply(np.random.shuffle, axis=axis)
    return df

data = pd.read_csv('/content/data/dataset.csv', encoding='utf-8', sep=';')

# sort the dataframe
data.sort_values(by='Emotion', axis=0, inplace=True)

# set the index to be this and don't drop
data.set_index(keys=['Emotion'], drop=False,inplace=True)

# get a list of names
emotions=data['Emotion'].unique().tolist()

# now we can perform a lookup on a 'view' of the dataframe
joys = shuffle(data.loc[data.Emotion=='joy'])
fears = shuffle(data.loc[data.Emotion=='fear'])
angers = shuffle(data.loc[data.Emotion=='anger'])
sadnesss = shuffle(data.loc[data.Emotion=='sadness'])
neutrals = shuffle(data.loc[data.Emotion=='neutral'])

joys_train = joys.iloc[0:int(joys.shape[0]*0.8)]
joys_test = joys.iloc[int(joys.shape[0]*0.8)+1:joys.shape[0]]

fears_train = fears.iloc[0:int(fears.shape[0]*0.8)]
fears_test = fears.iloc[int(fears.shape[0]*0.8)+1:fears.shape[0]]

angers_train = angers.iloc[0:int(angers.shape[0]*0.8)]
angers_test = angers.iloc[int(angers.shape[0]*0.8)+1:angers.shape[0]]

sadnesss_train = sadnesss.iloc[0:int(sadnesss.shape[0]*0.8)]
sadnesss_test = sadnesss.iloc[int(sadnesss.shape[0]*0.8)+1:sadnesss.shape[0]]

neutrals_train = neutrals.iloc[0:int(neutrals.shape[0]*0.8)]
neutrals_test = neutrals.iloc[int(neutrals.shape[0]*0.8)+1:neutrals.shape[0]]

data_train = pd.concat([joys_train, fears_train, angers_train, sadnesss_train, neutrals_train])
data_test = pd.concat([joys_test, fears_test, angers_test, sadnesss_test, neutrals_test])

print(data_train.shape)
print(data_test.shape)

X_train = data_train.Text.tolist()
X_test = data_test.Text.tolist()

y_train = data_train.Emotion.tolist()
y_test = data_test.Emotion.tolist()

data = data_train.append(data_test, ignore_index=True)

class_names = ['joy', 'sadness', 'fear', 'anger', 'neutral']

print('size of training set: %s' % (len(data_train['Text'])))
print('size of validation set: %s' % (len(data_test['Text'])))
print(data.Emotion.value_counts())

data.head(10)

(113418, 2)
(28353, 2)
size of training set: 113418
size of validation set: 28353
neutral    94166
joy        25907
sadness    13161
anger       4910
fear        3627
Name: Emotion, dtype: int64


Unnamed: 0,Emotion,Text
0,joy,i opened the first window whilst listening to ...
1,joy,"@chachada1 Yeah im following you, Hun! Goodnight"
2,joy,"Yeah , I know . A friend in need is a friend ..."
3,joy,We have tons of updates including pics of Rob ...
4,joy,OK . Thanks .
5,joy,Of course !
6,joy,"Um , that ’ s good ."
7,joy,i feel like getting away from all the friendly...
8,joy,i always have been when im not feeling sociabl...
9,joy,happy mothers day to all the yummy mummies on ...


In [None]:
encoding = {
    'joy': 0,
    'sadness': 1,
    'fear': 2,
    'anger': 3,
    'neutral': 4
}

# Integer values for each class
y_train = [encoding[x] for x in y_train]
y_test = [encoding[x] for x in y_test]

## 2. Data preprocessing

* The text must be preprocessed in a specific way for use with BERT. This is accomplished by setting preprocess_mode to ‘bert’. The BERT model and vocabulary will be automatically downloaded

* BERT can handle a maximum length of 512, but let's use less to reduce memory and improve speed. 

In [None]:
(x_train,  y_train), (x_test, y_test), preproc = text.texts_from_array(x_train=X_train, y_train=y_train,
                                                                       x_test=X_test, y_test=y_test,
                                                                       class_names=class_names,
                                                                       preprocess_mode='bert',
                                                                       maxlen=350, 
                                                                       max_features=135000)

downloading pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.

cleanup downloaded zip...
done.

preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en


task: text classification


## 2. Training and validation


Loading the pretrained BERT for text classification 

In [None]:
model = text.text_classifier('bert', train_data=(x_train, y_train), preproc=preproc)

Is Multi-Label? False
maxlen is 350
done.


Wrap it in a Learner object

In [None]:
learner = ktrain.get_learner(model, train_data=(x_train, y_train), 
                             val_data=(x_test, y_test),
                             batch_size=6)

In [None]:
import tensorflow as tf
from datetime import datetime
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau, CSVLogger, TensorBoard
import os

basedir = "/content/logs/"
logdir = os.path.join("/content/logs", datetime.now().strftime("%Y%m%d-%H%M%S"))
tf.debugging.experimental.enable_dump_debug_info(logdir)

callbacks = [
ModelCheckpoint(filepath=basedir+'checkpoint1-{epoch:02d}.hdf5', verbose=2, save_best_only=True, monitor='accuracy',mode='max'),
CSVLogger(basedir+'model_1trainanalysis1.csv',separator=',', append=False),
EarlyStopping(monitor='val_loss', min_delta=1e-6, patience=1, verbose=2, mode='auto'),
TensorBoard(log_dir=logdir,histogram_freq=1)]

Train the model. More about tuning learning rates [here](https://github.com/amaiya/ktrain/blob/master/tutorial-02-tuning-learning-rates.ipynb)

In [None]:
learner.fit_onecycle(2e-5, 2, callbacks = callbacks)



begin training using onecycle policy with max lr of 2e-05...
Epoch 1/2
Epoch 00001: accuracy improved from -inf to 0.82920, saving model to /content/logs/checkpoint1-01.hdf5
Epoch 2/2
Epoch 00002: accuracy improved from 0.82920 to 0.87275, saving model to /content/logs/checkpoint1-02.hdf5


<tensorflow.python.keras.callbacks.History at 0x7fc34187d390>

Validation

In [15]:
learner.validate(val_data=(x_test, y_test), class_names=class_names)

              precision    recall  f1-score   support

         joy       0.76      0.71      0.73      5181
     sadness       0.81      0.75      0.78      2632
        fear       0.92      0.85      0.88       725
       anger       0.84      0.77      0.80       982
     neutral       0.89      0.92      0.91     18833

    accuracy                           0.86     28353
   macro avg       0.84      0.80      0.82     28353
weighted avg       0.86      0.86      0.86     28353



array([[ 3664,    90,     6,    10,  1411],
       [  106,  1971,    13,    45,   497],
       [    9,    28,   615,    39,    34],
       [   22,    46,    16,   760,   138],
       [ 1043,   311,    15,    55, 17409]])

#### Testing with other inputs

In [17]:
predictor = ktrain.get_predictor(learner.model, preproc)
predictor.get_classes()

['joy', 'sadness', 'fear', 'anger', 'neutral']

In [18]:
from sklearn.metrics import precision_recall_fscore_support

predictions = model.predict(x_test)
predictions = np.argmax(predictions, axis=1)
predictions = [class_names[pred] for pred in predictions]

print(precision_recall_fscore_support(data_test.Emotion, predictions, average='weighted'))

(0.8589680861077176, 0.8612492505202272, 0.8596000596826674, None)


In [19]:
!nvidia-smi

Fri Oct  9 18:57:24 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   75C    P0    35W /  70W |   8515MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [25]:
import time 

message = 'delivery was hour late and my pizza was cold!'

start_time = time.time() 
prediction = predictor.predict(message)

print('predicted: {} ({:.2f})'.format(prediction, (time.time() - start_time)))

predicted: sadness (0.20)


## 4. Saving Bert model


In [21]:
# let's save the predictor for later use
predictor.save("models2/bert_model")

In [22]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [24]:
%cp -av /content/logs "/content/drive/My Drive/dd3"

'/content/logs' -> '/content/drive/My Drive/dd3/logs'
'/content/logs/20201009-092551' -> '/content/drive/My Drive/dd3/logs/20201009-092551'
'/content/logs/20201009-092551/tfdbg_events.1602235551.b5fb4ee70369.source_files' -> '/content/drive/My Drive/dd3/logs/20201009-092551/tfdbg_events.1602235551.b5fb4ee70369.source_files'
'/content/logs/20201009-092551/tfdbg_events.1602235551.b5fb4ee70369.stack_frames' -> '/content/drive/My Drive/dd3/logs/20201009-092551/tfdbg_events.1602235551.b5fb4ee70369.stack_frames'
'/content/logs/20201009-092551/tfdbg_events.1602235551.b5fb4ee70369.graphs' -> '/content/drive/My Drive/dd3/logs/20201009-092551/tfdbg_events.1602235551.b5fb4ee70369.graphs'
'/content/logs/20201009-092551/tfdbg_events.1602235551.b5fb4ee70369.metadata' -> '/content/drive/My Drive/dd3/logs/20201009-092551/tfdbg_events.1602235551.b5fb4ee70369.metadata'
'/content/logs/20201009-092551/tfdbg_events.1602235551.b5fb4ee70369.execution' -> '/content/drive/My Drive/dd3/logs/20201009-092551/tfdb

Done! to reload the predictor use: ktrain.load_predictor