<a href="https://colab.research.google.com/github/colabnlp/bert-sklearn/blob/master/demo_bert_sklearn_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# setup for colab

### enable GPU

You can enable the GPU in one of two ways: 

 * Select **`GPU`**  in the Edit->Notebook Settings for Hardware Accelerator   **OR**
 

 * Select **`GPU`**  in the Runtime->Change runtime type  for Hardware Accelerator 
 
 
 You should be able to now see the nvidia-smi diagnostic: 

In [1]:
!nvidia-smi

Sat Mar  7 03:06:42 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   46C    P8     7W /  75W |      0MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

###  install bert-sklearn

In [2]:
!git clone -b master https://github.com/charles9n/bert-sklearn
!cd bert-sklearn; pip install .
import os
os.chdir("bert-sklearn")
print(os.listdir())

Cloning into 'bert-sklearn'...
remote: Enumerating objects: 12, done.[K
remote: Counting objects: 100% (12/12), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 259 (delta 3), reused 3 (delta 0), pack-reused 247[K
Receiving objects: 100% (259/259), 519.36 KiB | 581.00 KiB/s, done.
Resolving deltas: 100% (125/125), done.
Processing /content/bert-sklearn
Building wheels for collected packages: bert-sklearn
  Building wheel for bert-sklearn (setup.py) ... [?25l[?25hdone
  Created wheel for bert-sklearn: filename=bert_sklearn-0.3.1-cp36-none-any.whl size=54234 sha256=c406d2731629907137fc97c4be3bd9b410caf39774b3dd43424e2b7f90531718
  Stored in directory: /root/.cache/pip/wheels/61/95/c6/5790aae8fb377f5ff356dbe58205aab28858595d6bff8197d0
Successfully built bert-sklearn
Installing collected packages: bert-sklearn
Successfully installed bert-sklearn-0.3.1
['tests', 'LICENSE', 'other_examples', 'demo_tuning_hyperparams.ipynb', 'bert_sklearn', 'setup.py', 'Options.m

# demo 

* **text classification** - the goal is to classify a single sentence or short text.


### A note on GPU cards and memory

While its possible, it would be very slow to run the examples without a GPU card of some sort. In addition, the BERT models (especially the large model) are pretty big so it helps to have more GPU memory. All the examples in this demo were run on a laptop with a Nvidia GTX-1070 card that has 8Gb of memory.

The three biggest parameters you can change which will reduce the GPU memory requirements significantly are:

* **`bert_model`** - BERT models come in 2 sizes : `base` and `large`. As you would expect the large model demands more GPU memory and takes longer to train. If you have a small GPU, start with the any of the `base` models first. The default is set to `'bert-base-uncased'`

> `base(110M parameter models)` : `'bert-base-uncased'`, `'bert-base-cased'`, `'bert-base-multilingual-uncased'`, `'bert-base-multilingual-cased'`, `'bert-base-chinese'`, and all the **`BioBERT`** and **`SciBERT`** models.

> `large(340M parameter models)`: `'bert-large-uncased'` and `'bert-large-cased'`


* **`max_seq_length`** - the defualt is 128 with a max value of 512. But seting it to a smaller value like 96 or even 64  saves a lot of GPU memory and still gets good results on a lot of tasks.


* **`train_batch_size`** - the default is 32. Cutting it in half will save memory and should also still give good results.


In addition to these two parameters,  [huggingface/pytorch-pretrained-BERT](https://github.com/huggingface/pytorch-pretrained-BERT#Training-large-models-introduction,-tools-and-examples) has several options to reduce the GPU memory requirements which are passed through in `bert-sklearn`:

* **`gradient_accumulation_steps`** - this is the number of update steps to accumulate gradients before performing an update step with the optimizer. The default is 1. Setting it to a higher integer(i.e 2, 4, up to the **`train_batch_size`** ) will trade GPU memory for compute time. I use this a lot when I train BERT models on my laptop GPU.


* **`fp16`** - this is whether to use 16-bit float precision instead of the 32-bit. The default is set to `False`. To enable half precision, you must install [Nvidia apex](https://github.com/NVIDIA/apex). Then setting this option to `True` will cut the model memory load in half. I use this when I train on my laptop GPU as well.


Finally the two other system setups that will help reduce the memory requirement: 

* `multiple gpus` - for a single machine with multiple GPUs, following the huggingface port, the GPUs will be detected and will split the load onto the multiple cards. 


* `distributed training` - the huggingface port allows you to train across distributed GPUs. The parameter,  **`local_rank`**, is exposed in `bert-sklearn`. But this option has not been tested yet.


### setup
Now let's setup the imports and utility code we will need for the rest of the demo:


In [3]:
import torch
print('pytorch version:', torch.__version__)
print('GPU:',torch.cuda.get_device_name(0))

pytorch version: 1.4.0
GPU: Tesla P4


In [0]:
import os
import math
import random
import csv
import sys

import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
import statistics as stats

from bert_sklearn import BertClassifier
from bert_sklearn import BertRegressor
from bert_sklearn import BertTokenClassifier
from bert_sklearn import load_model

def read_tsv(filename, quotechar=None):
    with open(filename, "r", encoding='utf-8') as f:
        return list(csv.reader(f, delimiter="\t", quotechar=quotechar))   

def flatten(l):
    return [item for sublist in l for item in sublist]

def read_CoNLL2003_format(filename, idx=3):
    """Read file in CoNLL-2003 shared task format"""
    
    # read file
    lines =  open(filename).read().strip()   
    
    # find sentence-like boundaries
    lines = lines.split("\n\n")  
    
     # split on newlines
    lines = [line.split("\n") for line in lines]
    
    # get tokens
    tokens = [[l.split()[0] for l in line] for line in lines]
    
    # get labels/tags
    labels = [[l.split()[idx] for l in line] for line in lines]
    
    #convert to df
    data= {'tokens': tokens, 'labels': labels}
    df=pd.DataFrame(data=data)
    
    return df


<a id='text_classification'></a>
# text classification 

For single text classification, we have the input data `X`, and target data `y` where:

* `X` is a list, pandas Series, or numpy array of text data.


* `y` is a list, pandas Series, or numpy array of text labels.

For this example, we will use the **`Stanford Sentiment Treebank (SST-2)`** data set from the [GLUE benchmarks](https://gluebenchmark.com/). The **`SST-2`** task consists of sentences drawn from movie reviews and annotated with a sentiment label. 

See [website](https://nlp.stanford.edu/sentiment/code.html) and [paper](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf) for more info.

The input features are short sentences and the labels are the standard sentiment polarity of:

*    0 for negative 


*    1 for positive.

## get data

First download the data using the GLUE downloder:

In [5]:
%%bash
python3 ./glue_examples/download_glue_data.py --data_dir ./glue_examples//glue_data --tasks SST 

Downloading and extracting SST...
	Completed!


In [6]:
"""
SST-2 train data size: 67349 
SST-2 dev data size: 872 
"""
DATADIR = './glue_examples/glue_data'

def get_sst_data(train_file=DATADIR + '/SST-2/train.tsv',
                 dev_file=DATADIR + '/SST-2/dev.tsv'):

    train = pd.read_csv(train_file, sep='\t', encoding='utf8', keep_default_na=False)
    train.columns=['text', 'label']
    print("SST-2 train data size: %d "%(len(train)))
    
    dev = pd.read_csv(dev_file, sep='\t', encoding='utf8', keep_default_na=False)
    dev.columns=['text', 'label']
    print("SST-2 dev data size: %d "%(len(dev)))
    label_list = np.unique(train['label'])

    return train, dev, label_list

train, dev, label_list = get_sst_data()
train.head()

SST-2 train data size: 67349 
SST-2 dev data size: 872 


Unnamed: 0,text,label
0,hide new secretions from the parental units,0
1,"contains no wit , only labored gags",0
2,that loves its characters and communicates som...,1
3,remains utterly satisfied to remain the same t...,0
4,on the worst revenge-of-the-nerds clichés the ...,0


## setup data

We will subsample the data for the demo. See the [SST-2.ipynb notebook](https://github.com/charles9n/bert-sklearn/blob/master/glue_examples/SST-2.ipynb) for a finetune demo on the full data set.

In [0]:
# subsample data 
n = 1000
train = train.sample(n, random_state=42)

X_train = train['text']
y_train = train['label']

# use the dev set for testing
test = dev
X_test = test['text']
y_test = test['label']

## define model

We will set up a classifier with the defualt settings, but let's reduce **`max_sequence_length`** , and **`train_batch_size`**, so it can run on a smaller GPU. This config uses ~5Gb of GPU memory om my laptop 8GB GTX-1070:

In [11]:
model = BertClassifier(max_seq_length=64, train_batch_size=16)
model

Building sklearn text classifier...


BertClassifier(bert_config_json=None, bert_model='bert-base-uncased',
               bert_vocab=None, do_lower_case=None, epochs=3, eval_batch_size=8,
               fp16=False, from_tf=False, gradient_accumulation_steps=1,
               ignore_label=None, label_list=None, learning_rate=2e-05,
               local_rank=-1, logfile='bert_sklearn.log', loss_scale=0,
               max_seq_length=64, num_mlp_hiddens=500, num_mlp_layers=0,
               random_state=42, restore_file=None, train_batch_size=16,
               use_cuda=True, validation_fraction=0.1, warmup_proportion=0.1)

## finetune model

finetune = fit model on train data

The `model.fit()` routine:

* Loads the pretrained BERT model defined in `model.bert_model`. The first time this runs will be slower as it downloads the BERT model from the internet. Subsequent calls will be faster as the model is saved in a file cache locally.


* Uses `model.validation_fraction` (defualt=0.1)  of the data for validation and finetunes BERT on the remainder for `model.epochs` (default=3) epochs.

In [12]:
%%time
model = model.fit(X_train, y_train)

100%|██████████| 231508/231508 [00:00<00:00, 349310.32B/s]


Loading bert-base-uncased model...


100%|██████████| 440473133/440473133 [00:44<00:00, 9976398.58B/s] 
100%|██████████| 361/361 [00:00<00:00, 120141.53B/s]


Defaulting to linear classifier/regressor
Loading Pytorch checkpoint

train data size: 900, validation data size: 100



HBox(children=(IntProgress(value=0, description='Training  ', max=57, style=ProgressStyle(description_width='i…




HBox(children=(IntProgress(value=0, description='Validating', max=13, style=ProgressStyle(description_width='i…



Epoch 1, Train loss: 0.5282, Val loss: 0.4596, Val accy: 82.00%



HBox(children=(IntProgress(value=0, description='Training  ', max=57, style=ProgressStyle(description_width='i…




HBox(children=(IntProgress(value=0, description='Validating', max=13, style=ProgressStyle(description_width='i…



Epoch 2, Train loss: 0.1622, Val loss: 0.4801, Val accy: 81.00%



HBox(children=(IntProgress(value=0, description='Training  ', max=57, style=ProgressStyle(description_width='i…




HBox(children=(IntProgress(value=0, description='Validating', max=13, style=ProgressStyle(description_width='i…



Epoch 3, Train loss: 0.0333, Val loss: 0.6586, Val accy: 81.00%

CPU times: user 55.1 s, sys: 23.6 s, total: 1min 18s
Wall time: 2min 12s


## score and make predictions on test data

In [0]:
from tqdm import tqdm
# score model
accy = model.score(X_test, y_test)

# make class probability predictions
y_prob = model.predict_proba(X_test)
print("class prob estimates:\n", y_prob)

# make predictions
y_pred = model.predict(X_test)
print("Accuracy: %0.2f%%"%(metrics.accuracy_score(y_pred, y_test) * 100))

target_names = ['negative', 'positive']
print(classification_report(y_test, y_pred, target_names=target_names))

HBox(children=(IntProgress(value=0, description='Testing', max=109, style=ProgressStyle(description_width='ini…



Loss: 0.4551, Accuracy: 87.04%


HBox(children=(IntProgress(value=0, description='Predicting', max=109, style=ProgressStyle(description_width='…


class prob estimates:
 [[0.00164577 0.9983543 ]
 [0.9979557  0.00204425]
 [0.00229784 0.9977022 ]
 ...
 [0.8799137  0.12008635]
 [0.44239706 0.55760294]
 [0.00624655 0.9937535 ]]


HBox(children=(IntProgress(value=0, description='Predicting', max=109, style=ProgressStyle(description_width='…


Accuracy: 87.04%
              precision    recall  f1-score   support

    negative       0.83      0.92      0.87       428
    positive       0.91      0.82      0.87       444

    accuracy                           0.87       872
   macro avg       0.87      0.87      0.87       872
weighted avg       0.87      0.87      0.87       872



In [15]:
X_test.head()


0      it 's a charming and often affecting journey . 
1                   unflinchingly bleak and desperate 
2    allows us to hope that nolan is poised to emba...
3    the acting , costumes , music , cinematography...
4                    it 's slow -- very , very slow . 
Name: text, dtype: object

In [16]:
y_test.head()

0    1
1    0
2    1
3    1
4    0
Name: label, dtype: int64

## save/load model from disk

In [0]:
#save model to disk
savefile = 'test.bin'
model.save(savefile)

In [18]:


# load model from disk
new_model = load_model(savefile)

# predict with new model
accy = new_model.score(X_test, y_test)

Loading model from test.bin...
Defaulting to linear classifier/regressor
Building sklearn text classifier...


HBox(children=(IntProgress(value=0, description='Testing', max=109, style=ProgressStyle(description_width='ini…



Loss: 0.4766, Accuracy: 87.39%


### random seed
The finetuned model weights will change depending on the random seeds we use for the pytorch and numpy RNGs. The variance in test accuracy is higher when the training data is small. If you want to check out the variability with a few random seeds the following cell  takes ~3min to run and uses ~6.5GB on my laptop GPU.

In [19]:
%%time
scores = []; 
for seed in [4, 27, 33]:
    model.random_state = seed
    model.fit(X_train, y_train)
    scores.append(model.score(X_test, y_test))

Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
Loading Pytorch checkpoint

train data size: 900, validation data size: 100



HBox(children=(IntProgress(value=0, description='Training  ', max=57, style=ProgressStyle(description_width='i…




HBox(children=(IntProgress(value=0, description='Validating', max=13, style=ProgressStyle(description_width='i…



Epoch 1, Train loss: 0.5294, Val loss: 0.4174, Val accy: 84.00%



HBox(children=(IntProgress(value=0, description='Training  ', max=57, style=ProgressStyle(description_width='i…




HBox(children=(IntProgress(value=0, description='Validating', max=13, style=ProgressStyle(description_width='i…



Epoch 2, Train loss: 0.2127, Val loss: 0.4727, Val accy: 79.00%



HBox(children=(IntProgress(value=0, description='Training  ', max=57, style=ProgressStyle(description_width='i…




HBox(children=(IntProgress(value=0, description='Validating', max=13, style=ProgressStyle(description_width='i…



Epoch 3, Train loss: 0.0567, Val loss: 0.6405, Val accy: 82.00%



HBox(children=(IntProgress(value=0, description='Testing', max=109, style=ProgressStyle(description_width='ini…



Loss: 0.3655, Accuracy: 88.07%
Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
Loading Pytorch checkpoint

train data size: 900, validation data size: 100



HBox(children=(IntProgress(value=0, description='Training  ', max=57, style=ProgressStyle(description_width='i…




HBox(children=(IntProgress(value=0, description='Validating', max=13, style=ProgressStyle(description_width='i…



Epoch 1, Train loss: 0.5397, Val loss: 0.2543, Val accy: 90.00%



HBox(children=(IntProgress(value=0, description='Training  ', max=57, style=ProgressStyle(description_width='i…




HBox(children=(IntProgress(value=0, description='Validating', max=13, style=ProgressStyle(description_width='i…



Epoch 2, Train loss: 0.1867, Val loss: 0.2778, Val accy: 89.00%



HBox(children=(IntProgress(value=0, description='Training  ', max=57, style=ProgressStyle(description_width='i…




HBox(children=(IntProgress(value=0, description='Validating', max=13, style=ProgressStyle(description_width='i…



Epoch 3, Train loss: 0.0452, Val loss: 0.3852, Val accy: 86.00%



HBox(children=(IntProgress(value=0, description='Testing', max=109, style=ProgressStyle(description_width='ini…



Loss: 0.4923, Accuracy: 85.21%
Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
Loading Pytorch checkpoint

train data size: 900, validation data size: 100



HBox(children=(IntProgress(value=0, description='Training  ', max=57, style=ProgressStyle(description_width='i…




HBox(children=(IntProgress(value=0, description='Validating', max=13, style=ProgressStyle(description_width='i…



Epoch 1, Train loss: 0.5088, Val loss: 0.3841, Val accy: 83.00%



HBox(children=(IntProgress(value=0, description='Training  ', max=57, style=ProgressStyle(description_width='i…




HBox(children=(IntProgress(value=0, description='Validating', max=13, style=ProgressStyle(description_width='i…



Epoch 2, Train loss: 0.1785, Val loss: 0.2559, Val accy: 89.00%



HBox(children=(IntProgress(value=0, description='Training  ', max=57, style=ProgressStyle(description_width='i…




HBox(children=(IntProgress(value=0, description='Validating', max=13, style=ProgressStyle(description_width='i…



Epoch 3, Train loss: 0.0375, Val loss: 0.2667, Val accy: 89.00%



HBox(children=(IntProgress(value=0, description='Testing', max=109, style=ProgressStyle(description_width='ini…



Loss: 0.4218, Accuracy: 86.12%
CPU times: user 2min 34s, sys: 1min 8s, total: 3min 42s
Wall time: 3min 58s


In [20]:
# lets add the accy from our earlier run as well that uses the default seed=42
scores = np.array(scores + [accy])
print(scores)
print("%0.2f%% (+/-%0.03f)"% (stats.mean(scores), stats.stdev(scores) * 2))

[88.0733945  85.20642202 86.12385321 87.3853211 ]
86.70% (+/-2.561)
