# demo 

We will go through 5 examples: 

* **[text classification ](#text_classification)** - the goal is to classify a single sentence or short text.


* **[text pair classification ](#text_pair_classification)** - the goal is to to classify a pair of sentences or short texts.


* **[text pair regression ](#text_pair_regression)** - the goal is to predict a numerical value for a pair of sentences or short texts.


* **[Named Entity Recognition (NER)](#ner_conll_eng)** - the goal is to tag each token in a list of tokens as a person, location, organization,etc.



* **[Biomedical NER](#ner_ncbi)** - another NER task but focused on the biomedical domain . We will finetune [**BERT**](#NCBI_BERT), [**SciBERT**](#NCBI_SciBERT), and  [**BioBERT**](#NCBI_BioBERT) based models.


### A note on GPU cards and memory

While its possible, it would be very slow to run the examples without a GPU card of some sort. In addition, the BERT models (especially the large model) are pretty big so it helps to have more GPU memory. All the examples in this demo were run on a laptop with a Nvidia GTX-1070 card that has 8Gb of memory.

The three biggest parameters you can change which will reduce the GPU memory requirements significantly are:

* **`bert_model`** - BERT models come in 2 sizes : `base` and `large`. As you would expect the large model demands more GPU memory and takes longer to train. If you have a small GPU, start with the any of the `base` models first. The default is set to `'bert-base-uncased'`

> `base(110M parameter models)` : `'bert-base-uncased'`, `'bert-base-cased'`, `'bert-base-multilingual-uncased'`, `'bert-base-multilingual-cased'`, `'bert-base-chinese'`, `'bert-base-portuguese-cased'`, and all the **`BioBERT`** and **`SciBERT`** models.

> `large(340M parameter models)`: `'bert-large-uncased'`, `'bert-large-cased'` and `'bert-large-portuguese-cased'`


* **`max_seq_length`** - the defualt is 128 with a max value of 512. But seting it to a smaller value like 96 or even 64  saves a lot of GPU memory and still gets good results on a lot of tasks.


* **`train_batch_size`** - the default is 32. Cutting it in half will save memory and should also still give good results.


In addition to these two parameters,  [huggingface/pytorch-pretrained-BERT](https://github.com/huggingface/pytorch-pretrained-BERT#Training-large-models-introduction,-tools-and-examples) has several options to reduce the GPU memory requirements which are passed through in `bert-sklearn`:

* **`gradient_accumulation_steps`** - this is the number of update steps to accumulate gradients before performing an update step with the optimizer. The default is 1. Setting it to a higher integer(i.e 2, 4, up to the **`train_batch_size`** ) will trade GPU memory for compute time. I use this a lot when I train BERT models on my laptop GPU.


* **`fp16`** - this is whether to use 16-bit float precision instead of the 32-bit. The default is set to `False`. To enable half precision, you must install [Nvidia apex](https://github.com/NVIDIA/apex). Then setting this option to `True` will cut the model memory load in half. I use this when I train on my laptop GPU as well.


Finally the two other system setups that will help reduce the memory requirement: 

* `multiple gpus` - for a single machine with multiple GPUs, following the huggingface port, the GPUs will be detected and will split the load onto the multiple cards. 


* `distributed training` - the huggingface port allows you to train across distributed GPUs. The parameter,  **`local_rank`**, is exposed in `bert-sklearn`. But this option has not been tested yet.

First, let's import some libraries and define a few functions we will use in the rest of the demo:


In [1]:
import os
import math
import random
import csv
import sys

import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
import statistics as stats

from bert_sklearn import BertClassifier
from bert_sklearn import BertRegressor
from bert_sklearn import BertTokenClassifier
from bert_sklearn import load_model

def read_tsv(filename, quotechar=None):
    with open(filename, "r", encoding='utf-8') as f:
        return list(csv.reader(f, delimiter="\t", quotechar=quotechar))   

def flatten(l):
    return [item for sublist in l for item in sublist]

def read_CoNLL2003_format(filename, idx=3):
    """Read file in CoNLL-2003 shared task format"""
    
    # read file
    lines =  open(filename).read().strip()   
    
    # find sentence-like boundaries
    lines = lines.split("\n\n")  
    
     # split on newlines
    lines = [line.split("\n") for line in lines]
    
    # get tokens
    tokens = [[l.split()[0] for l in line] for line in lines]
    
    # get labels/tags
    labels = [[l.split()[idx] for l in line] for line in lines]
    
    #convert to df
    data= {'tokens': tokens, 'labels': labels}
    df=pd.DataFrame(data=data)
    
    return df


<a id='text_classification'></a>
# text classification 

For single text classification, we have the input data `X`, and target data `y` where:

* `X` is a list, pandas Series, or numpy array of text data.


* `y` is a list, pandas Series, or numpy array of text labels.

For this example, we will use the **`Stanford Sentiment Treebank (SST-2)`** data set from the [GLUE benchmarks](https://gluebenchmark.com/). The **`SST-2`** task consists of sentences drawn from movie reviews and annotated with a sentiment label. 

See [website](https://nlp.stanford.edu/sentiment/code.html) and [paper](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf) for more info.

The input features are short sentences and the labels are the standard sentiment polarity of:

*    0 for negative 


*    1 for positive.

## get data

First download the data using the GLUE downloder:

In [2]:
%%bash
python3 ./glue_examples/download_glue_data.py --data_dir ./glue_examples//glue_data --tasks SST 

Downloading and extracting SST...
	Completed!


In [3]:
"""
SST-2 train data size: 67349 
SST-2 dev data size: 872 
"""
DATADIR = './glue_examples/glue_data'

def get_sst_data(train_file=DATADIR + '/SST-2/train.tsv',
                 dev_file=DATADIR + '/SST-2/dev.tsv'):

    train = pd.read_csv(train_file, sep='\t', encoding='utf8', keep_default_na=False)
    train.columns=['text', 'label']
    print("SST-2 train data size: %d "%(len(train)))
    
    dev = pd.read_csv(dev_file, sep='\t', encoding='utf8', keep_default_na=False)
    dev.columns=['text', 'label']
    print("SST-2 dev data size: %d "%(len(dev)))
    label_list = np.unique(train['label'])

    return train, dev, label_list

train, dev, label_list = get_sst_data()
train.head()

SST-2 train data size: 67349 
SST-2 dev data size: 872 


Unnamed: 0,text,label
0,hide new secretions from the parental units,0
1,"contains no wit , only labored gags",0
2,that loves its characters and communicates som...,1
3,remains utterly satisfied to remain the same t...,0
4,on the worst revenge-of-the-nerds clichés the ...,0


## setup data

We will subsample the data for the demo. See the [SST-2.ipynb notebook](https://github.com/charles9n/bert-sklearn/blob/master/glue_examples/SST-2.ipynb) for a finetune demo on the full data set.

In [4]:
# subsample data 
n = 1000
train = train.sample(n, random_state=42)

X_train = train['text']
y_train = train['label']

# use the dev set for testing
test = dev
X_test = test['text']
y_test = test['label']

## define model

We will set up a classifier with the defualt settings, but let's reduce **`max_sequence_length`** , and **`train_batch_size`**, so it can run on a smaller GPU. This config uses ~5Gb of GPU memory om my laptop 8GB GTX-1070:

In [5]:
model = BertClassifier(max_seq_length=64, train_batch_size=16)
model

Building sklearn text classifier...


BertClassifier(bert_config_json=None, bert_model='bert-base-uncased',
        bert_vocab=None, do_lower_case=None, epochs=3, eval_batch_size=8,
        fp16=False, from_tf=False, gradient_accumulation_steps=1,
        ignore_label=None, label_list=None, learning_rate=2e-05,
        local_rank=-1, logfile='bert_sklearn.log', loss_scale=0,
        max_seq_length=64, num_mlp_hiddens=500, num_mlp_layers=0,
        random_state=42, restore_file=None, train_batch_size=16,
        use_cuda=True, validation_fraction=0.1, warmup_proportion=0.1)

## finetune model

finetune = fit model on train data

The `model.fit()` routine:

* Loads the pretrained BERT model defined in `model.bert_model`. The first time this runs will be slower as it downloads the BERT model from the internet. Subsequent calls will be faster as the model is saved in a file cache locally.


* Uses `model.validation_fraction`(defualt=0.1) of the data for validation and finetunes BERT on the remainder for `model.epochs`(default=3) epochs.

In [6]:
%%time
model = model.fit(X_train, y_train)

Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
Loading Pytorch checkpoint
train data size: 900, validation data size: 100


Training  : 100%|██████████| 57/57 [00:15<00:00,  4.17it/s, loss=0.529]
Validating: 100%|██████████| 13/13 [00:00<00:00, 22.70it/s]

Epoch 1, Train loss: 0.5295, Val loss: 0.4408, Val accy: 81.00%



Training  : 100%|██████████| 57/57 [00:15<00:00,  4.16it/s, loss=0.167]
Validating: 100%|██████████| 13/13 [00:00<00:00, 22.89it/s]

Epoch 2, Train loss: 0.1668, Val loss: 0.4380, Val accy: 87.00%



Training  : 100%|██████████| 57/57 [00:15<00:00,  4.16it/s, loss=0.0434]
Validating: 100%|██████████| 13/13 [00:00<00:00, 22.74it/s]

Epoch 3, Train loss: 0.0434, Val loss: 0.5512, Val accy: 86.00%
CPU times: user 36.1 s, sys: 16.5 s, total: 52.6 s
Wall time: 54.1 s





## score and make predictions on test data

In [7]:
from tqdm import tqdm
# score model
accy = model.score(X_test, y_test)

# make class probability predictions
y_prob = model.predict_proba(X_test)
print("class prob estimates:\n", y_prob)

# make predictions
y_pred = model.predict(X_test)
print("Accuracy: %0.2f%%"%(metrics.accuracy_score(y_pred, y_test) * 100))

target_names = ['negative', 'positive']
print(classification_report(y_test, y_pred, target_names=target_names))

Testing: 100%|██████████| 109/109 [00:03<00:00, 27.99it/s]



Loss: 0.3717, Accuracy: 88.07%


Predicting: 100%|██████████| 109/109 [00:03<00:00, 27.94it/s]

class prob estimates:
 [[0.00176739 0.9982326 ]
 [0.978774   0.02122599]
 [0.00462427 0.99537575]
 ...
 [0.96313787 0.03686218]
 [0.1856012  0.81439877]
 [0.00501524 0.99498475]]



Predicting: 100%|██████████| 109/109 [00:03<00:00, 27.85it/s]

Accuracy: 88.07%
              precision    recall  f1-score   support

    negative       0.88      0.87      0.88       428
    positive       0.88      0.89      0.88       444

   micro avg       0.88      0.88      0.88       872
   macro avg       0.88      0.88      0.88       872
weighted avg       0.88      0.88      0.88       872






## save/load model from disk

In [8]:
#save model to disk
savefile = '/data/test.bin'
model.save(savefile)

# load model from disk
new_model = load_model(savefile)

# predict with new model
accy = new_model.score(X_test, y_test)

Loading model from /data/test.bin...
Defaulting to linear classifier/regressor
Building sklearn text classifier...


Testing: 100%|██████████| 109/109 [00:03<00:00, 27.95it/s]


Loss: 0.3717, Accuracy: 88.07%





### random seed
The finetuned model weights will change depending on the random seeds we use for the pytorch and numpy RNGs. The variance in test accuracy is higher when the training data is small. If you want to check out the variability with a few random seeds the following cell  takes ~3min to run and uses ~6.5GB on my laptop GPU.

In [9]:
%%time
scores = []; 
for seed in [4, 27, 33]:
    model.random_state = seed
    model.fit(X_train, y_train)
    scores.append(model.score(X_test, y_test))

Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
Loading Pytorch checkpoint
train data size: 900, validation data size: 100


Training  : 100%|██████████| 57/57 [00:16<00:00,  4.04it/s, loss=0.502]
Validating: 100%|██████████| 13/13 [00:00<00:00, 22.24it/s]

Epoch 1, Train loss: 0.5023, Val loss: 0.5222, Val accy: 81.00%



Training  : 100%|██████████| 57/57 [00:16<00:00,  4.10it/s, loss=0.184]
Validating: 100%|██████████| 13/13 [00:00<00:00, 21.74it/s]

Epoch 2, Train loss: 0.1842, Val loss: 0.4111, Val accy: 87.00%



Training  : 100%|██████████| 57/57 [00:16<00:00,  3.99it/s, loss=0.0377]
Validating: 100%|██████████| 13/13 [00:00<00:00, 21.17it/s]

Epoch 3, Train loss: 0.0377, Val loss: 0.5228, Val accy: 88.00%



Testing: 100%|██████████| 109/109 [00:04<00:00, 28.08it/s]



Loss: 0.3578, Accuracy: 88.07%
Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
Loading Pytorch checkpoint
train data size: 900, validation data size: 100


Training  : 100%|██████████| 57/57 [00:16<00:00,  3.73it/s, loss=0.533]
Validating: 100%|██████████| 13/13 [00:00<00:00, 20.84it/s]

Epoch 1, Train loss: 0.5332, Val loss: 0.2840, Val accy: 86.00%



Training  : 100%|██████████| 57/57 [00:16<00:00,  3.94it/s, loss=0.204]
Validating: 100%|██████████| 13/13 [00:00<00:00, 21.30it/s]

Epoch 2, Train loss: 0.2042, Val loss: 0.2330, Val accy: 91.00%



Training  : 100%|██████████| 57/57 [00:16<00:00,  3.98it/s, loss=0.0656]
Validating: 100%|██████████| 13/13 [00:00<00:00, 21.87it/s]

Epoch 3, Train loss: 0.0656, Val loss: 0.2525, Val accy: 91.00%



Testing: 100%|██████████| 109/109 [00:04<00:00, 27.26it/s]



Loss: 0.4295, Accuracy: 85.89%
Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
Loading Pytorch checkpoint
train data size: 900, validation data size: 100


Training  : 100%|██████████| 57/57 [00:16<00:00,  4.10it/s, loss=0.585]
Validating: 100%|██████████| 13/13 [00:00<00:00, 21.12it/s]

Epoch 1, Train loss: 0.5846, Val loss: 0.4408, Val accy: 76.00%



Training  : 100%|██████████| 57/57 [00:16<00:00,  3.94it/s, loss=0.291]
Validating: 100%|██████████| 13/13 [00:00<00:00, 20.98it/s]

Epoch 2, Train loss: 0.2907, Val loss: 0.2925, Val accy: 90.00%



Training  : 100%|██████████| 57/57 [00:17<00:00,  3.86it/s, loss=0.0896]
Validating: 100%|██████████| 13/13 [00:00<00:00, 20.87it/s]

Epoch 3, Train loss: 0.0896, Val loss: 0.2609, Val accy: 91.00%



Testing: 100%|██████████| 109/109 [00:04<00:00, 22.76it/s]


Loss: 0.4628, Accuracy: 84.75%
CPU times: user 1min 58s, sys: 55.7 s, total: 2min 54s
Wall time: 2min 58s





In [10]:
# lets add the accy from our earlier run as well that uses the default seed=42
scores = np.array(scores + [accy])
print(scores)
print("%0.2f%% (+/-%0.03f)"% (stats.mean(scores), stats.stdev(scores) * 2))

[88.0733945  85.89449541 84.74770642 88.0733945 ]
86.70% (+/-3.313)


<a id='text_pair_classification'></a>

# text pair classification

For text pair classification, we have input data `X`, and target data `y` where :

* `X` is a list, pandas dataframe, or numpy array of text pairs (`text_a`, `text_b`) .


* `y` is a list, pandas Series, or numpy array of text labels

For this example, we will use the **`Quora Question Pair(QQP)`** data set from the [GLUE benchmarks](https://gluebenchmark.com/). This data consists of sentence pairs from the Quora website labeled as duplicate or not. See the [original release post](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) for more info.

The input features are pairs of questions (text_a,text_b) along with the labels :
*    0 if `text_a` and `text_b` are not duplicates

*    1 if `text_a` and `text_b` are duplicates


## get data

In [11]:
%%bash
python3 ./glue_examples/download_glue_data.py --data_dir ./glue_examples//glue_data --tasks QQP 

Downloading and extracting QQP...
	Completed!


In [12]:
"""
QQP train data size: 363849 
QQP dev data size: 40430 
"""

DATADIR = './glue_examples/glue_data'

def get_quora_df(filename):
    rows = read_tsv(filename)
    df=pd.DataFrame(rows[1:], columns=rows[0])
    df=df[['question1', 'question2', 'is_duplicate']]
    df = df[pd.notnull(df['is_duplicate'])]
    df.columns=['text_a', 'text_b', 'label']
    return df

def get_quora_data(train_file=DATADIR+'/QQP/train.tsv', 
                   dev_file=DATADIR+'/QQP/dev.tsv'):
    train = get_quora_df(train_file)
    print("QQP train data size: %d "%(len(train)))
    dev = get_quora_df(dev_file)
    print("QQP dev data size: %d "%(len(dev)))

    label_list = np.unique(train['label'].values)
    return train, dev, label_list

train, dev, label_list = get_quora_data()
train.head()

QQP train data size: 363849 
QQP dev data size: 40430 


Unnamed: 0,text_a,text_b,label
0,How is the life of a math student? Could you d...,Which level of prepration is enough for the ex...,0
1,How do I control my horny emotions?,How do you control your horniness?,1
2,What causes stool color to change to yellow?,What can cause stool to come out as little balls?,0
3,What can one do after MBBS?,What do i do after my MBBS ?,1
4,Where can I find a power outlet for my laptop ...,"Would a second airport in Sydney, Australia be...",0


## setup data

We will subsample the data for the demo. See the [QQP.ipynb](https://github.com/charles9n/bert-sklearn/blob/master/glue_examples/QQP.ipynb) notebook for a finetune demo on the full data set.

In [13]:
# subsample data 
n = 1000
train = train.sample(n, random_state=42)
dev = dev.sample(n, random_state=42)

X_train = train[['text_a', 'text_b']]
y_train = train['label']

# use the dev set for testing...
test = dev
X_test = test[['text_a', 'text_b']]
y_test = test['label']

## finetune

In [14]:
%%time
# define model
model = BertClassifier(max_seq_length=64, train_batch_size=16)

# finetune model
model.fit(X_train, y_train)

# score model
model.score(X_test, y_test)

# make predictions
y_pred = model.predict(X_test)
print("Accuracy: %0.2f%%"%(metrics.accuracy_score(y_pred, y_test) * 100))

target_names = ['not duplicate', 'is duplicate']
print(classification_report(y_test, y_pred, target_names=target_names))

Building sklearn text classifier...
Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
Loading Pytorch checkpoint
train data size: 900, validation data size: 100


Training  : 100%|██████████| 57/57 [00:15<00:00,  4.18it/s, loss=0.643]
Validating: 100%|██████████| 13/13 [00:00<00:00, 21.79it/s]

Epoch 1, Train loss: 0.6428, Val loss: 0.5923, Val accy: 65.00%



Training  : 100%|██████████| 57/57 [00:16<00:00,  4.10it/s, loss=0.424]
Validating: 100%|██████████| 13/13 [00:00<00:00, 21.50it/s]

Epoch 2, Train loss: 0.4240, Val loss: 0.6151, Val accy: 64.00%



Training  : 100%|██████████| 57/57 [00:17<00:00,  3.72it/s, loss=0.225]
Validating: 100%|██████████| 13/13 [00:00<00:00, 14.36it/s]

Epoch 3, Train loss: 0.2246, Val loss: 0.7069, Val accy: 64.00%



Testing: 100%|██████████| 125/125 [00:05<00:00, 22.48it/s]


Loss: 0.5384, Accuracy: 74.90%



Predicting: 100%|██████████| 125/125 [00:05<00:00, 21.51it/s]

Accuracy: 74.90%
               precision    recall  f1-score   support

not duplicate       0.85      0.72      0.78       617
 is duplicate       0.64      0.80      0.71       383

    micro avg       0.75      0.75      0.75      1000
    macro avg       0.74      0.76      0.74      1000
 weighted avg       0.77      0.75      0.75      1000

CPU times: user 44 s, sys: 21.1 s, total: 1min 5s
Wall time: 1min 6s





<a id='text_pair_regression'></a>

# text pair regression  

For text pair regression, we have input data `X`, and target data `y` where :

* `X` is a list, pandas dataframe, or numpy array of text pairs (`text_a`, `text_b`) .


* `y` is a list, pandas Series, or numpy array of floats.


For this example, we will use the **`STS-B`** data set from [GLUE benchmarks](https://gluebenchmark.com/). The data consists of sentence pairs drawn from news headlines and image captions with annotated similarity scores ranging from 1 to 5.

See [website](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) and [paper](http://www.aclweb.org/anthology/S/S17/S17-2001.pdf) for more info.


### STS-B

In [15]:
%%bash
python3 ./glue_examples/download_glue_data.py --data_dir ./glue_examples//glue_data --tasks STS 

Downloading and extracting STS...
	Completed!


In [16]:
"""
STS-B train data size: 5749 
STS-B dev data size: 1500 
"""

DATADIR = './glue_examples/glue_data'

def get_sts_b_df(filename):
    rows = read_tsv(filename)
    df=pd.DataFrame(rows[1:], columns=rows[0])
    df=df[['sentence1', 'sentence2', 'score']]    
    df.columns=['text_a', 'text_b', 'label']
    df.label = pd.to_numeric(df.label)
    df = df[pd.notnull(df['label'])]                
    return df

def get_sts_b_data(train_file=DATADIR + '/STS-B/train.tsv',
                   dev_file=DATADIR + '/STS-B/dev.tsv'):
    train = get_sts_b_df(train_file)
    print("STS-B train data size: %d "%(len(train)))    
    dev   = get_sts_b_df(dev_file)
    print("STS-B dev data size: %d "%(len(dev)))  
    return train,dev

train, dev = get_sts_b_data()
train.head()

STS-B train data size: 5749 
STS-B dev data size: 1500 


Unnamed: 0,text_a,text_b,label
0,A plane is taking off.,An air plane is taking off.,5.0
1,A man is playing a large flute.,A man is playing a flute.,3.8
2,A man is spreading shreded cheese on a pizza.,A man is spreading shredded cheese on an uncoo...,3.8
3,Three men are playing chess.,Two men are playing chess.,2.6
4,A man is playing the cello.,A man seated is playing the cello.,4.25


## setup data

We will subsample the data for the demo. See the [STS-B.ipynb](https://github.com/charles9n/bert-sklearn/blob/master/glue_examples/STS-B.ipynb) notebook for a finetune demo on the full data set.

In [17]:
# subsample data
n = 1000
train = train.sample(n, random_state=42)
dev = dev.sample(n, random_state=42)

X_train = train[['text_a', 'text_b']]
y_train = train['label']

# use the dev set for testing...
test = dev
X_test = test[['text_a', 'text_b']]
y_test = test['label']

## finetune

* For regression, we will report validation accuracy as pearson correlation

In [18]:
%%time
from scipy.stats import pearsonr

# define model
model = BertRegressor()
model.max_seq_length = 64

# finetune model
model.fit(X_train, y_train)

# score model
model.score(X_test, y_test)

# make predictions
y_pred = model.predict(X_test)
pearson_accy = pearsonr(y_pred, y_test)[0] * 100
print("Pearson : %0.2f"%(pearson_accy))

Building sklearn text regressor...
Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
Loading Pytorch checkpoint
train data size: 900, validation data size: 100


Training  : 100%|██████████| 29/29 [00:12<00:00,  2.90it/s, loss=3.03]
Validating: 100%|██████████| 13/13 [00:00<00:00, 21.43it/s]

Epoch 1, Train loss: 3.0313, Val loss: 1.5151, Val accy: 71.77%



Training  : 100%|██████████| 29/29 [00:12<00:00,  2.87it/s, loss=0.713]
Validating: 100%|██████████| 13/13 [00:00<00:00, 21.33it/s]

Epoch 2, Train loss: 0.7135, Val loss: 1.0438, Val accy: 81.48%



Training  : 100%|██████████| 29/29 [00:12<00:00,  2.86it/s, loss=0.333]
Validating: 100%|██████████| 13/13 [00:00<00:00, 21.54it/s]

Epoch 3, Train loss: 0.3328, Val loss: 0.7230, Val accy: 82.61%



Testing: 100%|██████████| 125/125 [00:04<00:00, 27.48it/s]


Loss: 0.5683, Accuracy: 86.77%



Predicting: 100%|██████████| 125/125 [00:04<00:00, 28.98it/s]

Pearson : 86.77
CPU times: user 32.6 s, sys: 17.1 s, total: 49.7 s
Wall time: 51.2 s





<a id='ner_conll_eng'></a>


## Named Entity Recognition (NER) :  CoNLL 2003 task


For this example, we will use the  **`CoNLL 2003`** shared task which consists of data from the Reuters 1996 news corpus with annotations for 4 types of `Named Entities` (persons, locations, organizations, and miscellaneous entities). The data is in a [IOB2](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) format. Each token enitity has a `'B-'` or `'I-'` tag indicating if its the start of the entity or if the token is inside the annotation. The `'O'` tag means the token is not a named entity.

* **`Person`**: `'B-PER'` and  `'I-PER'`


* **`Organization`**: `'B-ORG'` and `'I-ORG'`


* **`Location`**: `'B-LOC'`  and `'I-LOC'`


* **`Miscellaneous`**: `'B-MISC'` and `'I-MISC'`


* **`Other(non-named entity)`**: `'O'`

See [website](https://www.clips.uantwerpen.be/conll2003/ner/) and [paper](https://www.clips.uantwerpen.be/conll2003/pdf/14247tjo.pdf) for more info.

The data is already tokenized and tagged:

In [None]:
# tokens: EU     rejects  German  call  to  boycott  British  lamb . 
# tags  : B-ORG  O        B-MISC  O     O   O        B-MISC   O    O

So for the named entity recognition (NER) task the data consists of features:`X`and labels:`y`

* **`X`** :  a list of list of tokens 


* **`y`** :  a list of list of NER tags


### get data

In [20]:
%%bash
cd other_examples
DATADIR="ner_english"
if test ! -d "$DATADIR";then
    echo "Creating $DATADIR dir"
    mkdir "$DATADIR"
    cd "$DATADIR"
    wget https://raw.githubusercontent.com/mxhofer/Named-Entity-Recognition-BidirectionalLSTM-CNN-CoNLL/master/data/train.txt
    wget https://raw.githubusercontent.com/mxhofer/Named-Entity-Recognition-BidirectionalLSTM-CNN-CoNLL/master/data/test.txt
    wget https://raw.githubusercontent.com/mxhofer/Named-Entity-Recognition-BidirectionalLSTM-CNN-CoNLL/master/data/dev.txt
fi

In [21]:
"""
Train data: 14987 sentences, 204567 tokens
Dev data: 3466 sentences, 51578 tokens
Test data: 3684 sentences, 46666 tokens
"""

DATADIR = "./other_examples/ner_english/"

def get_conll2003_data(trainfile=DATADIR + "train.txt",
                       devfile=DATADIR + "dev.txt",
                       testfile=DATADIR + "test.txt"):

    train = read_CoNLL2003_format(trainfile)
    print("Train data: %d sentences, %d tokens"%(len(train), len(flatten(train.tokens))))
    dev = read_CoNLL2003_format(devfile)
    print("Dev data: %d sentences, %d tokens"%(len(dev), len(flatten(dev.tokens))))
    test = read_CoNLL2003_format(testfile)
    print("Test data: %d sentences, %d tokens"%(len(test), len(flatten(test.tokens))))
    
    return train, dev, test


train, dev, test = get_conll2003_data()
train.head()

Train data: 14987 sentences, 204567 tokens
Dev data: 3466 sentences, 51578 tokens
Test data: 3684 sentences, 46666 tokens


Unnamed: 0,tokens,labels
0,[-DOCSTART-],[O]
1,"[EU, rejects, German, call, to, boycott, Briti...","[B-ORG, O, B-MISC, O, O, O, B-MISC, O, O]"
2,"[Peter, Blackburn]","[B-PER, I-PER]"
3,"[BRUSSELS, 1996-08-22]","[B-LOC, O]"
4,"[The, European, Commission, said, on, Thursday...","[O, B-ORG, I-ORG, O, O, O, O, O, O, B-MISC, O,..."


## setup data

We will subsample the data for the demo. See the [ner_english.ipynb](https://github.com/charles9n/bert-sklearn/blob/master/other_examples/ner_english.ipynb) notebook for finetune demo on the full data set.

In [22]:
X_train, y_train = train.tokens, train.labels
X_dev, y_dev = dev.tokens, dev.labels
X_test, y_test = test.tokens, test.labels

label_list = np.unique(flatten(y_train))
label_list = list(label_list)
print("\nNER tags:",label_list)

# take a subset of the data for demo
n = 1000
X_train, y_train = X_train[:n], y_train[:n]
X_test, y_test = X_test[:n], y_test[:n]


NER tags: ['B-LOC', 'B-MISC', 'B-ORG', 'B-PER', 'I-LOC', 'I-MISC', 'I-ORG', 'I-PER', 'O']


## finetune 


Let's define our model using the **`BertTokenClassifier`** class

* We will include an **`ignore_label`** option to exclude the `'O'` labels, i.e non named entities label, to calculate  `f1`. The non named entities are a huge majority of the labels, and typically `f1` is reported with non named entities excluded.


* We will also use the cased model,`'bert-base-cased'`, as casing provides an important signal for NER. The first time you run this it will take a little longer to download the model into the cache.


* With the BertTokenClassifier we should also be mindful to set the **` max_seq_len`**  high enough to cover lengths of the token lists. See the [ner_english.ipynb](https://github.com/charles9n/bert-sklearn/blob/master/other_examples/ner_english.ipynb) for more detail.

This uses around 7GB GPU memory on my laptop. If this gives you OOM, then set the  **` max_seq_len`** to a lower number, i.e 128 or 96. You can also increase **`gradient_accumulation_steps`** up to 8 to further reduce the GPU memory.

In [23]:
%%time
# define model
model = BertTokenClassifier(bert_model='bert-base-cased',
                            epochs=3,
                            max_seq_length=173,
                            learning_rate=2e-5,
                            train_batch_size=16,
                            eval_batch_size=16,
                            gradient_accumulation_steps=2,
                            ignore_label=['O'])


print(model)

# finetune model
model.fit(X_train, y_train)

# score model
f1_test = model.score(X_test, y_test)
print("Test f1: %0.02f"%(f1_test))

# make predictions
y_preds = model.predict(X_test)

# calculate the probability of each class
y_probs = model.predict_proba(X_test)

print(classification_report(flatten(y_test), flatten(y_preds)))

Building sklearn token classifier...
BertTokenClassifier(bert_config_json=None, bert_model='bert-base-cased',
          bert_vocab=None, do_lower_case=None, epochs=3,
          eval_batch_size=16, fp16=False, from_tf=False,
          gradient_accumulation_steps=2, ignore_label=['O'],
          label_list=None, learning_rate=2e-05, local_rank=-1,
          logfile='bert_sklearn.log', loss_scale=0, max_seq_length=173,
          num_mlp_hiddens=500, num_mlp_layers=0, random_state=42,
          restore_file=None, train_batch_size=16, use_cuda=True,
          validation_fraction=0.1, warmup_proportion=0.1)


100%|██████████| 213450/213450 [00:00<00:00, 891015.14B/s]


Loading bert-base-cased model...


100%|██████████| 435779157/435779157 [02:58<00:00, 2437434.04B/s] 
100%|██████████| 313/313 [00:00<00:00, 62271.95B/s]


Defaulting to linear classifier/regressor
Loading Pytorch checkpoint
train data size: 900, validation data size: 100


Training  : 100%|██████████| 113/113 [00:33<00:00,  3.91it/s, loss=0.0501]
Validating: 100%|██████████| 7/7 [00:01<00:00,  5.63it/s]

Epoch 1, Train loss: 0.0501, Val loss: 0.0102, Val accy: 95.16%, f1: 83.12



Training  : 100%|██████████| 113/113 [00:33<00:00,  3.87it/s, loss=0.00848]
Validating: 100%|██████████| 7/7 [00:01<00:00,  5.60it/s]

Epoch 2, Train loss: 0.0085, Val loss: 0.0071, Val accy: 96.93%, f1: 90.42



Training  : 100%|██████████| 113/113 [00:34<00:00,  3.86it/s, loss=0.00417]
Validating: 100%|██████████| 7/7 [00:01<00:00,  5.56it/s]

Epoch 3, Train loss: 0.0042, Val loss: 0.0055, Val accy: 97.30%, f1: 91.74



Predicting: 100%|██████████| 63/63 [00:11<00:00,  5.42it/s]

Test f1: 87.65



Predicting: 100%|██████████| 63/63 [00:12<00:00,  5.38it/s]
Predicting: 100%|██████████| 63/63 [00:13<00:00,  5.25it/s]


              precision    recall  f1-score   support

       B-LOC       0.84      0.86      0.85       418
      B-MISC       0.67      0.69      0.68       189
       B-ORG       0.84      0.82      0.83       489
       B-PER       0.99      0.92      0.95       655
       I-LOC       0.86      0.55      0.67        66
      I-MISC       0.59      0.72      0.65        82
       I-ORG       0.87      0.88      0.87       205
       I-PER       0.99      1.00      0.99       483
           O       0.99      0.99      0.99      8479

   micro avg       0.96      0.96      0.96     11066
   macro avg       0.85      0.83      0.83     11066
weighted avg       0.96      0.96      0.96     11066

CPU times: user 1min 53s, sys: 1min 1s, total: 2min 55s
Wall time: 5min 27s


### span level stats on test data

If we want span level stats, we can run the [perl script](https://www.clips.uantwerpen.be/conll2003/ner/bin/conlleval) from the original `CoNLL-2000/2003 shared task`, to evaluate the results:

In [24]:
# write out predictions to file for conlleval.pl
iter_zip = zip(flatten(X_test),flatten(y_test),flatten(y_preds))
preds = [" ".join([token, y, y_pred]) for token, y, y_pred in iter_zip]
with open("preds.txt",'w') as f:
    for x in preds:
        f.write(str(x)+'\n') 

# run conlleval perl script 
!perl ./other_examples/conlleval.pl < preds.txt
!rm preds.txt

processed 11066 tokens with 1751 phrases; found: 1765 phrases; correct: 1467.
accuracy:  96.34%; precision:  83.12%; recall:  83.78%; FB1:  83.45
              LOC: precision:  79.27%; recall:  83.25%; FB1:  81.21  439
             MISC: precision:  55.96%; recall:  64.55%; FB1:  59.95  218
              ORG: precision:  79.31%; recall:  79.96%; FB1:  79.63  493
              PER: precision:  98.54%; recall:  92.52%; FB1:  95.43  615


### check results on test data

In [25]:
i = 152
tokens = X_test[i]
labels = y_test[i]
preds = y_preds[i]
data = {"token": tokens,"label": labels,"predict": preds}
df=pd.DataFrame(data=data)
print(df)

         token   label predict
0        Dutch  B-MISC  B-MISC
1      forward       O       O
2       Reggie   B-PER   B-PER
3      Blinker   I-PER   I-PER
4          had       O       O
5          his       O       O
6   indefinite       O       O
7   suspension       O       O
8       lifted       O       O
9           by       O       O
10        FIFA   B-ORG   B-ORG
11          on       O       O
12      Friday       O       O
13         and       O       O
14         was       O       O
15         set       O       O
16          to       O       O
17        make       O       O
18         his       O       O
19   Sheffield   B-ORG   B-LOC
20   Wednesday   I-ORG   I-ORG
21    comeback       O       O
22     against       O       O
23   Liverpool   B-ORG   B-ORG
24          on       O       O
25    Saturday       O       O
26           .       O       O


In [26]:
# pprint out probs for this observation
prob = y_probs[i]
tokens_prob = model.tokens_proba(tokens, prob)

         token  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER    O
0        Dutch   0.11    0.81   0.06   0.01   0.00    0.00   0.00   0.00 0.00
1      forward   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
2       Reggie   0.00    0.00   0.00   1.00   0.00    0.00   0.00   0.00 0.00
3      Blinker   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.99 0.00
4          had   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
5          his   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
6   indefinite   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
7   suspension   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
8       lifted   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
9           by   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
10        FIFA   0.30    0.07   0.60   0.01   0.01    0.00   0.01   0.00 0.00
11          on   0.00    0.00   0.00   0.00   0.00    0.00   0.0

Finally, lets predict the tags and tag probabilities on some new text:

In [27]:
text = "Jefferson wants to go to France."       

tag_predicts  = model.tag_text(text)       
prob_predicts = model.tag_text_proba(text)    

Predicting: 100%|██████████| 1/1 [00:00<00:00,  6.48it/s]


       token predicted tags
0  Jefferson          B-PER
1      wants              O
2         to              O
3         go              O
4         to              O
5     France          B-LOC
6          .              O


Predicting: 100%|██████████| 1/1 [00:00<00:00,  6.75it/s]

       token  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER    O
0  Jefferson   0.00    0.01   0.03   0.95   0.00    0.00   0.00   0.00 0.00
1      wants   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
2         to   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
3         go   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
4         to   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
5     France   0.99    0.00   0.00   0.00   0.00    0.00   0.00   0.00 0.00
6          .   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00





<a id='ner_ncbi'></a>


##  Biomedical NER : NCBI Disease Corpus


## NCBI Disease Corpus

The NCBI disease corpus  task is a Named Entity Recognition(NER) task in the biomedical domain. The data is from  a collection of 793 PubMed abstracts with annotations for disease entities. Each token enitity has a `'B-'` or `'I-'` tags indicating if it is the start of the entity or if the token is inside the annotation. The `'O'` tag means the token is not a named entity. See this [paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3951655/) for more information


NER tasks are token classification tasks where the data consists of features,`X`, and labels,`y`, where:

* **`X`** :  is a list of list of tokens 


* **`y`** :  is a list of list of NER tags


We will finetune models from:


* [**BERT**](#NCBI_BERT) - this is the standard `BERT` base case model from Google  pretrained on the Books Corpus and English Wikipedia data.



* [**SciBERT**](#NCBI_SciBERT) - `SciBERT` is a model from [AllenAI](https://allenai.org/) based on `BERT` but pretrained on large scientific text corpus.  For more information on `SciBERT`, see the [github repo](https://github.com/allenai/scibert) and [paper](https://arxiv.org/pdf/1903.10676.pdf).



* [**BioBERT**](#NCBI_BioBERT) -  `BioBERT` is a model also based on `BERT` but pretrained on a large biomedical text corpus.  For more information on `BioBERT`, see the [ github repo](https://github.com/dmis-lab/biobert) and [paper](https://arxiv.org/pdf/1901.08746.pdf).


For this demo we will use a subsample of the the data. See [ner_NCBI_disease_BioBERT_SciBERT](https://github.com/charles9n/bert-sklearn/blob/master/other_examples/ner_NCBI_disease_BioBERT_SciBERT.ipynb) for a finetune demo using all the training data.


### get data

In [29]:
%%bash
DATADIR="other_examples/NCBI_disease"
if test ! -d "$DATADIR";then
    echo "Creating $DATADIR dir"
    mkdir "$DATADIR"
    cd "$DATADIR"
    wget https://raw.githubusercontent.com/allenai/scibert/master/data/ner/NCBI-disease/dev.txt
    wget https://raw.githubusercontent.com/allenai/scibert/master/data/ner/NCBI-disease/test.txt
    wget https://raw.githubusercontent.com/allenai/scibert/master/data/ner/NCBI-disease/train.txt
fi

In [30]:
"""
Train and dev data: 6347 sentences, 159670 tokens
Test data: 940 sentences, 24497 tokens
"""
DATADIR = "other_examples/NCBI_disease/"

def get_data(trainfile=DATADIR + "train.txt",
             devfile=DATADIR + "dev.txt",
             testfile=DATADIR + "test.txt"):

    train = read_CoNLL2003_format(trainfile, idx=3)    
    dev = read_CoNLL2003_format(devfile, idx=3)
    
    # combine train and dev
    train = pd.concat([train, dev])
    print("Train and dev data: %d sentences, %d tokens"%(len(train), len(flatten(train.tokens))))

    test = read_CoNLL2003_format(testfile, idx=3)
    print("Test data: %d sentences, %d tokens"%(len(test), len(flatten(test.tokens))))
    
    return train, test

train, test = get_data()


X_train, y_train = train.tokens, train.labels

# subsample the data for the demo
n=600
X_train = X_train[:n]
y_train = y_train[:n]

X_test, y_test = test.tokens, test.labels

label_list = np.unique(flatten(y_train))
label_list = list(label_list)
print("\nNER tags:",label_list)

Train and dev data: 6347 sentences, 159670 tokens
Test data: 940 sentences, 24497 tokens

NER tags: ['B-Disease', 'I-Disease', 'O']


In [31]:
train.head()

Unnamed: 0,tokens,labels
0,"[Identification, of, APC2, ,, a, homologue, of...","[O, O, O, O, O, O, O, O, B-Disease, I-Disease,..."
1,"[The, adenomatous, polyposis, coli, (, APC, ),...","[O, B-Disease, I-Disease, I-Disease, I-Disease..."
2,"[Complex, formation, induces, the, rapid, degr...","[O, O, O, O, O, O, O, O, O]"
3,"[In, colon, carcinoma, cells, ,, loss, of, APC...","[O, B-Disease, I-Disease, O, O, O, O, O, O, O,..."
4,"[Here, ,, we, report, the, identification, and...","[O, O, O, O, O, O, O, O, O, O, O, O, O]"


In [32]:
i = 9
tokens = X_test[i]
labels = y_test[i]

data = {"token": tokens,"label": labels}
df=pd.DataFrame(data=data)
print(df)

         token      label
0   Occasional          O
1     missense          O
2    mutations          O
3           in          O
4          ATM          O
5         were          O
6         also          O
7        found          O
8           in          O
9       tumour  B-Disease
10         DNA          O
11        from          O
12    patients          O
13        with          O
14           B  B-Disease
15           -  I-Disease
16        cell  I-Disease
17         non  I-Disease
18           -  I-Disease
19    Hodgkins  I-Disease
20   lymphomas  I-Disease
21           (          O
22           B  B-Disease
23           -  I-Disease
24         NHL  I-Disease
25           )          O
26         and          O
27           a          O
28           B  B-Disease
29           -  I-Disease
30         NHL  I-Disease
31        cell          O
32        line          O
33           .          O


<a id='NCBI_BERT'></a>
### finetune  BERT

As in the CoNLL task, we define our model using the **`BertTokenClassifier`** class

* We will include an **`ignore_label`** option to exclude the `'O'`, non named entities label, to calculate  `f1`. The non named entities are a huge majority of the labels, and typically `f1` is reported with non named entities excluded.


* With the BertTokenClassifier we should also be mindful to set the **` max_seq_len`**  high enough to cover lengths of the token lists. See the extended demo for more detail.

This uses around 7GB GPU memory on my laptop. If this gives you OOM, then set the  **` max_seq_len`** to a lower number, i.e 128 or 96. You can also increase **`gradient_accumulation_steps`** up to 8 to further reduce the GPU memory.

In [33]:
%%time
model = BertTokenClassifier('bert-base-cased',
                            max_seq_length=178,
                            epochs=3,
                            gradient_accumulation_steps=2,
                            learning_rate=3e-5,
                            train_batch_size=16,
                            eval_batch_size=16,
                            validation_fraction=0.,                            
                            label_list=label_list,
                            ignore_label=['O'])

print(model)

# finetune model on train data
model.fit(X_train, y_train)

# score model on test data
f1_test = model.score(X_test, y_test,'macro')
print("Test f1: %0.02f"%(f1_test))

# get predictions on test data
y_preds = model.predict(X_test)

# print report on classifier stats
print(classification_report(flatten(y_test), flatten(y_preds)))

Building sklearn token classifier...
BertTokenClassifier(bert_config_json=None, bert_model='bert-base-cased',
          bert_vocab=None, do_lower_case=None, epochs=3,
          eval_batch_size=16, fp16=False, from_tf=False,
          gradient_accumulation_steps=2, ignore_label=['O'],
          label_list=['B-Disease', 'I-Disease', 'O'], learning_rate=3e-05,
          local_rank=-1, logfile='bert_sklearn.log', loss_scale=0,
          max_seq_length=178, num_mlp_hiddens=500, num_mlp_layers=0,
          random_state=42, restore_file=None, train_batch_size=16,
          use_cuda=True, validation_fraction=0.0, warmup_proportion=0.1)
Loading bert-base-cased model...
Defaulting to linear classifier/regressor
Loading Pytorch checkpoint
train data size: 600, validation data size: 0


Training  : 100%|██████████| 75/75 [00:23<00:00,  3.20it/s, loss=0.0514]
Training  : 100%|██████████| 75/75 [00:25<00:00,  2.95it/s, loss=0.0105]
Training  : 100%|██████████| 75/75 [00:29<00:00,  2.57it/s, loss=0.00384]
Predicting: 100%|██████████| 59/59 [00:15<00:00,  3.75it/s]

Test f1: 76.72



Predicting: 100%|██████████| 59/59 [00:17<00:00,  3.49it/s]

              precision    recall  f1-score   support

   B-Disease       0.81      0.71      0.75       960
   I-Disease       0.90      0.69      0.78      1087
           O       0.98      0.99      0.98     22450

   micro avg       0.97      0.97      0.97     24497
   macro avg       0.89      0.80      0.84     24497
weighted avg       0.97      0.97      0.97     24497

CPU times: user 1min 10s, sys: 44.3 s, total: 1min 54s
Wall time: 1min 55s





For span level stats let's use the conlleval.pl script:

In [34]:
# write out predictions to file for conlleval.pl
iter_zip = zip(flatten(X_test),flatten(y_test),flatten(y_preds))
preds = [" ".join([token, y, y_pred]) for token, y, y_pred in iter_zip]
with open("preds.txt",'w') as f:
    for x in preds:
        f.write(str(x)+'\n') 

# run conlleval perl script 
!perl ./other_examples/conlleval.pl < preds.txt
!rm preds.txt

processed 24497 tokens with 960 phrases; found: 904 phrases; correct: 631.
accuracy:  96.72%; precision:  69.80%; recall:  65.73%; FB1:  67.70
          Disease: precision:  69.80%; recall:  65.73%; FB1:  67.70  904


get the token predicts for our sample to compare with SCiBERT and BioBERT models later

In [35]:
# get the token predicts 
i = 9
tokens = X_test[i]
labels = y_test[i]
bert_preds  = y_preds[i]

<a id='NCBI_SciBERT'></a>
### finetune SciBERT

There are 4 SciBERT models available: 


* `scibert-scivocab-uncased` 


* `scibert-scivocab-cased`


* `scibert-basevocab-uncased`


* `scibert-basevocab-cased`


See the [`SciBERT` github](https://github.com/allenai/scibert) and [paper](https://arxiv.org/pdf/1903.10676.pdf) for more info.

While the  `scibert-scivocab-uncased` is the recomended model on the website, we will use `'scibert-scivocab-cased'` for the demo to have comparable results with the BERT and BioBERT models. Note as before, the first time you run this will take some extra time to download the model into the file cache:

In [36]:
%%time
model = BertTokenClassifier('scibert-scivocab-cased',
                            max_seq_length=178,
                            epochs=3,
                            gradient_accumulation_steps=2,
                            learning_rate=3e-5,
                            train_batch_size=16,
                            eval_batch_size=16,
                            validation_fraction=0.,                            
                            label_list=label_list,
                            ignore_label=['O'])

print(model)

# finetune model on train data
model.fit(X_train, y_train)

# score model on test data
f1_test = model.score(X_test, y_test,'macro')
print("Test f1: %0.02f"%(f1_test))

# get predictions on test data
y_preds = model.predict(X_test)

# print report on classifier stats
print(classification_report(flatten(y_test), flatten(y_preds)))

Building sklearn token classifier...
BertTokenClassifier(bert_config_json=None,
          bert_model='scibert-scivocab-cased', bert_vocab=None,
          do_lower_case=None, epochs=3, eval_batch_size=16, fp16=False,
          from_tf=False, gradient_accumulation_steps=2, ignore_label=['O'],
          label_list=['B-Disease', 'I-Disease', 'O'], learning_rate=3e-05,
          local_rank=-1, logfile='bert_sklearn.log', loss_scale=0,
          max_seq_length=178, num_mlp_hiddens=500, num_mlp_layers=0,
          random_state=42, restore_file=None, train_batch_size=16,
          use_cuda=True, validation_fraction=0.0, warmup_proportion=0.1)


100%|██████████| 410521600/410521600 [00:21<00:00, 18680427.63B/s]


Loading scibert-scivocab-cased model...


100%|██████████| 410521600/410521600 [00:18<00:00, 21932301.40B/s]


Defaulting to linear classifier/regressor
Loading Pytorch checkpoint
train data size: 600, validation data size: 0


Training  : 100%|██████████| 75/75 [00:23<00:00,  3.36it/s, loss=0.0342]
Training  : 100%|██████████| 75/75 [00:23<00:00,  3.10it/s, loss=0.00651]
Training  : 100%|██████████| 75/75 [00:24<00:00,  3.31it/s, loss=0.0025] 
Predicting: 100%|██████████| 59/59 [00:10<00:00,  6.12it/s]


Test f1: 85.33


Predicting: 100%|██████████| 59/59 [00:11<00:00,  5.23it/s]

              precision    recall  f1-score   support

   B-Disease       0.85      0.85      0.85       960
   I-Disease       0.89      0.83      0.86      1087
           O       0.99      0.99      0.99     22450

   micro avg       0.98      0.98      0.98     24497
   macro avg       0.91      0.89      0.90     24497
weighted avg       0.98      0.98      0.98     24497

CPU times: user 1min 22s, sys: 41.4 s, total: 2min 4s
Wall time: 2min 25s





In [12]:
# write out predictions to file for conlleval.pl
iter_zip = zip(flatten(X_test),flatten(y_test),flatten(y_preds))
preds = [" ".join([token, y, y_pred]) for token, y, y_pred in iter_zip]
with open("preds.txt",'w') as f:
    for x in preds:
        f.write(str(x)+'\n') 

# run conlleval perl script 
!perl ./other_examples/conlleval.pl < preds.txt
!rm preds.txt

processed 24497 tokens with 960 phrases; found: 1013 phrases; correct: 777.
accuracy:  97.84%; precision:  76.70%; recall:  80.94%; FB1:  78.76
          Disease: precision:  76.70%; recall:  80.94%; FB1:  78.76  1013


In [36]:
# get the token predicts 
i = 9
tokens = X_test[i]
labels = y_test[i]
scibert_preds  = y_preds[i]

<a id='NCBI_BioBERT'></a>
### finetune BioBERT

There are 4 **`BioBERT`** models available:

* `'biobert-v1.0-pmc-base-cased'`


* `'biobert-v1.0-pubmed-base-cased'`


* `'biobert-v1.0-pubmed-pmc-base-cased'` 


* `'biobert-v1.1-pubmed-base-cased'` 

See [BioBERT github](https://github.com/dmis-lab/biobert) and [paper](https://arxiv.org/pdf/1901.08746.pdf)  for more info.

The **`BioBERT`** models are archived as tensorflow checkpoints. We will use the hugginface pytorch BERT port to convert them to a pytorch model, however tensorflow will need to be installed on the system for the inital conversion. If it is not installed run : 

`pip install tensorflow-gpu`

We will use the most recent release,`biobert-v1.1-pubmed-base-cased`, for the demo:

In [37]:
%%time
model = BertTokenClassifier('biobert-v1.1-pubmed-base-cased',
                            max_seq_length=178,
                            epochs=3,
                            gradient_accumulation_steps=2,
                            learning_rate=3e-5,
                            train_batch_size=16,
                            eval_batch_size=16,
                            validation_fraction=0,                            
                            label_list=label_list,
                            ignore_label=['O'])

print(model)

# finetune model on train data
model.fit(X_train, y_train)

# score model on test data
f1_test = model.score(X_test, y_test,'macro')
print("Test f1: %0.02f"%(f1_test))

# get predictions on test data
y_preds = model.predict(X_test)

# print report on classifier stats
print(classification_report(flatten(y_test), flatten(y_preds)))

Building sklearn token classifier...
BertTokenClassifier(bert_config_json=None,
          bert_model='biobert-v1.1-pubmed-base-cased', bert_vocab=None,
          do_lower_case=None, epochs=3, eval_batch_size=16, fp16=False,
          from_tf=False, gradient_accumulation_steps=2, ignore_label=['O'],
          label_list=['B-Disease', 'I-Disease', 'O'], learning_rate=3e-05,
          local_rank=-1, logfile='bert_sklearn.log', loss_scale=0,
          max_seq_length=178, num_mlp_hiddens=500, num_mlp_layers=0,
          random_state=42, restore_file=None, train_batch_size=16,
          use_cuda=True, validation_fraction=0, warmup_proportion=0.1)


100%|██████████| 401403346/401403346 [00:27<00:00, 14423439.71B/s]


Loading biobert-v1.1-pubmed-base-cased model...


100%|██████████| 401403346/401403346 [00:23<00:00, 16765954.84B/s]


Defaulting to linear classifier/regressor
Loading Tensorflow checkpoint from  model.ckpt-1000000
train data size: 600, validation data size: 0


Training: 100%|██████████| 75/75 [00:25<00:00,  3.13it/s, loss=0.0471]
Training: 100%|██████████| 75/75 [00:24<00:00,  3.13it/s, loss=0.00758]
Training: 100%|██████████| 75/75 [00:25<00:00,  2.97it/s, loss=0.00332]
Predicting: 100%|██████████| 59/59 [00:12<00:00,  4.39it/s]

Test f1: 85.36



Predicting: 100%|██████████| 59/59 [00:14<00:00,  4.36it/s]

              precision    recall  f1-score   support

   B-Disease       0.84      0.85      0.84       960
   I-Disease       0.88      0.85      0.86      1087
           O       0.99      0.99      0.99     22450

   micro avg       0.98      0.98      0.98     24497
   macro avg       0.90      0.90      0.90     24497
weighted avg       0.98      0.98      0.98     24497

CPU times: user 1min 31s, sys: 44.1 s, total: 2min 15s
Wall time: 2min 50s





In [38]:
# get span level stats with conlleval.pl script
iter_zip = zip(flatten(X_test),flatten(y_test),flatten(y_preds))
preds = [" ".join([token, y, y_pred]) for token, y, y_pred in iter_zip]
with open("preds.txt",'w') as f:
    for x in preds:
        f.write(str(x)+'\n') 

# run conlleval perl script 
!perl ./other_examples/conlleval.pl < preds.txt
!rm preds.txt

processed 24497 tokens with 960 phrases; found: 1006 phrases; correct: 776.
accuracy:  97.84%; precision:  77.14%; recall:  80.83%; FB1:  78.94
          Disease: precision:  77.14%; recall:  80.83%; FB1:  78.94  1006


In [39]:
# get the token predicts
i = 9
tokens = X_test[i]
labels = y_test[i]
biobert_preds  = y_preds[i]

###  examine the token tags from the 3 models

Compare the token predictions between BERT, SciBERT, and BioBERT. Remember these are results based on training on just 600 training samples. See [ner_NCBI_disease_BioBERT_SciBERT](https://github.com/charles9n/bert-sklearn/blob/master/other_examples/ner_NCBI_disease_BioBERT_SciBERT.ipynb) for a finetune demo on all the data with the various models. On the full dataset both SciBERT and BiBERT are pretty close and noticably better than standard BERT.

In [40]:
data = {"token": tokens,"label": labels, "bert": bert_preds,"scibert": scibert_preds, "biobert": biobert_preds}
df=pd.DataFrame(data=data)
print(df)

         token      label       bert    scibert    biobert
0   Occasional          O          O          O          O
1     missense          O          O          O          O
2    mutations          O          O          O          O
3           in          O          O          O          O
4          ATM          O          O          O          O
5         were          O          O          O          O
6         also          O          O          O          O
7        found          O          O          O          O
8           in          O          O          O          O
9       tumour  B-Disease  B-Disease  B-Disease          O
10         DNA          O          O          O          O
11        from          O          O          O          O
12    patients          O          O          O          O
13        with          O          O          O          O
14           B  B-Disease          O  B-Disease  B-Disease
15           -  I-Disease          O  I-Disease  I-Disea