## NCBI Disease Corpus

The NCBI disease corpus  task is a Named Entity Recognition(NER) task in the biomedical domain. The data is from  a collection of 793 PubMed abstracts with annotations for disease entities. Each token enitity has a `'B-'` or `'I-'` tag indicating if it is the start of the entity or if the token is inside the annotation. The `'O'` tag means the token is not a named entity. See this [paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3951655/) for more information


NER tasks are token classification tasks where the data consists of features,`X`, and labels,`y`, where:

* **`X`** :  is a list of list of tokens 


* **`y`** :  is a list of list of NER tags


We will finetune models from:


* [**BERT**](#NCBI_BERT) - this is the standard `BERT` base case model from Google  pretrained on the Books Corpus and English Wikipedia data.



* [**SciBERT**](#NCBI_SciBERT) - `SciBERT` is a model from [AllenAI](https://allenai.org/) based on `BERT` but pretrained on scientific text.  For more information on `SciBERT`, see the [github repo](https://github.com/allenai/scibert) and [paper](https://arxiv.org/pdf/1903.10676.pdf).



* [**BioBERT**](#NCBI_BioBERT) -  `BioBERT` is a model also based on `BERT` but pretrained on biomedical text.  For more information on `BioBERT`, see the [ github repo](https://github.com/dmis-lab/biobert) and [paper](https://arxiv.org/pdf/1901.08746.pdf).


### get ncbi data
We can get the ncbi data from the allenai github:

In [1]:
%%bash
DATADIR="NCBI_disease"
if test ! -d "$DATADIR";then
    echo "Creating $DATADIR dir"
    mkdir "$DATADIR"
    cd "$DATADIR"
    wget https://raw.githubusercontent.com/allenai/scibert/master/data/ner/NCBI-disease/dev.txt
    wget https://raw.githubusercontent.com/allenai/scibert/master/data/ner/NCBI-disease/test.txt
    wget https://raw.githubusercontent.com/allenai/scibert/master/data/ner/NCBI-disease/train.txt
fi

In [1]:
import os
import math
import random
import csv
import sys

import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.metrics import f1_score, precision_score, recall_score
from sklearn.metrics import classification_report
import statistics as stats

sys.path.append("../") 
from bert_sklearn import BertTokenClassifier
from bert_sklearn import load_model

def read_tsv(filename, quotechar=None):
    with open(filename, "r", encoding='utf-8') as f:
        return list(csv.reader(f, delimiter="\t", quotechar=quotechar))   

def flatten(l):
    return [item for sublist in l for item in sublist]

def read_CoNLL2003_format(filename, idx=3):
    """Read file in CoNLL-2003 shared task format"""
    
    # read file
    lines =  open(filename).read().strip()   
    
    # find sentence-like boundaries
    lines = lines.split("\n\n")  
    
     # split on newlines
    lines = [line.split("\n") for line in lines]
    
    # get tokens
    tokens = [[l.split()[0] for l in line] for line in lines]
    
    # get labels/tags
    labels = [[l.split()[idx] for l in line] for line in lines]
    
    #convert to df
    data= {'tokens': tokens, 'labels': labels}
    df=pd.DataFrame(data=data)
    return df

DATADIR = "NCBI_disease/"

def get_data(trainfile=DATADIR + "train.txt",
             devfile=DATADIR + "dev.txt",
             testfile=DATADIR + "test.txt"):

    train = read_CoNLL2003_format(trainfile, idx=3)    
    dev = read_CoNLL2003_format(devfile, idx=3)
    
    # combine train and dev
    train = pd.concat([train, dev])
    print("Train and dev data: %d sentences, %d tokens"%(len(train),len(flatten(train.tokens))))

    test = read_CoNLL2003_format(testfile, idx=3)
    print("Test data: %d sentences, %d tokens"%(len(test),len(flatten(test.tokens))))
    
    return train, test

train, test = get_data()

X_train, y_train = train.tokens, train.labels
X_test, y_test = test.tokens, test.labels

print(len(train))

label_list = np.unique(flatten(y_train))
label_list = list(label_list)
print("\nNER tags:",label_list)

Train and dev data: 6347 sentences, 159670 tokens
Test data: 940 sentences, 24497 tokens
6347

NER tags: ['B-Disease', 'I-Disease', 'O']


In [2]:
train.head()

Unnamed: 0,tokens,labels
0,"[Identification, of, APC2, ,, a, homologue, of...","[O, O, O, O, O, O, O, O, B-Disease, I-Disease,..."
1,"[The, adenomatous, polyposis, coli, (, APC, ),...","[O, B-Disease, I-Disease, I-Disease, I-Disease..."
2,"[Complex, formation, induces, the, rapid, degr...","[O, O, O, O, O, O, O, O, O]"
3,"[In, colon, carcinoma, cells, ,, loss, of, APC...","[O, B-Disease, I-Disease, O, O, O, O, O, O, O,..."
4,"[Here, ,, we, report, the, identification, and...","[O, O, O, O, O, O, O, O, O, O, O, O, O]"


Let's take a closer look at a (tokens, labels) pair for an example:

In [4]:
i = 9
tokens = X_test[i]
labels = y_test[i]

data = {"token": tokens,"label": labels}
df=pd.DataFrame(data=data)
print(df)

         token      label
0   Occasional          O
1     missense          O
2    mutations          O
3           in          O
4          ATM          O
5         were          O
6         also          O
7        found          O
8           in          O
9       tumour  B-Disease
10         DNA          O
11        from          O
12    patients          O
13        with          O
14           B  B-Disease
15           -  I-Disease
16        cell  I-Disease
17         non  I-Disease
18           -  I-Disease
19    Hodgkins  I-Disease
20   lymphomas  I-Disease
21           (          O
22           B  B-Disease
23           -  I-Disease
24         NHL  I-Disease
25           )          O
26         and          O
27           a          O
28           B  B-Disease
29           -  I-Disease
30         NHL  I-Disease
31        cell          O
32        line          O
33           .          O


<a id='NCBI_BERT'></a>
## BERT base cased model

We will finetune a BERT base cased model. As in the CoNLL 2003 NER task, we will check the token sequence lengths to set the **`max_seq_length`** parameter in the model:
    
* The **`max_seq_length`** parameter  will dictate how long a token sequence we can handle. All input token sequences longer than this will be truncated. The limit on this is 512, but we would like smaller sequences since they are much faster and consume less memory on the GPU. 
    
    
* Each token will be tokenized again by the BERT wordpiece tokenizer. This will result in longer token sequences than the input token lists. 
    
Let's check our bert wordpiece token lengths by running the data through the BERT wordpiece tokenizer:

In [5]:
model = BertTokenClassifier('bert-base-cased')
print("Bert wordpiece tokenizer max token length in train: %d tokens"% model.get_max_token_len(X_train))
print("Bert wordpiece tokenizer max token length in test: %d tokens"% model.get_max_token_len(X_test))

Building sklearn token classifier...


100%|██████████| 213450/213450 [00:01<00:00, 201278.39B/s]


Bert wordpiece tokenizer max token length in train: 176 tokens
Bert wordpiece tokenizer max token length in test: 155 tokens


So as long as we set the **`max_seq_length`** to greater than 178 = 176 + 2( for the `'[CLS]'` and `'[SEP]'` delimiter tokens that Bert uses), none of the data will be truncated.

If we set the  **`max_seq_length`**  to less than 178, we can still fineune the model, but we will lose the training signal from truncated tokens in the training data. Also at prediction time, we will predict the majority label,`'O'` for any tokens that have been truncated.



### finetune BERT

We will include an **`ignore_label`** option to exclude the `'O'` = non named entities label, to calculate  `f1`. The non named entities are a huge majority of the labels, and typically `f1` is reported with this class excluded.


In [6]:
%%time
model = BertTokenClassifier('bert-base-cased',
                            max_seq_length=178,
                            epochs=3,
                            gradient_accumulation_steps=4,
                            learning_rate=3e-5,
                            train_batch_size=16,
                            eval_batch_size=16,
                            validation_fraction=0,                            
                            label_list=label_list,
                            ignore_label=['O'])

print(model)

# finetune model on train data
model.fit(X_train, y_train)

# score model on test data
f1_test = model.score(X_test, y_test,'macro')
print("Test f1: %0.02f"%(f1_test))

# get predictions on test data
y_preds = model.predict(X_test)

# print report on classifier stats
print(classification_report(flatten(y_test), flatten(y_preds)))

Building sklearn token classifier...
BertTokenClassifier(bert_config_json=None, bert_model='bert-base-cased',
          bert_vocab=None, do_lower_case=None, epochs=3,
          eval_batch_size=16, fp16=False, from_tf=False,
          gradient_accumulation_steps=4, ignore_label=['O'],
          label_list=['B-Disease', 'I-Disease', 'O'], learning_rate=3e-05,
          local_rank=-1, logfile='bert_sklearn.log', loss_scale=0,
          max_seq_length=178, num_mlp_hiddens=500, num_mlp_layers=0,
          random_state=42, restore_file=None, train_batch_size=16,
          use_cuda=True, validation_fraction=0, warmup_proportion=0.1)
Loading bert-base-cased model...


100%|██████████| 404400730/404400730 [02:27<00:00, 2746409.68B/s] 


Defaulting to linear classifier/regressor
Loading Pytorch checkpoint from  pytorch_model.bin
train data size: 6347, validation data size: 0


Training: 100%|██████████| 1587/1587 [04:33<00:00,  6.16it/s, loss=0.0161]
Training: 100%|██████████| 1587/1587 [05:40<00:00,  4.45it/s, loss=0.00329]
Training: 100%|██████████| 1587/1587 [06:25<00:00,  5.10it/s, loss=0.00133]
Predicting:   0%|          | 0/59 [00:00<?, ?it/s]         

Test f1: 88.87


                                                           

              precision    recall  f1-score   support

   B-Disease       0.87      0.89      0.88       960
   I-Disease       0.87      0.92      0.89      1087
           O       0.99      0.99      0.99     22450

   micro avg       0.98      0.98      0.98     24497
   macro avg       0.91      0.94      0.92     24497
weighted avg       0.98      0.98      0.98     24497

CPU times: user 11min 43s, sys: 6min 6s, total: 17min 50s
Wall time: 19min 52s




For span level stats, run the original [perl script](https://www.clips.uantwerpen.be/conll2003/ner/bin/conlleval) to evaluate the results of processing the `CoNLL-2000/2003 shared task`:

In [7]:
# write out predictions to file for conlleval.pl
iter_zip = zip(flatten(X_test),flatten(y_test),flatten(y_preds))
preds = [" ".join([token, y, y_pred]) for token, y, y_pred in iter_zip]
with open("preds.txt",'w') as f:
    for x in preds:
        f.write(str(x)+'\n') 

# run conlleval perl script 
!perl ./conlleval.pl < preds.txt
!rm preds.txt

processed 24497 tokens with 960 phrases; found: 1012 phrases; correct: 839.
accuracy:  98.36%; precision:  82.91%; recall:  87.40%; FB1:  85.09
          Disease: precision:  82.91%; recall:  87.40%; FB1:  85.09  1012


In [8]:
i = 9
tokens = X_test[i]
labels = y_test[i]
preds  = y_preds[i]

data = {"token": tokens,"label": labels,"predict": preds}
df=pd.DataFrame(data=data)
print(df)

         token      label    predict
0   Occasional          O          O
1     missense          O          O
2    mutations          O          O
3           in          O          O
4          ATM          O          O
5         were          O          O
6         also          O          O
7        found          O          O
8           in          O          O
9       tumour  B-Disease  B-Disease
10         DNA          O          O
11        from          O          O
12    patients          O          O
13        with          O          O
14           B  B-Disease  B-Disease
15           -  I-Disease  I-Disease
16        cell  I-Disease  I-Disease
17         non  I-Disease  I-Disease
18           -  I-Disease  I-Disease
19    Hodgkins  I-Disease  I-Disease
20   lymphomas  I-Disease  I-Disease
21           (          O          O
22           B  B-Disease  B-Disease
23           -  I-Disease  I-Disease
24         NHL  I-Disease  I-Disease
25           )          O          O
2

In [9]:
# calculate the probability of each class
y_probs = model.predict_proba(X_test)

# pprint out probs for this observation
prob = y_probs[i]
tokens_prob = model.tokens_proba(tokens, prob)

                                                           

         token  B-Disease  I-Disease    O
0   Occasional       0.00       0.00 1.00
1     missense       0.00       0.00 1.00
2    mutations       0.00       0.00 1.00
3           in       0.00       0.00 1.00
4          ATM       0.00       0.00 1.00
5         were       0.00       0.00 1.00
6         also       0.00       0.00 1.00
7        found       0.00       0.00 1.00
8           in       0.00       0.00 1.00
9       tumour       1.00       0.00 0.00
10         DNA       0.00       0.00 1.00
11        from       0.00       0.00 1.00
12    patients       0.00       0.00 1.00
13        with       0.00       0.00 1.00
14           B       0.95       0.00 0.05
15           -       0.00       0.96 0.04
16        cell       0.00       0.95 0.05
17         non       0.01       0.99 0.00
18           -       0.00       1.00 0.00
19    Hodgkins       0.00       1.00 0.00
20   lymphomas       0.00       1.00 0.00
21           (       0.00       0.00 1.00
22           B       0.98       0.



<a id='NCBI_SciBERT'></a>
## SciBERT

There are 4 SciBERT models available: 


* `scibert-scivocab-cased`


* `scibert-scivocab-uncased` 


* `scibert-basevocab-cased`


* `scibert-basevocab-uncased`



See the [`SciBERT` github](https://github.com/allenai/scibert) and [paper](https://arxiv.org/pdf/1903.10676.pdf) for more info.


###  finetune `'scibert-basevocab-cased'`

In [10]:
%%time
model = BertTokenClassifier(bert_model='scibert-basevocab-cased',
                            max_seq_length=178,
                            epochs=3,
                            gradient_accumulation_steps=4,
                            learning_rate=3e-5,
                            train_batch_size=16,
                            eval_batch_size=16,
                            validation_fraction=0.,                            
                            ignore_label=['O'])

print(model)

# finetune model on train data
model.fit(X_train, y_train)

# score model on test data
f1_test = model.score(X_test, y_test, 'macro')
print("Test f1: %0.02f"%(f1_test))

# get predictions on test data
y_preds = model.predict(X_test)

# print report on classifier stats
print(classification_report(flatten(y_test), flatten(y_preds)))

Building sklearn token classifier...
BertTokenClassifier(bert_config_json=None,
          bert_model='scibert-basevocab-cased', bert_vocab=None,
          do_lower_case=None, epochs=3, eval_batch_size=16, fp16=False,
          from_tf=False, gradient_accumulation_steps=4, ignore_label=['O'],
          label_list=None, learning_rate=3e-05, local_rank=-1,
          logfile='bert_sklearn.log', loss_scale=0, max_seq_length=178,
          num_mlp_hiddens=500, num_mlp_layers=0, random_state=42,
          restore_file=None, train_batch_size=16, use_cuda=True,
          validation_fraction=0.0, warmup_proportion=0.1)


100%|██████████| 403916800/403916800 [00:21<00:00, 18565376.32B/s]


Loading scibert-basevocab-cased model...


100%|██████████| 403916800/403916800 [00:48<00:00, 8373221.90B/s] 


Defaulting to linear classifier/regressor
Loading Pytorch checkpoint from  pytorch_model.bin
train data size: 6347, validation data size: 0


Training: 100%|██████████| 1587/1587 [04:39<00:00,  5.78it/s, loss=0.0148]
Training: 100%|██████████| 1587/1587 [05:52<00:00,  5.25it/s, loss=0.00285]
Training: 100%|██████████| 1587/1587 [06:32<00:00,  4.23it/s, loss=0.00116]
Predicting:   0%|          | 0/59 [00:00<?, ?it/s]         

Test f1: 90.75


                                                           

              precision    recall  f1-score   support

   B-Disease       0.89      0.92      0.90       960
   I-Disease       0.89      0.94      0.91      1087
           O       1.00      0.99      0.99     22450

   micro avg       0.99      0.99      0.99     24497
   macro avg       0.92      0.95      0.94     24497
weighted avg       0.99      0.99      0.99     24497

CPU times: user 11min 32s, sys: 6min 51s, total: 18min 23s
Wall time: 19min 4s




In [12]:
# For span level stats, write out predictions to file for conlleval.pl
iter_zip = zip(flatten(X_test),flatten(y_test),flatten(y_preds))
preds = [" ".join([token, y, y_pred]) for token, y, y_pred in iter_zip]
with open("preds.txt",'w') as f:
    for x in preds:
        f.write(str(x)+'\n') 

# run conlleval perl script 
!perl ./conlleval.pl < preds.txt
!rm preds.txt

processed 24497 tokens with 960 phrases; found: 1013 phrases; correct: 871.
accuracy:  98.65%; precision:  85.98%; recall:  90.73%; FB1:  88.29
          Disease: precision:  85.98%; recall:  90.73%; FB1:  88.29  1013


In [13]:
# look at predicted token labels for a test example
i = 9
tokens = X_test[i]
labels = y_test[i]
preds  = y_preds[i]

data = {"token": tokens,"label": labels,"predict": preds}
df=pd.DataFrame(data=data)
print(df)

         token      label    predict
0   Occasional          O          O
1     missense          O          O
2    mutations          O          O
3           in          O          O
4          ATM          O          O
5         were          O          O
6         also          O          O
7        found          O          O
8           in          O          O
9       tumour  B-Disease  B-Disease
10         DNA          O          O
11        from          O          O
12    patients          O          O
13        with          O          O
14           B  B-Disease  B-Disease
15           -  I-Disease  I-Disease
16        cell  I-Disease  I-Disease
17         non  I-Disease  I-Disease
18           -  I-Disease  I-Disease
19    Hodgkins  I-Disease  I-Disease
20   lymphomas  I-Disease  I-Disease
21           (          O          O
22           B  B-Disease  B-Disease
23           -  I-Disease  I-Disease
24         NHL  I-Disease  I-Disease
25           )          O          O
2

### finetune  `'scibert-basevocab-uncased'`

In [3]:
%%time

model = BertTokenClassifier(bert_model='scibert-basevocab-uncased',
                            max_seq_length=178,
                            epochs=3,
                            gradient_accumulation_steps=4,
                            learning_rate=3e-5,
                            train_batch_size=16,
                            eval_batch_size=16,
                            validation_fraction=0.,                            
                            label_list=label_list,
                            ignore_label=['O'])

print(model)

# finetune model on train data
model.fit(X_train, y_train)

# score model on test data
f1_test = model.score(X_test, y_test, 'macro')
print("Test f1: %0.02f"%(f1_test))

# get predictions on test data
y_preds = model.predict(X_test)

# print report on classifier stats
print(classification_report(flatten(y_test), flatten(y_preds)))

Building sklearn token classifier...
BertTokenClassifier(bert_config_json=None,
          bert_model='scibert-basevocab-uncased', bert_vocab=None,
          do_lower_case=None, epochs=3, eval_batch_size=16, fp16=False,
          from_tf=False, gradient_accumulation_steps=4, ignore_label=['O'],
          label_list=['B-Disease', 'I-Disease', 'O'], learning_rate=3e-05,
          local_rank=-1, logfile='bert_sklearn.log', loss_scale=0,
          max_seq_length=178, num_mlp_hiddens=500, num_mlp_layers=0,
          random_state=42, restore_file=None, train_batch_size=16,
          use_cuda=True, validation_fraction=0.0, warmup_proportion=0.1)
Loading scibert-basevocab-uncased model...
Defaulting to linear classifier/regressor
Loading Pytorch checkpoint from  pytorch_model.bin
train data size: 6347, validation data size: 0


Training: 100%|██████████| 1587/1587 [04:36<00:00,  5.99it/s, loss=0.0148]
Training: 100%|██████████| 1587/1587 [05:56<00:00,  4.80it/s, loss=0.00311]
Training: 100%|██████████| 1587/1587 [06:45<00:00,  3.42it/s, loss=0.00128]
Predicting:   0%|          | 0/59 [00:00<?, ?it/s]         

Test f1: 89.34


                                                           

              precision    recall  f1-score   support

   B-Disease       0.87      0.91      0.89       960
   I-Disease       0.87      0.92      0.90      1087
           O       1.00      0.99      0.99     22450

   micro avg       0.98      0.98      0.98     24497
   macro avg       0.91      0.94      0.93     24497
weighted avg       0.98      0.98      0.98     24497

CPU times: user 11min 44s, sys: 6min 27s, total: 18min 12s
Wall time: 18min 10s




In [4]:
# For span level stats, write out predictions to file for conlleval.pl
iter_zip = zip(flatten(X_test),flatten(y_test),flatten(y_preds))
preds = [" ".join([token, y, y_pred]) for token, y, y_pred in iter_zip]
with open("preds.txt",'w') as f:
    for x in preds:
        f.write(str(x)+'\n') 

# run conlleval perl script 
!perl ./conlleval.pl < preds.txt
!rm preds.txt

processed 24497 tokens with 960 phrases; found: 1028 phrases; correct: 853.
accuracy:  98.43%; precision:  82.98%; recall:  88.85%; FB1:  85.81
          Disease: precision:  82.98%; recall:  88.85%; FB1:  85.81  1028


In [5]:
# look at predicted token labels for a test example
i = 9
tokens = X_test[i]
labels = y_test[i]
preds  = y_preds[i]

data = {"token": tokens,"label": labels,"predict": preds}
df=pd.DataFrame(data=data)
print(df)

         token      label    predict
0   Occasional          O          O
1     missense          O          O
2    mutations          O          O
3           in          O          O
4          ATM          O          O
5         were          O          O
6         also          O          O
7        found          O          O
8           in          O          O
9       tumour  B-Disease  B-Disease
10         DNA          O          O
11        from          O          O
12    patients          O          O
13        with          O          O
14           B  B-Disease  B-Disease
15           -  I-Disease  I-Disease
16        cell  I-Disease  I-Disease
17         non  I-Disease  I-Disease
18           -  I-Disease  I-Disease
19    Hodgkins  I-Disease  I-Disease
20   lymphomas  I-Disease  I-Disease
21           (          O          O
22           B  B-Disease  B-Disease
23           -  I-Disease  I-Disease
24         NHL  I-Disease  I-Disease
25           )          O          O
2

### finetune  `'scibert-scivocab-cased'`

In [18]:
%%time
model = BertTokenClassifier(bert_model='scibert-scivocab-cased',
                            max_seq_length=178,
                            epochs=3,
                            gradient_accumulation_steps=4,
                            learning_rate=3e-5,
                            train_batch_size=16,
                            eval_batch_size=16,
                            validation_fraction=0.,                            
                            label_list=label_list,
                            ignore_label=['O'])

print(model)

# finetune model on train data
model.fit(X_train, y_train)

# score model on test data
f1_test = model.score(X_test, y_test,'macro')
print("Test f1: %0.02f"%(f1_test))

# get predictions on test data
y_preds = model.predict(X_test)

# print report on classifier stats
print(classification_report(flatten(y_test), flatten(y_preds)))

Building sklearn token classifier...
BertTokenClassifier(bert_config_json=None,
          bert_model='scibert-scivocab-cased', bert_vocab=None,
          do_lower_case=None, epochs=3, eval_batch_size=16, fp16=False,
          from_tf=False, gradient_accumulation_steps=4, ignore_label=['O'],
          label_list=['B-Disease', 'I-Disease', 'O'], learning_rate=3e-05,
          local_rank=-1, logfile='bert_sklearn.log', loss_scale=0,
          max_seq_length=178, num_mlp_hiddens=500, num_mlp_layers=0,
          random_state=42, restore_file=None, train_batch_size=16,
          use_cuda=True, validation_fraction=0.0, warmup_proportion=0.1)


100%|██████████| 410521600/410521600 [00:37<00:00, 10961204.04B/s]


Loading scibert-scivocab-cased model...


100%|██████████| 410521600/410521600 [00:20<00:00, 20012169.06B/s]


Defaulting to linear classifier/regressor
Loading Pytorch checkpoint from  pytorch_model.bin
train data size: 6347, validation data size: 0


Training: 100%|██████████| 1587/1587 [04:41<00:00,  5.53it/s, loss=0.0123]
Training: 100%|██████████| 1587/1587 [05:59<00:00,  4.37it/s, loss=0.00281]
Training: 100%|██████████| 1587/1587 [06:34<00:00,  4.41it/s, loss=0.00111]
Predicting:   0%|          | 0/59 [00:00<?, ?it/s]         

Test f1: 90.51


                                                           

              precision    recall  f1-score   support

   B-Disease       0.89      0.92      0.90       960
   I-Disease       0.89      0.94      0.91      1087
           O       1.00      0.99      0.99     22450

   micro avg       0.99      0.99      0.99     24497
   macro avg       0.92      0.95      0.93     24497
weighted avg       0.99      0.99      0.99     24497

CPU times: user 11min 30s, sys: 6min 57s, total: 18min 28s
Wall time: 19min 1s




In [19]:
# For span level stats, write out predictions to file for conlleval.pl
iter_zip = zip(flatten(X_test),flatten(y_test),flatten(y_preds))
preds = [" ".join([token, y, y_pred]) for token, y, y_pred in iter_zip]
with open("preds.txt",'w') as f:
    for x in preds:
        f.write(str(x)+'\n') 

# run conlleval perl script 
!perl ./conlleval.pl < preds.txt
!rm preds.txt

processed 24497 tokens with 960 phrases; found: 1012 phrases; correct: 865.
accuracy:  98.62%; precision:  85.47%; recall:  90.10%; FB1:  87.73
          Disease: precision:  85.47%; recall:  90.10%; FB1:  87.73  1012


In [20]:
# look at predicted token labels for a test example
i = 9
tokens = X_test[i]
labels = y_test[i]
preds  = y_preds[i]

data = {"token": tokens,"label": labels,"predict": preds}
df=pd.DataFrame(data=data)
print(df)

         token      label    predict
0   Occasional          O          O
1     missense          O          O
2    mutations          O          O
3           in          O          O
4          ATM          O          O
5         were          O          O
6         also          O          O
7        found          O          O
8           in          O          O
9       tumour  B-Disease  B-Disease
10         DNA          O          O
11        from          O          O
12    patients          O          O
13        with          O          O
14           B  B-Disease  B-Disease
15           -  I-Disease  I-Disease
16        cell  I-Disease  I-Disease
17         non  I-Disease  I-Disease
18           -  I-Disease  I-Disease
19    Hodgkins  I-Disease  I-Disease
20   lymphomas  I-Disease  I-Disease
21           (          O          O
22           B  B-Disease  B-Disease
23           -  I-Disease  I-Disease
24         NHL  I-Disease  I-Disease
25           )          O          O
2

### finetune `scibert-scivocab-uncased`

In [21]:
%%time
model = BertTokenClassifier(bert_model='scibert-scivocab-uncased',
                            max_seq_length=178,
                            epochs=3,
                            gradient_accumulation_steps=4,
                            learning_rate=3e-5,
                            train_batch_size=16,
                            eval_batch_size=16,
                            validation_fraction=0.,                            
                            label_list=label_list,
                            ignore_label=['O'])

print(model)

# finetune model on train data
model.fit(X_train, y_train)

# score model on test data
f1_test = model.score(X_test, y_test,'macro')
print("Test f1: %0.02f"%(f1_test))

# get predictions on test data
y_preds = model.predict(X_test)

# print report on classifier stats
print(classification_report(flatten(y_test), flatten(y_preds)))

Building sklearn token classifier...
BertTokenClassifier(bert_config_json=None,
          bert_model='scibert-scivocab-uncased', bert_vocab=None,
          do_lower_case=None, epochs=3, eval_batch_size=16, fp16=False,
          from_tf=False, gradient_accumulation_steps=4, ignore_label=['O'],
          label_list=['B-Disease', 'I-Disease', 'O'], learning_rate=3e-05,
          local_rank=-1, logfile='bert_sklearn.log', loss_scale=0,
          max_seq_length=178, num_mlp_hiddens=500, num_mlp_layers=0,
          random_state=42, restore_file=None, train_batch_size=16,
          use_cuda=True, validation_fraction=0.0, warmup_proportion=0.1)


100%|██████████| 410593280/410593280 [00:25<00:00, 16160852.87B/s]


Loading scibert-scivocab-uncased model...


100%|██████████| 410593280/410593280 [00:36<00:00, 11296127.79B/s]


Defaulting to linear classifier/regressor
Loading Pytorch checkpoint from  pytorch_model.bin
train data size: 6347, validation data size: 0


Training: 100%|██████████| 1587/1587 [05:19<00:00,  4.80it/s, loss=0.0149]
Training: 100%|██████████| 1587/1587 [06:35<00:00,  5.00it/s, loss=0.00282]
Training: 100%|██████████| 1587/1587 [06:56<00:00,  4.73it/s, loss=0.00117]
Predicting:   0%|          | 0/59 [00:00<?, ?it/s]         

Test f1: 90.27


                                                           

              precision    recall  f1-score   support

   B-Disease       0.88      0.92      0.90       960
   I-Disease       0.88      0.93      0.90      1087
           O       1.00      0.99      0.99     22450

   micro avg       0.99      0.99      0.99     24497
   macro avg       0.92      0.95      0.93     24497
weighted avg       0.99      0.99      0.99     24497

CPU times: user 12min 37s, sys: 7min 38s, total: 20min 15s
Wall time: 20min 45s




In [22]:
# For span level stats, write out predictions to file for conlleval.pl
iter_zip = zip(flatten(X_test),flatten(y_test),flatten(y_preds))
preds = [" ".join([token, y, y_pred]) for token, y, y_pred in iter_zip]
with open("preds.txt",'w') as f:
    for x in preds:
        f.write(str(x)+'\n') 

# run conlleval perl script 
!perl ./conlleval.pl < preds.txt
!rm preds.txt

processed 24497 tokens with 960 phrases; found: 1034 phrases; correct: 872.
accuracy:  98.59%; precision:  84.33%; recall:  90.83%; FB1:  87.46
          Disease: precision:  84.33%; recall:  90.83%; FB1:  87.46  1034


In [23]:
# look at predicted token labels for a test example
i = 9
tokens = X_test[i]
labels = y_test[i]
preds  = y_preds[i]

data = {"token": tokens,"label": labels,"predict": preds}
df=pd.DataFrame(data=data)
print(df)

         token      label    predict
0   Occasional          O          O
1     missense          O          O
2    mutations          O          O
3           in          O          O
4          ATM          O          O
5         were          O          O
6         also          O          O
7        found          O          O
8           in          O          O
9       tumour  B-Disease  B-Disease
10         DNA          O          O
11        from          O          O
12    patients          O          O
13        with          O          O
14           B  B-Disease  B-Disease
15           -  I-Disease  I-Disease
16        cell  I-Disease  I-Disease
17         non  I-Disease  I-Disease
18           -  I-Disease  I-Disease
19    Hodgkins  I-Disease  I-Disease
20   lymphomas  I-Disease  I-Disease
21           (          O          O
22           B  B-Disease  B-Disease
23           -  I-Disease  I-Disease
24         NHL  I-Disease  I-Disease
25           )          O          O
2

<a id='NCBI_BioBERT'></a>

## BioBERT

There are 4 **`BioBERT`** models available:

* `'biobert-v1.0-pmc-base-cased'`


* `'biobert-v1.0-pubmed-base-cased'`


* `'biobert-v1.0-pubmed-pmc-base-cased'` 


* `'biobert-v1.1-pubmed-base-cased'` 

See [BioBERT github](https://github.com/dmis-lab/biobert) and [paper](https://arxiv.org/pdf/1901.08746.pdf)  for more info.


###  finetune `'biobert-v1.0-pmc-base-cased'`

In [24]:
%%time
model = BertTokenClassifier('biobert-v1.0-pmc-base-cased',
                            max_seq_length=178,
                            epochs=3,
                            gradient_accumulation_steps=4,
                            learning_rate=3e-5,
                            train_batch_size=16,
                            eval_batch_size=16,
                            validation_fraction=0.0,                            
                            label_list=label_list,
                            ignore_label=['O'])

print(model)

# finetune model on train data
model.fit(X_train, y_train)


# score model on test data
f1_test = model.score(X_test, y_test,'macro')
print("Test f1: %0.02f"%(f1_test))

# get predictions on test data
y_preds = model.predict(X_test)

# print report on classifier stats
print(classification_report(flatten(y_test), flatten(y_preds)))

Building sklearn token classifier...
BertTokenClassifier(bert_config_json=None,
          bert_model='biobert-v1.0-pmc-base-cased', bert_vocab=None,
          do_lower_case=None, epochs=3, eval_batch_size=16, fp16=False,
          from_tf=False, gradient_accumulation_steps=4, ignore_label=['O'],
          label_list=['B-Disease', 'I-Disease', 'O'], learning_rate=3e-05,
          local_rank=-1, logfile='bert_sklearn.log', loss_scale=0,
          max_seq_length=178, num_mlp_hiddens=500, num_mlp_layers=0,
          random_state=42, restore_file=None, train_batch_size=16,
          use_cuda=True, validation_fraction=0.0, warmup_proportion=0.1)


100%|██████████| 402110819/402110819 [00:24<00:00, 16241433.83B/s]


Loading biobert-v1.0-pmc-base-cased model...


100%|██████████| 402110819/402110819 [01:04<00:00, 6204009.53B/s] 


Defaulting to linear classifier/regressor
Loading Tensorflow checkpoint from  biobert_model.ckpt
train data size: 6347, validation data size: 0


Training: 100%|██████████| 1587/1587 [05:09<00:00,  5.60it/s, loss=0.0153]
Training: 100%|██████████| 1587/1587 [06:41<00:00,  4.00it/s, loss=0.00283]
Training: 100%|██████████| 1587/1587 [07:02<00:00,  3.82it/s, loss=0.00109]
Predicting:   0%|          | 0/59 [00:00<?, ?it/s]         

Test f1: 90.27


                                                           

              precision    recall  f1-score   support

   B-Disease       0.88      0.91      0.90       960
   I-Disease       0.89      0.93      0.91      1087
           O       1.00      0.99      0.99     22450

   micro avg       0.99      0.99      0.99     24497
   macro avg       0.92      0.95      0.93     24497
weighted avg       0.99      0.99      0.99     24497

CPU times: user 12min 31s, sys: 7min 47s, total: 20min 19s
Wall time: 21min 15s




In [26]:
# For span level stats, write out predictions to file for conlleval.pl
iter_zip = zip(flatten(X_test),flatten(y_test),flatten(y_preds))
preds = [" ".join([token, y, y_pred]) for token, y, y_pred in iter_zip]
with open("preds.txt",'w') as f:
    for x in preds:
        f.write(str(x)+'\n') 

# run conlleval perl script 
!perl ./conlleval.pl < preds.txt
!rm preds.txt

processed 24497 tokens with 960 phrases; found: 1023 phrases; correct: 860.
accuracy:  98.57%; precision:  84.07%; recall:  89.58%; FB1:  86.74
          Disease: precision:  84.07%; recall:  89.58%; FB1:  86.74  1023


In [27]:
# look at predicted token labels for a test example
i = 9
tokens = X_test[i]
labels = y_test[i]
preds  = y_preds[i]

data = {"token": tokens,"label": labels,"predict": preds}
df=pd.DataFrame(data=data)
print(df)

         token      label    predict
0   Occasional          O          O
1     missense          O          O
2    mutations          O          O
3           in          O          O
4          ATM          O          O
5         were          O          O
6         also          O          O
7        found          O          O
8           in          O          O
9       tumour  B-Disease  B-Disease
10         DNA          O          O
11        from          O          O
12    patients          O          O
13        with          O          O
14           B  B-Disease  B-Disease
15           -  I-Disease  I-Disease
16        cell  I-Disease  I-Disease
17         non  I-Disease  I-Disease
18           -  I-Disease  I-Disease
19    Hodgkins  I-Disease  I-Disease
20   lymphomas  I-Disease  I-Disease
21           (          O          O
22           B  B-Disease  B-Disease
23           -  I-Disease  I-Disease
24         NHL  I-Disease  I-Disease
25           )          O          O
2

###  finetune `'biobert-v1.0-pubmed-base-cased'`

In [39]:
%%time
model = BertTokenClassifier('biobert-v1.0-pubmed-base-cased',
                            max_seq_length=178,
                            epochs=3,
                            gradient_accumulation_steps=4,
                            learning_rate=3e-5,
                            train_batch_size=16,
                            eval_batch_size=16,
                            validation_fraction=0.0,                            
                            #label_list=label_list,
                            ignore_label=['O'])

print(model)

# finetune model on train data
model.fit(X_train, y_train)

# score model on test data
f1_test = model.score(X_test, y_test, 'macro')
print("Test f1: %0.02f"%(f1_test))

# get predictions on test data
y_preds = model.predict(X_test)

# print report on classifier stats
print(classification_report(flatten(y_test), flatten(y_preds)))

Building sklearn token classifier...
BertTokenClassifier(bert_config_json=None,
          bert_model='biobert-v1.0-pubmed-base-cased', bert_vocab=None,
          do_lower_case=None, epochs=3, eval_batch_size=16, fp16=False,
          from_tf=False, gradient_accumulation_steps=4, ignore_label=['O'],
          label_list=None, learning_rate=3e-05, local_rank=-1,
          logfile='bert_sklearn.log', loss_scale=0, max_seq_length=178,
          num_mlp_hiddens=500, num_mlp_layers=0, random_state=42,
          restore_file=None, train_batch_size=16, use_cuda=True,
          validation_fraction=0.0, warmup_proportion=0.1)
Loading biobert-v1.0-pubmed-base-cased model...
Defaulting to linear classifier/regressor
Loading Tensorflow checkpoint from  biobert_model.ckpt
train data size: 6347, validation data size: 0


Training: 100%|██████████| 1587/1587 [05:51<00:00,  4.33it/s, loss=0.0153]
Training: 100%|██████████| 1587/1587 [07:03<00:00,  4.24it/s, loss=0.00263]
Training: 100%|██████████| 1587/1587 [06:57<00:00,  4.35it/s, loss=0.00116]
Predicting:   0%|          | 0/59 [00:00<?, ?it/s]         

Test f1: 90.83


                                                           

              precision    recall  f1-score   support

   B-Disease       0.89      0.92      0.90       960
   I-Disease       0.89      0.94      0.91      1087
           O       1.00      0.99      0.99     22450

   micro avg       0.99      0.99      0.99     24497
   macro avg       0.93      0.95      0.94     24497
weighted avg       0.99      0.99      0.99     24497

CPU times: user 12min 35s, sys: 8min 8s, total: 20min 43s
Wall time: 20min 41s




In [40]:
# For span level stats, write out predictions to file for conlleval.pl
iter_zip = zip(flatten(X_test),flatten(y_test),flatten(y_preds))
preds = [" ".join([token, y, y_pred]) for token, y, y_pred in iter_zip]
with open("preds.txt",'w') as f:
    for x in preds:
        f.write(str(x)+'\n') 

# run conlleval perl script 
!perl ./conlleval.pl < preds.txt
!rm preds.txt

processed 24497 tokens with 960 phrases; found: 1009 phrases; correct: 865.
accuracy:  98.68%; precision:  85.73%; recall:  90.10%; FB1:  87.86
          Disease: precision:  85.73%; recall:  90.10%; FB1:  87.86  1009


In [41]:
# look at predicted token labels for a test example
i = 9
tokens = X_test[i]
labels = y_test[i]
preds  = y_preds[i]

data = {"token": tokens,"label": labels,"predict": preds}
df=pd.DataFrame(data=data)
print(df)

         token      label    predict
0   Occasional          O          O
1     missense          O          O
2    mutations          O          O
3           in          O          O
4          ATM          O          O
5         were          O          O
6         also          O          O
7        found          O          O
8           in          O          O
9       tumour  B-Disease  B-Disease
10         DNA          O          O
11        from          O          O
12    patients          O          O
13        with          O          O
14           B  B-Disease  B-Disease
15           -  I-Disease  I-Disease
16        cell  I-Disease  I-Disease
17         non  I-Disease  I-Disease
18           -  I-Disease  I-Disease
19    Hodgkins  I-Disease  I-Disease
20   lymphomas  I-Disease  I-Disease
21           (          O          O
22           B  B-Disease  B-Disease
23           -  I-Disease  I-Disease
24         NHL  I-Disease  I-Disease
25           )          O          O
2

### finetune `'biobert-v1.0-pubmed-pmc-base-cased'`

In [31]:
%%time

model = BertTokenClassifier('biobert-v1.0-pubmed-pmc-base-cased',
                            max_seq_length=178,
                            epochs=3,
                            gradient_accumulation_steps=4,
                            learning_rate=3e-5,
                            train_batch_size=16,
                            eval_batch_size=16,
                            validation_fraction=0.0,                            
                            ignore_label=['O'])

print(model)

# finetune model on train data
model.fit(X_train, y_train)

# score model on test data
f1_test = model.score(X_test, y_test, 'macro')
print("Test f1: %0.02f"%(f1_test))

# get predictions on test data
y_preds = model.predict(X_test)

# print report on classifier stats
print(classification_report(flatten(y_test), flatten(y_preds)))

Building sklearn token classifier...
BertTokenClassifier(bert_config_json=None,
          bert_model='biobert-v1.0-pubmed-pmc-base-cased', bert_vocab=None,
          do_lower_case=None, epochs=3, eval_batch_size=16, fp16=False,
          from_tf=False, gradient_accumulation_steps=4, ignore_label=['O'],
          label_list=None, learning_rate=3e-05, local_rank=-1,
          logfile='bert_sklearn.log', loss_scale=0, max_seq_length=178,
          num_mlp_hiddens=500, num_mlp_layers=0, random_state=42,
          restore_file=None, train_batch_size=16, use_cuda=True,
          validation_fraction=0.0, warmup_proportion=0.1)


100%|██████████| 402016728/402016728 [01:01<00:00, 6488488.15B/s] 


Loading biobert-v1.0-pubmed-pmc-base-cased model...


100%|██████████| 402016728/402016728 [00:23<00:00, 17169864.90B/s]


Defaulting to linear classifier/regressor
Loading Tensorflow checkpoint from  biobert_model.ckpt
train data size: 6347, validation data size: 0


Training: 100%|██████████| 1587/1587 [05:26<00:00,  3.71it/s, loss=0.0156]
Training: 100%|██████████| 1587/1587 [06:50<00:00,  4.73it/s, loss=0.00273]
Training: 100%|██████████| 1587/1587 [07:06<00:00,  4.12it/s, loss=0.0011] 
Predicting:   0%|          | 0/59 [00:00<?, ?it/s]         

Test f1: 90.88


                                                           

              precision    recall  f1-score   support

   B-Disease       0.89      0.92      0.90       960
   I-Disease       0.89      0.94      0.91      1087
           O       1.00      0.99      0.99     22450

   micro avg       0.99      0.99      0.99     24497
   macro avg       0.93      0.95      0.94     24497
weighted avg       0.99      0.99      0.99     24497

CPU times: user 12min 54s, sys: 7min 53s, total: 20min 48s
Wall time: 21min 41s




In [32]:
# For span level stats, write out predictions to file for conlleval.pl
iter_zip = zip(flatten(X_test),flatten(y_test),flatten(y_preds))
preds = [" ".join([token, y, y_pred]) for token, y, y_pred in iter_zip]
with open("preds.txt",'w') as f:
    for x in preds:
        f.write(str(x)+'\n') 

# run conlleval perl script 
!perl ./conlleval.pl < preds.txt
!rm preds.txt

processed 24497 tokens with 960 phrases; found: 1016 phrases; correct: 872.
accuracy:  98.68%; precision:  85.83%; recall:  90.83%; FB1:  88.26
          Disease: precision:  85.83%; recall:  90.83%; FB1:  88.26  1016


In [33]:
# look at predicted token labels for a test example
i = 9
tokens = X_test[i]
labels = y_test[i]
preds  = y_preds[i]

data = {"token": tokens,"label": labels,"predict": preds}
df=pd.DataFrame(data=data)
print(df)

         token      label    predict
0   Occasional          O          O
1     missense          O          O
2    mutations          O          O
3           in          O          O
4          ATM          O          O
5         were          O          O
6         also          O          O
7        found          O          O
8           in          O          O
9       tumour  B-Disease  B-Disease
10         DNA          O          O
11        from          O          O
12    patients          O          O
13        with          O          O
14           B  B-Disease  B-Disease
15           -  I-Disease  I-Disease
16        cell  I-Disease  I-Disease
17         non  I-Disease  I-Disease
18           -  I-Disease  I-Disease
19    Hodgkins  I-Disease  I-Disease
20   lymphomas  I-Disease  I-Disease
21           (          O          O
22           B  B-Disease  B-Disease
23           -  I-Disease  I-Disease
24         NHL  I-Disease  I-Disease
25           )          O          O
2

### finetune `'biobert-v1.1-pubmed-base-cased'` 

In [35]:
%%time
model = BertTokenClassifier('biobert-v1.1-pubmed-base-cased',
                            max_seq_length=178,
                            epochs=3,
                            gradient_accumulation_steps=4,
                            learning_rate=3e-5,
                            train_batch_size=16,
                            eval_batch_size=16,
                            validation_fraction=0.0,                            
                            ignore_label=['O'])

print(model)

# finetune model on train data
model.fit(X_train, y_train)

# score model on test data
f1_test = model.score(X_test, y_test, 'macro')
print("Test f1: %0.02f"%(f1_test))

# get predictions on test data
y_preds = model.predict(X_test)

# print report on classifier stats
print(classification_report(flatten(y_test), flatten(y_preds)))

Building sklearn token classifier...
BertTokenClassifier(bert_config_json=None,
          bert_model='biobert-v1.1-pubmed-base-cased', bert_vocab=None,
          do_lower_case=None, epochs=3, eval_batch_size=16, fp16=False,
          from_tf=False, gradient_accumulation_steps=4, ignore_label=['O'],
          label_list=None, learning_rate=3e-05, local_rank=-1,
          logfile='bert_sklearn.log', loss_scale=0, max_seq_length=178,
          num_mlp_hiddens=500, num_mlp_layers=0, random_state=42,
          restore_file=None, train_batch_size=16, use_cuda=True,
          validation_fraction=0.0, warmup_proportion=0.1)


100%|██████████| 401403346/401403346 [00:43<00:00, 9231261.98B/s] 


Loading biobert-v1.1-pubmed-base-cased model...


100%|██████████| 401403346/401403346 [01:09<00:00, 5774462.83B/s] 


Defaulting to linear classifier/regressor
Loading Tensorflow checkpoint from  model.ckpt-1000000
train data size: 6347, validation data size: 0


Training: 100%|██████████| 1587/1587 [05:01<00:00,  4.06it/s, loss=0.0149]
Training: 100%|██████████| 1587/1587 [06:39<00:00,  3.81it/s, loss=0.00272]
Training: 100%|██████████| 1587/1587 [06:59<00:00,  4.17it/s, loss=0.00107]
Predicting:   0%|          | 0/59 [00:00<?, ?it/s]         

Test f1: 90.45


                                                           

              precision    recall  f1-score   support

   B-Disease       0.88      0.92      0.90       960
   I-Disease       0.88      0.94      0.91      1087
           O       1.00      0.99      0.99     22450

   micro avg       0.99      0.99      0.99     24497
   macro avg       0.92      0.95      0.93     24497
weighted avg       0.99      0.99      0.99     24497

CPU times: user 12min 38s, sys: 7min 32s, total: 20min 11s
Wall time: 21min 25s




In [36]:
# For span level stats, write out predictions to file for conlleval.pl
iter_zip = zip(flatten(X_test),flatten(y_test),flatten(y_preds))
preds = [" ".join([token, y, y_pred]) for token, y, y_pred in iter_zip]
with open("preds.txt",'w') as f:
    for x in preds:
        f.write(str(x)+'\n') 

# run conlleval perl script 
!perl ./conlleval.pl < preds.txt
!rm preds.txt

processed 24497 tokens with 960 phrases; found: 1018 phrases; correct: 863.
accuracy:  98.62%; precision:  84.77%; recall:  89.90%; FB1:  87.26
          Disease: precision:  84.77%; recall:  89.90%; FB1:  87.26  1018


In [37]:
# look at predicted token labels for a test example
i = 9
tokens = X_test[i]
labels = y_test[i]
preds  = y_preds[i]

data = {"token": tokens,"label": labels,"predict": preds}
df=pd.DataFrame(data=data)
print(df)

         token      label    predict
0   Occasional          O          O
1     missense          O          O
2    mutations          O          O
3           in          O          O
4          ATM          O          O
5         were          O          O
6         also          O          O
7        found          O          O
8           in          O          O
9       tumour  B-Disease  B-Disease
10         DNA          O          O
11        from          O          O
12    patients          O          O
13        with          O          O
14           B  B-Disease  B-Disease
15           -  I-Disease  I-Disease
16        cell  I-Disease  I-Disease
17         non  I-Disease  I-Disease
18           -  I-Disease  I-Disease
19    Hodgkins  I-Disease  I-Disease
20   lymphomas  I-Disease  I-Disease
21           (          O          O
22           B  B-Disease  B-Disease
23           -  I-Disease  I-Disease
24         NHL  I-Disease  I-Disease
25           )          O          O
2