<a href="https://colab.research.google.com/github/arunvajra/API/blob/master/Procedure_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BioBert Fine-Tuning for Disease
In this notebook we will be fine-tuning a BioBERT base model for disease NER on biomedical texts. 

In [1]:
!pip install transformers==3.1.0
!pip install seqeval
!pip install tensorflow

Collecting transformers==3.1.0
  Downloading transformers-3.1.0-py3-none-any.whl (884 kB)
[K     |████████████████████████████████| 884 kB 5.2 MB/s 
Collecting tokenizers==0.8.1.rc2
  Downloading tokenizers-0.8.1rc2-cp37-cp37m-manylinux1_x86_64.whl (3.0 MB)
[K     |████████████████████████████████| 3.0 MB 38.3 MB/s 
Collecting sentencepiece!=0.1.92
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 41.0 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 44.2 MB/s 
Installing collected packages: tokenizers, sentencepiece, sacremoses, transformers
Successfully installed sacremoses-0.0.49 sentencepiece-0.1.96 tokenizers-0.8.1rc2 transformers-3.1.0
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 1.1 MB/s 
Building wheels for collected packages

In [2]:
!git clone https://github.com/dmis-lab/biobert-pytorch.git
%cd biobert-pytorch
!./download.sh

Cloning into 'biobert-pytorch'...
remote: Enumerating objects: 236, done.[K
remote: Counting objects: 100% (236/236), done.[K
remote: Compressing objects: 100% (205/205), done.[K
remote: Total 236 (delta 59), reused 174 (delta 18), pack-reused 0[K
Receiving objects: 100% (236/236), 1.92 MiB | 10.06 MiB/s, done.
Resolving deltas: 100% (59/59), done.
/content/biobert-pytorch
BIOBERT_DATA not set; downloading to default path ('data').
--2022-03-23 06:41:11--  https://docs.google.com/uc?export=download&confirm=t&id=1cGqvAm9IZ_86C4Mj7Zf-w9CFilYVDl8j
Resolving docs.google.com (docs.google.com)... 209.85.146.113, 209.85.146.138, 209.85.146.100, ...
Connecting to docs.google.com (docs.google.com)|209.85.146.113|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-10-20-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/216n7k1pq2jc2tvdodgv2t58t2v2jmf1/1648017600000/13799006341648886493/*/1cGqvAm9IZ_86C4Mj7Zf-w9CFilYVDl8j?e=do

In [3]:
#!cat /content/214370.conll /content/214371.conll /content/214372.conll /content/214373.conll /content/214374.conll /content/214375.conll /content/214377.conll /content/214379.conll /content/214380.conll /content/214381.conll /content/214382.conll /content/214383.conll /content/214385.conll /content/214387.conll > /content/combined.tsv 

In [4]:
!pwd

/content/biobert-pytorch


In [5]:
!mkdir datasets/NER/NCB-BC5CDR-disease
!cat /content/dev.tsv > datasets/NER/NCB-BC5CDR-disease/devel.tsv
!cat /content/train.tsv > datasets/NER/NCB-BC5CDR-disease/train.tsv
!cat /content/test.tsv > datasets/NER/NCB-BC5CDR-disease/test.tsv
!cat /content/train.tsv /content/dev.tsv> datasets/NER/NCB-BC5CDR-disease/train_dev.tsv

In [6]:
!rm -rf datasets/NER/NCBI-disease
!mv datasets/NER/NCB-BC5CDR-disease datasets/NER/NCBI-disease

In [7]:
%cd named-entity-recognition/
!./preprocess.sh

/content/biobert-pytorch/named-entity-recognition
*****  NCBI-disease  Preprocessing Start *****
Replacing Done
Downloading: 100% 433/433 [00:00<00:00, 267kB/s]
Downloading: 100% 213k/213k [00:00<00:00, 2.47MB/s]
*****  NCBI-disease  Preprocessing Done *****
*****  BC5CDR-disease  Preprocessing Start *****
Replacing Done
*****  BC5CDR-disease  Preprocessing Done *****
*****  BC5CDR-chem  Preprocessing Start *****
Replacing Done
*****  BC5CDR-chem  Preprocessing Done *****
*****  BC4CHEMD  Preprocessing Start *****
Replacing Done
*****  BC4CHEMD  Preprocessing Done *****
*****  JNLPBA  Preprocessing Start *****
Replacing Done
*****  JNLPBA  Preprocessing Done *****
*****  BC2GM  Preprocessing Start *****
Replacing Done
*****  BC2GM  Preprocessing Done *****
*****  linnaeus  Preprocessing Start *****
Replacing Done
*****  linnaeus  Preprocessing Done *****
*****  s800  Preprocessing Start *****
Replacing Done
*****  s800  Preprocessing Done *****


In [8]:
!python run_ner.py \
  --data_dir ../datasets/NER/NCBI-disease/ \
  --labels ../datasets/NER/NCBI-disease/labels.txt \
  --model_name_or_path dmis-lab/biobert-base-cased-v1.1 \
  --output_dir output/NCBI-disease \
  --max_seq_length 128 \
  --num_train_epochs 3 \
  --per_device_train_batch_size 32 \
  --save_steps 1000 \
  --seed 1 \
  --do_train \
  --do_eval \
  --do_predict \
  --overwrite_output_dir

03/23/2022 06:52:05 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='output/NCBI-disease', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=True, evaluate_during_training=False, prediction_loss_only=False, per_device_train_batch_size=32, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Mar23_06-52-04_ec0645df48fe', logging_first_step=False, logging_steps=500, save_steps=1000, save_total_limit=None, no_cuda=False, seed=1, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, past_index=-1, run_name=None, disable_tqdm=False, remove_unused_columns=True)
Downloading: 100% 313/313 [00:00<00:00,

In [9]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline 

tokenizer = AutoTokenizer.from_pretrained("output/NCBI-disease/")
model = AutoModelForTokenClassification.from_pretrained("output/NCBI-disease/")


In [10]:
!pwd

/content/biobert-pytorch/named-entity-recognition


In [77]:
input_text = "While SARS-CoV-2 primarily affects lung tissue, it may cause direct or indirect damage to the cardiovascular system and permanent damage may occur. Arterial stiffness is an early indicator of cardiovascular disease risk. The aim of our study was to establish the potential effects of SARS-CoV-2 on the vascular system evaluated by transthoracic echocardiographic (TTE) examination. SUBJECTS AND METHODS: This study compared arterial stiffness between the survivors of COVID-19 and those without a history of COVID-19 infection. The difference in aortic diameter was examined using echocardiography. RESULTS: The study included 50 patients who survived COVID-19 in the last 3 to 6 months and 50 age- and gender-matched healthy volunteers. In surviving COVID-19 patients, aortic diastolic diameter in cm ([3.1±0.2] vs. [2.9±0.1], p < 0.001), pulse pressure ([43.02±14.05] vs [35.74±9.86], p = 0.004), aortic distensibility ([5.61±3.57] vs [8.31±3.82], p < 0.001), aortic strain ([10.56±4.91] vs [13.88±5.86], p = 0.003), pulse pressure/stroke volume index ([1.25±0.47] vs [0.98±0.28], p = 0.001) and aortic stiffness index ([2.82±0.47] vs [2.46±0.45], p < 0.001) were statistically significant compared to the control group. CONCLUSION: SARS-CoV-2 may cause reduced or impaired aortic elasticity parameters linked to impaired arterial wall function in COVID-19 survivors compared with controls"
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
ner_results = nlp(input_text)
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(input_text)))
print(ner_results)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[{'word': 'sa', 'score': 0.7829495668411255, 'entity': 'B-treatment-bio', 'index': 2}, {'word': '##rs', 'score': 0.47747647762298584, 'entity': 'I-treatment-bio', 'index': 3}, {'word': '-', 'score': 0.5097053647041321, 'entity': 'I-treatment-bio', 'index': 4}, {'word': 'co', 'score': 0.6564749479293823, 'entity': 'I-treatment-bio', 'index': 5}, {'word': '##v', 'score': 0.6388837695121765, 'entity': 'I-treatment-bio', 'index': 6}, {'word': 'direct', 'score': 0.6683764457702637, 'entity': 'B-bio', 'index': 17}, {'word': 'damage', 'score': 0.867350161075592, 'entity': 'I-bio', 'index': 20}, {'word': 'to', 'score': 0.6013107895851135, 'entity': 'I-bio', 'index': 21}, {'word': 'the', 'score': 0.6165382862091064, 'entity': 'I-bio', 'index': 22}, {'word': 'card', 'score': 0.7481844425201416, 'entity': 'I-bio', 'index': 23}, {'word': '##iovascular', 'score': 0.6552270650863647, 'entity': 'I-bio', 'index': 24}, {'word': 'system', 'score': 0.6695665717124939, 'entity': 'I-bio', 'index': 25}, {'w

In [12]:
!ls output/NCBI-disease

config.json	   special_tokens_map.json  tokenizer_config.json
eval_results.txt   test_predictions.txt     training_args.bin
pytorch_model.bin  test_results.txt	    vocab.txt


In [13]:
!cat output/NCBI-disease/config.json


{
  "architectures": [
    "BertForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "B-bio",
    "1": "B-test-bio",
    "2": "B-treatment-bio",
    "3": "I-bio",
    "4": "I-test-bio",
    "5": "I-treatment-bio",
    "6": "O"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "B-bio": 0,
    "B-test-bio": 1,
    "B-treatment-bio": 2,
    "I-bio": 3,
    "I-test-bio": 4,
    "I-treatment-bio": 5,
    "O": 6
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "type_vocab_size": 2,
  "vocab_size": 28996
}


In [78]:
entities = []
idx=-1

def prep_token(token):
  if (token.find('#') == -1):
    token = " " + token
  else:
    token = token.replace('#','')
  return token

for token in ner_results:
  if (token['entity'] == 'B-bio'):
    idx += 1
    entities.append(prep_token(token['word']).lstrip())
  elif (token['entity'] == 'I-bio'):
    entities[idx] += prep_token(token['word'])

print(entities)

['direct damage to the cardiovascular system', 'permanent damageerial stiffness', 'cardiovascular diseaseerial stiffness', 'co', 'coic diameter', 'co', 'reduced', 'impaired aortic elasticity parameters', 'impaired arterial wall function', 'co']
