<a href="https://colab.research.google.com/github/dilaratank/BioBert-Symptom-Tracking/blob/main/PIPELINE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Complete pipeline for symptom recognition (NER) using BERT
Training code is taken and adapted from https://github.com/eugenesiow/practical-ml

- Data
  - Getting the data
  - Data preprocessing
- Training
  - BC5CDR NER
  - Costum Data
- Inference

## Imports and Installations

In [21]:
# Imports
import urllib.request
from pathlib import Path
import zipfile
import pandas as pd

from simpletransformers.ner import NERModel
from transformers import AutoTokenizer
import pandas as pd
import logging

# Installations
!pip install -q transformers
!pip install -q simpletransformers

Please restart runtime after this! 

## Data

TODO: explain something about data here?

### Getting the data

In [5]:
# Raw data medical dialogues 
with zipfile.ZipFile('/content/data/raw_data.zip') as zip_ref:
  zip_ref.extractall('/content/data/')

# NER BC5CDR data
def download_file(url, output_file):
  Path(output_file).parent.mkdir(parents=True, exist_ok=True)
  urllib.request.urlretrieve (url, output_file)

download_file('https://raw.githubusercontent.com/shreyashub/BioFLAIR/master/data/ner/bc5cdr/train.txt', '/content/data/BC5CDR/train.txt')
download_file('https://raw.githubusercontent.com/shreyashub/BioFLAIR/master/data/ner/bc5cdr/test.txt', '/content/data/BC5CDR/test.txt')
download_file('https://raw.githubusercontent.com/shreyashub/BioFLAIR/master/data/ner/bc5cdr/dev.txt', '/content/data/BC5CDR/dev.txt')

### Data preprocessing

In [23]:
# TODO: data preprocessing for raw medical dialogue data

# NER BC5CDR data
def read_conll(filename):
    df = pd.read_csv(filename,
                    sep = '\t', header = None, keep_default_na = False,
                    names = ['words', 'pos', 'chunk', 'labels'],
                    quoting = 3, skip_blank_lines = False)
    df = df[~df['words'].astype(str).str.startswith('-DOCSTART- ')] # Remove the -DOCSTART- header
    df['sentence_id'] = (df.words == '').cumsum()
    return df[df.words != '']

BC5CDR_train_df = read_conll('/content/data/BC5CDR/train.txt')
BC5CDR_test_df = read_conll('/content/data/BC5CDR/test.txt')
BC5CDR_dev_df = read_conll('/content/data/BC5CDR/dev.txt')

## Training

### BC5CDR NER

Set Parameters and costum labels

In [24]:
train_args_BC5CDR = {
    'reprocess_input_data': True,
    'overwrite_output_dir': True,
    'sliding_window': True,
    'max_seq_length': 64,
    'num_train_epochs': 10,
    'train_batch_size': 32,
    'fp16': True,
    'output_dir': '/outputs/BC5CDR/',
    'best_model_dir': '/outputs/BC5CDR/best_model/',
    'evaluate_during_training': True,
}

custom_labels_BC5CDR = list(train_df['labels'].unique())

Train model

In [25]:
logging.basicConfig(level=logging.DEBUG)
transformers_logger = logging.getLogger('transformers')
transformers_logger.setLevel(logging.WARNING)

# We use the bio BERT pre-trained model.
model = NERModel('bert', 'dmis-lab/biobert-v1.1', labels=custom_labels_BC5CDR, args=train_args_BC5CDR)
model.train_model(BC5CDR_train_df, eval_data=BC5CDR_dev_df)

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /dmis-lab/biobert-v1.1/resolve/main/config.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /dmis-lab/biobert-v1.1/resolve/main/pytorch_model.bin HTTP/1.1" 302 0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at dmis-lab/biobert-v1.1 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /dmis-lab/biobert-v1.1/resolve/main/vocab.txt HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface

  0%|          | 0/2 [00:00<?, ?it/s]

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 0 of 10:   0%|          | 0/124 [00:00<?, ?it/s]

INFO:simpletransformers.ner.ner_model: Converting to features started.


  0%|          | 0/3949 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/494 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/124 [00:00<?, ?it/s]

INFO:simpletransformers.ner.ner_model: Converting to features started.


  0%|          | 0/3949 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/494 [00:00<?, ?it/s]

Running Epoch 2 of 10:   0%|          | 0/124 [00:00<?, ?it/s]

INFO:simpletransformers.ner.ner_model: Converting to features started.


  0%|          | 0/3949 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/494 [00:00<?, ?it/s]

Running Epoch 3 of 10:   0%|          | 0/124 [00:00<?, ?it/s]

INFO:simpletransformers.ner.ner_model: Converting to features started.


  0%|          | 0/3949 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/494 [00:00<?, ?it/s]

Running Epoch 4 of 10:   0%|          | 0/124 [00:00<?, ?it/s]

INFO:simpletransformers.ner.ner_model: Converting to features started.


  0%|          | 0/3949 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/494 [00:00<?, ?it/s]

Running Epoch 5 of 10:   0%|          | 0/124 [00:00<?, ?it/s]

INFO:simpletransformers.ner.ner_model: Converting to features started.


  0%|          | 0/3949 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/494 [00:00<?, ?it/s]

Running Epoch 6 of 10:   0%|          | 0/124 [00:00<?, ?it/s]

INFO:simpletransformers.ner.ner_model: Converting to features started.


  0%|          | 0/3949 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/494 [00:00<?, ?it/s]

Running Epoch 7 of 10:   0%|          | 0/124 [00:00<?, ?it/s]

INFO:simpletransformers.ner.ner_model: Converting to features started.


  0%|          | 0/3949 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/494 [00:00<?, ?it/s]

Running Epoch 8 of 10:   0%|          | 0/124 [00:00<?, ?it/s]

INFO:simpletransformers.ner.ner_model: Converting to features started.


  0%|          | 0/3949 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/494 [00:00<?, ?it/s]

Running Epoch 9 of 10:   0%|          | 0/124 [00:00<?, ?it/s]

INFO:simpletransformers.ner.ner_model: Converting to features started.


  0%|          | 0/3949 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/494 [00:00<?, ?it/s]

INFO:simpletransformers.ner.ner_model: Training of bert model complete. Saved to /outputs/BC5CDR/.


(1240,
 {'eval_loss': [0.08479625847744887,
   0.07737277924631786,
   0.08965948429585506,
   0.10285149812635914,
   0.10627035042063639,
   0.1253798841797195,
   0.13076278711708766,
   0.13749964082107688,
   0.1434740877798375,
   0.14529182969652368],
  'f1_score': [0.8782727521180651,
   0.8922797844528624,
   0.9015612161051766,
   0.9016920111372886,
   0.9049842545336084,
   0.9033881068350906,
   0.9053711201079622,
   0.9071589567457335,
   0.903629938965628,
   0.902751703052084],
  'global_step': [124, 248, 372, 496, 620, 744, 868, 992, 1116, 1240],
  'precision': [0.8852892561983471,
   0.8780846371941615,
   0.9107913669064748,
   0.8904399323181049,
   0.9060665362035225,
   0.89178907323259,
   0.9012358946802794,
   0.8978117697046951,
   0.8922605201945443,
   0.8930276981852913],
  'recall': [0.8713665943600868,
   0.906941431670282,
   0.8925162689804772,
   0.913232104121475,
   0.9039045553145336,
   0.91529284164859,
   0.9095444685466377,
   0.916702819956616

Let's see how this model would perform on our costum data. 

In [None]:
# TODO: show inference with first line or conversation in data. 

### Costum Data

In [None]:
# TODO: train on costum data

Let's see how this model would perform on our costum data.

In [None]:
# TODO: show inference with first line or conversation in data. 

## Evaluation

### BC5CDR NER

In [26]:
# On BC5CDR test set
result, model_outputs, preds_list = model.eval_model(BC5CDR_test_df)

# TODO: on costum test set

INFO:simpletransformers.ner.ner_model: Converting to features started.


  0%|          | 0/4139 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/518 [00:00<?, ?it/s]

INFO:simpletransformers.ner.ner_model:{'eval_loss': 0.15501047888215366, 'precision': 0.8795230263157895, 'recall': 0.9089556995644321, 'f1_score': 0.893997178830782}


### Custom Model

In [None]:
# TODO: on costum test set

## THOUGHTS (will be deleted)
- I could also use pre-trained roBERTa? 