# Domain Shift - Medical

## Setup

In [2]:
#!pip install -r requirements
#import nltk
#nltk.download('punkt')

In [3]:
import torch 

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

### Get datasets

In [21]:
#! python datasets/load_general_dataset.py
#! python datasets/load_multilingual_dataset.py
! python datasets/load_conll2003.py
! python datasets/load_wiki_resized.py
! python datasets/load_ner_multilingual_dataset.py
! python datasets/load_medical_dataset.py
! python datasets/load_medical_ner.py

## Pretrain - RUN MLM 

BertCheckpoint: bert-base-cased

Dataset: PubMed

Output Dir: model/output/medical/mlm/base_bert_on_pubmed


In [17]:
!python CharBERT/run_lm_finetuning.py \
    --model_type bert \
    --model_name_or_path bert-base-cased \
    --output_dir model/output/medical/mlm/base_bert_on_pubmed \
    --train_data_file datasets/medical_domain/mlm/train_pubmed_full.txt \
    --eval_data_file  datasets/medical_domain/mlm/val_pubmed_full.txt \
    --do_train \
    --do_eval \
    --term_vocab CharBERT/data/dict/term_vocab \
    --learning_rate 3e-5 \
    --num_train_epochs 3 \
    --char_vocab CharBERT/data/dict/bert_char_vocab \
    --mlm_probability 0.10 \
    --input_nraws 1000 \
    --per_gpu_train_batch_size 4 \
    --per_gpu_eval_batch_size 4 \
    --save_steps 10000 \
    --block_size 384 \
    --mlm \
    --overwrite_output_dir


2024-02-13 17:14:13.938095: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-13 17:14:13.940678: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-13 17:14:13.970460: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-13 17:14:13.970495: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-13 17:14:13.971360: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to

## Pretrain - RUN MLM 

BertCheckpoint: BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext

Dataset: PubMed

Output Dir: model/output/medical/mlm/biomed_bert_on_pubmed


In [None]:
!bash model/download_biomednlp_bert.bash

In [9]:
!python CharBERT/run_lm_finetuning.py \
    --model_type bert \
    --model_name_or_path model/input/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext \
    --output_dir model/output/medical/mlm/biomed_bert_on_pubmed \
    --train_data_file datasets/medical_domain/mlm/train_pubmed_full.txt  \
    --eval_data_file datasets/medical_domain/mlm/val_pubmed_full.txt  \
    --do_train \
    --do_eval \
    --term_vocab CharBERT/data/dict/term_vocab \
    --learning_rate 3e-5 \
    --num_train_epochs 3 \
    --char_vocab CharBERT/data/dict/bert_char_vocab \
    --mlm_probability 0.10 \
    --input_nraws 1000 \
    --per_gpu_train_batch_size 4 \
    --per_gpu_eval_batch_size 4 \
    --save_steps 10000 \
    --block_size 384 \
    --mlm \
    --overwrite_output_dir 

2024-02-13 15:41:55.595437: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-13 15:41:55.598000: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-13 15:41:55.627460: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-13 15:41:55.627493: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-13 15:41:55.628337: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to

## Run NER

Pretrained Model: model/output/general/mlm/wiki_eng_cased

Dataset: jnlpba (biomedical-ner)

In [None]:
! python CharBERT/run_ner.py --data_dir datasets/medical_domain/ner \
                            --labels datasets/medical_domain/ner/labels.txt \
                            --model_type bert \
                            --model_name_or_path model/output/general/mlm/wikil_eng_cased \
                            --output_dir model/output/medical/ner/base_bert_pretrained_on_wikil/jnlpba \
                            --num_train_epochs 3 \
                            --learning_rate 3e-5 \
                            --char_vocab CharBERT/data/dict/bert_char_vocab \
                            --per_gpu_train_batch_size 6 \
                            --do_train \
                            --do_predict \
                            --overwrite_output_dir \
                            --save_steps 1000

## Run NER

Pretrained Model: model/output/medical/mlm/base_bert_on_pubmed

Dataset: conll2003

In [23]:
! python CharBERT/run_ner.py --data_dir ./datasets/CoNLL2003/ \
                            --model_type bert \
                            --model_name_or_path model/output/medical/mlm/base_bert_on_pubmed \
                            --output_dir model/output/medical/ner/base_bert_pretrained_on_pubmed/conll2003 \
                            --num_train_epochs 3 \
                            --learning_rate 3e-5 \
                            --char_vocab CharBERT/data/dict/bert_char_vocab \
                            --per_gpu_train_batch_size 6 \
                            --do_train \
                            --do_predict \
                            --overwrite_output_dir \
                            --save_steps 1000

2024-02-13 17:40:26.247888: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-13 17:40:26.250548: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-13 17:40:26.280040: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-13 17:40:26.280072: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-13 17:40:26.280900: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to

## Run NER

Pretrained Model: model/output/medical/mlm/base_bert_on_pubmed

Dataset: jnlpba (biomedical-ner)

In [28]:
! python CharBERT/run_ner.py --data_dir datasets/medical_domain/ner \
                            --labels datasets/medical_domain/ner/labels.txt \
                            --model_type bert \
                            --model_name_or_path model/output/medical/mlm/base_bert_on_pubmed \
                            --output_dir model/output/medical/ner/base_bert_pretrained_on_pubmed/jnlpba \
                            --num_train_epochs 3 \
                            --learning_rate 3e-5 \
                            --char_vocab CharBERT/data/dict/bert_char_vocab \
                            --per_gpu_train_batch_size 6 \
                            --do_train \
                            --do_predict \
                            --overwrite_output_dir \
                            --save_steps 1000

2024-02-13 18:11:28.129488: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-13 18:11:28.132062: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-13 18:11:28.161967: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-13 18:11:28.162004: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-13 18:11:28.162853: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to

## Run NER

Pretrained Model: model/output/medical/mlm/biomed_bert_on_pubmed

Dataset: conll2003

In [None]:
! python CharBERT/run_ner.py --data_dir ./datasets/CoNLL2003/ \
                            --model_type bert \
                            --model_name_or_path model/output/medical/mlm/biomed_bert_on_pubmed \
                            --output_dir model/output/medical/ner/biomed_bert_pretrained_on_pubmed/conll2003 \
                            --num_train_epochs 3 \
                            --learning_rate 3e-5 \
                            --char_vocab CharBERT/data/dict/bert_char_vocab \
                            --per_gpu_train_batch_size 6 \
                            --do_train \
                            --do_predict \
                            --overwrite_output_dir \
                            --save_steps 1000

## Run NER

Pretrained Model: model/output/medical/mlm/biomed_bert_on_pubmed

Dataset: jnlpba (biomedical-ner)

In [None]:
! python CharBERT/run_ner.py --data_dir datasets/medical_domain/ner \
                            --labels datasets/medical_domain/ner/labels.txt \
                            --model_type bert \
                            --model_name_or_path model/output/medical/mlm/biomed_bert_on_pubmed \
                            --output_dir model/output/medical/ner/biomed_bert_pretrained_on_pubmed/jnlpba \
                            --num_train_epochs 3 \
                            --learning_rate 3e-5 \
                            --char_vocab CharBERT/data/dict/bert_char_vocab \
                            --per_gpu_train_batch_size 6 \
                            --do_train \
                            --do_predict \
                            --overwrite_output_dir \
                            --save_steps 1000