### Use `original` BERT to take `fill mask` test without fine-tuning it on our COVID articles

#### Prerequisites

In [1]:
%%capture 

!pip install transformers==4.17.0
!pip install pandas==1.1.5

#### Imports 

In [2]:
from transformers import BertTokenizerFast
from transformers import BertForMaskedLM
from transformers import BertConfig
from transformers import pipeline
import transformers 
import pandas as pd
import logging

##### Setup logging

In [3]:
logger = logging.getLogger('sagemaker')
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

##### Log versions of dependencies

In [4]:
logger.info(f'[Using transformers: {transformers.__version__}]')
logger.info(f'[Using pandas: {pd.__version__}]')

[Using transformers: 4.17.0]
[Using pandas: 1.1.5]


#### Essentials

#### Re-create BERT MLM 

In [5]:
oob_model = BertForMaskedLM.from_pretrained('bert-base-uncased')
logger.info(f'Total number of parameters = {oob_model.num_parameters()}')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Total number of parameters = 109514298


#### Re-create default BERT tokenizer 

In [6]:
oob_tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
oob_tokenizer

PreTrainedTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

##### Verify tokenizer

In [7]:
oob_tokenizer.special_tokens_map

{'unk_token': '[UNK]',
 'sep_token': '[SEP]',
 'pad_token': '[PAD]',
 'cls_token': '[CLS]',
 'mask_token': '[MASK]'}

#### Create HuggingFace Pipeline for `fill mask` task

In [8]:
fill_mask = pipeline('fill-mask', model=oob_model, tokenizer=oob_tokenizer)

#### Test original BERT MLM for `fill mask` task

In [9]:
df = pd.read_csv('./data/eval_mlm.csv')

for gt, masked_sentence in zip(df.ground_truth.tolist(), df.masked.tolist()):
    logger.info(f'Ground Truth    : {gt}')
    logger.info(f'Masked sentence : {masked_sentence}')
    predictions = fill_mask(masked_sentence, top_k=10)
    for i, prediction in enumerate(predictions):
        logger.info(f'Rank: {i+1} | {(prediction["score"] * 100):.2f} % | {[prediction["token_str"]]}')
    print('-' * 100)

Ground Truth    : A number of firms have been reassessing spending plans in light of the covid-19 outbreak and reduced oil price.
Masked sentence : A number of firms have been reassessing spending plans in light of the covid-19 [MASK] and reduced oil price.
[2022-09-14 19:33:47.146 pytorch-1-8-gpu-py3-ml-g4dn-xlarge-60bd0d07a83be181dcf7335baae2:4725 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2022-09-14 19:33:47.240 pytorch-1-8-gpu-py3-ml-g4dn-xlarge-60bd0d07a83be181dcf7335baae2:4725 INFO profiler_config_parser.py:102] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
Rank: 1 | 11.37 % | ['crisis']
Rank: 2 | 8.95 % | ['crash']
Rank: 3 | 3.26 % | ['program']
Rank: 4 | 2.08 % | ['earthquake']
Rank: 5 | 1.98 % | ['agreement']
Rank: 6 | 1.91 % | ['disaster']
Rank: 7 | 1.75 % | ['accident']
Rank: 8 | 1.62 % | ['treaty']
Rank: 9 | 1.31 % | ['ban']
Rank: 10 | 1.23 % | ['issue']
------------------------------------------------------------------