### Use `original` BERT to take `fill mask` test without fine-tuning it on our COVID articles

#### Prerequisites

In [2]:
%%capture 

!pip install transformers==4.17.0
!pip install pandas==1.1.5

#### Imports 

In [3]:
from transformers import BertTokenizerFast
from transformers import BertForMaskedLM
from transformers import BertConfig
from transformers import pipeline
import transformers 
import pandas as pd
import pandas
import logging

##### Setup logging

In [4]:
logger = logging.getLogger('sagemaker')
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

##### Log versions of dependencies

In [5]:
logger.info(f'[Using transformers: {transformers.__version__}]')
logger.info(f'[Using pandas: {pd.__version__}]')

[Using transformers: 4.17.0]
[Using pandas: 1.1.5]


#### Essentials

In [6]:
config = BertConfig()

#### Re-create BERT MLM 

In [8]:
default_model = BertForMaskedLM(config=config)
logger.info(f'Total number of parameters = {default_model.num_parameters()}')

Total number of parameters = 109514298


#### Re-create default BERT tokenizer 

In [9]:
default_tokenizer = BertTokenizerFast.from_pretrained('./vocab', config=config)
default_tokenizer.model_max_length = 512
default_tokenizer.init_kwargs['model_max_length'] = 512
default_tokenizer

PreTrainedTokenizerFast(name_or_path='./vocab', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

##### Verify tokenizer

In [10]:
MASK_TOKEN = default_tokenizer.mask_token
MASK_TOKEN

'[MASK]'

In [11]:
default_tokenizer.special_tokens_map

{'unk_token': '[UNK]',
 'sep_token': '[SEP]',
 'pad_token': '[PAD]',
 'cls_token': '[CLS]',
 'mask_token': '[MASK]'}

#### Create HuggingFace Pipeline for `fill mask` task

In [12]:
fill_mask = pipeline('fill-mask', model=default_model, tokenizer=default_tokenizer)

#### Test original BERT MLM for `fill mask` task

In [13]:
df = pd.read_csv('./data/eval_mlm.csv')

for gt, masked_sentence in zip(df.ground_truth.tolist(), df.masked.tolist()):
    print(f'Ground Truth    : {gt}')
    print(f'Masked sentence : {masked_sentence}')
    predictions = fill_mask(masked_sentence, top_k=3)
    for i, prediction in enumerate(predictions):
        print(f'Rank: {i+1} | {(prediction["score"] * 100):.2f} % | {[prediction["token_str"]]}')
    print('-' * 100)

Ground Truth    : A number of firms have been reassessing spending plans in light of the covid-19 outbreak and reduced oil price.
Masked sentence : A number of firms have been reassessing spending plans in light of the covid-19 [MASK] and reduced oil price.
[2022-08-25 20:57:16.189 pytorch-1-8-gpu-py3-ml-g4dn-xlarge-60bd0d07a83be181dcf7335baae2:730 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2022-08-25 20:57:16.322 pytorch-1-8-gpu-py3-ml-g4dn-xlarge-60bd0d07a83be181dcf7335baae2:730 INFO profiler_config_parser.py:102] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
Rank: 1 | 0.03 % | ['##lvis']
Rank: 2 | 0.03 % | ['lobster']
Rank: 3 | 0.02 % | ['##von']
----------------------------------------------------------------------------------------------------
Ground Truth    : Globally, airlines are closing down and the covid-19 coronavirus has accelerated some of these closures.
Masked sentence : Globally, airlines are closing down and the c