# Using BERT to solve 'open cloze' exercises

@Data_sigh

BERT (Bidirectional Encoder Representations from Transformers) with pretrained weights is loaded from the library of state-of-the-art pretrained models from HuggingFace

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
https://arxiv.org/abs/1810.04805
@article{devlin2018bert,
  title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
  author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
  journal={arXiv preprint arXiv:1810.04805},
  year={2018}
}
https://github.com/google-research/bert

https://github.com/huggingface/pytorch-transformers

https://colab.research.google.com/github/pytorch/pytorch.github.io/blob/master/assets/hub/huggingface_pytorch-pretrained-bert_bert.ipynb

BertTokenizer: to perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization
Text normalization: Convert all whitespace characters to spaces, and (for the Uncased model) lowercase the input and strip out accent markers. E.g., John Johanson's, → john johanson's,.
Punctuation splitting: Split all punctuation characters on both sides (i.e., add whitespace around all punctuation characters). Punctuation characters are defined as (a) Anything with a P* Unicode class, (b) any non-letter/number/space ASCII character (e.g., characters like $ which are technically not punctuation). E.g., john johanson's, → john johanson ' s ,
WordPiece tokenization: Apply whitespace tokenization to the output of the above procedure, and apply WordPiece tokenization to each token separately. (Our implementation is directly based on the one from tensor2tensor, which is linked). E.g., john johanson ' s , → john johan ##son ' s ,

BertForMaskedLM: BERT Transformer with the pre-trained masked language modelling head on top (fully pre-trained)

In [1]:
from fastai.text import * 
from pytorch_transformers import BertTokenizer, BertForMaskedLM
import glob

In [2]:
bert_model_name="bert-base-uncased" # Pretrained weights shortcut
tokenizer = BertTokenizer.from_pretrained(bert_model_name, do_lower_case=True) # wordpiece tokenizer
maskedLM_model = BertForMaskedLM.from_pretrained(bert_model_name)
maskedLM_model.eval();

In [3]:
fnames = glob.glob("C:/Users/aliso/.fastai/data/cambridge_nlp/test/open_cloze/*.xlsx") # test data

In [4]:
def exam_test_open_cloze(df_test, n=1):
    score = 0
    txt_before_gap = tokenizer.tokenize(df_test.text[0])
    txt_before_gap = ' '.join(txt_before_gap[-6:]) # no more than 6 tokens
    for i in range(len(df_test)-1):      
        txt_after_gap = df_test.text[i+1]
        txt = '[CLS] ' + txt_before_gap + ' [MASK] ' + txt_after_gap + ' [SEP]'
        # Tokenized input
        tokens_txt = tokenizer.tokenize(txt)
        idx_tokens = [tokenizer.convert_tokens_to_ids(tokens_txt)]
        masked_idx = tokens_txt.index('[MASK]')
        segments_ids = [0] * masked_idx + [1] * (len(tokens_txt)-masked_idx)
        # Convert inputs to PyTorch tensors
        segments_tensors = torch.tensor(segments_ids)
        tokens_tensor = torch.tensor(idx_tokens)
        # Predict the missing token (indicated with [MASK]) with `BertForMaskedLM`
        with torch.no_grad(): preds = maskedLM_model(tokens_tensor, segments_tensors)
        preds_idx = [torch.argmax(preds[0][0, masked_idx,:]).item()]
        pred_token = tokenizer.convert_ids_to_tokens(preds_idx)[0]
        
        if pred_token in [',','.']: # Take second highest prediction if the first is punctuation
            preds[0][0,masked_idx,preds_idx]=0
            preds_idx = [torch.argmax(preds[0][0, masked_idx,:]).item()]
            pred_token = tokenizer.convert_ids_to_tokens(preds_idx)[0]
        
        if n==2:
            # Make two suggestions for each word
            preds[0][0,masked_idx,preds_idx]=0
            preds_idx = [torch.argmax(preds[0][0, masked_idx,:]).item()]
            pred_token2 = tokenizer.convert_ids_to_tokens(preds_idx)[0]
            pred_token = [pred_token, pred_token2] # propose 2 words

        print (df_test.text[i], "[", pred_token, ":", df_test.answer[i], "]")
        
        actual_answer = list(df_test.answer[i].lower().split("'"))
        if n==2:
            if (any(x == pred_token[0] for x in actual_answer) or any(x == pred_token[1] for x in actual_answer)):
                score +=1
        else:
            if pred_token in actual_answer:
                score +=1
                                
        txt_before_gap = tokenizer.tokenize(txt_after_gap)
        txt_before_gap = ' '.join(txt_before_gap[-7:])
        
    print (df_test.text[len(df_test)-1])
    print ("SCORE", score, '/', len(df_test)-1 )
    return score

In [5]:
pd.set_option('display.max_colwidth', 150)
df_test = pd.read_excel(fnames[0])
df_test

Unnamed: 0,question,answer,text
0,0.0,as,I work
1,9.0,where,"a motorbike stunt rider - that is, I do tricks on my motorbike at shows. The Le Mans racetrack in France was"
2,10.0,so,I first saw some guys doing motorbike stunts. I'd never seen anyone riding a motorbike using just the back wheel before and I was
3,11.0,myself,impressed I went straight home and taught
4,12.0,in,to do the same. It wasn't very long before I began to earn my living at shows performing my own motorbike stunts. I have a degree
5,13.0,"['which','that']",mechanical engineering; this helps me to look at the physics
6,14.0,"['out','on','at']","lies behind each stunt. In addition to being responsible for design changes to the motorbike, I have to work"
7,15.0,from,"every stunt I do. People often think that my work is very dangerous, but, apart"
8,16.0,any,"some minor mechanical problems happening occasionally during a stunt, nothing ever goes wrong. I never feel in"
9,,,kind of danger because I'm very experienced.


In [6]:
exam_test_open_cloze(pd.read_excel(fnames[0]), n=1)

I work [ as : as ]
a motorbike stunt rider - that is, I do tricks on my motorbike at shows. The Le Mans racetrack in France was [ where : where ]
I first saw some guys doing motorbike stunts. I'd never seen anyone riding a motorbike using just the back wheel before and I was [ so : so ]
impressed I went straight home and taught [ myself : myself ]
to do the same. It wasn't very long before I began to earn my living at shows performing my own motorbike stunts. I have a degree [ in : in ]
mechanical engineering; this helps me to look at the physics [ that : ['which','that'] ]
lies behind each stunt. In addition to being responsible for design changes to the motorbike, I have to work [ on : ['out','on','at'] ]
every stunt I do. People often think that my work is very dangerous, but, apart [ from : from ]
some minor mechanical problems happening occasionally during a stunt, nothing ever goes wrong. I never feel in [ any : any ]
kind of danger because I'm very experienced.
SCORE 9 / 9


9

In [7]:
total_score = 0
for fname in list(fnames): 
    print ()
    print (fname.split('\\')[-1])
    df_test = pd.read_excel(fname)
    score = exam_test_open_cloze(df_test, n=1)
    total_score+=score
print ("TOTAL", total_score, ' in ', len(fnames), ' tests', 100*total_score/(9*len(fnames)), '%')


B2_sample_paper_1_part_2_first.xlsx
I work [ as : as ]
a motorbike stunt rider - that is, I do tricks on my motorbike at shows. The Le Mans racetrack in France was [ where : where ]
I first saw some guys doing motorbike stunts. I'd never seen anyone riding a motorbike using just the back wheel before and I was [ so : so ]
impressed I went straight home and taught [ myself : myself ]
to do the same. It wasn't very long before I began to earn my living at shows performing my own motorbike stunts. I have a degree [ in : in ]
mechanical engineering; this helps me to look at the physics [ that : ['which','that'] ]
lies behind each stunt. In addition to being responsible for design changes to the motorbike, I have to work [ on : ['out','on','at'] ]
every stunt I do. People often think that my work is very dangerous, but, apart [ from : from ]
some minor mechanical problems happening occasionally during a stunt, nothing ever goes wrong. I never feel in [ any : any ]
kind of danger because I'

In [8]:
total_score = 0
for fname in list(fnames): 
    print ()
    print (fname.split('\\')[-1])
    df_test = pd.read_excel(fname)
    score = exam_test_open_cloze(df_test, n=2)
    total_score+=score
print ("TOTAL", total_score, ' in ', len(fnames), ' tests', 100*total_score/(9*len(fnames)), '%')


B2_sample_paper_1_part_2_first.xlsx
I work [ ['as', 'like'] : as ]
a motorbike stunt rider - that is, I do tricks on my motorbike at shows. The Le Mans racetrack in France was [ ['where', 'when'] : where ]
I first saw some guys doing motorbike stunts. I'd never seen anyone riding a motorbike using just the back wheel before and I was [ ['so', 'pretty'] : so ]
impressed I went straight home and taught [ ['myself', 'others'] : myself ]
to do the same. It wasn't very long before I began to earn my living at shows performing my own motorbike stunts. I have a degree [ ['in', 'of'] : in ]
mechanical engineering; this helps me to look at the physics [ ['that', 'which'] : ['which','that'] ]
lies behind each stunt. In addition to being responsible for design changes to the motorbike, I have to work [ ['on', 'with'] : ['out','on','at'] ]
every stunt I do. People often think that my work is very dangerous, but, apart [ ['from', 'of'] : from ]
some minor mechanical problems happening occasionally