<a href="https://colab.research.google.com/github/ckenlam/NLU-Prepositions-Challenge/blob/main/Challenge_Problem_Preposition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Objective


The objective of this Notebook is to predict the original preposition for each instance of the masked token found in the full text of "**The Hound of the Baskervilles**" as best as possible via the use of a fine-tuned RoBERTa model. 

However, this model will not access any part of the original text of "**The Hound of the Baskervilles**" for its training/fine-tuning process; instead, I will use another work of Sir Arthur Conan Doyle, "**The Adventures of Sherlock Holmes**", to fine tune the language model. 

The full text of "**The Hound of the Baskervilles**", by Sir Arthur Conan Doyle, is available for download at https://www.gutenberg.org/ebooks/2852.txt.utf-8 .

The full text of "**The Adventures of Sherlock Holmes**", by Sir Arthur Conan Doyle, can be found in my github repo at https://raw.githubusercontent.com/ckenlam/Language-Model/main/hound-train.txt .

# Huggingface Authentication

In [27]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


In [28]:
!git config --global credential.helper store

# Install the libraries

In [44]:
!pip install transformers



In [45]:
!pip install datasets



In [46]:
!pip install transformers[sentencepiece]



In [32]:
!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash

Detected operating system as Ubuntu/bionic.
Checking for curl...
Detected curl...
Checking for gpg...
Detected gpg...
Running apt-get update... done.
Installing apt-transport-https... done.
Installing /etc/apt/sources.list.d/github_git-lfs.list...done.
Importing packagecloud gpg key... done.
Running apt-get update... done.

The repository is setup! You can now install packages.


In [33]:
!sudo apt-get install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages were automatically installed and are no longer required:
  cuda-command-line-tools-10-0 cuda-command-line-tools-10-1
  cuda-command-line-tools-11-0 cuda-compiler-10-0 cuda-compiler-10-1
  cuda-compiler-11-0 cuda-cuobjdump-10-0 cuda-cuobjdump-10-1
  cuda-cuobjdump-11-0 cuda-cupti-10-0 cuda-cupti-10-1 cuda-cupti-11-0
  cuda-cupti-dev-11-0 cuda-documentation-10-0 cuda-documentation-10-1
  cuda-documentation-11-0 cuda-documentation-11-1 cuda-gdb-10-0 cuda-gdb-10-1
  cuda-gdb-11-0 cuda-gpu-library-advisor-10-0 cuda-gpu-library-advisor-10-1
  cuda-libraries-10-0 cuda-libraries-10-1 cuda-libraries-11-0
  cuda-memcheck-10-0 cuda-memcheck-10-1 cuda-memcheck-11-0 cuda-nsight-10-0
  cuda-nsight-10-1 cuda-nsight-11-0 cuda-nsight-11-1 cuda-nsight-compute-10-0
  cuda-nsight-compute-10-1 cuda-nsight-compute-11-0 cuda-nsight-compute-11-1
  cuda-nsight-systems-10-1 cuda-nsight-systems-

# Loading Pre-Trained Model

I will be using the pre-trained RoBERTa model for this data challenge. Unlike BERT, RoBERTa is trained using dynamic masking pattern instead of static masking pattern; it also uses 10 times more training data than BERT.

In [47]:
import transformers

In [48]:
from transformers import TFAutoModelForMaskedLM

#model_checkpoint = "distilbert-base-uncased"
model_checkpoint = "roberta-base"
model = TFAutoModelForMaskedLM.from_pretrained(model_checkpoint)

All model checkpoint layers were used when initializing TFRobertaForMaskedLM.

All the layers of TFRobertaForMaskedLM were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForMaskedLM for predictions without further training.


In [50]:
model(model.dummy_inputs)
model.summary()

Model: "tf_roberta_for_masked_lm_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 roberta (TFRobertaMainLayer  multiple                 124055040 
 )                                                               
                                                                 
 lm_head (TFRobertaLMHead)   multiple                  39642969  
                                                                 
Total params: 124,697,433
Trainable params: 124,697,433
Non-trainable params: 0
_________________________________________________________________


# Loading the Tokenizer

In [51]:
from transformers import AutoTokenizer
import numpy as np
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# Loading the Data

I will be using "**The Adventures of Sherlock Holmes**" as the training data to fine-tune the pre-trained RoBERTa model and use "**The Hound of the Baskervilles**" as requested in the instruction of the challenge. Since they are both written by Sir Arthur Conan Doyle, "**The Adventures of Sherlock Holmes**" should serve as an adequate domain-specific data to fine-tune the language model as a way to boost the performance of the downstream task.

In [52]:
from datasets import load_dataset
dataset = load_dataset('text', data_files={'train': 'https://raw.githubusercontent.com/ckenlam/Language-Model/main/hound-train.txt', 'test':'https://www.gutenberg.org/cache/epub/2852/pg2852.txt'})
dataset

Using custom data configuration default-3ad063869b1cab2a
Reusing dataset text (/root/.cache/huggingface/datasets/text/default-3ad063869b1cab2a/0.0.0/08f6fb1dd2dab0a18ea441c359e1d63794ea8cb53e7863e6edf8fc5655e47ec4)


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 12310
    })
    test: Dataset({
        features: ['text'],
        num_rows: 7222
    })
})

Let's take a look at the first 20 lines of the training data:

In [61]:
dataset['train'][:20]

{'text': ['',
  "Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle",
  '',
  'This eBook is for the use of anyone anywhere at no cost and with',
  'almost no restrictions whatsoever.  You may copy it, give it away or',
  're-use it under the terms of the Project Gutenberg License included',
  'with this eBook or online at www.gutenberg.net',
  '',
  '',
  'Title: The Adventures of Sherlock Holmes',
  '',
  'Author: Arthur Conan Doyle',
  '',
  'Release Date: November 29, 2002 [EBook #1661]',
  'Last Updated: May 20, 2019',
  '',
  'Language: English',
  '',
  'Character set encoding: UTF-8',
  '']}

The following are the first 20 lines of the test data:

In [54]:
dataset['test'][:20]

{'text': ["\ufeffProject Gutenberg's The Hound of the Baskervilles, by A. Conan Doyle",
  '',
  'This eBook is for the use of anyone anywhere at no cost and with',
  'almost no restrictions whatsoever.  You may copy it, give it away or',
  're-use it under the terms of the Project Gutenberg License included',
  'with this eBook or online at www.gutenberg.org',
  '',
  '',
  'Title: The Hound of the Baskervilles',
  '',
  'Author: A. Conan Doyle',
  '',
  'Posting Date: December 8, 2008 [EBook #2852]',
  'Release Date: October, 2001',
  '',
  'Language: English',
  '',
  '',
  '*** START OF THIS PROJECT GUTENBERG EBOOK THE HOUND OF THE BASKERVILLES ***',
  '']}

# Data Preprocessing

I will first create a function that will replace the ten most common English prepositions (of, in, to, for, with, on, at, from, by, about) with the single token “**\<mask\>**”. After tokenization, each of the dataset will contain a column for the original text (i.e. 'labels') and a column for the masked text (i.e. 'input_ids').

In [74]:
def masking_prepositions(example):
  prepositions = {'of','in', 'to', 'for', 'with', 'on', 'at', 'from', 'by', 'about'}
  example['mask']= ' '.join(tokenizer.mask_token if i in prepositions else i for i in example['text'].lower().split())
  return example


def tokenize_function(examples):
    result = tokenizer(examples["mask"])
    result_label = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    result['labels'] = result_label['input_ids']
    return result



In [75]:
dataset_masked = dataset.map(masking_prepositions)

tokenized_datasets = dataset_masked.map(tokenize_function,batched=True, remove_columns=["mask","text"])

0ex [00:00, ?ex/s]

0ex [00:00, ?ex/s]

  0%|          | 0/13 [00:00<?, ?ba/s]

  0%|          | 0/8 [00:00<?, ?ba/s]

Below is an example of how the masked text and the original text look like:

In [82]:
tokenizer.decode(tokenized_datasets["train"][1]["input_ids"])

"<s>project gutenberg's the adventures<mask> sherlock holmes,<mask> arthur conan doyle</s>"

In [81]:
tokenizer.decode(tokenized_datasets["train"][1]["labels"])

"<s>Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle</s>"

I will group all the lines together and split the result into chunks that can fit the model’s maximum context size (i.e. 512 tokens).

In [17]:
tokenizer.model_max_length

512

Since I'm running training on Google Colab, I’ll pick something a bit smaller that can fit in memory:

In [24]:
chunk_size = 100

In [22]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    # result["labels"] = result["input_ids"].copy()
    return result

In [25]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

  0%|          | 0/13 [00:00<?, ?ba/s]

  0%|          | 0/8 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1727
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 929
    })
})

# Loading Data Collator

The data collator will insert \<mask\> tokens at random positions in the inputs during the fine-tuning. I will be using 2 different data collator to treat the training set and the test set. 

For the training set, in addition to the already masked preposition-words, I will also specify a 15% MLM probability of tokens to be masked in order to provide sufficient domain adaptation. 

As for the test set, since the goal is to only predict the preposition-words, there is no need to mask additional tokens; the "mlm" parameter is thus set to 'False'.


In [29]:
from transformers import DataCollatorForLanguageModeling

data_collator_train = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15 )
data_collator_test = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [30]:
tf_train_dataset = lm_datasets["train"].to_tf_dataset(
    columns=["input_ids", "attention_mask", "labels"],
    collate_fn=data_collator_train,
    shuffle=True,
    batch_size=32,
)

tf_eval_dataset = lm_datasets["test"].to_tf_dataset(
    columns=["input_ids", "attention_mask", "labels"],
    collate_fn=data_collator_test,
    shuffle=False,
    batch_size=32,
)

# Model Hyperparameters Setup

In [34]:
from transformers import create_optimizer
from transformers.keras_callbacks import PushToHubCallback
import tensorflow as tf

num_train_steps = len(tf_train_dataset)
optimizer, schedule = create_optimizer(
    init_lr=2e-5,
    num_warmup_steps=1000,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)
model.compile(optimizer=optimizer)

callback = PushToHubCallback(output_dir="nlu_sherlock_model", tokenizer=tokenizer )

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! Please ensure your labels are passed as keys in the input dict so that they are accessible to the model during the forward pass. To disable this behaviour, please pass a loss argument, or explicitly pass loss=None if you do not want your model to compute a loss.
Cloning https://huggingface.co/ckenlam/nlu_sherlock into local empty directory.


# Fine-Tuning RoBERTa

Let me start by checking the perplexity of the current model before the fine-tuning:

In [35]:
import math

eval_loss = model.evaluate(tf_eval_dataset)
print(f"Perplexity: {math.exp(eval_loss):.2f}")

Perplexity: 4.79


I will now proceed to fine-tune this RoBERTa model with the training data:

In [36]:
model.fit(tf_train_dataset, validation_data=tf_eval_dataset)



<keras.callbacks.History at 0x7fafdd096910>

The perplexity of the fine-tuned model slightly improved from the initial pre-trained RoBERTa model:

In [37]:
eval_loss = model.evaluate(tf_eval_dataset)
print(f"Perplexity: {math.exp(eval_loss):.2f}")

Perplexity: 4.64


I will now save a copy of this fine-tuned model:

In [None]:
#model.push_to_hub("nlu_sherlock_model")

# Testing the Model

I will run each line of "**The Hound of the Baskervilles**" through the fine-tuned RoBERTa model and generate a prediction for each masked preposition-word. I will also compute the number of correct predictions for each line and organize the results in a dataframe. 

In [99]:
from transformers import pipeline

#load the saved model:
#model='ckenlam/nlu_sherlock_model'

mask_filler = pipeline("fill-mask", model=model, tokenizer=tokenizer, top_k=1)


In [109]:
import pandas as pd
import re
from tqdm import tqdm
from google.colab import files

all_predictions=[]
for i in tqdm(range(29,6856)):
#for i in range(29,35):
  #mask_text = tokenizer.decode(tokenized_datasets["test"][i]["input_ids"]).replace('[CLS]','').replace('[SEP]','')
  #non_mask_text = tokenizer.decode(tokenized_datasets["test"][i]["labels"]).replace('[CLS]','').replace('[SEP]','')
  mask_text = tokenizer.decode(tokenized_datasets["test"][i]["input_ids"]).replace('<s>','').replace('</s>','').replace(tokenizer.mask_token,' '+tokenizer.mask_token)
  non_mask_text = tokenizer.decode(tokenized_datasets["test"][i]["labels"]).replace('<s>','').replace('</s>','')

  #extract a list of all the prepositions from the input phrase
  test_prepositions_list = re.findall(r'\bof\b|\bin\b|\bto\b|\bfor\b|\bwith\b|\bon\b|\bat\b|\bfrom\b|\bby\b|\babout\b', non_mask_text.lower(), flags=re.IGNORECASE)
  #prepare a list of prepositions predictions
  pred_prepositions_list=[]
  #fill_missing will get all its missing words filled out with the predictions
  fill_missing = mask_text

  #Check if the phrase contains any [MASK] prepositions
  if mask_text.count(tokenizer.mask_token)>0:

  #iterate through each [MASK] token, fill the [MASK] with the prediction, then make prediction for the nexxt [MASK]
    for n in range(mask_text.count(tokenizer.mask_token)):
      pred = mask_filler(fill_missing)

      if n == mask_text.count(tokenizer.mask_token)-1:
        #fill_missing = fill_missing.replace(tokenizer.mask_token, pred[0]['token_str'] , 1).replace('[CLS]','').replace('[SEP]','')
        fill_missing = fill_missing.replace(tokenizer.mask_token, pred[0]['token_str'] , 1).replace('<s>','').replace('</s>','')
        pred_prepositions_list.append(pred[0]['token_str'].lstrip().lower())
      else:
        #fill_missing = fill_missing.replace(tokenizer.mask_token, pred[0][0]['token_str'] , 1).replace('[CLS]','').replace('[SEP]','')
        fill_missing = fill_missing.replace(tokenizer.mask_token, pred[0][0]['token_str'] , 1).replace('<s>','').replace('</s>','')
        pred_prepositions_list.append(pred[0][0]['token_str'].lstrip().lower())
  # count the number of correct predictions and total number of predictions made
    match_cnt = sum(x == y for x,y in zip(test_prepositions_list,pred_prepositions_list))
    total = mask_text.count(tokenizer.mask_token)

  else:
    fill_missing = mask_text
    match_cnt = 0
    total = mask_text.count(tokenizer.mask_token)

  prediction = [non_mask_text,mask_text,fill_missing,match_cnt,total]
  all_predictions.append(prediction)

df = pd.DataFrame (all_predictions, columns = ['original_text', 'masked_text','prediction','matching_count','total_count'])


100%|██████████| 6827/6827 [49:46<00:00,  2.29it/s]


Below is an example of how the results look like:

In [110]:
df.head()

Unnamed: 0,original_text,masked_text,prediction,matching_count,total_count
0,THE HOUND OF THE BASKERVILLES,the hound <mask> the baskervilles,the hound of the baskervilles,1,1
1,,,,0,0
2,By A. Conan Doyle,<mask> a. conan doyle,Albert a. conan doyle,0,1
3,,,,0,0
4,,,,0,0


The accuracy of this fine-tuned model is as followed:

In [111]:
df.matching_count.sum()/df.total_count.sum()

0.6399931752260707

I will now save the results in my google drive as nlu_challenge_results.csv

In [112]:
from google.colab import drive
drive.mount('drive', force_remount=True)

Mounted at drive


In [113]:
df.to_csv('./drive/My Drive/nlu_challenge_results.csv')