**Install the transformers and datasets for the language modelling tasks**

In [1]:
! pip install datasets transformers



**Signing in the HuggingFace API login for the authentication**

In [2]:
!huggingface-cli login


        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

        
Username: VirenS13117
Password: 
Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-crendential store but this isn't the helper defined on your machine.
You will have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal to set it as the default

git config --global credential.helper store[0m


**Setting up Git LFS**

In [3]:
!apt install git-lfs
!git config --global user.email "virender13117@gmail.com"
!git config --global user.name "VirenS13117"

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  git-lfs
0 upgraded, 1 newly installed, 0 to remove and 37 not upgraded.
Need to get 2,129 kB of archives.
After this operation, 7,662 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 git-lfs amd64 2.3.4-1 [2,129 kB]
Fetched 2,129 kB in 2s (886 kB/s)
Selecting previously unselected package git-lfs.
(Reading database ... 155013 files and directories currently installed.)
Preparing to unpack .../git-lfs_2.3.4-1_amd64.deb ...
Unpacking git-lfs (2.3.4-1) ...
Setting up git-lfs (2.3.4-1) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...


**Checking up the Transformers version**

In [4]:
import transformers
print(transformers.__version__) 

4.10.2


**Now we fine tune a language model on text classification task**

**Task : We will fine tune one of the Transformers model to a text classification task from the [GLUE Benchmark](https://https://gluebenchmark.com/)\**
It has following nine classification tasks:\
a. [CoLA](https://https://nyu-mll.github.io/CoLA/) Corpus of Linguistic Acceptability : To determine if a sentence is grammatically correct or not. It is a dataset containing sentences with two labels grammatically correct(1) or grammatically incorrect(0).\
b. [MNLI](https://https://arxiv.org/abs/1704.05426) Multi-Genre Natural Language Inference : To determine if a sentence results, contradicts or is unrelated to given hypothesis. The data has **two** versions, one where validation and test set coming from the same distribution, another called mismatched, where the validation and test set come from out of domain-data.\
c. [MRPC](https://https://www.microsoft.com/en-us/download/details.aspx?id=52398) Microsoft Research Paraphrase Corpus : To determine if two sentences are paraphrases from one another or not.\
d. [QNLI](https://https://rajpurkar.github.io/SQuAD-explorer/) Question -Answering Natural Language Inference : To determine if the answer to a question is in the second sentence or not.\
e. **QQP** Quora Question Pairs2 : To determine if two questions are semantically equivalent or not.\
f. [RTE](https://https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) Recognizing Textual Entailment : To determine if a sentence entails a given hypothesis or not.\
g. [SST-2](https://https://nlp.stanford.edu/sentiment/index.html) Stanford Sentiment Treebank : To determine if the sentence has positive or negative sentiment.\
h. [STS-B](https://http://ixa2.si.ehu.eus/stswiki/index.php/STSbenchmark) Semantic Textual Similarity Benchmark : To determine if the similarity of two sentences with a score from 1 to 5.\
g. [WNLI](https://https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html) Winograd Natural Language Inference : To determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or not.



In [5]:
GLUE_TASKS = ["cola", "mnli", "mnli-mm", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]

Let's decide one of the GLUE_TASKS to run on distilbert-base-uncased model from model hub.

In [6]:
task = "cola"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

**Load the dataset**

In [7]:
from datasets import load_dataset, load_metric

In the load-dataset, we can pass the actual name directly except the MNLI-MM which takes the MNLI name only.

In [8]:
actual_task = "mnli" if task == "mnli-mm" else task
dataset = load_dataset("glue", actual_task)
metric = load_metric('glue', actual_task)

Downloading:   0%|          | 0.00/7.78k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Downloading and preparing dataset glue/cola (download: 368.14 KiB, generated: 596.73 KiB, post-processed: Unknown size, total: 964.86 KiB) to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading:   0%|          | 0.00/377k [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/1.86k [00:00<?, ?B/s]

Let's have a look at the data

In [9]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

the dataset is a dictionary which contains one key for training, vlaidation and test set each. Exception is the MNLI case where in the mismatched version, we have more keys for validation and the test set.)

In [10]:
dataset["train"][0] # shows the first entry of the train data

{'idx': 0,
 'label': 1,
 'sentence': "Our friends won't buy this analysis, let alone the next one we propose."}

**Some random elements**

In [11]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [12]:
show_random_elements(dataset["train"])

Unnamed: 0,sentence,label,idx
0,"Some people like bagels, but others cream cheese.",acceptable,7091
1,The stone knocked against the pole into the road.,unacceptable,603
2,This girl in the red coat will and dress must put a picture of Bill on your desk.,unacceptable,7169
3,Martha carved the baby some wood into a toy.,unacceptable,2144
4,The departing passenger waved at the crowd.,acceptable,2195
5,Reports of which the government prescribes the height of the Lettering on the covers are invariably boring.,unacceptable,1349
6,John is easy to please.,acceptable,5003
7,John wants to come up with as good a solution as Christine did.,acceptable,5521
8,The fence hit.,unacceptable,2803
9,Each of our rabbit and the neighbor's cat likes the other.,acceptable,7520


In [13]:
metric

Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(res

For different tasks, we have different metrics:\
a. CoLA : **Matthews Correlation Coefficient**\
b. MNLI(matched or mismatched) : **Accuracy**\
c. MRPC : **Accuracy** and **F1 score**\
d. QNLI : **Accuracy**\
e. QQP : **Accuracy** and **F1 score**\
f. RTE : **Accuracy**\
g. SST-2 : **Accuracy**\
h. STS-B : **Pearson Correlation Coefficient** and **Spearman Rank Correlation** Coefficient\
i. WNLI : **Accuracy**\
For our cola task, we can see that matthews correlation is the selected metric. We can check on demo examples.

In [14]:
import numpy as np

demo_preds = np.random.randint(0, 2, size=(64,))
demo_labels = np.random.randint(0, 2, size=(64,))
metric.compute(predictions=demo_preds, references=demo_labels)

{'matthews_correlation': 0.09847634407689815}

**Preprocessing the Data**

We are going to use the [distilbert-base-uncased](https://https://huggingface.co/distilbert-base-uncased) base model. DistilBERT is a transformers model, smaller and faster than BERT, which was pretrained on the same corpus in a self-supervised fashion, using the BERT base model as a teacher. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts using the BERT base model.

Now we apply tokenization function, we just need to update our tokenizer to use the checkpoint we just picked:\
We apply *use_fast=True* in the call as one of the fast tokenizer from the tokenizers library. 

In [15]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Tokenizer can be called on one sentence or pair of sentences:

In [16]:
tokenizer("This is first demo sentence!", "And this is second sentence that goes with first one")

{'input_ids': [101, 2023, 2003, 2034, 9703, 6251, 999, 102, 1998, 2023, 2003, 2117, 6251, 2008, 3632, 2007, 2034, 2028, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

To preprocess our dataset, we will thus need the names of the columns containing the sentence(s). The following dictionary keeps track of the correspondence task to column names:

In [17]:
task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

Now lets, get the first and second sentence key for the first data entry of the train dataset.

In [18]:
sentence1_key, sentence2_key = task_to_keys[task]
if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

Sentence: Our friends won't buy this analysis, let alone the next one we propose.


We can then write the function that will preprocess our samples. We just feed them to the tokenizer with the argument truncation=True. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model.

In [19]:
def preprocess_function(examples):
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key], truncation=True)
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [20]:
preprocess_function(dataset['train'][:5])

{'input_ids': [[101, 2256, 2814, 2180, 1005, 1056, 4965, 2023, 4106, 1010, 2292, 2894, 1996, 2279, 2028, 2057, 16599, 1012, 102], [101, 2028, 2062, 18404, 2236, 3989, 1998, 1045, 1005, 1049, 3228, 2039, 1012, 102], [101, 2028, 2062, 18404, 2236, 3989, 2030, 1045, 1005, 1049, 3228, 2039, 1012, 102], [101, 1996, 2062, 2057, 2817, 16025, 1010, 1996, 13675, 16103, 2121, 2027, 2131, 1012, 102], [101, 2154, 2011, 2154, 1996, 8866, 2024, 2893, 14163, 8024, 3771, 1012, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

Take a batch of the dataset and apply the preprocess function.

In [21]:
encoded_dataset = dataset.map(preprocess_function, batched=True)

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

The results are cached by the datasets library to avoid spending time on it again when running the same code. Datasets can express when it uses cached files, we can pass load_from_cache=False in the call to map to not use the cached files and force the preprocessing to be applied again. We passed batched=True to encode the texts by batches together. This uses full benefit of the fast tokenizer we loaded earlier, this uses multi-threading to treat the texts in a batch concurrently. 

**Fine Tuning Model**

We can download the pretrained model and fine tune it, now that we have the pre-processed data. All our tasks are about sentence classification, we are using AutoModelForSequenceClassification class. Like the tokenizer, from_pretrained method downloads and cache the model for us. We just need to specify the number of labels for this problem.

In [22]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

num_labels = 3 if task.startswith("mnli") else 1 if task=="stsb" else 2
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier

The model we are using is distilbert-base-uncased which is trained pre-trained on masked language modeling and we want to use this for classification so we need to fine thune this model before using for inference. That is why we have the warning above.

We need to define the training arguments as well. 

In [23]:
metric_name = "pearson" if task == "stsb" else "matthews_correlation" if task == "cola" else "accuracy"
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    "test-glue",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    push_to_hub=True,
    push_to_hub_model_id=f"{model_name}-finetuned-{task}",
)

By setting evaluation_strategy = epoch, we have set the evaluation to be done at the end of each epoch. We set the metric for best model to load the best model it saved according to the end of the training.

Now we need to define a fucntion for computing the metrics for predictions. We need to define a function for this, which will just use the metric we loaded earlier, we just take the argmax of our predicted logits.

In [24]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

Now we need to pass the parameters to the trainer.

In [None]:
validation_key = "validation_mismatched" if task == "mnli-mm" else "validation_matched" if task == "mnli" else "validation"
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Cloning https://huggingface.co/VirenS13117/distilbert-base-uncased-finetuned-cola into local empty directory.


Download file pytorch_model.bin:   0%|          | 595/255M [00:00<?, ?B/s]

Download file runs/Sep21_21-15-21_3102c17bc505/events.out.tfevents.1632260002.3102c17bc505.79.0:  62%|######2 …

Download file runs/Sep21_21-15-21_3102c17bc505/1632260002.7672696/events.out.tfevents.1632260002.3102c17bc505.…

Clean file runs/Sep21_21-15-21_3102c17bc505/events.out.tfevents.1632260002.3102c17bc505.79.0:  18%|#8        |…

Clean file runs/Sep21_21-15-21_3102c17bc505/1632260002.7672696/events.out.tfevents.1632260002.3102c17bc505.79.…

Download file training_args.bin:  70%|#######   | 1.84k/2.61k [00:00<?, ?B/s]

Download file runs/Sep21_21-15-21_3102c17bc505/events.out.tfevents.1632260388.3102c17bc505.79.2: 100%|########…

Clean file training_args.bin:  38%|###8      | 1.00k/2.61k [00:00<?, ?B/s]

Clean file runs/Sep21_21-15-21_3102c17bc505/events.out.tfevents.1632260388.3102c17bc505.79.2: 100%|##########|…

In [None]:
trainer.train()

In [None]:
trainer.evaluate()

In [None]:
trainer.push_to_hub()

Uploaded the results to the Hub

Now we need to do a hyperparameter search to optimize our model

**Hyperparameter Search**

In [None]:
!pip install optuna

For the hyperparameter search, the Trainer will run several trainings, so it defines the model via a function.

In [None]:

from transformers import AutoModelForSequenceClassification

# model = AutoModelForSequenceClassification.from_pretrained("sgugger/my-awesome-model")

In [None]:
def model_init():
  return AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

Now loading the Trainer as before

In [None]:
trainer = Trainer(
    model_init=model_init,
    args=args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")

In [None]:
best_run

In [None]:
for n, v in best_run.hyperparameters.items():
    setattr(trainer.args, n, v)

trainer.train()