# About: 
- this notebook follows the tutorial from the this link: https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb

### The GLUE Benchmark 
- is a group of nine classification tasks on sentences or pairs of sentences
- their labels are mostly classification with 2 or 3 labels,  except STS-B which is regression
    - CoLA (Corpus of Linguistic Acceptability) Determine if a sentence is grammatically correct or not.is a dataset containing sentences labeled grammatically correct or not.
    - MNLI (Multi-Genre Natural Language Inference) Determine if a sentence entails, contradicts or is unrelated to a given hypothesis. (This dataset has two versions, one with the validation and test set coming from the same distribution, another called mismatched where the validation and test use out-of-domain data.)
    - MRPC (Microsoft Research Paraphrase Corpus) Determine if two sentences are paraphrases from one another or not.
    - QNLI (Question-answering Natural Language Inference) Determine if the answer to a question is in the second sentence or not. (This dataset is built from the SQuAD dataset.)
    - QQP (Quora Question Pairs2) Determine if two questions are semantically equivalent or not.
    - RTE (Recognizing Textual Entailment) Determine if a sentence entails a given hypothesis or not.
    - SST-2 (Stanford Sentiment Treebank) Determine if the sentence has a positive or negative sentiment.
    - STS-B (Semantic Textual Similarity Benchmark) Determine the similarity of two sentences with a score from 1 to 5.
    - WNLI (Winograd Natural Language Inference) Determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or not. (This dataset is built from the Winograd Schema Challenge dataset.)

In [1]:
GLUE_TASKS = ["cola", "mnli", "mnli-mm", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"] # these are the abbreviation used to load the dataset

In [2]:
task = "cola"     
model_checkpoint = "distilbert-base-uncased"                 # pre-trained model utilized
batch_size = 16                                         

## 1. load_dataset
- the dataset can be either one sentenced or two sentenced

In [3]:
from datasets import load_dataset, load_metric

In [4]:
actual_task = "mnli" if task == "mnli-mm" else task
dataset = load_dataset("glue", actual_task)

Reusing dataset glue (C:\Users\tanch\.cache\huggingface\datasets\glue\cola\1.0.0\7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4)


In [5]:
dataset['train'][0]                       # cola is one sentenced dataset

{'idx': 0,
 'label': 1,
 'sentence': "Our friends won't buy this analysis, let alone the next one we propose."}

In [6]:
load_dataset("glue", "mnli")['train'][0]  # mnli is two sentenced dataset

Reusing dataset glue (C:\Users\tanch\.cache\huggingface\datasets\glue\mnli\1.0.0\7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4)


{'hypothesis': 'Product and geography are what make cream skimming work. ',
 'idx': 0,
 'label': 1,
 'premise': 'Conceptually cream skimming has two basic dimensions - product and geography.'}

In [7]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML
def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [8]:
show_random_elements(dataset["train"])

Unnamed: 0,idx,label,sentence
0,7984,acceptable,That photograph of Jane of Lucy's
1,6325,acceptable,Heidi gave a present to herself.
2,8510,unacceptable,There were killed three men by the assassin.
3,2594,acceptable,Jessica sprayed paint on the wall.
4,38,acceptable,They made him president.
5,702,acceptable,John spoke to Mary intimately.
6,5847,acceptable,I think he will eat asparagus.
7,7456,acceptable,Will John not go to school?
8,2504,acceptable,Amanda burned the stove black.
9,7086,acceptable,Either Dana or Lee are going to lead the parade.


## 2. load_metric
- each task has its own associated metric

    - for CoLA: Matthews Correlation Coefficient
    - for MNLI (matched or mismatched): Accuracy
    - for MRPC: Accuracy and F1 score
    - for QNLI: Accuracy
    - for QQP: Accuracy and F1 score
    - for RTE: Accuracy
    - for SST-2: Accuracy
    - for STS-B: Pearson Correlation Coefficient and Spearman's_Rank_Correlation_Coefficient
    - for WNLI: Accuracy

In [9]:
metric = load_metric('glue', actual_task)

In [10]:
metric

Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
""", stored examples: 0)

## 3. Pre-processing: 
- depending on whether the task has 1 or 2 sentences, pre-processing is done slightly differently
- for 1 sentence:
    - [CLS] and [SEP] tokens are added to the front and end of the sentence respectively
- for 2 sentences:
    - in addition to above steps, the second sentence is concatenated and another [SEP] token is added to the end 

In [11]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, 
                                          use_fast=True    # uses fast tokenizer
                                          )

In [12]:
task_to_keys = {                                 # this just tracks which tasks is one sentenced or two sentenced and the key names
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

In [13]:
sentence1_key, sentence2_key = task_to_keys[task]
if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

Sentence: Our friends won't buy this analysis, let alone the next one we propose.


In [14]:
# this function is meant to process either one sentence or two sentence
def preprocess_function(examples):
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key], truncation=True)
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

In [17]:
encoded_dataset = dataset.map(preprocess_function, 
                              batched=True      # this leverages the fast tokenizers to tokenize multiple samples concurrently                      
                              )


HBox(children=(FloatProgress(value=0.0, max=9.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




## 4. AutoModelForSequenceClassification:
- "AutoModel" will guess and load the pre-trained model architecture we are using based on model_checkpoint
- "ForSequenceClassification" adds the sequence classfication head at the end of the pre-trained model
    - for sequence classification we are essentially applying a classfication model on the [CLS] token
    - reason being that, during the pre-training of BERT, the model learns to encode the entire meaning of any sentence into the [CLS] token

In [18]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

num_labels = 3 if task.startswith("mnli") else 1 if task=="stsb" else 2               # regression task has 1 label
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, 
                                                           num_labels=num_labels       # num_labels need to be specified so that the classifier fitted has the correct number of output nodes 
                                                           )

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

## 5. TrainingArguments
- this customizes how we want the training to be done and other hyperparameters

In [19]:
metric_name = "pearson" if task == "stsb" else "matthews_correlation" if task == "cola" else "accuracy"

args = TrainingArguments(
    "test-glue",
    evaluation_strategy = "epoch",                           
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,                               # best model may not be the model at the end of training, thus this param enables us to save any best model during training
    metric_for_best_model=metric_name,
)

## 6. Trainer


In [22]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

In [23]:
validation_key = "validation_mismatched" if task == "mnli-mm" else "validation_matched" if task == "mnli" else "validation"
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],                  # input should be tokenized
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,                                     # tokenizer is specified again to padd all samples to the same length
    compute_metrics=compute_metrics
)

In [24]:
trainer.train()

The git executable must be specified in one of the following ways:
    - be included in your $PATH
    - be set via $GIT_PYTHON_GIT_EXECUTABLE
    - explicitly set via git.refresh()

All git commands will error until this is rectified.

$GIT_PYTHON_REFRESH environment variable. Use one of the following values:
    - error|e|raise|r|2: for a raised exception

Example:
    export GIT_PYTHON_REFRESH=quiet



Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.521356,0.466718,0.47875
2,0.344652,0.475714,0.498651
3,0.229314,0.652028,0.522169
4,0.15914,0.806316,0.526802
5,0.120497,0.879817,0.53902


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

TrainOutput(global_step=2675, training_loss=0.26428672933132846)

In [25]:
trainer.evaluate()       # from the above results we know which epoch produced the best model 
                                # run this to check if the best model was loaded at the end of training

<IPython.core.display.Javascript object>

{'eval_loss': 0.8798167109489441,
 'eval_matthews_correlation': 0.539019545585709,
 'epoch': 5.0}

## 7. Hyperparameter search
- Trainer supports hyperparameter search using optuna or Ray Tune !
- During hyperparameter search, the Trainer will run several trainings, so it needs to have the model defined via a function (so it can be reinitialized at each new run) instead of just having it passed. 

In [28]:
# ! pip install optuna

In [29]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

In [35]:
train_dataset_shard = encoded_dataset["train"].shard(index=1, num_shards=10) # this splits the dataset into 10 shards and takes the first shard to speed up the search!!

In [36]:
trainer = Trainer(
    model_init=model_init,                           # for hyperparameter search pass the model function through model_init instead
    args=args,
    train_dataset= train_dataset_shard,           # encoded_dataset["train"] replaced with a shard to speed up search
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

In [40]:
best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")

[32m[I 2021-06-30 14:13:27,862][0m A new study created in memory with name: no-name-9807f1be-0af2-45d0-92a8-68306c95c4d3[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the 

Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.61914,0.0
2,No log,0.625893,0.250219
3,No log,1.024068,0.26401
4,No log,1.239591,0.265278


<IPython.core.display.Javascript object>

  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

[32m[I 2021-06-30 14:14:19,723][0m Trial 0 finished with value: 0.26527750281270746 and parameters: {'learning_rate': 8.183940600837202e-05, 'num_train_epochs': 4, 'seed': 33, 'per_device_train_batch_size': 8}. Best is trial 0 with value: 0.26527750281270746.[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassific

Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.57183,0.046356
2,No log,0.585767,0.304264
3,No log,0.633739,0.332177


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

[32m[I 2021-06-30 14:14:48,557][0m Trial 1 finished with value: 0.33217663052054286 and parameters: {'learning_rate': 5.580218474344434e-05, 'num_train_epochs': 3, 'seed': 32, 'per_device_train_batch_size': 32}. Best is trial 1 with value: 0.33217663052054286.[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassifi

Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.586148,0.046356


<IPython.core.display.Javascript object>

[32m[I 2021-06-30 14:15:02,145][0m Trial 2 finished with value: 0.0463559874942472 and parameters: {'learning_rate': 8.358367574915814e-05, 'num_train_epochs': 1, 'seed': 15, 'per_device_train_batch_size': 32}. Best is trial 1 with value: 0.33217663052054286.[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassific

Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.624588,0.0
2,No log,0.61699,0.0


<IPython.core.display.Javascript object>

  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)


<IPython.core.display.Javascript object>

  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
[32m[I 2021-06-30 14:15:43,958][0m Trial 3 finished with value: 0.0 and parameters: {'learning_rate': 1.3912587555729834e-06, 'num_train_epochs': 2, 'seed': 24, 'per_device_train_batch_size': 4}. Best is trial 1 with value: 0.33217663052054286.[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initia

Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.696835,0.020502
2,No log,0.690412,0.000922


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

[32m[I 2021-06-30 14:16:02,685][0m Trial 4 finished with value: 0.0009221805058227965 and parameters: {'learning_rate': 1.5249230488150498e-06, 'num_train_epochs': 2, 'seed': 14, 'per_device_train_batch_size': 64}. Best is trial 1 with value: 0.33217663052054286.[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClass

Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.569167,0.157906
2,No log,0.842552,0.320225
3,0.477074,1.038123,0.303876


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

[32m[I 2021-06-30 14:17:01,622][0m Trial 5 finished with value: 0.30387589923520775 and parameters: {'learning_rate': 2.0180605906127178e-05, 'num_train_epochs': 3, 'seed': 14, 'per_device_train_batch_size': 4}. Best is trial 1 with value: 0.33217663052054286.[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassifi

Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.576581,0.0


<IPython.core.display.Javascript object>

  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
[32m[I 2021-06-30 14:17:14,419][0m Trial 6 pruned. [0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at d

Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.558593,0.183694


<IPython.core.display.Javascript object>

[32m[I 2021-06-30 14:17:30,896][0m Trial 7 finished with value: 0.18369395013023057 and parameters: {'learning_rate': 6.9924833297187e-05, 'num_train_epochs': 1, 'seed': 13, 'per_device_train_batch_size': 8}. Best is trial 1 with value: 0.33217663052054286.[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassificat

Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.595248,0.0


<IPython.core.display.Javascript object>

  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
[32m[I 2021-06-30 14:17:39,211][0m Trial 8 pruned. [0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at d

Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.587466,0.0


<IPython.core.display.Javascript object>

  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
[32m[I 2021-06-30 14:17:47,545][0m Trial 9 pruned. [0m


In [39]:
import mlflow
mlflow.end_run()

In [41]:
best_run

BestRun(run_id='1', objective=0.33217663052054286, hyperparameters={'learning_rate': 5.580218474344434e-05, 'num_train_epochs': 3, 'seed': 32, 'per_device_train_batch_size': 32})

In [42]:
for n, v in best_run.hyperparameters.items():   # train model with best params
    setattr(trainer.args, n, v)

trainer.train()

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.57183,0.046356
2,No log,0.585767,0.304264
3,No log,0.633739,0.332177


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

TrainOutput(global_step=81, training_loss=0.44485756202980326)