# NOTE! Make sure you're using a GPU runtime

In [None]:
!pip install datasets transformers[sentencepiece] evaluate

Note that this is **one** way to train. You can also run training with one of the many scripts (https://huggingface.co/docs/transformers/run_scripts), train on Amazon SageMaker (https://huggingface.co/docs/sagemaker/train), or even go ahead without using the Trainer API and just make your own PyTorch/Keras/etc. training loop.

Set up from previous code-along...

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

# Only grabbing the train split since I'm just going to make a small dataset.
raw_datasets = load_dataset("glue", "mrpc", split="train")

# I just want a small subset of train and validation...
raw_datasets = raw_datasets.train_test_split(test_size=0.2, shuffle=True)

checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading and preparing dataset glue/mrpc (download: 1.43 MiB, generated: 1.43 MiB, post-processed: Unknown size, total: 2.85 MiB) to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 2934
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 734
    })
})

Log in, so that we can upload our model

In [None]:
from huggingface_hub import notebook_login

In [None]:
notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


Define the TrainingArguments!

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
                  output_dir="test-trainer",
                  save_strategy="epoch",
                  evaluation_strategy="epoch",
                  num_train_epochs=1,
                  hub_model_id="Sphere-Fall2022/nima-test-bert-glue-live"
                ) # Can pass all kinds of stuff to this!

Instantiate your model

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint,
    num_labels=2,
  )  # Note the warning!!

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier

Create a compute_metrics function to evaluate the predictions.

In [None]:
import numpy as np
import evaluate

def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],  # Just because I made the smaller dataset
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
!rm -rf test-trainer

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence2, idx, sentence1. If sentence2, idx, sentence1 are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 2934
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 367
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.42934,0.80109,0.850103


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence2, idx, sentence1. If sentence2, idx, sentence1 are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 734
  Batch size = 8


Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

Saving model checkpoint to test-trainer/checkpoint-367
Configuration saved in test-trainer/checkpoint-367/config.json
Model weights saved in test-trainer/checkpoint-367/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-367/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-367/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=367, training_loss=0.5084189017397182, metrics={'train_runtime': 33.1764, 'train_samples_per_second': 88.436, 'train_steps_per_second': 11.062, 'total_flos': 54583812557136.0, 'train_loss': 0.5084189017397182, 'epoch': 1.0})

In [None]:
trainer.push_to_hub()

Saving model checkpoint to test-trainer
Configuration saved in test-trainer/config.json
Model weights saved in test-trainer/pytorch_model.bin
tokenizer config file saved in test-trainer/tokenizer_config.json
Special tokens file saved in test-trainer/special_tokens_map.json
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 3.34k/255M [00:00<?, ?B/s]

Upload file runs/Sep21_07-56-13_024064d00a3a/events.out.tfevents.1663747448.024064d00a3a.1174.2:  79%|#######9…

remote: Scanning LFS files for validity, may be slow...        
remote: LFS file scan complete.        
To https://huggingface.co/Sphere-Fall2022/nima-test-bert-glue
   3a0988f..e4741a1  main -> main

remote: LFS file scan complete.        
To https://huggingface.co/Sphere-Fall2022/nima-test-bert-glue
   3a0988f..e4741a1  main -> main

Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Text Classification', 'type': 'text-classification'}, 'dataset': {'name': 'glue', 'type': 'glue', 'config': 'mrpc', 'split': 'train', 'args': 'mrpc'}}
To https://huggingface.co/Sphere-Fall2022/nima-test-bert-glue
   e4741a1..f8094d4  main -> main

   e4741a1..f8094d4  main -> main



'https://huggingface.co/Sphere-Fall2022/nima-test-bert-glue/commit/e4741a1f146b40efd3e8de507b441a65f552fdd4'