Huggingface transformer installation:  https://huggingface.co/transformers/installation.html  
sample code from here: https://huggingface.co/transformers/training.html  
datasets installation here: https://huggingface.co/docs/datasets/installation.html

In [1]:
import transformers
transformers.__version__

'4.10.0'

In [2]:
from datasets import load_dataset
raw_datasets = load_dataset("imdb")

Reusing dataset imdb (/Users/kai/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)


In [3]:
type(raw_datasets)

datasets.dataset_dict.DatasetDict

In [4]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_wâ€¦




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_â€¦




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descriptiâ€¦




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descriptiâ€¦




In [5]:
sentences = """The steep sell-off came less than one day after JPMorgan analysts warned in a note to clients that recently rallying altcoinsâ€”or cryptocurrency alternatives to bitcoin and etherâ€”reflected "froth and retail investor mania," as opposed to sustainable gains for the market. "The August rally in non-fungible tokens and the pickup in decentralized finance activity have helped not only ethereum but also alternative cryptocurrencies that facilitate or plan to facilitate smart contracts, such as Solana, Binance Coin and Cardano," JPMorgan Managing Director Nikolaos Panigirtzoglou said Monday. "The previous phase of retail investorsâ€™ mania into cryptocurrency markets was between the beginning of January and mid-May... and retail investors are making cryptocurrency markets look frothy again." After the bouts of retail-investing mania in January and May, crypto markets crashed about 13% and 50%, respectively."""

In [6]:
inputs = tokenizer(sentences, padding="max_length", truncation=True)

However, we can instead apply these preprocessing steps to all the splits of our dataset at once by using the map method:

In [7]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

HBox(children=(FloatProgress(value=0.0, max=25.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=25.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))




Next we will generate a small subset of the training and validation set, to enable faster training:

In [8]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
full_train_dataset = tokenized_datasets["train"]
full_eval_dataset = tokenized_datasets["test"]

In all the examples below, we will always use small_train_dataset and small_eval_dataset. Just replace them by their full equivalent to train or evaluate on the full dataset.

Since PyTorch does not provide a training loop, the ðŸ¤— Transformers library provides a Trainer API that is optimized for ðŸ¤— Transformers models, with a wide range of training options and with built-in features like logging, gradient accumulation, and mixed precision.

First, letâ€™s define our model:

In [9]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435779157.0, style=ProgressStyle(descriâ€¦




Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [10]:
from transformers import TrainingArguments

training_args = TrainingArguments("test_trainer")

In [11]:
from transformers import Trainer

trainer = Trainer(
    model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset
)

To fine-tune our model, we just need to call

In [None]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text.
***** Running training *****
  Num examples = 1000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 375


Step,Training Loss


which will start a training that you can follow with a progress bar, which should take a couple of minutes to complete (as long as you have access to a GPU). It wonâ€™t actually tell you anything useful about how well (or badly) your model is performing however as by default, there is no evaluation during training, and we didnâ€™t tell the Trainer to compute any metrics. Letâ€™s have a look on how to do that now!

To have the Trainer compute and report metrics, we need to give it a compute_metrics function that takes predictions and labels (grouped in a namedtuple called EvalPrediction) and return a dictionary with string items (the metric names) and float values (the metric values).

The ðŸ¤— Datasets library provides an easy way to get the common metrics used in NLP with the load_metric function. here we simply use accuracy. Then we define the compute_metrics function that just convert logits to predictions (remember that all ðŸ¤— Transformers models return the logits) and feed them to compute method of this metric.

In [None]:
import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

The compute function needs to receive a tuple (with logits and labels) and has to return a dictionary with string keys (the name of the metric) and float values. It will be called at the end of each evaluation phase on the whole arrays of predictions/labels.

To check if this works on practice, letâ€™s create a new Trainer with our fine-tuned model:

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)
trainer.evaluate()

which showed an accuracy of 87.5% in our case.

If you want to fine-tune your model and regularly report the evaluation metrics (for instance at the end of each epoch), here is how you should define your training arguments:

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments("test_trainer", evaluation_strategy="epoch")