https://github.com/huggingface/notebooks/blob/main/examples/text_classification.ipynb 


In [1]:
import transformers

print(transformers.__version__)

  from .autonotebook import tqdm as notebook_tqdm


4.24.0


In [2]:
GLUE_TASKS = ["cola", "mnli", "mnli-mm", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]

In [13]:
from datasets import load_dataset, load_metric, ClassLabel

In [4]:
task = "mnli"
dataset = load_dataset("glue", task)
metric = load_metric("glue", task)

Downloading builder script: 28.8kB [00:00, 20.0MB/s]
Downloading metadata: 28.7kB [00:00, 25.0MB/s]
Downloading readme: 27.9kB [00:00, 31.3MB/s]


Downloading and preparing dataset glue/mnli to /users/k21193529/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 313M/313M [00:01<00:00, 159MB/s] 
                                                                                                     

Dataset glue downloaded and prepared to /users/k21193529/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 5/5 [00:00<00:00, 666.57it/s]
  metric = load_metric("glue", task)
Downloading builder script: 5.76kB [00:00, 5.86MB/s]                   


In [40]:
dataset

DatasetDict({
    train: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 392702
    })
    validation_matched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9815
    })
    validation_mismatched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9832
    })
    test_matched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9796
    })
    test_mismatched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9847
    })
})

In [41]:
dataset['train'].features

{'premise': Value(dtype='string', id=None),
 'hypothesis': Value(dtype='string', id=None),
 'label': ClassLabel(names=['entailment', 'neutral', 'contradiction'], id=None),
 'idx': Value(dtype='int32', id=None)}

In [14]:
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) -1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [15]:
show_random_elements(dataset['train'])

Unnamed: 0,premise,hypothesis,label,idx
0,Haynes Johnson ( NewsHour ) credits him with bringing the South and Sunbelt into the GOP.,Haynes Johnson thought he changed the political opinions of southerners.,neutral,195422
1,What time did you go out last evening?,What time did you go out last night? 8?,neutral,218361
2,"When you reach an impressive plaza with a 16th-century church and whitewashed town hall, you may feel that you've come to the top of the town but you haven't.",You haven't come to the top of the town when you reach an impressive plaza with a 16th-century church.,entailment,359426
3,do you think they should in in that profession,Do you believe they should in that job?,entailment,107527
4,"Even if it were, we couldn't afford it.","Even if it were, we'd easily be able to afford it.",contradiction,32591
5,Set aside money specifically for planning.,Save money for planning.,entailment,108719
6,"But there is one place where Will's journalism does seem to matter, where he does toss baseball.",Will can't do anything if it has nothing to do with baseball,neutral,294315
7,The Straits Settlements were formed after the Anglo-Dutch Treaty of London (1824).,A treaty between the English and Dutch enabled the creation of the Straits Settlements.,entailment,223306
8,i uh for the longest time i i'd gotten rid of my gas credit cards and then all of a sudden i started getting a flurry of these things so i i did i have picked up three gas credit cards in the last couple months,Gas credit cards have helped my situation immensely,neutral,199554
9,Life was about to get very difficult until Adrin and San'doro came back to him.,Adrin and San'doro made things harder for him.,contradiction,321656


In [16]:
metric

Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(res

In [19]:
from transformers import AutoTokenizer

task = "mnli"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Downloading (â€¦)okenizer_config.json: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 28.0/28.0 [00:00<00:00, 6.64kB/s]
Downloading (â€¦)lve/main/config.json: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 483/483 [00:00<00:00, 127kB/s]
Downloading (â€¦)solve/main/vocab.txt: 232kB [00:00, 11.0MB/s]
Downloading (â€¦)/main/tokenizer.json: 466kB [00:00, 8.08MB/s]


In [20]:
# you can directly call tokeknizer on one sentence or pair of sentence
tokenizer("hello, this is one sentence!", "And this sentence goes with it.")

{'input_ids': [101, 7592, 1010, 2023, 2003, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [21]:
tokenizer.decode(tokenizer("hello, this is one sentence!", "And this sentence goes with it.")['input_ids'])

'[CLS] hello, this is one sentence! [SEP] and this sentence goes with it. [SEP]'

In [22]:
# task key for mnli
key1 = "premise"
key2 = "hypothesis"

def preprocess_function(examples):
    return tokenizer(examples[key1], examples[key2], truncation=True)

In [23]:
encoded_ds = dataset.map(preprocess_function, batched=True)

                                                                     

In [24]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

In [25]:
num_labels = 3 # mnli has 3 labels
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

Downloading pytorch_model.bin: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 268M/268M [00:02<00:00, 121MB/s]  
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at dist

In [30]:
metric_name = "accuracy"
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_name}-finetuned-{task}",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate = 2e-5, 
    per_device_train_batch_size = batch_size, 
    per_device_eval_batch_size = batch_size,
    num_train_epochs = 5, 
    weight_decay = 0.01, 
    load_best_model_at_end = True,
    metric_for_best_model = metric_name,
)

In [31]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions = predictions, references = labels)

In [33]:
validation_key = "validation_matched"
trainer = Trainer(
    model, 
    args,
    train_dataset = encoded_ds["train"],
    eval_dataset = encoded_ds[validation_key],
    tokenizer = tokenizer, 
    compute_metrics = compute_metrics
)

In [35]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, hypothesis, premise. If idx, hypothesis, premise are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 392702
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 122720
  Number of trainable parameters = 66955779


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2202,0.561836,0.815894
2,0.2168,0.561836,0.815894
3,0.2204,0.561836,0.815894
4,0.2127,0.561836,0.815894
5,0.2098,0.561836,0.815894


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, hypothesis, premise. If idx, hypothesis, premise are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9815
  Batch size = 16
Saving model checkpoint to distilbert-base-uncased-finetuned-mnli/checkpoint-24544
Configuration saved in distilbert-base-uncased-finetuned-mnli/checkpoint-24544/config.json
Model weights saved in distilbert-base-uncased-finetuned-mnli/checkpoint-24544/pytorch_model.bin
tokenizer config file saved in distilbert-base-uncased-finetuned-mnli/checkpoint-24544/tokenizer_config.json
Special tokens file saved in distilbert-base-uncased-finetuned-mnli/checkpoint-24544/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` a

TrainOutput(global_step=122720, training_loss=0.21243214613455683, metrics={'train_runtime': 4394.003, 'train_samples_per_second': 446.861, 'train_steps_per_second': 27.929, 'total_flos': 4.122893008518235e+16, 'train_loss': 0.21243214613455683, 'epoch': 5.0})

In [37]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, hypothesis, premise. If idx, hypothesis, premise are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9815
  Batch size = 16


{'eval_loss': 0.5618363618850708,
 'eval_accuracy': 0.8158940397350993,
 'eval_runtime': 6.3631,
 'eval_samples_per_second': 1542.497,
 'eval_steps_per_second': 96.494,
 'epoch': 5.0}