<a href="https://colab.research.google.com/github/galtay/hacdc/blob/main/hacdc_2020_10_20.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Classification Example

https://github.com/huggingface/notebooks/blob/main/examples/text_classification.ipynb

In [1]:
!nvidia-smi

Thu Oct 20 03:39:04 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   55C    P8    13W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Install packages 

In [2]:
%%capture
!pip install rich
!pip install transformers
!pip install datasets
!pip install evaluate

# Import packages

In [30]:
import json
import random

import datasets
import evaluate
import numpy as np
import pandas as pd
from rich import print as rprint
import transformers

In [4]:
from google.colab import data_table
data_table.enable_dataframe_formatter()


#data_table.disable_dataframe_formatter()

# Load Dataset

https://allenai.org/data/scitail

https://huggingface.co/datasets/bigscience-biomedical/scitail

https://huggingface.co/docs/datasets/process



## Create a dataset dictionary using Hugging Face load_dataset

In [5]:
dsd_raw = datasets.load_dataset('bigscience-biomedical/scitail')

Downloading builder script:   0%|          | 0.00/5.98k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/1.16k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/4.58k [00:00<?, ?B/s]



Downloading and preparing dataset scitail/scitail_source to /root/.cache/huggingface/datasets/bigscience-biomedical___scitail/scitail_source/1.1.0/f088c42a59b8388c2a26db50217601aa87fde3776316b6fa21a905f91ac11c48...


Downloading data:   0%|          | 0.00/14.2M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset scitail downloaded and prepared to /root/.cache/huggingface/datasets/bigscience-biomedical___scitail/scitail_source/1.1.0/f088c42a59b8388c2a26db50217601aa87fde3776316b6fa21a905f91ac11c48. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [6]:
dsd_raw

DatasetDict({
    train: Dataset({
        features: ['id', 'premise', 'hypothesis', 'label'],
        num_rows: 23596
    })
    test: Dataset({
        features: ['id', 'premise', 'hypothesis', 'label'],
        num_rows: 2126
    })
    validation: Dataset({
        features: ['id', 'premise', 'hypothesis', 'label'],
        num_rows: 1304
    })
})

In [7]:
dsd_raw['train'][0]

{'id': '0',
 'premise': 'Pluto rotates once on its axis every 6.39 Earth days;',
 'hypothesis': 'Earth rotates on its axis once times in one day.',
 'label': 'neutral'}

In [8]:
dsd_raw['train'][1]

{'id': '1',
 'hypothesis': 'Earth rotates on its axis once times in one day.',
 'label': 'entails'}

## Inspect some random samples



In [9]:
df_sample = dsd_raw['train'].shuffle(seed=42).select(range(10)).to_pandas()

In [10]:
df_sample

Unnamed: 0,id,premise,hypothesis,label
0,1092,Harte and graduate student Rebecca Shaw report...,If an accident happens during a science experi...,neutral
1,412,"As a result, the energy waves that come from t...",The surface of the sun is much hotter than alm...,neutral
2,15928,Vacuoles in cells and tissues.,The cell sap is the liquid inside the central ...,neutral
3,3304,"As the population grows, competition for the s...","As the population grows, competition for food ...",entails
4,19487,"The first lines of defense, often called the i...",The innate immune system serves as a first res...,entails
5,8362,The three types of radiation that are given of...,Of the three basic types of radioactive emissi...,neutral
6,6819,The waves created by a stringed instrument app...,"In an electromagnetic wave, the crests and tro...",neutral
7,600,In a similar way heat enters a liquid to chang...,Evaporation and condensation are similar becau...,neutral
8,21680,But the vernal equinox-when the sun is directl...,The sun is directly over the equator during.,entails
9,11310,Most sugars found naturally in foods or added ...,Double sugars are called disaccharides.,entails


In [11]:
dsd_raw['train'].to_pandas()['label'].value_counts()

neutral    14994
entails     8602
Name: label, dtype: int64

In [12]:
label_map = {"neutral": 0, "entails": 1}

In [13]:
def make_int_label(sample):
    sample["label"] = label_map[sample["label_str"]]
    return sample

dsd = dsd_raw.rename_column("label", "label_str")
dsd = dsd.map(make_int_label)
dsd

  0%|          | 0/23596 [00:00<?, ?ex/s]

  0%|          | 0/2126 [00:00<?, ?ex/s]

  0%|          | 0/1304 [00:00<?, ?ex/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'premise', 'hypothesis', 'label_str', 'label'],
        num_rows: 23596
    })
    test: Dataset({
        features: ['id', 'premise', 'hypothesis', 'label_str', 'label'],
        num_rows: 2126
    })
    validation: Dataset({
        features: ['id', 'premise', 'hypothesis', 'label_str', 'label'],
        num_rows: 1304
    })
})

In [14]:
dsd['train'][0]

{'id': '0',
 'premise': 'Pluto rotates once on its axis every 6.39 Earth days;',
 'hypothesis': 'Earth rotates on its axis once times in one day.',
 'label_str': 'neutral',
 'label': 0}

# Load Metrics

In [15]:
clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

In [16]:
clf_metrics.compute(predictions=[0, 1, 0], references=[0, 1, 1])

{'accuracy': 0.6666666666666666,
 'f1': 0.6666666666666666,
 'precision': 1.0,
 'recall': 0.5}

# Load Tokenizer and Model

https://huggingface.co/models

https://huggingface.co/docs/transformers/task_summary

In [17]:
model_name = "distilbert-base-uncased"

## Tokenizer

In [18]:
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
tokenizer

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

PreTrainedTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

In [19]:
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

{'input_ids': [101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [20]:
def preprocess_function(samples):
    return tokenizer(samples["premise"], samples["hypothesis"], truncation=True)

In [21]:
preprocess_function(dsd['train'][:5])

{'input_ids': [[101, 26930, 24357, 2015, 2320, 2006, 2049, 8123, 2296, 1020, 1012, 4464, 3011, 2420, 1025, 102, 3011, 24357, 2015, 2006, 2049, 8123, 2320, 2335, 1999, 2028, 2154, 1012, 102], [101, 1011, 1011, 1011, 9465, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 1027, 2320, 2566, 2154, 1010, 1996, 3011, 24357, 2015, 2055, 2049, 8123, 1012, 102, 3011, 24357, 2015, 2006, 2049, 8123, 2320, 2335, 1999, 2028, 2154, 1012, 102], [101, 16216, 23274, 2869, 1011, 15861, 12670, 2232, 1997, 2980, 2300, 2012, 1996, 3302, 1997, 1996, 3011, 1012, 102, 1996, 3302, 1997, 1996, 3103, 2003, 2172, 22302, 2084, 2471, 2505, 2006, 3011, 1012, 102], [101, 8866, 1024, 6381, 2300, 27126, 2064, 2022, 2904, 2046, 8841, 2300,

In [22]:
dsd_tokenized = dsd.map(preprocess_function, batched=True)

  0%|          | 0/24 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [23]:
dsd['train'][0]

{'id': '0',
 'premise': 'Pluto rotates once on its axis every 6.39 Earth days;',
 'hypothesis': 'Earth rotates on its axis once times in one day.',
 'label_str': 'neutral',
 'label': 0}

In [24]:
dsd_tokenized['train'][0]

{'id': '0',
 'premise': 'Pluto rotates once on its axis every 6.39 Earth days;',
 'hypothesis': 'Earth rotates on its axis once times in one day.',
 'label_str': 'neutral',
 'label': 0,
 'input_ids': [101,
  26930,
  24357,
  2015,
  2320,
  2006,
  2049,
  8123,
  2296,
  1020,
  1012,
  4464,
  3011,
  2420,
  1025,
  102,
  3011,
  24357,
  2015,
  2006,
  2049,
  8123,
  2320,
  2335,
  1999,
  2028,
  2154,
  1012,
  102],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1]}

## Model

In [25]:
num_labels = 2
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier

In [26]:
batch_size = 16
args = transformers.TrainingArguments(
    f"{model_name}-finetuned-scitail",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
)

In [27]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return clf_metrics.compute(predictions=predictions, references=labels)

In [28]:
trainer = transformers.Trainer(
    model,
    args,
    train_dataset=dsd_tokenized["train"],
    eval_dataset=dsd_tokenized["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [31]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: id, label_str, premise, hypothesis. If id, label_str, premise, hypothesis are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 23596
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 7375


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.1658,0.349151,0.88727,0.896987,0.831169,0.974125
2,0.0931,0.377212,0.903374,0.904545,0.900452,0.908676
3,0.0609,0.44906,0.902607,0.907636,0.869081,0.949772
4,0.0372,0.469463,0.907209,0.911485,0.877465,0.94825
5,0.023,0.469463,0.907209,0.911485,0.877465,0.94825


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: id, label_str, premise, hypothesis. If id, label_str, premise, hypothesis are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1304
  Batch size = 16


Epoch,Training Loss,Validation Loss


Saving model checkpoint to distilbert-base-uncased-finetuned-scitail/checkpoint-1475
Configuration saved in distilbert-base-uncased-finetuned-scitail/checkpoint-1475/config.json
Model weights saved in distilbert-base-uncased-finetuned-scitail/checkpoint-1475/pytorch_model.bin
tokenizer config file saved in distilbert-base-uncased-finetuned-scitail/checkpoint-1475/tokenizer_config.json
Special tokens file saved in distilbert-base-uncased-finetuned-scitail/checkpoint-1475/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: id, label_str, premise, hypothesis. If id, label_str, premise, hypothesis are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1304
  Batch size = 16
Saving model checkpoint to distilbert-base-uncased-finetuned-scitail/checkpoint-2950
Configuration s

TrainOutput(global_step=7375, training_loss=0.07347519942461433, metrics={'train_runtime': 892.9949, 'train_samples_per_second': 132.117, 'train_steps_per_second': 8.259, 'total_flos': 2190778955822592.0, 'train_loss': 0.07347519942461433, 'epoch': 5.0})

In [32]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: id, label_str, premise, hypothesis. If id, label_str, premise, hypothesis are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1304
  Batch size = 16


{'eval_loss': 0.46946296095848083,
 'eval_accuracy': 0.9072085889570553,
 'eval_f1': 0.9114850036576445,
 'eval_precision': 0.8774647887323944,
 'eval_recall': 0.9482496194824962,
 'eval_runtime': 3.3429,
 'eval_samples_per_second': 390.075,
 'eval_steps_per_second': 24.529,
 'epoch': 5.0}

In [34]:
trainer.evaluate(dsd_tokenized['test'])

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: id, label_str, premise, hypothesis. If id, label_str, premise, hypothesis are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2126
  Batch size = 16


{'eval_loss': 0.5201898217201233,
 'eval_accuracy': 0.9049858889934148,
 'eval_f1': 0.8839080459770116,
 'eval_precision': 0.856347438752784,
 'eval_recall': 0.9133016627078385,
 'eval_runtime': 5.4794,
 'eval_samples_per_second': 388.0,
 'eval_steps_per_second': 24.273,
 'epoch': 5.0}

In [35]:
!ls

distilbert-base-uncased-finetuned-scitail  sample_data


In [36]:
!ls distilbert-base-uncased-finetuned-scitail

checkpoint-1475  checkpoint-4425  checkpoint-7375
checkpoint-2950  checkpoint-5900  runs


In [39]:
clf_metrics.compute(
    predictions=dsd_tokenized['test']['label'], 
    references=dsd_tokenized['test']['label'],
)

{'accuracy': 1.0, 'f1': 1.0, 'precision': 1.0, 'recall': 1.0}

In [40]:
clf_metrics.compute(
    predictions=[0] * len(dsd_tokenized['test']['label']), 
    references=dsd_tokenized['test']['label'],
)

  _warn_prf(average, modifier, msg_start, len(result))


{'accuracy': 0.6039510818438382, 'f1': 0.0, 'precision': 0.0, 'recall': 0.0}

In [47]:
np.random.randint(low=0, high=2, size=10)

array([1, 0, 0, 0, 1, 0, 0, 0, 1, 0])

In [49]:
clf_metrics.compute(
    predictions=np.random.randint(low=0, high=2, size=len(dsd_tokenized['test']['label'])), 
    references=dsd_tokenized['test']['label'],
)

{'accuracy': 0.4708372530573848,
 'f1': 0.40381558028616854,
 'precision': 0.3645933014354067,
 'recall': 0.4524940617577197}