<a href="https://colab.research.google.com/github/christopherdiamana/nlp/blob/main/Catch-up_2/catch_up2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Natural Language Processing Catch-up 2

## Closing on the sentiment classifier 

In [1]:
import torch
torch.cuda.is_available()

True

### Library and dataset

In [2]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.3.2-py3-none-any.whl (362 kB)
[K     |████████████████████████████████| 362 kB 4.2 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 57.7 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 10.8 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 45.5 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |███████████████

In [3]:
from datasets import get_dataset_split_names

In [4]:
get_dataset_split_names("imdb")

Downloading builder script:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

['train', 'test', 'unsupervised']

In [5]:
from datasets import load_dataset

In [6]:
dataset = load_dataset("imdb")

Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [7]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

#### Split the training set into a training and validation set

In [8]:
dataset_clean = dataset["train"].train_test_split(train_size=0.8, stratify_by_column="label")

# Rename the default "test" split to "validation"
dataset_clean["validation"] = dataset_clean.pop("test")

# Add the "test" set to our `DatasetDict`
dataset_clean["test"] = dataset["test"]

In [9]:
dataset_clean

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 20000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
})

In [10]:
dataset_clean['train'].features

{'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None),
 'text': Value(dtype='string', id=None)}

#### Let's check if the proportion of each class must be the same in the training and validation set

In [11]:
from collections import Counter

In [12]:
Counter(dataset_clean['train']['label'])

Counter({0: 10000, 1: 10000})

In [13]:
Counter(dataset_clean['validation']['label'])

Counter({0: 2500, 1: 2500})

### Fine-tuning a model

#### 1. Fine-tune the `distilbert-base-uncased` model on the training data

In [14]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 8.1 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 39.4 MB/s 
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.12.1 transformers-4.20.1


In [15]:
from transformers import AutoTokenizer

In [16]:
checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [17]:
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True)

With the option `batched=True` I will preprocessed faster 

In [18]:
tokenized_datasets = dataset_clean.map(tokenize_function, batched=True)



  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/25 [00:00<?, ?ba/s]

In [19]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 20000
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 25000
    })
})

In [20]:
tokenized_datasets = tokenized_datasets.shuffle(seed=42)

##### Preprocessing

In [21]:
from transformers import DataCollatorWithPadding

In [22]:
data_collator = DataCollatorWithPadding(tokenizer)

##### Training

First step, I will define a `TrainingArguments` class that will contain all the hyperparameters the `Trainer` will use for training and evaluation.

In [23]:
from transformers import TrainingArguments

In [24]:
directory_name = "finetuning-distilbert"
 
training_args = TrainingArguments(
   output_dir=directory_name,
   learning_rate=2e-5,
   per_device_train_batch_size=16,
   per_device_eval_batch_size=16,
   num_train_epochs=1,
   weight_decay=0.01
)

Second step, I will define the model. As in the previous chapter, we will use the `AutoModelForSequenceClassification` class, with two labels:

In [25]:
from transformers import AutoModelForSequenceClassification

In [26]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifi

In [27]:
from transformers import Trainer

In [28]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator
)

In [29]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 20000
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1250


Step,Training Loss
500,0.3168
1000,0.2395


Saving model checkpoint to finetuning-distilbert/checkpoint-500
Configuration saved in finetuning-distilbert/checkpoint-500/config.json
Model weights saved in finetuning-distilbert/checkpoint-500/pytorch_model.bin
tokenizer config file saved in finetuning-distilbert/checkpoint-500/tokenizer_config.json
Special tokens file saved in finetuning-distilbert/checkpoint-500/special_tokens_map.json
Saving model checkpoint to finetuning-distilbert/checkpoint-1000
Configuration saved in finetuning-distilbert/checkpoint-1000/config.json
Model weights saved in finetuning-distilbert/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in finetuning-distilbert/checkpoint-1000/tokenizer_config.json
Special tokens file saved in finetuning-distilbert/checkpoint-1000/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1250, training_loss=0.26772591857910155, metrics={'train_runtime': 544.7393, 'train_samples_per_second': 36.715, 'train_steps_per_second': 2.295, 'total_flos': 2623939070215296.0, 'train_loss': 0.26772591857910155, 'epoch': 1.0})

#### For what follow, I will use a [fine-tuned](https://huggingface.co/mvonwyl/distilbert-base-uncased-imdb) version of distilbert-base-uncased on an imdb dataset where an evaluation of 5000 samples was created by splitting the training set.

In [30]:
tokenizer_mvonwyl = AutoTokenizer.from_pretrained("mvonwyl/distilbert-base-uncased-imdb")

model_mvonwyl = AutoModelForSequenceClassification.from_pretrained("mvonwyl/distilbert-base-uncased-imdb")

https://huggingface.co/mvonwyl/distilbert-base-uncased-imdb/resolve/main/tokenizer_config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmphzcowptp


Downloading:   0%|          | 0.00/360 [00:00<?, ?B/s]

storing https://huggingface.co/mvonwyl/distilbert-base-uncased-imdb/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/c312c0f6745a03b88553cfb3e7603c49b7d3367514a3c63fed0846336510b332.c3f9f3d0063599ad5a8014363c5b46996af3c3efe30fb83a0d33a55b59bc6a23
creating metadata file for /root/.cache/huggingface/transformers/c312c0f6745a03b88553cfb3e7603c49b7d3367514a3c63fed0846336510b332.c3f9f3d0063599ad5a8014363c5b46996af3c3efe30fb83a0d33a55b59bc6a23
https://huggingface.co/mvonwyl/distilbert-base-uncased-imdb/resolve/main/vocab.txt not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp8k2bksbo


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

storing https://huggingface.co/mvonwyl/distilbert-base-uncased-imdb/resolve/main/vocab.txt in cache at /root/.cache/huggingface/transformers/4281206e7a7e0a9edc5227f33541196292145147630ce901689ac74b449256ed.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99
creating metadata file for /root/.cache/huggingface/transformers/4281206e7a7e0a9edc5227f33541196292145147630ce901689ac74b449256ed.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99
https://huggingface.co/mvonwyl/distilbert-base-uncased-imdb/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmplfimr9nc


Downloading:   0%|          | 0.00/695k [00:00<?, ?B/s]

storing https://huggingface.co/mvonwyl/distilbert-base-uncased-imdb/resolve/main/tokenizer.json in cache at /root/.cache/huggingface/transformers/c57072a97ab75bfd6274f08f88500ca960cfe6ca4c9435e9f185599b04e2435d.848c414913cfee271695b8761d3e947fb18a724fbad549de63228b20e5f2d615
creating metadata file for /root/.cache/huggingface/transformers/c57072a97ab75bfd6274f08f88500ca960cfe6ca4c9435e9f185599b04e2435d.848c414913cfee271695b8761d3e947fb18a724fbad549de63228b20e5f2d615
https://huggingface.co/mvonwyl/distilbert-base-uncased-imdb/resolve/main/special_tokens_map.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpwkfbnvib


Downloading:   0%|          | 0.00/125 [00:00<?, ?B/s]

storing https://huggingface.co/mvonwyl/distilbert-base-uncased-imdb/resolve/main/special_tokens_map.json in cache at /root/.cache/huggingface/transformers/bda42573df3424e9017803e91fd725748cb1d4b27de381bd7bbdb172eb8666d1.7da70648c6cb9951e284c9685f9ba7ae083dd59ed1d6d84bdfc0584a4ea94b6d
creating metadata file for /root/.cache/huggingface/transformers/bda42573df3424e9017803e91fd725748cb1d4b27de381bd7bbdb172eb8666d1.7da70648c6cb9951e284c9685f9ba7ae083dd59ed1d6d84bdfc0584a4ea94b6d
loading file https://huggingface.co/mvonwyl/distilbert-base-uncased-imdb/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/4281206e7a7e0a9edc5227f33541196292145147630ce901689ac74b449256ed.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99
loading file https://huggingface.co/mvonwyl/distilbert-base-uncased-imdb/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/c57072a97ab75bfd6274f08f88500ca960cfe6ca4c9435e9f185599b04e2435d.848c414913cfee271695b876

Downloading:   0%|          | 0.00/615 [00:00<?, ?B/s]

storing https://huggingface.co/mvonwyl/distilbert-base-uncased-imdb/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/155a1fb3ac6bb82f94f0363f0495d57da88c36b77b545cd468d5381686931fb1.424754d7ecb51ff9f4c71669d9f2f2e360304281b49e698fe9eb267a501bf8f0
creating metadata file for /root/.cache/huggingface/transformers/155a1fb3ac6bb82f94f0363f0495d57da88c36b77b545cd468d5381686931fb1.424754d7ecb51ff9f4c71669d9f2f2e360304281b49e698fe9eb267a501bf8f0
loading configuration file https://huggingface.co/mvonwyl/distilbert-base-uncased-imdb/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/155a1fb3ac6bb82f94f0363f0495d57da88c36b77b545cd468d5381686931fb1.424754d7ecb51ff9f4c71669d9f2f2e360304281b49e698fe9eb267a501bf8f0
Model config DistilBertConfig {
  "_name_or_path": "mvonwyl/distilbert-base-uncased-imdb",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

storing https://huggingface.co/mvonwyl/distilbert-base-uncased-imdb/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/90c7fa3d5a0007b25a637af126d11dabdb935d55c4f9f887c59d95f86e825081.4f5038faecc327c154d58f2b404e38b27e69abca5d2c5e2f39ecf462efeb54b1
creating metadata file for /root/.cache/huggingface/transformers/90c7fa3d5a0007b25a637af126d11dabdb935d55c4f9f887c59d95f86e825081.4f5038faecc327c154d58f2b404e38b27e69abca5d2c5e2f39ecf462efeb54b1
loading weights file https://huggingface.co/mvonwyl/distilbert-base-uncased-imdb/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/90c7fa3d5a0007b25a637af126d11dabdb935d55c4f9f887c59d95f86e825081.4f5038faecc327c154d58f2b404e38b27e69abca5d2c5e2f39ecf462efeb54b1
All model checkpoint weights were used when initializing DistilBertForSequenceClassification.

All the weights of DistilBertForSequenceClassification were initialized from the model checkpoint at mvonwyl/distilbert-base-uncased-imdb

#### 2. Evaluation

I will evaluate the model in term of accuracy on the test data

In [33]:
trainer_mvonwyl = Trainer(
    model=model_mvonwyl,
    tokenizer=tokenizer_mvonwyl
)

No `TrainingArguments` passed, using `output_dir=tmp_trainer`.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


First, I will get some predictions from the model.

In [34]:
predictions = trainer_mvonwyl.predict(tokenized_datasets["test"])
print(predictions.predictions.shape, predictions.label_ids.shape)

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 25000
  Batch size = 8


(25000, 2) (25000,)


In [41]:
predictions.metrics

{'test_loss': 0.3972010016441345,
 'test_runtime': 224.5122,
 'test_samples_per_second': 111.353,
 'test_steps_per_second': 13.919}

The output of the `predict()` method contain this three properties: predictions, label_ids, and metrics.

`Predictions` property is a two-dimensional array. Those are the logits for each element of the dataset we passed to `predict()`. 

I will take the index with the maximum value on the second axis to transform the two-dimensional array into one to be able to compare its to the labels:

In [35]:
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)

I can now compare the `preds` to the `labels`

In [36]:
from datasets import load_metric

metric = load_metric("accuracy")
metric.compute(predictions=preds, references=predictions.label_ids)

Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

{'accuracy': 0.92948}

Here, I can see the fine-tuned version of distilbert-base-uncased has an accuracy of 92.94% on the test set.

#### 3. Wrongly classified samples

I will select 2 samples which have been wrongly classified in the test set.

In [61]:
result_imdb_test = preds != predictions.label_ids

In [63]:
wrong_imdb_test = np.where(result_imdb_test == True)
wrong_imdb_test

(array([   21,    30,    53, ..., 24976, 24995, 24998]),)

In [66]:
type(wrong_imdb_test[0])

numpy.ndarray

I will randomly select 2 samples and make sure that will be the same by using `seed`.

In [75]:
np.random.seed(42)
two_wrongly_index = np.random.choice(wrong_imdb_test[0], 2)
two_wrongly_index

array([16121, 20899])

In [77]:
two_wrongly_index[0]

16121

In [83]:
index = int(two_wrongly_index[0])
tokenized_datasets["test"][index]['text']

'I have read the other user comments and I am happy someone has compared it to the original by Kamal called Perumarzhakalam released in 2004.<br /><br />The original had a tight story and no loopholes as described above about the Indian Govt not having proper records, or even bad shoots and bloopers.<br /><br />The story is great and a touchy one and well described by others. But sadly Nagesh taking credit for it as his own story is a sad thing and amounts to nothing other than plagiarism.<br /><br />I guess he has been affected by Bollywood\'s so called "inspired" syndrome.<br /><br />He must at least give credit where it is due.<br /><br />I liked some of his older movies, but now I suspect if any of them were originals after all.<br /><br />Here is a link in IMDb for the original masterpiece. http://www.imdb.com/title/tt0425350/#comment I recommend everyone to see the original, even with subtitles if needed, to know what class direction and class acting is all about.'

In [84]:
index = int(two_wrongly_index[1])
tokenized_datasets["test"][index]['text']

"A twist of fate puts a black man at the head of an old-school, white-bred advertising firm. And he intends to make a few changes...<br /><br />One very strange piece of cinema. You'll either love it or hate it. Either way, you've never seen anything like it."