<a href="https://colab.research.google.com/github/christopherdiamana/nlp/blob/main/Catch-up_2/catch_up2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Natural Language Processing Catch-up 2

## Closing on the sentiment classifier 

In [None]:
import torch
torch.cuda.is_available()

True

### Library and dataset

In [None]:
!pip install datasets

In [None]:
from datasets import get_dataset_split_names

In [None]:
get_dataset_split_names("imdb")

['train', 'test', 'unsupervised']

In [None]:
from datasets import load_dataset

In [None]:
dataset = load_dataset("imdb")

Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

#### Split the training set into a training and validation set

In [None]:
dataset_clean = dataset["train"].train_test_split(train_size=0.8, stratify_by_column="label")
# Rename the default "test" split to "validation"
dataset_clean["validation"] = dataset_clean.pop("test")
# Add the "test" set to our `DatasetDict`
dataset_clean["test"] = dataset["test"]

In [None]:
dataset_clean

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 20000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
})

In [None]:
dataset_clean['train'].features

{'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None),
 'text': Value(dtype='string', id=None)}

#### Let's check if the proportion of each class must be the same in the training and validation set

In [None]:
from collections import Counter

In [None]:
Counter(dataset_clean['train']['label'])

Counter({0: 10000, 1: 10000})

In [None]:
Counter(dataset_clean['validation']['label'])

Counter({0: 2500, 1: 2500})

### Fine-tuning a model

In [None]:
!pip install transformers

In [None]:
from transformers import AutoTokenizer

In [None]:
checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True)

With the option `batched=True` I will preprocessed faster 

In [None]:
tokenized_datasets = dataset_clean.map(tokenize_function, batched=True)

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

In [None]:
tokenized_datasets

Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 20000
})

In [None]:
tokenized_datasets = tokenized_datasets.shuffle(seed=42)

#### Preprocessing

In [None]:
from transformers import DataCollatorWithPadding

In [None]:
data_collator = DataCollatorWithPadding(tokenizer)

#### Training

First step, I will define a `TrainingArguments` class that will contain all the hyperparameters the `Trainer` will use for training and evaluation.

In [None]:
from transformers import TrainingArguments

In [None]:
directory_name = "finetuning-distilbert"
 
training_args = TrainingArguments(
   output_dir=directory_name,
   learning_rate=2e-5,
   per_device_train_batch_size=16,
   per_device_eval_batch_size=16,
   num_train_epochs=1,
   weight_decay=0.01
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Second step, I will define the model. As in the previous chapter, we will use the `AutoModelForSequenceClassification` class, with two labels:

In [None]:
from transformers import AutoModelForSequenceClassification

In [None]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

loading configuration file https://huggingface.co/distilbert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.91b885ab15d631bf9cee9dc9d25ece0afd932f2f5130eba28f2055b2220c0333
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.20.1",
  "vocab_size": 30522
}

loading weights file https://huggingface.co/distilbert-base-uncased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/9c169103d7e5a

In [None]:
from transformers import Trainer

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator
)

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 20000
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1250


Step,Training Loss
500,0.2481
1000,0.2328


Saving model checkpoint to finetuning-sentiment-model-3000-samples/checkpoint-500
Configuration saved in finetuning-sentiment-model-3000-samples/checkpoint-500/config.json
Model weights saved in finetuning-sentiment-model-3000-samples/checkpoint-500/pytorch_model.bin
tokenizer config file saved in finetuning-sentiment-model-3000-samples/checkpoint-500/tokenizer_config.json
Special tokens file saved in finetuning-sentiment-model-3000-samples/checkpoint-500/special_tokens_map.json
Saving model checkpoint to finetuning-sentiment-model-3000-samples/checkpoint-1000
Configuration saved in finetuning-sentiment-model-3000-samples/checkpoint-1000/config.json
Model weights saved in finetuning-sentiment-model-3000-samples/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in finetuning-sentiment-model-3000-samples/checkpoint-1000/tokenizer_config.json
Special tokens file saved in finetuning-sentiment-model-3000-samples/checkpoint-1000/special_tokens_map.json


Training completed. Do no

TrainOutput(global_step=1250, training_loss=0.23486163024902343, metrics={'train_runtime': 535.0728, 'train_samples_per_second': 37.378, 'train_steps_per_second': 2.336, 'total_flos': 2630351320231488.0, 'train_loss': 0.23486163024902343, 'epoch': 1.0})