# Text Classification

We will use the [distilled version of the BERT base model](https://huggingface.co/distilbert-base-uncased) on a [dataset with news articles](https://huggingface.co/datasets/ag_news) from HuggingFace.

The dataset consists of 120000 training and 7600 testing samples which can be divided into 4 classes: `World` (0), `Sports` (1), `Business` (2), and `Sci/Tech` (3)

In [None]:
!pip install -qq transformers[torch] datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m58.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m55.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.0/261.0 kB[0m [31m37.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━

In [None]:
DATASET = 'ag_news'
NUM_LABELS = 4
MODEL = 'distilbert-base-uncased'

Load the dataset with news articles:

In [None]:
from datasets import load_dataset

dataset = load_dataset(DATASET)
dataset

Downloading builder script:   0%|          | 0.00/4.06k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.65k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/751k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

Check the format of one sample from our dataset:

Check whether our dataset is balanced (get the number of samples from each class):

In [None]:
import numpy as np

def check_class_balance(class_labels):
  values, counts = np.unique(class_labels, return_counts=True)
  return values, counts

check_class_balance(dataset['train']['label'])

(array([0, 1, 2, 3]), array([30000, 30000, 30000, 30000]))

Load the tokenizer and have a look at it's special tokens:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL)
tokenizer

DistilBertTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

*What do these tokens mean?*

`[UNK]`, `unk_token` — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

`[SEP]`, `sep_token` — The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.

`[PAD]`, `pad_token` — The token used for padding, for example when batching sequences of different lengths.

`[CLS]`, `cls_token` — The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.

`[MASK]`, `mask_token` — The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.

Check what exactly does the tokenizer return (when applied on one sample):

Compare it to what is returned when we use the `preprocess_function`:

In [None]:
def preprocess_function(examples):
  # https://huggingface.co/docs/transformers/pad_truncation
  # truncation=True and padding='max_length' -> pads sequences with [PAD] token to given max sequence length
  return tokenizer(examples['text'], truncation=True, padding='max_length', return_tensors='pt')

In [None]:
input_ids, attention_mask = preprocess_function(dataset['train'][0]).values()
input_ids_, attention_mask_ = tokenizer(dataset['train'][0]['text']).values()

In [None]:
print(input_ids_)
print(input_ids)

[101, 2813, 2358, 1012, 6468, 15020, 2067, 2046, 1996, 2304, 1006, 26665, 1007, 26665, 1011, 2460, 1011, 19041, 1010, 2813, 2395, 1005, 1055, 1040, 11101, 2989, 1032, 2316, 1997, 11087, 1011, 22330, 8713, 2015, 1010, 2024, 3773, 2665, 2153, 1012, 102]
tensor([[  101,  2813,  2358,  1012,  6468, 15020,  2067,  2046,  1996,  2304,
          1006, 26665,  1007, 26665,  1011,  2460,  1011, 19041,  1010,  2813,
          2395,  1005,  1055,  1040, 11101,  2989,  1032,  2316,  1997, 11087,
          1011, 22330,  8713,  2015,  1010,  2024,  3773,  2665,  2153,  1012,
           102,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0, 

Preprocess more samples from our dataset at once:

In [None]:
# training on the whole dataset would take more than 5 hours :(
# train_dataset = dataset['train'].map(preprocess_function, batched=True)
# test_dataset = dataset['test'].map(preprocess_function, batched=True)

train_dataset = dataset['train'].shuffle(seed=42).select(range(2500)).map(preprocess_function, batched=True)
test_dataset = dataset['test'].shuffle(seed=42).select(range(500)).map(preprocess_function, batched=True)

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Load the model:

In [None]:
from transformers import AutoModelForSequenceClassification
from transformers import set_seed

set_seed(42)

def model_init():
  id2label = {0: 'World', 1: 'Sports', 2: 'Business', 3: 'Sci/Tech'}
  label2id = {'World': 0, 'Sports': 1, 'Business': 2, 'Sci/Tech': 3}
  return AutoModelForSequenceClassification.from_pretrained(
      MODEL,
      num_labels=NUM_LABELS,
      id2label=id2label,
      label2id=label2id
      )
model_init()

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

Define evaluation metrics and train our model:

In [None]:
from transformers import TrainingArguments, Trainer
from sklearn.metrics import accuracy_score
from transformers import set_seed

import numpy as np

def compute_metrics(p):
    logits = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.argmax(logits, axis=1)
    return {'accuracy': accuracy_score(p.label_ids, preds)}

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=2,
    per_device_train_batch_size=16,
    evaluation_strategy='epoch',
    learning_rate=5e-5,
    weight_decay=0.0,
    logging_steps=16,
    seed=42
)

trainer = Trainer(
    model=None,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
    model_init=model_init
)

trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3859,0.412676,0.866
2,0.2527,0.377189,0.886


TrainOutput(global_step=314, training_loss=0.3632557786953677, metrics={'train_runtime': 247.2173, 'train_samples_per_second': 20.225, 'train_steps_per_second': 1.27, 'total_flos': 662360616960000.0, 'train_loss': 0.3632557786953677, 'epoch': 2.0})

How can we improve the performance?

- increase dataset size

- **hyperparametr optimization**

## 1. The simpliest strategy - increase the number of epochs


In [None]:
set_seed(42)

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=5,
    per_device_train_batch_size=16,
    evaluation_strategy='epoch',
    learning_rate=5e-5,
    weight_decay=0.0,
    logging_steps=16,
    seed=42
)

trainer = Trainer(
    model=None,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
    model_init=model_init
)

trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3785,0.472366,0.86
2,0.2344,0.338372,0.896
3,0.1904,0.417836,0.886
4,0.0589,0.434215,0.896
5,0.0678,0.448407,0.888


TrainOutput(global_step=785, training_loss=0.1938984110622198, metrics={'train_runtime': 595.5287, 'train_samples_per_second': 20.99, 'train_steps_per_second': 1.318, 'total_flos': 1655901542400000.0, 'train_loss': 0.1938984110622198, 'epoch': 5.0})

Seems that the model is overfitting

## 2. Hyperparameter optimization using [optuna](https://optuna.org/)

In [None]:
!pip install optuna

Collecting optuna
  Downloading optuna-3.4.0-py3-none-any.whl (409 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m409.6/409.6 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.12.1-py3-none-any.whl (226 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m226.8/226.8 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting colorlog (from optuna)
  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading Mako-1.2.4-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.7/78.7 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: Mako, colorlog, alembic, optuna
Successfully installed Mako-1.2.4 alembic-1.12.1 colorlog-6.7.0 optuna-3.4.0


In [None]:
from transformers import TrainingArguments, Trainer
from sklearn.metrics import accuracy_score
from transformers import set_seed

import numpy as np

set_seed(42)

def optuna_hp_space(trial):
    return {
        "learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True),
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16, 32, 64]),
        "weight_decay": trial.suggest_float("weight_decay", 0.0001, 0.1),
        "num_train_epochs": trial.suggest_int("num_train_epochs", 2, 3)
    }

trainer = Trainer(
      model=None,
      args=training_args,
      train_dataset=train_dataset,
      eval_dataset=test_dataset,
      compute_metrics=compute_metrics,
      model_init=model_init,
  )

best_trial = trainer.hyperparameter_search(
    direction="maximize",
    backend="optuna",
    hp_space=optuna_hp_space,
    n_trials=5,
)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[I 2023-10-29 11:58:12,152] A new study created in memory with name: no-name-a75446b8-ddf5-4e6c-b866-56ab9083a73a
Trying to set dropout in the hyperparameter search but there is no corresponding field in `TrainingArguments`.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.7474,0.744642,0.832
2,0.5799,0.588172,0.862


[I 2023-10-29 12:02:34,256] Trial 0 finished with value: 0.862 and parameters: {'learning_rate': 3.4540000292055426e-06, 'per_device_train_batch_size': 8, 'weight_decay': 0.03655434056778146, 'num_train_epochs': 2}. Best is trial 0 with value: 0.862.
Trying to set dropout in the hyperparameter search but there is no corresponding field in `TrainingArguments`.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3981,0.443444,0.862
2,0.2213,0.401591,0.888


[I 2023-10-29 12:06:41,464] Trial 1 finished with value: 0.888 and parameters: {'learning_rate': 9.55250258385311e-05, 'per_device_train_batch_size': 16, 'weight_decay': 0.05538179693389445, 'num_train_epochs': 2}. Best is trial 1 with value: 0.888.
Trying to set dropout in the hyperparameter search but there is no corresponding field in `TrainingArguments`.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.391,0.394063,0.874
2,0.3069,0.378267,0.888


[I 2023-10-29 12:10:48,862] Trial 2 finished with value: 0.888 and parameters: {'learning_rate': 2.563130579670119e-05, 'per_device_train_batch_size': 16, 'weight_decay': 0.0028755869135947665, 'num_train_epochs': 2}. Best is trial 1 with value: 0.888.
Trying to set dropout in the hyperparameter search but there is no corresponding field in `TrainingArguments`.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,1.3446,1.308062,0.706
2,1.2062,1.174845,0.808
3,1.1247,1.121598,0.812


[I 2023-10-29 12:16:54,554] Trial 3 finished with value: 0.812 and parameters: {'learning_rate': 2.2299724840014933e-06, 'per_device_train_batch_size': 32, 'weight_decay': 0.08399728276673303, 'num_train_epochs': 3}. Best is trial 1 with value: 0.888.
Trying to set dropout in the hyperparameter search but there is no corresponding field in `TrainingArguments`.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,1.0442,1.03864,0.81
2,0.8547,0.865736,0.82


[I 2023-10-29 12:21:11,687] Trial 4 finished with value: 0.82 and parameters: {'learning_rate': 2.0418376768758244e-06, 'per_device_train_batch_size': 8, 'weight_decay': 0.07553242500757919, 'num_train_epochs': 2}. Best is trial 1 with value: 0.888.


In [None]:
best_trial

BestRun(run_id='1', objective=0.888, hyperparameters={'learning_rate': 9.55250258385311e-05, 'per_device_train_batch_size': 16, 'weight_decay': 0.05538179693389445, 'num_train_epochs': 2}, run_summary=None)

In [None]:
best_trial.hyperparameters

{'learning_rate': 9.55250258385311e-05,
 'per_device_train_batch_size': 16,
 'weight_decay': 0.05538179693389445,
 'num_train_epochs': 2}

In [None]:
set_seed(42)

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=best_trial.hyperparameters["num_train_epochs"],
    per_device_train_batch_size=best_trial.hyperparameters["per_device_train_batch_size"],
    evaluation_strategy='epoch',
    learning_rate=best_trial.hyperparameters["learning_rate"],
    weight_decay=best_trial.hyperparameters["weight_decay"],
    logging_steps=16,
    seed=42
)

trainer = Trainer(
    model=None,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
    model_init=model_init
)

trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3967,0.444413,0.862
2,0.2208,0.403314,0.886


TrainOutput(global_step=314, training_loss=0.33877817156967843, metrics={'train_runtime': 248.9674, 'train_samples_per_second': 20.083, 'train_steps_per_second': 1.261, 'total_flos': 662360616960000.0, 'train_loss': 0.33877817156967843, 'epoch': 2.0})

## Data augmentation
The idea is to mask a word in each data sample and generate a new word using a model

In [None]:
from datasets import Dataset
import random

from tqdm.notebook import tqdm
from transformers import pipeline

# Initialize the fill-mask pipeline
unmasker = pipeline('fill-mask', model='bert-base-cased')

# Define a function to perform text augmentation
def augment_dataset_with_masking(dataset):
    augmented_texts = []
    #original_labels = []

    for text in tqdm(dataset['text']):

        orig_text_list = text.split()
        len_input = len(orig_text_list)
        rand_idx = random.randint(1, len_input - 1)
        orig_word = orig_text_list[rand_idx]
        new_text_list = orig_text_list.copy()
        new_text_list[rand_idx] = '[MASK]'
        new_mask_sent = ' '.join(new_text_list)

        augmented_text_list = unmasker(new_mask_sent)

        for res in augmented_text_list:
            if res['token_str'] != orig_word:
                augmented_text = res['sequence']
                break

        augmented_texts.append(augmented_text)
    original_labels = dataset['label']

    augmented_dataset = Dataset.from_dict({'text': augmented_texts, 'label': original_labels})
    return augmented_dataset

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'cls.seq_relationship.weight', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

#### perform augmentation on the training set

In [None]:
original_dataset = dataset['train'].shuffle(seed=42).select(range(2500))

augmented_dataset = augment_dataset_with_masking(original_dataset)

aug_dataset = Dataset.from_dict({'text': original_dataset['text']+augmented_dataset['text'],
                       'label': original_dataset['label']+augmented_dataset['label']})
print(aug_dataset)

  0%|          | 0/2500 [00:00<?, ?it/s]

Dataset({
    features: ['text', 'label'],
    num_rows: 5000
})


In [None]:
train_dataset_aug = aug_dataset.shuffle(seed=42).map(preprocess_function, batched=True)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [None]:
set_seed(42)

id2label = {0: 'World', 1: 'Sports', 2: 'Business', 3: 'Sci/Tech'}
label2id = {'World': 0, 'Sports': 1, 'Business': 2, 'Sci/Tech': 3}
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL,
    num_labels=NUM_LABELS,
    id2label=id2label,
    label2id=label2id
    )

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset_aug,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2125,0.420548,0.866
2,0.0632,0.474889,0.886


TrainOutput(global_step=626, training_loss=0.23064140808848907, metrics={'train_runtime': 484.361, 'train_samples_per_second': 20.646, 'train_steps_per_second': 1.292, 'total_flos': 1324721233920000.0, 'train_loss': 0.23064140808848907, 'epoch': 2.0})

In [None]:
set_seed(42)

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=best_trial.hyperparameters["per_device_train_batch_size"],
    evaluation_strategy='epoch',
    learning_rate=1e-5,
    weight_decay=0.01,
    logging_steps=16,
    seed=42
)

id2label = {0: 'World', 1: 'Sports', 2: 'Business', 3: 'Sci/Tech'}
label2id = {'World': 0, 'Sports': 1, 'Business': 2, 'Sci/Tech': 3}
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL,
    num_labels=NUM_LABELS,
    id2label=id2label,
    label2id=label2id
    )

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset_aug,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3197,0.401383,0.872
2,0.2219,0.390085,0.888
3,0.1848,0.386588,0.89


TrainOutput(global_step=939, training_loss=0.3327051868327002, metrics={'train_runtime': 719.5211, 'train_samples_per_second': 20.847, 'train_steps_per_second': 1.305, 'total_flos': 1987081850880000.0, 'train_loss': 0.3327051868327002, 'epoch': 3.0})

In [None]:
set_seed(42)

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=5,
    per_device_train_batch_size=best_trial.hyperparameters["per_device_train_batch_size"],
    evaluation_strategy='epoch',
    learning_rate=1e-5,
    weight_decay=0.01,
    logging_steps=16,
    seed=42
)

id2label = {0: 'World', 1: 'Sports', 2: 'Business', 3: 'Sci/Tech'}
label2id = {'World': 0, 'Sports': 1, 'Business': 2, 'Sci/Tech': 3}
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL,
    num_labels=NUM_LABELS,
    id2label=id2label,
    label2id=label2id
    )

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset_aug,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3154,0.400036,0.872
2,0.2177,0.396595,0.886
3,0.1064,0.391166,0.89
4,0.1079,0.405557,0.894
5,0.0901,0.414489,0.892


TrainOutput(global_step=1565, training_loss=0.21980085409106537, metrics={'train_runtime': 1209.233, 'train_samples_per_second': 20.674, 'train_steps_per_second': 1.294, 'total_flos': 3311803084800000.0, 'train_loss': 0.21980085409106537, 'epoch': 5.0})