# DistilBERT Classifier using Hugging Face `transformers`

![](finetuning-ii.png)

| **Model Type**          | **Representation Used**            | **Typical Accuracy** | **Notes**                               |
| ----------------------- | ---------------------------------- | -------------------- | --------------------------------------- |
| **MLP (Dense Network)** | **TF-IDF (20k–50k features)**      | **87–89%**           | Strongest MLP setup; ignores word order |
| **MLP (Dense Network)** | Averaged Word2Vec/GloVe embeddings | 82–84%               | Fast but loses sequence information     |
| **MLP (Dense Network)** | Trainable embedding + flatten      | 80–82%               | Weak sequence modeling                  |
| **1D CNN**              | Trainable embedding                | 88–90%               | Good at local n-grams                   |
| **LSTM / GRU**          | Trainable embedding                | 88–91%               | Captures long-range dependencies        |
| **BiLSTM + Attention**  | Trainable embedding                | 90–92%               | Strong classical NLP baseline           |
| **FastText**            | Bag-of-words + subword n-grams     | 88–90%               | Very fast baseline                      |
| **DistilBERT**          | Transformer                        | 93–94%               | Compact pretrained model                |
| **BERT-base**           | Transformer                        | 94–95%               | Widely used strong baseline             |
| **RoBERTa-large**       | Transformer                        | 95–96%               | Top-tier accuracy                       |


In [1]:
# pip install datasets

In [2]:
# inspired in https://github.com/rasbt/deeplearning-models/blob/master/pytorch_ipynb/transformer/distilbert-hf-finetuning.ipynb

In [3]:
import torch
import pandas as pd
import numpy as np
import time
import datasets

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    print("GPU name:", torch.cuda.get_device_name(0))
else:
    print("No GPU available")

GPU name: NVIDIA GeForce RTX 3090


### 1 Loading the Dataset from datasets

The IMDB movie review dataset consists of 50k movie reviews with sentiment label (0: negative, 1: positive).

In [4]:
from huggingface_hub import list_datasets
from datasets import load_dataset

In [5]:
# List first 20 datasets
for d in list_datasets(limit=5):
    print(d.id)

Anthropic/AnthropicInterviewer
TuringEnterprises/Turing-Open-Reasoning
nvidia/ToolScale
nvidia/PhysicalAI-Autonomous-Vehicles
openai/gdpval


In [6]:
imdb_data = load_dataset("imdb")
print(imdb_data)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


In [7]:
imdb_data["train"][99]

{'text': "This film is terrible. You don't really need to read this review further. If you are planning on watching it, suffice to say - don't (unless you are studying how not to make a good movie).<br /><br />The acting is horrendous... serious amateur hour. Throughout the movie I thought that it was interesting that they found someone who speaks and looks like Michael Madsen, only to find out that it is actually him! A new low even for him!!<br /><br />The plot is terrible. People who claim that it is original or good have probably never seen a decent movie before. Even by the standard of Hollywood action flicks, this is a terrible movie.<br /><br />Don't watch it!!! Go for a jog instead - at least you won't feel like killing yourself.",
 'label': 0}

In [8]:
# Convert each split to pandas
df_train = imdb_data["train"].to_pandas()[["text", "label"]]
df_test = imdb_data["test"].to_pandas()[["text", "label"]]
df_unsup = imdb_data["unsupervised"].to_pandas()[["text", "label"]]

# Concatenate all splits
df_all = pd.concat([df_train, df_test, df_unsup], ignore_index=True)

df_all.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


In [9]:

df = df_all.copy()
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))

**Basic datasets analysis and sanity checks**

In [10]:
print("Class distribution:")
df = df[df["label"] >= 0]   # keep only 0 and 1
np.bincount(df["label"].values)

Class distribution:


array([25000, 25000])

In [11]:
text_len = df["text"].apply(lambda x: len(x.split()))
text_len.min(), text_len.median(), text_len.max() 

(np.int64(4), np.float64(173.0), np.int64(2470))

**Split data into training, validation, and test sets**

In [12]:
df_shuffled = df.sample(frac=1, random_state=1).reset_index()

df_train = df_shuffled.iloc[:35_000]
df_val = df_shuffled.iloc[35_000:40_000]
df_test = df_shuffled.iloc[40_000:]

df_train.to_csv("train.csv", index=False, encoding="utf-8")
df_val.to_csv("validation.csv", index=False, encoding="utf-8")
df_test.to_csv("test.csv", index=False, encoding="utf-8")

**Load the dataset via `load_dataset`**

In [13]:
imdb_dataset = load_dataset(
    "csv",
    data_files={
        "train": "train.csv",
        "validation": "validation.csv",
        "test": "test.csv",
    },
)

print(imdb_dataset)

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 35000
    })
    validation: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 10000
    })
})


### 2 Tokenization and Numericalization

In [14]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
print("Tokenizer input max length:", tokenizer.model_max_length)
print("Tokenizer vocabulary size:", tokenizer.vocab_size)

Tokenizer input max length: 512
Tokenizer vocabulary size: 30522


In [15]:
def tokenize_text(batch):
    return tokenizer(batch["text"], truncation=True, padding=True)

In [16]:
imdb_tokenized = imdb_dataset.map(tokenize_text, batched=True, batch_size=None)

Map:   0%|          | 0/35000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [17]:
del imdb_dataset

### 3 Finetuning DistilBERT

In [18]:
# Loads pre-trained DistilBert

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2)                               # we add a new layer for binary classification
model.to(device);          

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [19]:
from transformers import Trainer, TrainingArguments
import evaluate

metric = evaluate.load('accuracy')

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    acc = metric.compute(predictions=predictions, references=labels)["accuracy"]
    return {"accuracy": acc}

batch_size = 10
trainer_args = TrainingArguments(output_dir="distilbert-v1",
                                 num_train_epochs=5, 
                                 eval_strategy="epoch",
                                 learning_rate=1e-5)

#trainer = Trainer(model=model,
#                  args=trainer_args,
#                  compute_metrics=compute_metrics,
#                  train_dataset=imdb_tokenized["train"],
#                  eval_dataset=imdb_tokenized["validation"],
#                  tokenizer=tokenizer)
trainer = Trainer(
    model=model,
    args=trainer_args,
    compute_metrics=compute_metrics,
    train_dataset=imdb_tokenized["train"],
    eval_dataset=imdb_tokenized["validation"],
    tokenizer=tokenizer
#    data_collator=data_collator,  # <-- Use the new data_collator argument
)
start = time.time()
trainer.train()
end = time.time()

print(f"Elapsed training time: {end - start:.4f} seconds")

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2297,0.271545,0.9278
2,0.1818,0.317593,0.9312
3,0.1326,0.342382,0.9344
4,0.0809,0.360517,0.9344
5,0.0429,0.392872,0.9326


Elapsed training time: 2277.2440 seconds


In [20]:
outputs = trainer.predict(imdb_tokenized["train"])
outputs.metrics

{'test_loss': 0.026442622765898705,
 'test_accuracy': 0.9955714285714286,
 'test_runtime': 136.3525,
 'test_samples_per_second': 256.688,
 'test_steps_per_second': 32.086}

In [21]:
outputs = trainer.predict(imdb_tokenized["validation"])
outputs.metrics

{'test_loss': 0.39287230372428894,
 'test_accuracy': 0.9326,
 'test_runtime': 19.5982,
 'test_samples_per_second': 255.125,
 'test_steps_per_second': 31.891}

In [22]:
outputs = trainer.predict(imdb_tokenized["test"])
outputs.metrics

{'test_loss': 0.3850667178630829,
 'test_accuracy': 0.9342,
 'test_runtime': 39.073,
 'test_samples_per_second': 255.931,
 'test_steps_per_second': 31.991}