# Text Classification using distilbert-base-uncased

## DistilBert is a smaller version of BERT-base model. It uses BERT-base as the teacher for training

## imdb dataset is used for fine tunning DistilBert model for sentiment classification

In [1]:
# dependencies
#%pip install pytorch --quiet
%pip install transformers datasets evaluate --quiet


[33mDEPRECATION: distro-info 0.23ubuntu1 has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of distro-info or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m[33mDEPRECATION: python-debian 0.1.36ubuntu1 has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of python-debian or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [1]:
# import dependendcies 
from datasets import load_dataset
from transformers import AutoTokenizer
import evaluate
import numpy as np
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import torch




In [3]:
imdb = load_dataset("imdb")

Found cached dataset imdb (/home/azadeh/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)


  0%|          | 0/3 [00:00<?, ?it/s]

In [4]:
# check the dataset
print(imdb)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


In [8]:
# fix the randomness in the model
random_seed = 1 # or any of your favorite number 
torch.manual_seed(random_seed)
torch.cuda.manual_seed(random_seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
# import numpy as np
# np.random.seed(random_seed

In [9]:
# choose the model and related tokenizer

model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)


In [10]:
# define tokenizer function
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)



In [11]:
# apply the tokenizer in batch mode
tokenized_imdb = imdb.map(preprocess_function, batched=True)


Loading cached processed dataset at /home/azadeh/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-c1563b55d0da1a9b.arrow
Loading cached processed dataset at /home/azadeh/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-b2ee7b7c301146c1.arrow


Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [12]:
#check if tokenizer is working well for this corpus 

a = tokenizer(imdb["train"]["text"][0], truncation=True)
print(imdb["train"]["text"][0])

print
print(tokenizer.decode(a['input_ids']))


tokenizer = AutoTokenizer.from_pretrained(model_name)



I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, eve

In [13]:
accuracy = evaluate.load("accuracy")

In [14]:
# define the accuracy function for evaluation

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [15]:
# Train
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

In [16]:


model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=2, id2label=id2label, label2id=label2id
)

# show how many parameters it has
total_params = sum(p.numel() for p in model.parameters())
print(f"Number of parameters: {total_params}")


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'classifier.

Number of parameters: 66955010


In [23]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"

model= model.to(device)

print(device)

training_args = TrainingArguments(
    output_dir="./checkpoints/",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,    
    compute_metrics=compute_metrics,
)

trainer.train()

cuda:0


You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1259,0.294654,0.91832
2,0.1033,0.299436,0.92964


TrainOutput(global_step=3126, training_loss=0.11572517314478897, metrics={'train_runtime': 754.3504, 'train_samples_per_second': 66.282, 'train_steps_per_second': 4.144, 'total_flos': 6561288258498624.0, 'train_loss': 0.11572517314478897, 'epoch': 2.0})

In [24]:
# Test the trained new model for a given sample

model_name = "./checkpoints/checkpoint-1563"
sample_text = "This is the most pretty movie ever"

tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer(sample_text, return_tensors="pt") # it fixs the length of input
model = AutoModelForSequenceClassification.from_pretrained(model_name)

with torch.no_grad():
    logits = model(**inputs).logits
predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]

'POSITIVE'