Fine-tuned a pre-trained language model (BERT in this case) on a sample subset of the IMDb dataset.

1. **Dataset Preparation:** Loaded and tokenized a subset of the IMDb dataset.
2. **Model Setup:** Loaded a pre-trained BERT model for sequence classification.
3. **Fine-Tuning:** Trained the model using the Trainer class from the transformers library.
4. **Evaluation**: Evaluated the model's performance on the validation set.
5. **Inference**: Performed inference on sample texts to predict their sentiment.

In [None]:
pip install torch transformers datasets



In [None]:
from transformers import BertTokenizer, BertModel

# Load the tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Load the model
model = BertModel.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
from datasets import load_dataset

# Load a sample dataset
dataset = load_dataset('imdb', split='train[:100]')  # Using a small subset for demonstration

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
from transformers import BertForSequenceClassification, TrainingArguments, Trainer

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
!pip install accelerate transformers -U



In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Initialize the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
    eval_dataset=tokenized_datasets,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Train the model
trainer.train()



Epoch,Training Loss,Validation Loss
1,No log,0.008326
2,No log,0.00183
3,No log,0.001321


TrainOutput(global_step=39, training_loss=0.0446239832120064, metrics={'train_runtime': 36.9101, 'train_samples_per_second': 8.128, 'train_steps_per_second': 1.057, 'total_flos': 78933316608000.0, 'train_loss': 0.0446239832120064, 'epoch': 3.0})

In [None]:
# Evaluate the model
evaluation_results = trainer.evaluate()

# Print the evaluation results
print(evaluation_results)

{'eval_loss': 0.0013212577905505896, 'eval_runtime': 2.7006, 'eval_samples_per_second': 37.028, 'eval_steps_per_second': 4.814, 'epoch': 3.0}


In [None]:
# Save the model
model.save_pretrained('./trained_model')
tokenizer.save_pretrained('./trained_model')

('./trained_model/tokenizer_config.json',
 './trained_model/special_tokens_map.json',
 './trained_model/vocab.txt',
 './trained_model/added_tokens.json')

In [None]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification

# Load the trained model and tokenizer
model = BertForSequenceClassification.from_pretrained('./trained_model')
tokenizer = BertTokenizer.from_pretrained('./trained_model')

# Prepare a sample text for inference
sample_text = "This is a great movie with excellent performances."

# Tokenize the input text
inputs = tokenizer(sample_text, return_tensors='pt', padding=True, truncation=True)

# Perform inference
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)

# Print the predictions
print(f"Predicted class: {predictions.item()}")

Predicted class: 0
