# Grammatical Acceptability Classification by Fine-tuning ConvBERT #

In this exercise, we will fine-tune a ConvBERT model to classify sentences based on their grammatical acceptability using the CoLA (Corpus of Linguistic Acceptability) dataset. The goal is to determine whether a given sentence is grammatically correct (acceptable) or incorrect (unacceptable). The successful implementation of this AI model will greatly enhance the Student Help Centre’s ability to provide valuable, timely assistance to students.



## Section 1: Install and import Required Libraries

In [None]:
# Install required libraries:
# - torch: PyTorch framework for deep learning
# - transformers: Hugging Face library for pretrained transformer models
# - datasets: Hugging Face library for dataset loading and processing
!pip install torch transformers datasets

# Import the PyTorch base library for tensor operations and neural network building
import torch

# Import DataLoader to efficiently batch and iterate over datasets
from torch.utils.data import DataLoader

# Import AdamW optimizer from PyTorch (improved Adam with weight decay regularization)
from torch.optim import AdamW

# Import ConvBERT-specific tokenizer and model class for sequence classification
# - ConvBertTokenizer: handles converting text into token IDs
# - ConvBertForSequenceClassification: pretrained ConvBERT model adapted for classification tasks
from transformers import ConvBertTokenizer, ConvBertForSequenceClassification

# Import function to load datasets easily (supports many public datasets or custom local files)
from datasets import load_dataset

# Import NumPy for numerical operations (e.g., converting tensors to arrays, calculations)
import numpy as np

# Import accuracy_score from scikit-learn to calculate classification accuracy
from sklearn.metrics import accuracy_score

# Import pad_sequence to pad sequences of different lengths into a uniform tensor
# (useful for batching variable-length text sequences)
from torch.nn.utils.rnn import pad_sequence




## Section 2: Data Preparation

In [None]:
# Load the CoLA dataset (Corpus of Linguistic Acceptability) from the GLUE benchmark
# - "glue" is the GLUE benchmark collection
# - "cola" is one of the tasks within GLUE (binary classification: acceptable vs unacceptable sentence)
dataset = load_dataset("glue", "cola")

# Split the dataset into training and validation sets
train_dataset = dataset['train']          # Training split
test_dataset = dataset['validation']      # Validation split (used for evaluation)

# Define the ConvBERT model checkpoint name from Hugging Face model hub
model_name = "YituTech/conv-bert-base"  # Pretrained ConvBERT base model

# Load the ConvBERT tokenizer
# - Tokenizer converts raw text (sentences) into token IDs that the model can understand
tokenizer = ConvBertTokenizer.from_pretrained(model_name)

# Define a function to tokenize a batch of text
# - Takes a dictionary batch (with key 'sentence') and applies tokenization
# - padding=True: ensures all sentences are padded to the same length
# - truncation=True: truncates longer sentences to model's max length
def tokenize(batch):
    return tokenizer(batch['sentence'], padding=True, truncation=True)

# Apply tokenization to the training dataset
# - map applies the function to each example
# - batched=True means it processes multiple examples at once for efficiency
train_dataset = train_dataset.map(lambda x: tokenize(x), batched=True)

# Apply tokenization to the validation dataset
test_dataset = test_dataset.map(lambda x: tokenize(x), batched=True)

# Format the datasets so they can be fed directly into a PyTorch model
# - type='torch': converts dataset into PyTorch tensors
# - columns: selects which columns are kept for training
#   - input_ids: token IDs
#   - attention_mask: distinguishes padding tokens from real tokens
#   - label: target label for classification
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
test_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

cola/train-00000-of-00001.parquet:   0%|          | 0.00/251k [00:00<?, ?B/s]

cola/validation-00000-of-00001.parquet:   0%|          | 0.00/37.6k [00:00<?, ?B/s]

cola/test-00000-of-00001.parquet:   0%|          | 0.00/37.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8551 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1043 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1063 [00:00<?, ? examples/s]

vocab.txt: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/674 [00:00<?, ?B/s]

Map:   0%|          | 0/8551 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/1043 [00:00<?, ? examples/s]

## Section 3: Model Setup and Fine-tuning

In [None]:
# This function ensures that all sentences in a batch are padded to the same length
# before being fed into the model. It handles input_ids, attention_mask, and labels.
def collate_fn(batch):
    # Extract 'input_ids' from each example in the batch
    input_ids = [item['input_ids'] for item in batch]

    # Extract 'attention_mask' from each example in the batch
    attention_mask = [item['attention_mask'] for item in batch]

    # Extract 'label' from each example in the batch
    labels = [item['label'] for item in batch]

    # Pad input_ids so that all sequences in the batch are the same length
    input_ids_padded = pad_sequence(input_ids, batch_first=True)

    # Pad attention_mask in the same way
    attention_mask_padded = pad_sequence(attention_mask, batch_first=True)

    # Convert labels into a single tensor
    labels = torch.tensor(labels)

    # Return everything as a dictionary for the DataLoader
    return {
        'input_ids': input_ids_padded,
        'attention_mask': attention_mask_padded,
        'label': labels
    }


# Load ConvBERT sequence classification model from Hugging Face
# - num_labels=2 because CoLA is a binary classification task (acceptable/unacceptable sentence)
model = ConvBertForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Use AdamW optimizer (improved Adam with decoupled weight decay)
# - model.parameters(): passes all model weights to the optimizer
# - lr=5e-5: learning rate for fine-tuning
optimizer = AdamW(model.parameters(), lr=5e-5)


# DataLoader for training data
# - batch_size=8: process 8 samples per step
# - shuffle=True: shuffle data each epoch
# - collate_fn=collate_fn: apply custom padding function
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True, collate_fn=collate_fn)

# DataLoader for test data (no shuffling needed)
test_dataloader = DataLoader(test_dataset, batch_size=8, collate_fn=collate_fn)


# Use GPU (CUDA) if available, otherwise fall back to CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move the model to the selected device
model.to(device)


# Set model to training mode (enables dropout, gradient updates, etc.)
model.train()

# Loop through multiple epochs (passes through entire dataset)
for epoch in range(3):  # Train for 3 epochs
    total_loss = 0  # Track total loss for the epoch

    # Iterate over batches of training data
    for batch in train_dataloader:
        # Reset gradients before each batch
        optimizer.zero_grad()

        # Move input tensors to the same device as the model (CPU/GPU)
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        # Forward pass: run inputs through the model
        # - The model automatically returns loss when labels are provided
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)

        # Extract loss value
        loss = outputs.loss
        total_loss += loss.item()  # Accumulate loss for reporting

        # Backward pass: compute gradients
        loss.backward()

        # Update model weights
        optimizer.step()

    # Compute average loss for the epoch
    avg_loss = total_loss / len(train_dataloader)
    print(f"Epoch {epoch+1}, Loss: {avg_loss:.4f}")


pytorch_model.bin:   0%|          | 0.00/423M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/423M [00:00<?, ?B/s]

Some weights of ConvBertForSequenceClassification were not initialized from the model checkpoint at YituTech/conv-bert-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1, Loss: 0.4376
Epoch 2, Loss: 0.2509
Epoch 3, Loss: 0.1553



## Section 4: Evaluate the Model

In [None]:
# Set the model to evaluation mode
# - disables dropout, gradient updates, etc.
model.eval()

# Lists to store predictions and true labels
predictions, true_labels = [], []

# Disable gradient calculation during evaluation (saves memory & speeds up inference)
with torch.no_grad():
    # Iterate over the validation/test dataset
    for batch in test_dataloader:
        # Move inputs to the same device as the model (CPU/GPU)
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        # Forward pass (no labels here, we just want logits)
        outputs = model(input_ids, attention_mask=attention_mask)

        # Extract raw model predictions (logits = unnormalized scores)
        logits = outputs.logits

        # Convert logits to predicted class indices (0 or 1 for CoLA)
        preds = torch.argmax(logits, dim=-1)

        # Store predictions and true labels (move to CPU and convert to numpy)
        predictions.extend(preds.cpu().numpy())
        true_labels.extend(labels.cpu().numpy())

# Compute accuracy by comparing predicted vs true labels
accuracy = accuracy_score(true_labels, predictions)

# Print final test accuracy
print(f"Test Accuracy: {accuracy:.4f}")


Test Accuracy: 0.8389
