# 5. Transformer Sanity Check with DistilBERT
**Objective:** Get a complete, end-to-end Transformer fine-tuning pipeline working. The goal is not a high score, but to ensure all new libraries and components are set up correctly.

## 5.1 Imports and Configuration

In [1]:
import pandas as pd
import numpy as np
import re
import tensorflow as tf
from sklearn.model_selection import train_test_split
from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification, DataCollatorWithPadding

MODEL_CHECKPOINT = "distilbert-base-uncased"
BATCH_SIZE = 16 # You can lower this if you run out of memory

## 5.2. Load and Prepare Data

In [2]:
# Load the Full training data
full_train_df = pd.read_csv('../data/train.csv')

# Define clean3
def clean3(text):
  text = text.lower() # lowercasing
  text = re.sub(r"#([a-z0-9_]+)", r"\1", text) # Hashtag to plain word
  text = re.sub(r'http\S+', "", text) # removing HTTP. URL
  text = re.sub(r"www\.\S+", "", text) # removing WWW. URL
  text = re.sub(r'@\w+', "", text) # removing @mentions
  text = re.sub(r"[^a-z0-9\s]", " ", text) #removing other characters other than a-z, 0-9 and whitespace
  text = re.sub(r"\s+", " ", text).strip() # Changing multiple spaces into one
  return text

full_train_df['text'] = full_train_df['text'].apply(clean3)

# Changing the name to adhere to the default naming convention
full_train_df = full_train_df.rename(columns={'target': 'label'})

# Creating a 90/10 split
train_df, val_df = train_test_split(
  full_train_df,
  test_size=0.1,
  stratify= full_train_df['label'],
  random_state=42
) 

# Convert pandas dataframe to Hugging Face Datasets
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)

print("Training set shape:", train_df.shape)
print("Validation set shape:", val_df.shape)
print("\nColumns are now:", train_dataset.column_names)

Training set shape: (6851, 5)
Validation set shape: (762, 5)

Columns are now: ['id', 'keyword', 'location', 'text', 'label', '__index_level_0__']


## 5.3 Tokenization

In [3]:
# Loading the tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

# Creating a function to tokenize text
def tokenize_function(examples):
  return tokenizer(examples["text"], truncation=True, padding=True)

# Applying the tokenization to dataset
tokenized_train_dataset = train_dataset.map(tokenize_function, batched= True)
tokenized_val_dataset = val_dataset.map(tokenize_function, batched= True)

# Enable dynamic padding 
data_collator = DataCollatorWithPadding(tokenizer = tokenizer, return_tensors="tf")

Map:   0%|          | 0/6851 [00:00<?, ? examples/s]

Map:   0%|          | 0/762 [00:00<?, ? examples/s]

## 5.4. Fine-Tuning the Model

In [5]:
# Loading the pre-trained model
model = TFAutoModelForSequenceClassification.from_pretrained(MODEL_CHECKPOINT, num_labels=2)

# Preparing datasets for Tensorflow
tf_train_dataset = tokenized_train_dataset.to_tf_dataset(
  columns = ["attention_mask", "input_ids", "label"],
  shuffle = True,
  batch_size = BATCH_SIZE,
  collate_fn = data_collator,
)

tf_val_dataset = tokenized_val_dataset.to_tf_dataset(
  columns = ["attention_mask", "input_ids", "label"],
  shuffle = False,
  batch_size =BATCH_SIZE,
  collate_fn = data_collator,
)

# Compile and Train
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy')]

model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

# Fine tuning
model.fit(tf_train_dataset, validation_data=tf_val_dataset, epochs=3)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

Epoch 1/3


2025-06-12 12:28:22.302517: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.
E0000 00:00:1749720503.066957  103513 meta_optimizer.cc:967] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node Adam/AssignAddVariableOp_10.


Epoch 2/3
Epoch 3/3


<tf_keras.src.callbacks.History at 0x3521cf110>

## 5.5. Sanity Check Evaluation

In [7]:
from sklearn.metrics import f1_score

# Get predictions for the validation set
val_logits = model.predict(tf_val_dataset).logits
val_probs = tf.nn.softmax(val_logits, axis=1).numpy()
val_preds = np.argmax(val_probs, axis=1)

# Getting true labels
y_true = val_df['label'].to_numpy()

# Calculating F1 score
f1 = f1_score(y_true, val_preds)

print(f"\nSanity Check Validation F1 Score: {f1:.5f}")


Sanity Check Validation F1 Score: 0.79228
