# Comparative Study of Text Classification Using Transformer-Based Transfer Learning vs. LSTM-Based Deep Learning from Scratch


## 📌 Project Overview

This project compares two deep learning methods for classifying IMDB movie reviews as either positive or negative.


1. **Transformer-Based Transfer Learning**:
   - Utilizes a pretrained DistilBERT model (a lighter variant of BERT) from the Hugging Face Transformers library.
   - Leverages transfer learning by fine-tuning the model on a specific text classification dataset.

2. **GRU-Based Model From Scratch**:

  - Builds a GRU (Gated Recurrent Unit) model without using any pretrained embeddings.

  - Trains all layers, including the embedding matrix, from scratch.

**Objective**:

Evaluate and compare the performance of both methods in terms of:

- Training and validation loss

- Accuracy

- Generalization ability

- Training time and resource usage

**Motivation**:

- GRU-based models are classical deep learning approaches for sequence modeling and still offer simpler alternatives when resources are limited.

- Transformer-based models have achieved state-of-the-art performance.




This project demonstrates the strengths and trade-offs of each approach in a practical text classification scenario.





---



**Transfer Learning Part**: Distil_Bert_For_Sequence_Classification

importing libraries  

In [None]:
import torch
from torch import nn
from torch.utils.data import DataLoader
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
from datasets import load_dataset
from sklearn.metrics import accuracy_score
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
from datasets import load_dataset



hugging face login to access the dataset

In [None]:
!huggingface-cli login
# api key: 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


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineGrained).
The token `google colab` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `googl

In [None]:
!pip install -U datasets huggingface_hub fsspec
#run this cell if getting error after running the next cell
#fsspec: filesystem interface library required by datasets

Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting huggingface_hub
  Downloading huggingface_hub-0.33.2-py3-none-any.whl.metadata (14 kB)
Collecting fsspec
  Downloading fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading huggingface_hub-0.33.2-py3-none-any.whl (515 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m515.4/515.4 kB[0m [31m39.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, huggingface_hub, datasets
  Attempting uninstall: fsspec
    Found existing installat

Loading dataset

In [None]:
ds = load_dataset("stanfordnlp/imdb")
# Load dataset
dataset = load_dataset("imdb")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

info about data set

In [None]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


In [None]:
train_data = dataset['train'].shuffle(seed=42).select(range(5000))
test_data = dataset['test'].shuffle(seed=42).select(range(1000))

In [None]:

#Preprocess (for Transformer)


#Load the DistilBERT tokenizer (pretrained on lowercase English)
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

#Define a function to tokenize each text example:
#Convert text to input IDs and attention masks
# Pad or truncate to a fixed length of 256 tokens
def tokenize_function(example):
    return tokenizer(example["text"], padding="max_length", truncation=True, max_length=256)

#Apply the tokenizer to the entire training and test datasets
train_tokenized = train_data.map(tokenize_function, batched=True)
test_tokenized = test_data.map(tokenize_function, batched=True)

#Convert tokenized datasets to PyTorch format, keeping only the required columns
train_tokenized.set_format("torch", columns=["input_ids", "attention_mask", "label"])
test_tokenized.set_format("torch", columns=["input_ids", "attention_mask", "label"])

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:

#Transformer Model
model_transformer = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
#optimizer
optimizer = torch.optim.AdamW(model_transformer.parameters(), lr=5e-5)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_transformer.to(device)


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [None]:

#training function
#Updating (fine-tunes) all model weights, starting from the pretrained DistilBERT weights.
def train_transformer():
    model_transformer.train()
    for epoch in range(6):
        epoch_losses = []
        for batch in DataLoader(train_tokenized, batch_size=8, shuffle=True):
            batch = {k: v.to(device) for k, v in batch.items()}
            batch["labels"] = batch.pop("label")
            outputs = model_transformer(**batch)
            loss = outputs.loss

            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

            epoch_losses.append(loss.item())

        avg_loss = sum(epoch_losses) / len(epoch_losses)
        print(f"Epoch {epoch+1}, Average Loss: {avg_loss:.4f}")

train_transformer()

Epoch 1, Average Loss: 0.3914
Epoch 2, Average Loss: 0.1988
Epoch 3, Average Loss: 0.0964
Epoch 4, Average Loss: 0.0508
Epoch 5, Average Loss: 0.0426
Epoch 6, Average Loss: 0.0344


In [None]:
#Evaluation of transfer model

def evaluate_transformer():
    model_transformer.eval()
    all_preds, all_labels = [], []
    with torch.no_grad():
        for batch in DataLoader(test_tokenized, batch_size=8):#predicting on test tokenized data
            batch = {k: v.to(device) for k, v in batch.items()}
            batch["labels"] = batch.pop("label")  #Rename 'label' to 'labels'
            outputs = model_transformer(**batch)
            preds = torch.argmax(outputs.logits, dim=-1)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(batch['labels'].cpu().numpy())

    acc = accuracy_score(all_labels, all_preds)
    f1 = f1_score(all_labels, all_preds, average="weighted")

    print("Accuracy:", acc)
    print("F1 Score:", f1)

evaluate_transformer()


Accuracy: 0.863
F1 Score: 0.8625405147347781


In [None]:
#Total parameters that are fine-tuned

total_params = sum(p.numel() for p in model_transformer.parameters() if p.requires_grad)
print(f"Total trainable parameters: {total_params}")


Total trainable parameters: 66955010




---



**GRU Model from scratch**
GRU = Gated Recurrent Unit
It's type of RNN designed to process sequential data (like text, time series, etc.).


Core idea:
Classic RNNs suffer from vanishing gradients, which means they struggle to learn long-term dependencies (e.g., words far apart in a sentence).
GRUs solve this by using gates that control what information gets kept or forgotten.

In [None]:
#libraries
import pandas as pd
import json
import string
import re
import spacy
import requests
import numpy as np
from io import BytesIO
import zipfile
import numpy as np
import tensorflow as tf
import difflib
import random
import os
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
from sklearn.metrics import f1_score
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout, GRU
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.optimizers import Adam
from typing import Counter


In [None]:
def set_seed(seed):
    tf.random.set_seed(seed)
    np.random.seed(seed)
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)

In [None]:
#Download GloVe embeddings from Stanford, unzip, and load them into a dictionary mapping words to vectors
#will use this embedding mechanism after tokenization to represent the words in the fixed sized vectors
# that will be used to train the model


# Download glove dataset and unzip
url = "http://nlp.stanford.edu/data/glove.6B.zip"

response = requests.get(url)
print("Downloaded GloVe zip")

zip_file = zipfile.ZipFile(BytesIO(response.content))
glove_file = zip_file.open('glove.6B.100d.txt')

# Parse GloVe vectors into dict: word -> vector
glove = {}
for line in glove_file:
    parts = line.decode('utf-8').strip().split()
    word = parts[0]
    vec = np.array(parts[1:], dtype=np.float32)
    glove[word] = vec

print(f"Loaded {len(glove)} GloVe word vectors")


Downloaded GloVe zip
Loaded 400000 GloVe word vectors


In [None]:
#collecting token from tweets of training data to build the vocabulary
# just dividing words, not using subword technique

def tokenize(text):
    return text.lower().split()

# Gather all tokens from tweets
all_tokens = []
for review in train_data["text"]:
    all_tokens.extend(tokenize(review))

token_counts = Counter(all_tokens)
vocab = list(token_counts.keys()) # all the unique tokens
print(f"Vocabulary size: {len(vocab)}")





Vocabulary size: 92230


In [None]:
#making list for tokens that are present in glove and not
in_glove_token=[tok for tok in vocab if tok in glove ]
in_oov_token=[tok for tok in vocab if tok not in glove]

#create a map, 1 for those oov token from validation and testing data set
word2idx = {"<PAD>": 0, "<UNK>": 1}
idx = 2

# Add in-glove tokens
for tok in in_glove_token:
    word2idx[tok] = idx
    idx += 1

print("Words found in Glove",idx)
#Add oov tokens
for tok in in_oov_token:
    word2idx[tok] = idx
    idx += 1
print("Overall",idx)

Words found in Glove 32162
Overall 92232


In [None]:
#creating vector/Embedding
emb=100
embedding_matrix=np.zeros((len(word2idx),emb))
count1=0
count=0
# assignment of numercial sequence to each words
# if have found the words in glove, assign embedding from there
# otherwise randomly
for tok in all_tokens:
    if tok in glove:
        count+=1
        embedding_matrix[word2idx[tok]] = glove[tok]
    else:
        count1+=1
        embedding_matrix[word2idx[tok]] = np.random.normal(scale=0.6, size=(emb,))


# Now Embedding_matrix is the our complete vocabulary

In [None]:
# creating X_train and Y_train
# and preparing data for the model
# Function to convert words to token IDs
def text_to_sequence(text, word2idx):
    return [word2idx.get(word, word2idx["<UNK>"]) for word in text.lower().split()]

# Convert all reviews to sequences of IDs
X_train = [text_to_sequence(tweet, word2idx) for tweet in train_data["text"]]
X_test= [text_to_sequence(tweet, word2idx) for tweet in test_data["text"]]



# Pad sequences to the same length
max_len = max(len(seq) for seq in X_train)  # or set a fixed length like 50

X_train_padded = pad_sequences(X_train, maxlen=max_len, padding='post', truncating='post')
# Convert labels to numpy array
y_train = np.array(train_data["label"])

# padding testing data
X_test_padded = pad_sequences(X_test, maxlen=max_len, padding='post', truncating='post')
# Convert labels to numpy array
y_test = np.array(test_data["label"])


# Final shapes
print("X_train shape:", X_train_padded.shape)
print("y_train shape:", y_train.shape)


# Final shapes
print("X_test shape:", X_test_padded.shape)
print("y_test shape:", y_test.shape)





X_train shape: (5000, 1601)
y_train shape: (5000,)
X_test shape: (1000, 1601)
y_test shape: (1000,)


In [None]:
#defining Embedding layer and freeze the embedding layer
# around 30 percent was taken from glove, remaining was initialized randomly
vocab_size = len(word2idx)
embedding_dim = embedding_matrix.shape[1]

embedding_layer = Embedding(
    input_dim=vocab_size,
    output_dim=embedding_dim,
    weights=[embedding_matrix],
    input_length=max_len,
     mask_zero=True,
     trainable=False

)




In [None]:
#defining model
from tensorflow.keras.layers import Input, Embedding, Bidirectional, GRU, Dropout, Dense, Add
from tensorflow.keras.models import Model

input_seq = Input(shape=(max_len,))

# Embedding layer
x = embedding_layer(input_seq)  #which is defined in previous cell

# First Bidirectional GRU + Dropout, Trainable
gru1 = Bidirectional(GRU(64, return_sequences=True))(x)
drop1 = Dropout(0.3)(gru1)

residual_1 = drop1  # We'll add residual on next step

# Second Bidirectional GRU + Dropout
gru2 = Bidirectional(GRU(64, return_sequences=True))(drop1)
drop2 = Dropout(0.3)(gru2)

# Add residual connection from previous GRU output to current GRU output
residual_2 = Add()([drop2, residual_1])  # element-wise add

# Third Bidirectional GRU + Dropout
gru3 = Bidirectional(GRU(64, return_sequences=True))(residual_2)
drop3 = Dropout(0.3)(gru3)

# Add residual connection from previous GRU output to current GRU output
residual_3 = Add()([drop3, residual_2])  # element-wise add

# Third Bidirectional GRU (no return sequences)
gru4 = Bidirectional(GRU(64))(residual_3)
drop4 = Dropout(0.3)(gru4)

# Dense layers
dense1 = Dense(64, activation='relu')(drop4)
drop5 = Dropout(0.3)(dense1)

output = Dense(1, activation='sigmoid')(drop5)

model = Model(inputs=input_seq, outputs=output)


#Compiling and builidng
model.compile(optimizer=Adam(learning_rate=1e-4), loss='binary_crossentropy', metrics=['accuracy'])
model.build(input_shape=(None, max_len))  # max_len is the length of padded sequences

model.summary()

In [None]:

#have tested with multiple combination and final model have been trained on 20 epochs with learning rate=1e-4
SEEDS = [42]
results = []

for seed in SEEDS:
    print(f"\n\nRunning experiments with seed: {seed}")
    set_seed(seed)

    # Train Model 1

    model.fit(X_train_padded, y_train, epochs=20, batch_size=16, validation_split=0.2 )





Running experiments with seed: 42
Epoch 1/20
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m48s[0m 152ms/step - accuracy: 0.5148 - loss: 0.7058 - val_accuracy: 0.5390 - val_loss: 0.6851
Epoch 2/20
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m75s[0m 140ms/step - accuracy: 0.5304 - loss: 0.6907 - val_accuracy: 0.5850 - val_loss: 0.6743
Epoch 3/20
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 146ms/step - accuracy: 0.5744 - loss: 0.6767 - val_accuracy: 0.6090 - val_loss: 0.6579
Epoch 4/20
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 147ms/step - accuracy: 0.6075 - loss: 0.6608 - val_accuracy: 0.6310 - val_loss: 0.6354
Epoch 5/20
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 144ms/step - accuracy: 0.6309 - loss: 0.6384 - val_accuracy: 0.6540 - val_loss: 0.6131
Epoch 6/20
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 150ms/step - accuracy: 0.6703 - loss: 0.6118 - val_accuracy: 

In [None]:
#Finding F1 score

y_test_pred_model1 = (model.predict(X_test_padded) > 0.5).astype(int)
f1_model1 = f1_score(y_test, y_test_pred_model1, average='macro')
print(f"Model 1 Macro F1: {f1_model1:.4f}")

    # Save results
results.append({
        "seed": seed,

        "model1_f1": f1_model1
    })


[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 91ms/step
Model 1 Macro F1: 0.7758


**Summary of Findings**:

I trained a GRU-based model for sentiment classification on the IMDB dataset using frozen pre-trained GloVe embeddings(half of them are initailized randomly). This model has a total of approximately 9.5 million parameters, but most of them (~9.2 million) are non-trainable because the embedding layer is frozen. Only around 295,000 parameters are trainable in the rest of the model.

The GRU model achieved a macro F1 score of about 0.77, which is decent given the smaller number of trainable parameters.

In comparison, the Transformer-based model (DistilBERT) I used for transfer learning contains many more trainable parameters and achieved a higher macro F1 score of around 0.88.

This suggests that while the GRU model is more lightweight and computationally efficient, the Transformer’s larger capacity and ability to fine-tune pre-trained weights provide better performance on this task.

Freezing the embeddings in the GRU model likely limited its ability to fully adapt to the dataset, so allowing embedding fine-tuning might improve results at the cost of increased training time.

**In conclusion**, for resource-constrained scenarios, GRU models with frozen embeddings are a good choice, but if accuracy is the priority and resources allow, fine-tuned Transformer models outperform them.
