<a href="https://colab.research.google.com/github/arvishcdoshi/BERT-FineTuning/blob/main/BERT_FineTuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
! pip install tensorflow
! pip install datasets
! pip install transformers # A hugging face library used for models




*   We need a model.
*   And every model expects input in a particular format.
*   When we have input, we need tokenizer to tokenize our input and
    send it.

*   Therefore whenever we're dealing with pre-trained models, there'll be two things that we'll usually deal with - the model itself and the tokenizer ( i.e whatever tokenized version that particular model is expecting in, that is what we'll send )

* We can also use model specific tokenizer or general one.




In [4]:
import tensorflow as tf
from transformers import TFAutoModel, AutoTokenizer, BertModel, BertTokenizer
from datasets import load_dataset

In [5]:
model = TFAutoModel.from_pretrained("bert-base-uncased", use_safetensors=False)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tf_model.h5:   0%|          | 0.00/536M [00:00<?, ?B/s]

TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.
Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [6]:
# Tokenization examples with sample inputs

inputs = tokenizer(["Hi There", "How are you doing?", "Did you know how she is doing?"], padding=True, truncation=True, return_tensors="tf")
print(inputs)

# If we give return_tensors="pt", we'll get it in PyTorch compatible format.

TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.


{'input_ids': <tf.Tensor: shape=(3, 10), dtype=int32, numpy=
array([[ 101, 7632, 2045,  102,    0,    0,    0,    0,    0,    0],
       [ 101, 2129, 2024, 2017, 2725, 1029,  102,    0,    0,    0],
       [ 101, 2106, 2017, 2113, 2129, 2016, 2003, 2725, 1029,  102]],
      dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(3, 10), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(3, 10), dtype=int32, numpy=
array([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>}


###### With every sentence - At the start, it adds one token called CLS, the the words of sentence and in the end it adds one separator.

###### CLS --WORDS-OF-SENTENCE-- SEPARATOR

BERT is already pre-trained, so we can directly pass inputs to it.

In [7]:
output = model(inputs)
output

TFBaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=<tf.Tensor: shape=(3, 10, 768), dtype=float32, numpy=
array([[[-0.14789374,  0.3964893 , -0.24775226, ..., -0.12281266,
          0.14289272,  0.32143828],
        [ 0.2299842 , -0.16025874,  0.8877433 , ..., -0.36398733,
          0.17502756, -0.09828773],
        [-0.44709256,  0.20279723, -0.19469815, ...,  0.12965532,
          0.12807481, -0.48188624],
        ...,
        [-0.42713445, -0.07055303,  0.12530704, ...,  0.2775267 ,
          0.06314106, -0.01058996],
        [-0.48046604, -0.1329498 ,  0.18505162, ...,  0.42149228,
          0.05628638, -0.00228706],
        [-0.12210363,  0.22139286,  0.36014962, ..., -0.02125258,
          0.0744325 ,  0.3455704 ]],

       [[ 0.01838403,  0.17156395, -0.3715585 , ..., -0.580281  ,
          0.23574226,  0.1888364 ],
        [ 0.12354422, -0.5365178 , -0.27377445, ..., -0.23337778,
          0.47872016, -0.52466655],
        [ 0.56446874, -0.6456795 ,  0.383824  , ..

### Embeddings vs. Context in BERT's Output

It's common to get confused between "embedding" and "context" when discussing models like BERT. Here's a clarification:

*   **Embeddings**: Generally, an embedding is a numerical representation of a piece of data (like a word, phrase, or sentence) in a multi-dimensional space. The idea is that similar items are closer together in this space. Historically, word embeddings (e.g., Word2Vec, GloVe) were *static*; a word like "bank" would have the same embedding regardless of its usage.

*   **Context**: In natural language processing, 'context' refers to the surrounding words, phrases, and sentences that give meaning to a particular word or piece of text. For instance, the meaning of "bank" changes whether it's used in "river bank" or "deposit money in the bank."

#### What BERT Provides: Contextualized Embeddings

BERT doesn't give you 'embedding OR context'; it gives you **contextualized embeddings**. This means that the embeddings BERT generates *already incorporate the context* of the words within the input sequence. Therefore:

When you execute `output = model(inputs)`:

*   **What it does:** This line passes your tokenized inputs (the numerical representations of your text) through the pre-trained BERT model. It's essentially performing a forward pass, where the input data travels through all the layers of the neural network.
*   **What it means:** It's the step where BERT processes your text and computes its internal representations, which are rich with semantic and contextual information.
*   **What it gives:** The `output` variable will contain several pieces of information, most notably:
    *   **`last_hidden_state`**: These are the token-level embeddings. Each word (or sub-word token) in your input receives its own embedding vector. The values of this vector are dynamically determined by the *context* of all other words in the input sentence. So, the embedding for "bank" in "river bank" will be distinct from the embedding for "bank" in "financial bank." These embeddings are rich with contextual information.

    *   **`pooler_output`**: This is a single, aggregated embedding for the *entire input sequence*. It's also highly contextualized and serves as a summary representation of the whole sentence's meaning. It's often used for tasks requiring a single vector for the entire input, like text classification.

In essence, the embeddings produced by BERT are a powerful type of embedding that inherently captures and represents context. The model uses the context to *create* these rich numerical representations.

#### Example: Contextualized Embeddings with Ambiguous Words

Consider these two sentences:

1.  "He sat near a **river bank**."
2.  "She collected money from the **bank**."

When you process these sentences through BERT:

1.  **Embeddings for every word:** You will indeed get unique contextualized embeddings for *every* word (or sub-word token) in both sentences. This includes "He", "sat", "near", "a", "river", and "bank" from the first sentence, and "She", "collected", "money", "from", "the", and "bank" from the second sentence. These token-level embeddings are found in the `last_hidden_state` output.

2.  **Different embeddings for 'bank':** Crucially, the embedding (vector representation) for the word "bank" in the first sentence ("river bank") will be numerically distinct from the embedding for "bank" in the second sentence ("collected money from the bank"). This is because BERT understands the surrounding words and generates an embedding for "bank" that reflects its specific meaning and context within each sentence (one referring to a landform, the other to a financial institution).

What output do we got ?
> This is the contextualized meaning. We basically went through BERT, we gave the inputs and we got the output.

> If we check the shape, it's 3 * 10 * 768.  Earlier the input shape was 3*10.

> Basically, what happens here now is that every token i.e every word is now represented by a 768 dimensional vector. These dimensions vary by every model.

> Next we also get Pooler output : It's 3*768. What it means is for every sentence, they are giving one embedding.

CHATGPT Thread :- https://chatgpt.com/c/69339c6c-2bfc-8323-aa22-2f1df2cff8cf


# TILL NOW WHAT WE'VE DONE IS,
 - We had our BERT model, our inputs and we just passed it through the BERT layers to get contextualized embeddings.

 - Our goal is to have our own dataset and we want to fine tune this model on our new dataset.

 - Dataset that we'll use -> https://huggingface.co/datasets/SetFit/emotion

In [8]:
dataset = load_dataset("SetFit/emotion")

README.md:   0%|          | 0.00/194 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


train.jsonl:   0%|          | 0.00/2.23M [00:00<?, ?B/s]

validation.jsonl:   0%|          | 0.00/276k [00:00<?, ?B/s]

test.jsonl:   0%|          | 0.00/279k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [9]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 2000
    })
})

In [10]:
train_data = dataset["train"]
test_data = dataset["test"]

In [11]:
train_data.shape

(16000, 3)

In [12]:
train_data[1]

{'text': 'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake',
 'label': 0,
 'label_text': 'sadness'}

In [13]:
test_data.shape

(2000, 3)

In [14]:
test_data[0]

{'text': 'im feeling rather rotten so im not very ambitious right now',
 'label': 0,
 'label_text': 'sadness'}

In [15]:
# Tokenization function
def tokenize_function(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True)

# Apply tokenization
train_dataset = train_data.map(tokenize_function, batched=True)
test_dataset = test_data.map(tokenize_function, batched=True)

# Convert to PyTorch format
train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
test_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

Map:   0%|          | 0/16000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [16]:
from transformers import AutoModelForSequenceClassification

# Load BERT model (6 output classes)
base_model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=6)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Understanding `TFAutoModel` vs. `AutoModelForSequenceClassification`

When working with the `transformers` library, it's important to understand the subtle but significant differences between loading models for general feature extraction versus specific downstream tasks like classification.

#### BERT's Architecture: An Encoder

First, it's crucial to remember that **BERT (Bidirectional Encoder Representations from Transformers) is an encoder-only model**. Its architecture is based solely on the encoder stack of the original Transformer model. The raw output of a standard Transformer encoder is a sequence of *hidden states* (contextual embeddings for each token), not classification probabilities.

#### `TFAutoModel.from_pretrained("bert-base-uncased")`

*   **Purpose:** This command loads the **base pre-trained BERT model** for **feature extraction**. It provides the core BERT architecture with all its pre-trained layers, but *without any task-specific head* on top.
*   **Output:** When you pass inputs to a model loaded this way, its primary outputs are:
    *   `last_hidden_state`: Contextualized embeddings for each token in the input sequence.
    *   `pooler_output`: A single, aggregated summary embedding for the entire sequence (often derived from the `[CLS]` token).
*   **Use Case:** You would use `TFAutoModel` when your goal is to:
    *   Simply extract rich, contextual embeddings from text for use in another system or as input to a custom model.
    *   Build your *own custom task-specific head* on top of BERT for a novel task that isn't directly covered by existing `AutoModelFor...` classes (e.g., a very specific type of classification layer, or a custom layer for question answering or generation).
    *   Think of it as getting the 'brain' of BERT, ready for you to attach any 'skill' you need.

#### `AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=6)`

*   **Purpose:** This command loads the **base pre-trained BERT model *plus* a classification head** already configured for a sequence classification task. The `num_labels` argument (e.g., `num_labels=6`) specifies how many output classes your classification task has.
*   **How it works:** When you call this, the `transformers` library:
    1.  Loads the pre-trained `bert-base-uncased` encoder model.
    2.  **Attaches a new, randomly initialized classification head** on top of the encoder. This classification head typically consists of one or more simple feed-forward (dense) neural network layers. It takes the pooled output from BERT (usually the hidden state corresponding to the `[CLS]` token) and transforms it into `num_labels` (e.g., 6) output dimensions, representing the raw scores (logits) for each class.
*   **Output:** When you pass inputs to this model, it not only processes them through the BERT layers but also feeds the result into the newly attached classification layer. The final output is logits for each of your `num_labels` classes, which can then be converted to probabilities (e.g., using a softmax function).
*   **Use Case:** You would use this directly when you want to fine-tune BERT for a specific text classification task, such as:
    *   Sentiment analysis (e.g., positive, negative, neutral).
    *   Topic classification (e.g., sports, politics, tech).
    *   Your current task of emotion classification (where `num_labels=6` corresponds to 6 emotion categories).
    *   Think of it as getting the 'brain' of BERT *and* a ready-to-use 'skill' for classification already attached.

**In essence, the key difference is the presence of the task-specific head.** `TFAutoModel` provides the raw feature extractor, while `AutoModelForSequenceClassification` provides the feature extractor *plus* a classification layer, making it immediately suitable for fine-tuning on labeled datasets for classification tasks, without you having to manually construct that final layer.

In [17]:
base_model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [18]:
model.summary()

Model: "tf_bert_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
Total params: 109482240 (417.64 MB)
Trainable params: 109482240 (417.64 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


https://huggingface.co/docs/peft/v0.8.0/en/package_reference/lora

In [19]:
# LoRA Configuration

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType

lora_config = LoraConfig(
    r=8,  # Low-rank adaptation dimension
    lora_alpha=32,  # Scaling factor
    lora_dropout=0.05,  # Dropout rate
    target_modules=["query", "value"]  # Apply LoRA to self-attention layers only
)

# Prepare model for LoRA
base_model = prepare_model_for_kbit_training(base_model)

# Convert model into LoRA-enabled model
peft_model = get_peft_model(base_model, lora_config)

# Print trainable parameters
peft_model.print_trainable_parameters()

trainable params: 294,912 || all params: 109,781,766 || trainable%: 0.2686


https://huggingface.co/docs/transformers/en/trainer

In [20]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments( # configuration class that defines how training should happen
    output_dir="./model_checkpoints",  # Where to save model
    num_train_epochs=1,  # Train for 1 epoch
    per_device_train_batch_size=16,  # 16 samples per GPU/CPU
    eval_strategy="epoch",  # Evaluate after every epoch
    save_strategy="epoch",  # Save model after each epoch
    logging_steps=10,  # Log training metrics every 10 steps
    load_best_model_at_end=True,  # Automatically load best checkpoint
    fp16=True,  # Use mixed precision for faster training (if GPU supports it)
    report_to="tensorboard"  # Log training metrics to TensorBoard
)

# A high-level class that automates training, evaluation, and saving models.
# It wraps around your model and dataset, handling:

# Training loops.
# Evaluation during training.
# Model saving & checkpointing.

trainer = Trainer(
    model=peft_model,  # LoRA fine-tuned model
    args=training_args,  # Training settings
    train_dataset=train_dataset,  # Training data
    eval_dataset=test_dataset,  # Test data
    tokenizer=tokenizer  # Tokenizer for processing text,
)

trainer.train()

  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,1.4778,1.371447


TrainOutput(global_step=1000, training_loss=1.533692108154297, metrics={'train_runtime': 347.1394, 'train_samples_per_second': 46.091, 'train_steps_per_second': 2.881, 'total_flos': 4224423591936000.0, 'train_loss': 1.533692108154297, 'epoch': 1.0})

In [25]:
# inference
from transformers import pipeline, AutoModelForSequenceClassification
from peft import PeftModel

# Load the finetuned model
#         0: "joy",
#         1: "anger",
#         2: "sadness",
#         3: "fear",
#         4: "surprise",
#         5: "neutral"

finetuned_model = AutoModelForSequenceClassification.from_pretrained("./model_checkpoints/checkpoint-1000", num_labels=6)

emotion_classifier = pipeline("text-classification", model=finetuned_model, tokenizer="bert-base-uncased")

# Test on new sentences
print(emotion_classifier("I am so happy today!"))
print(emotion_classifier("I feel terrible and sad."))

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0


[{'label': 'LABEL_0', 'score': 0.21854345500469208}]
[{'label': 'LABEL_5', 'score': 0.19681425392627716}]


In [35]:
!pip install nbconvert nbformat



In [41]:
!jupyter nbconvert --to notebook --ClearMetadataPreprocessor.enabled=True --inplace BERT_FineTuning.ipynb

[NbConvertApp] Converting notebook BERT_FineTuning.ipynb to notebook
[NbConvertApp] Writing 51084 bytes to BERT_FineTuning.ipynb
