# What is BERT?
BERT (Bidirectional Encoder Representations from Transformers) is a Transformer-based model for natural language processing (NLP). It’s pre-trained on a large corpus (like Wikipedia) and can be fine-tuned on a specific task (like sentiment analysis, question answering, etc.).

In [22]:
from transformers import BertConfig, BertModel, BertTokenizer
import torch

In [23]:
# Step 1: Create the config (this defines the architecture of BERT)
config = BertConfig(
    hidden_size=768,
    num_hidden_layers=12,
    num_attention_heads=12,
    intermediate_size=3072
)

In [24]:
# Step 2: Build a model from that config (random weights!)
model = BertModel(config)

In [25]:
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

In [26]:
# Step 3: Use a real sentence
sentence = "Transformers are powerful models for natural language processing."

In [27]:
# Step 4: Use the tokenizer from a pretrained model (bert-base-cased)
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [28]:
# Step 5: Tokenize the input sentence
inputs = tokenizer(sentence, return_tensors="pt")

In [29]:
inputs

{'input_ids': tensor([[  101, 25267,  1132,  3110,  3584,  1111,  2379,  1846,  6165,   119,
           102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [30]:
# Step 6: Feed the tokenized input into the randomly initialized BERT model
outputs = model(**inputs)

In [31]:
outputs

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.2250, -0.5364,  0.9528,  ..., -0.1508, -0.3438,  0.3116],
         [-0.6192, -0.9513, -0.0240,  ...,  0.1583,  1.6032, -0.1074],
         [ 0.8189,  0.3935,  0.2806,  ..., -0.5453,  0.3253, -0.2948],
         ...,
         [-1.1611, -0.0860,  0.2818,  ..., -0.9372, -0.2194, -0.5715],
         [-0.2695,  0.4406,  0.3280,  ..., -0.7545,  0.3377,  0.2955],
         [-0.8147,  0.3906, -0.1974,  ..., -2.1204,  1.3449, -0.8238]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[ 0.1842,  0.7091,  0.2575, -0.2647, -0.1452, -0.4880, -0.4069, -0.0588,
         -0.3209, -0.5230,  0.0651, -0.0373,  0.0529,  0.4323,  0.3095, -0.3564,
         -0.0049, -0.3928,  0.0844, -0.6109,  0.0623,  0.3082, -0.1602, -0.6105,
         -0.2318,  0.4171, -0.1559, -0.7442, -0.2274, -0.3736,  0.1775, -0.2187,
          0.1207,  0.1747, -0.7252,  0.2858,  0.3580,  0.0713, -0.2416, -0.1052,
         -0.3966, -0.1335, -0.04

In [32]:
# Step 7: Get the last hidden state
last_hidden_state = outputs.last_hidden_state

In [33]:
last_hidden_state

tensor([[[ 0.2250, -0.5364,  0.9528,  ..., -0.1508, -0.3438,  0.3116],
         [-0.6192, -0.9513, -0.0240,  ...,  0.1583,  1.6032, -0.1074],
         [ 0.8189,  0.3935,  0.2806,  ..., -0.5453,  0.3253, -0.2948],
         ...,
         [-1.1611, -0.0860,  0.2818,  ..., -0.9372, -0.2194, -0.5715],
         [-0.2695,  0.4406,  0.3280,  ..., -0.7545,  0.3377,  0.2955],
         [-0.8147,  0.3906, -0.1974,  ..., -2.1204,  1.3449, -0.8238]]],
       grad_fn=<NativeLayerNormBackward0>)

In [35]:
# Step 8: Extract the [CLS] token's embedding
cls_embedding = last_hidden_state[:, 0, :]  # shape: [1, 768]

In [36]:
# Step 9: Print the result
print("CLS token embedding shape:", cls_embedding.shape)
print("First 5 values of CLS embedding:", cls_embedding[0][:5])

CLS token embedding shape: torch.Size([1, 768])
First 5 values of CLS embedding: tensor([ 0.2250, -0.5364,  0.9528, -0.9936,  1.2111], grad_fn=<SliceBackward0>)


# Understanding BERT Model Setup with `BertConfig`

This document explains how to build and use a BERT model from scratch using Hugging Face's `transformers` library. We'll go over the key components involved in setting up the model and processing a sentence.

---

## 🔧 Components Explained

### 1. `BertConfig`

`BertConfig` is used to define the architecture of the BERT model. It includes:

- `hidden_size`: the size of the hidden vectors (default: 768)
- `num_hidden_layers`: the number of transformer layers (default: 12)
- `num_attention_heads`: number of attention heads per layer
- `intermediate_size`: the size of the feedforward layer inside the transformer block

This configuration sets up the "skeleton" of your BERT model.

---

### 2. `BertModel(config)`

By passing the `BertConfig` to `BertModel`, you're creating a BERT model **from scratch**.

- No pretrained weights are used.
- All weights are randomly initialized.
- This is useful for custom training or experimentation but requires a lot of data and compute to be effective.

---

### 3. `BertTokenizer`

The tokenizer is responsible for converting text into numbers (token IDs) that the model can understand.

- It splits the text into subword tokens.
- Adds special tokens like `[CLS]` and `[SEP]`.
- Converts tokens to corresponding vocabulary IDs.

This is the critical first step before passing input to BERT.

---

### 4. Model Output

When you pass the tokenized inputs to the model, it outputs:

- `last_hidden_state`: a tensor containing hidden representations for **each token** in the input sentence.
- This is shaped `[batch_size, sequence_length, hidden_size]`.

---

### 5. `[CLS]` Token Output (`cls_embedding`)

The `[CLS]` token is added at the beginning of every sentence. Its output embedding (from `last_hidden_state[:, 0, :]`) is commonly used for:

- **Sentence-level tasks** like classification, sentiment analysis, etc.
- It serves as a summarized representation of the entire input sentence.

---

## ✅ Summary

| Component       | Purpose                                                              |
|----------------|----------------------------------------------------------------------|
| `BertConfig`    | Defines the architecture of the BERT model                           |
| `BertModel`     | Builds the model with random weights based on the config             |
| `BertTokenizer` | Converts real sentences into model-readable token IDs                |
| Output          | Returns hidden states for each token                                 |
| `[CLS]` Output  | Used as a sentence representation, helpful for classification tasks  |

---

> 📌 **Note:** Since the model is not pretrained, it will output random embeddings until trained.
