# BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

**Authors**: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova  
**Published**: NAACL 2019  
**By**: Google AI Language

---

## Overview of BERT

**BERT** stands for **Bidirectional Encoder Representations from Transformers**.

It introduced a new method for **pre-training language models** to deeply understand the **context of language from both directions (left and right)**.

### Key Innovations:
- Deep **bidirectional** context understanding.
- **Pre-training + Fine-tuning** approach.
- State-of-the-art results on **11 NLP tasks** with the **same architecture**.

---

## Why Was BERT Revolutionary?

**Before BERT**:
- Language models were **unidirectional**.
- Limited context understanding.
- Required **task-specific architectures** for downstream tasks.

**With BERT**:
- **Bidirectional Transformers** to capture context from both sides.
- Unified architecture for **multiple tasks**.
- Fine-tuning with minimal changes.

---

## Architecture Details

BERT is based on the **Transformer encoder** from "Attention is All You Need" (2017).

| Component | Details |
|----------|---------|
| Model | Transformer Encoder only |
| Layers | 12 (Base), 24 (Large) |
| Hidden Size | 768 (Base), 1024 (Large) |
| Self-Attention Heads | 12 (Base), 16 (Large) |
| Parameters | 110M (Base), 340M (Large) |

---

## Pre-training Tasks

### 1. Masked Language Modeling (MLM)
- Randomly masks 15% of input tokens.
- Model predicts the original value of masked tokens using surrounding context.

**Example**:  
Input: "I like to [MASK] ice cream."  
Target: Predict "[MASK]" as "eat"

**Goal**: Learn deep **bidirectional context**.

### 2. Next Sentence Prediction (NSP)
- Predicts whether sentence B is the **actual next sentence** of sentence A.

**Example**:  
Sentence A: "He went to the store."  
Sentence B: "He bought a gallon of milk." → IsNext  
Random sentence B: "Penguins are flightless birds." → NotNext

**Goal**: Understand **sentence relationships**.

---

## Input Representation

To support both single and sentence-pair tasks, BERT uses a combined input format:

| Component | Explanation |
|----------|-------------|
| [CLS] Token | First token of every input sequence. Its final hidden state is used for classification. |
| [SEP] Token | Separates sentence A and sentence B. |
| Token Embeddings | WordPiece tokenized input |
| Segment Embeddings | Identify sentence A or B |
| Position Embeddings | Represent word positions in sequence |

---

## Fine-tuning on Downstream Tasks

After pre-training, BERT is fine-tuned by adding a task-specific output layer.

### Example Tasks:

| Task | Input Format | Output |
|------|--------------|--------|
| Sentiment Classification | [CLS] + sentence | Classification label from [CLS] token |
| Question Answering | [CLS] + question + [SEP] + context | Start and end token positions |
| Named Entity Recognition | Tokens | Label per token |

Fine-tuning is performed with just a few epochs and minimal architectural changes.

---

## Results

BERT achieved **state-of-the-art** results on multiple NLP benchmarks:

- GLUE (General Language Understanding Evaluation)
- SQuAD v1.1 and v2.0 (Question Answering)
- SWAG (Common Sense Inference)

---

## Key Innovations Summary

| Innovation | Benefit |
|-----------|---------|
| Masked Language Model | Allows deep bidirectional understanding |
| Next Sentence Prediction | Captures inter-sentence relationships |
| Single architecture | Used across tasks without major changes |
| Transfer Learning for NLP | Pre-train once, fine-tune for many tasks |

---

## BERT Summary Table

| Aspect | BERT |
|--------|------|
| Full Name | Bidirectional Encoder Representations from Transformers |
| Architecture | Transformer Encoder |
| Directionality | Deeply Bidirectional |
| Pre-training Tasks | MLM and NSP |
| Fine-tuning | Task-specific head on top |
| Model Sizes | BERT-Base (110M), BERT-Large (340M) |
| Training Data | BooksCorpus + English Wikipedia |
| Tokenizer | WordPiece with 30k vocabulary |

---

## Limitations of BERT

- **Expensive to pre-train** (compute-heavy).
- NSP was later questioned (RoBERTa removed it).
- Not ideal for real-time or low-latency applications due to size.

---

## Follow-up Models

| Model | Improvement |
|-------|-------------|
| RoBERTa | Removed NSP, trained longer and on more data |
| DistilBERT | Smaller and faster BERT |
| ALBERT | Parameter-sharing to reduce memory |
| BioBERT, ClinicalBERT | Domain-specific variants for biomedical/clinical texts |

---

## Final Thoughts

BERT changed the landscape of NLP by:
- Introducing a unified **pre-training + fine-tuning** framework.
- Achieving excellent results on a **variety of NLP tasks**.
- Inspiring many **improvements and new models** in the field.

It is a **foundation model** that paved the way for future large-scale pre-trained models in NLP.
