## **Inductive Bias**

**Inductive bias** in deep learning refers to the set of assumptions that a model makes to **generalize** from the training data to unseen data.

Since learning from data alone is impossible without *some* prior assumptions (as shown by the **No Free Lunch Theorem**), every learning algorithm has an inductive bias—whether explicit or implicit.

---

### Formal Definition

> Inductive bias is the **preference or assumptions** a learning algorithm uses to predict outputs for **inputs it hasn't seen** before.

In deep learning, the model architecture, training setup, and optimization procedure all contribute to the inductive bias.

---

### Examples of Inductive Biases in Deep Learning

| Bias Type                  | Description                                             | Example                                                                       |
| -------------------------- | ------------------------------------------------------- | ----------------------------------------------------------------------------- |
| **Locality**               | Nearby pixels or values are more likely to be related.  | CNNs assume local features matter (e.g. edges).                               |
| **Translation Invariance** | A pattern is the same no matter where it appears.       | CNNs with weight sharing across spatial locations.                            |
| **Compositionality**       | Complex structures are made from simpler parts.         | Transformers assume a sequence has hierarchical meaning (words → sentences).  |
| **Smoothness**             | Small changes in input lead to small changes in output. | Regularization and SGD promote smooth functions.                              |
| **Hierarchy**              | Data can be represented at multiple levels.             | Deep networks learn hierarchical representations (low-level edges → objects). |

---

###  Why It Matters

* **Without inductive bias**, the model can fit the training data perfectly but fail to generalize.
* **With the right bias**, the model can learn meaningful patterns, even from limited data.

---

###  Inductive Bias Sources in Deep Learning

| Source             | Examples                                                                |
| ------------------ | ----------------------------------------------------------------------- |
| **Architecture**   | CNNs (spatial), RNNs (temporal), Transformers (attention over sequence) |
| **Loss Function**  | Cross-entropy, MSE, contrastive loss                                    |
| **Optimization**   | SGD tends to find flat minima (implicit bias)                           |
| **Regularization** | Dropout, weight decay, data augmentation                                |
| **Initialization** | Influences which minima are found                                       |
| **Training Data**  | Augmentations and preprocessing encode assumptions                      |

---





### MLPs: The “Minimal Bias” Baseline

Multilayer Perceptrons have virtually **no built-in vision-related inductive bias**. They flatten the input (e.g. raw images), losing information about spatial relationships and operating without weight-sharing or locality.  MLPs don't encode any assumptions about structure in the input (like spatial locality).

---

### CNNs: Strong Vision-Centric Bias

CNNs embed strong built-in assumptions ideal for vision tasks:

* **Locality:** Filters operate on small regions (receptive fields) of the input.
* **Weight sharing:** The same filters are applied across spatial locations.
* **Translation equivariance:** Shifting the input image leads to a similarly shifted response in feature maps.


---

### Vision Transformers (ViTs): A “Patch-Level” Bias

ViTs offer a different kind of inductive bias:

* **Weakened vision bias:** ViTs lack built-in translation equivariance—but they do embed structure through **patching** and **parameter sharing** across patches .
* **Emergent robustness and shape bias:** Empirical evidence shows Vision Transformers are often **more robust to corruption** (e.g., image noise, blur) and **exhibit higher shape bias** compared to classic CNNs—even when having **fewer parameters**.

---

### Comparative Summary

| Architecture | Inductive Bias Style                         | Vision Strengths                                   | Limitations                                    |
| ------------ | -------------------------------------------- | -------------------------------------------------- | ---------------------------------------------- |
| **MLP**      | Minimal, no structure                        | Possible high performance at scale                 | Data-hungry; poor inductive bias               |
| **CNN**      | Strong built-in bias (locality, translation) | Sample-efficient; structured inductive assumptions | Texture-bias; may struggle with corruption     |
| **ViT**      | Patch-based; flexible attention              | More robust to corruptions; shape-aware            | Requires more data; less inherently structured |

---

### Why These Biases Matter in Practice

* **CNNs** excel when training data is limited or structured—bias helps generalization.
* **ViTs** shine in scenarios requiring robustness or flexibility with large datasets and compute.
* **MLPs**, while the simplest, can approximate complex behaviors—but only with enough scale and data augmentation.





## Example: Detecting a Bicycle (Wheel + Handlebar far apart)

Imagine a **32×32 image**, divided into **4×4 patches** (each patch = 8×8 pixels).
We want to detect that the **wheel** (bottom left) and the **handlebar** (top right) belong to the same object.

---

### **CNN**

* A **3×3 convolution kernel** looks at a small local patch.
* To connect wheel (bottom-left corner) → handlebar (top-right corner):

  * The information must pass through **many layers** of convolutions + pooling.
  * Each layer increases the receptive field slowly:

    * 1st layer: sees 3×3 pixels
    * 2nd layer: maybe 7×7 pixels
    * After \~5–6 layers, the receptive field finally covers the whole image.

👉 Problem: The CNN only *learns the relation between wheel & handlebar indirectly* through deep stacking. It’s biased toward local patterns (edges, corners, textures).

---

### **ViT**

* Split the image into **16 patches** (4×4).
* The first **self-attention layer** compares *every patch with every other patch*.
* That means **wheel patch** can directly attend to the **handlebar patch**, even though they are far apart.
* Attention weight example:

  * Similarity between wheel patch and handlebar patch = **0.82** (high)
  * Similarity between wheel patch and background patch = **0.05** (low)

👉 Result: ViT **immediately learns a global relationship**: "wheel + handlebar = likely bicycle" — no need to wait for many layers.

---

### Numerical Contrast

* CNN: relation strength grows only after multiple layers, e.g.

  * Layer 1 relation strength (wheel → handlebar): \~0.0
  * Layer 3: \~0.2
  * Layer 6: \~0.7
* ViT:

  * Layer 1 relation strength (wheel → handlebar): \~0.82 already.

---

✅ **Takeaway:**
CNNs start local → slowly become global.
ViTs are **global from the start**, which makes them better at modeling objects whose parts are **spatially distant**.


[Example](https://x.com/zhaisf/status/1956689364408549522)