## Function Approximation

### Data Genertation
```python
    # Generate data
    n_samples = 1000  # More data
    x = torch.linspace(-20, 20, n_samples).reshape(-1, 1)
    y = torch.sin(x) + 0.05 * torch.randn(n_samples, 1)

    # Convert to numpy for plotting
    x_np = x.numpy()
    y_np = y.numpy()
```


### Normalization Step

Why do we normalize?
Neural networks learn more efficiently when the input features are in a 
consistent, small range (e.g. [-1, 1] or [0, 1]). Large raw values can 
cause unstable gradients and slow convergence, especially with activation 
functions like `ReLU`, `tanh`, or `sigmoid`.

In this case:
  - `x` originally ranges from `[-20, 20]`
  - We scale it down by dividing by `20` → new range is `[-1, 1]`
  - `y` is already bounded (sin(x) ∈ `[-1, 1])` and only has small added noise,
    so we can safely leave y as-is.

This makes the optimization problem numerically stable and easier to solve.



    

```python

    
    x_normalized = x / 20.0  # Scale x to [-1, 1]
    y_normalized = y         # y is already in a good range
   
    # Create dataset and dataloader with normalized data
    function_dataset = FunctionDataset(x_normalized, y_normalized)
```

---

###  Why Normalization Is Needed

* **Stability**: Large input values (e.g., -20 to 20) can cause very large outputs in linear layers (`wx + b`), making training unstable.
* **Gradient flow**: With normalized inputs, activations stay in their "useful" range (e.g., tanh not saturated at ±1).
* **Generalization**: Models trained on normalized data usually generalize better because the optimizer doesn’t have to fight scale differences.

---

###  Comparison with Image Input

In **image tasks**, we almost always normalize inputs as well, but the method differs slightly:

* **Raw pixel range**: Images are typically in `[0, 255]`.

* **Normalization**:

  1. First scaled to `[0, 1]` by dividing by 255.
  2. Then standardized using mean and std (per channel):

     ```python
     transforms.Normalize(mean=[0.485, 0.456, 0.406],
                          std=[0.229, 0.224, 0.225])
     ```

     These values come from ImageNet statistics and center the data around 0 with unit variance.

* **Comparison to our function input**:

  * Your `x` normalization is a simple **min-max scaling** to `[-1, 1]`.
  * Image preprocessing is usually **standardization** (subtract mean, divide by std), because RGB channels have different distributions and need per-channel correction.

---

**Summary**:

* For **1D function regression** → simple scaling to `[-1, 1]` is sufficient.
* For **images** → standardization is more robust, because each channel has different brightness/contrast distributions.

---

### Xavier/Glorot initialization


```python
    # Better initialization (optional, but recommended)
    for module in model.modules():
        if isinstance(module, nn.Linear):
            # Xavier/Glorot initialization:
            # - Fills the weight matrix with values drawn from a uniform distribution
            #   with bounds based on the number of input and output units.
            # - Keeps the variance of activations roughly the same across layers.
            # - Prevents vanishing/exploding activations and gradients.
            nn.init.xavier_uniform_(module.weight)

            # Bias is set to 0:
            # - Because bias can start from zero without symmetry-breaking issues.
            # - Keeping it zero lets the network learn the required offset naturally.
            nn.init.zeros_(module.bias)
```

---



* **Default initialization in PyTorch**
  By default, PyTorch already initializes `nn.Linear` weights with a uniform distribution depending on the layer size. This works fine, but it’s *generic*.

* **Xavier (Glorot) initialization**
  Designed specifically for **tanh/ReLU-like activations**, it sets the weights so that:

  $$
  \text{Var}(z_{in}) \approx \text{Var}(z_{out})
  $$

  That means signals neither blow up (explode) nor shrink to almost zero (vanish) as they pass through layers.

  For a Linear layer with `fan_in` inputs and `fan_out` outputs, weights are sampled from:

  $$
  U\Big(-\sqrt{\frac{6}{fan_{in} + fan_{out}}}, \; \sqrt{\frac{6}{fan_{in} + fan_{out}}}\Big)
  $$

* **Bias = 0**
  Bias terms don’t suffer from the same scaling problem as weights, so initializing them to 0 is common and safe. The network quickly learns non-zero bias values if needed.

---

###  Effect in our case

Since your network is fairly deep (4 linear layers with ReLU and Dropout), **better initialization**:

* Speeds up convergence.
* Reduces the risk of flat/unstable training curves.
* Makes it easier for the optimizer (Adam here) to find a good minimum.

---

**Summary**:
That block improves training stability because **Xavier initialization preserves variance across layers**, preventing vanishing/exploding activations, while **zero bias** is a safe default.

---



###  Xavier (Glorot) vs. He (Kaiming) Initialization

#### 1. **Xavier (Glorot) Initialization**

* Formula (uniform):

  $$
  w \sim U\Big(-\sqrt{\tfrac{6}{fan_{in} + fan_{out}}}, \; \sqrt{\tfrac{6}{fan_{in} + fan_{out}}}\Big)
  $$
* Goal: Keep the **variance of activations** roughly the same across layers.
* Works best with **symmetric, saturating activations** like:

  * `tanh`
  * `sigmoid`
* Why? Because tanh/sigmoid squash values into a small range (\[-1, 1] or \[0, 1]), so you need careful balancing of variance both forward (inputs) and backward (gradients).

---

#### 2. **He (Kaiming) Initialization**

* Formula (for uniform):

  $$
  w \sim U\Big(-\sqrt{\tfrac{6}{fan_{in}}}, \; \sqrt{\tfrac{6}{fan_{in}}}\Big)
  $$
* Goal: Compensate for the fact that **ReLU zeroes out half of its inputs** on average.
* Works best with **non-saturating, rectifier activations** like:

  * `ReLU`
  * `LeakyReLU`
  * `ELU`
* Why? ReLU discards negative values, so variance would shrink as you go deeper unless you give it a “boost” → He init scales by only `fan_in` (not both `fan_in + fan_out`).

---

### 3. **Rule of Thumb**

* **Use Xavier** → if your network uses **tanh** or **sigmoid** activations.
* **Use He/Kaiming** → if your network uses **ReLU-like** activations.
* **If unsure** → and your architecture is modern (ReLU/LeakyReLU etc.), He is usually the safer choice.

---

### 4. **Our case**

In your `FunctionModel`, you’re using **ReLU**:

```python
x = torch.relu(self.fc1(x))
```

That means **He initialization is more appropriate** than Xavier, because ReLU cuts off negative activations and Xavier doesn’t account for that.

You could implement it like this:

```python
for module in model.modules():
    if isinstance(module, nn.Linear):
        nn.init.kaiming_uniform_(module.weight, nonlinearity='relu')
        nn.init.zeros_(module.bias)
```

---

✅ **Summary**

* Xavier = good for `tanh` / `sigmoid`.
* He = good for `ReLU` / `LeakyReLU`.
* In practice, **modern deep networks almost always use ReLU variants → use He init**.

---




#### He (Kaiming) Initialization
![](images/kaiming_uniform.svg)

#### Xavier (Glorot)
![](images/xavier.svg)

###  What happens if you **don’t** initialize manually?

When you create a layer in PyTorch, like:

```python
self.fc1 = nn.Linear(1, 64)
```

PyTorch automatically initializes the weights and biases for you.
The defaults come from the **Kaiming uniform initialization** (also called **LeCun uniform** in some contexts), but with slightly different parameters.

---

### PyTorch Default for `nn.Linear`

* **Weights (`W`)**:

  $$
  w \sim U\Big(-\sqrt{\tfrac{1}{fan_{in}}}, \; \sqrt{\tfrac{1}{fan_{in}}}\Big)
  $$

  where `fan_in` = number of input features.
  → This is **Kaiming uniform** but with `a=√5` (to account for bias variance).

* **Biases (`b`)**:

  $$
  b \sim U\Big(-\tfrac{1}{\sqrt{fan_{in}}}, \; \tfrac{1}{\sqrt{fan_{in}}}\Big)
  $$

So by default, you’re already getting something *reasonable*, not random chaos.

---

###  Comparing with Explicit Xavier/He

* **Default (PyTorch)** → Kaiming uniform with a specific parameter choice. Works fairly well in most cases.
* **Explicit Xavier** → Balanced for tanh/sigmoid, may not be ideal for ReLU (could slightly slow convergence).
* **Explicit He/Kaiming** → Optimized for ReLU/LeakyReLU, often speeds up convergence and improves stability.

---

###  Why results differ

When you re-initialize with `nn.init.xavier_uniform_` or `nn.init.kaiming_uniform_`, you’re **changing the distribution of starting weights**.

Since neural network training is highly sensitive to initialization:

* Early layers may saturate or stay alive longer.
* The optimizer may explore a different trajectory.
* With poor initialization, training might get stuck in plateaus.

That’s why you see different results when you add/remove that initialization block.

---

**Summary**

* **If you don’t initialize** → PyTorch uses a built-in Kaiming-uniform-like scheme.
* **If you do initialize** → You can *tailor* it (Xavier vs. He) to your activation function.
* **Different results** happen because initialization controls how signals and gradients flow at the very start — changing the whole optimization trajectory.

---
