## Function Approximation

### Data Genertation
```python
    # ===== CONFIGURATION: Change these values to test different ranges =====
    x_min, x_max = -20, 20  # Try: (-10, 10), (0, 50), (-100, 100), etc.
    n_samples = 1000
    # ========================================================================

    # Generate data
    x = torch.linspace(x_min, x_max, n_samples).reshape(-1, 1).to(device)
    y = torch.sin(x) + 0.05 * torch.randn(n_samples, 1).to(device)

    # Convert to numpy for plotting
    x_np = x.cpu().numpy()
    y_np = y.cpu().numpy()
```


### Normalization Step

Neural networks learn more efficiently when the input features are in a 
consistent, small range (e.g. [-1, 1] or [0, 1]). Large raw values can 
cause unstable gradients and slow convergence, especially with activation 
functions like `ReLU`, `tanh`, or `sigmoid`.

In this case:
  - `x` originally ranges from `[-20, 20]`
  - We scale it down by dividing by `20` → new range is `[-1, 1]`
  - `y` is already bounded (sin(x) ∈ `[-1, 1])` and only has small added noise,
    so we can safely leave y as-is.

This makes the optimization problem numerically stable and easier to solve.



    

```python
    # Normalize both x and y using mean/std standardization (consistent approach)
    x_mean = x.mean()
    x_std = x.std()
    x_normalized = (x - x_mean) / x_std  # Standardize x to ~N(0,1)

    y_mean = y.mean()
    y_std = y.std()
    y_normalized = (y - y_mean) / y_std  # Standardize y to ~N(0,1)```

---

###  Why Normalization Is Needed

* **Stability**: Large input values (e.g., -20 to 20) can cause very large outputs in linear layers (`wx + b`), making training unstable.
* **Gradient flow**: With normalized inputs, activations stay in their "useful" range (e.g., tanh not saturated at ±1).
* **Generalization**: Models trained on normalized data usually generalize better because the optimizer doesn’t have to fight scale differences.

---

###  Comparison with Image Input

In **image tasks**, we almost always normalize inputs as well, but the method differs slightly:

* **Raw pixel range**: Images are typically in `[0, 255]`.

* **Normalization**:

  1. First scaled to `[0, 1]` by dividing by 255.
  2. Then standardized using mean and std (per channel):

     ```python
     transforms.Normalize(mean=[0.485, 0.456, 0.406],
                          std=[0.229, 0.224, 0.225])
     ```

     These values come from ImageNet statistics and center the data around 0 with unit variance.


---

### Xavier/Glorot initialization


```python
    # Better initialization (optional, but recommended)
    for module in model.modules():
        if isinstance(module, nn.Linear):
            # Xavier/Glorot initialization:
            # - Fills the weight matrix with values drawn from a uniform distribution
            #   with bounds based on the number of input and output units.
            # - Keeps the variance of activations roughly the same across layers.
            # - Prevents vanishing/exploding activations and gradients.
            nn.init.xavier_uniform_(module.weight)

            # Bias is set to 0:
            # - Because bias can start from zero without symmetry-breaking issues.
            # - Keeping it zero lets the network learn the required offset naturally.
            nn.init.zeros_(module.bias)
```

---





#### He (Kaiming) Initialization
![](images/kaiming_uniform.svg)

#### Xavier (Glorot)
![](images/xavier.svg)

###  What happens if you **don’t** initialize manually?

When you create a layer in PyTorch, like:

```python
self.fc1 = nn.Linear(1, 64)
```

PyTorch automatically initializes the weights and biases for you.
The defaults come from the **Kaiming uniform initialization** (also called **LeCun uniform** in some contexts), but with slightly different parameters.

---

### PyTorch Default for `nn.Linear`

* **Weights (`W`)**:

  $$
  w \sim U\Big(-\sqrt{\tfrac{1}{fan_{in}}}, \; \sqrt{\tfrac{1}{fan_{in}}}\Big)
  $$

  where `fan_in` = number of input features.
  → This is **Kaiming uniform** but with `a=√5` (to account for bias variance).

* **Biases (`b`)**:

  $$
  b \sim U\Big(-\tfrac{1}{\sqrt{fan_{in}}}, \; \tfrac{1}{\sqrt{fan_{in}}}\Big)
  $$

So by default, you’re already getting something *reasonable*, not random chaos.

---

###  Comparing with Explicit Xavier/He

* **Default (PyTorch)** → Kaiming uniform with a specific parameter choice. Works fairly well in most cases.
* **Explicit Xavier** → Balanced for tanh/sigmoid, may not be ideal for ReLU (could slightly slow convergence).
* **Explicit He/Kaiming** → Optimized for ReLU/LeakyReLU, often speeds up convergence and improves stability.

---

###  Why results differ

When you re-initialize with `nn.init.xavier_uniform_` or `nn.init.kaiming_uniform_`, you’re **changing the distribution of starting weights**.

Since neural network training is highly sensitive to initialization:

* Early layers may saturate or stay alive longer.
* The optimizer may explore a different trajectory.
* With poor initialization, training might get stuck in plateaus.

That’s why you see different results when you add/remove that initialization block.

---

**Summary**

* **If you don’t initialize** → PyTorch uses a built-in Kaiming-uniform-like scheme.
* **If you do initialize** → You can *tailor* it (Xavier vs. He) to your activation function.
* **Different results** happen because initialization controls how signals and gradients flow at the very start — changing the whole optimization trajectory.

---
