# BareBonesML Part 2: From Tensors to a Trainable Neural Network

In our [previous post](notebooks/01_autograd_engine.ipynb), we built an autograd engine. We have `Tensor` objects that can track their computational history and automatically compute gradients via backpropagation.

Today, we will use that engine to build an initial deep learning library. We'll move up one level of abstraction to create reusable **layers** and an **optimizer**.

Our goals for this post are:
1. Create a `Module` class inspired by `torch.nn.Module`.
2. Implement a `Linear` layer
3. Build a `Sequential` container to easily stack our layers into a deep model.
4. Implement Stochastic Gradient Descent (SGD) optimizer.
5. Train a simple model on a toy regression problem.

All the code for this post can be found in the `from_scratch/` directory of the [bare-bones-ml repository](https://github.com/devansh-lodha/bare-bones-ml).

## The `Module`: Our Neural Network Building Block

A `Module` is the core container in our neural network library. Its job is to keep track of all the learnable parameters (`Tensor`s with `requires_grad=True`) within it. A `Module` can also contain other `Module`s, allowing us to build complex, nested model architectures.

The magic happens in the `__setattr__` method. We override this special Python method so that whenever we assign a `Tensor` or another `Module` as an attribute, it gets automatically registered.

Here is the implementation from `from_scratch/nn.py`:

```python
# from_scratch/nn.py

class Module:
    def __init__(self):
        # Every module starts with two empty dictionaries
        self._parameters = {} # For Tensors
        self._modules = {}    # For other Modules

    def __setattr__(self, name: str, value):
        # This method is called EVERY time we do `self.some_name = some_value`
        
        # We check the TYPE of the value we are assigning.
        if isinstance(value, Tensor):
            # If it's a Tensor, we add it to our internal `_parameters` dictionary.
            print(f"Registering parameter '{name}'")
            self._parameters[name] = value

        elif isinstance(value, Module):
            # If it's another Module (like a Linear layer), we add it to `_modules`.
            print(f"Registering module '{name}'")
            self._modules[name] = value
        
        # We still call the original, default Python __setattr__ to actually perform the assignment (i.e., to make `self.layer1` exist).
        super().__setattr__(name, value)

    def parameters(self) -> List[Tensor]:
        """
        Returns a list of all parameters in this module AND all its sub-modules.
        """
        # 1. Start with the parameters defined directly in this module.
        params = list(self._parameters.values())
        
        # 2. Go into each registered sub-module...
        for module in self._modules.values():
            # ...and ask it for its parameters (which will also be a recursive call).
            params.extend(module.parameters())
            
        return params

    def zero_grad(self):
        """Resets the gradients of all parameters to None."""
        for p in self.parameters():
            p.grad = None
    
    def __call__(self, *args, **kwargs):
        """Allows the module to be called like a function, which runs the forward pass."""
        return self.forward(*args, **kwargs)

    def forward(self, *args, **kwargs):
        """Forward pass. Must be implemented by subclasses."""
        raise NotImplementedError
```
### `__setattr__` and Automatic Parameter Registration

You might wonder how `model.parameters()` magically finds all the `Tensor`s from all the layers. This is not magic, but a clever use of a special Python method called `__setattr__`.

Whenever you assign an attribute in Python, like `self.weight = ...`, Python internally calls `self.__setattr__('weight', ...)`. We have *overridden* this method in our base `Module` class to act as a registrar.

Our custom `__setattr__` checks the type of the value being assigned. If it's a `Tensor`, it gets added to an internal `_parameters` dictionary. If it's another `Module`, it gets added to `_modules`.

The `parameters()` method then simply traverses this structure, collecting all the parameters from a module and recursively asking all of its sub-modules for their parameters. This elegant design allows us to define complex, nested architectures while the framework handles the tedious bookkeeping of all the learnable weights and biases for us.

## The `Linear` Layer

With `Module` as our base, creating a `Linear` layer is incredibly elegant. A linear layer performs the affine transformation $y = xW^T + b$. All it needs to do is:
1.  Inherit from `Module`.
2.  In its `__init__`, create its learnable `weight` and `bias` Tensors. Because of our clever `__setattr__`, they are automatically registered as parameters.
3.  Define the `forward` method to perform the matrix multiplication and addition.

```python
# from_scratch/nn.py

class Linear(Module):
    """A standard fully-connected linear layer: y = xW^T + b"""
    def __init__(self, input_size: int, output_size: int):
        super().__init__() # This initializes the parent Module
        self.weight = Tensor(
            np.random.randn(output_size, input_size) * np.sqrt(2.0 / input_size), 
            requires_grad=True
        )
        self.bias = Tensor(np.zeros(output_size), requires_grad=True)

    def forward(self, x: Tensor) -> Tensor:
        return x @ self.weight.T + self.bias
```
### Design Deep Dive: Why He Initialization?

In the `Linear` layer, you'll notice we don't just use `np.random.randn()`. We scale it:
```python
self.weight = Tensor(
    np.random.randn(output_size, input_size) * np.sqrt(2.0 / input_size), 
    ...
)
```
This is a crucial technique called **He Initialization**, named after its inventor, Kaiming He.

**The Problem: Vanishing and Exploding Gradients**

When we stack many layers, the variance of the outputs can change at each layer. If the weights are too small, the signal (and gradients) can shrink to zero as it passes through the network, which is known as the **vanishing gradient problem**. If the weights are too large, the signal can grow exponentially, leading to the **exploding gradient problem**.

**The Solution: Scaling by Fan-in**

He initialization sets the initial weights to just the right scale by accounting for the **fan-in** of the layer. The "fan-in" is simply the number of input connections to a neuron in that layer. For a `Linear` layer, the fan-in is its `input_size`.

The logic is that the output of a neuron is a sum of `fan_in` terms. The more terms you add together, the larger the variance of the sum will be. To counteract this and keep the variance stable across layers, we need to make the individual weights smaller for layers with more inputs.

The He initialization formula does this perfectly:
$$
\text{scale} = \sqrt{\frac{2}{\text{fan\_in}}}
$$

By scaling our random weights by this factor, we help keep the signal flowing smoothly forward and backward, allowing us to train much deeper networks successfully. The `2` in the numerator is specifically derived to work best with the ReLU activation function, which is why this combination is so common in modern networks.

## The `Sequential` Container

Models are often just a sequence of layers. A `Sequential` module is a container that takes other modules and applies them in order during the forward pass. This makes defining a standard feed-forward network clean and easy.

```python
# from_scratch/nn.py

class Sequential(Module):
    """A container for modules that will be applied in sequence."""
    def __init__(self, *modules: Module):
        super().__init__()
        for i, module in enumerate(modules):
            # The name of the submodule will be its index as a string
            self._modules[str(i)] = module

    def forward(self, x: Tensor) -> Tensor:
        """Passes the input through each module in order."""
        for module in self._modules.values():
            x = module(x)
        return x
```

Now let's see how easy it is to define a 2-hidden-layer MLP:

In [1]:
import sys
sys.path.append('../')

from from_scratch.autograd.tensor import Tensor
from from_scratch.nn import Linear, Sequential, ReLU

# Define a 2-hidden-layer MLP using our library components
# Input (3 features) -> Hidden1 (4 neurons) -> ReLU -> Hidden2 (4 neurons) -> ReLU -> Output (1 neuron)
model = Sequential(
    Linear(input_size=3, output_size=4),
    ReLU(),
    Linear(input_size=4, output_size=4),
    ReLU(),
    Linear(input_size=4, output_size=1)
)

# Thanks to our Module's recursive parameter collection, this just works!
all_params = model.parameters()
print(f"Our Sequential model found {len(all_params)} parameter Tensors.")

# Let's inspect the number of parameters per layer
# Layer 1: weight (3x4) + bias (4) = 12 + 4 = 16
# Layer 2: weight (4x4) + bias (4) = 16 + 4 = 20
# Layer 3: weight (4x1) + bias (1) = 4 + 1 = 5
# Total params = 16 + 20 + 5 = 41
num_params_total = sum(p.data.size for p in all_params)
print(f"Total number of learnable parameters: {num_params_total}")

Our Sequential model found 6 parameter Tensors.
Total number of learnable parameters: 41


## The Optimizer

The final piece of the puzzle is the **optimizer**. Its job is simple: it holds a reference to the model's parameters and updates them using their computed gradients.

We'll start with the most fundamental optimizer: **Stochastic Gradient Descent (SGD)**. Its update rule is the one we saw earlier: `parameter -= learning_rate * gradient`.

```python
# from_scratch/optim.py

class Optimizer:
    """Base class for all optimizers."""
    def __init__(self, params: List[Tensor], lr: float):
        self.params = params
        self.lr = lr

    def step(self):
        raise NotImplementedError

    def zero_grad(self):
        """Resets the gradients of all parameters."""
        for p in self.params:
            p.grad = None

class SGD(Optimizer):
    """Implements Stochastic Gradient Descent optimizer."""
    def __init__(self, params: List[Tensor], lr: float):
        super().__init__(params, lr)

    def step(self):
        """Performs a single optimization step."""
        for p in self.params:
            if p.grad is not None:
                p.data -= self.lr * p.grad
```

## The Full Training Loop

We now have all the components to train a neural network. Let's put them together to solve a simple regression task. We'll train our model to learn the function $y = 2x_1 + 3x_2$.

The training loop follows a standard recipe:
1.  Get model predictions (forward pass).
2.  Compute the loss (how wrong the predictions are).
3.  Compute the gradients (backward pass).
4.  Update the model parameters using the optimizer.
5.  Repeat.

In [2]:
# 1. Imports and Data
import numpy as np
from from_scratch.autograd.tensor import Tensor
from from_scratch.nn import Linear
from from_scratch.functional import mse_loss
from from_scratch.optim import SGD

# Toy dataset
# X shape: (num_samples, num_features)
X = Tensor(np.array([[1, 1], [2, 2], [3, 3], [4, 4]], dtype=np.float32))
# y = 2*x1 + 3*x2. For [1,1], y=5. For [2,2], y=10, etc.
y_true = Tensor(np.array([[5], [10], [15], [20]], dtype=np.float32))

# 2. Model, Loss, and Optimizer
model = Linear(input_size=2, output_size=1)
loss_function = mse_loss
optimizer = SGD(params=model.parameters(), lr=0.01)

# 3. The Training Loop
epochs = 20
print("--- Training Start ---")
for epoch in range(epochs):
    # a. Zero out gradients from the last step
    optimizer.zero_grad()

    # b. Forward pass: get model predictions
    predictions = model(X)

    # c. Compute the loss
    loss = loss_function(predictions, y_true)

    # d. Backward pass: compute gradients
    loss.backward()

    # e. Update weights
    optimizer.step()

    if epoch % 5 == 0 or epoch == epochs - 1:
        print(f"Epoch {epoch}, Loss: {loss.data.item():.4f}")

# 4. Check the final predictions
print("\n--- Final Predictions vs True Values ---")
final_predictions = model(X)
for true, pred in zip(y_true.data, final_predictions.data):
    print(f"  True: {true[0]:.2f}, Predicted: {pred[0]:.2f}")

--- Training Start ---
Epoch 0, Loss: 328.5437
Epoch 5, Loss: 7.4500
Epoch 10, Loss: 0.3352
Epoch 15, Loss: 0.1724
Epoch 19, Loss: 0.1648

--- Final Predictions vs True Values ---
  True: 5.00, Predicted: 5.65
  True: 10.00, Predicted: 10.32
  True: 15.00, Predicted: 14.98
  True: 20.00, Predicted: 19.64


## Conclusion

The loss steadily decreases, and the model's final predictions are very close to the true values.

By creating a `Module` class, we were able to abstract away the low-level details of our autograd engine and build reusable `Linear` and `Sequential` layers. This is the essence of a deep learning library. We now have a clean, intuitive way to define model architectures.

In the next post, we'll expand our library to tackle a real-world classification problem, which will require us to implement a more advanced optimizer (`Adam`) and new loss and activation functions (`BinaryCrossEntropy` and `Sigmoid`).