<a href="https://colab.research.google.com/github/andreidhoang/deep_learning_from_first_principles/blob/main/1_1_Building_your_Deep_Neural_Network_Spelled_out.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Big Picture: Building Blocks of a Neural Network

**1. Operations:**
- NN perform computations. These computations can be broken down into "operations"
- Matrix multiplication (to calculate weighted sums), adding biases, applying activation functions.
- the `Operation` class is the base class for all these individual computations.

**2. Layers:**
- A layer is a collection of neurons. Each neuron applies a series of operations to its inputs.
- A neural network is made up of layers stacked together.
- The `Layer` class in our code organizes a set of Operations.

**3. Activation Functions:**

- These introduce non-linearity to the network, allowing it to learn complex relationships in data.
- Examples: Sigmoid, ReLU (Rectified Linear Unit), Tanh.
- In our code, activation functions are a type of Operation.

**4. Loss Function:**

- This measures how well the network is performing (how far off its predictions are from the actual values).
- The goal of training is to minimize this loss.
- The Loss class and its subclasses (like MeanSquaredError) calculate this.

**5. Neural Network:**

- The overall architecture that connects layers together.
- It defines the flow of data through the network (forward pass) and the flow of gradients for learning (backward pass).
- The NeuralNetwork class orchestrates the layers and the loss function.

**6. Optimizer:**

- This algorithm updates the network's parameters (weights and biases) based on the calculated gradients to minimize the loss function.
- Example: Stochastic Gradient Descent (SGD).
The Optimizer class (and SGD) handles this update process.

**7. Trainer:**

- This class manages the training loop, feeding data to the network, calculating the loss, updating parameters, and evaluating performance.
- The Trainer class encapsulates the training logic.


# Helper Functions

In [None]:
import numpy as np
from numpy import ndarray
from typing import List, Tuple  # Import Tuple

def assert_same_shape(array: ndarray, array_grad: ndarray) -> None:
  """
  Asserts that two ndarrays have the same shape.
  """
  assert array.shape == array_grad.shape, \
    f"Two ndarrays should have the same shape; instead, first ndarray's shape is {array_grad.shape} and second ndarray's shape is {array.shape}."

def to_2d_np(a: ndarray, type: str = "col") -> ndarray:
    """
    Reshapes a 1D ndarray to a 2D ndarray.

    Args:
      a: The 1D ndarray to reshape.
      type:  "col" to make it a column vector (shape: (n, 1)),
              "row" to make it a row vector (shape: (1, n)).

    Returns:
        The reshaped 2D ndarray.
    """
    assert a.ndim == 1, "Input ndarray must be 1-dimensional."
    if type == "col":
        return a.reshape(-1, 1)  # -1 infers the number of rows
    elif type == "row":
        return a.reshape(1, -1)  # -1 infers the number of columns
    else:
        raise ValueError("Type must be 'col' or 'row'.")

# Let's say we have a 1D array representing some target values:
my_array_1d = np.array([1, 2, 3, 4])
print("Original array:")
print(my_array_1d)
print("Shape:", my_array_1d.shape)  # Output: (4,)  (A 1D array with 4 elements)

Original array:
[1 2 3 4]
Shape: (4,)


In [None]:
# Now, let's use to_2d_np to reshape it into a column vector:
my_array_2d_col = to_2d_np(my_array_1d, type="col")
print("\nReshaped to column vector:")
print(my_array_2d_col)
print("Shape:", my_array_2d_col.shape)  # Output: (4, 1) (A 2D array with 4 rows and 1 column)


Reshaped to column vector:
[[1]
 [2]
 [3]
 [4]]
Shape: (4, 1)


In [None]:
# Reshape to a row vector:
my_array_2d_row = to_2d_np(my_array_1d, type="row")
print("\nReshaped to row vector:")
print(my_array_2d_row)
print("Shape:", my_array_2d_row.shape)  # Output: (1, 4) (A 2D array with 1 row and 4 columns)


Reshaped to row vector:
[[1 2 3 4]]
Shape: (1, 4)


# Operations

In [None]:
class Operation:
  """
  base class for all operations in our Neural Network
  """
  def __init__(self):
    self.input_ = None

  def forward(self, input_: ndarray) -> ndarray:
    """
    Performs the forward pass of the operation.

    Args:
      input_: The input ndarray.

    Returns:
      The output ndarray.
    """

    self.input_ = input_
    return self._output()

  def backward(self, output_grad: ndarray) -> ndarray:
    """
    Performs the backward pass of the operation.

    Args:
      Output_grad: The gradient of the loss with respect to the output of this operation.

    Returns:
      The gradient of the loss with respect to the input of this operation.
    """
    assert_same_shape(self.input_, output_grad)
    return self._input_grad(output_grad)

  def _output(self) -> ndarray:
        """
        Calculates the output of the operation.
        This method needs to be implemented by the subclasses.
        """
        raise NotImplementedError

  def _input_grad(self, output_grad: ndarray) -> ndarray:
        """
        Calculates the gradient with respect to the input.
        This method needs to be implemented by the subclasses.
        """
        raise NotImplementedError

  @property
  def output(self) -> ndarray:
        """
        Returns the output of the operation (calculated during the forward pass).
        """
        return self._output()  # We can directly call _output here for simplicity for now

**ParamOperation Class**
- It provides a blueprint for operations that learn. In neural networks, weights and biases are the parameters that the network adjusts to improve its performance.
- It cleanly separates the calculation of gradients for inputs (_input_grad) and parameters (_param_grad), which are used for different purposes during backpropagation.

In [None]:
class ParamOperation(Operation):
    """
    An Operation with parameters.
    """

    def __init__(self, param: ndarray):
        """
        Initialize with a parameter.

        Args:
            param: The parameter (e.g., weights or biases).
        """
        super().__init__()
        self.param = param


    def backward(self, output_grad: ndarray) -> ndarray:
        """
        Performs the backward pass.
        Calculates the gradient with respect to the input (_input_grad)
        and the gradient with respect to the parameter (_param_grad).

        Args:
            output_grad: The gradient of the loss with respect to the output of this operation.

        Returns:
            The gradient of the loss with respect to the input of this operation.
        """
        assert_same_shape(self.output, output_grad)
        # Calculates the gradient with respect to the input (_input_grad)
        self.input_grad = self.input_grad(output_grad)
        # Calculates the gradient with respect to the parameter (_param_grad).
        self.param_grad = self._param_grad(output_grad)

        assert_same_shape(self.input_, self.input_grad)
        assert_same_shape(self.param, self.param_grad)

        return self.input_grad

    def _param_grad(self, output_grad: ndarray) -> ndarray:
        """
        Calculates the gradient with respect to the parameter.
        This method needs to be implemented by the subclasses.

        Args:
            output_grad: The gradient of the loss with respect to the output of this operation.

        Returns:
            The gradient of the loss with respect to the parameter.
        """
        raise NotImplementedError  # Your code here (but it's already provided)

# Block 3: The WeightMultiply Class

- The WeightMultiply operation performs the crucial matrix multiplication between the input and the layer's weights. This is a core computation in every neural network.


## Matrix Multiplication

In [None]:
# Input Data (Matrix X):

# Sales in (Thousands of Dollars)
X = np.array([
    [10, 20, 15],  # Region 1: Product A, B, C
    [25, 5, 12],   # Region 2
    [8, 18, 22]    # Region 3
])
# Shape of X: (3 regions, 3 products)

# Weight Matrix (Matrix W):
  # This matrix represents the "importance" or "contribution"
  # of each product to certain "sales factors"
  # (which are abstract features the network learns).
# For example, these factors could be things like
  # "market penetration," "customer demand," or "seasonal influence"
  # – things that aren't directly in our raw data.

# Importance of each product to sales factors
W = np.array([
    [0.5, 1.2],   # Product A: Factor 1, Factor 2
    [0.8, 0.3],   # Product B
    [0.1, 1.0]    # Product C
])
# Shape of W: (3 products, 2 sales factors)

# Resulting "Sales Factor" representation
Result = np.dot(X, W)

# Calculation for the first cell (Region 1, Factor 1):
# (10 * 0.5) + (20 * 0.8) + (15 * 0.1) = 5 + 16 + 1.5 = 22.5

[[22.5, 38.0],  # Region 1: Factor 1, Factor 2
 [17.5, 34.5],  # Region 2
 [26.4, 46.6]]  # Region 3
# Shape of Result: (3 regions, 2 sales factors)

In [None]:
class WeightMultiply(ParamOperation):
    """
    Weight multiplication operation for a neural network.
    """

    def __init__(self, W: ndarray):
        """
        Initialize Operation with self.param = W (weights).

        Args:
            W: The weight matrix.
        """
        super().__init__(W)

    def _output(self) -> ndarray:
        """
        Compute output (y = xW).

        Returns:
            The result of the matrix multiplication.
        """
        return np.dot(self.input_, self.param) # xW

    def _input_grad(self, output_grad: ndarray) -> ndarray:
        """
        Compute input gradient (dy/dx = W^T @ output_grad).

        Args:
            output_grad: The gradient of the loss with respect to the output.

        Returns:
            The gradient of the loss with respect to the input.
        """
        return np.dot(output_grad, np.transpose(self.param, (1, 0)))


    def _param_grad(self, output_grad: ndarray) -> ndarray:
        """
        Compute parameter gradient (dy/dW = x^T @ output_grad).

        Args:
            output_grad: The gradient of the loss with respect to the output.

        Returns:
            The gradient of the loss with respect to the weights.
        """
        return np.dot(np.transpose(self.input_, (1, 0)), output_grad)

In [None]:
class BiasAdd(ParamOperation):
    """
    Compute bias addition.
    """

    def __init__(self, B: ndarray):
        """
        Initialize Operation with self.param = B (bias).
        Check appropriate shape.
        """
        assert B.shape[0] == 1
        super().__init__(B)

    def _output(self) -> ndarray:
        """
        Compute output (y = x + B).
        """
        return self.input_ + self.param

    def _input_grad(self, output_grad: ndarray) -> ndarray:
        """
        Compute input gradient (dy/dx = 1 * output_grad = output_grad).
        """
        return np.ones_like(self.input_) * output_grad # ∂L/∂x = ∂L/∂y

    def _param_grad(self, output_grad: ndarray) -> ndarray:
        """
        Compute parameter gradient (dy/dB = sum(output_grad, axis=0)).
        """
        param_grad = np.ones_like(self.param) * output_grad
        return np.sum(param_grad, axis=0).reshape(1, param_grad.shape[1])  # ∂L/∂b

In [None]:
class Sigmoid(Operation):
    """
    Sigmoid activation function.
    """

    def __init__(self):
        """Pass"""
        super().__init__()

    def _output(self) -> ndarray:
        """
        Compute output (sigmoid(x)).
        """
        return 1.0 / (1.0 + np.exp(-1.0 * self.input_))

    def _input_grad(self, output_grad: ndarray) -> ndarray:
        """
        Compute input gradient (dy/dx = sigmoid(x) * (1 - sigmoid(x)) * output_grad).
        """
        sigmoid_output = self.output  # Reuse the output from the forward pass
        return sigmoid_output * (1.0 - sigmoid_output) * output_grad

# Layer

**Block 7: The Layer Class**

A `Layer` in a neural network is a collection of neurons. It encapsulates the operations that transform data as it passes through that level of the network.

1. Initialization (__init__):
- Store the number of neurons in the layer.
- Initialize lists to hold the layer's parameters (params), their gradients (param_grads), and the operations it performs (operations).
- Keep track of whether it's the "first" time the layer is called (self.first). This is important for setting up the layer's parameters based on the input shape.

2. Setup (_setup_layer):

- This is an abstract method (like _output and _input_grad in Operation). Specific layer types (like Dense) will implement this to define their operations (e.g., weight multiplication, bias addition, activation).
- It takes the input shape (num_in) as an argument, so it can initialize things like the weight matrix with the correct dimensions.

3. Forward Pass (forward):

- If it's the first time the layer is called, call _setup_layer to initialize the operations.
- Store the input.
- Iterate through the layer's operations, passing the input through each one in sequence. The output of each operation becomes the input to the next.
- Store the final output of the layer.

4. Backward Pass (backward):

- Assert that the shape of the incoming output_grad matches the shape of the layer's output.
- Iterate through the layer's operations in reverse order, passing the output_grad backward through each operation. The input_grad from each operation becomes the output_grad for the previous one.
This is backpropagation!
- Store the final input_grad (the gradient of the loss with respect to the input of this layer).
- Call self._param_grads() to extract the parameter gradients from the operations.


5. Parameter Extraction (_params):

Extract the layer's parameters from its operations.

6. Parameter Gradient Extraction (_param_grads):

Extract the layer's parameter gradients from its operations.


In [None]:
class Layer(object):
    """
    A "layer" of neurons in a neural network.
    """

    def __init__(self, neurons: int):
        """
        The number of "neurons" roughly corresponds to the "breadth" of the layer
        """
        self.neurons = neurons
        self.first = True  # Flag to indicate if it's the first forward pass
        self.params: List[ndarray] = []  # List to store parameters (e.g., weights, biases)
        self.param_grads: List[ndarray] = []  # List to store parameter gradients
        self.operations: List[Operation] = []  # List to store operations in the layer

    def _setup_layer(self, input_: ndarray) -> None:
        """
        The _setup_layer function must be implemented for each layer
        (e.g., Dense layer will set up WeightMultiply, BiasAdd, Activation)
        """
        raise NotImplementedError()  # Abstract method

    def forward(self, input_: ndarray) -> ndarray:
        """
        Passes input forward through a series of operations
        """
        if self.first:
          self._setup_layer(input_)
          self.first = False

        self.input_ = input_

        for operation in self.operations:
          input_ = operation.forward(input_)

        self.output = input_

        return self.output

    def backward(self, output_grad: ndarray) -> ndarray:
        """
        Passes output_grad backward through a series of operations
        Checks appropriate shapes
        """
        assert_same_shape(self.output, output_grad)

        for operation in reversed(self.operations):  # Iterate in reverse order
            output_grad = operation.backward(output_grad)  # Backpropagate through each operation

        input_grad = output_grad  # Store the final input gradient

        self._param_grads()  # Extract parameter gradients

        return input_grad


    def _params(self) -> List[ndarray]:
        """
        Extracts the _params from a layer's operations
        """
        self.params = []
        for operation in self.operations:
            if isinstance(operation, ParamOperation):  # Check if it's a ParamOperation
                self.params.append(operation.param)
        return self.params


    def _param_grads(self) -> List[ndarray]:
        """
        Extracts the _param_grads from a layer's operations
        """
        self.param_grads = []
        for operation in self.operations:
            if isinstance(operation, ParamOperation):  # Check if it's a ParamOperation
                self.param_grads.append(operation.param_grad)
        return self.param_grads

In [None]:
import numpy as np

class Dense(Layer):
    """
    A fully connected layer which inherits from "Layer"
    """

    def __init__(self,
                 neurons: int,
                 activation: Operation = Sigmoid()):
        """
        Requires an activation function upon initialization
        """
        super().__init__(neurons)  # Call the Layer's constructor
        self.activation = activation

    def _setup_layer(self, input_: ndarray) -> None:
        """
        Defines the operations of a fully connected layer.
        """
        # Initialize weights and biases
        self.params = []
        self.params.append(np.random.randn(input_.shape[1], self.neurons))  # Weights (W)
        self.params.append(np.random.randn(1, self.neurons))  # Biases (B)

        # Define the operations
        self.operations = [
            WeightMultiply(self.params[0]),  # input @ W
            BiasAdd(self.params[1]),         # input @ W + B
            self.activation                # activation(input @ W + B)
        ]

# Loss

In [None]:
class Loss(object):
  '''
  The "loss" of a neural network.
  '''

  def __init__(self):
    '''Pass'''
    pass

  def forward(self, prediction: ndarray, target: ndarray) -> float:
    '''
    Computes the actual loss value.
    '''
    assert_same_shape(prediction, target)

    self.prediction = prediction
    self.target = target

    loss_value = self._output()

    return loss_value

  def backward(self) -> ndarray:
    '''
    Computes gradient of the loss value with respect to the input to the
    loss function.
    '''
    self.input_grad = self._input_grad()

    assert_same_shape(self.prediction, self.input_grad)

    return self.input_grad

  def _output(self) -> float:
    '''
    Every subclass of "Loss" must implement the _output function.
    '''
    raise NotImplementedError()

  def _input_grad(self) -> ndarray:
    '''
    Every subclass of "Loss" must implement the _input_grad function.
    '''
    raise NotImplementedError()

In [None]:
class MeanSquaredError(Loss):

  def __init__(self):
    '''Pass'''
    super().__init__()

  def _output(self) -> float:
    '''
    Computes the per-observation squared error loss.
    '''
    loss = np.sum(np.power(self.prediction - self.target, 2)) / self.prediction.shape[0]

    return loss

  def _input_grad(self) -> ndarray:
    '''
    Computes the loss gradient with respect to the input for MSE loss.
    '''

    return 2.0 * (self.prediction - self.target) / self.prediction.shape[0]

In [None]:
class SoftmaxCrossEntropyLoss(Loss):
    def __init__(self, eps: float=1e-9)
        super().__init__()
        self.eps = eps
        self.single_output = False

    def _output(self) -> float:

        # applying the softmax function to each row (observation)
        softmax_preds = softmax(self.prediction, axis=1)

        # clipping the softmax output to prevent numeric instability
        self.softmax_preds = np.clip(softmax_preds, self.eps, 1 - self.eps)

        # actual loss computation
        softmax_cross_entropy_loss = (
            -1.0 * self.target * np.log(self.softmax_preds) - \
                (1.0 - self.target) * np.log(1 - self.softmax_preds)
        )

        return np.sum(softmax_cross_entropy_loss)

    def _input_grad(self) -> ndarray:

        return self.softmax_preds - self.target

# Neural Network

In [None]:
class NeuralNetwork(object):
  '''
  The class for a neural network.
  '''
  def __init__(self,
               layers: List[Layer], #List of Layer instances (e.g., [Dense(13, Sigmoid), Dense(1, Linear)]).
               loss: Loss,
               seed: float = 1):
    '''
    Neural networks need layers, and a loss.
    '''
    self.layers = layers # specifies the layer’s output dimension
    self.loss = loss
    self.seed = seed
    if seed:
      for layer in self.layers:
        setattr(layer, 'seed', self.seed)

  def forward(self, x_batch: ndarray) -> ndarray:
    '''
    Passes input forward through a series of layers.
    '''
    x_out = x_batch
    for layer in self.layers:
      x_out = layer.forward(x_out)

    return x_out

In [None]:
# Receive X and y as inputs, both ndarrays.
# Feed X successively forward through each Layer.
# Use the Loss to produce loss value and the loss gradient to be sent backward.
# Use the loss gradient as input to the backward method for the network, which will calculate the param_grads for each layer in the network.
# Call the update_params function on each layer, which will use the overall learning rate for the NeuralNetwork as well as the newly calculated param_grads.

class NeuralNetwork(object):
  '''
  The class for a neural network.
  '''
  def __init__(self,
               layers: List[Layer], #List of Layer instances (e.g., [Dense(13, Sigmoid), Dense(1, Linear)]).
               loss: Loss,
               seed: float = 1):
    '''
    Neural networks need layers, and a loss.
    '''
    self.layers = layers # specifies the layer’s output dimension
    self.loss = loss
    self.seed = seed
    if seed:
      for layer in self.layers:
        setattr(layer, 'seed', self.seed)

  def forward(self, x_batch: ndarray) -> ndarray:
    '''
    Passes input forward through a series of layers.
    '''
    x_out = x_batch
    for layer in self.layers:
      x_out = layer.forward(x_out)

    return x_out

  def backward(self, loss_grad: ndarray) -> None:
    '''
    Passes data backward through a series of layers.
    '''

    grad = loss_grad
    for layer in reversed(self.layers):
      grad = layer.backward(grad)

    return None

  def train_batch(self,
                  x_batch: ndarray,
                  y_batch: ndarray) -> float:
    '''
    Passes data forward through the layers.
    Computes the loss.
    Passes data backward through the layers.
    '''
    predictions = self.forward(x_batch)
    loss = self.loss.forward(predictions, y_batch)
    self.backward(self.loss.backward())
    return loss

  #Provide iterators for parameters and their gradients across all layers.
  def params(self):
    '''
    Gets the parameters for the network.
    '''
    for layer in self.layers:
      yield from layer.params #Yields parameters (e.g., W1,b1,W2,b2,…W_1, b_1, W_2, b_2, \ldotsW_1, b_1, W_2, b_2,).

  def param_grads(self):
    '''
    Gets the gradient of the loss with respect to the parameters for the
    network.
    Yields gradients (e.g., ∂L∂W1,∂L∂b1,…)
    '''
    for layer in self.layers:
      yield from layer.param_grads


# Optimizer and SGD

In [None]:
class Optimizer(object):
  '''
  Base class for a neural network optimizer.
  '''
  def __init__(self,
               lr: float = 0.01):
    '''
    Every optimizer must have an initial learning rate.
    '''
    self.lr = lr

  def step(self) -> None:
    '''
    Every optimizer must implement the "step" function.
    '''
    pass

In [None]:
class SGD(Optimizer):
  '''
  Stochasitc gradient descent optimizer.
  '''
  def __init__(self,
                lr: float = 0.01) -> None:
    '''Pass'''
    super().__init__(lr)

  def step(self):
    '''
    For each parameter, adjust in the appropriate direction, with the magnitude of the adjustment
    based on the learning rate.
    '''
    for (param, param_grad) in zip(self.net.params(),
                                    self.net.param_grads()):

      param -= self.lr * param_grad

In [None]:
class Trainer(object):
    '''
    Trains a neural network
    '''
    def __init__(self,
                 net: NeuralNetwork,
                 optim: Optimizer) -> None:
        '''
        Requires a neural network and an optimizer in order for training to occur.
        Assign the neural network as an instance variable to the optimizer.
        '''
        self.net = net
        self.optim = optim
        self.best_loss = 1e9
        setattr(self.optim, 'net', self.net)
    def generate_batches(self,
                         X: ndarray,
                         y: ndarray,
                         size: int = 32) -> Tuple[ndarray]:
        '''
        Generates batches for training
        '''
        assert X.shape[0] == y.shape[0], \
        '''
        features and target must have the same number of rows, instead
        features has {0} and target has {1}
        '''.format(X.shape[0], y.shape[0])

        N = X.shape[0] # gets the total number of training examples (rows) in your dataset.

        for ii in range(0, N, size): # 0, 32, 64, 96...

            X_batch, y_batch = X[ii:ii+size], y[ii:ii+size]

            yield X_batch, y_batch

    def fit(self,
            X_train: ndarray, y_train: ndarray,
            X_test: ndarray, y_test: ndarray,
            epochs: int=100,
            eval_every: int=10,
            batch_size: int=32,
            seed: int = 1,
            restart: bool = True)-> None:
        '''
        Fits the neural network on the training data for a certain number of epochs.
        Every "eval_every" epochs, it evaluated the neural network on the testing data.
        '''
        np.random.seed(seed)
        if restart:
            for layer in self.net.layers:
                layer.first = True

            self.best_loss = 1e9

        for e in range(epochs):
            if (e+1) % eval_every == 0:
                # for early stopping
                last_model = deepcopy(self.net)
            X_train, y_train = permute_data(X_train, y_train)
            batch_generator = self.generate_batches(X_train, y_train,
                                                    batch_size)
            for ii, (X_batch, y_batch) in enumerate(batch_generator):
              self.net.train_batch(X_batch, y_batch)
              self.optim.step()

            if (e+1) % eval_every == 0:

                test_preds = self.net.forward(X_test)
                loss = self.net.loss.forward(test_preds, y_test)

                if loss < self.best_loss:
                    print(f"Validation loss after {e+1} epochs is {loss:.3f}")
                    self.best_loss = loss
                else:
                    print(f"""Loss increased after epoch {e+1}, final loss was {self.best_loss:.3f}, using the model from epoch {e+1-eval_every}""")
                    self.net = last_model
                    # ensure self.optim is still updating self.net
                    setattr(self.optim, 'net', self.net)
                    break

