In [9]:
!pip install ipynb

Collecting ipynb
  Downloading ipynb-0.5.1-py3-none-any.whl.metadata (303 bytes)
Downloading ipynb-0.5.1-py3-none-any.whl (6.9 kB)
Installing collected packages: ipynb
Successfully installed ipynb-0.5.1


In order to import the notebook, like how we import a package, which is a .py file in a regular IDE.
We will be using from ipynb.fs.full.<notebook_name> import <function_name>

In [14]:
import import_ipynb
import ActivationFunctions2
from ActivationFunctions2 import ActivationFunction

In [1]:
import numpy as np

In [2]:
class Layer:
    def __init__(self, input_size, output_size, activation_fn, activation_dfn):
        # We use "He Initialization" (random numbers scaled) 
        # because it helps the model learn faster.
        # Instead of * 0.1, use this formula (Xavier Initialization):
        self.W = np.random.randn(output_size, input_size) * np.sqrt(1. / input_size)
        self.b = np.zeros((output_size, 1))
        
        self.activation_fn = activation_fn
        self.activation_dfn = activation_dfn
        
        # We need to store these for the backward pass
        self.A_prev = None
        self.Z = None

    def forward(self, A_prev):
        self.A_prev = A_prev
        # Z = W . A_prev + b
        self.Z = np.dot(self.W, A_prev) + self.b
        # A = activation(Z)
        self.A = self.activation_fn(self.Z)
        return self.A

    def backward(self, dA, learning_rate):
        # dA is the 'error signal' coming from the next layer
        
        # 1. Calculate how Z affected the loss (dZ)
        # dZ = dA * activation_derivative(Z)
        dZ = dA * self.activation_dfn(self.Z)
        
        # 2. Calculate how W affected the loss (dW)
        # dW = dZ . A_prev_transpose
        m = self.A_prev.shape[1]
        dW = np.dot(dZ, self.A_prev.T) / m
        db = np.sum(dZ, axis=1, keepdims=True) / m
        
        # 3. Calculate error for the PREVIOUS layer (dA_prev)
        dA_prev = np.dot(self.W.T, dZ)
        
        # 4. Update Weights and Biases (Gradient Descent)
        self.W -= learning_rate * dW
        self.b -= learning_rate * db
        
        return dA_prev

### 3. Why did we do it this way? (Mastery Explanation)
A_prev: In the backward pass, to update the weight between Layer 1 and Layer 2, you need to know what the input was during the forward pass. Thatâ€™s why we "cache" (store) A_prev.
dA_prev: Notice the function returns dA_prev. This is how the "Chain" works. Layer 3 calculates its error and hands it back to Layer 2. Layer 2 uses it and hands its error back to Layer 1.
axis=1, keepdims=True: In the bias update (db), we sum up all the errors for a whole batch of data. keepdims ensures the shape stays 
(
n
,
1
)
(n,1)
 so we don't accidentally turn a column into a flat list.

### 4. Mathematical Logic (The "Aha!" Moment)
In an interview, they might ask: "What is the derivative of the weights?"

The answer is: Inputs * Error Signal

If the input was 0, the weight had no effect on the error, so the gradient is 0.
If the input was huge, the weight had a huge effect, so the gradient is large.


In [15]:
# Use the activation functions we wrote in Notebook 2
# Let's assume we have 3 input features, and we want 2 hidden neurons
layer1 = Layer(3, 2, ActivationFunction.relu, ActivationFunction.relu_derivative)

# Mock input data (3 features, 1 sample)
x_sample = np.array([[0.5], [0.1], [-0.2]])

# Run forward pass
output = layer1.forward(x_sample)
print("Output of Forward Pass:\n", output)

# Run backward pass (Assume error dA from next layer is 0.1)
dA_mock = np.array([[0.1], [0.1]])
prev_error = layer1.backward(dA_mock, learning_rate=0.01)
print("\nError passed back to input:\n", prev_error)

Output of Forward Pass:
 [[0.       ]
 [0.0891017]]

Error passed back to input:
 [[ 0.01081987]
 [ 0.01739602]
 [-0.00880316]]


Once you run this, I have a question: In the Layer class, we initialized W with np.random.randn. 
What would happen if we initialized all weights to Zero? (This is a very famous interview question!)

### The Answer to the "Zero Initialization" Question:
If you initialize all weights to Zero, you run into the Symmetry Problem.

1. Identical Output: Every neuron in the hidden layer will receive the same input and have the same weight (0). Therefore, they will all calculate the exact same output.
2. Identical Gradient: During backprop, every neuron will receive the exact same gradient update.
3. No Diversity: Even after 1,000 epochs of training, every neuron in that layer will still be identical to its neighbor.
4. Result: Your "Neural Network" effectively becomes a single-neuron model because all neurons are doing the same thing.
   
Conclusion: We use random numbers (like np.random.randn) to "Break Symmetry." It gives each neuron a different "starting perspective" so they can learn different features (e.g., one neuron learns to find edges, another finds circles).