### Binary Classification (Images of 0 and 1)  
Binary classification is a type of supervised learning where the goal is to assign an input to one of two categories. In the case of images of 0 and 1, a model learns to distinguish between these two digits. Given an input image, the model outputs a probability score, which is then thresholded to classify the image as either a 0 or a 1.

### Fundamentals of a Single Neuron  
A single neuron in a neural network is a fundamental unit that processes inputs and produces an output using three key components:  

- **Weights**: Each input is multiplied by a weight, which determines the importance of that input. The model learns optimal weight values during training.  
- **Bias**: A bias term is added to shift the output, allowing the neuron to better fit the data even when inputs are zero.  
- **Activation Function**: The weighted sum of inputs plus bias is passed through an activation function, such as the sigmoid function, to introduce non-linearity and output a probability (0 to 1) in the case of binary classification.

![Neuron](https://bpb-us-e1.wpmucdn.com/blogs.cornell.edu/dist/a/1688/files/2015/09/VqOpE-1c4xc4y.jpg)s.

### **Forward Propagation: Role, Intuition, and Formula**  

#### **Role**  
Forward propagation computes the neuron's output by applying learned weights, bias, and an activation function to the input data.

#### **Intuition**  
1. **Linear Combination**: Inputs are weighted and summed with a bias.  
2. **Non-Linearity**: The sigmoid function maps this sum to a probability between 0 and 1.  
3. **Prediction**: The output represents the probability of class 1.

#### **Formula**  
Linear transformation

$$
   Z = W X + b
$$


Activation (Sigmoid)  
$$
   A = \frac{1}{1 + e^{-Z}}
$$



### **Cost: Role, Intuition, and Formula**  

#### **Role**  
The cost function measures how well the neuron’s predictions match the true labels. It quantifies the error and guides weight updates during training.

#### **Intuition**  
- If predictions \( A \) are close to true labels \( Y \), the cost is low.  
- If predictions are incorrect, the cost increases.  
- The **log loss (binary cross-entropy)** is used for binary classification, penalizing incorrect confident predictions more heavily.

#### **Formula (Log Loss for Logistic Regression)**  

$$
J = -\frac{1}{m} \sum \left[ Y \log(A) + (1 - Y) \log(1 - A) \right]
$$

where:  
- $ Y $ → True labels, shape $ (1, m) $.  
- $ A $ → Predicted probabilities, shape $ (1, m) $. 
- $ m $ → Number of samples  



### **Evaluation: Role, Intuition, and Process**  

#### **Role**  
Evaluation checks how well the neuron classifies inputs by generating predictions and computing the cost.

#### **Intuition**  
1. **Forward Propagation**: The neuron processes input \( X \) to compute predicted probabilities \( A \).  
2. **Cost Calculation**: Measures how far predictions \( A \) deviate from true labels \( Y \) using log loss.  
3. **Thresholding for Classification**: Since \( A \) represents probabilities, values \( \geq 0.5 \) are classified as 1, and values \( < 0.5 \) as 0. This converts continuous outputs into discrete class labels.  

#### **Process in Code**  
1. **Run forward propagation** → `A = self.forward_prop(X)`  
2. **Compute cost** → `cost = self.cost(Y, A)`  
3. **Make predictions** → `result = np.where(A >= 0.5, 1, 0)`  
4. **Return predictions and cost**  


### **Backpropagation: Role, Intuition, and Process**  

#### **Role**  
Backpropagation is the key algorithm for training the neuron by adjusting weights and bias to minimize prediction error. It calculates gradients (partial derivatives) of the cost function with respect to the parameters and updates them using **gradient descent**.

---

#### **Intuition**  
1. **Error Signal**: The neuron computes the difference between predicted outputs \( A \) and true labels \( Y \) (i.e., \( A - Y \)).  
2. **Gradient Calculation**:  
   - **Weight Gradient**: Measures how much each weight contributes to the error. Computed using the derivative of the cost function with respect to weights.  
   - **Bias Gradient**: Measures the impact of the bias term on the error.  
3. **Parameter Update (Gradient Descent)**:  
   - Weights and bias are adjusted by moving them in the opposite direction of the gradients, scaled by the learning rate \( \alpha \).  
   - This reduces the cost function, improving predictions over time.  

---

#### **Mathematical Formulation**  
Given:  
- \( m \) → Number of samples  
- \( W \) → Weight vector (shape \( (1, nx) \))  
- \( b \) → Bias (scalar)  
- \( X \) → Input matrix (shape \( (nx, m) \))  
- \( A \) → Predicted output (shape \( (1, m) \))  
- \( Y \) → True labels (shape \( (1, m) \))  
- \( \alpha \) → Learning rate  

1. **Compute Gradients:**  
   - **Weight Gradient**:  
$$
     \frac{\partial J}{\partial W} = \frac{1}{m} (A - Y) X^T
$$
   - **Bias Gradient**:  
$$
     \frac{\partial J}{\partial b} = \frac{1}{m} \sum (A - Y)
$$

3. **Update Parameters:**  
   - **Weight Update**:  
$$
     W = W - \alpha \frac{\partial J}{\partial W}
$$
   - **Bias Update**:  
$$
     b = b - \alpha \frac{\partial J}{\partial b}
$$

---

#### **Process in Code**  
1. **Compute Gradients:**  
   - `grad_w = 1 / m * np.matmul((A - Y), X.T)`  
   - `grad_b = 1 / m * np.sum(A - Y)`  
2. **Update Weights and Bias:**  
   - `self.__W -= alpha * grad_w`  
   - `self.__b -= alpha * grad_b`  


### **Training the Neuron: High-Level Overview**  

#### **Role**  
The `train` method allows the neuron to learn from the training data by iterating over multiple cycles, improving its ability to make accurate predictions.

#### **Intuition**  
1. **Input Validation**: Ensures valid `iterations` and `alpha` values are provided.
2. **Iterative Learning**: Over multiple iterations:
   - **Forward Propagation**: Computes predictions.
   - **Backpropagation (Gradient Descent)**: Updates weights and bias to reduce error.
   - **Evaluation**: Assesses the current performance.
3. **Return**: After the specified number of iterations, return the final predictions and cost.


### **Key Idea of Gradient Descent in the Code**
Gradient descent is an optimization algorithm used to minimize the cost function by updating the parameters (weights and biases) iteratively. The idea is to compute the gradient (partial derivatives) of the cost function with respect to each parameter and update them in the direction that reduces the cost.

In this **NeuralNetwork** class, gradient descent is implemented in the `gradient_descent` method. The goal is to adjust the weights (`W1`, `W2`) and biases (`b1`, `b2`) to improve the model's predictions over time.

---

### **Breaking Down the Multiplications**
The key mathematical operations in gradient descent involve partial derivatives of the cost function with respect to the parameters. Here’s the intuition behind each multiplication:

1. **Error at Output Layer (dz2)**
$$
   dz2 = A2 - Y
$$
   - `A2` is the predicted output.
   - `Y` is the actual label.
   - The subtraction calculates the difference between predicted and actual values, which is the error signal that needs to be backpropagated.

2. **Gradient of Weights in the Output Layer (d__W2)**
$$
   dW2 = \frac{1}{m} \cdot (dz2 \cdot A1^T)
$$
   - `dz2` (error at the output layer) is multiplied by the activations of the hidden layer (`A1`).
   - This tells us how much each weight in `W2` contributed to the final error.
   - The division by `m` averages the gradient over all training examples.

3. **Gradient of Bias in the Output Layer (d__b2)**
$$
   db2 = \frac{1}{m} \sum dz2
$$
   - Since bias terms are independent of input values, we sum over all examples and average.

4. **Error at the Hidden Layer (dz1)**
$$
   dz1 = (W2^T \cdot dz2) \cdot A1(1 - A1)
$$
   - Backpropagating the error from the output layer to the hidden layer:
     - `W2^T * dz2` propagates the error backwards.
     - `A1(1 - A1)` accounts for the derivative of the sigmoid activation function (chain rule in calculus).

5. **Gradient of Weights in the Hidden Layer (d__W1)**
$$
   dW1 = \frac{1}{m} \cdot (dz1 \cdot X^T)
$$
   - `dz1` is multiplied by the input values `X` to determine how much `W1` should change.

6. **Gradient of Bias in the Hidden Layer (d__b1)**
$$
   db1 = \frac{1}{m} \sum dz1
$$
   - Bias gradients are computed similarly by summing over all examples.

7. **Updating Weights and Biases**
$$
   W2 = W2 - \alpha dW2, \quad b2 = b2 - \alpha db2
$$
$$
   W1 = W1 - \alpha dW1, \quad b1 = b1 - \alpha db1
$$
   - The parameters are updated by moving them **in the opposite direction** of the gradient (scaled by the learning rate `alpha`).
   - This ensures the model minimizes the error step by step.

---

### **Intuition for Why These Multiplications Are Needed**
- The weight updates use **matrix multiplication** because each weight contributes to multiple outputs.
- The gradient of the activation function **scales the error** to properly adjust the influence of each neuron.
- **Backpropagation** efficiently distributes the error from the output layer to the hidden layer using derivatives.

The use of **transposed matrices** in gradient descent for neural networks is crucial for ensuring the correct dimensions of the matrix operations during the backpropagation step. Let’s break down why transposing the matrices is needed:

### **1. Matrix Multiplication Dimensions**
When performing backpropagation, we need to compute gradients for the weights, and these gradients involve matrix multiplication. Matrix multiplication follows specific rules about the dimensions of the matrices being multiplied. 

#### **For the weight gradient in the output layer:**
The gradient for the weights in the output layer is computed as:
$$
dW2 = \frac{1}{m} \cdot (dz2 \cdot A1^T)
$$
Where:
- `dz2` is the error (a vector of size `1 x m` where `m` is the number of training examples).
- `A1` is the activation from the hidden layer (a vector of size `nodes x m`, where `nodes` is the number of neurons in the hidden layer).

Here’s the reasoning for the **transpose of A1**:

- `A1` is the activations from the hidden layer (size `nodes x m`).
- The multiplication needs to compute how each weight in `W2` contributes to the errors in `dz2`. Since `W2` is of size `1 x nodes`, you need to multiply it with `A1` (which has size `nodes x m`).
- The multiplication `dz2 * A1^T` ensures that we align the error `dz2` with the activations `A1`. **By transposing `A1`**, you get a matrix of size `m x nodes`, which ensures the multiplication with `dz2` (size `1 x m`) is valid and results in the gradient for `W2` of size `1 x nodes`.

Without the transpose of `A1`, the matrix multiplication would not be possible because the dimensions wouldn’t align. 

#### **For the weight gradient in the hidden layer:**
The gradient for the weights in the hidden layer is computed as:
$$
dW1 = \frac{1}{m} \cdot (dz1 \cdot X^T)
$$
Where:
- `dz1` is the error propagated back to the hidden layer (a vector of size `nodes x m`).
- `X` is the input data (a matrix of size `nx x m`, where `nx` is the number of input features).

The reasoning behind **transposing `X`** is similar to the previous case:

- `dz1` is of size `nodes x m` because there are `nodes` neurons in the hidden layer, and `m` is the number of training examples.
- The weights `W1` are of size `nodes x nx` (where `nx` is the number of input features).
- To compute the gradient of `W1`, you need to multiply `dz1` (which has size `nodes x m`) with `X^T` (the transpose of the input data, which has size `m x nx`).
- This multiplication produces a matrix of size `nodes x nx`, which is the correct shape for the gradient of `W1`.

Again, without transposing `X`, the matrix multiplication wouldn’t work because the dimensions wouldn’t align.
