# Single Neuron Basics

This lecture provides an introduction to deep neural networks (DNNs), starting with the basic unit, a single neuron, and progressively covering the components, the forward and backward passes, and applications in healthcare.

### **Key Concepts:**

1. **Neuron Structure:**
   - A **neuron** is a computational unit that receives inputs, $x_1, x_2, \dots, x_n$, each associated with weights, $w_1, w_2, \dots, w_n$, and a bias term $b$.
   - The neuron computes an intermediate output, $z$, which is the **linear combination** of inputs and weights plus the bias: 
   $$z = \sum_{i=1}^{n} w_i \cdot x_i + b$$
   - This output $z$ is then passed through a **non-linear activation function** $g$, producing the final output $y$:
   $$y = g(z)$$
   - The goal is to learn the weights $w_i$ and bias term $b$ to approximate the target $y$, which could represent a binary classification (e.g., disease prediction) or a numerical regression output.

2. **Activation Functions:**
   Activation functions introduce non-linearity, allowing neural networks to model complex relationships. The most popular activation functions are:
   
   - **Sigmoid Function:**
     $$\sigma(x) = \frac{1}{1 + e^{-x}}$$
     - **Output Range:** [0, 1]
     - **Use Case:** Often used for binary classification (e.g., predicting the probability of a heart disease).
     - **Problem:** The **vanishing gradient problem** occurs when the function saturates (output close to 0 or 1), leading to gradients near zero, which hinders the training process.

   - **Tanh Function:**
     $$\text{tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2\sigma(2x) - 1$$
     - **Output Range:** [-1, 1]
     - **Use Case:** Similar to sigmoid but with outputs scaled to range [-1, 1].
     - **Problem:** Still suffers from the vanishing gradient problem, especially when inputs are far from zero.

   - **ReLU (Rectified Linear Unit):**
     $$\text{ReLU}(x) = \max(0, x)$$
     - **Output Range:** [0, ∞)
     - **Use Case:** Popular in deep networks due to its simplicity and efficiency.
     - **Advantage:** Does not suffer from the vanishing gradient problem for positive inputs, as the gradient remains constant.

3. **Choosing Activation Functions:**
   - The choice of activation function depends on the specific task and neural network architecture.
   - **Sigmoid** is typically used in the output layer for binary classification tasks.
   - **Tanh** can be useful in hidden layers where the output needs to range between negative and positive values.
   - **ReLU** is commonly used in hidden layers due to its efficiency and ability to mitigate the vanishing gradient problem.


# Training a Single Neuron: SGD

In this section of the lecture, you’re diving deeper into how a single neuron in a neural network operates, and how optimization methods, particularly gradient descent and its variant stochastic gradient descent (SGD), are used to train neural networks.

### **Single Neuron Operation:**

1. **Computation Breakdown:**
   - **Linear Combination:**
     $$z = \sum_{i=1}^{n} w_i \cdot x_i + b$$
     This computes an intermediate value $z$ by summing the weighted inputs and adding a bias term.
   - **Nonlinear Transformation:**
     $$y = g(z)$$
     The intermediate value $z$ is passed through an activation function $g$ to produce the final output $y$.

2. **Training the Neuron:**
   - **Loss Function:**
     The goal of training is to adjust the weights $w_i$ and bias $b$ so that the output $y$ is close to a target label $t$. A common loss function used is the squared loss:
     $$\text{Loss} = \frac{1}{2}(y - t)^2$$
     This measures the difference between the predicted output $y$ and the target $t$.

### **Optimization with Gradient Descent:**

1. **Gradient Descent Overview:**
   - **Objective:**
     To minimize the loss function by adjusting the model parameters (weights and biases) based on the gradient of the loss function.
   - **Steps:**
     1. **Define the Likelihood Function:**
        For a given model, specify how likely the observed data is under different parameter settings. Often, the log-likelihood is used for numerical stability and simplicity:
        $$\text{Log-Likelihood} = \log(\text{Likelihood})$$
     2. **Compute the Gradient:**
        Calculate the gradient (partial derivatives) of the log-likelihood with respect to each parameter. This involves finding how changes in the parameters affect the likelihood:
        $$\text{Gradient} = \frac{\partial (\text{Log-Likelihood})}{\partial \theta}$$
     3. **Update Parameters:**
        Adjust the parameters in the direction opposite to the gradient to reduce the loss:
        $$\theta_{\text{new}} = \theta_{\text{old}} - \eta \cdot \text{Gradient}$$
        Here, $\eta$ is the learning rate, a hyperparameter that controls the step size of each update.

2. **Stochastic Gradient Descent (SGD):**
   - **Challenge with Large Datasets:**
     Computing gradients over the entire dataset can be computationally expensive. SGD addresses this by updating parameters using a random subset of data points:
     - **Mini-Batch Gradient Descent:** Uses a subset (mini-batch) of the dataset to compute gradients.
     - **Online Gradient Descent:** Uses one data point at a time (stochastic updates).
   - **SGD Algorithm:**
     1. **Initialize Parameters:**
        Start with small random values for weights $w$ and bias $b$.
     2. **Iterate:**
        For each iteration, pick a data point or mini-batch, compute the gradient of the loss with respect to parameters, and update parameters:
        $$w_{\text{new}} = w_{\text{old}} - \eta \cdot \frac{\partial \text{Loss}}{\partial w}$$
        $$b_{\text{new}} = b_{\text{old}} - \eta \cdot \frac{\partial \text{Loss}}{\partial b}$$
     3. **Repeat:**
        Continue until the parameters converge or a stopping criterion is met.

### **Summary:**

- **Single Neuron:**
  - Consists of a linear combination of inputs and a non-linear activation function.
  - Training involves minimizing a loss function to make predictions as close to the target values as possible.

- **Gradient Descent:**
  - An optimization technique used to find the parameter values that minimize the loss function.
  - Involves computing gradients and updating parameters iteratively.
  
- **Stochastic Gradient Descent (SGD):**
  - A variant of gradient descent that handles large datasets efficiently by updating parameters based on a subset of data points.

These concepts lay the groundwork for understanding how more complex neural networks are trained and optimized.

# Forward and Backward Computation

To train a single neuron effectively, you need to perform two critical steps: the forward pass and the backward pass. Here’s a detailed breakdown of each step and how they work together to update the neuron’s parameters:

### **1. Forward Pass:**

**Objective:** Compute the output $y$ of the neuron and the loss.

- **Linear Combination:**
  $$
  z = \sum_{i=1}^{n} w_i \cdot x_i + b
  $$
  Here, $z$ is the weighted sum of inputs $x_i$ plus the bias term $b$.

- **Nonlinear Activation:**
  Apply an activation function $g$ to $z$:
  $$
  y = g(z)
  $$
  For example, if the activation function is a sigmoid:
  $$
  y = \frac{1}{1 + e^{-z}}
  $$

- **Loss Function:**
  Compute the loss $L$, which measures how far the output $y$ is from the target $t$. For a squared loss function:
  $$
  L = \frac{1}{2} (y - t)^2
  $$

### **2. Backward Pass:**

**Objective:** Compute the gradients of the loss with respect to each parameter and use these gradients to update the parameters.

- **Compute Gradients:**
  Using the chain rule of calculus, the gradient of the loss function $L$ with respect to each parameter $w_i$ and the bias term $b$ is calculated.

  **For Weight $w_i$:**

  1. **Derivative of Loss with Respect to Output $y$:**
     $$
     \frac{\partial L}{\partial y} = y - t
     $$
     Here, $L$ is the loss function, and $y$ is the neuron’s output.

  2. **Derivative of Output $y$ with Respect to $z$:**
     $$
     \frac{\partial y}{\partial z} = y \cdot (1 - y)
     $$
     This derivative comes from the sigmoid activation function.

  3. **Derivative of $z$ with Respect to $w_i$:**
     $$
     \frac{\partial z}{\partial w_i} = x_i
     $$
     Since $z$ is the weighted sum, the derivative with respect to $w_i$ is the input $x_i$.

  Combine these derivatives using the chain rule:
  $$
  \frac{\partial L}{\partial w_i} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial z} \cdot \frac{\partial z}{\partial w_i}
  $$
  Substituting in:
  $$
  \frac{\partial L}{\partial w_i} = (y - t) \cdot y \cdot (1 - y) \cdot x_i
  $$

  **For Bias $b$:**

  1. **Derivative of Loss with Respect to Output $y$:**
     $$
     \frac{\partial L}{\partial y} = y - t
     $$

  2. **Derivative of Output $y$ with Respect to $z$:**
     $$
     \frac{\partial y}{\partial z} = y \cdot (1 - y)
     $$

  3. **Derivative of $z$ with Respect to Bias $b$:**
     $$
     \frac{\partial z}{\partial b} = 1
     $$
     The derivative of $z$ with respect to $b$ is 1 because $b$ is added directly to $z$.

  Combine these derivatives:
  $$
  \frac{\partial L}{\partial b} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial z} \cdot \frac{\partial z}{\partial b}
  $$
  Substituting in:
  $$
  \frac{\partial L}{\partial b} = (y - t) \cdot y \cdot (1 - y)
  $$

### **3. Update Parameters:**

Use the gradients computed from the backward pass to update the weights $w_i$ and bias $b$ using stochastic gradient descent (SGD):

- **Weight Update:**
  $$
  w_i \leftarrow w_i - \eta \cdot \frac{\partial L}{\partial w_i}
  $$
  where $\eta$ is the learning rate.

- **Bias Update:**
  $$
  b \leftarrow b - \eta \cdot \frac{\partial L}{\partial b}
  $$

### **Automatic Differentiation:**

Modern deep learning frameworks handle the gradient computations automatically using automatic differentiation, which simplifies the implementation of neural network training. This allows practitioners to focus more on the architecture and less on the detailed computation of gradients.

### **Summary:**

- **Forward Pass:** Compute $z$, apply the activation function to get $y$, and calculate the loss $L$.
- **Backward Pass:** Compute gradients of $L$ with respect to parameters $w_i$ and $b$ using the chain rule.
- **Parameter Update:** Adjust parameters using gradients and learning rate.

This process of forward and backward passes allows the network to learn and adjust its weights and biases to minimize the loss function over time.

# Multilayer Neural Network

### Multilayer Neural Networks

A neural network is built by connecting neurons across multiple layers, where each neuron's output becomes the input to the next layer. This structure allows the network to learn complex functions by stacking layers of neurons. Here’s a detailed breakdown of how to compute the forward pass in a multilayer neural network and how to represent these computations efficiently.

### **Network Structure**

Consider a simple multilayer neural network with the following layers:

1. **Input Layer (Layer 1):**
   - **Units:** $x_1, x_2, x_3$ (input features) and one bias unit.

2. **Hidden Layer (Layer 2):**
   - **Units:** $h_1, h_2, h_3$ (hidden neurons) and one bias unit.

3. **Output Layer (Layer 3):**
   - **Unit:** $y$ (output neuron).

### **Forward Pass Computation**

**Step-by-Step Forward Pass:**

1. **Compute Linear Combinations in Hidden Layer:**

   Each hidden unit computes a linear combination of inputs plus a bias term:
   $$
   z_j^{(2)} = \sum_{i=1}^{n} w_{ji}^{(1)} \cdot x_i + b_j^{(1)}
   $$
   - $z_j^{(2)}$ is the linear combination for hidden unit $h_j$ from layer 1.
   - $w_{ji}^{(1)}$ is the weight from input unit $x_i$ to hidden unit $h_j$.
   - $b_j^{(1)}$ is the bias term for hidden unit $h_j$.

2. **Apply Activation Function to Hidden Units:**

   Apply a nonlinear activation function $g^{(2)}$ (e.g., sigmoid or ReLU) to $z_j^{(2)}$:
   $$
   a_j^{(2)} = g^{(2)}(z_j^{(2)})
   $$
   - $a_j^{(2)}$ is the output of hidden unit $h_j$.

3. **Compute Linear Combination in Output Layer:**

   Use the outputs from the hidden layer as inputs to the output layer:
   $$
   z^{(3)} = \sum_{j=1}^{m} w_{j}^{(2)} \cdot a_j^{(2)} + b^{(2)}
   $$
   - $z^{(3)}$ is the linear combination for the output unit.
   - $w_{j}^{(2)}$ is the weight from hidden unit $h_j$ to the output unit $y$.
   - $b^{(2)}$ is the bias term for the output unit.

4. **Apply Activation Function to Output Unit:**

   Apply an activation function $g^{(3)}$ to $z^{(3)}$:
   $$
   y = g^{(3)}(z^{(3)})
   $$
   - $y$ is the final output of the network.

### **Matrix Notation for Efficiency**

For efficient computation, especially with large networks, we use matrix operations:

1. **Matrix Representation for Hidden Layer:**

   - **Weights Matrix:** $W^{(1)}$ (matrix of weights connecting input layer to hidden layer).
   - **Bias Vector:** $b^{(1)}$.

   Compute:
   $$
   z^{(2)} = W^{(1)} \cdot x + b^{(1)}
   $$
   Apply activation function element-wise:
   $$
   a^{(2)} = g^{(2)}(z^{(2)})
   $$

2. **Matrix Representation for Output Layer:**

   - **Weights Matrix:** $W^{(2)}$ (matrix of weights connecting hidden layer to output layer).
   - **Bias Term:** $b^{(2)}$.

   Compute:
   $$
   z^{(3)} = W^{(2)} \cdot a^{(2)} + b^{(2)}
   $$
   Apply activation function:
   $$
   y = g^{(3)}(z^{(3)})
   $$

### **General Form for Multiple Layers**

For a neural network with $L$ layers, the forward pass computation from layer $l$ to layer $l+1$ is given by:

1. **Linear Combination:**
   $$
   z^{(l+1)} = W^{(l)} \cdot a^{(l)} + b^{(l)}
   $$

2. **Activation Function:**
   $$
   a^{(l+1)} = g^{(l+1)}(z^{(l+1)})
   $$

### **Summary**

- **Forward Pass:** Involves computing linear combinations and applying activation functions layer by layer.
- **Matrix Notation:** Used for efficient computation, particularly in large networks and when using hardware accelerators like GPUs.
- **General Form:** Provides a compact and scalable way to represent neural network computations.

This process of computing the forward pass is crucial for both scoring new data points and training the network using backpropagation. The matrix operations ensure that the computations are performed efficiently, leveraging modern computational resources.