# 02. Neural Networks

In this section, we will learn about **neural networks** and focus more on their architecture and training process.

In the next notebook, [03. Neural Networks in PyTorch](03_neural_networks_in_pytorch.ipynb), we will implement a neural network using PyTorch.

Let's say you want to predict if you will pass an exam `(1)` or not `(0)`. You have the following data:

- Hours studied (x1)
- How smart you are (x2)
- Previous knowledge (x3)
- Name (x4)

This means that our neural network will take four inputs: `x1`, `x2`, `x3`, and `x4`. The output will be a single value, which is either `0` or `1` (pass or fail).

As you can probably guess, not all of these features are useful for predicting the outcome. For example, your name is not a good predictor of whether you will pass the exam or not.

Let's walk through the steps of how the neural network will work.

> Note: I will include videos and articles that explain the concepts in more detail in the last section of this notebook.


## Neural Network Structure

A neural network is made up of layers of neurons. Each neuron takes inputs, applies a transformation, and produces an output. The output of one layer becomes the input to the next layer.

To take our previous example, we can represent the neural network structure as follows:

<img src="../09_images/01-neural_network_structure.png" alt="Neural Network Structure" width="800">

The neural network has:

- **Input Layer**: The first layer that takes the inputs `x1`, `x2`, `x3`, and `x4`.
- **Hidden Layer**: The layer that processes the inputs and applies transformations. In this case, we have one neuron in the hidden layer, but we could have more neurons and more layers for a more complex model.
- **Output Layer**: The final layer that produces the output. In this case, we have one output neuron that gives the final prediction.

### Concrete Example

Let's consider a simple example to make this more tangible. Imagine we have a neural network with:

- 2 input features: Hours studied (x₁) and Previous knowledge (x₂)
- 1 hidden layer with 2 neurons
- 1 output neuron predicting exam pass/fail probability

For this example:

- Input values: x₁ = 3 (hours studied), x₂ = 7 (previous knowledge on a scale of 1-10)
- Weights from inputs to first hidden neuron: w₁₁ = 0.2, w₁₂ = 0.3
- Weights from inputs to second hidden neuron: w₂₁ = 0.1, w₂₂ = 0.4
- Bias for first hidden neuron: b₁ = 0.5
- Bias for second hidden neuron: b₂ = 0.1
- Weights from hidden neurons to output: w₃₁ = 0.6, w₃₂ = 0.8
- Bias for output neuron: b₃ = 0.2
- We'll use the sigmoid activation function: σ(x) = 1/(1 + e^(-x))

Step-by-step calculation:

1. **First hidden neuron:**

   - Weighted sum: (0.2 × 3) + (0.3 × 7) + 0.5 = 0.6 + 2.1 + 0.5 = 3.2
   - Apply activation: σ(3.2) ≈ 0.961

2. **Second hidden neuron:**

   - Weighted sum: (0.1 × 3) + (0.4 × 7) + 0.1 = 0.3 + 2.8 + 0.1 = 3.2
   - Apply activation: σ(3.2) ≈ 0.961

3. **Output neuron:**
   - Weighted sum: (0.6 × 0.961) + (0.8 × 0.961) + 0.2 = 0.577 + 0.769 + 0.2 = 1.546
   - Apply activation: σ(1.546) ≈ 0.824

So the neural network predicts a 82.4% probability of passing the exam with 3 hours of studying and a previous knowledge level of 7/10.

This example illustrates how information flows forward through the network, from inputs through hidden layers to the output.

Now let's investigate how the neural network works step by step.


### Step 1: Initializating Weights and Biases

In a neural network, we have **[weights](https://www.geeksforgeeks.org/deep-learning/the-role-of-weights-and-bias-in-neural-networks/)** and **[biases](https://www.turing.com/kb/necessity-of-bias-in-neural-networks)**. Weights are the parameters that the model learns during training, and biases are added to the weighted sum of inputs to help the model fit the data better.

The weights and biases are initialized randomly at the beginning of the training process. For our example, we will have four weights (one for each input) and one bias.

The weights measure the importance of each input feature, while the bias allows the model to shift the output up or down. As mentioned earlier, not all features are useful for predicting the outcome, so the weights will adjust accordingly during training.

For example, if the weight for `x4` (name) is close to zero, it means that the name is not a useful feature for predicting the outcome. The model will learn to ignore it.

### Step 2: Forward Pass

In the forward pass, the inputs are multiplied by their corresponding weights, and the bias is added to the weighted sum. This is done for each neuron in the hidden layer.

The output of the hidden layer is then passed through an activation function, which introduces non-linearity into the model. This allows the neural network to learn complex patterns in the data (this will be explained in more detail in a later section).

<img src="../09_images/01-weight_initialization.png" alt="Initializing Weights and Biases" width="800">


#### Activation Functions

Activation functions are a crucial component of neural networks. They determine whether a neuron should be activated or not by calculating a weighted sum and adding bias. Without activation functions, neural networks would only be capable of learning linear relationships, which would severely limit their capabilities.

Some common activation functions include:

- **Sigmoid**: Maps any input value to a value between 0 and 1. It's useful for binary classification problems but suffers from the vanishing gradient problem.
  - Formula: σ(x) = 1 / (1 + e^(-x))
- **ReLU (Rectified Linear Unit)**: Outputs the input directly if it's positive, otherwise, it outputs zero. It's the most commonly used activation function because it's computationally efficient and helps mitigate the vanishing gradient problem.
  - Formula: f(x) = max(0, x)
- **Tanh (Hyperbolic Tangent)**: Similar to sigmoid but maps values between -1 and 1. It has stronger gradients than sigmoid.
  - Formula: tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
- **Leaky ReLU**: A variant of ReLU that allows a small, non-zero gradient when the input is negative.
  - Formula: f(x) = x if x > 0, αx otherwise (where α is a small constant)
- **Softmax**: Often used in the output layer for multi-class classification problems, it converts a vector of values to a probability distribution.
  - Formula: softmax(x_i) = e^(x_i) / Σ e^(x_j) for all j

The choice of activation function depends on the specific task and layer of the neural network. For example, ReLU is commonly used in hidden layers, while sigmoid or softmax might be used in the output layer depending on the problem type.

<img src="../09_images/01-weight_initialization.png" alt="Initializing Weights and Biases" width="800">


### Step 3: Calculating Loss

After the forward pass, we need to calculate the loss, which measures how well the model's predictions match the actual labels. The loss function quantifies the difference between the predicted output and the true output.

### Step 4: Backward Pass

In the backward pass, we calculate the gradients of the loss with respect to the weights and biases. This is done using **backpropagation**, which is a method for calculating the gradients efficiently.

The gradients tell us how much the loss will change if we adjust the weights and biases. We use these gradients to update the weights and biases in the direction that reduces the loss.

### Step 5: Updating Weights and Biases

After calculating the gradients, we update the weights and biases using an optimization algorithm. The weights and biases are adjusted in the direction that minimizes the loss.
This process is repeated for multiple iterations (epochs) until the model converges and the loss reaches an acceptable level.

To summarize, the steps of a neural network are:

1. Initialize weights and biases randomly.
2. Perform a forward pass to calculate the output.
3. Calculate the loss.
4. Perform a backward pass to calculate gradients.
5. Update weights and biases using the gradients.
6. Repeat steps 2-5 for multiple epochs until convergence.

<img src="../09_images/01-neural_network_complete.png" alt="Neural Network Steps" width="1000">


## Types of Neural Networks

Neural networks come in various architectures, each designed for specific types of problems. Here are some common types:

### Feedforward Neural Networks (FNN)

- The most basic type where information flows in one direction from input to output
- No cycles or loops in the network
- Used for classification and regression problems
- Example: The network we've been discussing so far is a simple feedforward network

### Convolutional Neural Networks (CNN)

- Specialized for processing grid-like data such as images
- Uses convolutional layers to extract features
- Employs pooling layers to reduce dimensionality
- Excellent for image classification, object detection, and computer vision tasks

### Recurrent Neural Networks (RNN)

- Designed for sequential data where order matters
- Contains loops allowing information persistence
- Can process inputs of variable length
- Used for natural language processing, time series prediction, and speech recognition
- Variants include LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) networks

### Generative Adversarial Networks (GAN)

- Consists of two networks: a generator and a discriminator competing against each other
- Generator creates synthetic data, discriminator evaluates its authenticity
- Used for generating realistic images, music, and text

### Transformer Networks

- Based on attention mechanisms rather than recurrence
- Processes all input data in parallel
- Revolutionized NLP tasks
- Examples include BERT, GPT models, and other large language models

### Self-Organizing Maps (SOM)

- Unsupervised learning networks that produce low-dimensional representations of input data
- Preserves topological properties of input space
- Used for dimensionality reduction and data visualization

Each of these architectures has its strengths and weaknesses, and the choice of network depends on the specific problem you're trying to solve.


## Extra Reading

For a good overview of neural networks, I would highly recommend going through these 4 videos by 3Blue1Brown:

1. [But what is a Neural Network?](https://www.youtube.com/watch?v=aircAruvnKk)
2. [Gradient Descent, How Neural Networks Learn](https://www.youtube.com/watch?v=IHZwWFHWa-w)
3. [Backpropagation, Intuitively](https://www.youtube.com/watch?v=Ilg3gGewQ5U)
4. [Backpropagation, Calculus](https://www.youtube.com/watch?v=tIeHLnjs5U8)

For loss functions, I would recommend reading the following articles:

- [Loss Functions Explained](https://medium.com/deep-learning-demystified/loss-functions-explained-3098e8ff2b27)
- [PyTorch Loss Functions: The Ultimate Guide](https://neptune.ai/blog/pytorch-loss-functions)
- [Understanding Loss Functions in Deep Learning](https://towardsdatascience.com/understanding-loss-functions-in-deep-learning-for-effective-model-training-5de13424c7d2)
- [Choosing the Right Loss Function for Your Neural Network](https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/)

For gradient descent, I would recommend reading the following articles:

- [Understanding Gradient Descent and Its Variants](https://towardsdatascience.com/understanding-gradient-descent-and-its-variants-1e5a3596ec3a)
- [Gradient Descent Algorithm and Its Variants](https://www.geeksforgeeks.org/gradient-descent-algorithm-and-its-variants/)
- [Gradient Descent Optimization Algorithms Explained](https://ruder.io/optimizing-gradient-descent/)
- [Visualizing and Understanding Gradient Descent](https://distill.pub/2017/momentum/)

For backpropagation, I would recommend reading the following articles:

- [Backpropagation in Neural Networks: A Step-by-Step Guide](https://towardsdatascience.com/understanding-backpropagation-algorithm-7bb3aa2f95fd)
- [Mathematics of Backpropagation](https://en.wikipedia.org/wiki/Backpropagation)
- [Backpropagation - Visual and Interactive Explanation](https://developers.google.com/machine-learning/crash-course/backprop-scroll)

For activation functions:

- [Understanding Activation Functions in Neural Networks](https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0)
- [Activation Functions in Neural Networks: A Comprehensive Guide](https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6)
- [Visualizing Activation Functions](https://dashee87.github.io/deep%20learning/visualising-activation-functions-in-neural-networks/)
