### 1. Conceptual Overview of a Neural Network

A **Neural Network** is essentially a series of layers, where each layer applies mathematical transformations to the input data. The key building blocks of a neural network are:

#### **Input Layer**:
- The **input layer** is the first layer of the network and serves as the entry point for the data. The input data is passed into the network in the form of a vector (an array of numbers), where each element of the vector represents a feature of the data.
- For example, if you are processing an image, each pixel value might be an input, or if you’re working with tabular data, each column (feature) of the table would be an input.

#### **Hidden Layers**:
- The **hidden layers** are the intermediate layers where most of the computation happens. Each hidden layer performs two key operations:
  1. **Linear Transformation (Matrix Multiplication)**: The input vector from the previous layer is multiplied by a matrix of weights, and a bias vector is added. This operation transforms the input data into a new space. Mathematically, this can be represented as:

     $$
     y = W \times x + b
     $$

     where $W$ is the weight matrix, $x$ is the input vector, and $b$ is the bias vector.
  
  2. **Non-linear Activation**: After the linear transformation, a non-linear activation function (such as **ReLU**, which stands for Rectified Linear Unit) is applied to the output. This non-linearity is crucial because it allows the neural network to learn complex patterns and relationships in the data. Without this non-linearity, the entire network would behave like a simple linear transformation, no matter how many layers are used. The ReLU function is defined as:

     $$
     \text{ReLU}(z) = \max(0, z)
     $$

     It replaces negative values with zero and keeps positive values unchanged, introducing the non-linearity needed to model more complex relationships.

- There can be multiple hidden layers in a neural network, and they are called "hidden" because their outputs are not directly observable; they exist between the input and output layers, performing the internal computation.

#### **Output Layer**:
- The **output layer** is the final layer of the neural network. It takes the transformed data from the hidden layers and outputs a result. The nature of this output depends on the task:
  - For **classification tasks**, where the goal is to categorize inputs (e.g., recognize handwritten digits), the output layer typically consists of neurons that output probabilities for each class. The **softmax** function is often used here to convert raw scores into probabilities.
  - For **regression tasks**, where the goal is to predict a continuous value (e.g., predicting house prices), the output layer might have a single neuron that outputs the predicted value directly.

#### Example:
Imagine you’re building a neural network to recognize handwritten digits (0-9):
- **Input Layer**: The network might take an image of a digit (represented as a grid of pixel values) as the input vector.
- **Hidden Layers**: These layers will perform matrix multiplications and apply non-linear activations to detect important patterns (like curves, edges, etc.) in the digit image.
- **Output Layer**: The output layer will give probabilities for each possible digit (0-9), and the network will predict the digit that has the highest probability.

---

This **conceptual overview** helps us understand the flow of data through a neural network, where each layer transforms the input in a meaningful way, eventually leading to the desired output. By combining linear transformations and non-linear activations, neural networks can model highly complex functions.


![image.png](attachment:1f9295eb-aefc-4c15-b1ca-13d34eb7a141.png)!

![image.png](attachment:954e57bf-1a96-4d3f-a12f-1e9b10ed2b4d.png)!

In [1]:
# https://www.linkedin.com/pulse/explaining-multilayer-perceptrons-terms-general-matrix-ajit-jaokar-c5aje/

### Understanding Neural Networks as Linear Transformations

Neural networks, at their core, can be viewed as a series of **linear transformations** followed by **non-linear activations**. Understanding this fundamental principle allows us to use linear algebra and PyTorch to develop and train efficient models.

Linear transformations in neural networks are often represented by **matrix multiplications**.

Suppose we have an input vector \( x \) and a weight matrix \( W \). The product of \( W \) and \( x \) results in a new vector \( y \), which can be expressed as:

$$
y = W \times x
$$

This transformation changes the space and dimensions of the input data, which is a significant aspect of how neural networks operate.

### Matrix Multiplication in Neural Networks

In the context of a simple neural network layer, the operations performed can be broken down into:
- \( W \) is the weight matrix associated with the layer.
- \( b \) is the bias vector.
- \( x \) is the input vector.

The output \( y \) is computed by the expression:

$$
y = W \times x + b
$$

This operation combines a **linear transformation** with the addition of the **bias term**.

### Non-linear Activation Function

The application of a non-linear activation function, such as **ReLU** (Rectified Linear Unit), transforms the output further to introduce non-linearity:

$$
a = \text{ReLU}(y)
$$

The **ReLU** function is defined as:

$$
\text{ReLU}(z) = \max(0, z)
$$

This non-linearity is crucial for neural networks because it allows them to learn complex patterns that go beyond simple linear mappings.


### Activation functions

### Common Activation Functions in Neural Networks

Activation functions introduce non-linearity into neural networks, which allows the model to learn complex patterns. Here are some of the most commonly used activation functions:

#### **1. Sigmoid Activation Function**:
The **sigmoid** function maps any real-valued number to a value between 0 and 1. It is often used in binary classification tasks.

The sigmoid function is defined as:

$$
\text{Sigmoid}(z) = \frac{1}{1 + e^{-z}}
$$

- **Range**: (0, 1)
- **Pros**: Useful for probabilities in classification tasks.
- **Cons**: Can cause vanishing gradients when the input is very large or very small, leading to slow learning.

#### **2. Tanh (Hyperbolic Tangent) Activation Function**:
The **tanh** function is similar to the sigmoid but maps the input to values between -1 and 1. It is centered around 0, which often leads to faster convergence compared to the sigmoid function.

The tanh function is defined as:

$$
\text{Tanh}(z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}} = 2 \times \text{Sigmoid}(2z) - 1
$$

- **Range**: (-1, 1)
- **Pros**: Zero-centered, which helps with optimization.
- **Cons**: Like sigmoid, it can also suffer from vanishing gradients for very large or very small inputs.

#### **3. ReLU (Rectified Linear Unit) Activation Function**:
The **ReLU** function is the most commonly used activation function in deep learning because it is simple and efficient. It outputs the input directly if it is positive, otherwise, it returns zero.

The ReLU function is defined as:

$$
\text{ReLU}(z) = \max(0, z)
$$

- **Range**: [0, ∞)
- **Pros**: Computationally efficient and helps with the vanishing gradient problem.
- **Cons**: Can cause the **"dying ReLU problem"** where neurons get stuck during training, producing zero output for all inputs.

#### **4. Leaky ReLU Activation Function**:
The **Leaky ReLU** is a variation of ReLU where a small, non-zero slope is used for negative inputs, helping to mitigate the dying ReLU problem.

The Leaky ReLU function is defined as:

$$
\text{Leaky ReLU}(z) = 
\begin{cases} 
z & \text{if } z > 0 \\
\alpha z & \text{if } z \leq 0 
\end{cases}
$$

Where \( \alpha \) is a small constant (often 0.01).

- **Range**: (-∞, ∞)
- **Pros**: Prevents neurons from "dying" by allowing small negative outputs.
- **Cons**: The slope for negative inputs is a hyperparameter that needs to be tuned.

#### **5. Softmax Activation Function**:
The **softmax** function is often used in the output layer of a neural network for multi-class classification. It converts raw scores (logits) into probabilities by normalizing the outputs to sum to 1.

The softmax function is defined as:

$$
\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}
$$

- **Range**: (0, 1) for each output, with all outputs summing to 1.
- **Pros**: Useful for multi-class classification as it provides a probability distribution over classes.
- **Cons**: Can be computationally expensive when the number of classes is large.

---

By choosing the right activation function, we can introduce the necessary non-linearity and behavior in neural networks to solve complex tasks. Each function has its own advantages and disadvantages, and the choice depends on the specific problem being tackled.
