1. Sequential Network
Definition:
A Sequential model is a linear stack of layers. It is the simplest type of neural network model, suitable for feedforward networks.

2. Dense Layer
Definition:
A Dense layer, also known as a fully connected layer, where each neuron receives input from all neurons of the previous layer. This layer performs a linear transformation of the input vector.

Properties:

Units: Number of neurons in the layer.
Input Dimension: The shape of the input data. Required only for the first layer.
3. Weights and Initialization Techniques
Weights:
Weights are the parameters within the neural network that get updated during training. They connect the neurons of one layer to the neurons of the next layer.

Initialization Techniques:
Proper initialization is crucial for training efficiency and convergence. Common initialization methods include:

Zero Initialization:

Description: All weights are initialized to zero.
Disadvantage: Leads to symmetry and poor learning.
Random Initialization:

Description: Weights are initialized to small random values.
Disadvantage: May cause slow convergence.
He Initialization:

Description: Weights are initialized based on a normal distribution with a mean of 0 and a variance of \frac{2}{\text{fan_in}}.
Usage: Suitable for ReLU and its variants.
Formula: W \sim \mathcal{N}(0, \sqrt{\frac{2}{\text{fan_in}}})
Xavier/Glorot Initialization:

Description: Weights are initialized based on a normal distribution with a mean of 0 and a variance of \frac{2}{\text{fan_in} + \text{fan_out}}.
Usage: Suitable for sigmoid and tanh activations.
Formula: W \sim \mathcal{N}(0, \sqrt{\frac{2}{\text{fan_in} + \text{fan_out}}})
4. Activation Functions
Definition:
Activation functions introduce non-linearity into the network, allowing it to learn and model complex data.

Categories:

Sigmoid Activation Function:

Formula: 
σ
(
x
)
=
1
1
+
e
−
x
σ(x)= 
1+e 
−x
 
1
​	
 
Range: (0, 1)
Usage: Often used in the output layer of binary classification problems.
ReLU (Rectified Linear Unit):

Formula: 
f
(
x
)
=
max
⁡
(
0
,
x
)
f(x)=max(0,x)
Range: [0, ∞)
Usage: Widely used in hidden layers for its simplicity and efficiency.
Tanh (Hyperbolic Tangent):

Formula: 
tanh
⁡
(
x
)
=
e
x
−
e
−
x
e
x
+
e
−
x
tanh(x)= 
e 
x
 +e 
−x
 
e 
x
 −e 
−x
 
​	
 
Range: (-1, 1)
Usage: Often used in hidden layers. Good for handling negative input values.
Softmax:

Formula: 
σ
(
z
)
i
=
e
z
i
∑
j
=
1
K
e
z
j
σ(z) 
i
​	
 = 
∑ 
j=1
K
​	
 e 
z 
j
​	
 
 
e 
z 
i
​	
 
 
​	
 
Range: [0, 1] (sum to 1 across the output neurons)
Usage: Used in the output layer of multiclass classification problems.
5. Forward Propagation
Definition:
Forward propagation is the process of passing input data through the network to obtain an output prediction.

Steps:

Input Layer: Receive input data.
Weighted Sum: Calculate the weighted sum of inputs for each neuron in the next layer.
z
=
W
⋅
x
+
b
z=W⋅x+b
Activation: Apply activation function.
a
=
σ
(
z
)
a=σ(z)
Output Layer: Produce the final output after passing through all layers.
Equation:
For a single layer:
a
[
l
]
=
σ
(
W
[
l
]
⋅
a
[
l
−
1
]
+
b
[
l
]
)
a 
[l]
 =σ(W 
[l]
 ⋅a 
[l−1]
 +b 
[l]
 )
Where:

a
[
l
]
a 
[l]
 : Activation of layer 
l
l
W
[
l
]
W 
[l]
 : Weights of layer 
l
l
b
[
l
]
b 
[l]
 : Biases of layer 
l
l
σ
σ: Activation function
6. Backpropagation
Definition:
Backpropagation is the process of updating the weights and biases to minimize the loss function by propagating the error backwards through the network.

Steps:

Compute Loss: Calculate the difference between predicted and actual values using a loss function.
Calculate Gradients: Compute gradients of the loss function with respect to weights and biases using the chain rule.
Update Weights: Adjust weights and biases using the gradients and a learning rate.
Equations:

Loss Function Gradient:
∂
L
∂
W
[
l
]
∂W 
[l]
 
∂L
​	
 
∂
L
∂
b
[
l
]
∂b 
[l]
 
∂L
​	
 
Weight Update:
W
[
l
]
=
W
[
l
]
−
α
⋅
∂
L
∂
W
[
l
]
W 
[l]
 =W 
[l]
 −α⋅ 
∂W 
[l]
 
∂L
​	
 
b
[
l
]
=
b
[
l
]
−
α
⋅
∂
L
∂
b
[
l
]
b 
[l]
 =b 
[l]
 −α⋅ 
∂b 
[l]
 
∂L
​	
 
Where 
α
α is the learning rate.

7. Optimizers
Definition:
Optimizers are algorithms used to adjust the weights of the network to minimize the loss function.

Categories:

Stochastic Gradient Descent (SGD):

Description: Updates the weights based on the gradient of the loss function.
Learning Rate: Hyperparameter that controls the step size.
Adam (Adaptive Moment Estimation):

Description: Combines the advantages of two other extensions of SGD, specifically AdaGrad and RMSProp.
Parameters: Learning rate, beta1, beta2.
RMSprop:

Description: Adapts the learning rate for each parameter.
Usage: Suitable for non-stationary objectives.
Adagrad:

Description: Adapts the learning rate to the parameters, performing larger updates for infrequent parameters and smaller updates for frequent ones.
8. Loss Functions
Definition:
Loss functions measure how well the model's predictions match the true data labels. The choice of loss function depends on the type of problem being solved.

Types:

Mean Squared Error (MSE):

Usage: Regression problems.
Formula: 
MSE
=
1
n
∑
i
=
1
n
(
y
i
−
y
i
^
)
2
MSE= 
n
1
​	
 ∑ 
i=1
n
​	
 (y 
i
​	
 − 
y 
i
​	
 
^
​	
 ) 
2
 
Binary Crossentropy:

Usage: Binary classification problems.
Formula: 
Binary Crossentropy
=
−
1
n
∑
i
=
1
n
[
y
i
log
⁡
(
y
i
^
)
+
(
1
−
y
i
)
log
⁡
(
1
−
y
i
^
)
]
Binary Crossentropy=− 
n
1
​	
 ∑ 
i=1
n
​	
 [y 
i
​	
 log( 
y 
i
​	
 
^
​	
 )+(1−y 
i
​	
 )log(1− 
y 
i
​	
 
^
​	
 )]
Categorical Crossentropy:

Usage: Multiclass classification problems.
Formula: 
Categorical Crossentropy
=
−
∑
i
=
1
n
y
i
log
⁡
(
y
i
^
)
Categorical Crossentropy=−∑ 
i=1
n
​	
 y 
i
​	
 log( 
y 
i
​	
 
^
​	
 )
9. Metrics
Definition:
Metrics are used to evaluate the performance of the model. Unlike loss functions, metrics are not used for training the model but for monitoring and evaluating its performance.

Examples:

Accuracy:

Usage: Suitable for classification problems.
Description: Measures the percentage of correct predictions.
Precision:

Usage: Useful in binary classification to measure the accuracy of positive predictions.
Formula: 
Precision
=
True Positives
True Positives + False Positives
Precision= 
True Positives + False Positives
True Positives
​	
 
Recall:

Usage: Measures the ability of the model to find all relevant cases within a dataset.
Formula: 
Recall
=
True Positives
True Positives + False Negatives
Recall= 
True Positives + False Negatives
True Positives
​	
 
F1 Score:

Usage: Harmonic mean of Precision and Recall.
Formula: 
F1 Score
=
2
⋅
Precision
⋅
Recall
Precision + Recall
F1 Score=2⋅ 
Precision + Recall
Precision⋅Recall
​	
 