## Feature Engineering



<p align="center"><div class="alert alert-success" style="margin: 20px"><b>Feature engineering is the process of selecting, transforming, and creating new features from the raw data that can improve the performance of a machine learning model.</b></div></p>


<br>
<br>

<img src="../resources/FE.png" align='center'>

<br>
<br>

common tasks

- **Detecting and Handling Outliers**
- **Missing values Imputation**
- **Encoding Categorical Features**
- **Feature Scaling**
- **Feature Extraction/ Extracting Information**
- **Combining Information**

In [15]:
sales = pd.read_csv('../datasets/Advertising Budget and Sales.csv')
sales

Unnamed: 0.1,Unnamed: 0,TV Ad Budget ($),Radio Ad Budget ($),Newspaper Ad Budget ($),Sales ($)
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3
3,4,151.5,41.3,58.5,18.5
4,5,180.8,10.8,58.4,12.9
...,...,...,...,...,...
195,196,38.2,3.7,13.8,7.6
196,197,94.2,4.9,8.1,9.7
197,198,177.0,9.3,6.4,12.8
198,199,283.6,42.0,66.2,25.5


### Detecting and Handling Outliers


<p align="center"><div class="alert alert-success" style="margin: 20px"><b>Outliers are data points that are significantly different from other data points in a dataset.</b></div></p>


<img src="../resources/outliers.jpg" height=500px width=500px align=left>
<img src="../resources/outliers.png" height=500px width=500px align=right>

### Detecting Outliers

- **Univarite-Analysis:**
	- Z-score Method
	- IQR Method

- **Bi-variate Analysis**
	- Scatter Plot

#### Z-score Method:
- The Z-score method looks for data points that are more than three standard deviations away from the mean.

#### IQR Method:
- The IQR method identifies outliers as data points that fall outside the upper and lower bounds of the IQR range.


In [16]:
#Using Z_Score Method
#formula for Z_score method
z_scores = np.abs((sales - sales.mean()) / sales.std())

# identify outliers based on z-score threshold
threshold = 3
outliers = sales[(z_scores > threshold).any(axis=1)]
outliers

Unnamed: 0.1,Unnamed: 0,TV Ad Budget ($),Radio Ad Budget ($),Newspaper Ad Budget ($),Sales ($)
16,17,67.8,36.6,114.0,12.5
101,102,296.4,36.3,100.9,23.8


In [22]:
# first quartile (Q1)
Q1 = sales.quantile(0.25)

#  the third quartile (Q3)
Q3 = sales.quantile(0.75)

# interquartile range (IQR)
IQR = Q3 - Q1

# threshold for identifying outliers
threshold = 1.5

# identify outliers based on the threshold
outliers = sales[(sales < (Q1 - threshold * IQR)) | (sales > (Q3 + threshold * IQR))].stack().index.tolist()

outliers

[(16, 'Newspaper Ad Budget ($)'), (101, 'Newspaper Ad Budget ($)')]

### Handling Outliers

- **Removal/Trimming:** In this method, the outliers are identified and removed from the dataset. This can be done by either deleting the entire row containing the outlier or by replacing the outlier with a new value such as the mean or median of the dataset.

- **Imputation:** In this method, the outliers are replaced with a new value such as the mean, median, or mode of the dataset.

- **Winsorization:** In this method, the outliers are replaced with a value at a certain percentile. For example, if the 95th percentile is used, all values above the 95th percentile are replaced with the value at the 95th percentile.

- **Transformation:** In this method, the data is transformed in a way that reduces the effect of outliers. This can be done by applying a log transformation, a square root transformation, or a Box-Cox transformation.

- **Binning:** In this method, the data is divided into bins or intervals and the outliers are replaced with the upper or lower limit of the bin.

- **Clipping:** In this method, the outliers are replaced with a value at a certain threshold. For example, if the threshold is set at 3 standard deviations from the mean, all values above or below this threshold are replaced with the threshold value.

handling outliers using `mean`

In [23]:
#here write your code

#### handling outliers using Sklearn

In [25]:
from sklearn.preprocessing import MinMaxScaler, RobustScaler

### Missing values Imputation

<img src="../resources/Imputation Techniques types.jpeg" height=500px width=400px align=left>
<img src="../resources/imputation.png" width=400px height=500px align=right>




<br>
<br>

<br>
<br>
<br>
<br>

<br>

<br>
<br>

<br>
<br>
<br>
<br>

<br>
<br>
<br>

- **Mean/median imputation:** In this method, missing values are replaced with the mean or median of the available values in the same column. This method is simple to implement and can work well when missing values are randomly distributed. However, it can lead to biased models if missing values are not randomly distributed.

- **Mode imputation:** This method is used for categorical data and replaces missing values with the most common value in the same column. It is simple and easy to implement but can also lead to biased models if the missing values are not randomly distributed.

- **Hot-deck imputation:** This method replaces missing values with values from other similar observations in the same dataset. It works well when missing values are missing at random but can lead to biased models if the missing values are related to the missing observations.

- **K-NN imputation:** In this method, missing values are replaced with values from the K nearest neighbors in the same dataset. It can work well when the missing values are related to the available observations but can lead to biased models if the missing values are related to the missing observations.

In [None]:
import numpy as np
import pandas as pd

# Creating a dataset of 10 rows and 6 columns
data = {'Feature_1': [12, 15, 20, 10, 30, 17, 25, 22, 19, 27],
        'Feature_2': [1500, 1700, 1800, 1200, 2000, 1600, np.nan, 1900, 1700, 2100],
        'Feature_3': ['A', 'B', 'A', 'C', 'B', 'C', 'A', 'B', 'A', 'C'],
        'Feature_4': [3, 7, 5, 9, 6, 4, 8, 2, np.nan, 7],
        'Feature_5': [0.75, 0.9, 1.2, 0.5, 1.5, 1.1, 1.3, 1.0, 0.8, 1.4],
        'Feature_6': [300, 200, 400, 500, 350, 250, 450, 600, 350, 550]}

df = pd.DataFrame(data)



In [26]:
from sklearn.impute import SimpleImputer

In [None]:
SimpleImputer()

### Encoding Categorical Features

<img src="../resources/encoding.png">

- **Label Encoding** is used when we have ordinal categorical variables, which means there is some order or ranking in the categories. In Label Encoding, each unique category is assigned a numerical value starting from 0, 1, 2 and so on, based on their order or ranking. `For example`, if we have a variable "Size" with categories 'Small', 'Medium', and 'Large', then we can assign values of 0, 1, and 2 to these categories respectively. Label Encoding is advantageous when we have a large number of categories, as it reduces the number of unique values in the variable and makes it easier to work with. However, it can introduce an arbitrary ordering in the data, which can affect the performance of the model.

- **One-Hot Encoding** is used when we have nominal categorical variables, which means there is no order or ranking in the categories. In One-Hot Encoding, each unique category is converted into a new binary feature, where the presence of the category is represented by a value of 1 and the absence of the category is represented by a value of 0. `For example`, if we have a variable "Color" with categories 'Red', 'Green', and 'Blue', then we can create three new binary features 'Color_Red', 'Color_Green', and 'Color_Blue', where each feature will have a value of 1 if the original category was present and 0 if it was absent. One-Hot Encoding is advantageous as it preserves the information of the categories without introducing an arbitrary ordering. However, it can lead to the curse of dimensionality, where the number of features in the dataset becomes very large, making it computationally expensive and difficult to work with.

In [29]:
#code

In [None]:
# Handling outliers in Feature_1
df['Feature_1'] = np.where(df['Feature_1'] > 25, 25, df['Feature_1'])


In [None]:
# Imputing missing values in Feature_2 with mean
df['Feature_2'].fillna(df['Feature_2'].mean(), inplace=True)


In [None]:
# Encoding categorical features in Feature_3
df = pd.concat([df, pd.get_dummies(df['Feature_3'], prefix='Feature_3')], axis=1)
df.drop('Feature_3', axis=1, inplace=True)


In [None]:

# Scaling features with Min-Max scaling in Feature_5 and Feature_6
df['Feature_5'] = (df['Feature_5'] - df['Feature_5'].min()) / (df['Feature_5'].max() - df['Feature_5'].min())
df['Feature_6'] = (df['Feature_6'] - df['Feature_6'].min()) / (df['Feature_6'].max() - df['Feature_6'].min())


In [None]:
# Extracting information from Feature_1 by squaring the values
df['Feature_1_squared'] = df['Feature_1'] ** 2


In [None]:
# Combining information from Feature_4 and Feature_6
df['Feature_4_times_6'] = df['Feature_4'] * df['Feature_6']


In [4]:

df

Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Feature_6
0,12,1500.0,A,3.0,0.75,300
1,15,1700.0,B,7.0,0.9,200
2,20,1800.0,A,5.0,1.2,400
3,10,1200.0,C,9.0,0.5,500
4,30,2000.0,B,6.0,1.5,350
5,17,1600.0,C,4.0,1.1,250
6,25,,A,8.0,1.3,450
7,22,1900.0,B,2.0,1.0,600
8,19,1700.0,A,,0.8,350
9,27,2100.0,C,7.0,1.4,550



 Here are some common tasks that are performed in feature engineering:



- **Feature Selection:** This involves selecting the most relevant features from the raw data that can improve the performance of the model. This can be done using techniques such as correlation analysis, mutual information, or statistical tests.


In [1]:
from sklearn.feature_selection import mutual_info_classif

In [4]:
#Correlation Analysis

In [2]:
# Load the data
data = pd.read_csv('data.csv')

# Compute the correlation matrix
corr_matrix = data.corr()

# Select the top n features based on correlation
n = 10
top_features = corr_matrix.nlargest(n, 'target')['target'].index
selected_features = data[top_features]

In [3]:
#Mutual Information

In [None]:
from sklearn.feature_selection import SelectKBest, mutual_info_classif
import pandas as pd

# Load the data
data = pd.read_csv('data.csv')

# Select the top n features based on mutual information
n = 10
selector = SelectKBest(score_func=mutual_info_classif, k=n)
selected_features = selector.fit_transform(data.drop('target', axis=1), data['target'])


- **Feature Scaling:** This involves scaling the features to a similar range to avoid bias in the model. This can be done using techniques such as normalization or standardization.



In [None]:
# Using Normalization
from sklearn.preprocessing import MinMaxScaler

# Load the data
data = pd.read_csv('data.csv')

# Normalize the features
scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(data.drop('target', axis=1))

In [None]:
#Using Standardization
from sklearn.preprocessing import StandardScaler

# Load the data
data = pd.read_csv('data.csv')

# Standardize the features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(data.drop('target', axis=1))

- **Feature Encoding:** This involves representing categorical features as numerical features that can be used by the model. This can be done using techniques such as one-hot encoding or label encoding.



In [5]:
#Using one-hot encoding
import pandas as pd

# Load the data
data = pd.read_csv('data.csv')

# One-hot encode the categorical features
encoded_features = pd.get_dummies(data, columns=['cat_feature'])

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

In [None]:
#using label encoding
from sklearn.preprocessing import LabelEncoder

# Load the data
data = pd.read_csv('data.csv')

# Label encode the categorical features
encoder = LabelEncoder()
encoded_features = data.copy()
encoded_features['cat_feature'] = encoder.fit_transform(data['cat_feature'])

- **Feature Transformation:** This involves transforming the features to a different space that can improve the performance of the model. This can be done using techniques such as PCA or LDA.



In [None]:
from sklearn.decomposition import PCA
import pandas as pd

# Load the data
data = pd.read_csv('data.csv')

# Apply PCA to the features
n_components = 10
pca = PCA(n_components=n_components)
transformed_features = pca.fit_transform(data.drop('target', axis=1))

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import pandas as pd

# Load the data
data = pd.read_csv('data.csv')

# Apply LDA to the features
n_components = 2
lda = LinearDiscriminantAnalysis(n_components=n_components)
transformed_features = lda.fit_transform(data.drop('target', axis=1), data['target'])

- **Feature Creation:** This involves creating new features from the raw data that can capture important patterns or relationships in the data. This can be done using techniques such as polynomial features or interaction features.


In [None]:
from sklearn.preprocessing import PolynomialFeatures
import pandas as pd

# Load the data
data = pd.read_csv('data.csv')

# Create polynomial features
degree = 2
poly = PolynomialFeatures(degree=degree)
new_features = poly.fit_transform(data.drop('target', axis=1))

In [None]:
import pandas as pd

# Load the data
data = pd.read_csv('data.csv')

# Create interaction features
new_feature = data['feature1'] * data['feature2']
data['new_feature'] = new_feature


- **Feature Extraction:** This involves extracting useful information from the raw data that can be used as features. This can be done using techniques such as text or image feature extraction.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Load the data
data = pd.read_csv('data.csv')

# Extract features from text data
vectorizer = TfidfVectorizer()
text_features = vectorizer.fit_transform(data['text_feature'])

In [None]:
import cv2
import numpy as np

# Load the image
img = cv2.imread('image.jpg')

# Extract features from the image
features = np.array(img).flatten()


- **Feature Discretization:** This involves discretizing continuous features into categorical features that can be used by the model. This can be done using techniques such as binning or quantile-based discretization.

In [None]:
import pandas as pd

# Load the data
data = pd.read_csv('data.csv')

# Discretize the numerical feature using binning
bins = [0, 10, 20, 30, 40, 50]
labels = [1, 2, 3, 4, 5]
data['binned_feature'] = pd.cut(data['numerical_feature'], bins=bins, labels=labels)

In [None]:
import pandas as pd

# Load the data
data = pd.read_csv('data.csv')

# Discretize the numerical feature using quantile-based discretization
n_bins = 5
data['quantile_feature'] = pd.qcut(data['numerical_feature'], q=n_bins, labels=False)

In [None]:
# Feature Engineering Pipeline

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Load the data
data = pd.read_csv('data.csv')

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.drop('target', axis=1), data['target'], test_size=0.2, random_state=42)

# Define the feature engineering pipeline
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=10)),
    ('kbest', SelectKBest(score_func=mutual_info_classif, k=5))
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, ['numeric_feature1', 'numeric_feature2']),
        ('cat', categorical_transformer, ['categorical_feature'])
    ])

# Define the final pipeline with feature engineering and machine learning model
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Evaluate the pipeline on the testing data
score = pipeline.score(X_test, y_test)
print("Accuracy: {:.2f}%".format(score * 100))

In [2]:
import pandas as pd
import numpy as np

###  Neural Network Basics Like Forward Propagation

step-by-step derivation of forward propagation for a neural network with one input layer, two hidden layers, and one output layer. Let's assume that there are $n^{[0]}$ input features, $n^{[1]}$ neurons in the first hidden layer, $n^{[2]}$ neurons in the second hidden layer, and $n^{[3]}$ neurons in the output layer. We will use the following notation:

- $a^{[0]} \in \mathbb{R}^{n^{[0]}}$ is the input vector
- $W^{[l]} \in \mathbb{R}^{n^{[l]} \times n^{[l-1]}}$ is the weight matrix for layer $l$
- $b^{[l]} \in \mathbb{R}^{n^{[l]}}$ is the bias vector for layer $l$
- $z^{[l]} = W^{[l]}a^{[l-1]} + b^{[l]}$ is the pre-activation vector for layer $l$
- $a^{[l]} = g^{[l]}(z^{[l]})$ is the activation vector for layer $l$, where $g^{[l]}$ is the activation function for layer $l$


With this notation, the forward propagation algorithm for this neural network can be written as:

$$
\begin{align*}
z^{[1]} &= W^{[1]}a^{[0]} + b^{[1]} \
a^{[1]} &= g^{[1]}(z^{[1]}) \
z^{[2]} &= W^{[2]}a^{[1]} + b^{[2]} \
a^{[2]} &= g^{[2]}(z^{[2]}) \
z^{[3]} &= W^{[3]}a^{[2]} + b^{[3]} \
a^{[3]} &= g^{[3]}(z^{[3]})
\end{align*}
$$

Now, let's derive the mathematical steps for each layer:

First Hidden Layer:

$$
\begin{align*}
z^{[1]} &= W^{[1]}a^{[0]} + b^{[1]} \
& = \begin{bmatrix}
w_{11}^{[1]} & w_{12}^{[1]} & \cdots & w_{1n^{[0]}}^{[1]} \
w_{21}^{[1]} & w_{22}^{[1]} & \cdots & w_{2n^{[0]}}^{[1]} \
\vdots & \vdots & \ddots & \vdots \
w_{n^{[1]}1}^{[1]} & w_{n^{[1]}2}^{[1]} & \cdots & w_{n^{[1]}n^{[0]}}^{[1]}
\end{bmatrix}
\begin{bmatrix}
a_1^{[0]} \
a_2^{[0]} \
\vdots \
a_{n^{[0]}}^{[0]}
\end{bmatrix}

\begin{bmatrix}
b_1^{[1]} \
b_2^{[1]} \
\vdots \
b_{n^{[1]}}^{[1]}
\end{bmatrix} \
&= \begin{bmatrix}
w_{11}^{[1]}a_1^{[0]} + w_{12}^{[1]}a_2^{[0]} + \cdots + w_{1n^{[0]}}^{[1]}a_{n^{[0]}}^{[0]} + b_1^{[1]} \
w_{21}^{[1]}a_1^{[0]} + w_{22}^{[1]}a_2^{[0]} + \cdots + w_{2n^{[0]}}^{[1]}a_{n^{[0]}}^{[0]} + b_2^{[1]} \
\vdots \
w_{n^{[1]}1}^{[1]}a_1^{[0]} + w_{n^{[1]}2}^{[1]}a_2^{[0]} + \cdots + w_{n^{[1]}n^{[0]}}^{[1]}a_{n^{[0]}}^{[0]} + b_{n^{[1]}}^{[1]}
\end{bmatrix} \
a^{[1]} &= g^{[1]}(z^{[1]}) \
&= \begin{bmatrix}
g^{[1]}(z_1^{[1]}) \
g^{[1]}(z_2^{[1]}) \
\vdots \
g^{[1]}(z_{n^{[1]}}^{[1]})
\end{bmatrix}
\end{align*}
$$

Second Hidden Layer:
$$
\begin{align*}
z^{[2]} &= W^{[2]}a^{[1]} + b^{[2]} \
&= \begin{bmatrix}
w_{11}^{[2]} & w_{12}^{[2]} & \cdots & w_{1n^{[1]}}^{[2]} \
w_{21}^{[2]} & w_{22}^{[2]} & \cdots & w_{2n^{[1]}}^{[2]} \
\vdots & \vdots & \ddots & \vdots \
w_{n^{[2]}1}^{[2]} & w_{n^{[2]}2}^{[2]} & \cdots & w_{n^{[2]}n^{[1]}}^{[2]}
\end{bmatrix}
\begin{bmatrix}
a_1^{[1]} \
a_2^{[1]} \
\vdots \
a_{n^{[1]}}^{[1]}
\end{bmatrix}

\begin{bmatrix}
b_1^{[2]} \
b_2^{[2]} \
\vdots \
b_{n^{[2]}}^{[2]}
\end{bmatrix} \
&= \begin{bmatrix}
w_{11}^{[2]}a_1^{[1]} + w_{12}^{[2]}a_2^{[1]} + \cdots + w_{1n^{[1]}}^{[2]}a_{n^{[1]}}^{[1]} + b_1^{[2]} \
w_{21}^{[2]}a_1^{[1]} + w_{22}^{[2]}a_2^{[1]} + \cdots + w_{2n^{[1]}}^{[2]}a_{n^{[1]}}^{[1]} + b_2^{[2]} \
\vdots \
w_{n^{[2]}1}^{[2]}a_1^{[1]} + w_{n^{[2]}2}^{[2]}a_2^{[1]} + \cdots + w_{n^{[2]}n^{[1]}}^{[2]}a_{n^{[1]}}^{[1]} + b_{n^{[2]}}^{[2]}
\end{bmatrix} \
a^{[2]} &= g^{[2]}(z^{[2]}) \
&= \begin{bmatrix}
g^{[2]}(z_1^{[2]}) \
g^{[2]}(z_2^{[2]}) \
\vdots \
g^{[2]}(z_{n^{[2]}}^{[2]})
\end{bmatrix}
\end{align*}
$$

Output Layer:
$$
\begin{align*}
z^{[3]} &= W^{[3]}a^{[2]} + b^{[3]} \
&= \begin{bmatrix}
w_{11}^{[3]} & w_{12}^{[3]} & \cdots & w_{1n^{[2]}}^{[3]} \
w_{21}^{[3]} & w_{22}^{[3]} & \cdots & w_{2n^{[2]}}^{[3]} \
\vdots & \vdots & \ddots & \vdots \
w_{n^{[3]}1}^{[3]} & w_{n^{[3]}2}^{[3]} & \cdots & w_{n^{[3]}n^{[2]}}^{[3]}
\end{bmatrix}
\begin{bmatrix}
a_1^{[2]} \
a_2^{[2]} \
\vdots \
a_{n^{[2]}}^{[2]}
\end{bmatrix}

\begin{bmatrix}
b_1^{[3]} \
b_2^{[3]} \
\vdots \
b_{n^{[3]}}^{[3]}
\end{bmatrix} \
&= \begin{bmatrix}
w_{11}^{[3]}a_1^{[2]} + w_{12}^{[3]}a_2^{[2]} + \cdots + w_{1n^{[2]}}^{[3]}a_{n^{[2]}}^{[2]} + b_1^{[3]} \
w_{21}^{[3]}a_1^{[2]} + w_{22}^{[3]}a_2^{[2]} + \cdots + w_{2n^{[2]}}^{[3]}a_{n^{[2]}}^{[2]} + b_2^{[3]} \
\vdots \
w_{n^{[3]}1}^{[3]}a_1^{[2]} + w_{n^{[3]}2}^{[3]}a_2^{[2]} + \cdots + w_{n^{[3]}n^{[2]}}^{[3]}a_{n^{[2]}}^{[2]} + b_{n^{[3]}}^{[3]}
\end{bmatrix} \
a^{[3]} &= g^{[3]}(z^{[3]}) \
&= \begin{bmatrix}
g^{[3]}(z_1^{[3]}) \
g^{[3]}(z_2^{[3]}) \
\vdots \
g^{[3]}(z_{n^{[3]}}^{[3]})
\end{bmatrix}
\end{align*}
$$

This completes the derivation for the forward propagation algorithm for a neural network with one input layer, two hidden layers, and one output layer.

here's a step-by-step derivation of backpropagation for a neural network with one input layer, two hidden layers, and one output layer. Let's assume that there are $n^{[0]}$ input features, $n^{[1]}$ neurons in the first hidden layer, $n^{[2]}$ neurons in the second hidden layer, and $n^{[3]}$ neurons in the output layer. We will use the following notation:

- $a^{[0]} \in \mathbb{R}^{n^{[0]}}$ is the input vector
- $W^{[l]} \in \mathbb{R}^{n^{[l]} \times n^{[l-1]}}$ is the weight matrix for layer $l$
- $b^{[l]} \in \mathbb{R}^{n^{[l]}}$ is the bias vector for layer $l$
- $z^{[l]} = W^{[l]}a^{[l-1]} + b^{[l]}$ is the pre-activation vector for layer $l$
- $a^{[l]} = g^{[l]}(z^{[l]})$ is the activation vector for layer $l$, where $g^{[l]}$ is the activation function for layer $l$
- $y$ is the true output value for the given input $a^{[0]}$
- $J$ is the cost function, which measures the error between the predicted output $\hat{y}$ and the true output $y$

With this notation, the backpropagation algorithm for this neural network can be written as:

- Compute the output layer error: $\delta^{[3]} = \nabla_{\hat{y}} J \odot g^{[3]'}(z^{[3]})$
- Compute the second hidden layer error: $\delta^{[2]} = (W^{[3]})^T \delta^{[3]} \odot g^{[2]'}(z^{[2]})$
- Compute the first hidden layer error: $\delta^{[1]} = (W^{[2]})^T \delta^{[2]} \odot g^{[1]'}(z^{[1]})$
- Compute the gradients for the output layer weights and biases: $\nabla_{W^{[3]}} J = \delta^{[3]} (a^{[2]})^T$, $\nabla_{b^{[3]}} J = \delta^{[3]}$
- Compute the gradients for the second hidden layer weights and biases: $\nabla_{W^{[2]}} J = \delta^{[2]} (a^{[1]})^T$, $\nabla_{b^{[2]}} J = \delta^{[2]}$
- Compute the gradients for the first hidden layer weights and biases: $\nabla_{W^{[1]}} J = \delta^{[1]} (a^{[0]})^T$, $\nabla_{b^{[1]}} J = \delta^{[1]}$
- Update the weights and biases for all layers using the computed gradients and a learning rate $\alpha$:
        
$W^{[l]} := W^{[l]} - \alpha \nabla_{W^{[l]}} J$


$b^{[l]} := b^{[l]} - \alpha \nabla_{b^{[l]}} J$



Now, let's compute the mathematical steps for each of these steps:

- Compute the output layer error:
    $$
    \begin{align*}
    \delta^{[3]} &= \nabla_{\hat{y}} J \odot g^{[3]'}(z^{[3]}) \
    &= (\hat{y} - y) \odot g^{[3]'}(z^{[3]})
    \end{align*}
    $$

- Compute the second hidden layer error:
    $$
    \begin{align*}
    \delta^{[2]} &= (W^{[3]})^T \delta^{[3]} \odot g^{[2]'}(z^{[2]}) \
    &= (W^{[3]})^T (\delta^{[3]} \odot g^{[2]'}(z^{[2]}))
    \end{align*}
    $$

- Compute the first hidden layer error:
    $$
    \begin{align*}
    \delta^{[1]} &= (W^{[2]})^T \delta^{[2]} \odot g^{[1]'}(z^{[1]}) \
    &= (W^{[2]})^T (\delta^{[2]} \odot g^{[1]'}(z^{[1]}))
    \end{align*}
    $$

- Compute the gradients for the output layer weights and biases:
    $$
    \begin{align*}
    \nabla_{W^{[3]}} J &= \delta^{[3]} (a^{[2]})^T \
    \nabla_{b^{[3]}} J &= \delta^{[3]}
    \end{align*}
    $$

- Compute the gradients for the second hidden layer weights and biases:
    $$
    \begin{align*}
    \nabla_{W^{[2]}} J &= \delta^{[2]} (a^{[1]})^T \
    \nabla_{b^{[2]}} J &= \delta^{[2]}
    \end{align*}
    $$

- Compute the gradients for the first hidden layer weights and biases:
    $$
    \begin{align*}
    \nabla_{W^{[1]}} J &= \delta^{[1]} (a^{[0]})^T \
    \nabla_{b^{[1]}} J &= \delta^{[1]}
    \end{align*}
    $$

- Update the weights and biases for all layers using the computed gradients and a learning rate $\alpha$:
    $$
    \begin{align*}
    W^{[l]} &:= W^{[l]} - \alpha \nabla_{W^{[l]}} J \
    b^{[l]} &:= b^{[l]} - \alpha \nabla_{b^{[l]}} J
    \end{align*}
    $$
    for $l=1,2,3$.


here's a step-by-step derivation of backpropagation for a neural network with one input layer having two input nodes/neurons, two hidden layers both containing 3 nodes, and one output layer. Let's use the following notation:

- $a^{[0]} \in \mathbb{R}^{2}$ is the input vector
- $W^{[1]} \in \mathbb{R}^{3 \times 2}$ is the weight matrix for the first hidden layer
- $b^{[1]} \in \mathbb{R}^{3}$ is the bias vector for the first hidden layer
- $z^{[1]} = W^{[1]}a^{[0]} + b^{[1]}$ is the pre-activation vector for the first hidden layer
- $a^{[1]} = g^{[1]}(z^{[1]})$ is the activation vector for the first hidden layer, where $g^{[1]}$ is the activation function for the first hidden layer
- $W^{[2]} \in \mathbb{R}^{3 \times 3}$ is the weight matrix for the second hidden layer
- $b^{[2]} \in \mathbb{R}^{3}$ is the bias vector for the second hidden layer
- $z^{[2]} = W^{[2]}a^{[1]} + b^{[2]}$ is the pre-activation vector for the second hidden layer
- $a^{[2]} = g^{[2]}(z^{[2]})$ is the activation vector for the second hidden layer, where $g^{[2]}$ is the activation function for the second hidden layer
- $W^{[3]} \in \mathbb{R}^{1 \times 3}$ is the weight matrix for the output layer
- $b^{[3]} \in \mathbb{R}$ is the bias vector for the output layer
- $z^{[3]} = W^{[3]}a^{[2]} + b^{[3]}$ is the pre-activation scalar for the output layer
- $y$ is the true output value for the given input $a^{[0]}$
- $J$ is the cost function, which measures the error between the predicted output $\hat{y}$ and the true output $y$

With this notation, the backpropagation algorithm for this neural network can be written as:

- Compute the output layer error: $\delta^{[3]} = \nabla_{\hat{y}} J \odot g^{[3]'}(z^{[3]})$
- Compute the second hidden layer error: $\delta^{[2]} = (W^{[3]})^T \delta^{[3]} \odot g^{[2]'}(z^{[2]})$
- Compute the first hidden layer error: $\delta^{[1]} = (W^{[2]})^T \delta^{[2]} \odot g^{[1]'}(z^{[1]})$
- Compute the gradients for the output layer weights and biases: $\nabla_{W^{[3]}} J = \delta^{[3]} (a^{[2]})^T$, $\nabla_{b^{[3]}} J = \delta^{[3]}$
- Compute the gradients for the second hidden layer weights and biases: $\nabla_{W^{[2]}} J = \delta^{[2]} (a^{[1]})^T$, $\nabla_{b^{[2]}} J = \delta^{[2]}$
- Compute the gradients for the first hidden layer weights and biases: $\nabla_{W^{[1]}} J = \delta^{[1]} (a^{[0]})^T$, $\nabla_{b^{[1]}} J = \delta^{[1]}$
- Update the weights and biases for all layers using the computed gradients and a learning rate $\alpha$:
        
$W^{[l]} := W^{[l]} - \alpha \nabla_{W^{[l]}} J$
                                                                                          
                                                                                          
$b^{[l]} := b^{[l]} - \alpha \nabla_{b^{[l]}} J$

Now, let's compute the mathematical steps for each of these steps:

- Compute the output layer error:
    $$
    \begin{align*}
    \delta^{[3]} &= \nabla_{\hat{y}} J \odot g^{[3]'}(z^{[3]}) \
    &= (\hat{y} - y) \odot g^{[3]'}(z^{[3]})
    \end{align*}
    $$

- Compute the second hidden layer error:
    $$
    \begin{align*}
    \delta^{[2]} &= (W^{[3]})^T \delta^{[3]} \odot g^{[2]'}(z^{[2]}) \
    &= (W^{[3]})^T (\delta^{[3]} \odot g^{[2]'}(z^{[2]}))
    \end{align*}
    $$

- Compute the first hidden layer error:
    $$
    \begin{align*}
    \delta^{[1]} &= (W^{[2]})^T \delta^{[2]} \odot g^{[1]'}(z^{[1]}) \
    &= (W^{[2]})^T (\delta^{[2]} \odot g^{[1]'}(z^{[1]}))
    \end{align*}
    $$

- Compute the gradients for the output layer weights and biases:
    $$
    \begin{align*}
    \nabla_{W^{[3]}} J &= \delta^{[3]} (a^{[2]})^T \
    \nabla_{b^{[3]}} J &= \delta^{[3]}
    \end{align*}
    $$

- Compute the gradients for the second hidden layer weights and biases:
    $$
    \begin{align*}
    \nabla_{W^{[2]}} J &= \delta^{[2]} (a^{[1]})^T \
    \nabla_{b^{[2]}} J &= \delta^{[2]}
    \end{align*}
    $$

- Compute the gradients for the first hidden layer weights and biases:
    $$
    \begin{align*}
    \nabla_{W^{[1]}} J &= \delta^{[1]} (a^{[0]})^T \
    \nabla_{b^{[1]}} J &= \delta^{[1]}
    \end{align*}
    $$

- Update the weights and biases for all layers using the computed gradients and a learning rate $\alpha$:
    $$
    \begin{align*}
    W^{[l]} &:= W^{[l]} - \alpha \nabla_{W^{[l]}} J \
    b^{[l]} &:= b^{[l]} - \alpha \nabla_{b^{[l]}} J
    \end{align*}
    $$
    for $l=1,2,3$.


here's a step-by-step derivation of forward propagation for a neural network with one input layer having two input nodes/neurons, two hidden layers both containing three nodes, and one output layer. Let's use the same notation as in the previous answer:

- $a^{[0]} \in \mathbb{R}^{2}$ is the input vector
- $W^{[1]} \in \mathbb{R}^{3 \times 2}$ is the weight matrix for the first hidden layer
- $b^{[1]} \in \mathbb{R}^{3}$ is the bias vector for the first hidden layer
- $z^{[1]} = W^{[1]}a^{[0]} + b^{[1]}$ is the pre-activation vector for the first hidden layer
- $a^{[1]} = g^{[1]}(z^{[1]})$ is the activation vector for the first hidden layer, where $g^{[1]}$ is the activation function for the first hidden layer
- $W^{[2]} \in \mathbb{R}^{3 \times 3}$ is the weight matrix for the second hidden layer
- $b^{[2]} \in \mathbb{R}^{3}$ is the bias vector for the second hidden layer
- $z^{[2]} = W^{[2]}a^{[1]} + b^{[2]}$ is the pre-activation vector for the second hidden layer
- $a^{[2]} = g^{[2]}(z^{[2]})$ is the activation vector for the second hidden layer, where $g^{[2]}$ is the activation function for the second hidden layer
- $W^{[3]} \in \mathbb{R}^{1 \times 3}$ is the weight matrix for the output layer
- $b^{[3]} \in \mathbb{R}$ is the bias vector for the output layer
- $z^{[3]} = W^{[3]}a^{[2]} + b^{[3]}$ is the pre-activation scalar for the output layer
- $y$ is the true output value for the given input $a^{[0]}$

With this notation, the forward propagation algorithm for this neural network can be written as:

- Compute the pre-activation vector for the first hidden layer: $z^{[1]} = W^{[1]}a^{[0]} + b^{[1]}$
- Compute the activation vector for the first hidden layer: $a^{[1]} = g^{[1]}(z^{[1]})$
- Compute the pre-activation vector for the second hidden layer: $z^{[2]} = W^{[2]}a^{[1]} + b^{[2]}$
- Compute the activation vector for the second hidden layer: $a^{[2]} = g^{[2]}(z^{[2]})$
- Compute the pre-activation scalar for the output layer: $z^{[3]} = W^{[3]}a^{[2]} + b^{[3]}$
- Compute the predicted output value: $\hat{y} = g^{[3]}(z^{[3]})$

Now, let's compute the mathematical steps for each of these steps:

- Compute the pre-activation vector for the first hidden layer:
    $$
    z^{[1]} = W^{[1]}a^{[0]} + b^{[1]}
    $$

- Compute the activation vector for the first hidden layer:
    $$
    a^{[1]} = g^{[1]}(z^{[1]})
    $$

- Compute the pre-activation vector for the second hidden layer:
    $$
    z^{[2]} = W^{[2]}a^{[1]} + b^{[2]}
    $$

- Compute the activation vector for the second hidden layer:
    $$
    a^{[2]} = g^{[2]}(z^{[2]})
    $$

- Compute the pre-activation scalar for the output layer:
    $$
    z^{[3]} = W^{[3]}a^{[2]} + b^{[3]}
    $$

- Compute the predicted output value:
    $$
    \hat{y} = g^{[3]}(z^{[3]})
    $$


First, let's define the weight matrices and bias vectors for each layer. We'll use random numerical values for demonstration purposes.

\begin{align}
W_1 = \begin{bmatrix}
0.2 & 0.4 \
0.1 & 0.3 \
0.5 & 0.7 \
\end{bmatrix} &&
b_1 = \begin{bmatrix}
0.1 \
0.2 \
0.3 \
\end{bmatrix} \
W_2 = \begin{bmatrix}
0.4 & 0.6 & 0.8 \
0.3 & 0.5 & 0.7 \
\end{bmatrix} &&
b_2 = \begin{bmatrix}
0.4 \
0.5 \
\end{bmatrix} \
W_3 = \begin{bmatrix}
0.9 & 0.2 & 0.1 \
0.3 & 0.8 & 0.5 \
0.5 & 0.7 & 0.2 \
\end{bmatrix} &&
b_3 = \begin{bmatrix}
0.6 \
0.7 \
0.8 \
\end{bmatrix}
\end{align}

Next, let's assume we have an input vector $\mathbf{x}$ of size (2,1). We'll use the following values for demonstration purposes:

\begin{align}
\mathbf{x} = \begin{bmatrix}
0.9 \
0.1 \
\end{bmatrix}
\end{align}

The first step is to calculate the activations of the first hidden layer:

\begin{align}
\mathbf{z}_1 = W_1 \mathbf{x} + b_1 \
\mathbf{a}_1 = \sigma(\mathbf{z}_1)
\end{align}

where $\sigma$ is the sigmoid function:

\begin{align}
\sigma(z) = \frac{1}{1 + e^{-z}}
\end{align}

Substituting in the values:

\begin{align}
\mathbf{z}_1 = \begin{bmatrix}
0.2 & 0.4 \
0.1 & 0.3 \
0.5 & 0.7 \
\end{bmatrix} \begin{bmatrix}
0.9 \
0.1 \
\end{bmatrix} + \begin{bmatrix}
0.1 \
0.2 \
0.3 \
\end{bmatrix} = \begin{bmatrix}
0.74 \
0.92 \
1.78 \
\end{bmatrix} \
\mathbf{a}_1 = \sigma(\mathbf{z}_1) = \begin{bmatrix}
0.676 \
0.715 \
0.855 \
\end{bmatrix}
\end{align}

Next, we calculate the activations of the second hidden layer:

\begin{align}
\mathbf{z}_2 = W_2 \mathbf{a}_1 + b_2 \
\mathbf{a}_2 = \sigma(\mathbf{z}_2)
\end{align}

Substituting in the values:

\begin{align}
\mathbf{z}_2 = \begin{bmatrix}
0.4 & 0.6 & 0.8 \
0.3 & 0.5 & 0.7 \
\end{bmatrix} \begin{bmatrix}
0.676 \
0.715 \
0.855 \
\end{bmatrix} + \begin{bmatrix}
0.4 \
0.5 \
\end{bmatrix} = \begin{bmatrix}
1.265 \
1.450 \
\end{bmatrix} \
\mathbf{a}_2 = \sigma(\mathbf{z}_2) = \begin{bmatrix}
0.779 \
0.810 \
\end{bmatrix}
\end{align}

Finally, we calculate the activations of the output layer:

\begin{align}
\mathbf{z}_3 = W_3 \mathbf{a}_2 + b_3 \
\mathbf{a}_3 = \sigma(\mathbf{z}_3)
\end{align}

Substituting in the values:

\begin{align}
\mathbf{z}_3 = \begin{bmatrix}
0.9 & 0.2 & 0.1 \
0.3 & 0.8 & 0.5 \
0.5 & 0.7 & 0.2 \
\end{bmatrix} \begin{bmatrix}
0.779 \
0.810 \
\end{bmatrix} + \begin{bmatrix}
0.6 \
0.7 \
0.8 \
\end{bmatrix} = \begin{bmatrix}
1.743 \
1.862 \
1.357 \
\end{bmatrix} \
\mathbf{a}_3 = \sigma(\mathbf{z}_3) = \begin{bmatrix}
0.850 \
0.865 \
0.795 \
\end{bmatrix}
\end{align}

Therefore, the output of the neural network for the given input is:

\begin{align}
\mathbf{a}_3 = \begin{bmatrix}
0.850 \
0.865 \
0.795 \
\end{bmatrix}
\end{align}


Let's assume that the input to the network is a vector $\mathbf{x} = [x_1, x_2]^T$, and the output is a scalar $y$. We will use the sigmoid activation function $\sigma(z) = \frac{1}{1 + e^{-z}}$ for all the hidden and output neurons.

First, we need to initialize the weights and biases of the network. Let $w_{ij}^{(l)}$ be the weight of the connection from neuron $i$ in layer $l-1$ to neuron $j$ in layer $l$. Let $b_j^{(l)}$ be the bias of neuron $j$ in layer $l$. We will use random initialization for the weights and biases. Let's assume that the initial weights and biases are:

$w_{11}^{(1)} = 0.2$, $w_{12}^{(1)} = -0.3$, $b_1^{(1)} = -0.4$, $w_{21}^{(1)} = 0.1$, $w_{22}^{(1)} = 0.4$, $b_2^{(1)} = 0.2$

$w_{11}^{(2)} = -0.5$, $w_{12}^{(2)} = 0.3$, $w_{13}^{(2)} = 0.1$, $b_1^{(2)} = 0.1$, $w_{21}^{(2)} = 0.2$, $w_{22}^{(2)} = -0.1$, $w_{23}^{(2)} = -0.2$, $b_2^{(2)} = -0.1$

$w_{11}^{(3)} = 0.3$, $w_{12}^{(3)} = -0.2$, $w_{13}^{(3)} = -0.1$, $b^{(3)} = 0.2$

Next, we need to feed the input $\mathbf{x}$ forward through the network to compute the output $y$ and the activations of all the neurons. Let $z_j^{(l)}$ be the weighted sum of the inputs to neuron $j$ in layer $l$, and let $a_j^{(l)}$ be the activation of neuron $j$ in layer $l$. Then the forward pass equations are:

$z_1^{(1)} = w_{11}^{(1)}x_1 + w_{21}^{(1)}x_2 + b_1^{(1)} = 0.2 \times x_1 + 0.1 \times x_2 - 0.4$

$a_1^{(1)} = \sigma(z_1^{(1)})$

$z_2^{(1)} = w_{12}^{(1)}x_1 + w_{22}^{(1)}x_2 + b_2^{(1)} = -0.3 \times x_1 + 0.4 \times x_2 + 0.2$

$a_2^{(1)} = \sigma(z_2^{(1)})$

$z_1^{(2)} = w_{11}^{(2)}a_1^{(1)} + w_{21}^{(2)}a_2^{(1)} + w_{31}^{(2)}a_3^{(1)} + b_1^{(2)} = -0.5a_1^{(1)} + 0.2a_2^{(1)} + 0.1a_3^{(1)} + 0.1$

$a_1^{(2)} = \sigma(z_1^{(2)})$

$z_2^{(2)} = w_{12}^{(2)}a_1^{(1)} + w_{22}^{(2)}a_2^{(1)} + w_{32}^{(2)}a_3^{(1)} + b_2^{(2)} = 0.3a_1^{(1)} - 0.1a_2^{(1)} - 0.2a_3^{(1)} - 0.1$

$a_2^{(2)} = \sigma(z_2^{(2)})$

$z_3^{(2)} = w_{13}^{(2)}a_1^{(1)} + w_{23}^{(2)}a_2^{(1)} + w_{33}^{(2)}a_3^{(1)} + b_3^{(2)} = 0.1a_1^{(1)} - 0.2a_2^{(1)} - 0.1a_3^{(1)} - 0.1$

$a_3^{(2)} = \sigma(z_3^{(2)})$

$z_1^{(3)} = w_{11}^{(3)}a_1^{(2)} + w_{21}^{(3)}a_2^{(2)} + w_{31}^{(3)}a_3^{(2)} + b^{(3)} = 0.3a_1^{(2)} - 0.2a_2^{(2)} - 0.1a_3^{(2)} + 0.2$

$a^{(3)} = \sigma(z_1^{(3)}) = y$

Now, we can compute the error between the predicted output $y$ and the true output $y_{\text{true}}$. Let $E$ be the mean squared error:

$E = \frac{1}{2}(y - y_{\text{true}})^2$

We want to minimize this error by adjusting the weights and biases of the network. To do this, we will use the backpropagation algorithm to compute the gradients of the error with respect to the weights and biases.

First, we compute the derivative of the error with respect to the output activation:

$\frac{\partial E}{\partial y} = y - y_{\text{true}}$

Next, we compute the derivative of the output activation with respect to its input:

$\frac{\partial y}{\partial z_1^{(3)}} = \sigma(z_1^{(3)})(1 - \sigma(z_1^{(3)}))$

Using the chain rule, we can compute the derivative of the error with respect to the input $z_1^{(3)}$:

$\frac{\partial E}{\partial z_1^{(3)}} = \frac{\partial E}{\partial y} \cdot \frac{\partial y}{\partial z_1^{(3)}} = (y - y_{\text{true}}) \cdot \sigma(z_1^{(3)})(1 - \sigma(z_1^{(3)}))$

Next, we can use the derivatives of the activations and inputs of the neurons in the output layer to compute the derivatives of the error with respect to the weights and biases in the output layer:

$\frac{\partial E}{\partial w_{ij}^{(3)}} = \frac{\partial E}{\partial z_1^{(3)}} \cdot \frac{\partial z_1^{(3)}}{\partial w_{ij}^{(3)}} = \frac{\partial E}{\partial z_1^{(3)}} \cdot a_j^{(2)}$

$\frac{\partial E}{\partial b^{(3)}} = \frac{\partial E}{\partial z_1^{(3)}} \cdot \frac{\partial z_1^{(3)}}{\partial b^{(3)}} = \frac{\partial E}{\partial z_1^{(3)}}$

Now, we need to propagate the error derivatives backwards through the network to compute the gradients for the hidden layers. Let's start with the second hidden layer. We can compute the derivative of the error with respect to the input $z_j^{(2)}$ of neuron $j$ in the second hidden layer using the chain rule:

$\frac{\partial E}{\partial z_j^{(2)}} = \sum_k \frac{\partial E}{\partial z_k^{(3)}} \cdot \frac{\partial z_k^{(3)}}{\partial a_j^{(2)}} \cdot \frac{\partial a_j^{(2)}}{\partial z_j^{(2)}} = \sum_k \frac{\partial E}{\partial z_k^{(3)}} \cdot w_{jk}^{(3)} \cdot \sigma(z_j^{(2)})(1 - \sigma(z_j^{(2)}))$

Using the derivative of the activation function, we can compute the derivatives of the error with respect to the weights and biases in the second hidden layer:

$\frac{\partial E}{\partial w_{ij}^{(2)}} = \frac{\partial E}{\partial z_j^{(2)}} \cdot \frac{\partial z_j^{(2)}}{\partial w_{ij}^{(2)}} = \frac{\partial E}{\partial z_j^{(2)}} \cdot a_i^{(1)}$

$\frac{\partial E}{\partial b_j^{(2)}} = \frac{\partial E}{\partial z_j^{(2)}} \cdot \frac{\partial z_j^{(2)}}{\partial b_j^{(2)}} = \frac{\partial E}{\partial z_j^{(2)}}$

Similarly

Now, let's compute the derivatives for the first hidden layer. We can compute the derivative of the error with respect to the input $z_j^{(1)}$ of neuron $j$ in the first hidden layer using the chain rule:

$\frac{\partial E}{\partial z_j^{(1)}} = \sum_k \frac{\partial E}{\partial z_k^{(2)}} \cdot \frac{\partial z_k^{(2)}}{\partial a_j^{(1)}} \cdot \frac{\partial a_j^{(1)}}{\partial z_j^{(1)}} = \sum_k \frac{\partial E}{\partial z_k^{(2)}} \cdot w_{jk}^{(2)} \cdot \sigma(z_j^{(1)})(1 - \sigma(z_j^{(1)}))$

Using the derivative of the activation function, we can compute the derivatives of the error with respect to the weights and biases in the first hidden layer:

$\frac{\partial E}{\partial w_{ij}^{(1)}} = \frac{\partial E}{\partial z_j^{(1)}} \cdot \frac{\partial z_j^{(1)}}{\partial w_{ij}^{(1)}} = \frac{\partial E}{\partial z_j^{(1)}} \cdot x_i$

$\frac{\partial E}{\partial b_j^{(1)}} = \frac{\partial E}{\partial z_j^{(1)}} \cdot \frac{\partial z_j^{(1)}}{\partial b_j^{(1)}} = \frac{\partial E}{\partial z_j^{(1)}}$

Now we have computed all the gradients necessary to update the weights and biases of the network. Let's use a learning rate $\alpha$ to control the size of the updates. The weight and bias updates are:

$w_{ij}^{(l)} \leftarrow w_{ij}^{(l)} - \alpha \frac{\partial E}{\partial w_{ij}^{(l)}}$

$b_j^{(l)} \leftarrow b_j^{(l)} - \alpha \frac{\partial E}{\partial b_j^{(l)}}$

We repeat this process for multiple iterations until the error is minimized to an acceptable level.

Note that the above steps can be summarized as follows:

- Initialize weights and biases randomly
- Feed input forward through the network to compute output and activations
- Compute error and its derivative with respect to output activation
- Compute derivatives of error with respect to weights and biases in output layer
- Propagate error derivatives backwards through the network to compute derivatives for hidden layers
- Use the gradients to update the weights and biases of the network
- Repeat steps 2-6 for multiple iterations until error is minimized to an acceptable level.
