# Artificial Neural Networks

Surprisingly, ANNs have been around for quite a while: they were first introduced back in 1943 by the neurophysiologist Warren McCulloch and the mathematician Walter Pitts.

ANNs entered a long winter and by the 1990s other powerful machine learning techniques had been invented, such as support vector machines. These techniques seemed to offer better results and stronger theoretical foundations than ANNs,  so the study of neural networks was put on hold.

We are now witnessing yet another wave of interest in ANNs and there are a few good reasons to believe that the renewed interest will have a much more profound impact on our lives:

*   Relatively small tweaks in training algorithms have had a huge positive impact.

*   The tremendous increase in computing power since the 1990s now makes it possible to train large neural networks in a reasonable amount of time. This is in part due to Moore's law, but also thanks to the gaming industry, which has stimulated the production of powerful GPU cards by the millions.

* There is now a huge quantity of data available to train neural networks,
and ANNs frequently outperform other ML techniques on very large and
complex problems.



## Biological Neurons

Birds inspired us to fly, burdock plants inspired Velcro, and nature has inspired countless more inventions. It seems only logical, then, to look at the brain's architecture for inspiration on how to build an intelligent machine. This is the logic that sparked artificial neural networks (ANNs), machine learning models inspired by the networks of biological neurons found in our
brains.

Our brain consists of roughly 90 billion neurons. These neurons produce short electrical impulses called *action potentials*, which travel along the axons and make the synapses release chemical signals. When a neuron receives a sufficient amount of these signals within a few milliseconds, it fires its own electrical impulses.

Thus, individual biological neurons seem to behave in a simple way, but they're organized in a vast network of billions, with each neuron typically connected to thousands of other neurons. Highly complex computations can be performed by a network of fairly simple neurons, much like a complex anthill can emerge from the combined efforts of simple ants.

<figure>
<img src="https://upload.wikimedia.org/wikipedia/commons/1/10/Blausen_0657_MultipolarNeuron.png" alt="ai-ml-dl" width="600"/>
<figcaption>Fig.1 - The anatomy of multipolar neuron that possesses a single axon and many dendrites.</figcaption>
</figure>





## Logistic regression

### Introduction

Logistic regression is a statistical method used for binary classification tasks, where the aim is to predict the probability that an instance belongs to a particular class. It models the probability of the binary outcome using a logistic function, which is a sigmoid curve that maps any real-valued number to the range $[0, 1]$. The input features are combined linearly using weights, and then the logistic function is applied to the result to produce the probability. We can draw a copmputational graph to gain a clearer understanding.

<figure>
<img src="https://github.com/bbirke/ml-python/blob/main/images/logistic_regression.png?raw=true" alt="ai-ml-dl" width="600"/>
<figcaption>Fig.2 - A computational graph for logistic regression.</figcaption>
</figure>

The learnable parameters are, similar what we have seen with the polynomial regression, the weights and the intercept also called bias.

$z$ is calculated by multiplying the transposed weight vector with the input vector $X$ and adding the intercept/bias.

$$z = W^TX + b \equiv \sum_{n = 0}^{N} w_n x_n + b = w_0 x_0 + w_1 x_1 + ... + w_n x_n + b$$

The final output $\hat{y}$ is calcuated as

$$\hat{y} = \sigma (z)$$

where $\sigma$ is the sigmoid function defined as

$$\sigma (z) = \frac{1}{1 + e^{-z}}$$

which returns values in the range 0 to 1 for any given input.

<figure>
<img src="https://github.com/bbirke/ml-python/blob/main/images/sigmoid.svg?raw=true" alt="ai-ml-dl" width="600"/>
<figcaption>Fig.2 - The plot of the Sigmoid activation function.</figcaption>
</figure>

We use the sigmoid function as it gives us a probabilistic result. Let's consider an example where we calculated $z = 4$ and put $z$ into sigmoid function. The result is roughly $0.98$, which means that our classification result is $1$ with $98 \%$ probability.

So how does the training process work for a logistic regression model?

### Training

We will walk through each step involved in training our model with the example of classifying grayscale images as an image, which contains a cat (denoted with the output $1$) or an image which contains no cat (as the output $0$).

Our feature vector in this example contains the flattened pixel values of an image, so with images of size $32 \times 32$ our resulting feature vector would consist of $1024$ pixel values in the range of $[0, 255]$.

Each pixel value will be multiplied with its associated weight. The weighted pixels and bias will then be summed up and used as input for the sigmoid function. The resulting value will be in range between $0$ and $1$, which enables us to calculate some loss (error) function.

All these steps from pixels to calculating the loss is called ***forward propagation***.

During the **back propagation** our model weights are adjusted according to the gradient of the loss function with respect to the given parameter.

This cycle of forward and back propagation continues with samples of our dataset until we minimized our error.

## ANNs

So what does Logistic Regression have to do with Neural Networks? In simple terms, Logistic Regression can be seen as a single-layer neural network. In Logistic Regression, we use a linear combination of input features, followed by a non-linear activation function (usually the sigmoid function) to obtain the output. This output represents the probability of belonging to a certain class.

Neural Networks, on the other hand, extend this concept by introducing multiple layers (**hidden layers**) between the input and output layers. Each layer consists of nodes (**neurons**), where each neuron in a layer is connected to every neuron in the subsequent layer, forming a fully connected network, also known as a dense layer. Every node performs a similar computation as Logistic Regression.

By stacking multiple layers and using more sophisticated activation functions (such as ReLU, tanh, etc.), Neural Networks can capture highly non-linear relationships in the data.

<figure>
<img src="https://github.com/bbirke/ml-python/blob/main/images/ann.png?raw=true" alt="ai-ml-dl" width="600"/>
<figcaption>Fig.3 - A fully connected neural network.</figcaption>
</figure>

### Activation functions
In ANNs the activation function serves the purpose of injecting non-linearity into the network. This enables the modeling of a response variable (target variable or class label) that varies non-linearly with its explanatory variables. Non-linear implies that the output cannot be replicated from a linear combination of the inputs.

In simpler terms, if a network lacks a non-linear activation function, even if it has multiple layers, it would behave akin to a single-layer perceptron. This is because adding these layers would only result in another linear function.

The most common used activation functions are:

#### ReLU
The Rectified Linear Unit, defined as
<br>
<br>
$$f(x) = \text{max}(0,x)$$
<br>
<br>
<figure>
<img src="https://github.com/bbirke/ml-python/blob/main/images/relu.svg?raw=true" alt="ai-ml-dl" width="600"/>
<figcaption>Fig.4 - The plot of the ReLU activation function.</figcaption>
</figure>

#### Sigmoid
The sigmoid or logistic activation function, we have already seen is defined as
<br>
<br>
$$\sigma (x) = \frac{1}{1 + e^{-x}}$$
<br>
<br>
<figure>
<img src="https://github.com/bbirke/ml-python/blob/main/images/sigmoid.svg?raw=true" alt="ai-ml-dl" width="600"/>
<figcaption>Fig.5 - The plot of the Sigmoid activation function.</figcaption>
</figure>

#### Hyperbolic tangent
The tanh activation function, also called the hyperbolic tangent activation function produces output values between $-1$ and $1$ and is defined as
<br>
<br>
$$\text{tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$
<br>
<br>
<figure>
<img src="https://github.com/bbirke/ml-python/blob/main/images/tanh.svg?raw=true" alt="ai-ml-dl" width="600"/>
<figcaption>Fig.6 - The plot of the tanh activation function.</figcaption>
</figure>

#### Softmax
The Softmax function converts a vector of $K$ real numbers into a probability distribution of $K$ possible outcomes and is defined as

$$\text{softmax}(x)_i = \frac{e^{x_i}}{ \sum_{j=1}^{J} e^{x_j}}$$

for $i = 1,...,J$