# Artificial Neural Networks - Introduction:

ANN based models have always been ranked top in any ML based coding competitions and in research. but there have been few bottle necks in ANN implementation over the last 50 years from where ANN exploration has started.
![image.png](attachment:image.png)

deep learning has many applications:

![image.png](attachment:image.png)

1. Artificial neural networks are said to be inspired by the structure of the human brain.
2. So to understand ANN, we need to know how the information is transmitted within the brain and how brain neurons are wired.

Human brain is the basic analogy for a neural network system.

![image.png](attachment:image.png)

# Perceptron:

In this segment, you will study the basics of a simple device called the perceptron, which was the first step towards creating the large neural networks that we have developed today. Let's take an example to understand how a perceptron works.


Consider a sushi place you plan to visit this Saturday. There are various factors that would affect this decision, such as:

    The distance between the sushi place and your home
    The cost of the food they serve there
    The number of people accompanying you

You make such a decision based on multiple such factors. Also, each decision factor has a different ‘weight’, for example, the distance of the place might be more important than the number of people accompanying you. 

 

Perceptrons work in a similar manner. They take some signals as inputs and perform a set of simple calculations to arrive at a decision. Let’s watch the next video to study the basic perceptron.

![image.png](attachment:image.png)

the perceptron takes a weighted sum of multiple inputs (with bias) as the cumulative input and applies an output function on the cumulative input to get the output, which then assists in making a decision

![image.png](attachment:image.png)

Where, xi’s represent the inputs, wi’s represent the weights associated with inputs and b represents bias.

 Let’s say w and x are vectors representing weights and inputs as follows (note that, by default, a vector is assumed to be a column vector):

![image.png](attachment:image.png)

![image.png](attachment:image.png)

with the above dot product at hand, we can rewrite the Cummulative input function as:
![image.png](attachment:image.png)

We then apply the step function to the cumulative input. According to the step function, if this cumulative sum of inputs is greater than 0, then the output is 1/yes; or else, it is 0/no.

![image.png](attachment:image.png)

if you want to block out any input, just make the respective weight (w) as zero

![image.png](attachment:image.png)

where a1, a2,a3,a4 are activations from another neurons. there are different kinds of activation functions and each of them is discussed in this notebook.

**NOTE:** a's represent the inputs, w's represent the weights associated with the inputs, and b represents the bias of the neuron.

multiple artificial neurons in a neural network are arranged in different layers. The first layer is known as the input layer, and the last layer is called the output layer. The layers in between these two are the hidden layers.

The number of neurons in the input layer is equal to the number of attributes in the data set, and the number of neurons in the output layer is determined by the number of classes of the target variable **(for a classification problem)**.

For a **regression problem**, the number of neurons in the output layer would be 1 (a numeric variable). Take a look at the image given below to understand the topology of neural networks in the case of classification and regression problems. 

![image.png](attachment:image.png)

**NOTE:**
the number of hidden layers or the number of neurons in each hidden layer or the activation functions used in the neural network changes according to the problem, and these details determine the topology or structure of the neural network.

**SPECIFICATIONS OF A NEURAL NETWORK:**

We need to specify the following 4 parameters to specify a neural network:
1. Structure/Topology
2. Graph:
    1. all nodes in the graph are neurons.
    2. all edges are interconnections.
3. specify the input layer
4. specify the output layer
5. specify the weights
6. ![image.png](attachment:image.png)

7. specify the activation function
8. specify the bias - which is a constant added to the cummulative input into the neurons.

![image.png](attachment:image.png)

_**the number of neurons in the input layer is determined by the input given to the network, and the number of neurons in the output layer is equal to the number of classes (for a classification task) or is one (for a regression task)**_

### Applications of Neural Networks:
1. Speech recognnition - fourier coefficients, wavelet coefficients etc of a sound!
2. Text recognition - can be converted into numeric nature using **"Word embeddings"**. word embeddings can easily encode the semantic encoding of words. i.e. two words with same meaning will be represented with a similar kind of representation. **"one hot encoding"** can also be used but it is effective only in case of smaller vocabularies. for large vocabularies, this is not feasible. also OHE cannot perform semantic encoding.
3. Image recognition - can be represented in numerical sense as pixel information concatenated in row major order or column major order or any zigzag order depending on the level of consistency.
4. Numerical features and so on!

**NOTE:** all the inputs to a neural network **Must** be numerical in nature. so we need to convert all the above raw data into numerical values.

For different types of input data, you need to use different ways to convert the inputs into a numeric form. The most commonly used inputs for ANNs are as follows:

1. Structured data: The type of data that we use in standard machine learning algorithms with multiple features and available in two dimensions, such that the data can be represented in a tabular format, can be used as input for training ANNs. Such data can be stored in CSV files, MAT files, Excel files, etc. This is highly convenient because the input to an ANN is usually given as a numeric feature vector. Such structured data eases the process of feeding the input into the ANN. 

2. Text data: For text data, you can use a one-hot vector or word embeddings corresponding to a certain word. For example, in one hot vector encoding, if the vocabulary size is |V|, then you can represent the word wn as a one-hot vector of size |V| with '1' at the nth element with all other elements being zero. The problem with one-hot representation is that, usually, the vocabulary size |V| is huge, in tens of thousands at least; hence, it is often better to use word embeddings that are a lower-dimensional representation of each word.

3. Images: Images are naturally represented as arrays of numbers and can thus be fed into the network directly. These numbers are the raw pixels of an image. ‘Pixel’ is short for ‘picture element’. In images, pixels are arranged in rows and columns (an array of pixel elements). The figure given below shows the image of a handwritten 'zero' in the MNIST data set (black and white) and its corresponding representation in NumPy as an array of numbers. The pixel values are high where the intensity is high, i.e., the color is bright, while the values are low in the black regions, as shown below.

![image.png](attachment:image.png)

Images: Images are naturally represented as arrays of numbers and can thus be fed into the network directly. These numbers are the raw pixels of an image. ‘Pixel’ is short for ‘picture element’. In images, pixels are arranged in rows and columns (an array of pixel elements). The figure given below shows the image of a handwritten 'zero' in the MNIST data set (black and white) and its corresponding representation in NumPy as an array of numbers. The pixel values are high where the intensity is high, i.e., the color is bright, while the values are low in the black regions, as shown below.

![image.png](attachment:image.png)

In a neural network, each pixel of the input image is a feature. For example, the image provided above is an 18 x 18 array. Hence, it will be fed as a vector of size 324 into the network. Note that the image given above is black and white (also called a grayscale image), and thus, each pixel has only one ‘channel’. If it were a colored image called an RGB (Red, Green and Blue) image, each pixel would have three channels, one each for red, blue, and green, as shown below. Hence, the number of neurons in the input layer would be 18 x 18 x 3 = 972. The three channels of an RGB image are shown below

![image-2.png](attachment:image-2.png)

4. Speech: In the case of a speech/voice input, the basic input unit is in the form of phonemes. These are the distinct units of speech in any language. The speech signal is in the form of waves, and to convert these waves into numeric inputs, you need to use Fourier Transforms(you do not need to worry about this as it is covering areas of specialized mathematics that will not be covered in this course). Note that the input after conversion should be numeric, so you are able to feed it into a neural network.


### Softmax function:

One of the commonly used output functions is the softmax function for classification. graphically represented as follows:

![image.png](attachment:image.png)


A softmax output is similar to what we get from a multiclass logistic function commonly used to compute the probability of an output belonging to one of the multiple classes. It is given by the following formula: 

![image.png](attachment:image.png)

where c is the number of classes or neurons in the output layer, x′ is the input to the network, and wi’s are the weights associated with the inputs.

suppose the output layer has 3 neurons all having same input x', and the weights associated with each of the classes is w0, w1,w2. then the probability of the input belonging to each of the classes can be determined as:
![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

for a binary classification, the softmax function translates into sigmoid function:
![image.png](attachment:image.png)

In the case of a sigmoid output, there is only one neuron in the output layer because if there are two classes with probabilities p0 and p1, we know that p0+p1=1. Hence, we need to compute the value of either p0 or p1. In other words, the sigmoid function is just a special case of the softmax function

In fact, we can derive the sigmoid function from the softmax function, as shown below. Let's assume that the softmax function has two neurons with the following outputs:

![image.png](attachment:image.png)

Consider only  p1 and divide both the numerator and the denominator with the numerator. We can now rewrite p1 as:

![image-2.png](attachment:image-2.png)

And, if we replace w1−w0 = some w, we get the sigmoid function. 

### MNIST image recognition dataset:

![image.png](attachment:image.png)

## Workings of a single neuron:
what an artificial neuron is? what it does? and the activation functions and the effect of weight and bias terms in the neuron?

weight terms will decide how much of each input variable contributes to the output of the neuron as observed in the below figure:
![image-3.png](attachment:image-3.png)

![image-2.png](attachment:image-2.png)

the z function can have it's value anywhere between -∞ to ∞. so the activation function binds this value in a percivable range.

![image-4.png](attachment:image-4.png)

### commonly used activation functions:
![image-9.png](attachment:image-9.png)


### Different activation functions:

the different activation functions of a neuron are discussed in the above section. however, how is the output of a neuron calculated is shown below:

![image.png](attachment:image.png)

The activation functions introduce non-linearity in the network, thereby enabling the network to solve highly complex problems. Problems that take the help of neural networks require the ANN to recognise complex patterns and trends in the given data set. If we do not introduce non-linearity, the output will be a linear function of the input vector. This will not help us in understanding more complex patterns present in the data set. 

Also, the non-linearity allows the task to be performed with compact neural network with minium neurons instead of a large network of linearly programmed neurons.

While choosing activation functions, you need to ensure that they are:

    Non-linear,
    Continuous, and
    Monotonically increasing.

#### features of the various activation functions:

The features of these activation functions are as follows:

1. Sigmoid: When this type of function is applied, the output from the activation function is bound between 0 and 1 and is not centred around zero. A sigmoid activation function is usually used when we want to regularise the magnitude of the outputs we get from a neural network and ensure that this magnitude does not blow up.
2. Tanh (Hyperbolic Tangent): When this type of function is applied, the output is centred around 0 and bound between -1 and 1, unlike a sigmoid function in which case, it is centred around 0.5 and will give only positive outputs. Hence, the output is centred around zero for tanh. 
3. ReLU (Rectified Linear Unit): The output of this activation function is linear in nature when the input is positive and the output is zero when the input is negative. This activation function allows the network to converge very quickly, and hence, its usage is computationally efficient. However, its use in neural networks does not help the network to learn when the values are negative.
4. Leaky ReLU (Leaky Rectified Linear Unit): This activation function is similar to ReLU. However, it enables the neural network to learn even when the values are negative. When the input to the function is negative, it dampens the magnitude, i.e., the input is multiplied with an epsilon factor that is usually a number less than one. On the other hand, when the input is positive, the function is linear and gives the input value as the output. We can control the parameter to allow how much ‘learning emphasis’ should be given to the negative value.

## Parameter and Hyperparameters in Neural Networks:

**Q: Neural networks require rigorous training, but what does it mean to train neural networks? What are the parameters that the network learns during training, and what are the hyperparameters that you (as the network is designed) need to specify beforehand?**

Recall that models such as linear regression and logistic regression are trained on their coefficients, i.e., the task is to find the optimal values of the coefficients to minimize a cost function.

Neural networks are no different; they are trained on weights and biases.

the model is trained untill the optimum values of weight and bias are found and based on which we can select the model.

During training, the neural network learning algorithm fits various models to the training data and selects the best prediction model. The learning algorithm is trained with a fixed set of hyperparameters associated with the network structure. Some of the important hyperparameters to consider to decide the network structure are given below:

    Number of layers
    Number of neurons in the input, hidden and output layers
    Learning rate (the step size taken each time we update the weights and biases of an ANN)
    Number of epochs (the number of times the entire training data set passes through the neural network)

The purpose of training the learning algorithm is to obtain optimum weights and biases that form the parameters of the network.

Note: You will learn about hyperparameters such as learning rate and the number of epochs in the subsequent session. In this session, we will focus on the number of layers and the number of neurons in each layer.

![image.png](attachment:image.png)

## Assumptions for Simplifying Neural Network:

commonly used neural network architectures make the following simplifying assumptions:

1. The neurons in an ANN are arranged in layers, and these layers are arranged sequentially.
2. The neurons within the same layer do not interact with each other.
3. The inputs are fed into the network through the input layer, and the outputs are sent out from the output layer.
4. Neurons in consecutive layers are densely connected, i.e., all neurons in layer l are connected to all neurons in layer l+1.
5. Every neuron in the neural network has a bias value associated with it, and each interconnection has a weight associated with it.
6. All neurons in a particular hidden layer use the same activation function. Different hidden layers can use different activation functions, but in a hidden layer, all neurons use the same activation function.

![image-2.png](attachment:image-2.png)

#### Flow of information between layers:

![image.png](attachment:image.png)

in traditional neural networks, the information is always fed forward into the next layer, and hence it is always called as **feedforward neural networks.** This means that there are no loops in the network, i.e., information is always fed forward, never fed backward.


matrix representation of feedforward propagation

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

### Feedforward Algorithm:

pseudocode for a feedforward pass through the network for a single data point xi.

1. We initialise the variable h0 as the input: h0=xi
2. We loop through each of the layers computing the corresponding output for each layer, i.e., h(l). For l in [1,2,......,L]: ![image.png](attachment:image.png)
3. We compute the prediction p by applying an activation function to the output from the previous layer, i.e., we apply a function to hL, as shown below.  p=f(hL)

#### Classification problem: using feedforward algorithm:
![image.png](attachment:image.png)

### Loss Function:

to know how wrong the prediction of the neural network is and to quantify this error in the prediction a loss function or cost function will help us.

A loss function or cost function is a function that maps an event or values of one or more variables onto a real number intuitively, representing some ‘cost’ associated with the ‘event’, as shown below: 

![image.png](attachment:image.png)

Neural networks minimise the error in the prediction by optimising the loss function with respect to the parameters in the network. In other words, this optimisation is done by adjusting the weights and biases. We will see how this adjustment is done in subsequent sessions. For now, we will concentrate on how to compute the loss. 

In the case of regression, the most commonly used loss function is **MSE/RSS.**

In the case of classification, the most commonly used loss function is **Cross Entropy/Log Loss.**

### What Is Learning in Neural Networks?

training task is to compute the optimal weights and biases by minimising some cost function. to understand how training of neural networks are trained is discussed here:

the best way to train the neural networks is by Gradient descent algorithm where the descent is usually to a location having the optimal weights and biases.

according the topology of a NN, the architecture and activation functions is usually fixed which is like the hardware of the gadget, the tweeking for optimum performance can be only via software i.e. Weigths and biases!.

**NOTE:**
1. for an n-dimensional setting, the gradient will be an n-dimensional vector and not a scalar.
2. The gradient vector is in the direction in which the loss value increases most rapidly.

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)