# **Neural Networks**

**What will you learn?**
1. **Introduction :** Intro to ANNs
2. **Why do we need NN?**
3. **Example with Linear Boundaries** : Negation, AND, OR
3. **Example with Non-Linear Boundaries** : XOR
5. **Terminology**
6. **Propogation**
7. **Cost Function**
8. **Multiclass Classification** : One Hot Encoding
9. **Sklearn Implementation** : MLPClassifier

##**Introdution**


Artificial Neural Networks are one of the main tools used in machine learning. As the “neural” part of their name suggests, they are brain-inspired systems which are intended to replicate the way that we humans learn. Neural networks consist of input and output layers, as well as (in most cases) a hidden layer consisting of units that transform the input into something that the output layer can use. They are excellent tools for finding patterns which are far too complex or numerous for a human programmer to extract and teach the machine to recognize.

For a basic idea of how a deep learning neural network learns, imagine a factory line. After the raw materials (the data set) are input, they are then passed down the conveyer belt, with each subsequent stop or layer extracting a different set of high-level features. If the network is intended to recognize an object, the first layer might analyze the brightness of its pixels.

The next layer could then identify any edges in the image, based on lines of similar pixels. After this, another layer may recognize textures and shapes, and so on. By the time the fourth or fifth layer is reached, the deep learning net will have created complex feature detectors. It can figure out that certain image elements (such as a pair of eyes, a nose, and a mouth) are commonly found together.

Once this is done, the researchers who have trained the network can give labels to the output, and then use backpropagation to correct any mistakes which have been made. After a while, the network can carry out its own classification tasks without needing humans to help every time.

<img src = "https://files.codingninjas.in/3nn-7659.gif" width = 800>



##**Why do we need NN?**

Neural Networks have been around even before machine learning gained pace. But they were thought to be computationally too heavy and hence, brushed aside.

A problem we faced during Logistic Regression was that, even though the decision function (Sigmoid) was non linear, we got a linear decision boundary. We fixed this problem by adding dummy data with higher powers.

To do that, we had to experiment and decide the degree of features we needed to add. Our decision boundary shoud be such, that it performs this task on its own.

Logistic regression had the following structure:

<img src = "https://files.codingninjas.in/nn1-7661.jpg" width = 500>

The intuition behind Neural Networks is a follows:

<img src = "	https://files.codingninjas.in/nn2-7662.jpg" width = 500>

So, here the final output will not be linear with respect to $x_1, x_2, x_0$.
The functions $f_1, f_2$ need not necessarily be Sigmoid. We can choose any function. Using this method we can create quite interesting decision boundaries without applying the dummy feature method.

##**Example with Linear Decision Boundaries**

To understand how to reach the boundaries, lets take a simple example

###**Example 1 : Negation**

x | y | 
:---:|:---:|
1|0|
0|1|

<img src = "https://files.codingninjas.in/negation-7688.jpg" width = 450>

Here, the function used is 
$$\frac{1}{1 + e^{-z}}$$

We want to pick the correct values of $w_0$ and $w_1$ so that we reach to the correct answer.

**Case 1** : When $x = 0$, we want $y = 1$. So, we want $z\geq0$.
$$w_0 + w_1x \geq 0$$
$$w_0 \geq 0$$

This means we need to keep $w_0$ at a high value, so sigmoid function closely reaches 1. Lets take $w_0 = 50$.

Hence, $z = 50$, which is what we wanted.

**Case 2** : When $x = 1$, we want $y = 0$. So, we want $z\geq0$.
$$w_0 + w_1x \leq 0$$
$$w_0 + w_1 \leq 0$$

Lets take $w_1$ = -100

Hence, $z = -50$, which is what we wanted.

###**Example 2 : OR**

<img src = "https://files.codingninjas.in/or-7690.jpg" width = 500>

$x_1$ | $x_2$ | $y$ 
:---:|:---:|:---:|
0|0|0
1|0|1
0|1|1
1|1|1

Here, the function used is 
$$\frac{1}{1 + e^{-z}}$$

We want to pick the correct values of $w_0$ and $w_1$ so that we reach to the correct answer.

**Case 1** : When $x_1 = 0$ and $x_2 = 0$, we want $y = 0$.

Therefore,
$$ w_1x_1 + w_2x_2 + w_0(1) < 0 $$
$$ w_1(0) + w_2(0) + w_0 < 0 $$
$$ w_0 < 0 $$

Lets take $w_0 = -50$

**Case 2** : When $x_1 = 1$ and $x_2 = 0$, we want $y = 1$.

Therefore,
$$ w_1x_1 + w_2x_2 + w_0(1) > 0 $$
$$ w_1(1) + w_2(0) + w_0 > 0 $$
$$ w_0 + w_1 > 0 $$

Lets take $w_1 = 100$

**Case 3** : When $x_1 = 0$ and $x_2 = 1$, we want $y = 1$.

Therefore,
$$ w_1x_1 + w_2x_2 + w_0(1) > 0 $$
$$ w_1(0) + w_2(1) + w_0 > 0 $$
$$ w_0 + w_2 > 0 $$

Lets take $w_2 = 100$

So, we can draw the table as :

$x_1$ | $x_2$ | $y$ | $z$ | $y_p$
:---:|:---:|:---:|:---:|:---:|
0|0|0|-50|0
1|0|1|50|1
0|1|1|50|1
1|1|1|150|1

Try doing the calculations for **AND** Gate

$x_1$ | $x_2$ | $y$ 
:---:|:---:|:---:|
0|0|0
1|0|0
0|1|0
1|1|1

##**Example with Non Linear Decision Boundaries**

###**Example 1 : XOR**

$x_1$ | $x_2$ | XOR | AND | NOR
:---:|:---:|:---:|:---:|:---:|
0|0|0|0|1
1|0|1|0|0
0|1|1|0|0
1|1|0|1|0

If we look at the table closely, when outputs of AND and NOR are 0, XOR is 1.
If any of AND and NOR is 1, output of XOR is 0.

So we can combine AND and NOR to reach XOR.
Taking AND to be $f_1$ and NOR to be $f_2$ we can say that NOR($f_1$, $f_2$) will give the desired output.

<img src = "	https://files.codingninjas.in/xor-7691.jpg">

Verify the results with your own calculations.

##**Terminology**

**Neuron** : A single unit in any layer is called neuron.

**Input Layer** : The Input layer communicates with the external environment that presents a pattern to the neural network. Its job is to deal with all the inputs only.The input layer should represent the condition for which we are training the neural network. Every input neuron should represent some independent variable that has an influence over the output of the neural network.

**Hidden Layer** : The hidden layer is the collection of neurons which has activation function applied on it and it is an intermediate layer found between the input layer and the output layer. Its job is to process the inputs obtained by its previous layer. So it is the layer which is responsible extracting the required features from the input data.


**Output Layer** : The output layer of the neural network collects and transmits the information accordingly in way it has been designed to give. The pattern presented by the output layer can be directly traced back to the input layer. The number of neurons in output layer should be directly related to the type of work that the neural network was performing.


Weights for each neuron will be found using some algorithm.
Lets say we have jth layer and j+1th layer, units in $U_j$ layer and $U_{j+1}$ layer is $(U_j + 1) * U_{j+1}$ where + 1 is the bias term. This is the no of parmeters we need to train between two layers.  
Because we have so many parameters, that can lead to overfitting in case of Neural Networks, which we need to take care of.  
Having so many parameters to train upon can lead to overfitted model as well.    
    
    
What we need to decide is :
1. How many hidden layers we want?
2. How many neurons in each layer?
3. Function to be applied over hidden and output layer.


<img src = "	https://files.codingninjas.in/network-7689.png">

##**Propogation**


###**Forward Propagation**

The input X provides the initial information that then propagates to the hidden units at each layer and finally produce the output y^.This is called Forward Propogation. The architecture of the network entails determining its depth, width, and activation functions used on each layer. Depth is the number of hidden layers. Width is the number of units (nodes) on each hidden layer since we don’t control neither input layer nor output layer dimensions. There are quite a few set of activation functions such Rectified Linear Unit, Sigmoid, Hyperbolic tangent, etc. Research has proven that deeper networks outperform networks with more hidden units. Therefore, it’s always better and won’t hurt to train a deeper network (with diminishing returns).  
So the actual error that we calculate will be on the last output layer thats where we get to know where the error is.  

###**Backward Propagation**

Backpropagation refers to the method of calculating the gradient of neural network parameters. In short, the method traverses the network in reverse order, from the output to the input layer, according to the chain rule from calculus. The algorithm stores any intermediate variables (partial derivatives) required while calculating the gradient with respect to some parameters.  
To update weights or to push error back through the layers we use Backward Propogation.  

##**Cost Function**

For neural networks, or any other algorithm for that matter, the cost function is similar. Here, 

$$ Cost = Error + \lambda Regularisation$$

For regularisation, we will use $\sum w_j^2$, which is sum of all the weights squared.

Now, error function can be :

$$ (y_t - y_{pred})^2 $$

Therefore, 

$$ Cost = \frac{1}{m}\sum(y_t - y_{pred})^2 + \frac{\lambda}{2m} \sum w_j^2$$

##**How to handle Multiclass Classification**



To handle this, basic idea is to add additional weights. All remains same, but at output layer also, multiple units are added.

Lets assume that there are 3 classes that can be predicted.

<img src = "https://files.codingninjas.in/multiclass-7687.png">

Lets say the values predicted are:
$y_1 = 0.1$, $y_2 = 0.15$ and  $y_3 = 0.99$.

We will say that the data points belong to the max value class, in this case Class 3. The true value of the output will be in the form of a vector like [0, 0, 1].

Now, for above model, data points will also be in form of a vector, as is output.

If datapoint $x^1$ belongs to the 1st class, then its input vector is [1, 0, 0].


Similarily, if $x^2$ belongs to the 3rd class, then its input vector is [0, 0, 1].

Such an input is called **One Hot Encoded** input.


Cost function changes to :

$$Cost = \sum^m_{i = 1} \sum^k_{j = 1} f(y_{i}^j(pred),\enspace y_i^j(true)) + \frac{\lambda}{2m} \sum w_j^2 $$

This extra summation $\sum^k_{j = 1}$ penalises us if one hot encoding is incorrect. Hence, error and cost are not just to be calculated for correct units, but also for incorrect prediction of other units. So lets say we have to predict [0,1,0] for the class and we predicted [0.8, 1, 0.1] for the class, it will penalize it for the 0.8 and 0.1 because it will penalize it for all the units.

So first summation sign is for the no of **Datapoints which are m** while the second summation is for the **Units in output layer which is k** So lets say we have 3 units in output layer, k will be 3. 

  
##### Explaination of the Multiclass   
Idea behind multiclass classification is so far we were using **one vs one or one vs rest** for Multiclss classification.  
Which means we will train 10 different models for 10 different classes.  
Thats not something we will be doing for neural networks,  
Lets take an example of Normal Neural Network with one output layer.  
Only thing that will change in Multiclass Classification is **it will have Multiple Output Layers**.  
**So for Binary classes, we will only have one Output Layer. For more than two classes we will have n output layers for n classes.  
For eg. For 3 classes, we will have 3 output layers.And these three units will produce binary data which is in the form of One Hot Encoding.**  
So the first change will be  
1.  Change the Training Data Y to be **One Hot Encoded**.  
    In case we are using any Algorithm from the Library, we dont have to do One Hot Encoding Manually, it will be done automatically.  
    If we are Implementing by ourself, we have to do it manually.  
   
2.  To Predict- Predict class for which Ypred Value is Maximum.  
    Lets say we have 3 different class, $c_{1}$, $c_{2}$, $c_{3}$, whichever gives higher output, we will pich that to be Ypredicted class.  
   
   
**So lets say we have 2 classes, we will have only one unit in output layer.  
If we have 3 classes, we will have 3 units in output layer, if we have 4 classes, we will have 4 units and so on for n classes.**


##**MLP Classifier in Sklearn**

The MLP classifier is not a very efficient classifier. It is not advised to use in implementaion of neural networks of large data or an actual product.

In [None]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier

In [None]:
clf  =  MLPClassifier()  # Creating object 
iris = datasets.load_iris()  # Loading dataset
X = iris.data
Y = iris.target
xtrain, xtest, ytrain, ytest = train_test_split(X, Y) # here ytrain is not onehotencoded, classifier will take care of it.
clf.fit(xtrain, ytrain)  # Training neural network 



MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

Above warning states that the no of iterations has reached to its max value.  
We can also change that by changing max_iter value.  
It says it runs for 200 times and it has not converged yet

In [None]:
clf.score(xtest,ytest) # Obtaining score 

0.9210526315789473

In [None]:
clf.predict(xtest)   # results

array([1, 2, 1, 2, 0, 1, 2, 0, 2, 2, 2, 1, 1, 2, 2, 2, 0, 2, 2, 1, 0, 0,
       0, 1, 1, 2, 0, 2, 2, 1, 0, 0, 2, 2, 2, 0, 1, 2])

###**Important Parameters**  
**hidden_layer_sizes : tuple, length = n_layers - 2, default=(100,)**
The ith element represents the number of neurons in the ith hidden layer.  
We can change that. (100, 200) means we need two hidden layers, one with 100 units, second with 200 units.  
(100,20,30) means we need three hidden layers one with 100 units, second with 20 units and third with 30 units.  

**activation : {‘identity’, ‘logistic’, ‘tanh’, ‘relu’}, default=’relu’**
Activation function for the hidden layer. It decides which function does the layers use.  

‘identity’, no-op activation, useful to implement linear bottleneck, returns f(x) = x

‘logistic’, the logistic sigmoid function, returns f(x) = 1 / (1 + exp(-x)).

‘tanh’, the hyperbolic tan function, returns f(x) = tanh(x).

‘relu’, the rectified linear unit function, returns f(x) = max(0, x)

**batch_size : int, default=’auto’**
Size of minibatches for stochastic optimizers. If the solver is ‘lbfgs’, the classifier will not use minibatch. When set to “auto”, batch_size=min(200, n_samples)  
  
**alpha : default = 0.0001**  
Alpha is the Regularization Factor.  
  


###**Important Attributes**
**coefs_ : list of shape (n_layers - 1,)**
The ith element in the list represents the weight matrix corresponding to layer i.

**intercepts_ : list of shape (n_layers - 1,)**
The ith element in the list represents the bias vector corresponding to layer i + 1.

In [1]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier

In [2]:
clf  =  MLPClassifier()  # Creating object 
iris = datasets.load_iris()  # Loading dataset
X = iris.data
Y = iris.target
xtrain, xtest, ytrain, ytest = train_test_split(X, Y) # here ytrain is not onehotencoded, classifier will take care of it.
clf.fit(xtrain, ytrain)



MLPClassifier()

In [3]:
clf.score(xtest,ytest) # Obtaining score 

0.8947368421052632

**Changing the hidden layer sizes**

In [5]:
clf  =  MLPClassifier(hidden_layer_sizes = (20, ), max_iter = 3000)
clf.fit(xtrain,ytrain)

MLPClassifier(hidden_layer_sizes=(20,), max_iter=3000)

It didnt gave the warning means *it has converged*.

In [6]:
clf.score(xtest, ytest)
# score has improved.

0.9210526315789473

**If we want, we can look at the individual weights on each of the layers**

In [7]:
clf.coefs_

[array([[-5.16135766e-11, -2.92928568e-03, -8.75015148e-06,
          9.35296053e-09, -1.08713292e-01,  2.37676333e-01,
          3.79424530e-01, -3.71204104e-23, -4.42015291e-02,
          4.34246834e-01, -9.86625363e-03, -4.62814074e-04,
          1.82839451e-01, -6.60680161e-11, -6.72082540e-01,
          6.07259913e-01,  3.68909660e-01,  3.95995222e-02,
          5.18347800e-01, -1.23804011e-01],
        [ 3.71739791e-07, -9.87218726e-20, -8.33757358e-04,
         -3.45959848e-04, -6.40156652e-03, -2.93585314e-01,
         -5.61521595e-02, -1.24305679e-02,  2.70458412e-01,
         -6.81375482e-01,  1.48380950e-09, -7.22315157e-08,
         -2.38304082e-01, -6.69055568e-03, -6.54957764e-01,
         -5.46062589e-01,  4.15171752e-01,  5.50467722e-01,
          9.08373849e-01, -3.30173859e-02],
        [-1.65612981e-24, -5.81158558e-06,  3.36151269e-08,
          1.34097253e-22, -1.79555883e-02,  3.75007087e-01,
         -4.28293375e-01, -1.89556690e-04,  3.90786649e-01,
          8.

##### How does our network look?  
Now we have 4 features and 1 bias in it.  
Then we gave 20 hidden layers.  
As we have 3 different classes on y, outer layer has 3 nodes.  

-----------------------------------------------------------------------------------------------------------------

This coefs does not include biases. The biases are separate.  
So it does not include connections with the bias term.  
Lets look at the size of it.  

In [8]:
len(clf.coefs_)

2

This means we have one set of coefficient of Hidden layer and one set of coefficient of output layer.  

In [9]:
clf.coefs_[0]

array([[-5.16135766e-11, -2.92928568e-03, -8.75015148e-06,
         9.35296053e-09, -1.08713292e-01,  2.37676333e-01,
         3.79424530e-01, -3.71204104e-23, -4.42015291e-02,
         4.34246834e-01, -9.86625363e-03, -4.62814074e-04,
         1.82839451e-01, -6.60680161e-11, -6.72082540e-01,
         6.07259913e-01,  3.68909660e-01,  3.95995222e-02,
         5.18347800e-01, -1.23804011e-01],
       [ 3.71739791e-07, -9.87218726e-20, -8.33757358e-04,
        -3.45959848e-04, -6.40156652e-03, -2.93585314e-01,
        -5.61521595e-02, -1.24305679e-02,  2.70458412e-01,
        -6.81375482e-01,  1.48380950e-09, -7.22315157e-08,
        -2.38304082e-01, -6.69055568e-03, -6.54957764e-01,
        -5.46062589e-01,  4.15171752e-01,  5.50467722e-01,
         9.08373849e-01, -3.30173859e-02],
       [-1.65612981e-24, -5.81158558e-06,  3.36151269e-08,
         1.34097253e-22, -1.79555883e-02,  3.75007087e-01,
        -4.28293375e-01, -1.89556690e-04,  3.90786649e-01,
         8.07819229e-01, -1.6

This seems like a 2d array.

In [11]:
clf.coefs_[0].shape

(4, 20)

Thats what we are hoping for.  
We have 4 features and we have 20 hidden layer units, so the no of features we have is 4 * 20.  

In [12]:
clf.coefs_[1].shape

(20, 3)

And thats exactly what we hoping for.  
We have 20 hidden layers and 3 output layers.

**We can look at our biases which are the intercepts.**

In [15]:
clf.intercepts_

[array([-3.97568818e-01, -6.31740630e-02,  8.44952772e-04, -4.12808784e-01,
         3.46779375e-01, -3.02399438e-01,  4.85932874e-01, -8.81510310e-02,
         1.83274898e-01,  3.27191059e-01, -1.45746523e-01, -1.44636188e-01,
         1.10599340e-01, -6.37380441e-03, -1.49135358e+00,  1.79455993e-01,
        -2.60621616e-01,  5.10211669e-02,  1.10090031e+00, -4.00679157e-01]),
 array([-0.3366806 ,  0.37872429,  0.08606772])]

We must have 20 intercepts connecting to hidden layer from the features and 3 intercepts connecting to the output layer connecting from the hidden layer.

In [16]:
clf.intercepts_[0].shape

(20,)

In [17]:
clf.intercepts_[1].shape

(3,)

We have the exact shape as we want.