# Artificial Neural Networks (ANNs)
Artificial neural networks are one of the main tools used in machine learning. As the “neural” part of their name suggests, they are brain-inspired systems which are intended to replicate the way that we humans learn. Neural networks consist of input and output layers, as well as (in most cases) a hidden layer consisting of units that transform the input into something that the output layer can use. They are excellent tools for finding patterns which are far too complex or numerous for a human programmer to extract and teach the machine to recognize.


For a basic idea of how a deep learning neural network learns, imagine a factory line. After the raw materials (the data set) are input, they are then passed down the conveyer belt, with each subsequent stop or layer extracting a different set of high-level features. If the network is intended to recognize an object, the first layer might analyze the brightness of its pixels.

The next layer could then identify any edges in the image, based on lines of similar pixels. After this, another layer may recognize textures and shapes, and so on. By the time the fourth or fifth layer is reached, the deep learning net will have created complex feature detectors. It can figure out that certain image elements (such as a pair of eyes, a nose, and a mouth) are commonly found together.

Once this is done, the researchers who have trained the network can give labels to the output, and then use backpropagation to correct any mistakes which have been made. After a while, the network can carry out its own classification tasks without needing humans to help every time.

![title](3NN.GIF)

Since you are very comfortable with sklearn library, given below is a code using MLP(Multi-layer Perceptron) classifier.
visit 
http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier for documentation of MLP classifier 

# USING MLP CLASSIFIER 

In [4]:
from sklearn.neural_network import MLPClassifier
import sklearn.datasets as ds
from sklearn.model_selection import train_test_split
import numpy as np


In [5]:
clf=MLPClassifier()  # creating object 
iris=ds.load_iris()  # loading dataset
X=iris.data
Y=iris.target
xtrain,xtest,ytrain,ytest=train_test_split(X,Y)
clf.fit(xtrain,ytrain) #training neural network 



MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

In [6]:
clf.score(xtest,ytest) #obtaining score 

1.0

In [7]:
clf.predict(xtest)   # results

array([2, 2, 2, 0, 2, 2, 0, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 0, 0, 2, 1,
       0, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 0, 0, 2])

# How exactly do NN “learn” stuff?

In the same way that we learn from experience in our lives, neural networks require data to learn. In most cases, the more data that can be thrown at a neural network, the more accurate it will become. Think of it like any task you do over and over. Over time, you gradually get more efficient and make fewer mistakes.

When researchers or computer scientists set out to train a neural network, they typically divide their data into three sets. First is a training set, which helps the network establish the various weights between its nodes. After this, they fine-tune it using a validation data set. Finally, they’ll use a test set to see if it can successfully turn the input into the desired output.

# Do neural networks have any limitations?

Biggest challenge with neural networks is the significantly large training time and the amount of computation power required to train the neural network. The biggest issue, however, is that neural networks are “black boxes,” in which the user feeds in data and receives answers. They can fine-tune the answers, but they don’t have access to the exact decision making process.

# NEURAL NETWORK AND BACK PROPAGATION

In [8]:
def sig(z):
    return 1/(1 + np.exp(-z))

![title](4NN.jpg)

In [9]:
def derivativeSig(sig_out):
    return sig_out*(1 - sig_out)

In [10]:
X = np.array([[0,0,1],   # the last column is 1 for bias multiplication 
              [0,1,1],
              [1,0,1],
              [1,1,1]])  # trying to make neural network learn non linear decision boundary, see the graph of XOR 
Y = np.array([[0,1,1,0]]).T

![title](5NN.jpg)

![title](7NN.jpg)
Let's use the above mentioned error fuction for following unit:
![title](6NN.jpg)

here input has n features per training example , consequently n weights and 1 bias should be used to get :
![title](8NN.jpg)

Suppose O1 applies sigmoid on this input and gives the output as y_predicted.

The sigmoid function applied is called the activation of this perceptron. It can be replaced by any other function like tanh, relu, leaky relu, or even identity function. 

Now simply using gradient descent to minimize error E wrt weight we do following process :

![title](9NN.jpg)
and it is given that O1 is sigmoid activation and derivative of sigmoid wrt its input is given as:

![title](4NN.jpg)

and since output of O1 is y_predicted we get 



![title](1NN.jpg)

In [11]:
weights = 2 * np.random.random((3,1)) - 1       # generating random weights between -1 and 1 
learning_rate = 0.1

weights


array([[-0.06855632],
       [-0.67335311],
       [-0.46686853]])

In [12]:
X.shape, weights.shape

((4, 3), (3, 1))

In [15]:
for iter in range(1000):
    output0 = X  # is basically output of 0th layer i.e the input layer hence equals to X 
    output1 = sig(np.dot(output0, weights))    # as mentioned above the output  of O1 is sigmoid applied on z which is 
                                               # dot product of input and weight matrices 
    first_term = output1 - Y                   # basically y_pred - y_act 
    second_term = derivativeSig(output1)       # output of unit O1 as mentioned above 
    first_two = first_term * second_term
    changes = np.array([[0.0],[0.0]])
    for i in range(2):
        for j in range(4):
            changes[i][0] = changes[i][0] + first_two[j][0]*output0[j][i]

    # # net_change = np.dot(output0.T, first_two)
    weights = weights - learning_rate*changes.T # updating weights  
    
output1 = sig(np.dot(X, weights))
weights,output1

(array([[ 0.28028424,  3.25965162],
        [-0.32451255,  2.65485483],
        [-0.11802797,  2.86133941]]),
 array([[0.47052721, 0.94590188],
        [0.39113578, 0.99599498],
        [0.54047531, 0.99780853],
        [0.45952469, 0.99984561]]))

Since above network had only one layer it wasn't able to create non-linear decision boundary and hence the results were poor.
Now we will add one more layer and see if output changes or not. 
( You should test the above result by changing Y=[0,0,0,1] which is having a linear decision boundary. )

In [16]:
X

array([[0, 0, 1],
       [0, 1, 1],
       [1, 0, 1],
       [1, 1, 1]])

In [17]:
weights0 = 2* np.random.random((3,4)) - 1
weights1 = 2* np.random.random((4, 1)) - 1
learning_rate = 0.5
weights0,weights1

(array([[-0.35096536, -0.22815188,  0.61386183,  0.00161917],
        [ 0.83115685,  0.33966994,  0.05194455, -0.16465386],
        [ 0.82747126, -0.57970081,  0.97976553, -0.74731188]]),
 array([[ 0.33614741],
        [-0.51443281],
        [ 0.6471896 ],
        [-0.33725042]]))

Now final error depends upon the O_k, which is dependent upon the z_k, which is dependent on O_j from equation for z_k we get our required derivative. Rest is quite similar to process for single perceptron as done above. 


![title](10NN.jpg)


In [18]:
for iter in range(5000):
    layer0 = X            # Input layer 
    layer1 = sig(np.dot(layer0, weights0))  # output  of layer1 is sigmoid applied on z1 i.e. input of layer 1
    layer2 = sig(np.dot(layer1, weights1))  # output  of layer2 is sigmoid applied on z2 i.e. input of layer 2
    
    l2_error = layer2 - Y                   
    l2_delta = l2_error * derivativeSig(layer2)   # delta k 
    net_change2 = np.dot(layer1.T, l2_delta)

    l1_error = l2_delta.dot(weights1.T)           # error j 
    l1_delta = l1_error  * derivativeSig(layer1)  # delta j
    net_change1 = np.dot(layer0.T, l1_delta)

    weights0 = weights0 - learning_rate*net_change1
    weights1 = weights1 - learning_rate*net_change2

In [19]:
layer0 = X
layer1 = sig(np.dot(layer0, weights0))
layer2 = sig(np.dot(layer1, weights1))
layer2 

array([[0.02108519],
       [0.97051032],
       [0.98152703],
       [0.02625783]])