## Multi Layered Perceptron



### Description:

This chapter is the base of neural network. After reading this notebook you will get to know, what is neural network? History of the neural network. Building your own neural network from scratch.


### Overview

* Perceptron and its limitations
* Perceptron learning rule and Delta learning rule
* Multilayer perceptron
* Problem solving with the neural network
* Forward propagation
* Backward propagation



### Pre-requisites

- Calculus 
- Linear Algebra
- Statistics
- Python
- Supervised Machine Learning techniques (good to have)


### Learning Outcomes
* Introduction to perceptron  
* Perceptron learning rule
* Neural network implementation

## Chapter 1: Perceptrons

### Description: This chapter gives a brief introduction to perceptrons and how they work

### 1.1 BRAIN: Where it all began

***

**Deep Learning** has created a lot of buzz in recent times, mainly due to its overwhelming performance in tasks like recognizing faces, generating texts, predicting stock prices etc. Well, the motivation behind deep learning was the **human brain**. During the decade of 1940 many prominent neuroscientists were trying to model the brain (which was a pretty ambitious task considering the fact that they had extremely limited computing power). As with the brain, the building block for such a model was neuron or brain cells. Before proceeding further lets understand how a brain works.

***

**Working of human brain**

Brain is made up of cells called neurons. These neurons are interconnected within the brain wherein the **dendrite** receives input signals from neighboring neurons, which are then passed across the **cell body** and later onto the **axon**. The **axon** will output a signal which will be an input to another neuron and in this way the structure continues and ultimately leads to an interconnected network of neurons. 

<img src='images/neuron.jpg'>

The above image gives a picturesque representation of the working of neurons. But there is a specific condition under which one neuron can pass information onto subsequent neurons. **Only if the combination of the input signals to the neuron exceeds a certain threshold, then that neuron can pass information; in other words the neuron fires only then**. 


***

**Brain motivates the Artifical Neuron**

Drawing on this concept an artificial neuron was modelled up. Here the dendrites, the input connections to the neuron carry the attenuated or amplified input signals from other neighboring neurons. The signals are passed on to the neuron, where the input signals are summed up and then a decision is made as to what to output based on the total input received. For instance, for a binary threshold neuron, an output value of 1 is provided when the total input exceeds a pre-defined threshold; otherwise, the output stays at 0. 

<img src='images/ann.png'>

The brain equivalent of an artificial neuron is depicted above where: 
- Inputs signals to dendrites from neighboring neurons are $x_1, x_2, ...., x_n$. In Machine learning terminology these are the features.
- These inputs are combined in a weighted manner where weights are denoted by $w_1, w_2, ..., w_n$ and b. They $w\text{'s}$ are called **weights** and b is called the **bias**.
- The weighted combination is summed up at the cell body after which an operation called **activation** is carried out. You will learn more about activations in upcoming concepts (In fact this one will be covered in extensive detail).
- If this output is greater than some pre-defined threshold, then the neuron will fire; or in other words it will pass on information to subsequent neighboring neurons. 


**Formal definition of weights and bias**

* **`Weights`** represent the strength of the connection between units. If the weight from neuron 1 to neuron 2 has greater magnitude, it means that neuron 1 has greater influence over neuron 2. Weights near zero means changing this input will not change the output. In other words weights decide how much influence the input will have on the output.


* **`Bias`**'s role is similar to threshold. It determines whether or not or how much will neuron fire. It is letting us known when a neuron is meaningfully activated. **The addition of these biases ends up increasing the flexibility of a model to fit the given data. Bias makes sure that even when all the inputs  0’s there’s gonna be an activation in the neuron.**


***

### 1.2 Perceptron: The next BIG thing and its fall

***

**First breakthrough**

Modelling of the human brain with artificial neurons looked like an exciting challenge at those times and many scientists were trying to find the answer to the ultimate puzzle. If history is to believed then the first influential work was by two electrical engineers *Warren McCulloch* and *Walter Pitts*. Together they published a paper titled "*A Logical Calculus of the Ideas Immanent in Nervous Activity*" related to neural
networks, in 1943. The paper can be located [here](http://www.cs.cmu.edu/~epxing/Class/10715/reading/McCulloch.and.Pitts.pdf). The artificial neuron of this model is also referred to as the McCulloch-Pitts neuron and has the following properties:
- Neurons have a binary output state
- There are two types of input to the neurons: **excitatory** inputs and **inhibitory** inputs
- All excitatory inputs to the neuron have equal positive weights. If all the inputs to the neuron are excitatory and if the total input $\sum{w_ix_i} >= 0$, then the neuron will fire giving an output of 1
- In cases when any of the inhibitory inputs are active or $\sum{w_ix_i} < 0$ then the output will be 0

<img src='images/mpneuron.png'>


However, their configuration suffered from two major flaws:
- **Unability to learn weights**: The amount by which to combine the inputs to neurons could not be learnt by the so-called McCulloch-Pitts neuron. One must try out a variety of combinations by hit and trial to arrive at the optimum combination of weights.
- **Input binarization**: The input needs to be in a binary format so that they can be fed into a McCulloch-Pitts neuron. However, input binarization is a far-fetched thing to do as it leads to a loss of information.


***


**Invention of the Perceptron model**


The experiments with brain modelling went on and a paper titled "*The Perceptron—A Perceiving and
Recognizing Automaton*" by the authors *Frank Rosenblatt, Alexander Stieber* and *Robert H. Shatz* received significant attention, and for good reason. If you want to read the book, feel free to read it [here](https://blogs.umass.edu/brain-wars/files/2016/03/rosenblatt-1957.pdf). This model borrowed its underlying concept from two places: the McCulloch-Pitts model of an artificial neuron and Hebbian learning rule of adjusting weights. It was an improvement of the McCulloch-Pitts in the sense that it eliminated its drawbacks. 

Firstly, inputs need not be binarized. But the **most important part was that the weights and bias can be learnt when training the model through an iterative process**. It was made with the objective of performing binary classification primarily. The artificial neuron in this model was named **perceptron** and was intended to be a machine, rather than a program, and while its first implementation was in software for the IBM 704, it was subsequently implemented in custom-built hardware as the "*Mark 1 perceptron*" This machine was designed for image recognition: it had an array of 400 photocells, randomly connected to the **perceptron** encoded in potentiometers, and weight updates during learning were performed by electric motors.

Its architecture is similar to that of that of McCulloch-Pitts model as shown below:
<img src='images/perceptron.png'>


***

**Downfall of the perceptron model**

Rosenblatt made numerous strong claims for his **perceptron** model. All of this came to a halt when *Marvin Minsky* and *Seymour A. Papert* wrote a book titled "*Perceptrons: An Introduction to Computational Geometry in 1969 (MIT Press)*, which showed the limitations of the Perceptron learning algorithm even on simple tasks such as developing the **XOR** Boolean function with a single perceptron. It led to most people active in the artificial neural network community believing that these limitations applied to all neural networks, and hence the research in artificial neural networks nearly halted for a decade, until the 1980s.

### 1.3 Can you predict churn of customers?

***

Lets take a quick detour from the theoretical aspect. Here, we will introduce the problem that you are trying to solve throughout this entire concept; firstly through perceptrons and then through multi-layer perceptrons. 


**What is the problem?**

The problem you will be working on is churn prediction which is a binary classification task. You have a dataset of a bank with $10000$ customers. The dataset contains lots of attributes of the customers measured over last 6 months such as `name`, `credit score`, `geography`, `age`, `tenure`, `balance` etc. Also, you know which of these customers stayed and left. Now, given a new customer profile, your task is to create a model out of this data that can predict whether an incoming new customer will **leave** or **stay**. 


**Why solve this?**

This problem is valuable to any customer-oriented organisations, e.g. *Should this person get a loan or not?* *Is this a fradulent transaction?* Therefore, **Customer Churn Modelling** finds its applications in varied sectors. Its used by telecom service providers, banks to name a few.


**Brief Explanation of Dataset and Features**
- `CreditScore(Discrete)`: Denotes the credit rating of the customer given by the bank 
- `Gender(Categorical)`: Gender of the customer
- `Age(Discrete)`: Age of the customer
- `Tenure(Discrete)`: Number of years spent with the organisation
- `Balance(Continuous)`: Account balance of the customer
- `NumOfProducts(Discrete)`: Number of products of the customer
- `HasCrCard(Categorical)`: Does the customer have a credit card?
- `IsActiveMember(Categorical)`: Is the customer an active member?
- `EstimatedSalary(Continuous)`: Estimate of the salary of the customer
- `Exited(Categorical)`: Has the customer exited from the organisation


**What do we want as outcome?**

Given the features about the Customer, we want to predict if the customer has exited from the organisation with the help of perceptrons and then with multi-layer perceptrons.

## Load the dataset

First load the data in order to have a high level overview. Quite possibly the simplest of all tasks that you will be peforming throughout the entire course :) 


### Instructions
- Load the csv file by passing `path` argument to `.read_csv()` method of pandas. Save it as `data`
- Use `.describe()` method of `data` to look at its properties
- Look at the first five observations of `data` using `.head()` method

In [5]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# file path
path = 'data/Churn_Modelling.csv'

# Code starts here

# load data
data = pd.read_csv(path)

# description of data
print(data.describe())
print('='*50)

# first five observations of data
data.head()

# Code ends here

         RowNumber    CustomerId   CreditScore           Age        Tenure  \
count  10000.00000  1.000000e+04  10000.000000  10000.000000  10000.000000   
mean    5000.50000  1.569094e+07    650.528800     38.921800      5.012800   
std     2886.89568  7.193619e+04     96.653299     10.487806      2.892174   
min        1.00000  1.556570e+07    350.000000     18.000000      0.000000   
25%     2500.75000  1.562853e+07    584.000000     32.000000      3.000000   
50%     5000.50000  1.569074e+07    652.000000     37.000000      5.000000   
75%     7500.25000  1.575323e+07    718.000000     44.000000      7.000000   
max    10000.00000  1.581569e+07    850.000000     92.000000     10.000000   

             Balance  NumOfProducts    HasCrCard  IsActiveMember  \
count   10000.000000   10000.000000  10000.00000    10000.000000   
mean    76485.889288       1.530200      0.70550        0.515100   
std     62397.405202       0.581654      0.45584        0.499797   
min         0.000000     

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


### 1.4 Perceptron Learning Rule

***

**What is the Perceptron Learning Rule**?

Despite the eventual failure of the perceptron, it is helpful to learn about perhaps the most important precursor to deep learning; the **perceptrons**. Modern advances in deep learning would not have been possible at all if the perceptron was not framed. You will eventually see a lot of similarities between the two later on. So, coming back to perceptrons, **they are linear binary classifiers that use a hyperplane
to separate two classes**. You will learn how it is making only linear classifications and unable to capture non-linear ones. And the learning algorithm by which it learns the optimum hyperplane or set of weights (if there exists such a set/hyperplane) is called the **perceptron learning rule**.


***


**Solving the problem with perceptrons**

Take a look at a snapshot of the data we have at our disposal where `Exited` is the variable we want to predict based on the set of features we have. `Exited` has two values; `1` means a customer has exited and `0` means he hasn't.

<img src="images/data.png"/>

Pretty clear that it is a binary classification problem. The perceptron will take a weighted linear combination of the features of the data and sum it up. Lets assume the weights to be $w_1, w_2, ..., w_n$ and features be $x_1, x_2, .., x_n$ with $m$ number of data points. Now if their linear combination is greater than some threshold (let it be denoted by $b$) then we say that the customer has exited i.e. the perceptron fires. Otherwise, the customer is retained or the perceptron fails to fire. Mathematically,

$$\text{If } \displaystyle\sum_{i=1}^{m}{w_ix_i} > \mathbf{b}, \text{ y(target)} = 1 \text{, else y(target) = } 0$$

- If $\displaystyle\sum_{i=1}^{m}{w_ix_i} > \mathbf{b}$, the data point belongs to positive class or the customer has exited
- If $\displaystyle\sum_{i=1}^{m}{w_ix_i} <= \mathbf{b}$, the data point belongs to negative class or the customer has stayed
- The equation of the decision boundary is $\displaystyle\sum_{i=1}^{m}{w_ix_i} = \mathbf{b}$

In our problem $x_i$s are the input features `customerId`, `age`, `gender` etc. and $y$ is the target.

Now the question is how to get the right combination of the weight terms $w_1, w_2, ...$ and the bias $b$. This is where the perceptron learning rule comes to the rescue! Lets see how.

***


**Optimum weights with `Perceptron Learning Rule`**

Before discussing the steps associated with the learning rule lets get used to the syntax of things we would be going forward with. $w$ is the weight vector and $x$ is the data point both belonging to $\mathbf{R^{n}}$. Also, let $b$ be any real number denoting the threshold and $y$ is the actual target with possible values of 1 and 0. The steps are outlined below:

- **Random initialization**: First initialize $w$s and $b$ randomly.
- **Predict single data point**: Pick a data point $x^{(i)}, y^{(i)}$ from all the positive and negative inputs. Calculate the the dot product $w.x^{(i)}$ and lets call it $y_p$. This is the predicted class, which is $1$ if $w.x^{(i)} > 0$ and $0$ if $w.x^{(i)} <= 0$ 
- **Update weights**: This step is where the magic happens. Update the weights and bias as given below:
    - If predicted class $y_p = 0$ and actual class $y^{(i)} = 1$, update the weight vector as $w = w + x^{(i)}$. Bias gets updated by $+1$
    - If predicted class $y_p = 1$ and actual class $y^{(i)} = 0$, update the weight vector as $w = w - x^{(i)}$. Bias gets updated by $-1$
    - If predicted class and actual class are equal i.e. $y_p == y^{(i)}$, no updates are required
- Repeat **STEP 2** and proceed to next data point.
- Stop when all data points have been correctly classified.

However not that **perceptron will only be able to classify the two classes properly if there exists a feasible weight vector** $w$ **that can linearly separate the two classes. In such cases, the Perceptron Convergence theorem guarantees convergence**.

***

**Pythonic implementation of Perceptron**

```python
# import packages
import numpy as np
from sklearn.metrics import accuracy_score


# OOP implementation of Perceptron
class Perceptron(object):
    
    # initialize with features, target, weigths, bias and number of iterations
    def __init__(self, X, y, w, b, epochs=1000):
        
        self.b = b
        self.w = w
        self.X = X
        self.y = y
        self.epochs = epochs
        
    # computes dot product
    def learn(self, x):
        return 1 if np.dot(self.w, x) >= self.b else 0
   
    
    # training process
    def fit(self):
        
        # dictionary to store accuracy values
        accuracy = {}
        # maximum accuracy 
        max_accuracy = 0
        # iterate 
        for epoch in range(self.epochs):
            # iterate over every data point
            for x,y in zip(self.X, self.y):
                # prediction for data point
                pred = self.learn(x)
                # weight update
                if pred == 0 and y == 1:
                    self.w += x
                    self.b += 1
                elif pred == 1 and y == 0:
                    self.w -= *x
                    self.b -= 1
            # store accuracy according to iteration number
            accuracy[epoch] = accuracy_score(self.predict(self.X), self.y)
            # display if new maximum training accuracy is achievedd
            if accuracy[epoch] > max_accuracy:
                print("Training accuracy at epoch {} is: {}".format(epoch, accuracy[epoch]))
                print("="*100)
                max_accuracy = accuracy[epoch]
                # checkpoint maximum accuracy weights and bias
                chkpt_w = self.w
                chkpt_b = self.b
        
            self.b = chkpt_b
            self.w = chkpt_w
        
        return max_accuracy, self.b, self.w
    
    # predict on new data
    def predict(self, test):
        # list to store predictions
        preds = []
        for row in test:
            y_pred = self.learn(row)
            preds.append(y_pred)
        return np.array(preds)
    
    # calculate accuracy
    def accuracy(self, X_test, y_test):
        _, self.b, self.w = self.fit()
        preds = self.predict(X_test)
        accuracy = accuracy_score(preds, y_test)
        return accuracy
```

## Making predictions with Perceptron

In this task you will apply perceptron learning rule to your churn prediction problem on preprocessed data. In order to concentrate more on applying the perceptron learning rule, we have split data into training and test features and targets as `X_train`, `X_test`, `y_train` and `y_test`

### Instructions
- Initialize weights as `w = np.ones()` with shape as `(X.shape[1], )` and bias `b` equal to `0`
- In the topic we gave you a code snippet for `Perceptron`. We encourage you to write it down here or else you can also copy-paste it. Name the class as `Perceptron`
- Initialize an object `perceptron` of class `Perceptron`
- Use `.accuracy()` method of `perceptron` object to calculate the accuracy over the test data. Save it as `acc` and print it out

In [6]:
# import packages
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# separate into features and target
X = data[["CreditScore", "Age", "Tenure", "Balance", "NumOfProducts", "HasCrCard", "IsActiveMember", "EstimatedSalary"]]
y = data["Exited"]

# mean normalization and scaling
mean, std = np.mean(X), np.std(X)
X = (X - mean) / std
X = pd.concat([X, pd.get_dummies(data["Gender"], prefix="Gender", drop_first = True)], axis = 1)

# transform data according to the model input format
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .20, random_state= 9)


# Code starts here

# initialize weights and bias
w = np.ones(shape=(X.shape[1],))
b = 0

# class of perceptron
class Perceptron(object):
    
    # initialize with features, target, weigths, bias and number of iterations
    def __init__(self, X, y, w, b, epochs=100):
        
        self.b = b
        self.w = w
        self.X = X
        self.y = y
        self.epochs = epochs
        
    # computes dot product
    def learn(self, x):
        return 1 if np.dot(self.w, x) >= self.b else 0
   
    
    # training process
    def fit(self):
        
        # dictionary to store accuracy values
        accuracy = {}
        # maximum accuracy 
        max_accuracy = 0
        # iterate 
        for epoch in range(self.epochs):
            # iterate over every data point
            for x,y in zip(self.X, self.y):
                # prediction for data point
                pred = self.learn(x)
                # weight update
                if pred == 0 and y == 1:
                    self.w += x
                    self.b += 1
                elif pred == 1 and y == 0:
                    self.w -= x
                    self.b -= 1
            # store accuracy according to iteration number
            accuracy[epoch] = accuracy_score(self.predict(self.X), self.y)
            # display if new maximum training accuracy is achievedd
            if accuracy[epoch] > max_accuracy:
                print("Training accuracy at epoch {} is: {}".format(epoch, accuracy[epoch]))
                print("="*100)
                max_accuracy = accuracy[epoch]
                # checkpoint maximum accuracy weights and bias
                chkpt_w = self.w
                chkpt_b = self.b
        
            self.b = chkpt_b
            self.w = chkpt_w
        
        return max_accuracy, self.b, self.w
    
    # predict on new data
    def predict(self, test):
        # list to store predictions
        preds = []
        for row in test:
            y_pred = self.learn(row)
            preds.append(y_pred)
        return np.array(preds)
    
    # calculate accuracy
    def accuracy(self, X_test, y_test):
        _, self.b, self.w = self.fit()
        preds = self.predict(X_test)
        accuracy = accuracy_score(preds, y_test)
        return accuracy
    
# object of class
perceptron = Perceptron(X_train.values, y_train.values, w, b, 1000)

# accuracy of prediction 
acc = perceptron.accuracy(X_test.values, y_test.values)

# print accuracy
print(acc)

# Code ends here

Training accuracy at epoch 0 is: 0.20275
Training accuracy at epoch 2 is: 0.28
Training accuracy at epoch 4 is: 0.352375
Training accuracy at epoch 6 is: 0.386375
Training accuracy at epoch 8 is: 0.40825
Training accuracy at epoch 10 is: 0.424375
Training accuracy at epoch 12 is: 0.434875
Training accuracy at epoch 14 is: 0.4445
Training accuracy at epoch 16 is: 0.450875
Training accuracy at epoch 18 is: 0.454875
Training accuracy at epoch 20 is: 0.4575
Training accuracy at epoch 22 is: 0.45925
Training accuracy at epoch 24 is: 0.461375
Training accuracy at epoch 27 is: 0.476
Training accuracy at epoch 30 is: 0.476375
Training accuracy at epoch 34 is: 0.4765
Training accuracy at epoch 39 is: 0.47675
Training accuracy at epoch 45 is: 0.47875
Training accuracy at epoch 52 is: 0.480625
Training accuracy at epoch 59 is: 0.48075
Training accuracy at epoch 66 is: 0.481
Training accuracy at epoch 74 is: 0.482375
Training accuracy at epoch 83 is: 0.484125
Training accuracy at epoch 92 is: 0.48

Training accuracy at epoch 778 is: 0.500875
Training accuracy at epoch 808 is: 0.501125
Training accuracy at epoch 839 is: 0.50125
Training accuracy at epoch 871 is: 0.501625
Training accuracy at epoch 904 is: 0.50225
Training accuracy at epoch 937 is: 0.502375
Training accuracy at epoch 971 is: 0.5025
0.58


### 1.5 Why the perceptron learning rule works?

***

Okay you know how the weights get updated, but wondered why? Lets see how. For better explainability lets consider the situation where we have only two features $x_1$ and $x_2$ where $x \in \mathbf{R^{2}}$ and its corresponding weights are $w_1$ and $w_2$ where $w^{T} \in \mathbf{R^{2}}$. Lets also assume that the bias $b$ is indeed a weight $w_0$ with constant input of $1$. So, we can rewrite the equation for perceptron prediction as $\displaystyle\sum_{i=0}^{m}{w_ix_i}$. 

![](images/mlp_1.PNG)

The diagram above shows the representation and a few pointers about it:
- The decision boundary is given by $\displaystyle\sum_{i=0}^{m}{w_ix_i} = 0$. It is represented by a line.
- The weight vector $w$ is perpendicular to this decision boundary. If you want to know why, we highly encourage you to [link](https://stackoverflow.com/questions/10177330/why-is-weight-vector-orthogonal-to-decision-plane-in-neural-networks)


**Proof of convergence**

So you initialized weights randomly, made a prediction and then updated the weight. What happens next? Well, the weight vector moves from $w$ to $w'$ and the corresponding decision boundary changes. Lets consider the following example where you have positive points $p_1$, $p_2$ and $p_3$ colored blue and negative points $n_1$, $n_2$ and $n_3$ colored red. After first iteration you have the weight vector $w$ and decision boundary as $w^{T}x = 0$

![](images/mlp_2.PNG)

- **Case when prediction is negative but actual target is positive**: For this scenario weight is updated as $w' = w + x$. Assume the angle between $w$ and $x$ is $\alpha$. Since prediction is negative, its value must be between $90$ and $270$ degrees. After updating the weights the new weight is $w' = w + x$ and the new angle becomes lets say $\alpha_{new}$. The main reason why we are updating the weight vector is because we want to correctly predict the next time. Observe how weight update tries to achieve that:

    $$cos(\alpha_{\text{new}}) \propto  w_{\text{new}}.x$$

    $$=>cos(\alpha_{\text{new}}) \propto  {(w + x)}.x$$

    $$=> cos(\alpha_{\text{new}}) \propto  w.x  + x.x$$

    $$=> cos(\alpha_{\text{new}}) \propto  cos(\alpha)  + x^2$$

   $$=> cos(\alpha_{\text{new}}) >  cos(\alpha) $$

    This essentially means that the new angle $(\alpha_{\text{new}})$ shifts the weight vector to a position such that the cosine of the angle between the new weight vector and the input vector is greater than the previous cosine value; and repeat it until it becomes positive and we succesfully predict the input as positive.

In a similar manner you can reason for the case when prediction is positive but actual target is negative. We encourage you to reason behind it!

### 1.6 Limitations of perceptrons

***

The very first failure of perceptrons happened with a logical function. A logic gate/function is an elementary building block of a digital circuit. Most logic gates have two inputs and one output. At any given moment, every terminal is in one of the two binary conditions low ($0$) or high ($1$), represented by different voltage levels. First, you will look at **AND** and **OR** gates for which the perceptron can easily distinguish between different outputs ($0$ and $1$). Then you will see how it fails for **XOR** gate and the underlying reason behind it.

<img src="images/logic gate.jpeg" width="70%" />

**Inputs**: For this task we will have four data points $(0,0), (0,1), (1,0), (1,1)$ with respective targets $0, 0, 0, 1$ for **AND** gate and outputs $0, 1, 1, 1$ for **OR** gate. 


**AND gate with perceptron**

With perceptron the decision boundary looks somewhat like this where the red points represent output 0 and blue points represent output 1. The black line is the decision boundary obtained after weight updates.

<img src='images/and.png'>


**OR gate with perceptron**

The decision boundary for OR gate is obtained as shown below: <img src='images/or.png'> 

Notice the similarity between these two decision boundaries; **both are a line**. If you take more features, this line will become a hyperplane. A hyperplane is a linear decision boundary i.e. it can capture linear relationships between features. Now, lets look at the decision boundary for **XOR** gate.

<img src='images/xor_gate.png'>

Clearly, the decision boundary fails to separate the data points belonging to separate classes. The XOR problem can only be solved by a non-linear decision boundary; for example the below decision boundary:

<img src="images/mlp_hidden.png" width="70%" />

Ultimately it was proven that perceptrons failed for classification tasks which had non-linear relationships. A lot of real world use cases are non-linear in nature, and hence the perceptron model was put to rest.

## Chapter 2: Multilayer Perceptron

### Description: In this chapter, you will learn about the bedrock of neural networks; the multilayer perceptron.

### 2.1 Need to capture non-linear relationships

***

At the end of the last chapter we mentioned that a lot of real world use cases involve non-linear relationships. And since the perceptron always produces a decision boundary in the form of a hyperplane, it is unable to capture these non-linear relationships. Lets observe a similar non-linear relationship with the data itself. 


**Non-Linear relationship between two features**

Consider `Balance`($x_1$) and `EstimatedSalary` ($x_2$) features. The data points are plotted below with only these two features and it gives an indication of how the decision boundary should look like or approximate to. The blue circles represent the positive labels and the red crosses represent the negative labels. In our example, positive labels mean the `customer has exited` and negative labels mean `customer has stayed`. On making a scatterplot with these two features and then plotting it, the pink curve is what we might call as a "good" decision boundary.

<br>
<img src="images/non-linear.png"/>
<br> 

***

**Solving the XOR problem**

This **non-linear** trend is present almost in every real world problem. Therefore, we need a new learning algorithm which is good at capturing non linearities. This led to the invention of **multi-layered perceptrons**. As you have previously seen before with **XOR** gate, we can model the decision boundary something to have two hyperplanes in order to separate the classes. Requiring two hyperplanes to separate two classes is the equivalent of having a non-linear classifier.

If we take two perceptrons stacked in a layer, one capable of **AND** logic and the other capable of **OR** logic, then the network would be able to implement **XOR** logic. The perceptrons for getting **AND** and **OR** logics can be trained individually using perceptron learning rule but the network as a whole cannot perform **XOR** logic using the same rule.

<img src='images/xor_logic.png'>

### 2.2 Introducing non-linearity with activation functions

***

At the end of the previous topic we introduced an architecture consisting of two perceptrons which can be combined effectively to produce **XOR** logic. It is solving only a minor problem whereas the major concern is to capture **non-linear relationships**. Now, we will check if this configuration can capture a more complex non-linear relationship i.e. produce decision boundaries other than a hyperplane.


**Combining perceptrons linearly without activation functions**

Consider the figure below where you have two inputs $x_1$ and $x_2$, $w$ as the weigths, $b$ as the biases, $h$ as the hidden layer values and $y$ as the final output. 

<img src='images/need_nonlinear.png'>

Output of the hidden unit $\mathbf{h_1 = w_{11}x_1 + w_{21}x_2 + b_1}$

Output of the hidden unit $\mathbf{h_2 = w_{12}x_1 + w_{22}x_2 + b_2}$

Output $\mathbf{p_1 = w_1(w_{11}x_1 + w_{21}x_2 + b_1) + w_2(w_{12}x_1 + w_{22}x_2 + b_2) + b_3 = (w_1w_{11} + w_2w_{12})x_1 + (w_1w_{21} + w_2w_{22})x_2 + w_1b_1 + w_2b_2 + b_3}$

Since $p_1$ is a linear combination of its inputs, it will be unable to capture non-linearities within the data. So, how to get around this problem and arrive at a solution. Well there is a function that you had likely come across in machine learning (hint: Logistic Regression)


***

**Activation functions to the rescue**

Think of the scenario where instead of the linear output of $h_1$ and $h_2$ what if we apply a non-linear function? It will then lend our model the power to capture non-linear patterns within the data. **Therefore, activation functions are used after taking the weighted combination of inputs so as to introduce non-linearity to the function**. Lets take the most simple and widely used activation function of all; the sigmoid. 

$\text{Mathematically, it is given by the formula: }$ 
$g(x) = \mathbf{\frac{1}{1 + e^{-x}}}$

It resembles an S-shaped curve like the one shown below which indicates that this function is non-linear. This is very important if we want the network to capture the non-linearities in the data.

<img src="images/600px-Logistic-curve.svg.png" style="width:300px"/>


Lets also see mathematically whether after applying sigmoid activation function the output $p_1$ is a linear combination of features $x_1$ and $x_2$. 

Output $h_1 = \mathbf{\frac{1}{1 + e^{-(w_{11}x_1 + w_{21}x_2 + b_1)}}}$

Output $h_2 = \mathbf{\frac{1}{1 + e^{-(w_{12}x_1 + w_{22}x_2 + b_2)}}}$

Adding these two we have $p_1 = \mathbf{\frac{w_1}{1 + e^{-(w_{11}x_1 + w_{21}x_2 + b_1)}} + \frac{w_2}{1 + e^{-(w_{12}x_1 + w_{22}x_2 + b_2)}} + b_3}$

Now the output is clearly not a linear combination of the features and we can say that it can capture non-linearity to some extent within the data. Here we have introduced the sigmoid activation function. People out there use different types of activation functions like **ReLU**, **tanH**, **Leaky ReLU** etc. and you will learn more about them in the upcoming concepts.

**Sigmoid-styled perceptron**

Following is a Python OOP implementation of sigmoid perceptron where the prediction is now done by applying sigmoid function on top of the weighted combination of input data. The error function is taken as Squared Error and we minimize it using the gradient descent method to find the best weight and bias.

```python
# import packages
from tqdm import tqdm_notebook
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')

# Code starts here

# initialize weights 
w = np.zeros(shape=(X.shape[1],))
b = 0

# OOP implementation of sigmoid neuron
class sigmoidNN:
    
    
    # initialize with weights, bias and number of epochs
    def __init__(self, w, b, epochs):
        np.random.seed(2)
        self.w = w
        self.b = b
        self.epochs = epochs
    
    
    # calculate dot product of weights and inputs
    def dot(self, x):
        return np.dot(self.w, x) + self.b
    
    
    # calculate sigmoid output
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-self.dot(x)))
    
    
    # calculate gradient of weights
    def grad_w(self, x, y):
        y_pred = self.sigmoid(x)
        return (y_pred - y) * x * y_pred * (1 - y_pred)
    
    
    # calculate gradient of bias
    def grad_b(self, x, y):
        y_pred = self.sigmoid(x)
        return (y_pred - y) * y_pred * (1 - y_pred)
    
    
    # fit on training data
    def fit(self, X, Y, lr=1, display_loss=False):
        L = []
        for _ in tqdm_notebook(range(self.epochs), total=self.epochs, unit="epoch"):
            preds = []
            dw, db = 0, 0
            for x,y in zip(X, Y):
                dw += self.grad_w(x, y)
                db += self.grad_b(x, y)
            self.w -= lr * dw
            self.b -= lr * db
            for row in X:
                pred = self.sigmoid(row)
                preds.append(pred)
            loss = mean_squared_error(np.array(preds).reshape(-1,1), np.array(Y).reshape(-1,1))
            L.append(loss)
        if display_loss:
            plt.plot(L)
            plt.xlabel('Epochs')
            plt.ylabel('MSE Loss')
            plt.show()
        return self.w, self.b
    
    
    def predict(self, X_train, X_test, Y_train):
        self.w, self.b = self.fit(X_train, Y_train)
        Y_pred = []
        for row in X_test:
            pred = self.sigmoid(row)
            Y_pred.append(pred)
        return np.array(Y_pred)
    

# instantiate object    
sigmoid_nn = sigmoidNN(w, b, 100)

# make predictions using sigmoid neuron
sigmoid_pred = sigmoid_nn.predict(X_train.values, X_test.values, y_train.values)

# binarize data
sigmoid_pred_bin = sigmoid_pred > 0.5

# accuracy
acc = accuracy_score(sigmoid_pred_bin, y_test)

# print accuracy score
print(acc)
```

This gives output:
```python
0.793
```

### 2.3 Introduction to multi-layered perceptrons

***

So now you have lended some non-linearity to the learning function with the help of sigmoid activation function. Do you think a couple of such perceptrons with sigmoid activations is sufficient? The answer is **NO**. Real world problems often exhibit very complex non-linear relationships which are extremely difficult to model around. In other words we need a more complex model that has the capacity to capture those non-linear trends. How to do it? 


**Architecture of MLP**

Its simple. You stack layers of perceptrons with activations. These perceptrons will capture trends during the training phase. This type of architecture is often referred to as Multi-layered perceptron and quite rightly it is called the **"hello world"** of deep learning. Take a look at the image below to look at its architecture.

<img src='images/multi_1.png' width="400">


An MLP has three basic components in the form of layers which contain a bunch of neurons interconnected between successive layers:
- **Input layer**: The leftmost layer in the above image that takes in inputs is referred to as the input layer. No activations are applied in this layer and the number of input neurons is equal to the number of features of the data.
- **Hidden layer**: Perhaps the most important component is the hidden layer. In the figure shown above there is a single hidden layer. But you can add up many such layers making it very deep. Usually if the number of hidden layers exceeeds 4 then the network is said to be deep, and otherwise shallow. Activation functions are applied to the neurons in the hidden layers o that non-linearity can be captured. 
- **Output layer**: The final layer producing the output is called the output layer and the problem statement at hand determines the type of activation function to use here. For ex: If it is a regression problem then there should be a single neuron with no activation function. In case of binary classification there can be either one neuron with sigmoid activation or two neuron with softmax activation. 


The interconnections between the neurons of successive layers are weighted which needs to be learnt during the training phase. It is done efficiently done through two processes: 
- **Forward propagation**
- **Backward propagation**

The next chapter details more on these two mechanisms and you will understand why and how **Backward propagation** forms the workhorse of any deep learning technique. 

## Chapter 3: Forward and Backward propagation in MLPs

### Description: In this chapter, we will see how an MLP makes predictions and learn the correct set of weights through forward and backward propagation

### 3.1 Predictions with FORWARD PROPAGATION

***

You know about perceptrons, weights and bias, activation functions and most importantly the architecture of a multi layer perceptron. In this topic you will see how a multi-layer perceptron predicts the output given the weights and bias, activation function of the network and the inputs.


**MLP architecture**: The architecture consists of three layers: one input, one hidden and one output layer.  
- **Input layer**: For the sake of simplicity consider 2 neurons in this layer. Remember that you do not use activations in this layer
- **Hidden layer**: Here there are three neurons with sigmoid activations. 
- **Output layer**: Since our problem statement is a binary classification problem (`Exited`/`NotExited`), we will use a single neuron in the output layer with sigmoid activation. The output represents the probability of a particular customer exiting.


***

**Workflow of obtaining best set of weights**

The workflow of MLP is pretty straightforward. It is sequentially described below:
- Initialize random set of weights and biases and predict output with **forward propagation**
- Declare something called **Loss Function** which is a measure of how badly the model is performing. Our goal is to minimize this loss and in order to move in the direction of decreasing loss we take the gradient of the **loss function with respect to the weights**. This gradient is then multiplied by a factor called **learning rate** and is then subtracted from the initial weight. This process is called **backward propagation** and is discussed thoroughly in the next topic.

***

**Forward Propagation**

The process by which any neural network makes predictions is called forward propagation. Its called **forward** because making predictions implies going from the input layers on the left to the output layer on the right. Before going into the nuts and bolts of this method, lets fix some conventions. 

- Consider we have $m$ number of observations with every observation having two features $x_1$ and $x_2$
- Every layer is represented by the letter **L** as superscript. For ex: input layer is given by superscript 1, hidden layer by 2 and output layer by 3
- Weights are matrices denoted by $w$'s. Weights and bias from input to hidden layer are given by $w^{(1)}$, $b^{(1)}$ and those from hidden to output layer are given by $w^{(2)}$ and $b^{(2)}$
- Linear combinations of weights and inputs given by $\sum{w.x} + b$ are represented by $z$
- After applying sigmoid activations these combinations are represented by $a = \mathbf{\frac{1}{1 + e^{-(\sum{w.x + b})}}}$


<img src='images/forwardprop.jpg'>

The network is shown in the picture above. Now what will be the dimension of the weight matrices $w^{(1)}$ and $w^{(2)}$? What is the dimension of the matrix containing the entire data? Lets answer these questions below:

- **Dimension of weight matrices**: For weights connecting layers $(L - 1)$ and $L$, the dimension of weight matrices is $(a\text{x}b)$ where $a$ and $b$ are the number of neurons in layers $(L - 1)$ and $L$ respectively. In our case dim$(w^{(1)}) = (2\text{x}3)$ and dim$(w^{(2)}) = (3\text{x}1)$. 

- **Weight matrices**: $\mathbf{w^{(1)} = \begin{bmatrix}
    w_{11}^{(1)} & w_{12}^{(1)} & w_{13}^{(1)} \\
    w_{21}^{(1)} & w_{22}^{(1)} & w_{23}^{(1)}
  \end{bmatrix}}$
  
  and $\mathbf{w^{(2)} = \begin{bmatrix} w_{11}^{(2)} \\ w_{21}^{(2)} \\ w_{31}^{(2)}  \end{bmatrix}}$
    
- **Design matrix**: This is the matrix containing the entire set of data points. Its dimension will be $(n\text{x}m)$ where $m$ is the number of data points and $n$ is the number of features. It is given as       $\mathbf{X = \begin{bmatrix}  x_{1}^{(1)} & x_{2}^{(1)} \\ x_{1}^{(2)} & x_{2}^{(2)} \\ ... & ... \\ ... & ... \\ x_{1}^{(m)} & x_{2}^{(m)} \\ \end{bmatrix}}$. Here superscript represents the observation number.


**`Calculation from input to hidden layer`**
- First we will calculate $z^{(1)}$. It is given by the matrix multiplication of the design matrix and the weight vector i.e. $z^{(1)} = Xw^{(1)} + b^{(1)}$. Dimensions of $z^{(1)}$ will be $m\text{x}3$ in our case
- Then we apply sigmoid activation on $z^{(1)}$ i.e. $a^{(1)} = \sigma(z^{(1)})$. Its dimension remain the same as that of $z^{(1)}$. 
    
   Summarizing the operations, $\mathbf{z^{(1)} = \begin{bmatrix} x_{1}^{(1)} &  x_{2}^{(1)} \\ x_{1}^{(2)} &  x_{2}^{(2)} \\ .. & .. \\ .. & ..\\ x_{1}^{(m)} &  x_{2}^{(m)} \end{bmatrix} \begin{bmatrix} w_{11}^{(1)} & w_{12}^{(1)} & w_{13}^{(1)} \\  w_{21}^{(1)} & w_{22}^{(1)} & w_{23}^{(1)}\end{bmatrix} + b^{(1)}}$  and $\mathbf{a^{(1)} = \sigma(z^{(1)})}$

**`Calculation from hidden to output layer`**
- First we'll calculate $z^{(2)}$. It is given as the matrix multiplication of output of the hidden layer and the weight vector i.e. $z^{(2)} = a^{(1)}w^{(2)} + b^{(2)}$ with dimension $(m\text{x}1)$ in our case. 
- Next we apply the sigmoid activation given by $a^{(2)} = \sigma(z^{(2)})$

## Implement forward propagation

In this task you will implement forward propagation for our network wih a single hidden layer (3 neurons) and on output layer (1 neuron). The input layer has 13 neurons. These are saved as `INPUT_SIZE`, `HIDDEN_SIZE` and `OUTPUT_SIZE`

### Instructions
- Initialize weight `w_1` going from input layer to hidden layer as `np.ones(shape=(INPUT_SIZE, HIDDEN_SIZE)) * 0.05`. Similarly initialize `w_2` with relevant shape and multiply it by `0.05`
- Initialize bias `b_1` as `np.zeros((1, INPUT_SIZE))`. Similarly initialize `b_2`
- Define a function `sigmoid` which takes in a single argument `x` and returns the sigmoid equivalent of it
- Define a function `forward` which takes in five arguments: weights `w1` and `w2`, bias `b1` and `b2` and finally the features `X`
- From input to hidden layer, first thing to do is take matrix multiplication of weights and inputs going from input to hidden layer. You can do this as `z1 = X@w1 + b1`. Now apply sigmoid activation on top of it using `sigmoid` function and save it as `a1`. Similarly save `z2` and `a2`
- Make sure it returns the post activation value of the hidden layer and the final output of the network i.e. `a1` and `a2`
- Now time to test our initial predictions. Pass on `w_1`, `w_2`, `b_1`, `b_2` and `X_train` as arguments to `forward` function and save the final network output as `pred`. Print it out 

In [7]:
# size of layers
INPUT_SIZE = X_train.shape[1]
HIDDEN_SIZE = 3
OUTPUT_SIZE = 1
np.random.seed(22)

# Code starts here

# initialize weights and bias
w_1 = np.ones(shape=(INPUT_SIZE, HIDDEN_SIZE)) * 0.05
w_2 = np.ones(shape=(HIDDEN_SIZE, OUTPUT_SIZE)) * 0.05
b_1, b_2 = np.zeros(shape=(1, HIDDEN_SIZE)), np.zeros(shape=(1, OUTPUT_SIZE)) 

# function for sigmoid
def sigmoid(x):
    return 1 / (1 + np.exp(-x))


# function for forward propagation
def forward(w1, w2, b1, b2, X):
    
    # hidden layer
    z1 = X@w1 + b1
    a1 = sigmoid(z1)
    
    # output layer
    z2 = a1@w2 + b2
    a2 = sigmoid(z2)
    
    return a2, a1

pred, _ = forward(w_1, w_2, b_1, b_2, X_train.values)
print(pred)


# Code ends here

[[0.52028141]
 [0.52174097]
 [0.51779328]
 ...
 [0.52270479]
 [0.51694232]
 [0.52091934]]


### 3.2 Finding best weights with BACKWARD PROPAGATION

***

Here is the recap of the forward propagation 
$$z^{(1)} = Xw^{(1)} + b^{(1)}$$
$$a^{(1)} = \sigma(z^{(1)})$$
$$z^{(2)} = a^{(1)}w^{(2)} + b^{(2)}$$
$$a^{(2)} = \sigma(z^{(2)})$$


**Gradient Descent**

These operations give an output which is nothing better than a random guess since the weights themselves were chosen at random. Now its time to improve on this random prediction by some method. How to do that? Remember the approach of finding the best-fit line in linear regression. There is an analytical approach and there is an iterative approach. The iterative approach uses **gradient descent** to arrive at the optimum weights; which are the gradients across every dimension of the line in case of linear regression. In case you do not know about this method, below we have outlined an intuition of **gradient descent**. You will learn about its mathematics in detail in an upcoming concept and also about its different types.

Following steps are involved in **gradient descent**:
- Declare a loss function $J(\theta)$ which is a measure of the disparity between actual and predicted outcomes
- Now our main goal is to make this loss as minimum as possible. To achieve this we need to move in the steepest direction opposite the direction of increasing loss function. How do we do that mathematically? We take the gradient of the loss function with respect to every weight and then subtract this gradient multiplied by a learning rate from the original weight.
- Iterate this process until a desired criteria is fulfilled; for ex: loss is not decreasing further.


This approach of gradient descent works well if we have zero hidden layers. How to leverage its power so that it can be used for Multi-layer Perceptron as well? But even before that lets decide on the cost function. Since this is a classification problem, lets take the cross entropy to be the loss function. In case you forgot about cross entropy  . Mathematically it is given by $\mathbf{J(\theta) = \text{Loss}(y_{predicted}, y_{actual}) = -y_{actual}\log(y_{predicted}) - (1-y_{actual})\log(1-y_{predicted})}$


**BACKPROPAGATION**

Our main goal is to reduce the error and for that we need to change/update the predicted value. By decomposing predicted values into the basic elements we can find that weights are variable elements affecting the predictions. Now if we want to change the prediction value we need to change the weights. *How to update the weights in order to update the predicted values?* The answer is backpropagation. Backward Propagation is a mechanism used to update the weights using the power of gradient descent. **It calculates the gradients of the error function with respect to the neural network weights**. 

It is called backward because it moves in the direction opposite to forward propagation. In the first step loss/cost function is calculated, then the weights of the output layer are updated, and then those of the hidden layers are updated and it continues till all the weights are updated. This process is repeated iteratively till the loss change is negligible when we say that the weights are the optimum. 

For our network, we have one hidden layer and one output layer i.e. we have two sets of weights to be updated. Lets break it down:


**Weight update for** $\mathbf{w^{(2)}}$

Goal is to calculate $\mathbf{\frac{\delta J}{\delta w^{(2)}}}$. Now, the error (cross-entropy) depends on the predicted output $a^{(2)}$. The prediction in turn depends on the linear transformation $z^{(1)}$. So you need to calculate these three values
- Take derivative of error with respect to $a^{(2)}$
- Take derivative of $a^{(2)}$ with respect to $z^{(2)}$
- Take derivative of $z^{(2)}$ with respect to $w^{(2)}$

Finally you can have $\mathbf{\frac{\delta J}{\delta w^{(2)}} = \frac{\delta J}{\delta a^{(2)}} \frac{\delta a^{(2)}}{\delta z^{(2)}} \frac{\delta z^{(2)}}{\delta w^{(2)}} }$. This is nothing but the chain-rule of differential calculus.  

The final step would be to take a learning rate ($\eta$). This value determines the step size you want to take in the steepest direction of gradient descent. Finally, you update the weight $w^{(2)}$ as : $$\mathbf{w^{(2)} := w^{(2)} - \eta . \frac{\delta J}{\delta w^{(2)}}}$$

Finally it becomes $\mathbf{w^{(2)} := w^{(2)} - \eta(a^{(1)})^{T}(a^{(2)} - y)}$


**Weight update for** $\mathbf{w^{(1)}}$

We again take the help of the chain rule to compute this term

- First we need to take the derivative of the error $J$ wrt to $a^{(2)}$

$$\text{i.e }\mathbf{\frac{\delta J}{\delta a^{(2)}} =\frac{a^{(2)} - y}{a^{(2)}(1 - a^{(2)})}}$$

- Change in $a^{(2)}$ effects $z^{(2)}$ and so we need to take the derivative of $a^{(2)}$ w.r.t. $z^{(2)}$ 

i.e $$\text{i.e }\mathbf{\frac{\delta a^{(2)}}{\delta z^{(2)}} = a^{(2)}(1 - a^{(2)})} $$

- Now this change in $z^{(2)}$ effects $a^{(1)}$, so we next calculate the derivative of $z^{(2)}$ w.r.t. $a^{(1)}$

 $$\text{i.e }\mathbf{\frac{\delta z^{(2)}}{\delta a^{(2)}} = w^{(2)}} $$


- This leads to change in $z^{(1)}$, so we calculate derivative of $a^{(1)}$ w.r.t. $z^{(1)}$

 $$\text{i.e }\mathbf{\frac{\delta a^{(1)}}{\delta z^{(1)}} = a^{(1)}(1 - a^{(1)})} $$


- Finally change in $z^{(1)}$ effects $w^{(1)}$. So we need to calculate its derivative as well. 

i.e $$\mathbf{\frac{\delta z^{(1)}}{\delta w^{(1)}} = X}$$


Finally we update the weights in a fashion similar to that of $w^{(2)}$.

$$\mathbf{w^{(1)} :=  w^{(1)} - \eta \frac{\delta J} {\delta w^{(1)}}}$$


$\mathbf{w^{(1)} := w^{(1)}- \eta X^{T}((a^{(2)} - y)(w^{(2)})^{T}\circ a^{(1)}(1 - a^{(1)}))}$

## Implement backpropagation

In this task you will take the help of the `forward` funtion you had performed in the previous task and code up the backpropagation algorithm from scratch to find the best set of weights for our problem statement. For your convenience a function to calculate the cross-entropy named `cross_entropy` is already there. It takes in two arguments; `y_actual` and `y_hat` which are the actual target and predicted target values and returns the cross entropy value


### Instructions
- Define a function `backpropagate` which 10 arguments:
    - `w1` and `w2`: weights from input to hidden layer, weights from hidden to output layer
    - `b1` and `b2`: bias from input to hidden layer, bias from hidden to output layer
    - `X_train` and `X_test`: training and test set features
    - `y-train` and `y_test`: training and test set target
    - `epochs`: number of iterations over the entire training data
    - `lr`: learning rate
- Inside the function perform the following operations
    - Declare a variable `m` which stores the number of data points in the trainig set
    - Intialize a for loop which iterates over `range(epochs)`
    - Make use of `forward` function to compute the post sigmoid activations `a1` and `a2`
    - Then calculate its loss `curr_loss` with help of `cross_entropy` function. It will take argument `y_train` and `a2`
    - Now you need to cache some values to avoid computing them repeatedly. These are given as `da2 = a2 - y_train.values.reshape(-1,1)` and `da1 = np.multiply((da2@w_2.T), np.multiply(a1, 1-a1))`. 
    - Now calculate change of weight and bias starting from the outermost layer given by `dw2` and `db2`. For `dw2` it will be the matrix product of `(1/m)*a1.T` and `da2` and for `db2` it will simply be `(1/m)*da2`
    - Similarly calculate `dw1`as `(1/m)*X_train.values.T@(da1)` and `dw2` as `(1/m)*da1`
    - Now update `w1` and `w2` by subtracting the learning rate times their gradient from their previous values
    - Similarly do it for `b1` and `b2` by simply using `b -= np.sum(db)`
    - This function should return the new weights and bias `w1`, `w2`, `b1` and `b2`
- A function `predict` is also given for you which takes arguments `X_test` and `y_test` and returns the accuracy score. Make use of this function to calculate your accuracy over the test data. Save its output as `acc`. Print it out to have a look at your accuracy

In [8]:
# function for loss function 
def cross_entropy(y_actual, y_hat):
    return (1/y_hat.shape[0]) * np.sum (- np.multiply(y_actual.values.reshape(-1,1), np.log(y_hat)) - 
                    np.multiply((1 - y_actual.values.reshape(-1,1)), np.log(1 - y_hat)))

# function to score on unseen data
def predict(X_test, y_test):
    
    # finding best set of weights
    w1_new, w2_new, b1_new, b2_new = backpropagate(w_1, w_2, b_1, b_2, X_train, X_test, y_train, y_test, 4000, 0.01)
    
    # make predictions
    y_pred = forward(w1_new, w2_new, b1_new, b2_new, X_test.values)[0].flatten()
    
    # binarize it
    y_pred = y_pred > 0.5
    
    # calculate accuracy
    acc = accuracy_score(y_pred, y_test.values)
    
    return acc


# Code starts here

# function for backpropagation 
def backpropagate(w1, w2, b1, b2, X_train, X_test, y_train, y_test, epochs, lr):
    
    # number of data points
    m = X_train.shape[0]
    
    for epoch in range(epochs):
        
        # make predictions
        a2, a1 = forward(w1, w2, b1, b2, X_train.values)
        
        # calculate loss
        curr_loss = cross_entropy(y_train, a2)
        
        # cached terms
        da2 = a2 - y_train.values.reshape(-1,1)
        da1 = np.multiply((da2@w_2.T), np.multiply(a1, 1-a1))

        # gradient of w2 and b2
        dw2 = (1/m)*a1.T@da2
        db2 = (1/m)*da2

        # gradient of w1 and b1
        dw1 = (1/m)*X_train.values.T@(da1)
        db1 = (1/m)*da1

        # weight updates
        w1 -= lr*dw1
        w2 -= lr*dw2
        b1 -= lr*np.sum(db1)
        b2 -= lr*np.sum(db2)

        if not (epoch+1)%400:
            print("Loss at epoch {}".format(epoch+1), np.sum(curr_loss))
        
        
        
    return w1, w2, b1, b2

acc = predict(X_test, y_test)
print(acc)

Loss at epoch 400 0.5138458880207293
Loss at epoch 800 0.5021010899959656
Loss at epoch 1200 0.49944911104603334
Loss at epoch 1600 0.4973564679435289
Loss at epoch 2000 0.4951821911174639
Loss at epoch 2400 0.4928407432787319
Loss at epoch 2800 0.4903022621479986
Loss at epoch 3200 0.4875551667076658
Loss at epoch 3600 0.4846042081726663
Loss at epoch 4000 0.48147033005623313
0.789
