## Can a vanilla multi-layer perceptron (MLP) learn the XOR gate?

This is an important point I am about to make, as it leads to alot of confusion regarded where and why the bias is introduced:

With shallow models, you inject the bias term into the data matrix, giving you your [design matrix](https://en.wikipedia.org/wiki/Design_matrix). In the first notebook (linearly-separable-spaces.ipynb), we were working with a shallow model and hence introduced the bias column. 

When preparing data for a deep model, the biases are added **after** the input layer, and are present on the target node(s). This is just a convention. You could shift all the biases back, such that there is one present for each input node (building the design matrix), but not present on the output node.

Hence the XOR gate input matrix becomes:

\begin{equation}
X = \begin{bmatrix}
1 & 1 & 1\\
1 & 1 & 0\\
1 & 0 & 1\\
1 & 0 & 0\\
\end{bmatrix}
\rightarrow
\begin{bmatrix}
1 & 1\\
1 & 0\\
0 & 1\\
0 & 0\\
\end{bmatrix}, 
y_{xor} = \begin{bmatrix}
0 \\
1 \\
1 \\
0 \\ \end{bmatrix}
\end{equation}

The target architecture is:
![MLP](media/MLPxorTopology.png)

In [1]:
import numpy as np
import matplotlib.pyplot as plt

input and target matrices:

In [17]:
X = np.array([
    [1,1],
    [1,0],
    [0,1],
    [0,0]
])
Y = np.array([
    [0],
    [1],
    [1],
    [0],
])

We will now define the MLP.
Notice the weights shape.

_Why is it 3 rows for each?_ 

1 element for the bias, and 2 for the features! Look at the diagram

In [193]:
class MLP:
    
    def __init__(self):
        '''
        Initialise the weights of the network 
        3 layers:
        hidden layer weights: (3,2) shape 
        output weights:       (3,1) shape
        '''
        n_in = 2
        n_out = 1
        Glorot_weight_init_scale =  (2/( n_in + n_out))**(1/2)
        
        self.W1 = np.random.normal(loc=0.0,
                                   scale=Glorot_weight_init_scale,
                                   size=(3,2))
        self.W2 = np.random.normal(loc=0.0,
                                   scale=Glorot_weight_init_scale,
                                   size=(3,1))
        
    def activation(self,z):
        '''
        Rectified linear unit
        '''
        return np.maximum(z,0)
    

    def forward(self,x):
        '''
        In TF, and most frameworks, the first (or zeroth if you know you know) 
        dimension is the batch dimension 
        
        Suppose we had a SINGLE input, x, of shape (2,). This would 
        require use to inject the addtional batch dimension 
        '''
        if len(x.shape) == 1:
            # testing for existence of batch dimension 
            x = x[np.newaxis,...]
            # this adds a new axis to the front of the vector/matrix/tensor
            # if you wanted to add to the end of it, do: 
            #  x = x[...,np.newaxis]
            
        b1 = np.ones(shape=(x.shape[0], 1))
        x1_design = np.hstack((b1,x))
        h1 = x1_design @ self.W1
        x2 = self.activation(h1)
        
        b2 = np.ones(shape=(x2.shape[0], 1))
        x2_design = np.hstack((b2,x2))
        h2 = x2_design @ self.W2
        
        # linear activation
        x3 = h2 
        
        return h1, x2, h2, x3         

Now we have the MLP defined, lets setup the training routine. We need to fix some learning parameters beforehand:

In [23]:
MAX_EPOCHS = 100
LR = 0.1 

In [194]:
model = MLP()


loss_hist, W1_hist, W2_hist = [],[],[]
for epoch in range(MAX_EPOCHS):
    # forward pass
    h1, x2, h2, x3 = model.forward(X)
    
    loss = (1/4) * np.sum((x3 -Y)**2)/len(Y)
    
    loss_hist.append(loss)
    
    if epoch % 20:
        print(f"Epoch: {epoch}\nMSE: {loss:.4f}")
    
    # This is backpropagation. Note that many frameworks 
    # have implemented automatic differentiation techniques 
    # so that you dont need to concern yourself with 
    # the derivatives of activations and weight updates. 
    
    
    

    

Epoch: 1
MSE: 2.5205
Epoch: 2
MSE: 2.5205
Epoch: 3
MSE: 2.5205
Epoch: 4
MSE: 2.5205
Epoch: 5
MSE: 2.5205
Epoch: 6
MSE: 2.5205
Epoch: 7
MSE: 2.5205
Epoch: 8
MSE: 2.5205
Epoch: 9
MSE: 2.5205
Epoch: 10
MSE: 2.5205
Epoch: 11
MSE: 2.5205
Epoch: 12
MSE: 2.5205
Epoch: 13
MSE: 2.5205
Epoch: 14
MSE: 2.5205
Epoch: 15
MSE: 2.5205
Epoch: 16
MSE: 2.5205
Epoch: 17
MSE: 2.5205
Epoch: 18
MSE: 2.5205
Epoch: 19
MSE: 2.5205
Epoch: 21
MSE: 2.5205
Epoch: 22
MSE: 2.5205
Epoch: 23
MSE: 2.5205
Epoch: 24
MSE: 2.5205
Epoch: 25
MSE: 2.5205
Epoch: 26
MSE: 2.5205
Epoch: 27
MSE: 2.5205
Epoch: 28
MSE: 2.5205
Epoch: 29
MSE: 2.5205
Epoch: 30
MSE: 2.5205
Epoch: 31
MSE: 2.5205
Epoch: 32
MSE: 2.5205
Epoch: 33
MSE: 2.5205
Epoch: 34
MSE: 2.5205
Epoch: 35
MSE: 2.5205
Epoch: 36
MSE: 2.5205
Epoch: 37
MSE: 2.5205
Epoch: 38
MSE: 2.5205
Epoch: 39
MSE: 2.5205
Epoch: 41
MSE: 2.5205
Epoch: 42
MSE: 2.5205
Epoch: 43
MSE: 2.5205
Epoch: 44
MSE: 2.5205
Epoch: 45
MSE: 2.5205
Epoch: 46
MSE: 2.5205
Epoch: 47
MSE: 2.5205
Epoch: 48
MSE: 2.52

(1, 2)

In [13]:
def set_to_optimal_weights(model):
    '''
    Sets weights of MLP to optimal set for XOR gate modelling.
    
    '''
    
    model.W1 = np.array([[0, -1],
                         [1,1],
                         [1,1]], dtype=np.float)
    model.W2 = np.array([[0],
                         [1],
                         [-2]], dtype=np.float)
    return model 


array([[1.],
       [1.]])