## Can a vanilla multi-layer perceptron (MLP) learn the XOR gate?

This is an important point I am about to make, as it leads to alot of confusion regarded where and why the bias is introduced:

With shallow models, you inject the bias term into the data matrix, giving you your [design matrix](https://en.wikipedia.org/wiki/Design_matrix). In the first notebook (linearly-separable-spaces.ipynb), we were working with a shallow model and hence introduced the bias column. 

When preparing data for a deep model, the biases are added **after** the input layer, and are present on the target node(s). This is just a convention. You could shift all the biases back, such that there is one present for each input node (building the design matrix), but not present on the output node.

Hence the XOR gate input matrix becomes:

\begin{equation}
X = \begin{bmatrix}
1 & 1 & 1\\
1 & 1 & 0\\
1 & 0 & 1\\
1 & 0 & 0\\
\end{bmatrix}
\rightarrow
\begin{bmatrix}
1 & 1\\
1 & 0\\
0 & 1\\
0 & 0\\
\end{bmatrix}, 
y_{xor} = \begin{bmatrix}
0 \\
1 \\
1 \\
0 \\ \end{bmatrix}
\end{equation}

In [1]:
import numpy as np
import matplotlib.pyplot as plt

input and target matrices:

In [9]:
X = np.array([
    [1,1],
    [1,0],
    [0,1],
    [0,0]
])
Y = np.array([
    [0],
    [1],
    [1],
    [0],
])


(2, 2)

We will now define the MLP.
Notice the weights shape.

_Why is it 3 rows for each?_ 

1 for the bias, and 2 for the features!

In [None]:
class MLP:
    
    def __init__(self):
        '''
        Initialise the weights of the network 
        3 layers:
        hidden layer weights: (3,2) shape 
        output weights:       (3,1) shape
        '''
        n_in = 2
        n_out = 1
        Glorot_weight_init_scale =  (2/( n_in + n_out))**(1/2)
        
        self.W1 = np.random.normal(loc=0.0,
                                   scale=Glorot_weight_init_scale,
                                   size=(3,2))
        self.W2 = np.random.normal(loc=0.0,
                                   scale=Glorot_weight_init_scale,
                                   size=(3,1))
        
    def activation(self,z):
        '''
        Rectified linear unit
        '''
        return np.maximum(z,0)
    

    def forward(self,x):
        '''
        In TF, and most frameworks, the first (or zeroth if you know you know) 
        dimension is the batch dimension 
        
        Suppose we had a SINGLE input, x, of shape (2,). This would 
        require use to inject the addtional batch dimension 
        '''
        if len(x.shape) == 1:
            # testing for existence of batch dimension 
            x = x[np.newaxis,...]
            # this adds a new axis to the front of the vector/matrix/tensor
            # if you wanted to add to the end of it, do: 
            #  x = x[...,np.newaxis]
            
        b1 = np.ones(shape=(x.shape[0], 1))
        x1_design = np.hstack(b1,x)
        h1 = x1_design @ self.W1
        x2 = self.activation(h1)
        
        b2 = np.ones(shape=(x2.shape[0], 1))
        x2_design = np.hstack(b2,x2)
        h2 = x2_design @ self.W2
        
        # linear activation
        x3 = h2 
        
        return h1, x2, h2, x3 
        
        
        
        
        
        

array([[ 1.13907358,  1.34213213],
       [ 0.1300233 , -0.10749791],
       [-0.86408775,  0.88440219]])

In [12]:
X[1,:][np.newaxis,...].shape

(1, 2)

In [13]:
np.ones(shape=(2, 1))

array([[1.],
       [1.]])