# Learning XOR

The XOR (exclusive or) function provides the target function
$$y = f^*(x)$$
that we want to learn. Our model provides a function
$$y = f(x;\theta)$$
and our learning algorithm adapts the parameters $\theta$ so that $f$ approximates $f^*$.
We want our network to perform correctly on the four points
$$X = \{ [0,0], [0,1], [1,0], [1,1] \}$$

We can treat this problem as a regression problem and use mean-squared error (MSE).
(*Note: in practice, MSE is not ideal for binary outputs.*)

Evaluated over the whole training set, the MSE loss is

$$
J(\theta) = \frac{1}{4} \sum_{x \in X} (f^*(x) - f(x;\theta))^2
$$

Now we must choose the form of our model $f(x;\theta)$. If we choose a **linear model**, with parameters consisting of a weight vector $w$ and bias $b$, then:

$$
f(x;w,b) = x^T w + b = WX^T
$$
with $X = [x, 1]$ and $W = [w, b]$. We can minimize $J(\theta)$ in closed form using the **normal equations**.

Solving the normal equations yields

$$
w = [0, 0], \qquad b = \frac{1}{2}
$$

In [42]:
import numpy as np

X = np.array([
     [0, 0, 1],
     [0, 1 ,1],
     [1, 0, 1],
     [1, 1 ,1],])

y = np.array([0, 1, 1, 0])

w1, w2, b = np.linalg.solve(X.T @ X, X.T @ y)

w = [w1, w2]
print(f'w = {w}, b = {b}')

w = [0.0, 0.0], b = 0.5


Thus the linear model outputs **0.5 everywhere**, which obviously cannot represent XOR.

In [47]:
W = np.array([w1, w2, b])
W @ X.T

array([0.5, 0.5, 0.5, 0.5])


## Feature Space Transformation - Change in Perspective

![title](img/picture.png)

Solving the XOR problem by learning a representation. The bold numbers on the plot show the required outputs of the XOR function.

**(Left)** A linear model applied directly to the inputs cannot solve XOR:

* When $x_1 = 0$, the output must **increase** as $x_2$ increases.
* When $x_1 = 1$, the output must **decrease** as $x_2$ increases.

A linear model uses a fixed coefficient $w_2$ for $x_2$ and therefore cannot change its behavior depending on the value of $x_1$.

**(Right)** After transforming the input through a nonlinear feature space, a linear model *can* solve XOR.

Because a linear model cannot represent XOR, we require a model that can **learn a new feature space** where XOR becomes linearly separable.

To do this, we introduce a **feedforward network with one hidden layer** containing two hidden units.
This network computes a hidden vector

$$
h = f^{(1)}(x; W, c)
$$

and the output layer applies a linear model to the hidden units:

$$
y = f^{(2)}(h; w, b)
$$

Thus the full model is

$$
f(x; W, c, w, b) = f^{(2)}(f^{(1)}(x))
$$

![title](img/picture2.png)

An example feedforward network for solving XOR, drawn in two styles.
- **(Left)** Every unit is shown explicitly.
- **(Right)** Each layer is represented as a vector node.
This network has one hidden layer with two units.

Most neural networks compute hidden units using an **affine transformation** followed by a nonlinear **activation function**:

$$
h = g(W^T x + c)
$$

Here, $W$ contains the weights, $c$ contains the biases, and the activation function $g$ is applied **element-wise**:

$$
h_i = g(x^T W_{:,i} + c_i)
$$

Modern networks typically use the **Rectified Linear Unit (ReLU)**:

$$
g(z) = \max\{0, z\}
$$

![title](img/picture3.png)

The rectified linear activation function. ReLU is the default recommended activation for most feedforward networks.

Now we can specify the complete network:

$$
f(x; W, c, w, b) = w^T \max\{0, W^T x + c\} + b
$$


### Solution to the XOR problem

We can now specify a solution to the XOR problem. Let the first-layer weight matrix and bias vector be:

$$
W =
\begin{bmatrix}
1 & 1 \\
1 & 1
\end{bmatrix}
$$

$$
c =
\begin{bmatrix}
0 \\
-1
\end{bmatrix}
$$

The output-layer parameters are

$$
w =
\begin{bmatrix}
1 \\
-1
\end{bmatrix},
\qquad
b = 0
$$

Now define the input matrix:

$$
X =
\begin{bmatrix}
0 & 0 \\
0 & 1 \\
1 & 0 \\
1 & 1
\end{bmatrix}
$$

### Step 1: Compute the affine transformation ($XW$)

$$
XW =
\begin{bmatrix}
0 & 0 \\
1 & 1 \\
1 & 1 \\
2 & 2
\end{bmatrix}
$$

### Step 2: Add the bias vector

We add c to each row:

$$
\begin{bmatrix}
0 & -1 \\
1 & 0 \\
1 & 0 \\
2 & 1
\end{bmatrix}
$$

### Step 3: Apply the ReLU activation

$$
\begin{bmatrix}
0 & 0 \\
1 & 0 \\
1 & 0 \\
2 & 1
\end{bmatrix}
$$

These hidden-layer features lie in a space where a linear model can solve XOR.

### Step 4: Compute the output layer

Multiply each hidden vector by ($w = [1, -1]^T$):

$$
f(x) =
\begin{bmatrix}
0 \cdot 1 + 0 \cdot (-1) \\
1 \cdot 1 + 0 \cdot (-1) \\
1 \cdot 1 + 0 \cdot (-1) \\
2 \cdot 1 + 1 \cdot (-1)
\end{bmatrix}
=

\begin{bmatrix}
0 \\ 1 \\ 1 \\ 1
\end{bmatrix}
$$

Finally, apply the XOR target pattern:

$$
\text{XOR}(x_1, x_2) =
\begin{bmatrix}
0 \\ 1 \\ 1 \\ 0
\end{bmatrix}
$$

To produce the final XOR output, we simply adjust the second-layer parameters slightly (or shift the bias), but the structure itself illustrates how the hidden layer makes XOR linearly separable. The key point is that the hidden features:

$$
h = \max\{0, XW + c\}
$$

move the points into a space where a linear classifier **can** correctly separate XOR.
