# How Do We Learn Networks?

**(Run this cell to define useful Latex macros)**
\\[
\newcommand{\bigoh}[1]{\mathcal{O}\left(#1\right)}
\newcommand{\card}[1]{\left\lvert#1\right\rvert}
\newcommand{\condbar}[0]{\,\big|\,}
\newcommand{\eprob}[1]{\widehat{\text{Pr}}\left[#1\right]}
\newcommand{\norm}[1]{\left\lvert\left\lvert#1\right\rvert\right\rvert}
\newcommand{\prob}[1]{\text{Pr}\left[#1\right]}
\newcommand{\pprob}[2]{\text{Pr}_{#1}\left[#2\right]}
\newcommand{\set}[1]{\left\{#1\right\}}
\newcommand{\fpartial}[2]{\frac{\partial #1}{\partial #2}}
\\]

Okay, I've told you about how I can build any Boolean function out of a series of linear classifiers (each followed by a *nonlinearity* to clean up the output so it is a one or a zero).

I've shown you that it's important to have a series of them, because complicated Boolean formulas come from chaining together simpler parts like AND, OR, and NOT.

We can call this a *Boolean network*. What I want to do is turn an algorithm loose on a dataset and have it build a Boolean network for me.

The experimenter will define the *architecture* of the network. This includes the number of layers, and the number of logical units that are used at each layer. For instance, I might want to train a network of five layers with five gates each.

The algorithm needs to learn which gates to use at each layer. It needs to decide which gates in one layer get wired into which gates in the next layer. You could call this the *parameterization* of the network.

The question is: how will the algorithm learn this?

### A Naive Learning Algorithm

First, before we can decide how to learn from the data, we need a notion of "goodness of network."

The simplest possibility is that we calculate the percentage of training examples that the network gets the right output on. Higher is better. This is called the *misclassification error*.

An important note: our network, for now, is *nonprobabilistic.* We can't use our cross entropy error yet.

For lack of any better idea, let's have an iterative where the algorithm randomly chooses a gate to change. It randomly chooses a new candidate to switch it to. If this would improve the error, it makes the change. Otherwise it doesn't make the change.

I suppose if you change an AND or OR to a NOT, you must randomly decide which input to delete. Likewise, if you change a NOT to an AND or OR, you must randomly choose a new input to add.

Whatever. Just run this for a long time. Eventually, you will reach a point where no change will have any positive impact. You have reached a local optimum. Whenever you suspect you might be in a local optimum, you can just iterate through all possible changes and check. If you really are, then stop.

This algorithm is stupid and inefficient. The problem is that there is no intelligence or principle behind the proposed changes: they are random. It is expensive to propose and evaluate a change. So the proposal of useless changes is a waste of valuable time.

### Error Derivatives Gives Direction

In the case of linear and logistic regression, the error derivative showed us in what direction to change each of the parameters. There was also an efficient, *vectorized* calculation of these derivatives. Many changes were made *simultaneously*. And all those proposed changes should have a high likely of having a positive impact.

The fundamental reason that we can't use error derivatives to guide our optimization is because all our choices are *discrete*. Derivatives are all about "what would be the incremental change in output if I made a small change incremental change to the input?" Our output is discrete: it is always binary, it cannot change incrementally. Likewise, our "inputs", which are the choice of gates and how to wire them together, are also discrete.

So we want to figure out how to tweak our Boolean network to make it *differentiable*. When we have done that, we will have a *Neural Network*.


### Eliminating Discrete Gate Choices And Wiring

Our choice of gates to use and which gates to wire to which others is discrete. Let's change that.

As mentioned, the AND, OR, and NOT gates are all representable by a linear classifier. A linear function has continuous parameters: the coefficients for the inputs, plus the intercept.

So let's do this. For each layer, we'll wire every "unit" into every unit of the next layer. So if there are $n$ units in the second layer, and $m$ units in the third layer, there are $nm$ connections.

Each unit in the third layer does a weighted sum of its inputs. The weights (and also an intercept term) are the parameters we have to choose. For each of the units in the third layer, there are $n$ weight parameters and $1$ intercept parameter.

Our original primitive Boolean operations like AND, OR, and NOT are still representable by choice of appropriate weights. However, we are able to also capture relationships between three and more inputs, too.


So we've swapped out discrete choice of wiring and gate type for a choice of continuous valued parameters.


### Make Nonlinearity Differentiable

We were previously "cleaning up" the output of a gate by using a function that mapped every negative value to zero, and every positive value to one. This is called the Heaviside Step Function sometimes.

The step function is undesirable. First, it is discontinuous, and doesn't have a derivative at $x = 0$. But that isn't the main problem.

The real problem is that the Heaviside function has a zero derivative everywhere. Let me explain why this is a problem.

Say we are considering an example. For this example, a unit is output zero. Let us say that it would be better if the unit were to output a one.

The way to make this happen is to tweak the parameters for this unit so that its overall input becomes positive.

However, $\fpartial{H}{\theta_{i, j}} = 0$, because the derivative of the Heaviside function is *always* zero. As far as the Heaviside function is concerned, there is no benefit to an incremental improvement to the input if it doesn't switch the function from zero to one.

Since derivatives are always about *marginal* changes, which are infintesimal, it will never realize that changing the parameters could help.


Let's replace the Heaviside step function with the logistic function. The logistic function is a smooth function that moves from zero to one as you increase the value of its input.

You can think of the logistic function as a continuous version of the Heaviside function. The logistic function can see the value in marginal changes that move it closer to its ideal value.

### The Output Is Now Continuous Valued

Now that we are applying the logistic function to the output of each unit, all the outputs of the units are continuous valued in the range of zero to one.

That's great, because we can now use our cross entropy error again. We can interpret the final output as the "probability" that the model thinks the desired result is a one.


### Everything Is Differentiable

We are done for now! The model is now fully differentiable with respect to the parameters. Now we can use our typical gradient descent approach to train it.

This is what a neural network is!