<h1>The Essence of Neural Networks</h1>

<p>Let's consider that we want to predict ($y$) if you're going to fail ($0$) and exam, or approve it ($1$). The inputs (a.k.a. <b>features</b>) are those things that can influence the prediction, and even those that don't, like for instance:</p>

<ul>
	<li>How much did you study?  $\rightarrow x_1$</li>
	<li>How smart are you?  $\rightarrow x_2$</li>
	<li>You previous knowledge $\rightarrow x_3$</li>
	<li>Your name $\rightarrow x_4$</li>
</ul>

<p>In this case, we can think that $x_1$ has an important influence on the prediction. $x_2$ and $x_3$ can help you in the exam, but are not decisive. Finally, $x_4$ is not important for the prediction. This last observation give us the notion of the importance level or <b>weight</b> of each variable.</p>

<p>Now, we have to model an equation in order to get our probability, we can do as follow:</p>

$$z = x_1(w_1) + x_2(w_2) + x_3(w_3) + x_4(w_4)$$

<p>where each $w_i$ is the weight that we will assign to each variable:</p>
<ul>
	<li>$w_1 = 1$ because is decisive for the prediction.</li>
	<li>$w_1 = 0.5$ can probably help.</li>
	<li>$w_1 = 0.2$ can help if you had a similar course.</li>
	<li>$w_4 = 0$ because in fact is not important.</li>
</ul>

$$z = x_1(1) + x_2(0.5) + x_3(0.2) + x_4(0)$$

<p>Once we get $z$, we need a function for modeling that $z$ to our final result ($y$). Like for instance: maybe our $z$ ends with a value of $2.8$ but we need a value in between $0-1$. These kind of functions are called <b>Activation Functions</b> ($f(x)$) and we can model all this system as follow:</p>

<img style="width: 400px;" src="img/neuron_model.png"/>

<p>We call it a <b>Neuron</b> or <b>The Perceptron Model</b>, and we can express it more formally as follow:</p>

$$y= f(\sum_{i=1}^{n}x_iw_i)$$

<p>Or, adding a <b>bias</b> term ($w_0$) we can express as follow:</p>

$$y= f(w_0 + \sum_{i=1}^{n}x_iw_i)$$

<p>Finally, a little bit more convenient way of expressing this is using linear algebra:</p>

$$y= f(w_0 + \textbf{X}^T\textbf{W}) \\[10pt] \textbf{X} = \begin{bmatrix}
x_1\\ 
\vdots \\ 
x_n
\end{bmatrix} \quad \textrm{and} \quad \textbf{W} = \begin{bmatrix}
w_1\\ 
\vdots \\ 
w_n
\end{bmatrix}$$





<h3>Activation Functions</h3>

<p>As long as we know, we can classify data in two categories: when we have data where two classes can be separated by an straigh line, we say that the data is <b>Linearly Separable</b> (A); and when the two classes can be separated by a curve or a more complex function, we say that the data is <b>Non-Linearly Separable</b> (B).</p>

<img style="width: 450px;" src="img/separable_data.png"/>

<p>That's why we use activation functions in por perceptron model, every activation function is non-linear and add flexibility to our model, in the same way than the most of the time our real world data is non-linear too. In those little cases where the data is linearly separable we don't have the necessity of an activation function, but we'll have it the most of the times.</p>

<p>The activation functions appear in the same places through the whole network, as we can see in the following figure:</p>

<img style="width: 550px;" src="img/activation_function.png">
