# A brief introduction to neural networks

This notebook is a very brief, simplified introduction to the basic of neural networks.  This notebook assumes you are familiar with linear algebra--primarily matrix multiplication, vectors, scalars, and the relationships between these--and calculus--primarily the concept of derivatives.  This notebook also assumes conceptual and mathematical familiarity with linear models, e.g., linear regression, logistic regression, and generalized linear models.

I'm trying to cram a _lot_ of information in here, because this is a huge topic. I'm going to have to leave out a _lot_ of stuff about the mathematics, intuitions, practical considerations, engineering, etc. of neural networks, otherwise we'd be here for a few weeks covering everything.

This set of notebooks will _not_ get you up and running with neural networks if you're starting from scratch.  You won't walk away from here knowing the details about how to build them and use them in practice.  Rather, my goal is to give you a solid foundation that you can easily build off of if you decide to pursue neural networks further.

# Linear models are special cases of neural networks

Every linear model you've ever used or read about--least squares regression, generalized linear models, hierarchical models, etc--are special cases of neural networks.  Neural networks encompass far more exotic and interesting kinds of models than these, but we're going to work our way up to them by starting with basic linear models.

_Note:_ It's a bit misleading to say that "neural networks are generalized forms of linear models."  This is true, but it doesn't quite paint the whole picture.  Neural networks are an _extremely_ general mathematical formulation, and linear models are an _extremely_ specialized case.  Neural networks encompass linear models, but also many other kinds of models that bear no meaningful resemblence to linear models.  It's kind of like the relationship between _squares_ and _shapes:_ all squares are shapes, but there are shapes that look absolutely nothing like squares.

Let's start with a simple linear regression on two variables, and work our way up to neural networks:

$$
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2
$$

We can re-write this as a vector dot product, and the result will be mathematically identical:

$$
y = \begin{bmatrix}\beta_1 & \beta_2\end{bmatrix}\begin{bmatrix}x_1 \\ x_2\end{bmatrix} + \beta_0
$$

It's very easy to generalize this to any number of features, by just writing out general vectors and not specifying their exact size:

$$
y = \vec{\beta}\cdot\vec{x} + \beta_0
$$

## All linear models are matrices

Let's remember something critical from linear algebra: scalars (single numbers) and vectors are both special cases of matrices.  (Technically matrices are special cases of tensors, but we're not worrying about that--tensors are too much to get into today).  So:
- A matrix is a two-dimensional grid of numbers, organized into rows and columns.
- A vector (list of numbers) is a special case of a matrix: one where there's one row but multiple columns, _or_ one column but multiple rows.
- A scalar (single number) is a special case of a matrix where theres only one column _and_ one row (or equivalently: a vector with only one entry).

Since we're already dealing with vectors and dot products, we're firmly within the realm of linear algebra.  So let's replace our vectors in the above equation with matrices, and see if that leads us to any new and interesting insights:

$$
\mathrm{\textbf{Y}} = \mathrm{\textbf{AX}} + \mathrm{\textbf{B}}
$$

Where $\mathrm{\textbf{A}}$ is now our _weight matrix_ and $\mathrm{\textbf{B}}$ is called a _bias matrix_.  ("Bias" in this case is a technical term that just means "a constant offset;" the intercept in a linear regression is a bias in this sense.  The bias matrix generalizes this notion).  Typically, $\mathrm{\textbf{A}}$ is the only non-vector; $\mathrm{\textbf{X}}$ is usually a vector of features, meaning $\mathrm{\textbf{AX}}$ give a vector; $\mathrm{\textbf{B}}$ and $\mathrm{\textbf{Y}}$ are thus typically vectors as well.  In principle, though, all of these could be matrices, or even multi-dimensional arrays, of arbirary sizes.  It just tends to make the most sense if we work mostly in vectors, with the exception of $\mathrm{\textbf{A}}$.

Re-writing a linear model like this makes it easy to see a few things that will be very helpful later on:

1. If our weights are a matrix with multiple columns, then our output $\mathrm{\textbf{Y}}$ could have multiple entries.
2. Since $\mathrm{\textbf{Y}}$ and $\mathrm{\textbf{X}}$ are both matrices, we could feed $\mathrm{\textbf{Y}}$ in to another round of matrix multiplication.  In fact, we could do this a theoretically infinite number of times.  (Every time we do this, we call it a _layer_).  It would look something like this ($\mathrm{\textbf{Y_n}}$ is the final output):

$$
\begin{align}
\mathrm{\textbf{Y}}_1 &= \mathrm{\textbf{A}}_1\mathrm{\textbf{X}}   + \mathrm{\textbf{B}}_1\\
\mathrm{\textbf{Y}}_2 &= \mathrm{\textbf{A}}_2\mathrm{\textbf{Y}}_1 + \mathrm{\textbf{B}}_2\\
\dots\\
\mathrm{\textbf{Y}}_n &= \mathrm{\textbf{A}}_n\mathrm{\textbf{Y}}_{n-1}   + \mathrm{\textbf{B}}_n
\end{align}
$$

3. If $\mathrm{\textbf{Y}}$ has multiple entries, and we're doing a regression problem, we kind of have to do step 2 at least one more time, since we do want our final answer to be a scalar.

This opens us up to some interesting new ideas.  The neural network we described "transforms" our input features into some arbitrary number of intermediate representations, before eventually giving us our final output.  By contrast, the linear models you probably know and love will map directly from data to output, with _no_ intermediate transformations.  This is not too unprecedented, at least in broad strokes.  You might do (or have already done) a project where you do something like this:
1. Give your subjects a survey.
2. Calculate some measure or score from their responses, e.g., use it to assign each student a value for some particulat psychological construct.
3. Use those resulting scores as features in a regression.

Or:
1. Get some data.
2. Run PCA or another dimensionality reduction algorithm on it.
3. Use the transformed results as inputs to a regression.

A neural network is doing something very similar when there are multiple layers.  Each layer is potentially doing a mix of these things: learning near features based on old ones, and learning new projections/representation of the data.  And, the layers can do both of these (and some other things) at the same time!  Essentially, any tranformation that can be represented exactly or approximately by matrix multiplication is something the layer might learn.  (e.g.: PCA ultimately boils down to matrix multiplication).

## From linear to non-linear: activation functions

There is only one thing missing from our formulation above.  A serries of matrix multiplications and additions is great, but it's fundamentally just a big linear model, and _almost no data is truly linear._  So we haven't built a very general model if it can't deal with most kinds of data.  To allow a neural network to learn non-linear relationships, we introduce _activation functions,_ which we use to transform the outputs of each individual layer.  The Greek letter $\sigma$ is conventionally used to represent an activation function.  So, our final formulation of a single neural network layer is:

$$
\mathrm{\textbf{Y}}_n = \sigma\left(\mathrm{\textbf{A}}_n\mathrm{\textbf{X}} + \mathrm{\textbf{B}}_n\right)
$$

This turns out to be the same formulation as a Generalized Linear Model--but where in GLM's the link function ($\sigma$ in the above equation) is almost always in the exponential family of distributions, in neural networks, $\sigma$ can be any arbitrary function.

A typical neural network consists of multiple "chained" instances of these layers.  The activation function is applied _element-wise_ to the vector we get after matrix multiplication.  Some common activation functions:

- The Identity function: $\sigma(x) = x$.  Rarely used in practice, but often wort trying when you're developing a network.
- Rectified Linear Unit (ReLU): $\sigma(x) = max(x, 0)$.  Sets all negative values to zero.  This is very common, and is usually a very sensible default.
- Hyperbolic tangent: $\sigma(x) = tanh(x)$.  Fairly common, slower than ReLU, but allows learning extremely non-linear relationships very quickly.  This is in the family is *sigmoid functions.*
- Logistic function: $\sigma(x) = \frac{1}{1 + e^{-x}}$.  This is another sigmoid function that's a bit faster to compute than $tanh(x)$, but it's often interchangeable.
    - Yes, this _is_ the same "logistic" as in "logistic regression!"  However, using this as your activation function doesn't mean you're building a neural network parallel to a classical logistic regression.  To do that, you need to use the logistic _loss function,_ which we'll talk about later.

In principle, there's no reason each layer couldn't use a totally different activation function.  In practice, you'll usually see the same activation function used for all layers except the very last one (the output layer), and sometimes the very first one (the input layer).
    
So our more complete picture of a (still very, very basic) neural network is as follow.  I'm going to assume that our inputs and intermediate layer outputs are all vectors, since this is the most common situation in practice:

$$
\vec{x} = \begin{bmatrix} x_0 & x_1 & x_2 & \dots & x_n\end{bmatrix}\\
\vec{y_1} = \sigma_1\left(\mathrm{\textbf{A}}_1\vec{x} + \vec{\beta_1}\right)\\
\vec{y_2} = \sigma_2\left(\mathrm{\textbf{A}}_2\vec{y_1} + \vec{\beta_2}\right)\\
\dots\\
\vec{y_n} = \sigma_n\left(\mathrm{\textbf{A}}_n\vec{y_{n-1}} + \vec{\beta_n}\right)
$$

## A small example with numbers

Let's build an extremely simple network with two layers.  We'll feed a single data point through it and do the math by hand.  Here are the components to our network:

$$
\begin{align}
\vec{x} &= \begin{bmatrix}2 \\ 3\end{bmatrix}\\
\mathrm{\textbf{A}}_1 &= \begin{bmatrix}1 & -2 \\ -3 & 4\end{bmatrix}\\
\mathrm{\textbf{A}}_2 &= \begin{bmatrix}-1 & 0.5\end{bmatrix}\\
\vec{\beta_1} &= \begin{bmatrix}0 \\ 1\end{bmatrix}\\
\vec{\beta_2} &= \begin{bmatrix}-2\end{bmatrix}\\
\sigma_1 &= \mathrm{ReLU}\\
\sigma_2 &= \mathrm{Identity}\\
\end{align}
$$

So, our network--written out layer by layer--looks like this.  For the first layer:

$$
\begin{align}
\vec{y_1} &= \mathrm{ReLU}\left(\mathrm{\textbf{A}}_1\vec{x} + \vec{\beta_1}\right)\\
&= \mathrm{ReLU}\left(\begin{bmatrix}1 & -2 \\ -3 & 4\end{bmatrix}\begin{bmatrix}2 \\ 3\end{bmatrix} + \begin{bmatrix}0 \\ 1\end{bmatrix}\right)\\
&= \mathrm{ReLU}\left(\begin{bmatrix}1\times2 + -2\times3 \\ -3\times2 + 4\times3\end{bmatrix} + \begin{bmatrix}0 \\ 1\end{bmatrix}\right)\\
&= \mathrm{ReLU}\left(\begin{bmatrix}-4 \\ 6\end{bmatrix} + \begin{bmatrix}0 \\ 1\end{bmatrix}\right)\\
&= \mathrm{ReLU}\left(\begin{bmatrix}-4 \\ 7\end{bmatrix}\right)\\
&= \begin{bmatrix}\max(0, -4) \\ \max(0, 7)\end{bmatrix}\\
&= \begin{bmatrix}0 \\ 7\end{bmatrix}
\end{align}
$$

And then for the output of our second layer:

$$
\begin{align}
\vec{y_2} &= \mathrm{Logistic}\left(\mathrm{\textbf{A}}_2\vec{y_1} + \vec{\beta_2}\right)\\
&= \mathrm{Logistic}\left(\begin{bmatrix}-1 & 0.5\end{bmatrix}\begin{bmatrix}0 \\ 7\end{bmatrix} + \begin{bmatrix}-2\end{bmatrix}\right)\\
&= \mathrm{Logistic}\left(\begin{bmatrix}-1\times0 + 0.5\times7\end{bmatrix} + \begin{bmatrix}-2\end{bmatrix}\right)\\
&= \mathrm{Logistic}\left(\begin{bmatrix}3.5\end{bmatrix} + \begin{bmatrix}-2\end{bmatrix}\right)\\
&= \mathrm{Logistic}\left(\begin{bmatrix}1.5\end{bmatrix}\right)\\
&= \begin{bmatrix}\frac{1}{1+e^{1.5}}\end{bmatrix}\\
\\&\text{We can remove the brackets and vector symbol, since a matrix/vector with one entry is a scalar.}\\
y_2 &= \frac{1}{1+e^{1.5}}\\
&\approx 0.182
\end{align}
$$

Fortunately, computers are absurdly fast at doing this kind of tedious math for us.  So you will _never_ do this sort of thing by hand, ever.

## But how do we train a neural network?

We train a neural network by using calculus.  Specifically, a technique called _gradient descent,_ which is an extremely general approach to _numeric optimization._  Basically, if we have a value that we want to make as small (or big) as possible, but we can't find an explicit formula for the right answer, we can often use a gradient descent method.  Essentially: we look at the current value we have, take the derivative of our prediction with respect to our inputs, and use that to update our guess for the inputs.  If you've taken a calculus course, you've probably run across Newton's Method for finding the roots of functions:

1. Start with a function, $f(x)$, that you want to find the roots (zeroes) of.
2. Pick a random starting value for $x$; call it $x_0$.  Chances are, $x_0$ is pretty far off from one of the roots.
3. To get closer to one of the roots of $f(x)$, "update" your guess to get the next, better guess, $x_1$: $x_1 = x_0 - \frac{f(x)}{f'(x)}$ (where $f'(x)$ is the derivative of $f(x)$).
4. Repeat step 3 until the difference between one guess and the next gets really small.  (how small is up to you).

It turns out that for any neural network, we can take the derivative of the prediction with respect to the weights and biases of every layer.  I'll leave it as an exercize for you to do yourself, but here are the steps to do that:
1. Replace all the numbers with variable names.  E.g.: your inputs $\vec{x}$ become $[x_1, x_2]$ rather than $[2, 3]$, and similarly for the weight and bias matrices.
2. Replace the activation functions with the identity function (just to make the math simpler--the same principle will apply to any activation function, but the math gets a lot messier).
3. Write out an explicit form for $y_2$ in terms of all the weights and biases and inputs from both layers.  You should get something that looks like a polynomial.
4. Take the derivative of this resulting equation with respect to one of the weights in the first layer, or one of the biases.  
    - You _could_ take the derivative with respect to the input values--there's no mathematical reason why not--but there isn't much point in doing that for a neural network.  We're taking the derivative so we can update a value; it doesn't make sense for us to edit our original data to make our network classify the original data better.

# Objective functions: the things we want to make smaller

Let's add the final piece to our neural network puzzle.  Since we're using neural networks for prediction tasks, we ultimately want to make sure that their predictions line up with reality.  But, as we saw with scikit-learn, "how well they match up" is a question that can have many different answers for the same set of predictions.  For a regression model, we could measure:
1. Absolute error
1. Squared error
1. Log error
1. Brier score
1. Pinball loss (this is a generalization of mean absolute error)
1. Maximum error (this is very rarely used, but it tells us what the worst-case performance is for our model)
1. $R^2$ score

And so on, ad nauseum.  Really, any function that takes in two vectors and returns a scalar is a thing we _can_ use as a metric function, but the vast majority of these functions won't be of any real-world use or interest.  Once we've picked a metric, we want to build a model that does as well as possible on it.  Or, in terms of numeric optimization: we want to build a model that optimizes our chosen metric on our dataset.

Like with performance metrics, any function can be a loss function.  It just has to satisfy a few criteria:
- It should take a predicted and a ground-truth value as arguments, and return a single number that measures some notion of accuracy.
- It should be differentiable.  I.e., it hould have a well-defined derivative.
- It should have one, and only one, global extremum (a minimum or maximum value).

Here is a _very_ small sampling of common loss functions:
- Squared error.  This is the same loss as ordinary least squares regression.
- Absolute error.  This is much more robust to outliers than squared error.  In simple linear models, this will predict the median, rather than mean, value.
- Pinball loss.  This generalizes absolute error, and allows the model to predict any arbitrary quantile of the target, not just the median.
- Log loss.  This is the same loss used by a standard logistic regression.
- Cross-entropy.  This compares the similarity of two discrete probability distributions, and it's often used for classification models.  (Classification models usually output a vector, with one entry for each possible class; whatever entry has the highest value is the class that gets predicted).

A quick note on terminology: an _objective function_ is any function we want to either minimize or maximize.  A _loss function_ is any function we want to specifically _minimize_ (because we want to "minimize our losses").  In practice, there is no meaningful distinction here--any "higher is better" function can be turned into a loss function by multiplying its output by $-1$--so I'll use the two interchangeably.  "Loss function" is the more common term, though, since usually we have something we want our network to minimize.

In more formal math terms: we can write our neural network as $M(\theta, x)$.  $\theta$ represents all the model's parameters (weights and biases), and $x$ represents our inputs.  Our objective function will be $L(y, \hat{y})$, where $y$ is the ground-truth labels, and $\hat{y}$ is what our model predicted.  Then, the thing we actually want our neural network to do can be written very compactly:

$$
\mathrm{Solve}\quad\underset{\theta}{\operatorname{argmin}} L(y, M(\theta, x))
$$

Or, in more clear English: to train our network, we want to find the right parameters (weights and biases) that minimize the difference between our predictions and the true labels, where "distance" is defined as our loss function $L$.

To do this, we follow essentially Newton's Method.  Start with random guesses for our parameters, then feed some data through.  Calculate the derivative of our loss function with respect to our parameters, given the data we just passed through the network, and use that to update all of our parameters.  Keep repeating this we hit some _stopping criterion._

(Yes.  I am leaving out a _lot_ of details about the implementation of all of this, in favor of very, very high-level conceptual descriptions).

# Stopping criteria

Since neural networks, like Newton's Method, don't have a natural and obvious "stopping point," we have to define what makes a good stopping point.  There are generally two ways we do this:

1. _Convergence,_ often called _early stopping_ in the neural network word.  We set aside part of our data to use like a testing split (it doesn't get used for training), and every so often, we check how well our model is doing on this subset of the data.  When the performance hasn't improved for a while, we'll say the network is done.  (how much it needs to improve, and how long we'll give it to improve, are things you have to devide on).
2. _Compute budget._  You might set an explicit "budget" for how much compute time you'll allow the model to train.  E.g.: you only let it train for 100 _epochs_ (full passes over the dataset).  After that, you call it done.  Or, less commonly: the model only gets to train for so many minutes/hours/days before you call it done.

Usually, convergence is used, even for extremely long-running models.  Or at least, convergence is _checked._  If you stop your model after 100 epochs, it might not be done learning from your data (yes, it can sometimes take _thousands_ of passes over the dataset for the model to be "done learning").  But if you check how it's been doing on that held-out subset of your data, you can at least reason about how close to "done" it might be.

# Some miscellaneous notes

## Neural Networks versus linear models

Neural networks are the most general form of a linear model that we currently have.  This means:
- Every ordinary least squares regression is just a special case of a neural network.  (Specifically: one with a single matrix multiplication, using the squared error loss function, and no activation function).
- Every logistic regression is just a special case of a neural netwrk.  (Specifically: one with a single matrix multiplication, using the log loss objective function, and a sigmoidal/logistic activation function).
- Etc.

However: this does not mean neural networks can, or should, replace other linear models.  There are lots of benefits to the special cases.  Namely, you get a lot of cool mathematical shortcuts for free!  (Unless you're a logistic regression; those are pretty much always solved numerically, similar to neural networks).  For example: if the assumptions of an ordinary least squares regression are met, then there's only a single equation you need to solve to get to the _exact_ best model parameters that will minimize the squared errors.  You only need to do one step to get to the right answer, whereas gradient descent and neural networks require many steps, and "when to stop" is always a choice that you have to make.  These special cases can also let you calculate things like p-values and confidence intervals, which you often cannot calculate from gradient descent methods.

But there are benefits to the more general solution.  If you swap out your loss function--switching from squared to absolute error, for example--you don't need to change anything about how your model is trained.  The exact same math and code steps will apply with no modification.  This also holds true if you change the size or internal configurations of your network, like adding more layers, making intermediate vectors larger or smaller, changing activation functions, etc.  So you get extra flexibility, at the cost of speed and specialization.

# Neural networks versus other machine learning models

In most cases, you do not need a neural network.

Most of your problem can be solved by better feature engineering and parameter tuning with other, simpler models.  E.g.: I find that a random forest can always get within spitting distanct of a neural network for most tasks that have reasonable density (i.e. not a lot of zero/missing values), are tabular data problems (as opposed to images/text), and are either pretty straightforward regression or classification problems with reasonably balanced classes.  That may sound like a lot of special cases, but that ends up being most of the problems I work on.  The added compute time and model complexity of neural networks is often not worth the tradeoff.

However, there are some problems where neural networks are an extremely natural fit.  Image processing comes to mind: convolutional neural networks (a kind we will not be covering) are empirically better and more flexible than almost all previous models, in basically every way.  Extremely large-scale problems also come to mind: since a neural network is fit _incrementally_, only looking at a small set of observations at a time, you don't need to have all that data loaded into RAM at once.  You can do the gradient descent step on one chunk of data, then discard it, and do something like query a database to get the next chunk of observations.  Many other machine learning models require your data to be entirely in-memory (though there are many that do not require this).

Neural networks also excel on large datasets that are either extremely sparse, or have extremely high-order interaction terms between features, or where it's not obvious what a good set of features is.  Things like language (a notoriously sparse kind of data), or audio data, or clickstream data are all good examples.

# Incremental learning and massive datasets

Neural networks excel in one area that almost every other model falls flat on: datasets of absurd size.  Meaning, tens or hundreds of millions of rows, and thousands or ten of thousands of columns.  Datasets like this will give you all sorts of problems if you try running even just a linear regression on them.
1. The data is too big.  Linear regressions usually want the data to be all in-memory, as do many other kinds of models.  Your computer will probably crash.
2. The data is all but guaranteed to be _extremely_ nonlinear, with that much of it and that many features.  A linear regression is just too simplistic for such complicated data.
3. Even if the data is linear: a linear regression will "saturate" pretty quickly (i.e.: it'll learn everything it's gonna learn), and you can't tweak it to make it learn more from the data.  The "ceiling" for what linear models can learn is extremely low.

Both of these areas are where neural networks excel.  Because of how gradient descent is done--we only use a few observations at a time, not necessarily the whole dataset--we can have our full dataset living in a database or a file somewhere.  We only need to load up a few observations at a time to feed through the network, so we don't have to incur the huge RAM cost of loading our entire dataset into memory at once.  (I should note: there are many models that can this; it isn't unique to neural networks.  You can, in fact, train a simple least squares regression using gradient descent, but that's rarely done in practice).

A neural network can also learn very complex relationships with just a few layers, and you can keep tweaking the number and size and activation functions of the layers to give it more "space" to store information.  So if your neural network isn't doing too well, you can always try making it bigger!

Since neural networks are just repeated application of matrix multiplication and activation functions, they also implicitly learn very high-order interactions between your input features.  A network with a few layers can easily learn fifth, sixth, and higher-order interaction terms--but it does this _implicitly,_ and via gradient descent, it also learns which interaction terms are important and which aren't.

## Some thing we will not cover

The kind of neural network we've described above is called a _multi-layer perceptron_ (there are some historical reasons for this name that I won't be going into).  But there are many, many more kinds of networks.  There are also many more configurations and methodologies and tools and topics that we simply will not be covering, because the topics are just too big.  Here's some of the things we won't cover:

- Different neural architectures.
    - Recurrent
    - Convolutional
    - Transformers
    - Bidirectional layers
    - Embedding layers
    - Autoencoders
    - Hopfield networks
- How to pick a configuration of parameters:
    - Cross-validation/tuning for neural networks
    - How to reasin about the number, sizes, and types of layers
    - How to reason about good loss functions
    - Differences between activation functions
    - The effect of batch size and learning rate
    - Differences between optimizers
- Optimizations
    - Making them run efficiently
    - Quantization
    - Pruning
    - Learning rate schedules
- Interpretations of neural networks:
    - What are they actually learning?
    - Fragility
    - Debugging
    - Interpretability of the predictions/what the model has learned
- Applications
    - Neural networks can be used everywhere a normal model can be used; but where they _should_ be used (versus other models) is a discussion for another time.
- Libraries and tools
    - We won't be going very deep into Keras or PyTorch; we'll only scratch the surface.
    - We won't talk about Tensorflow, Caffe, or other libraries.
    - We won't talk about how to pick a good neural network library; there are lots out there, and they all have their pros and cons, but we will not be discussing them here.
    
And even more stuff not listed.

If this is your first exposure to neural networks, you will not walk away with enough of an understanding to start using them productively and responsibly, outside of some pretty trivial networks.  (The kinds of networks that are so simple that a random forest can do at least as well as them, while running faster).  But, you _should_ walk away with enough of an understanding that you can start going further on your own.