## A Brief Introduction to Multilayer Perceptions
A Perceptron is a single neuron model that was a precursor to larger neural networks. A perceptron is able to receive an single dimensional array of inputs and produce an single dimensional array of outputs that can be used for a variety of ML tasks, the most fundamental of which is classification. 

The building block for neural networks are artificial neurons. These are simple computational units that have weighted input signals and produce an output signal using an activation function. 

The weights attached to input signals can be likened to the coefficients of a regression equation. These weights represent the level of importance for each input. Consequently, changes in some inputs will have a larger effect on the output of the perceptron than others. Furthermore, each neuron will have a bias, a scalar weight that adjusts the final output of the neuron by a set amount, positive or negative.

Weights are often initialized to small random values although more complex initialization schemes can be used. Larger weights tend to increase the complexity and fragility of the model, so it is desirable to keep weights small and regularization techniques can be used to decrease the risks of overfitting. 

### Activation

The weighted inputs are summed and passed through an **activation function**, sometimes called a **transfer function**. It is called an activation function because it governs the threshold at which the neuron is "activated" and the strength of the output signal. The output of the function is limited in various ways, from application of the sigmoid function to bind values between 0 and 1, to hyperbolic tangent functions (Tanh) that binds values between -1 and 1, to ReLU (rectifier activation function) that binds values between 0 and another positive number. ReLU has been found to provide better results, generally, and this may be due to its mimicry of all-or-nothing type activation systems found in biological neurology. 

### Network of Neurons

Neurons are arranged into networks of neurons. A row of neurons is called a **layer** and one network can have multiple layers. The architecture of the neurons in the network is often called the **network topology**. 

### Input of Visible Layers

The bottom layer that takes input from a dataset is called the visible layer, because it is the exposed part of the network. Fundamentally, it is simply the series of initial inputs, where each input is the value from each column in the dataset for a given observation. 

### Hidden Layers
Layers after the input layer are called **hidden layers** because they are not directly exposed to the input. Each input is effectively transformed through a weight and bias. 

### Output Layer

The final hidden layer is called the output layer and is responsible for producing a value or vector of values that correspond to the format required for the problem. Here are some examples:
- A regression problem may have a single output neuron and the neuron may have no activation function.
- A binary classification problem may have a single output neuron and use a sigmoid activation function to output a value between 0 and 1 to represent the probability of predicting a value for the primary class. This can be turned into a crisp class value by using a threshold of 0.5 and snap values less than the threshold to 0 otherwise to 1. 
- A multiclass classification problem may have multiple neurons in the output layer, one for each class (e.g. three neurons for the three classes in the famous iris flowers classification problem). In this case, a softmax activation function may be used to output a probability of the network predicting each of the class values. Selecting the output with the highest probability can be used to produce a crisp class classification value. 

## Training Networks
Once configured, the neural network must be trained on the dataset.

### Data Preparation
- Data must be numerical
    - Categorical data can be converted via *one-hot encoding*.
    - Text data will need to be converted via TF-IDF or some other NLP encoding technique.
    - This same one hot encoding can be used on the output variable in classification problems with more than one class.
- Data must be similarly scaled
    - Min-max scaling (or any other scaling transformation) can be applied to the dataset to ensure consistency between inputs
    
### Stochastic Gradient Descent
The classical training algorithm for neural networks is stochastic gradient descent. A single observation is exposed to the network, and the output is generated (called a **forward pass**). The output is compared to the expected output and an error is calculated. The error is propagated back through the network, one layer at a time, and the weights are updated according to the amount they contributed ot the error (known as **backpropagation**). One round of updating the network for the entire training dataset is called an **epoch**. A network may be trained for any number of epochs. 

### Weight Updates
The weights can be updated from the errors calculated after each forward pass for each observation. This is called **online learning**. Alternatively, the errors can be aggregated across all results of all observations in the training set. This is called **batch learning** and is often more stable. 

Because datasets are so large, the size of the batch will be reduced before an update is completed. The amount that the weights are updated is controlled by a configuration parameter called the learning rate. This is also called the **step size** and controls the momentum at which a local minima is approached in order to reduce the risk of a convergence failure (where a local minima cannot be identified because each update overshoots a true local minima of the aggregated errors). 
- **Momentum** is a term that incorporates the properties from the previous weight update to allow the weights to continue to change in the same direction even when there is less error being calculated.
- **Learning Rate Decay** is used to decrease the learning weight over epochs to allow the network to make large changes to the weights at the beginning and smaller fine tuning changes later in the training schedule.

### Prediction
Predictions are made by providing the input to the network and performing a forward-pass allowing it to generate an ouput that you can use as a prediction. Evaluating performance on a separate out-of-sample dataset (ideally randomly split from the original full dataset available) can alert to risks of overfitting.

The network topology and the final set of weights is all that needs to persist from the research environment into the production environment to make novel predictions. 

## Developing a Simple Neural Network from the Pima Indians Onset of Diabetes Dataset