# What is training? 

Training a neural network is a process that finds (or attempts to find) an optimal set of weights, such that the loss (the difference between the actual output and the predicted output) is minimized. We usually start with an arbitrary set of weights and iterate over each training example in order to gradually zero-in on the solution. Before we can train a network, we need to build it. This involves identifying the following parameters:
- **Number of layers**
- **Number of neurons** in each layer
- **Activation Function**: It is applied to the output of a neuron. More on this below. 
- **Loss Function**: The loss function decides how to calculate the difference or the "error" between the predicted output and the real output of the training samples. 
- **Training Algorithm** (Optimizer): The optimizer decides how the weights will be updated.
- **Batch Size**: The number of input samples that will lead to one update in the weights. If the batch size is too small, training will take a long time. If the batch size is very large, training won't converge properly.
- **Learning Rate**: A hyperparameter that lets you control how much the weights will update in each step. If the learning rate is too small, the convergence will be delayed. If it is too large, convergence may never be achieved.

# Training Process
The training of a neural network is an iterative process. Each iteration consists of 2 passes - **forward pass** and **backward pass**.

__Forward Pass__

During the forward pass, the input sample is fed into the network, and its output is calculated. This output is compared to the actual output from the output label (we're doing supervised learning). The cost function calculates the error. Training the network can be thought of as an optimization problem, where we try to minimize the cost function. 

__Backward Pass__

Now, during the backward pass, the errors are propagated backward. Using partial derivates, we try to find out how significant each feature is i.e. how much each feature contributes to the final output. Derivates help us find how much the final output changed with respect to a change in the input. Take a look at the mathematics behind this [here].(http://www.nunnlib.eu/home/mlp/back-propagation-algorithm)

![img](https://cdn-images-1.medium.com/max/1600/1*q1M7LGiDTirwU-4LcFq7_Q.png)

## Why do we need an activation function? 
The core operation that takes place at a neuron is a weighted sum, to which the bias is added. Depending on the value of the inputs, the output can be a positive or a negative number and can be infinitely large or infinitesimally small. Also, the output function may not be continuous or differentiable, which can prevent the backprop algorithm from working properly. In order to overcome these issues, we apply an activation function to the output of each neuron.
An activation function can limit the output range of a neuron, and it also makes the output differentiable. 
Some popular activation functions are:
- tanh
- sigmoid
- signum
- ReLU (Rectified Linear Unit), Leaky ReLU