# Deep learning math

Scalars: A scalar is a single quantity that you can think of as a number. In machine learning models, we can use scalar quantities to manipulate data, and we often modify them to improve our model’s accuracy. We can also represent data as scalar values depending on what dataset we are working with.

In [None]:
x = 5

Vectors: Vectors are arrays of numbers. In Python, we often denote vectors as NumPy arrays. Each value in the array can be identified by its index (location within the array).

In [None]:
x = np.array([1,2,3])

Matrices: Matrices are grids of information with rows and columns. We can index a matrix just like an array; however, when indexing on a matrix, we need two arguments: one for the row and one for the column.

In [None]:
x = np.array([[1,2,3],[4,5,6],[7,8,9]])

Scalars, vectors, and matrices are foundational objects in linear algebra. Understanding the different ways they interact with each other and can be manipulated through matrix algebra is integral before diving into deep learning. This is because the data structure we use in deep learning is called a tensor, which is a generalized form of a vector and matrix: a multidimensional array.

The shape of this tensor is (3, 2, 5), as outlined on the diagram. 

![image.png](attachment:image.png)

See C:\Users\alan_\Documents\GitHub\Code-Academy\Python_For_Data_Science\8-Linear_Algebra\Lessons\LinearAlgebra.ipynb for more info

## Neural Networks Concept Overview

![image.png](attachment:image.png)

- Input later - contains different inputs (data points) from our dataset. Our input can have many different features, so in our input layer, each node represents a different input feature. For example, if we were working with a dataset of different types of food, some of our features might be size, shape, nutrition, etc., where the value for each of these features would be held in an input node.
- Hidden layers are layers that come between the input layer and the output layer. They introduce complexity into our neural network and help with the learning process. You can have as many hidden layers as you want in a neural network (including zero of them).
- The output layer is the final layer in our neural network. It produces the final result, so every neural network must have only one output layer.
- Each layer in a neural network contains nodes. Nodes between each layer are connected by weights. These are the learning parameters of our neural network, determining the strength of the connection between each linked node.
- The weighted sum between nodes and weights is calculated between each layer. For example, from our input layer, we take the weighted sum of the inputs and our weights with the following equation:

![image-2.png](attachment:image-2.png)


## Loss Functions

When a value is outputted, we calculate its error using a loss function. Our predicted values are compared with the actual values within the training data. There are two commonly used loss calculation formulas:

- Mean squared error, which is most likely familiar to you if you have come across linear regression.
- Cross-entropy loss, which is used for classification learning models rather than regression.

## Backpropagation

However, what if our output values are inaccurate?

This is where backpropagation and gradient descent come into play. Forward propagation deals with feeding the input values through hidden layers to the final output layer. Backpropagation refers to the computation of gradients with an algorithm known as gradient descent. This algorithm continuously updates and refines the weights between neurons to minimize our loss function.

By gradient, we mean the rate of change with respect to the parameters of our loss function. From this, backpropagation determines how much each weight is contributing to the error in our loss function, and gradient descent will update our weight values accordingly to decrease this error.

## Gradient descent

If we think about the concept graphically, we want to look for the minimum point of our loss function because this will yield us the highest accuracy. If we start at a random point on our loss function, gradient descent will take “steps” in the “downhill direction” towards the negative gradient. The size of the “step” taken is dependent on our learning rate. Choosing the optimal learning rate is important because it affects both the efficiency and accuracy of our results.

![image-3.png](attachment:image-3.png)

## Stochastic Gradient Descent

n deep learning models, we are often dealing with extremely large datasets. Because of this, performing backpropagation and gradient descent calculations on all of our data may be inefficient and computationally exhaustive no matter what learning rate we choose.

To solve this problem, a variation of gradient descent known as Stochastic Gradient Descent (SGD) was developed. Let’s say we have 100,000 data points and 5 parameters. If we did 1000 iterations (also known as epochs in Deep Learning) we would end up with 100000⋅5⋅1000 = 500,000,000 computations. We do not want our computer to do that many computations on top of the rest of the learning model; it will take forever.

This is where SGD comes to play. Instead of performing gradient descent on our entire dataset, we pick out a random data point to use at each iteration. This cuts back on computation time immensely while still yielding accurate results.

## More variants

There are also other variants of gradient descent such as Adam optimization algorithm and mini-batch gradient descent. Adam is an adaptive learning algorithm that finds individual learning rates for each parameter. Mini-batch gradient descent is similar to SGD except instead of iterating on one data point at a time, we iterate on small batches of fixed size.

Adam optimizer’s ability to have an adaptive learning rate has made it an ideal variant of gradient descent and is commonly used in deep learning models. Mini-batch gradient descent was developed as an ideal trade-off between GD and SGD. Since mini-batch does not depend on just one training sample, it has a much smoother curve and is less affected by outliers and noisy data making it a more optimal algorithm for gradient descent than SGD.

## Summary

In backpropagation, the gradient of the loss function is calculated with respect to the weight parameters within a neural network.
- Gradient descent updates our weight parameters by iteratively minimizing our loss function to increase our model’s accuracy.
- Stochastic gradient descent is a variant of gradient descent, where instead of using all data points to update parameters, a random data point is selected.
- Adam optimization is a variant of SGD that allows for adaptive learning rates.
- Mini-batch gradient descent is a variant of GD that uses random batches of data to update parameters instead of a random datapoint.
