# Learning with Large Datasets

![title](pictures/ld_1.png)

![title](pictures/ld_2.png)

The summation of 100,000,000 examples is very computational expensive. 

__Sanity Check!__: Try first on a small training set & verify that the algorithm has a __high variance__ (plot on the left) when m is small. Since the problem of high variance can be solved by using a larger training dataset, using the full 100,000,000 example dataset will result in a well trained algorithm. 

If found on the __high bias__ problem, increasing the number of features, or adding the number of hidden units in the Neural Network will help to solve this.

# Stochastic Gradient Descent

Lets use linear regression as an example, however SGD can be applied to any learning algorithms.

![title](pictures/ld_sgd_1.png)

The gradient term is very computationally expensive to calculate.

![title](pictures/ld_sgd_2.png)

Lets look at Stochastic gradient descent (SGD)

![title](pictures/ld_sgd_3.png)

![title](pictures/ld_sgd_4.png)

# Mini-Batch Gradient Descent

Can work faster than stochastic gradient descent. Use a small batch of the data and carry out gradient descent on that. Loop through all the batches in the dataset. 

![title](pictures/ld_mbgd_1.png)

![title](pictures/ld_mbgd_2.png)

MBGD will outperform SGD if the implementation is well vectorized. This will loop through the dataset quicker and achieve the results faster.

(b usually equal to 10)

# Convergence of stochastic gradient descent

Checking for convergence in batch gradient descent & stochastic gradient descent.

![title](pictures/ld_csgd_1.png)

Cost decreases & platoes over time. Smaller learning rate is slower initially, however the final cost can be lower. (top left)

Increasing the number of examples averaging over produces a smoother curve, however the cost would be calculated less frequently. (top right)

Flat oscilating curves could mean that the algorithm isnt learning much. However averaging over a larger number of samples could help show the smoother curve. If curve is still flat when increasing the number of points it averages, the learning algorithm is not learning properly. Decreasing the learning rate or changing the features could help solve this problem. (bottom left)

If learning rate is increasing the algorithm is diverging and the learning rate should be reduced. (bottom right)

![title](pictures/ld_csgd_2.png)

The parameters will not converge to the global rate. It will hover around it as shown in the figure.

A way to solve this is to reduce the learning rate as the iterations increase. Not very common since learning rate has the added constants which need to be tunned. However if achieved the results will be better.

![title](pictures/ld_csgd_3.png)

![title](pictures/ld_csgd_4.png)

# Online Learning

Learning from a continuous stream of data. 

![title](pictures/ld_ol_1.png)

![title](pictures/ld_ol_2.png)

# Map reduce & Data parallelism

Scale learning algorithms to higher sizes than stochastic learning descent. Applicable to m = 400,000,000 or higher.

Split the dataset into the number of available machines. Batch gradient descent for the portions of the data carried out in the available machines. Parameters updated based on the joined portions of the calculated batch gradient descent.

![title](pictures/ld_mr_1.png)
![title](pictures/ld_mr_2.png)

Key to the map reduce is weather the learning rate & gradients can be expressed as the sum over the training set.

![title](pictures/ld_mr_3.png)

Map reduce can also be parallelised over multi-core machines.

![title](pictures/ld_mr_4.png)