# Week 10

## Large scale machine learning

Working with **large datasets** allow to get **low bias** and thus to achieve a **high performance** algorithm.

During this whole chapter we will take the linear regression algorithm as an example but all also work for logistic regression and neural networks.

### The problem of batch gradient descent

For example, we want to compute a batch gradient descent on a large dataset ($\text{m}$ of several millions) with a linear regression. Here is a reminder of the formulas:

> $\displaystyle 
\begin{align} 
h_{\theta}(x) &= \sum_{j=0}^{n}\theta_j x_j \\
J_{train}(\theta) &= \frac{1}{2m} \sum_{i=1}^{m} \big(h_{\theta}(x^{(i)}) - y^{(i)} \big)^2 \\
\theta_{j} &:= \theta_{j} - \alpha \underbrace{\frac{1}{m}\sum_{i=1}^{m} \big(h_{\theta}(x^{(i)}) - y^{(i)} \big)x_{j}^{(i)}}_{\frac{\partial}{\partial \theta_{j}} J_{\theta}(train)}
\end{align}$

The summation of the batch gradient descent needs to be done on all the $\text m$ iterations resulting on a **computationnally expensive procedure**.

Fortunately, several methods exist to optimize the calculation.

### Stochastic gradient descent

The cost is computed on a **single example** $ (x^{(i)}, y^{(i)})$ at the time :

> $\displaystyle 
\begin{align} 
cost \big( \theta, (x^{(i)}, y^{(i)}) \big) &= \frac{1}{2}\big(h_{\theta}(x^{(i)}) - y^{(i)} \big)^2 \\
J_{train}(\theta) &= \frac{1}{m} \sum_{i=1}^{m} cost \big( \theta, (x^{(i)}, y^{(i)}) \big)
\end{align}$

1. Randomly shuffle the dataset
2. Repeat for i = 1, ..., m and j = 0, ..., n :

> $\displaystyle 
\theta_{j} := \theta_{j} - \alpha \underbrace{\big(h_{\theta}(x^{(i)}) - y^{(i)} \big)x_{j}^{(i)}}_{\frac{\partial}{\partial \theta_{j}}cost \big( \theta, (x^{(i)}, y^{(i)}) \big)}$

### Mini-batch gradient descent

It uses $\text{b}$ examples for each iteration. The mini-batch size $\text{b}$ is usually comprised between 2 and 100.

Say $\text{b = 10}$ and $\text{m = 1000}$, for $\text{i = 1, 11, 21, ..., 991}$ :

> $\displaystyle 
\theta_{j} := \theta_{j} - \alpha \frac{1}{10} \sum_{k=i}^{i+9} \big(h_{\theta}(x^{(k)}) - y^{(k)} \big)x_{j}^{(k)}$

### Stochastic gradient descent convergence

- During learning, compute $cost \big( \theta, (x^{(i)}, y^{(i)}) \big)$ before updating $\theta$ using $(x^{(i)}, y^{(i)})$.

- Every 1000 iterations (say), plot $cost \big( \theta, (x^{(i)}, y^{(i)}) \big)$ averaged over the last 1000 examples processed by the algorithm.

- Learning rate $\alpha$ is typically held constant. We can slowly decrease $\alpha$ over time if we want $\theta$ to converge:
> $\displaystyle \alpha = \frac{const1}{IterationNumber + const2}$

The constants 1 and 2 need to be set by the user.

### Map reduce

For example, when splitting the dataset between 4 different machines, each machine will have to compute $\big(h_{\theta}(x^{(i)}) - y^{(i)} \big)x_{j}^{(i)}$ on one fourth of the data.

Then the four results are combined in the batch gradient descent formula.

The splitting can also be done on several processing cores from a multi-core machine.