# Large Scale Machine Learning

### Learning With Large Datasets

- Larger datasets is one of the primary reasons why Machine Learning algorithms perform so much better today than they did 10 years ago. More data means low-bias models can learn better.
- However, larger datasets are also computationally expensive.
- One method/sanity check is to randomly select 1000 examples (e.g. in the case of m = 100,000,000) and test the model to see how well it performs before scaling the model to train on all m examples.

### Stochastic Gradient Descent

- Consider Linear Regression with Gradient Descent:
    - Hypothesis: $h_\theta(x) = \sum \limits_{j=0}^n \theta_j x_j$
    - Cost Function: $J_{train}(\theta) = \frac{1}{2m}\sum \limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2$
    - Gradient Descent: $\theta := \theta_j - \alpha\frac{1}{m}\sum \limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}$
        - This type of gradient descent is called **Batch Gradient Descent** because we are looking at all of the training examples at the same time.
        - When m is very large, calculating the gradient can be very computationally expensive
- Stochastic Gradient Descent;
    - $cost(\theta,(x^{(i)},y^{(i)})) = \frac {1}{2}(h_\theta(x^{(i)})-y^{(i)})^2$
    - $J_{train}(\theta) = \frac{1}{m} \sum \limits_{i=1}^m cost(\theta,(x^{(i)},y^{(i)}))$
    - Process for Gradient Descent;
        1. Randomly shuffle dataset
        2. Repeat
            - For $i = 1,...,m$:
                - $\theta_j := \theta_j - \alpha (h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}$

### Mini-batch Gradient Descent

- Batch Gradient Descent: Use all m exxamples in each iteration
- Stochastic Gradient Descent: Use 1 training example in each iteration
- Mini-batch Gradient Descent: Use b examples in each iteration
    - b = mini-batch size. Common b parameter values are 10, or a range (2-100) depending on the number of training examples
    - If b = 10, then the gradient calculation is $\theta := \theta_j - \alpha\frac{1}{10}\sum \limits_{k=1}^{i+9}(h_\theta(x^{(k)})-y^{(k)})x_j^{(k)}$

### Stochastic Gradient Descent Convergence

- During Stochastic Gradient Descent, computer the $cost(\theta,(x^{(i)},y^{(i)}))$ before updating $\theta$ using $(x^{(i)},y^{(i)})$
- Every 1000 iterations (say), plot $cost(\theta,(x^{(i)},y^{(i)}))$ averaged over the last 1000 examples processed by the algorithm.
- Learning rate $\alpha$ is typically held constant; however, it can be slowly decreased over time if we want $\theta$ to converge (e.g. $\alpha = \frac{const1}{iter + const2}$)

### Advanced Topics: Online Learning

- Example: Shipping service website were user visits the site, specifies origin and destination, and I offer to ship their package for some asking price and a user will either choose your service (y=1) or will not choose your service (y=0).
    - Features x capture properties of the user, or origin/destination and asking price. We want to learn $p(y=1|x;\theta)$ to optimize the price
    - Repeat forever:
        - Get (x,y) corresponding to the user.
        - Update $\theta$ using $(x^{(i)},y^{(i)})$:
            - $\theta_j := \theta_j - \alpha (h_\theta(x^{(k)})-y^{(k)})x_j^{(k)}$
            
### Advanced Topics: Map-Reduce and Data Parallelism

- Consider a Batch Gradient Problem where $\theta_j := \theta_j - \alpha\frac{1}{400}\sum \limits_{i=1}^{400}(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}$ i.e. there are 400 training examples in the batch.
    - Machine 1: Use $(x^{(1)},y^{(1)}),...,(x^{(100)},y^{(100)})$
        - $temp_j^{(1)} = \sum\limits_{i=1}^{100} (h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}$
    - Machine 2: Use $(x^{(101)},y^{(101)}),...,(x^{(200)},y^{(200)})$
        - $temp_j^{(2)} = \sum\limits_{i=101}^{200} (h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}$
    - Machine 3: Use $(x^{(201)},y^{(201)}),...,(x^{(300)},y^{(300)})$
        - $temp_j^{(3)} = \sum\limits_{i=201}^{300} (h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}$
    - Machine 4: Use $(x^{(301)},y^{(301)}),...,(x^{(400)},y^{(400)})$
        - $temp_j^{(4)} = \sum\limits_{i=301}^{400} (h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}$
- Once the calculation is complete, the temp_j values for each machine are sent to a master server that computers/update $\theta_j := \theta_j - \alpha \frac{1}{400}(temp_j^{(1)} + temp_j^{(2)} + temp_j^{(3)} + temp_j^{(4)})$
- Map-reduce and summation over the training set
    - Many learning algorithms can be expressed as computing sums of functions over the trainings set
- Because many modern computers have multiple cores, you can still use map-reduce techniques to run computations in parallel by splitting the training set into $temp_j^{(i)}$ divisions and send each division to a different core.