# Minibatch Stochastic Gradient Descent

## Drawbacks: 

### (Naive) Gradient Descent: 
* Gradient descent is not particularly _data efficient_, whenever data is very similar

### Stochastic Gradient Descent: 
* SGD is not particularly _computationally efficient_ since CPUs and GPUs cannot exploit the full power of vectorization.

This suggests that there might be somethign in between, and in fact Minibatch SGD takes the best of both worlds!

## Vectorization and Caches

__At the heart of the decision to use minibatches is computational efficiency__. This is most easily understood when considering parallelization to multiple GPUs and multiple servers. In this case we need to send at least one image to each GPU. With 8 GPUs per server and 16 servers we already arrive at a minibatch size no smaller than 128.

In [1]:
%matplotlib inline
import time
import numpy as np
import torch
from torch import nn
from d2l import torch as d2l

A = torch.zeros(256, 256)
B = torch.randn(256, 256)
C = torch.randn(256, 256)

# Class for benchmarking running time

In [2]:
class Timer:  #@save
    """Record multiple running times."""
    def __init__(self):
        self.times = []
        self.start()

    def start(self):
        """Start the timer."""
        self.tik = time.time()

    def stop(self):
        """Stop the timer and record the time in a list."""
        self.times.append(time.time() - self.tik)
        return self.times[-1]

    def avg(self):
        """Return the average time."""
        return sum(self.times) / len(self.times)

    def sum(self):
        """Return the sum of time."""
        return sum(self.times)

    def cumsum(self):
        """Return the accumulated time."""
        return np.array(self.times).cumsum().tolist()

timer = Timer()

In [3]:
# Compute A = BC one element at a time
timer.start()
for i in range(256):
    for j in range(256):
        A[i, j] = torch.dot(B[i, :], C[:, j])
timer.stop()

0.6620998382568359

In [4]:
# Compute A = BC one column at a time
timer.start()
for j in range(256):
    A[:, j] = torch.mv(B, C[:, j])
timer.stop()

0.010849952697753906

In [5]:
# Compute A = BC in one go
timer.start()
A = torch.mm(B, C)
timer.stop()

gigaflops = [0.03 / i for i in timer.times]
print(f'performance in Gigaflops: element {gigaflops[0]:.3f}, '
      f'column {gigaflops[1]:.3f}, full {gigaflops[2]:.3f}')

performance in Gigaflops: element 0.045, column 2.765, full 3.848


## Papers to read during the Weekend!!!!!

* [What Every Programmer Shoud Know About Memory by Ulrich Drepper](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf)
* [What Every Programmer Shoud Know About Memory MADE SIMPLE](https://lwn.net/Articles/250967/)
* [What Every Computer Scientist Should Know About Floating-Point Arithmetic PDF](https://docs.oracle.com/cd/E19957-01/800-7895/800-7895.pdf)
    - [What Every Programmer Should Know About Floating-Point Arithmetic or Why don't my numbers add up](https://floating-point-gui.de) 
* [Cache Hierachy](https://en.wikipedia.org/wiki/Cache_hierarchy)

## Minibatches

In [6]:
timer.start()
for j in range(0, 256, 64):
    A[:, j:j+64] = torch.mm(B, C[:, j:j+64])
timer.stop()
print(f'performance in Gigaflops: block {0.03 / timer.times[3]:.3f}')

performance in Gigaflops: block 6.177
