In [9]:
import torch

# Questionnaire

1. How is a greyscale image represented on a computer? How about a color image?

    - A typical digital 2-D greyscale image is represented as a 2-D array (Width - Height) containing pixel value [0-255] (brightness)
    - A 2-D color image is a greyscale image but has 3 stacks representing R,G,B
1. How are the files and folders in the `MNIST_SAMPLE` dataset structured? Why?
    
    - MNIST_SAMPLE consists of:
        - train: Train set: a folder of "3" images and folder of "7" images
        - validation: Validation set: same as train but for different purpose
        - labels.csv: label for each image, 0 is 3 and 1 is 7
1. Explain how the "pixel similarity" approach to classifying digits works.
    
    - pixel simiarity means that we calculate how much closeness are there in the pictures, by comparing each pixel value by some function
1. What is a list comprehension? Create one now that selects odd numbers from a list and doubles them.

    - List comprehension is the pythonic way to do a for loop and result an array
    ```python
        arr = [i*i for i in range(10) if i%2!=0]
    ```
1. What is a "rank 3 tensor"?

    - Rank 3 tensor is a tensor we must use 3 index so that we can access to a scalar
    ```python
        t3[0][0][0]
    ```
1. What is the difference between tensor rank and shape? How do you get the rank from the shape?

    - Rank is how many dimensions
    - Shape is the length of each dimension
    ```python
        r = len(t.shape)
    ```
1. What are RMSE and L1 norm?
    
    - RMSE is Root mean square error: $RMSE = \sqrt{\frac{\sum{(y_{1} - y_{0})^2}}{N}}$
    - L1 norm is : $L1 = \frac{\sum^{N}{|y_{1} - y_{0}|}}{N}$
1. How can you apply a calculation on thousands of numbers at once, many thousands of times faster than a Python loop?

    - Vectorization/ Broadcasting
1. Create a 3x3 tensor or array containing the numbers from 1 to 9. Double it. Select the bottom right 4 numbers.
1. What is broadcasting?

    - Broadcasting is a method for two tensors: one with smaller rank, but they can still operate. The rule specifies as follow:
        - Compare element-wise each dimension from the trailing (last) dimension
        - They are compatible only if:
            - One of them is 1
            - They're equal
1. Are metrics generally calculated using the training set, or the validation set? Why?
    
    - Metrics should be calculated using the validation set since that is the purpose of the validations set. The model might have been overfitted the train set, which cause false accuracy if use.
1. What is SGD?

    - SGD (Stochastic Gradient Descent) is a method for optimizing a function so that we can find the global minimum if possible, but using a batch (rather than the whole dataset - which is Gradient Descent)
1. Why does SGD use mini batches?

    - Because sometime dataset is so big that if we use them all once would be very costly and take much time.
1. What are the 7 steps in SGD for machine learning?
    
    - Initialize the weights
    - For each observation, use the weights to predict
    - Base of the prediction, calculate how good is the model
    - Calculate the gradient (for each weight)
    - Update all the weights with the gradient
    - Go back to step 2, repeat
    - Until the end of the training process
1. How do we initialize the weights in a model?

    - Naive thought: random values
    - Good implementation: 
        - Xavier (for tanh): ![Xavier](https://miro.medium.com/max/1400/1*QIzXjH8uefVbcaycsjfdmw.png)
        - Kaiming (a==0 for RELU): ![Kaiming](https://miro.medium.com/max/1032/0*DwUan_QhBFIKHFfy.png)
1. What is "loss"?

    - Loss is the function for us to see how well is our model, base on the difference between predictions made by the model and the targets
1. Why can't we always use a high learning rate?
    
    - Because the gradient might be too big when multiply by the learning rate or simply the learning rate is too big in general, which might make us burst through the function to elsewhere (can't reach the minimum point) or pingpong around it.
    ![lr1](./assets/lr_high_1.png)
    ![lr2](./assets/lr_high_2.png)
1. What is a "gradient"?

    - The gradient tells us how much we have to change each weight to make our model better.
1. Do you need to know how to calculate gradients yourself?

    - Not necessary if we use framework
1. Why can't we use accuracy as a loss function?

    - Because accuracy doesn't drastically change when the model weights has small changes (which means gradients ~ 0), so the model can't learn much.
1. Draw the sigmoid function. What is special about its shape?

    ![sigmoid](https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/1200px-Logistic-curve.svg.png)
1. What is the difference between loss and metric?
    
    - **accuracy** is a function that is constant almost everywhere (except at the threshold, 0.5) so its derivative is nil almost everywhere 
    - **loss function**: when our weights result in slightly better predictions, gives us a slightly better loss
1. What is the function to calculate new weights using a learning rate?

    $$w = w - lr*grad(w)$$
1. What does the `DataLoader` class do?

    - `DataLoader` will load the dataset with some configurations like batch_size, shuffle or not,... then we can iterate through the dataset effortlessly with it.
1. Write pseudo-code showing the basic steps taken each epoch for SGD.

    ```
        for each epoch:
            for x, y in dataloader:
                preds = model(x)
                loss = loss_func(preds, y)
                grads = grad_func(loss)
                w -= lr * grads
    ```
1. Create a function which, if passed two arguments `[1,2,3,4]` and `'abcd'`, returns `[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]`. What is special about that output data structure?

    ```python
        def func(a1,a2):
        return zip(a1, list(a2))
    ```
1. What does `view` do in PyTorch?
    
    - Return a new tensor with the shape specified, but still use the same memory block as the tensor call this function.
1. What are the "bias" parameters in a neural network? Why do we need them?
    
    - `bias` params are just as the name, bias for a specific neural network, so that not only the weights of the function decide everything about that activation.
    - e.g: a movie not only "good" only because of its feature (length, type,...), but maybe the audience just hate that movie for some reason.
1. What does the `@` operator do in python?
    
    - Matrix multiplication
1. What does the `backward` method do?
    
    - Calculate the gradient of those variable that requires_grad() involving with the one calling the function.
1. Why do we have to zero the gradients?
    
    - Because pytorch accumulate it.
1. What information do we have to pass to `Learner`?

    - dataloaders, model, optimize function, loss function, metric
1. Show python or pseudo-code for the basic steps of a training loop.

    ```
        for each epoch:
            for x, y in dataloader:
                preds = model(x)
                loss = loss_func(preds, y)
                grads = grad_func(loss)
                w -= lr * grads
    ```    
1. What is "ReLU"? Draw a plot of it for values from `-2` to `+2`.

    - A nonlinear function: 
    $$RELU = max(0, x)$$
1. What is an "activation function"?

    - Activation function is mathematical equations that determine the output of a neural network 
1. What's the difference between `F.relu` and `nn.ReLU`?

    - F.relu is a function for calculating
    - nn.relu is a module (e.g: for using in nn.Sequential)
1. The universal approximation theorem shows that any function can be approximated as closely as needed using just one nonlinearity. So why do we normally use more?

    - Thanks to experiments with deep models by researchers, it shows that the model could perfom much better using many nonlinearity

9. Create a 3x3 tensor or array containing the numbers from 1 to 9. Double it. Select the bottom right 4 numbers.

In [10]:
array = torch.arange(1,10).reshape((3,3))
array.pow_(2)
array

tensor([[ 1,  4,  9],
        [16, 25, 36],
        [49, 64, 81]])

# Further research

1. Create your own implementation of Learner from scratch, based on the training loop shown in this chapter.

In [11]:
class MLearner:
    def __init__(self, dataloader, model, opt_func, loss_func, metrics):
        self.dl = dataloader
        self.mdl = model
        self.opt_func = opt_func
        self.opt = None
        self.loss_func = loss_func
        if callable(metrics):
            self.metrics = [metrics]
        
    def _calc_grad(self, x, y):
        preds = self.mdl(x)
        loss = self.loss_func(preds, y)
        loss.backward()
        
    def _train_epoch(self):
        for x, y in self.dl.train:            
            self._calc_grad(x, y)
            self.opt.step()
            self.opt.zero_grad()
    
    def _validate_epoch(self):
        res = {}
        for metric in self.metrics:
            accs = [metric(self.mdl(xb), yb) for xb, yb in self.dl.valid]
            res[metric.__name__] = round(torch.stack(accs).mean().item(), 4)
        return res
    
    def fit(self, epochs, lr):
        self.opt = self.opt_func(self.mdl.parameters(), lr)
        for i in range(epochs):
            self._train_epoch()
            print(self._validate_epoch())

In [207]:
learn = MLearner(dls, nn.Linear(28*28,1), opt_func=SGD,
                loss_func=mnist_loss, metrics=batch_accuracy)

True
True


In [208]:
learn.fit(10, lr=lr)

{'batch_accuracy': 0.4932}
{'batch_accuracy': 0.9101}
{'batch_accuracy': 0.8145}
{'batch_accuracy': 0.9067}
{'batch_accuracy': 0.9316}
{'batch_accuracy': 0.9438}
{'batch_accuracy': 0.9555}
{'batch_accuracy': 0.9624}
{'batch_accuracy': 0.9648}
{'batch_accuracy': 0.9668}
