# 10-714 Homework 2

In this homework, you will be implementing a neural network library in the needle framework. Reminder: __you must save a copy in drive__.

In [None]:
# Code to set up the assignment
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/
!mkdir -p 10714
%cd /content/drive/MyDrive/10714
!git clone https://github.com/dlsys10714/hw2.git
%cd /content/drive/MyDrive/10714/hw2

!pip3 install --upgrade --no-deps git+https://github.com/dlsys10714/mugrade.git

This homework builds off of Homework 1. First, in your Homework 2 directory, go to the files `autograd.py`, `ops.py` and `numpy_backend.py` in the `python/needle` directory, and fill in the code between `### BEGIN YOUR SOLUTION` and `### END YOUR SOLUTION` with your solutions from Homework 1. 

__A note__: When copying over your solutions from the previous `numpy_backend.py`, you should add something like `.astype(inputs[0].dtype)` to your methods' return statements to ensure that your methods don't change the dtype (in fact, you can directly use the previous snippet in all cases). This is because functions like `np.divide` will change their output type in order to present a more accurate answer. However, this may result in type conflicts in our current version of needle. Forcibly casting like this is not the optimal solution, and we will probably revisit it later, but it is an acceptable workaround for now.

In [1]:
import sys
sys.path.append('./python')
sys.path.append('./apps')

## Question 0 [5 points]

Before you begin implementing your Needle neural network library, you first have to implement a couple more ops. These are crucial ops for the loss function and optimizers that you will quickly find yourself needing below.

### PowerScalar

We will generally find it crucial to find the elementwise power (to a scalar) of a tensor, for example, to compute the variance in BatchNorm. You will also need to implement the `__pow__` method in `Tensor` for this op.

In [10]:
!python3 -m pytest -v -k "op_power_scalar"

platform linux -- Python 3.7.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /home/bowenc/dev/cmu/dlsys/hw2/hw2-env/bin/python3
cachedir: .pytest_cache
rootdir: /home/bowenc/dev/cmu/dlsys/hw2
collected 98 items / 95 deselected / 3 selected                                [0m

tests/test_nn_and_optim.py::test_op_power_scalar_forward_1 [32mPASSED[0m[33m        [ 33%][0m
tests/test_nn_and_optim.py::test_op_power_scalar_forward_2 [32mPASSED[0m[33m        [ 66%][0m
tests/test_nn_and_optim.py::test_op_power_scalar_backward_1 [32mPASSED[0m[33m       [100%][0m

hw2-env/lib/python3.7/site-packages/mugrade/mugrade.py:71
    @pytest.mark.hookwrapper



In [12]:
!python3 -m mugrade submit 'eAzORHnc5JIEkr0dqEig' -k "op_power_scalar"

submit
platform linux -- Python 3.7.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /home/bowenc/dev/cmu/dlsys/hw2
collected 22 items / 21 deselected / 1 selected                                [0m

tests/test_nn_and_optim.py 
Submitting op_power_scalar...
Grader test 1 passed
Grader test 2 passed
Grader test 3 passed
[32m.[0m



### LogSoftmax

LogSoftmax is useful for implementing softmax loss since we can write its forward and backward pass in terms of it. It is simply the log of the softmax function:

$$
\mathsf{LogSoftmax}(x) = \log\left(\frac{\exp(x)}{\sum_i \exp(x_i)}\right)
$$

Consider how you could simplify this expression.

In [46]:
!python3 -m pytest -v -k "op_logsoftmax"

platform linux -- Python 3.7.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /home/bowenc/dev/cmu/dlsys/hw2/hw2-env/bin/python3
cachedir: .pytest_cache
rootdir: /home/bowenc/dev/cmu/dlsys/hw2
collected 98 items / 95 deselected / 3 selected                                [0m

tests/test_nn_and_optim.py::test_op_logsoftmax_forward_1 [32mPASSED[0m[33m          [ 33%][0m
tests/test_nn_and_optim.py::test_op_logsoftmax_stable_forward_1 [32mPASSED[0m[33m   [ 66%][0m
tests/test_nn_and_optim.py::test_op_logsoftmax_backward_1 [32mPASSED[0m[33m         [100%][0m

hw2-env/lib/python3.7/site-packages/mugrade/mugrade.py:71
    @pytest.mark.hookwrapper

tests/test_nn_and_optim.py::test_op_logsoftmax_stable_forward_1
    exp_x = np.exp(inputs[0])



In [4]:
!python3 -m mugrade submit 'eAzORHnc5JIEkr0dqEig' -k "op_logsoftmax"

submit
platform linux -- Python 3.7.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /home/bowenc/dev/cmu/dlsys/hw2
collected 22 items / 21 deselected / 1 selected                                [0m

tests/test_nn_and_optim.py 
Submitting op_logsoftmax...
Grader test 1 passed
Grader test 2 passed
Grader test 3 passed
Grader test 4 passed
Grader test 5 passed
[32m.[0m

tests/test_nn_and_optim.py::submit_op_logsoftmax
    exp_x = np.exp(inputs[0])



## Question 1 [5 points]

Before we can implement certain neural network library modules, we need an assortment
of methods to initialize tensors from various distributions.
In this question, you will implement different methods for weight initialization. Specifically, in `python/needle/init.py` implement the functions:
___
### Uniform
`needle.init.uniform(x, low=0.0, high=1.0)`

Fills the input Tensor with values drawn from the uniform distribution $\mathcal{U}(a,b)$.

##### Parameters
- `x` - Tensor
- `low` - lower bound of the uniform distribution
- `high` - upper bound of the uniform distribution
___

### Normal
`needle.init.normal(x, mean=0.0, std=1.0)`

Fills the input Tensor with values drawn from the normal distribution $\mathcal{N}(\text{mean},\text{std}^2)$.

##### Parameters
- `x` - Tensor
- `mean` - mean of the normal distribution
- `std` - standard deviation of the normal distribution
___

### Constant
`needle.init.constant(x, c=0.0)`

Fills the input Tensor with value `c`.

##### Parameters
- `x` - Tensor
- `c` - the value to fill the Tensor with
___

### Ones
`needle.init.ones(x, c=0.0)`

Fills the input Tensor with scalar value 1.

##### Parameters
- `x` - Tensor
___

### Zeros
`needle.init.zeros(x)`

Fills the input Tensor with scalar value 0.

##### Parameters
- `x` - Tensor
___

### Xavier uniform
`needle.init.xavier_uniform(x, gain=1.0)`

Fills the input Tensor with values according to the method described in [Understanding the difficulty of training deep feedforward neural networks](https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf), using a uniform distribution. The resulting Tensor will have values sampled from $\mathcal{U}(-a, a)$ where 
\begin{equation}
a = \text{gain} \times \sqrt{\frac{6}{\text{fan_in} + \text{fan_out}}}
\end{equation}

##### Parameters
- `x` - Tensor
- `gain` - optional scaling factor
___

### Xavier normal
`needle.init.xavier_normal(x, gain=1.0)`

Fills the input Tensor with values according to the method described in [Understanding the difficulty of training deep feedforward neural networks](https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf), using a normal distribution. The resulting Tensor will have values sampled from $\mathcal{N}(0, \text{std}^2)$ where 
\begin{equation}
\text{std} = \text{gain} \times \sqrt{\frac{2}{\text{fan_in} + \text{fan_out}}}
\end{equation}

##### Parameters
- `x` - Tensor
- `gain` - optional scaling factor
___

### Kaiming uniform
`needle.init.kaiming_uniform(x, mode='fan_in', nonlinearity='relu')`

Fills the input Tensor with values according to the method described in [Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification](https://arxiv.org/pdf/1502.01852.pdf), using a uniform distribution. The resulting Tensor will have values sampled from $\mathcal{U}(-\text{bound}, \text{bound})$ where 
\begin{equation}
\text{bound} = \text{gain} \times \sqrt{\frac{3}{\text{fan_mode}}}
\end{equation}

Use the recommended gain value for ReLU: $\text{gain}=\sqrt{2}$.

##### Parameters
- `x` - Tensor
- `mode` - either `fan_in` or `fan_out`. Choosing `fan_in` preserves the magnitude of the variance of the weights in the forward pass. Choosing `fan_out` preserves the magnitudes in the backwards pass. 
- `nonlinearity` - the non-linear function
___

### Kaiming normal
`needle.init.kaiming_normal(x, mode='fan_in', nonlinearity='relu')`

Fills the input Tensor with values according to the method described in [Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification](https://arxiv.org/pdf/1502.01852.pdf), using a uniform distribution. The resulting Tensor will have values sampled from $\mathcal{N}(0, \text{std}^2)$ where 
\begin{equation}
\text{std} = \frac{\text{gain}}{\sqrt{\text{fan_mode}}}
\end{equation}

Use the recommended gain value for ReLU: $\text{gain}=\sqrt{2}$.

##### Parameters
- `x` - Tensor
- `mode` - either `fan_in` or `fan_out`. Choosing `fan_in` preserves the magnitude of the variance of the weights in the forward pass. Choosing `fan_out` preserves the magnitudes in the backwards pass. 
- `nonlinearity` - the non-linear function

In [28]:
!python3 -m pytest -v -k "test_init"

platform linux -- Python 3.7.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /home/bowenc/dev/cmu/dlsys/hw2/hw2-env/bin/python3
cachedir: .pytest_cache
rootdir: /home/bowenc/dev/cmu/dlsys/hw2
collected 98 items / 87 deselected / 11 selected                               [0m

tests/test_nn_and_optim.py::test_init_uniform_1 [32mPASSED[0m[33m                   [  9%][0m
tests/test_nn_and_optim.py::test_init_normal_1 [32mPASSED[0m[33m                    [ 18%][0m
tests/test_nn_and_optim.py::test_init_constant_1 [32mPASSED[0m[33m                  [ 27%][0m
tests/test_nn_and_optim.py::test_init_ones_1 [32mPASSED[0m[33m                      [ 36%][0m
tests/test_nn_and_optim.py::test_init_zeros_1 [32mPASSED[0m[33m                     [ 45%][0m
tests/test_nn_and_optim.py::test_init_kaiming_uniform_1 [32mPASSED[0m[33m           [ 54%][0m
tests/test_nn_and_optim.py::test_init_kaiming_uniform_2 [32mPASSED[0m[33m           [ 63%][0m
tests/test_nn_and_optim.py::test_init_kai

In [29]:
!python3 -m mugrade submit 'eAzORHnc5JIEkr0dqEig' -k "init"

submit
platform linux -- Python 3.7.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /home/bowenc/dev/cmu/dlsys/hw2
collected 22 items / 21 deselected / 1 selected                                [0m

tests/test_nn_and_optim.py 
Submitting init...
Grader test 1 passed
Grader test 2 passed
Grader test 3 passed
Grader test 4 passed
Grader test 5 passed
Grader test 6 passed
[32m.[0m



## Question 2 [30 points]

In this question, you will implement additional modules in `python/needle/nn.py`. Specifically, for the following modules described below, initialize any variables of the module in the constructor, and fill out the `forward` method. 
___

### Linear
`needle.nn.Linear(in_features, out_features, bias=True, device=None, dtype="float32")`

Applies a linear transformation to the incoming data: $y = xA^T + b$. The input shape is $(N, *, H_{in})$ where * means any number of additional dimensions and $H_{in}=\text{in_features}$. The output shape is $(N, *, H_{out})$ where all but the last dimension are the same shape as the input and $H_{out}=\text{out_features}$.

Be careful to explicitly broadcast the bias term to the correct shape -- Needle does not support implicit broadcasting.


##### Parameters
- `in_features` - size of each input sample
- `out_features` - size of each output sample
- `bias` - If set to `False`, the layer will not learn an additive bias.

##### Variables
- `weight` - the learnable weights of shape (`in_features`, `out_features`). The values are initialized from $\mathcal{U}(-\sqrt{k}, \sqrt{k})$ where $k=\frac{1}{\text{in_features}}$.
- `bias` - the learnable bias of shape (`out_features`). If `bias` is `True`, the values are initialized from $\mathcal{U}(-\sqrt{k}, \sqrt{k})$ where $k=\frac{1}{\text{in_features}}$.
___

In [40]:
!python3 -m pytest -v -k "test_nn_linear"

platform linux -- Python 3.7.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /home/bowenc/dev/cmu/dlsys/hw2/hw2-env/bin/python3
cachedir: .pytest_cache
rootdir: /home/bowenc/dev/cmu/dlsys/hw2
collected 98 items / 90 deselected / 8 selected                                [0m

tests/test_nn_and_optim.py::test_nn_linear_weight_init_1 [32mPASSED[0m[33m          [ 12%][0m
tests/test_nn_and_optim.py::test_nn_linear_bias_init_1 [32mPASSED[0m[33m            [ 25%][0m
tests/test_nn_and_optim.py::test_nn_linear_forward_1 [32mPASSED[0m[33m              [ 37%][0m
tests/test_nn_and_optim.py::test_nn_linear_forward_2 [32mPASSED[0m[33m              [ 50%][0m
tests/test_nn_and_optim.py::test_nn_linear_forward_3 [32mPASSED[0m[33m              [ 62%][0m
tests/test_nn_and_optim.py::test_nn_linear_backward_1 [32mPASSED[0m[33m             [ 75%][0m
tests/test_nn_and_optim.py::test_nn_linear_backward_2 [32mPASSED[0m[33m             [ 87%][0m
tests/test_nn_and_optim.py::test_nn_linea

In [41]:
!python3 -m mugrade submit 'eAzORHnc5JIEkr0dqEig' -k "nn_linear"

submit
platform linux -- Python 3.7.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /home/bowenc/dev/cmu/dlsys/hw2
collected 22 items / 21 deselected / 1 selected                                [0m

tests/test_nn_and_optim.py 
Submitting nn_linear...
Grader test 1 passed
Grader test 2 passed
Grader test 3 passed
Grader test 4 passed
Grader test 5 passed
Grader test 6 passed
[32m.[0m



### ReLU
`needle.nn.ReLU(device=None, dtype="float32")`

Applies the rectified linear unit function element-wise:
$ReLU(x) = max(0, x)$.

If you have previously implemented ReLU's backwards pass in terms of itself, note that this is numerically unstable and will likely cause problems
down the line.
Instead, consider that we could write the derivative of ReLU as $I\{x>0\}$, where we arbitrarily decide that the derivative at $x=0$ is 0.
(This is a _subdifferentiable_ function.)

___

In [31]:
!python3 -m pytest -v -k "test_nn_relu"

platform linux -- Python 3.7.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /home/bowenc/dev/cmu/dlsys/hw2/hw2-env/bin/python3
cachedir: .pytest_cache
rootdir: /home/bowenc/dev/cmu/dlsys/hw2
collected 98 items / 96 deselected / 2 selected                                [0m

tests/test_nn_and_optim.py::test_nn_relu_forward_1 [32mPASSED[0m[33m                [ 50%][0m
tests/test_nn_and_optim.py::test_nn_relu_backward_1 [32mPASSED[0m[33m               [100%][0m

hw2-env/lib/python3.7/site-packages/mugrade/mugrade.py:71
    @pytest.mark.hookwrapper



In [33]:
!python3 -m mugrade submit 'eAzORHnc5JIEkr0dqEig' -k "nn_relu"

submit
platform linux -- Python 3.7.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /home/bowenc/dev/cmu/dlsys/hw2
collected 22 items / 21 deselected / 1 selected                                [0m

tests/test_nn_and_optim.py 
Submitting nn_relu...
Grader test 1 passed
Grader test 2 passed
[32m.[0m



### Sequential
`needle.nn.Sequential(*modules, device=None, dtype="float32")`

Applies a sequence of modules to the input (in the order that they were passed to the constructor) and returns the output of the last module.
These should be kept in a `.module` property: you should _not_ redefine any magic methods like `__getitem__`, as this may not be compatible with our tests.

##### Parameters
- `*modules` - any number of modules of type `needle.nn.Module`

___

In [42]:
!python3 -m pytest -v -k "test_nn_sequential"

platform linux -- Python 3.7.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /home/bowenc/dev/cmu/dlsys/hw2/hw2-env/bin/python3
cachedir: .pytest_cache
rootdir: /home/bowenc/dev/cmu/dlsys/hw2
collected 98 items / 96 deselected / 2 selected                                [0m

tests/test_nn_and_optim.py::test_nn_sequential_forward_1 [32mPASSED[0m[33m          [ 50%][0m
tests/test_nn_and_optim.py::test_nn_sequential_backward_1 [32mPASSED[0m[33m         [100%][0m

hw2-env/lib/python3.7/site-packages/mugrade/mugrade.py:71
    @pytest.mark.hookwrapper



In [43]:
!python3 -m mugrade submit 'eAzORHnc5JIEkr0dqEig' -k "nn_sequential"

submit
platform linux -- Python 3.7.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /home/bowenc/dev/cmu/dlsys/hw2
collected 22 items / 21 deselected / 1 selected                                [0m

tests/test_nn_and_optim.py 
Submitting nn_sequential...
Grader test 1 passed
Grader test 2 passed
[32m.[0m



### SoftmaxLoss
`needle.nn.SoftmaxLoss(device=None, dtype="float32")`

Applies the softmax loss as defined below (and as implemented in Homework 1), taking in as input a Tensor of logits and a Tensor of the true labels (expressed as a list of numbers, *not* one-hot encoded).

Note that you can use the new `ops.one_hot` function now instead of writing this yourself.
**Importantly**, you should implement your SoftmaxLoss in terms of the `LogSoftmaxOp` you implemented in Q0.
Also note that the equation below is equal to negative log softmax.

\begin{equation}
\ell_\text{softmax}(z,y) = \log \sum_{i=1}^k \exp z_i - z_y
\end{equation}

___

In [66]:
!python3 -m pytest -v -k "test_nn_softmax_loss"

platform linux -- Python 3.7.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /home/bowenc/dev/cmu/dlsys/hw2/hw2-env/bin/python3
cachedir: .pytest_cache
rootdir: /home/bowenc/dev/cmu/dlsys/hw2
collected 98 items / 94 deselected / 4 selected                                [0m

tests/test_nn_and_optim.py::test_nn_softmax_loss_forward_1 [32mPASSED[0m[33m        [ 25%][0m
tests/test_nn_and_optim.py::test_nn_softmax_loss_forward_2 [32mPASSED[0m[33m        [ 50%][0m
tests/test_nn_and_optim.py::test_nn_softmax_loss_backward_1 [32mPASSED[0m[33m       [ 75%][0m
tests/test_nn_and_optim.py::test_nn_softmax_loss_backward_2 [32mPASSED[0m[33m       [100%][0m

hw2-env/lib/python3.7/site-packages/mugrade/mugrade.py:71
    @pytest.mark.hookwrapper



In [67]:
!python3 -m mugrade submit 'eAzORHnc5JIEkr0dqEig' -k "nn_softmax_loss"

submit
platform linux -- Python 3.7.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /home/bowenc/dev/cmu/dlsys/hw2
collected 22 items / 21 deselected / 1 selected                                [0m

tests/test_nn_and_optim.py 
Submitting nn_softmax_loss...
Grader test 1 passed
Grader test 2 passed
Grader test 3 passed
Grader test 4 passed
[32m.[0m




### LayerNorm
`needle.nn.LayerNorm(dims, eps=1e-5, device=None, dtype="float32")`

Applies layer normalization over a mini-batch of inputs as described in the paper [Layer Normalization](https://arxiv.org/abs/1607.06450).

\begin{equation}
\hat{z}_{i+1} = \sigma_i(W_i^Tz_i + b_i) \\
z_{i+1} = \frac{\hat{z}_{i+1} - \text{E}[\hat{z}_{i+1}]}{(\text{Var}[\hat{z}_{i+1}]+\epsilon)^{1/2})}
\end{equation}

Note that the affine scaling parameters `weight` and `bias` here are shaped differently than for BatchNorm below.

##### Parameters
- `dims` - input shape
- `eps` - a value added to the denominator for numerical stability.

##### Variables
- `weight` - the learnable weights of shape `dims`, elements initialized to 1.
- `bias` - the learnable bias of shape `dims`, elements initialized to 0.
___

In [22]:
!python3 -m pytest -v -k "test_nn_layernorm"

platform linux -- Python 3.7.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /home/bowenc/dev/cmu/dlsys/hw2/hw2-env/bin/python3
cachedir: .pytest_cache
rootdir: /home/bowenc/dev/cmu/dlsys/hw2
collected 98 items / 90 deselected / 8 selected                                [0m

tests/test_nn_and_optim.py::test_nn_layernorm_forward_1 [32mPASSED[0m[33m           [ 12%][0m
tests/test_nn_and_optim.py::test_nn_layernorm_forward_2 [32mPASSED[0m[33m           [ 25%][0m
tests/test_nn_and_optim.py::test_nn_layernorm_forward_3 [32mPASSED[0m[33m           [ 37%][0m
tests/test_nn_and_optim.py::test_nn_layernorm_forward_4 [32mPASSED[0m[33m           [ 50%][0m
tests/test_nn_and_optim.py::test_nn_layernorm_backward_1 [32mPASSED[0m[33m          [ 62%][0m
tests/test_nn_and_optim.py::test_nn_layernorm_backward_2 [32mPASSED[0m[33m          [ 75%][0m
tests/test_nn_and_optim.py::test_nn_layernorm_backward_3 [32mPASSED[0m[33m          [ 87%][0m
tests/test_nn_and_optim.py::test_nn_layer

In [34]:
!python3 -m mugrade submit 'eAzORHnc5JIEkr0dqEig' -k "nn_layernorm"

submit
platform linux -- Python 3.7.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /home/bowenc/dev/cmu/dlsys/hw2
collected 22 items / 21 deselected / 1 selected                                [0m

tests/test_nn_and_optim.py 
Submitting nn_layernorm...
Grader test 1 passed
Grader test 2 passed
Grader test 3 passed
Grader test 4 passed
[needle.Tensor(15.988424), needle.Tensor([[[[0.999964   0.9999647 ]
   [0.99645394 0.99645394]]

  [[0.99997663 0.99997663]
   [0.9999273  0.9999273 ]]]


 [[[0.99999547 0.99999547]
   [0.99990153 0.9999008 ]]

  [[0.9999902  0.9999907 ]
   [0.99800783 0.9979983 ]]]]), needle.Tensor([[[[ 0.999991   -0.9999912 ]
   [ 0.9991123  -0.9991123 ]]

  [[ 0.99999416 -0.99999416]
   [ 0.9999818  -0.9999818 ]]]


 [[[ 0.99999887 -0.99999887]
   [ 0.9999754  -0.9999752 ]]

  [[ 0.99999756 -0.9999977 ]
   [-0.9995016   0.9994992 ]]]]), needle.Tensor([[[[0.75000674 0.75000674]
   [0.07506669 0.07506669]]

  [[0.9250054  0.9250054 ]
   [0.5250095  0.5250095 ]]]


 [

Grader test 6 passed
[needle.Tensor(32.639454), needle.Tensor([[[[7.5086045e-01 5.0572075e-02]
   [5.7327393e-03 3.2876173e-04]]

  [[8.8973093e+00 1.1747462e-09]
   [4.1385946e-01 1.0166465e+01]]]


 [[[2.9539104e+00 4.0114841e+00]
   [1.1420321e-01 1.8021780e+00]]

  [[3.2537580e+00 7.1542710e-02]
   [3.9219130e-02 1.0803149e-01]]]]), needle.Tensor([[[[ 0.93087167 -0.47421762]
   [ 0.27516332  0.13465433]]

  [[ 1.7270888  -0.00585445]
   [-0.80207175 -1.7856342 ]]]


 [[[ 1.3109899  -1.4152275 ]
   [-0.58132577 -1.1586424 ]]

  [[ 1.3430628  -0.51717955]
   [ 0.44501483  0.57330745]]]]), needle.Tensor([[[[1.0675479 1.0675479]
   [1.0675479 1.0675479]]

  [[1.0675479 1.0675479]
   [1.0675479 1.0675479]]]


 [[[1.5589366 1.5589366]
   [1.5589366 1.5589366]]

  [[1.5589366 1.5589366]
   [1.5589366 1.5589366]]]]), needle.Tensor([[[[1.0675479]]]


 [[[1.5589366]]]]), needle.Tensor([[[[1.1396585]]]


 [[[2.4302833]]]]), needle.Tensor([[[[1.1396484]]]


 [[[2.4302733]]]]), needle.Tensor([1

Grader test 8 passed
[32m.[0m



### BatchNorm
`needle.nn.BatchNorm(dim, eps=1e-5, momentum=0.1, device=None, dtype="float32")`

Applies batch normalization over a mini-batch of inputs as described in the paper [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/abs/1502.03167). Unlike layer normalization, normalizes activations over the mini-batch rather than normalizing the activations at each layer. Computes a running average of mean/variance for all features at each layer $\hat{\mu}_{i+1}, \hat{\sigma}^2_{i+1}$, and at test time normalizes by these quantities:

\begin{equation}
(z_{i+1})_j = \frac{(\hat{z}_{i+1})_j - (\hat{\mu}_{i+1})_j}{((\hat{\sigma}^2_{i+1})_j+\epsilon)^{1/2}}
\end{equation}

You will additionally have to write the function `_child_modules` in `nn.py` in order to make
it possible to set a flag on a module and its children which indicates whether or not training is in progress;
that is, all modules have a property `training` which should be false after calling `model.eval()` on the module or one of its parents, and vice versa for `model.train()`. (There is a test for this in this section.)

BatchNorm uses the running estimates of mean and variance instead of batch statistics at test time, i.e.,
after `model.eval()` has been called on the BatchNorm layer's `training` flag is false.

**Important:** A small detail here is that our implementation of BatchNorm uses the *biased* estimate
of the variance during training, but computes a running estimate of the *unbiased* version. The biased estimate
divides by $N$, while the unbiased estimate divides by $N-1$.

To compute the running estimates, you can use the equation $$\hat{x_{new}} = (1 - m) \hat{x_{old}} + mx_{observed},$$
where $m$ is momentum.

##### Parameters
- `dim` - input dim
- `eps` - a value added to the denominator for numerical stability.
- `momentum` - the value used for the running mean and running variance computation.

##### Variables
- `weight` - the learnable weights of size `dim`, elements initialized to 1.
- `bias` - the learnable bias of size `dim`, elements initialized to 0.
- `running_mean` - the running mean used at evaluation time, elements initialized to 0.
- `running_var` - the running (unbiased) variance used at evaluation time, elements initialized to 1. 

___

In [35]:
!python3 -m pytest -v -k "test_nn_batchnorm"

platform linux -- Python 3.7.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /home/bowenc/dev/cmu/dlsys/hw2/hw2-env/bin/python3
cachedir: .pytest_cache
rootdir: /home/bowenc/dev/cmu/dlsys/hw2
collected 98 items / 77 deselected / 21 selected                               [0m

tests/test_nn_and_optim.py::test_nn_batchnorm_check_model_eval_switches_training_flag_1 [32mPASSED[0m[33m [  4%][0m
tests/test_nn_and_optim.py::test_nn_batchnorm_forward_1 [32mPASSED[0m[33m           [  9%][0m
tests/test_nn_and_optim.py::test_nn_batchnorm_forward_2 [32mPASSED[0m[33m           [ 14%][0m
tests/test_nn_and_optim.py::test_nn_batchnorm_forward_3 [32mPASSED[0m[33m           [ 19%][0m
tests/test_nn_and_optim.py::test_nn_batchnorm_forward_4 [32mPASSED[0m[33m           [ 23%][0m
tests/test_nn_and_optim.py::test_nn_batchnorm_forward_affine_1 [32mPASSED[0m[33m    [ 28%][0m
tests/test_nn_and_optim.py::test_nn_batchnorm_forward_affine_2 [32mPASSED[0m[33m    [ 33%][0m
tests/test_nn_and_o

In [36]:
!python3 -m mugrade submit 'eAzORHnc5JIEkr0dqEig' -k "nn_batchnorm"

submit
platform linux -- Python 3.7.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /home/bowenc/dev/cmu/dlsys/hw2
collected 22 items / 21 deselected / 1 selected                                [0m

tests/test_nn_and_optim.py 
Submitting nn_batchnorm...
Grader test 1 passed
Grader test 2 passed
Grader test 3 passed
Grader test 4 passed
Grader test 5 passed
Grader test 6 passed
Grader test 7 passed
Grader test 8 passed
[needle.Tensor(14.9997835), needle.Tensor([[0.2546954  0.01621025 0.6249844 ]
 [0.85605955 0.87175167 0.89997756]
 [1.7829783  0.65021145 2.0249493 ]
 [0.95610255 1.7308954  0.2249943 ]
 [1.1501089  1.730895   1.2249695 ]]), needle.Tensor([[ 0.50467354  0.12731948  0.79055953]
 [ 0.92523485 -0.9336764  -0.94867146]
 [-1.3352821   0.8063569  -1.4230071 ]
 [ 0.97780496 -1.315635    0.47433564]
 [-1.0724313   1.3156348   1.1067834 ]]), needle.Tensor([[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]), needle.Tensor([[0. 0. 0.]]), needle.Tensor([0. 0. 0.]), needl

Grader test 10 passed
[needle.Tensor(11.999924), needle.Tensor([[[[6.2027740e+00 2.0752174e-01 5.4983473e-01]
   [4.9633168e-02 1.3008002e+00 3.3091348e-01]]]


 [[[3.9914376e-03 9.1559835e-02 9.1559835e-02]
   [1.2856747e+00 1.3934101e+00 4.9225160e-01]]]]), needle.Tensor([[[[-2.490537   -0.45554554  0.7415084 ]
   [ 0.22278503  1.1405263  -0.5752508 ]]]


 [[[ 0.06317782  0.30258855  0.30258855]
   [-1.133876    1.1804279   0.70160645]]]]), needle.Tensor([[[[0. 0. 0.]
   [0. 0. 0.]]]


 [[[0. 0. 0.]
   [0. 0. 0.]]]]), needle.Tensor([[[[0.]]]]), needle.Tensor([0.]), needle.Tensor([[[[-2.490537   -0.45554554  0.7415084 ]
   [ 0.22278503  1.1405263  -0.5752508 ]]]


 [[[ 0.06317782  0.30258855  0.30258855]
   [-1.133876    1.1804279   0.70160645]]]]), needle.Tensor([[[[1.2530766 1.2530766 1.2530766]
   [1.2530766 1.2530766 1.2530766]]]


 [[[1.2530766 1.2530766 1.2530766]
   [1.2530766 1.2530766 1.2530766]]]]), needle.Tensor([[[[1.2530766]]]]), needle.Tensor([[[[1.5702009]]]]), needle.T

Grader test 12 passed
Grader test 13 passed
Grader test 14 passed
Grader test 15 passed
Grader test 16 passed
{'training': False, 'dim': 3, 'eps': 1e-05, 'momentum': 0.1, 'weight': needle.Tensor([1. 1. 1.]), 'bias': needle.Tensor([0. 0. 0.]), 'running_mean': needle.Tensor([1.5894531 1.7175173 1.6310714]), 'running_var': needle.Tensor([1.7184839 1.5241226 1.9188087])}
[needle.Tensor(23.999916), needle.Tensor([[[0.5037282  1.1602125 ]
  [0.5680535  0.12473619]
  [0.84278774 1.3760325 ]]

 [[3.4671102  0.16739807]
  [0.97361475 0.8620431 ]
  [1.6914275  0.7081759 ]]

 [[0.195844   2.2339013 ]
  [0.671141   1.0319458 ]
  [0.04161911 1.0404787 ]]

 [[0.195844   0.07592534]
  [3.2746246  0.49381396]
  [1.258979   1.0404787 ]]]), needle.Tensor([[[ 0.70973814  1.0771316 ]
  [ 0.7536932  -0.3531801 ]
  [-0.91803473  1.1730441 ]]

 [[-1.8620178   0.4091431 ]
  [ 0.9867192   0.92846274]
  [ 1.3005489  -0.8415319 ]]

 [[ 0.44254264 -1.4946241 ]
  [-0.819232    1.0158473 ]
  [ 0.20400763 -1.0200386

### Dropout
`needle.nn.Dropout(drop_prob, device=None, dtype="float32")`

During training, randomly zeroes some of the elements of the input tensor with probability `drop_prob` using samples from a Bernoulli distribution. This has proven to be an effective technique for regularization and preventing the co-adaptation of neurons as described in the paper [Improving neural networks by preventing co-adaption of feature detectors](https://arxiv.org/abs/1207.0580). During evaluation the module simply computes an identity function. 

\begin{equation}
\hat{z}_{i+1} = \sigma_i (W_i^T z_i + b_i) \\
(z_{i+1})_j = 
    \begin{cases}
    (\hat{z}_{i+1})_j /(1-p) & \text{with probability } 1-p \\
    0 & \text{with probability } p \\
    \end{cases}
\end{equation}

**Important**: If the Dropout module the flag `training=False`, you shouldn't "dropout" any weights. That is, dropout applies during training only, not during evaluation.

##### Parameters
- `drop_prob` - probability of an element to be zeroed. 

___

In [41]:
!python3 -m pytest -v -k "test_nn_dropout"

platform linux -- Python 3.7.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /home/bowenc/dev/cmu/dlsys/hw2/hw2-env/bin/python3
cachedir: .pytest_cache
rootdir: /home/bowenc/dev/cmu/dlsys/hw2
collected 98 items / 96 deselected / 2 selected                                [0m

tests/test_nn_and_optim.py::test_nn_dropout_forward_1 [32mPASSED[0m[33m             [ 50%][0m
tests/test_nn_and_optim.py::test_nn_dropout_backward_1 [32mPASSED[0m[33m            [100%][0m

hw2-env/lib/python3.7/site-packages/mugrade/mugrade.py:71
    @pytest.mark.hookwrapper



In [42]:
!python3 -m mugrade submit 'eAzORHnc5JIEkr0dqEig' -k "nn_dropout"

submit
platform linux -- Python 3.7.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /home/bowenc/dev/cmu/dlsys/hw2
collected 22 items / 21 deselected / 1 selected                                [0m

tests/test_nn_and_optim.py 
Submitting nn_dropout...
Grader test 1 passed
[needle.Tensor(19.764704), needle.Tensor([[2.470588   1.117647   4.0588236 ]
 [3.6470587  0.         2.7058823 ]
 [3.8823528  0.47058824 1.4117647 ]]), needle.Tensor([[2.1  0.95 3.45]
 [3.1  0.   2.3 ]
 [3.3  0.4  1.2 ]]), needle.Tensor([[1. 1. 1.]
 [1. 0. 1.]
 [1. 1. 1.]]), needle.Tensor([[-0. -0. -0.]
 [-0. -1. -0.]
 [-0. -0. -0.]]), needle.Tensor([[0. 0. 0.]
 [0. 1. 0.]
 [0. 0. 0.]]), needle.Tensor([[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]), needle.Tensor([[2.1  0.95 3.45]
 [3.1  2.45 2.3 ]
 [3.3  0.4  1.2 ]])]
Computing grad for: <needle.ops.SummationOp object at 0x7f0bdd4e46d0>
Computing grad for: <needle.ops.DivScalarOp object at 0x7f0bdd4e4590>
Computing grad for: <needle.ops.EWiseMulOp object at 0x7f0bdd4e4390>


### Residual
`needle.nn.Residual(fn: Module, device=None, dtype="float32")`

Applies a residual or skip connection given module $\mathcal{F}$ and input Tensor $x$, returning $\mathcal{F}(x) + x$.
##### Parameters
- `fn` - module of type `needle.nn.Module`

In [43]:
!python3 -m pytest -v -k "test_nn_residual"

platform linux -- Python 3.7.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /home/bowenc/dev/cmu/dlsys/hw2/hw2-env/bin/python3
cachedir: .pytest_cache
rootdir: /home/bowenc/dev/cmu/dlsys/hw2
collected 98 items / 96 deselected / 2 selected                                [0m

tests/test_nn_and_optim.py::test_nn_residual_forward_1 [32mPASSED[0m[33m            [ 50%][0m
tests/test_nn_and_optim.py::test_nn_residual_backward_1 [32mPASSED[0m[33m           [100%][0m

hw2-env/lib/python3.7/site-packages/mugrade/mugrade.py:71
    @pytest.mark.hookwrapper



In [44]:
!python3 -m mugrade submit 'eAzORHnc5JIEkr0dqEig' -k "nn_residual"

submit
platform linux -- Python 3.7.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /home/bowenc/dev/cmu/dlsys/hw2
collected 22 items / 21 deselected / 1 selected                                [0m

tests/test_nn_and_optim.py 
Submitting nn_residual...
Grader test 1 passed
[needle.Tensor(22.030226), needle.Tensor([[ 1.9047918  -0.64011514  2.8450792 ]
 [ 4.3380623   0.31210983  0.32276547]
 [ 0.5470216   3.0703483   0.32774067]
 [ 5.52239     3.0602338   0.4197983 ]]), needle.Tensor([[ 0.20479175 -0.7901151  -0.3549208 ]
 [-0.01193756 -0.5378902  -0.52723455]
 [ 0.49702162 -0.87965184  0.12774065]
 [ 0.57239026 -1.0397661  -0.13020168]]), needle.Tensor([ 0.09241457 -0.4535496   0.10754485]), needle.Tensor([[ 0.11237718 -0.33656555 -0.46246564]
 [-0.10435212 -0.08434059 -0.6347794 ]
 [ 0.40460706 -0.42610225  0.02019579]
 [ 0.47997567 -0.5862165  -0.23774654]]), needle.Tensor([[-0.19575776  0.02475643 -0.06805498]
 [-0.20877086  0.11185289 -0.36050615]
 [-0.20785536 -0.13363816 -0.04

## Question 3 [30 points]

Implement the `step` function of the following optimizers.
Make sure that your optimizers _don't_ modify the gradients of tensors in-place.

We have included some tests to ensure that you are not consuming excessive memory, which can happen if you are
not using `.data` or `.detach()` in the right places, thus building an increasingly large computational graph
(not just in the optimizers, but in the previous modules as well).
You can ignore these tests, which include the string `check_memory` at your own discretion.

___

### SGD
`needle.optim.SGD(params, lr=0.01, momentum=0.9, weight_decay=0.0)`

Implements stochastic gradient descent (optionally with momentum, shown as $\beta$ below). 

\begin{equation}
\begin{split}
    u_{t+1} &= \beta u_t + \nabla_\theta f(\theta_t) \\
    \theta_{t+1} &= \theta_t - \alpha u_{t+1}
\end{split}
\end{equation}

##### Parameters
- `params` - iterable of parameters of type `needle.nn.Parameter` to optimize
- `lr` (*float*) - learning rate
- `momentum` (*float*) - momentum factor
- `weight_decay` (*float*) - weight decay (L2 penalty)
___

In [1]:
!python3 -m pytest -v -k "test_optim_sgd"

platform linux -- Python 3.7.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /home/bowenc/dev/cmu/dlsys/hw2/hw2-env/bin/python3
cachedir: .pytest_cache
rootdir: /home/bowenc/dev/cmu/dlsys/hw2
collected 98 items / 92 deselected / 6 selected                                [0m

tests/test_nn_and_optim.py::test_optim_sgd_vanilla_1 [32mPASSED[0m[33m              [ 16%][0m
tests/test_nn_and_optim.py::test_optim_sgd_momentum_1 [32mPASSED[0m[33m             [ 33%][0m
tests/test_nn_and_optim.py::test_optim_sgd_weight_decay_1 [32mPASSED[0m[33m         [ 50%][0m
tests/test_nn_and_optim.py::test_optim_sgd_momentum_weight_decay_1 [32mPASSED[0m[33m [ 66%][0m
tests/test_nn_and_optim.py::test_optim_sgd_layernorm_residual_1 [32mPASSED[0m[33m   [ 83%][0m
tests/test_nn_and_optim.py::test_optim_sgd_z_memory_check_1 [32mPASSED[0m[33m       [100%][0m

hw2-env/lib/python3.7/site-packages/mugrade/mugrade.py:71
    @pytest.mark.hookwrapper



In [74]:
!python3 -m mugrade submit 'eAzORHnc5JIEkr0dqEig' -k "optim_sgd"

submit
platform linux -- Python 3.7.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /home/bowenc/dev/cmu/dlsys/hw2
collected 22 items / 21 deselected / 1 selected                                [0m

tests/test_nn_and_optim.py 
Submitting optim_sgd...
Grader test 1 passed
Grader test 2 passed
[31mF[0m

[31m[1m_______________________________ submit_optim_sgd _______________________________[0m

    [94mdef[39;49;00m [92msubmit_optim_sgd[39;49;00m():
    	mugrade.submit(learn_model_1d([94m48[39;49;00m, [94m17[39;49;00m, [94mlambda[39;49;00m z: nn.Sequential(nn.Linear([94m48[39;49;00m, [94m32[39;49;00m), nn.ReLU(), nn.Linear([94m32[39;49;00m, [94m17[39;49;00m)), ndl.optim.SGD, lr=[94m0.03[39;49;00m, momentum=[94m0.0[39;49;00m, epochs=[94m2[39;49;00m))
    	mugrade.submit(learn_model_1d([94m48[39;49;00m, [94m16[39;49;00m, [94mlambda[39;49;00m z: nn.Sequential(nn.Linear([94m48[39;49;00m, [94m32[39;49;00m), nn.ReLU(), nn.Linear([94m32[39;49;00m, [94

### Adam
`needle.optim.Adam(params, lr=0.01, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.0)`

Implements Adam algorithm, proposed in [Adam: A Method for Stochastic Optimization](https://arxiv.org/abs/1412.6980). 

\begin{equation}
\begin{split}
u_{t+1} &= \beta_1 u_t + (1-\beta_1) \nabla_\theta f(\theta_t) \\
v_{t+1} &= \beta_2 v_t + (1-\beta_2) (\nabla_\theta f(\theta_t))^2 \\
\hat{u_{t+1}} &= u_{t+1} / (1 - \beta_1^t) \quad \text{(bias correction)} \\
\hat{v_{t+1}} &= v_{t+1} / (1 - \beta_2^t) \quad \text{(bias correction)}\\
\theta_{t+1} &= \theta_t - \alpha \hat{u_{t+1}}/(\hat{v_{t+1}}^{1/2}+\epsilon)
\end{split}
\end{equation}

**Important:** Pay attention to whether or not you are applying bias correction.

##### Parameters
- `params` - iterable of parameters of type `needle.nn.Parameter` to optimize
- `lr` (*float*) - learning rate
- `beta1` (*float*) - coefficient used for computing running average of gradient
- `beta2` (*float*) - coefficient used for computing running average of square of gradient
- `eps` (*float*) - term added to the denominator to improve numerical stability
- `bias_correction` - whether to use bias correction for $u, v$
- `weight_decay` (*float*) - weight decay (L2 penalty)

In [23]:
!python3 -m pytest -v -k "test_optim_adam"

platform linux -- Python 3.7.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /home/bowenc/dev/cmu/dlsys/hw2/hw2-env/bin/python3
cachedir: .pytest_cache
rootdir: /home/bowenc/dev/cmu/dlsys/hw2
collected 98 items / 91 deselected / 7 selected                                [0m

tests/test_nn_and_optim.py::test_optim_adam_1 [32mPASSED[0m[33m                     [ 14%][0m
tests/test_nn_and_optim.py::test_optim_adam_weight_decay_1 [32mPASSED[0m[33m        [ 28%][0m
tests/test_nn_and_optim.py::test_optim_adam_batchnorm_1 [32mPASSED[0m[33m           [ 42%][0m
tests/test_nn_and_optim.py::test_optim_adam_batchnorm_eval_mode_1 [32mPASSED[0m[33m [ 57%][0m
tests/test_nn_and_optim.py::test_optim_adam_layernorm_1 [32mPASSED[0m[33m           [ 71%][0m
tests/test_nn_and_optim.py::test_optim_adam_weight_decay_bias_correction_1 [32mPASSED[0m[33m [ 85%][0m
tests/test_nn_and_optim.py::test_optim_adam_z_memory_check_1 [32mPASSED[0m[33m      [100%][0m

hw2-env/lib/python3.7/site-pack

In [24]:
!python3 -m mugrade submit 'eAzORHnc5JIEkr0dqEig' -k "optim_adam"

submit
platform linux -- Python 3.7.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /home/bowenc/dev/cmu/dlsys/hw2
collected 22 items / 21 deselected / 1 selected                                [0m

tests/test_nn_and_optim.py 
Submitting optim_adam...
Grader test 1 passed
Grader test 2 passed
Grader test 3 passed
Grader test 4 passed
Grader test 5 passed
Grader test 6 passed
[32m.[0m



## Question 4 [10 points]

In this question, you will implement two data primitives: `needle.data.DataLoader` and `needle.data.Dataset`. `Dataset` stores the samples and their corresponding labels, and `DataLoader` wraps an iterable around the `Dataset` to enable easy access to the samples. 

For this question, you will be working in `python/needle/data.py`. First, copy your solution to `parse_mnist` from the previous homework into the `parse_mnist` function. 

### Transformations

First we will implement a few transformations that are helpful when working with images. We will stick with a horizontal flip and a random crop for now. Fill out the following functions in `data.py`.
___ 

#### FlipHorizontal
`needle.data.FlipHorizontal()`

Flips the image horizontally.
___

#### RandomCrop
`needle.data.RandomCrop(padding=3)`

Padding is added to all side of the image, and then the image is cropped back to it's original size at a random location. Returns an image the same size as the original image.

##### Parameters
- `padding` (*int*) - The padding on each border of the image.

In [None]:
!python3 -m pytest -v -k "flip_horizontal"
!python3 -m pytest -v -k "random_crop"

In [None]:
!python3 -m mugrade submit 'eAzORHnc5JIEkr0dqEig' -k "flip_horizontal"
!python3 -m mugrade submit 'eAzORHnc5JIEkr0dqEig' -k "random_crop"

### Dataset

Each `Dataset` subclass must implement three functions: `__init__`, `__len__`, and `__getitem__`. The `__init__` function initializes the images, labels, and transforms. The `__len__` function returns the number of samples in the dataset. The `__getitem__` function retrieves a sample from the dataset at a given index `idx`, calls the transform functions on the image (if applicable), converts the image and label to a numpy array (the data will be converted to Tensors elsewhere). Fill out these functions in the `MNISTDataset` class: 
___ 

### MNISTDataset
`needle.data.MNISTDataset(image_filesname, label_filesname)`

##### Parameters
- `image_filesname` - path of file containing images
- `label_filesname` - path of file containing labels


In [None]:
!python3 -m pytest -v -k "mnist_dataset"

In [None]:
!python3 -m mugrade submit 'YOUR_GRADER_KEY_HERE' -k "mnist_dataset"

### Sampling 

During training, we typically want to pass samples in mini-batches, and shuffle the data at each epoch to reduce model overfitting. The dataloader we will eventually build will be flexible enough to change how it samples points from a given dataset, as well as how it combines the data at a batch level. First, we define a `Sampler` class that will be subclassed to build different data sampling strategies. In this homework we will extend this and create a `SequentialSampler` and a `RandomSampler`. Furthermore, we typically batch our data. Therefore, we will also construct a `BatchSampler`, which will take a `Sampler` and return the sampled indexes as batches instead of individual indexes. Each iteration of a sampler will return an index, whereas each iteration of a batch sampler will return a list of indexes. Fill out the following classes in `python/needle/data.py`.
___

### SequentialSampler
`needle.data.SequentialSampler(data_source: needle.data.Dataset)`

Samples elements sequentially, always in the same order. 

##### Parameters
- `data_source` - `needle.data.Dataset` - dataset 
___ 

### RandomSampler
`needle.data.RandomSampler(data_source: Sized, replacement: bool = False,
                 num_samples: Optional[int] = None)`

Samples elements randomly. If replacement is specified then you must also specify `num_samples`. If replacement is not specified, then the `data_source` size will be used as `num_samples` and all indices shuffled. 

##### Parameters
- `data_source` - `needle.data.Dataset` - dataset 
- `replacement` - `bool` - whether or not to use replacement when sampling from the dataset
- `num_samples` - `int` - if using replacement, how many samples to draw
___ 

### BatchSampler
`needle.data.BatchSampler(sampler: Union[Sampler, Iterable], batch_size: int, drop_last: bool)`

Given a sampler, batch the sampled indexes in sets of `batch_size`. If `drop_last` is set to `True`, and the final batch is not full, then it is dropped. 

##### Parameters
- `sampler` - `needle.data.Sampler` - the sampler who's output is to be batched
- `batch_size` - `int` - the batch size, or the set size to group the indexes in
- `drop_last` - `bool` - if set, the final batch is dropped if it is less than `batch_size`
___ 


In [None]:
!python3 -m pytest -v -k "sequential_sampler"
!python3 -m pytest -v -k "random_sampler"
!python3 -m pytest -v -k "batch_sampler"

In [None]:
!python3 -m mugrade submit 'YOUR_GRADER_KEY_HERE' -k "sequential_sampler"
!python3 -m mugrade submit 'YOUR_GRADER_KEY_HERE' -k "random_sampler"
!python3 -m mugrade submit 'YOUR_GRADER_KEY_HERE' -k "batch_sampler"

### Collating

So far we have a `Dataset`, a `Sampler` to draw data points from the dataset, and we can even batch these data points by wrapping our `Sampler` in a `BatchSampler`. However, there is a final step we need to perform before serving data points: collation. Our data is not necessarily in the exact format or structure that we want to leverage during training, and these specifications can differ. Here we will write a collator function for the MNIST dataset. This function takes a batch of data and refactors it such that we return the input data and label separately, the outer dimension of these returned variables is the batch size, and they are of the proper data type (`Tensor` (*float64*)). Fill out the following function in `data.py`.
___
#### collate_mnist
`needle.data.collate_mnist(batch)`

Take a batch of data and transform it into a Tensor of the proper shape.

##### Parameters
- `batch` - a batch of data, can be of different types including a tuple or a list. 

In [None]:
!python3 -m pytest -v -k "collate_mnist"

In [None]:
!python3 -m mugrade submit 'YOUR_GRADER_KEY_HERE' -k "collate_mnist"

### Dataloader

We now have all of the components we need to create a dataloader. Using all of the prior objects and methods created in this question, we will build a dataloader object to combine them together. We have provided a base `Dataloader` implementation that is flexible enough to include multiprocess loading of samples (without multithreading, data loading can become the bottle neck when using fast computation methods). For simplicity we will only implement a single threaded dataloader, but we hope that for those interested in how to extend this to a multi-threaded process, the "jump-off" points are clear. This also applies to the underlying data type. In this homework we focus on `Iterable` datasets, but commonly `Map` style datasets are used as well. Again, we hope that the base implemntation makes it clear how one would extend this to work with `Map` style datasets, although we only work with `Iterable` ones here. Look through the `DataLoader` class, as well as the classes it calls/depends on. Then fill out the `fetch` method of `_IterableDatasetFetcher`. 
___

### Dataloader
`needle.data.Dataloader(dataset: Dataset, batch_size: Optional[int] = 1, shuffle: bool = False, sampler: Union[Sampler, Iterable, None] = None, collate_fn: Optional = default_collate, drop_last: bool = False,`

Combine a dataset, sampler, and collator to serve datapoints from a dataset. 

##### Parameters
- `dataset` - `needle.data.Dataset` - a dataset 
- `batch_size` - `int` - what batch size to serve the data in 
- `shuffle` - `bool` - if a sampler is not provided, use this to choose between a `SequentialSampler` and a `RandomSampler`
- `collate_fn` - `Generic` - the function to use to collate the samples. For simplicity we will be using only the collate_mnist here, but a general collation function can be built. 
- `drop_last` - `bool` - whether or not to drop the last batch if it is not full. 
___ 

### \_IterableDatasetFetcher
`needle.data._IterableDatasetFetcher(dataset, collate_fn, drop_last)`

This function takes a batch of indexes (typically returned from a `Sampler`), acquires the data from the `dataset`, collates the data using the `collate_fn`, and then returns it for use.  

##### Parameters
- `dataset` - `needle.data.Dataset` - a dataset 
- `collate_fn` - `Generic` - the function to use to collate the samples. For simplicity we will be using only the `collate_mnist` here, but a general collation function can be built. 
- `drop_last` - `bool` - whether or not to drop the last batch if it is not full. 
___ 



In [None]:
!python3 -m pytest -v -k "test_dataloader"

In [None]:
!python3 -m mugrade submit 'YOUR_GRADER_KEY_HERE' -k "dataloader"

## Question 5 [20 points]

Given you have now implemented all the necessary components for our neural network library, let's build and train an MLP ResNet. For this question, you will be working in `apps/mlp_resnet.py`. First, fill out the functions `ResidualBlock` and `MLPResNet` as described below:

### ResidualBlock
`ResidualBlock(dim, hidden_dim, norm=nn.BatchNorm, drop_prob=0.9)`

Implements a residual block as follows:
![](figures/residualblock.png)
where the first linear layer has `in_features=dim` and `out_features=hidden_dim`, and the last linear layer has `out_features=dim`. Returns the block as type `nn.Module`. 

##### Parameters
- `dim` (*int*) - input dim
- `hidden_dim` (*int*) - hidden dim
- `norm` (*nn.Module*) - normalization method
- `drop_prob` (*float*) - dropout probability

___

### MLPResNet
`ResidualBlock(dim, hidden_dim=100, num_blocks=3, num_classes=10, norm=nn.BatchNorm, drop_prob=0.1)`

Implements an MLP ResNet as follows:
![](figures/mlp_resnet.png)
where the first linear layer has `in_features=dim` and `out_features=hidden_dim`, and each ResidualBlock has `dim=hidden_dim` and `hidden_dim=hidden_dim//2`. Returns a network of type `nn.Module`. __NOTE__: please hard-code `drop_prob` to 0.1 (or ignore `drop_prob`) in this function due to an error in the reference implementation.

##### Parameters
- `dim` (*int*) - input dim
- `hidden_dim` (*int*) - hidden dim
- `num_blocks` (*int*) - number of ResidualBlocks
- `num_classes` (*int*) - number of classes
- `norm` (*nn.Module*) - normalization method
- `drop_prob` (*float*) - dropout probability (0.1)
___ 

Once you have the deep learning model architecture correct, let's train the network using our new neural network library components. Specifically, implement the functions `train_epoch`, `evaluate` and `train_mnist`. 

___
`train_epoch(dataloader, model, loss_fn, opt)`

Executes one epoch of training, iterating over the entire training dataset once (just like `nn_epoch` from previous homeworks). Returns the average accuracy (as a *float*) and the average loss over all samples (as a *float*). Sets the model to `training` mode at the beginning of the function.  

##### Parameters
- `dataloader` (*`needle.data.DataLoader`*) - dataloader returning samples from the training dataset
- `model` (*`needle.nn.Module`*) - neural network
- `loss_fn` (*`needle.nn.Module` type*) - loss function to optimize over
- `opt` (*`needle.optim.Optimizer`*) - optimizer instance

___
`evaluate(dataloader, model, loss_fn)`

Evaluates the model given a loss function on the entire test dataset. Returns the average accuracy (as a *float*) and the average loss over all samples (as a *float*). Sets the model to `eval` mode at the beginning of the function. 

##### Parameters
- `dataloader` (*`needle.data.DataLoader`*) - dataloader returning samples from the test dataset
- `model` (*`needle.nn.Module`*) - neural network 
- `loss_fn` (*`needle.nn.Module` type*) - loss function
___

`train_mnist(batch_size=100, epochs=10, optimizer=ndl.optim.Adam, 
                lr=0.001, weight_decay=0.001, hidden_dim=100, data_dir="data")`
                
Initializes a training dataloader (with `shuffle` set to `True`) and a test dataloader for MNIST data, and trains an `MLPResNet` using the given optimizer and the softmax loss for a given number of epochs. Returns a tuple of the training accuracy, training loss, test accuracy, test loss computed in the last epoch of training. If any parameters are not specified, use the default parameters.

##### Parameters
- `batch_size` (*int*) - batch size to use for train and test dataloader
- `epochs` (*int*) - number of epochs to train for
- `optimizer` (*`needle.optim.Optimizer` type*) - optimizer type to use
- `lr` (*float*) - learning rate 
- `weight_decay` (*float*) - weight decay
- `hidden_dim` (*int*) - hidden dim for `MLPResNet`
- `data_dir` (*int*) - directory containing MNIST image/label files


In [None]:
!python3 -m pytest -v -k "test_mlp"

In [None]:
!python3 -m mugrade submit 'YOUR_GRADER_KEY_HERE' -k "mlp_resnet"

We encourage to experiment with the `mlp_resnet.py` training script.
You can uncomment the lines that print training statistics and investigate
the effect of using different initializers on the Linear layers,
increasing the dropout probability,
or adding transforms (via a list to the `transforms=` keyword argument of Dataset)
such as random cropping.