<a href="https://colab.research.google.com/github/ghlai9665/course-v3/blob/master/colab_gary_study_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Beginning after the matrix multiplication lesson, which has decent notes, but should take notes in a completely new Jupyter Notebook for better organization and retention.


# Forward Pass

## Imports

In [1]:
# workaround to download mnist data, set root to './', run and proceed, 
!wget www.di.ens.fr/~lelarge/MNIST.tar.gz
!tar -zxvf MNIST.tar.gz

import torchvision
import torchvision.transforms as transforms
root = './'
torchvision.datasets.MNIST(root=root,download=True)

--2021-03-14 13:49:36--  http://www.di.ens.fr/~lelarge/MNIST.tar.gz
Resolving www.di.ens.fr (www.di.ens.fr)... 129.199.99.14
Connecting to www.di.ens.fr (www.di.ens.fr)|129.199.99.14|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.di.ens.fr/~lelarge/MNIST.tar.gz [following]
--2021-03-14 13:49:37--  https://www.di.ens.fr/~lelarge/MNIST.tar.gz
Connecting to www.di.ens.fr (www.di.ens.fr)|129.199.99.14|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/x-gzip]
Saving to: ‘MNIST.tar.gz’

MNIST.tar.gz            [        <=>         ]  33.20M  5.96MB/s    in 16s     

2021-03-14 13:49:53 (2.09 MB/s) - ‘MNIST.tar.gz’ saved [34813078]

MNIST/
MNIST/raw/
MNIST/raw/train-labels-idx1-ubyte
MNIST/raw/t10k-labels-idx1-ubyte.gz
MNIST/raw/t10k-labels-idx1-ubyte
MNIST/raw/t10k-images-idx3-ubyte.gz
MNIST/raw/train-images-idx3-ubyte
MNIST/raw/train-labels-idx1-ubyte.gz
MNIST/raw/t10k-images-idx3-ubyte
MNIST/raw/tra

Dataset MNIST
    Number of datapoints: 60000
    Root location: ./
    Split: Train

In [2]:
!pip show torchvision

Name: torchvision
Version: 0.9.0+cu101
Summary: image and video datasets and models for torch deep learning
Home-page: https://github.com/pytorch/vision
Author: PyTorch Core Team
Author-email: soumith@pytorch.org
License: BSD
Location: /usr/local/lib/python3.7/dist-packages
Requires: torch, pillow, numpy
Required-by: fastai


In [3]:
import operator

def test(a,b,cmp,cname=None):
    if cname is None: cname=cmp.__name__
    assert cmp(a,b),f"{cname}:\n{a}\n{b}"

def test_eq(a,b): test(a,b,operator.eq,'==')

from pathlib import Path
from IPython.core.debugger import set_trace
from fastai import datasets
import pickle, gzip, math, torch, matplotlib as mpl
import matplotlib.pyplot as plt
from torch import tensor
import torch.nn.functional as F

def near(a,b): return torch.allclose(a, b, rtol=1e-3, atol=1e-5)
def test_near(a,b): test(a,b,near)

# Make MNIST data work on Google Colab
from six.moves import urllib
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)

In [4]:
from torch import nn

In [5]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

def get_data():
    import os
    import torchvision.datasets as datasets
    # root = '../data'

    if not os.path.exists(root):
        os.mkdir(root)
    train_set = datasets.MNIST(root=root, train=True, download=True)
    test_set = datasets.MNIST(root=root, train=False, download=True)
    x_train, x_valid = train_set.train_data.split([50000, 10000])
    y_train, y_valid = train_set.train_labels.split([50000, 10000])
    return (x_train.view(50000, -1) / 256.0), y_train.float(), (x_valid.view(10000, -1))/ 256.0, y_valid.float()

# The geographic intuition for this is picturing x's around a horizontal line (mean), bring that mean down to 0,
# then scale x's by dividing them by the standard deviation
def normalize(x, mean, std): return (x-mean)/std

In [6]:
x_train, y_train, x_valid, y_valid = get_data()



## Normalization

- We want mean to be 0 and standard deviation to be 1 for easier convergence, so we normalize. 
- Notice how we use train_mean and train_std to normalize valid data as well - that's because we don't want validation dataset to be in a different scale

In [7]:
# before normalization
train_mean, train_std = x_train.mean(), x_train.std()
train_mean, train_std

(tensor(0.1304), tensor(0.3073))

In [8]:
x_train = normalize(x_train, train_mean, train_std)
x_valid = normalize(x_valid, train_mean, train_std)

In [9]:
# after normalization
train_mean, train_std = x_train.mean(), x_train.std()
train_mean, train_std

(tensor(3.9162e-08), tensor(1.))

In [10]:
def assert_is_near_zero(a, threshold=1e-3): assert a.abs() < threshold, f"{a} is not near zero"
assert_is_near_zero(train_mean)

## Get shapes

In [11]:
num_samples, image_size = x_train.shape 
num_classes = y_train.max() + 1 
nh = 50

n = num_samples
m = image_size
c = num_classes

n, m, c, nh

(50000, 784, tensor(10.), 50)

## Intialization

- Initialization is *extremely* important. In 2019, they wrote a paper "Fixup Initialization: Residual Learning Without Normalization" in which they trained a 10,000 layer neural net WITHOUT normalization just by initializing everything carefully.

### Xavier Initialization

- To perform Standard Xavier Initialization, you just divide input by the sqrt(num_input_units), which would give you a mean of 0, and standard deviation of 1 / sqrt(m)

In [12]:
def lin(x, w, b): return x@w + b

In [13]:
# Forward pass without Initialization
w1 = torch.randn(m, nh)
b1 = torch.zeros(nh)

t = lin(x_valid, w1, b1)
t.mean(), t.std() # terrible, you want ~ (0,1) (mean,std)

(tensor(0.2650), tensor(28.1635))

In [14]:
# Forward pass with Standard Xavier Init
w1 = torch.randn(m, nh) * math.sqrt(1/m)
b1 = torch.zeros(nh)

t = lin(x_valid, w1, b1)
t.mean(), t.std() # good

(tensor(0.1633), tensor(0.9979))

In [15]:
assert_is_near_zero(w1.mean())
assert_is_near_zero(w1.std() - 1/math.sqrt(m))

### Vanishing Activation/Gradient Problem

- Remember after performing the matrix multiplication, you have to pass it through relu, but each time you do that, you cut all activation values that are below 0 to 0 and thereby reduces the standard deviation. If your network is very deep, your standard deviation will keep getting reduced (possibly down to 0)

![Screen Shot 2021-03-02 at 9.20.03 PM.png](attachment:a3f59b8e-febb-4866-8494-d2338a2b0374.png)

In [16]:
# clamp_min(n) means replace everything below n with n, in this case, relu means replacing everything negative with 0
# always try to use PyTorch function because they're generally implemented in C for you
def relu(x): return x.clamp_min(0.) 

t = relu(lin(x_valid, w1, b1)) 
t.mean(),t.std()

(tensor(0.4760), tensor(0.6493))

### Kaiming Initialization
- The problem with Xavier Initialization is that it doesn't combat the vanishing gradient problem very well. 
- Kaiming initialization is almost identical to Xavier initialization but with a 2 on top; it will keep the std around 

$$\text{std} = \sqrt{\frac{2}{(1 + a^2) \times \text{fan_in}}}$$

- This was introduced in the paper that described the Imagenet-winning approach from *He et al*: [Delving Deep into Rectifiers](https://arxiv.org/abs/1502.01852), which was also the first paper that claimed "super-human performance" on Imagenet (and, most importantly, it introduced resnets as well as Kaiming He initialization!) 

- So papers by competition winners are very good because they introduce MANY good ideas instead of just one tiny tweak.

In [17]:
# Forward pass with Kaiming Initialization
torch.manual_seed(42)
w1 = torch.randn(m, nh) * math.sqrt(2/m)
b1 = torch.zeros(nh)

t = relu(lin(x_valid, w1, b1))
t.mean(), t.std() 

(tensor(0.6624), tensor(0.9097))

In [18]:
# Forward pass with PyTorch's Kaiming Initialization, same thing
from torch.nn.init import kaiming_normal_

w1 = torch.empty(m, nh)
b1 = torch.zeros(nh)

torch.manual_seed(42)
kaiming_normal_(w1, mode='fan_out')
t = relu(lin(x_valid, w1, b1))
t.mean(), t.std() 

(tensor(0.6624), tensor(0.9097))

- Note: Kaiming initialization is very good but notice the mean is still not zero - we have good reasons to want them to be. So we can define our own new_relu to see if it helps with normalizing the mean. It's an intuitive thing to do and papers are written from these minor tweaks. Maybe it'll help a lot in practice

In [19]:
def new_relu(x): return x.clamp_min(0.) - 0.5

In [20]:
# The new_relu seems to help!
torch.manual_seed(42)
w1 = torch.randn(m,nh) * math.sqrt(2./m)
t1 = new_relu(lin(x_valid, w1, b1))
t1.mean(), t1.std()

(tensor(0.1624), tensor(0.9097))

## Train a Model

In [21]:
torch.manual_seed(42)

w1 = torch.empty(m, nh)
b1 = torch.zeros(nh)
w2 = torch.randn(nh,1)/math.sqrt(nh)
b2 = torch.zeros(1)

def model(x):
    l1 = lin(x, w1, b1)
    l2 = relu(l1)
    l3 = lin(l2, w2, b2)
    return l3

In [22]:
%timeit -n 10 _=model(x_valid)

10 loops, best of 5: 7.7 ms per loop


In [23]:
assert model(x_valid).shape == torch.Size((x_valid.shape[0],1))

## Loss Function

- We wrongly use the MSE for now just for simplicity's sake

In [24]:
def mse(output, targ): return (output.squeeze(-1) - targ).pow(2).mean()

In [25]:
y_train, y_valid = y_train.float(), y_valid.float()

In [26]:
pred = model(x_train)

In [27]:
y_train.shape

torch.Size([50000])

In [28]:
pred.shape # not the exact shape, need to squeeze in mse

torch.Size([50000, 1])

In [29]:
mse(pred, y_train)

tensor(1.7683e+29)

# Backward Pass

- During backward pass, you calculate the gradient of every w1, b1, w2, b2 with respect to the loss
- For each of the function below, we take the derivative of each layer in terms of loss, storing the result in thatlayer's .g -- in other words, x.grad stores the result of dloss/dx. Note x is the denominator, the layer.
- DON'T RUN THIS FUNCTION LOCALLY ON CPU. IT REQUIRES GPU!

In [30]:
def mse_grad(inp, targ):
    # gradient of loss with respect to the previous layer, so it's pred.grad == dloss/dpred
    inp.grad = 2. * (inp.squeeze() - targ).unsqueeze(-1) / inp.shape[0]

In [31]:
def relu_grad(inp, out):
    # inp.grad == dloss/dinp == dout/dinp * dloss/dout  
    inp.grad = (inp > 0).float() * out.grad

In [32]:
def lin_grad(inp, out, w, b):
    # dloss / dl
    inp.grad = out.grad @ w.t() 
    # dloss / dw
    w.grad = (inp.unsqueeze(-1) * out.grad.unsqueeze(1)).sum(0)
    # dloss / db
    b.grad = out.grad.sum(0)

# Full Pass: Forward + Backward 


In [33]:
from torch.nn import init
torch.manual_seed(42)

# Our forward + backward loop

w1 = torch.zeros(m,nh)
init.kaiming_normal_(w1, mode='fan_out')
b1 = torch.zeros(nh)
w2 = torch.randn(nh,1)/math.sqrt(nh)
b2 = torch.zeros(1)

def forward_and_backward(inp, target):
    # forward
    l1 = lin(inp, w1, b1)
    l2 = relu(l1)
    pred = lin(l2, w2, b2)
    loss = mse(pred, target)    
    
    # backward
    # pred.grad = dloss/dpred
    mse_grad(pred, target) 
    # l2.grad = dloss/dl2 = dloss/dpred * dpred/dl2
    # w2.grad = dloss/dw2 = dloss/dpred * dpred/dw2
    # b2.grad = dloss/db2 = dloss/dpred * dpred/db2
    lin_grad(l2, pred, w2, b2) 
    # l1.grad = dloss/dl1 = dloss/dl2 * dl2/dl1
    relu_grad(l1, l2)
    # x.grad = dloss/dx = dloss/dl1 * dl1/dx
    # w1.grad = dloss/dw1 = dloss/dl1 * dl1/dw1
    # b1.grad = dloss/db1 = dloss/dl1 * dl1/db1
    lin_grad(inp, l1, w1, b1)

forward_and_backward(x_train, y_train)

In [34]:
# pytorch's forward + backward loop
w1_2 = w1.clone().requires_grad_(True)
w2_2 = w2.clone().requires_grad_(True)
b1_2 = b1.clone().requires_grad_(True)
b2_2 = b2.clone().requires_grad_(True)
x_train_2 = x_train.clone().requires_grad_(True)

def forward(inp, targ):
    # forward pass:
    l1 = lin(inp, w1_2, b1_2)
    l2 = relu(l1)
    pred = lin(l2, w2_2, b2_2)
    # we don't actually need the loss in backward!
    return mse(pred, targ) 

loss = forward(x_train_2, y_train)
loss.backward()

In [35]:
test_near(w1_2.grad, w1.grad)
test_near(w2_2.grad, w2.grad)
test_near(b1_2.grad, b1.grad)
test_near(b2_2.grad, b2.grad)
test_near(x_train_2.grad, x_train.grad)

# Refactor Forward & Backward Functions into Same Classes

In [36]:
class Relu():
  def __call__(self, input):
    self.input = input
    self.output = input.clamp_min(0.) 
    return self.output

  def backward(self):
    self.input.grad = (self.input > 0).float() * self.output.grad

In [37]:
class Lin():
  def __init__(self, w, b): 
    self.w = w
    self.b = b

  def __call__(self, input):
    self.input = input
    self.output = input@self.w + self.b
    return self.output
  
  def backward(self):
    self.input.grad = self.output.grad @ self.w.t()
    self.w.grad = (self.input.unsqueeze(-1) * self.output.grad.unsqueeze(1)).sum(0)
    self.b.grad = self.output.grad.sum(0)

In [38]:
class Mse():
  def __call__(self, pred, targ):
    self.pred = pred
    self.targ = targ
    return (pred.squeeze(-1) - targ).pow(2).mean()

  def backward(self):
    self.pred.grad = 2. * (self.pred.squeeze() - self.targ).unsqueeze(-1) / self.pred.shape[0]

In [39]:
class Model:
  def __init__(self, w1, w2, b1, b2):
    self.layers = [Lin(w1, b1), Relu(), Lin(w2, b2)]
    self.loss = Mse()
  
  def __call__(self, x, target):
    for layer in self.layers:
      x = layer(x)
    return self.loss(x, target)
  
  def backward(self):
    self.loss.backward()
    for layer in reversed(self.layers):
      layer.backward()

In [40]:
w1.grad, b1.grad, w2.grad, b2.grad = [None for _ in range(4)]; print(w1.grad, b1.grad, w2.grad, b2.grad)
model = Model(w1, w2, b1, b2)

None None None None


In [41]:
# forward pass
%time loss = model(x_train, y_train)

CPU times: user 115 ms, sys: 0 ns, total: 115 ms
Wall time: 58.4 ms


In [42]:
# backward pass
%time model.backward()

CPU times: user 2.98 s, sys: 3.63 s, total: 6.61 s
Wall time: 3.43 s


In [43]:
test_near(w1_2.grad, w1.grad)
test_near(w2_2.grad, w2.grad)
test_near(b1_2.grad, b1.grad)
test_near(b2_2.grad, b2.grad)
test_near(x_train_2.grad, x_train.grad)

# Refactor out Repetitive Code

### Notice how we're basically recreating nn.Linear and nn.Module etc.

In [44]:
class Module():
  def __call__(self, *args):
    self.args = args
    self.output = self.forward(*args)
    return self.output

  def backward(self): self.bwd(self.output, *self.args)

In [45]:
'''
Module is the one that takes in all the arguments during dumb call __call__ and pass them down to the 
specific forward / bwd. There are no self.input, only self.args and self.output.
'''

# Relu(input), *args is input
class Relu(Module):
  def forward(self, input):
    return input.clamp_min(0.) 

  def bwd(self, output, input):
    input.grad = (input > 0).float() * output.grad

In [46]:
# Lin(input), *args is input
class Lin(Module):
  def __init__(self, w, b): 
    self.w, self.b = w, b

  def forward(self, input):
    return input@self.w + self.b
  
  def bwd(self, output, input):
    input.grad = output.grad @ self.w.t()
    # optimization via matrix multiplication instead of multiplication and summing (einsum could also be used to improve performance here)
    self.w.grad = input.t() @ output.grad
    self.b.grad = output.grad.sum(0)
    # print("input, w, b shapes: ", input.shape, self.w.shape, self.b.shape)
    # print("input.grad, w.grad, b.grad shapes: ", input.grad.shape, self.w.grad.shape, self.b.grad.shape)

In [47]:
# Mse(input), *args are pred and targ
class Mse(Module):
  def forward(self, pred, targ):
    return (pred.squeeze(-1) - targ).pow(2).mean()

  def bwd(self, out, pred, targ):
    pred.grad = 2. * (pred.squeeze() - targ).unsqueeze(-1) / pred.shape[0]


In [48]:
class Model:
  def __init__(self, w1, w2, b1, b2):
    self.layers = [Lin(w1, b1), Relu(), Lin(w2, b2)]
    self.loss = Mse()
  
  def __call__(self, x, target):
    for layer in self.layers:
      x = layer(x)
    return self.loss(x, target)
  
  def backward(self):
    self.loss.backward()
    for layer in reversed(self.layers):
      layer.backward()

In [49]:
w1.grad, b1.grad, w2.grad, b2.grad = [None for _ in range(4)]; print(w1.grad, b1.grad, w2.grad, b2.grad)
model = Model(w1, w2, b1, b2)

None None None None


In [50]:
# forward pass
%time loss = model(x_train, y_train)

CPU times: user 99.5 ms, sys: 0 ns, total: 99.5 ms
Wall time: 49.9 ms


In [51]:
# backward pass
%time model.backward()

CPU times: user 243 ms, sys: 29.8 ms, total: 273 ms
Wall time: 151 ms


In [52]:
test_near(w1_2.grad, w1.grad)
test_near(w2_2.grad, w2.grad)
test_near(b1_2.grad, b1.grad)
test_near(b2_2.grad, b2.grad)
test_near(x_train_2.grad, x_train.grad)

# Lesson 9 

# How to do Research

- If you see something weird in PyTorch code, don't assume it's correct. 
- When it comes to deep learning, no one knows what he's doing. And also a lot of the code is just that way because of legacy and no longer a good idea. It's important to question the status quo.
- Re-implement from scratch, play with it, do experiments with it to see if it actually gives better results. (example of what Jeremy did in 02a and 02b notebooks) If not, reach out to the team and ask why. You can publish your results by converting your Jupyter Notebook into Gist with Gist It. 


## Cross Entropy Loss

In [53]:
# use nn.Linear and don't write how to deal with loss into the model yet
class Model(nn.Module):
  def __init__(self, input_dim, hidden_dim, output_dim):
    super().__init__()
    self.layers = [nn.Linear(input_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, output_dim)]
  
  def __call__(self, x):
    for layer in self.layers:
      x = layer(x)
    return x

In [54]:
int(c)

10

In [55]:
model = Model(m, nh, int(c))

To get the cross entropy loss, we first pass the numerical values through a softmax. Then do $$ -\sum x\, \log p(x) $$


In [56]:
def log_softmax(x): return (x.exp() / x.exp().sum(-1, keepdim=True)).log()

In [57]:
pred = model(x_train) # numerical value
pred = log_softmax(pred) # a softmax

Let's look at the targets for the first 3 images

In [58]:
y_train[:3]

tensor([5., 0., 4.])

And our model's prediction for them (keep in mind only the target class matter since we're dealing with one-hot vectors here)

In [59]:
print(pred[0][5])
print(pred[1][0])
print(pred[2][4])

tensor(-2.5021, grad_fn=<SelectBackward>)
tensor(-2.1583, grad_fn=<SelectBackward>)
tensor(-2.4839, grad_fn=<SelectBackward>)


In [60]:
# numpy / pytorch trick to get them quicker, these are the class in the softmax (p(x)) that matter
pred[[0,1,2],[5,0,4]]

tensor([-2.5021, -2.1583, -2.4839], grad_fn=<IndexBackward>)

In [61]:
y_train.long()

tensor([5, 0, 4,  ..., 8, 4, 8])

In [62]:
# negative log likelihood loss
# pred[range(target.shape[0]), target] selects the predicted likelihoods for the correct classes, which if you multiply by 1, gives you the loss for that image.
# You then average all images. 
# target is a vector
def nll(pred, target):
  return - pred[range(target.shape[0]), target.long()].mean()

### Optimization for Numerical Stability

Because computers cannot do perfect math i.e. divisions can only be accurate to a certain number of decimals, computers can't tell differences between two really big but different numbers, there are mathematical tricks we can do to make the computation more stable.

Here we optimize log_softmax in two ways

**1. Using the formula** $$\log \left ( \frac{a}{b} \right ) = \log(a) - \log(b)$$ 


**2. Use the LogSumExp trick** 

Reference to this great [article](https://nhigham.com/2021/01/05/what-is-the-log-sum-exp-function/).

Basically 

$$LSE \left (x \right ) = \log \left ( \sum_{j=1}^{n} e^{x_{j}} \right )$$ is called the Log Sum Exponential - it is useful because it approximates the max function max(x), but it is smooth / differentiable at all points unlike the max function. Note its property (it is very close to the max(x), bounded by a difference of log(n)): 

$$ \max \left (x \right ) <= LSE \left (x \right ) <= \max \left (x \right ) + \log \left (n \right )$$ 

which is derived by taking log of $$ e^{\max \left (x \right )} <= \sum_{j=1}^{n} e^{x_{j}} <= n e^{\max \left (x \right )}$$

In fact, if x = [0 t], LSE approximates the ReLu, which is max(t, 0). LSE([0 t]), which is just log(1 + e^t), is known as the softplus function. It is a smooth approximation of ReLU.

Now, you shouldn't type $$\sum_{j=1}^{n} e^{x_{j}}$$ directly in PyTorch because **if x is big, you can get numerical overflow very fast (with just double-digit x) - note this is an exponential. Watch out for overflows any time you see exponentials!**

So we use this trick:

$$\log \left ( \sum_{j=1}^{n} e^{x_{j}} \right ) = \log \left ( e^{a} \sum_{j=1}^{n} e^{x_{j}-a} \right ) = a + \log \left ( \sum_{j=1}^{n} e^{x_{j}-a} \right )$$

where a is the maximum of the $x_{j}$.

Note now you will only get negative exponents. No overflow. Any underflows are harmless. 


In [132]:
def test_log_softmax_equal(func1, func2):
  loss1 = nll(func1(model(x_train)), y_train)
  loss2 = nll(func2(model(x_train)), y_train)
  test_near(loss1, loss2)

In [133]:
# first optimization, avoid a lot of divisions (and thus inaccuracies)
def log_softmax_1(x): 
  # the first term is really x.exp().log() but that's just x
  return x - (x.exp().sum(-1, keepdim=True)).log()

In [134]:
def logsumexp(x):
  a = x.max(-1).values
  return a + (x-a.unsqueeze(-1)).exp().sum(-1).log()
# test our logsumexp against pytorch's
pred = model(x_train)
test_near(logsumexp(pred), pred.logsumexp(-1))

# second optimization. As mentioned above, to avoid overflow, we do second optimization with LSE (log sum exponential) 
def log_softmax_2(x):
  lse = x.logsumexp(-1) # use PyTorch's native logsumexp
  return x - lse.unsqueeze(-1)

In [146]:
# test the 2 optimized log_softmax
test_log_softmax_equal(log_softmax, log_softmax_1)
test_log_softmax_equal(log_softmax, log_softmax_2)

# test our nll against pytorch's implementation
pred = model(x_train)
our_logsoftmax_nll_loss = nll(log_softmax(pred), y_train)
torch_logsoftmax_nll_loss = F.nll_loss(F.log_softmax(pred, -1), y_train.long())
torch_crossentropy_loss = F.cross_entropy(pred, y_train.long())

test_near(our_logsoftmax_nll_loss, torch_logsoftmax_nll_loss)
test_near(torch_logsoftmax_nll_loss, torch_crossentropy_loss)

Checkpoint Lesson 9 42:25