<center>

# 对于`nn.Module`的理解
</center>

> Jeremy Howard 在 fast.ai课程上对Pytorch中`nn.Module`的理解，浅显易懂
> 原文详见：https://pytorch.org/tutorials/beginner/nn_tutorial.html

# What is `torch.nn` ?

## 1.Load Data
> 使用MNIST 数据做测试

In [6]:
from pathlib import Path
import torch

In [2]:
PATH = Path("/root/.fastai/data/mnist/")
FILENAME = "mnist.pkl.gz"

In [3]:
import pickle
import gzip

with gzip.open((PATH/FILENAME).as_posix(),"rb") as f:
    ((x_train, y_train), (x_valid, y_valid),_) = pickle.load(f, encoding="latin-1")

In [4]:
x_train.shape, y_train.shape

((50000, 784), (50000,))

In [7]:
x_train, y_train, x_valid, y_valid = map(torch.tensor, (x_train, y_train, x_valid, y_valid))

In [9]:
y_train.unique()

tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [10]:
N = x_train.shape[0]

### 本手册用到相关参数声明

In [24]:
bs = 64
epochs = 2
lr = 0.5
M = x_train.shape[1]

## 2. 第一个纯手工打造神经网络

参数初始化，使用Xavier初始化w

In [12]:
import math
w = torch.randn((M,10)) / math.sqrt(M)
w.requires_grad_()
b = torch.zeros(10, requires_grad=True)

> 注意此处`requires_grad`的用法，用于标准该参数是否需要计算梯度，在frozen层时需要额外的处理

In [13]:
w.shape, b.shape

(torch.Size([784, 10]), torch.Size([10]))

构建一个多分类线性分类器

In [15]:
def model(x):
    out = x@w + b
    log_softmax = out - out.exp().sum(-1).log().unsqueeze(-1)
    return log_softmax

测试一个batch的数据

In [17]:
xb = x_train[:bs]
preds = model(xb)
preds[0]

tensor([-2.0676, -2.5316, -2.4059, -2.7309, -2.4479, -2.2261, -2.3194, -2.1424,
        -1.9971, -2.3776], grad_fn=<SelectBackward>)

In [18]:
preds.shape

torch.Size([64, 10])

> 预测输出描述了每个结果对每个分类的打分值，可以理解为scorces

使用负对数似然函数求解损失

In [19]:
def loss_func(input, target):
    return -input[range(target.shape[0]), target].mean()

In [20]:
yb = y_train[:bs]
print(loss_func(preds, yb))

tensor(2.3624, grad_fn=<NegBackward>)


In [21]:
def accuracy(out, yb):
    preds = torch.argmax(out, dim=1)
    return (preds == yb).float().mean()

In [22]:
accuracy(preds, yb)

tensor(0.1562)

In [23]:
yb.shape

torch.Size([64])

### Test NLL function
> 理解表达的方式，累加所有在真值位置的打分值的负数之和，loss值越小越好

In [None]:
a = torch.randn(3,3)

In [None]:
b1 = torch.tensor([0,1,0])

In [None]:
b1.shape

In [None]:
a[range(3),b]

In [None]:
-a[range(3),b1].mean()

### All together

In [25]:
model = model

for epoch in range(epochs):
    for i in range((N-1)//bs + 1):
        s_i = i*bs
        e_i = s_i + bs
        xb  = x_train[s_i:e_i]
        yb = y_train[s_i:e_i]
        preds = model(xb)
        loss = loss_func(preds, yb)
        loss.backward()

        with torch.no_grad():
            w -= w.grad * lr
            b -= b.grad * lr
            w.grad.zero_()
            b.grad.zero_()
    print("epoch %d with loss %f" %(epoch, loss))

epoch 0 with loss 0.402983
epoch 1 with loss 0.312470


> 1. 关于清零w和b的grad是为了不累加后续的梯度；2.关于torch.no_grad的使用，一样

In [26]:
print(loss_func(model(xb),yb), accuracy(model(xb),yb))

tensor(0.0817, grad_fn=<NegBackward>) tensor(1.)


# Update #1: 使用内置loss函数和参数方法

In [27]:
import torch.nn.functional as F
loss_func = F.cross_entropy

> NNL ---> cross entropy loss

In [28]:
print(loss_func(model(xb),yb))

tensor(0.0817, grad_fn=<NllLossBackward>)


In [29]:
from torch import nn
import math

In [30]:
class No_1_LG(nn.Module):
    def __init__(self):
        super().__init__()
        self.w = nn.Parameter(torch.randn(784,10)/math.sqrt(784))
        self.b = nn.Parameter(torch.zeros(10))
        
    def forward(self, xb):
        return xb @ self.w + self.b

In [31]:
model = No_1_LG()

In [32]:
model.w.shape

torch.Size([784, 10])

In [33]:
print(loss_func(model(xb),yb))

tensor(2.2939, grad_fn=<NllLossBackward>)


In [34]:
def fit():
    for epoch in range(epochs):
        for i in range((N-1)//bs + 1):
            s_i = i*bs
            e_i = s_i + bs
            xb  = x_train[s_i:e_i]
            yb = y_train[s_i:e_i]
            preds = model(xb)
            loss = loss_func(preds, yb)
            acc = accuracy(preds, yb)
            loss.backward()

            with torch.no_grad():
                for p in model.parameters():
                    p -= p.grad * lr
                model.zero_grad()
        print("epoch %d with loss %f, accuracy %f" %(epoch, loss, acc))

In [35]:
fit()

epoch 0 with loss 0.399068, accuracy 0.937500
epoch 1 with loss 0.311083, accuracy 0.937500


## Update #2:使用 `nn.Linear()`

In [36]:
class No_2_LN(nn.Module):
    def __init__(self):
        super().__init__()
        self.lin = nn.Linear(784, 10)
        
    def forward(self, xb):
        return self.lin(xb)

> 参数交给pytorch来管理，我们只管输入输出了

In [37]:
model = No_2_LN()
print(loss_func(model(xb), yb))

tensor(2.3847, grad_fn=<NllLossBackward>)


In [38]:
model.parameters

<bound method Module.parameters of No_2_LN(
  (lin): Linear(in_features=784, out_features=10, bias=True)
)>

In [39]:
fit()

epoch 0 with loss 0.406061, accuracy 0.937500
epoch 1 with loss 0.315676, accuracy 0.937500


## Update #3: optim

> 可以使用pytorch的optim的step方法，避免手动更新每个参数

In [40]:
from torch import optim

In [41]:
model = No_2_LN()
opt = optim.SGD(model.parameters(), lr=lr)
def fit():
    for epoch in range(epochs):
        for i in range((N-1)//bs + 1):
            s_i = i*bs
            e_i = s_i + bs
            xb  = x_train[s_i:e_i]
            yb = y_train[s_i:e_i]
            preds = model(xb)
            loss = loss_func(preds, yb)
            acc = accuracy(preds, yb)
            
            loss.backward()
            opt.step()
            opt.zero_grad()
            
        print("epoch %d with loss %f, accuracy %f" %(epoch, loss, acc))

In [42]:
fit()

epoch 0 with loss 0.404722, accuracy 0.937500
epoch 1 with loss 0.314789, accuracy 0.937500


## Update #4: Dataset and Dataloader

In [43]:
from torch.utils.data import TensorDataset, DataLoader

In [44]:
train_ds = TensorDataset(x_train, y_train)
train_dl = DataLoader(train_ds, batch_size=bs)

In [45]:
model = No_2_LN()
opt = optim.SGD(model.parameters(), lr=lr)
def fit():
    for epoch in range(epochs):
        for xb,yb in train_dl:
            preds = model(xb)
            loss = loss_func(preds, yb)
            acc = accuracy(preds, yb)
            
            loss.backward()
            opt.step()
            opt.zero_grad()
            
        print("epoch %d with loss %f, accuracy %f" %(epoch, loss, acc))

In [46]:
fit()

epoch 0 with loss 0.393464, accuracy 0.937500
epoch 1 with loss 0.307258, accuracy 0.937500


## Update #5: Add validation Set

In [47]:
train_ds = TensorDataset(x_train, y_train)
train_dl = DataLoader(train_ds, batch_size=bs, shuffle=True)
val_ds = TensorDataset(x_valid, y_valid)
val_dl = DataLoader(val_ds, batch_size=bs*2)

In [48]:
model = No_2_LN()
opt = optim.SGD(model.parameters(), lr=lr)
def fit():
    for epoch in range(epochs):
        model.train()
        for xb,yb in train_dl:
            preds = model(xb)
            loss = loss_func(preds, yb)
            acc = accuracy(preds, yb)
            
            loss.backward()
            opt.step()
            opt.zero_grad()
        model.eval()
        with torch.no_grad():
            valid_loss = sum(loss_func(model(xb),yb) for xb, yb in val_dl )
            
        print("epoch %d with Train_loss %f | Val_Loss %f" %(epoch, loss, valid_loss/len(val_dl)))

> 验证集是不需要反向传播的，所以可以不需要传递梯度相关参数，可以使用更大的batch size

In [None]:
fit()

### Wrap Everything

In [None]:
def fit(model, epochs, loss_func, opt, train_dl, val_dl):
    
    for epoch in range(epochs):
        model.train()
        for xb,yb in train_dl:
            preds = model(xb)
            loss = loss_func(preds, yb)
            #acc = accuracy(preds, yb)
            if opt is not None:
                loss.backward()
                opt.step()
                opt.zero_grad()
            
        model.eval()
        with torch.no_grad():
            valid_loss = sum(loss_func(model(xb),yb) for xb, yb in val_dl)/len(val_dl)
            
        print("epoch %d with Train_loss %f | Val_Loss %f" %(epoch, loss, valid_loss))

> model.train() 和 model.eval()类似于指示函数，用于说明下面的代码是哪个过程，在BN，Dropout等机制需要这种声明

In [49]:
def get_data(train_ds, val_ds , bs):
    return(
        DataLoader(train_ds, batch_size=bs, shuffle=True),
        DataLoader(val_ds, batch_size=bs*2),
    )

In [None]:
train_dl, valid_dl = get_data(train_ds, val_ds, bs)
model = No_2_LN()
opt = optim.SGD(model.parameters(), lr=lr)
fit(model, epochs, loss_func, opt, train_dl, valid_dl)

## Update #6: CNN

In [None]:
class No_3_CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1,16,kernel_size=3, stride=2, padding=1)
        self.conv2 = nn.Conv2d(16,16,kernel_size=3, stride=2, padding=1)
        self.conv3 = nn.Conv2d(16,10,kernel_size=3, stride=2, padding=1)
        #self.lin = nn.Linear(784, 10)
        
    def forward(self, xb):
        xb = xb.view(-1,1,28,28)
        xb = F.relu(self.conv1(xb))
        xb = F.relu(self.conv2(xb))
        xb = F.relu(self.conv3(xb))
        xb = F.avg_pool2d(xb, 4)
        return xb.view(-1, xb.size(1))

> `view` in pytorch is the same meaning of reshape in numpy

In [None]:
lr = 0.1
model = No_3_CNN()
opt = optim.SGD(model.parameters(), lr=lr, momentum=0.9)

train_dl, valid_dl = get_data(train_ds, val_ds, bs)

In [None]:
fit(model, epochs, loss_func, opt, train_dl, valid_dl)

### 把函数封装成pytorch Layer

In [None]:
class Lambda(nn.Module):
    def __init__(self, func):
        super().__init__()
        self.func = func
    def forward(self, x):
        return self.func(x)    

In [None]:
def preprocess(x):
    return x.view(-1, 1, 28 ,28)

Define model in Sequential mode

In [None]:
model = nn.Sequential(
    Lambda(preprocess),
    nn.Conv2d(1,16,kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.Conv2d(16,16,kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.Conv2d(16,10,kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.AvgPool2d(4),
    Lambda(lambda x:x.view(x.size(0), -1))
)

In [None]:
opt = optim.SGD(model.parameters(), lr=lr, momentum=0.9)
fit(model, 1, loss_func, opt, train_dl, valid_dl)

## Update #7: Wrapping DataLoader

In [None]:
def preprocess(x,y):
    return x.view(-1,1,28,28), y

In [None]:
class W_DataLoader:
    def __init__(self, dl, func):
        self.dl = dl
        self.func = func
        
    def __len__(self):
        return len(self.dl)
    
    def __iter__(self):
        batches = iter(self.dl)
        for b in batches:
            yield (self.func(*b))

In [None]:
train_dl, valid_dl = get_data(train_ds, val_ds, bs)
train_dl = W_DataLoader(train_dl, preprocess)
valid_dl = W_DataLoader(valid_dl, preprocess)

In [None]:
len(train_dl)

## Update #8: Update Model
> work with any size of input

In [None]:
model = nn.Sequential(
    nn.Conv2d(1,16,kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.Conv2d(16,16,kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.Conv2d(16,10,kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.AdaptiveAvgPool2d(1),
    Lambda(lambda x:x.view(x.size(0), -1))
)

In [None]:
opt = optim.SGD(model.parameters(), lr=lr, momentum=0.9)

In [None]:
fit(model, 1, loss_func, opt, train_dl, valid_dl)

## Update #9:Using GPU

In [None]:
torch.cuda.is_available()

In [None]:
dev = torch.device(
    "cuda") if torch.cuda.is_available() else torch.device("cpu")

> Update preprocess function

In [None]:
def preprocess(x, y):
    return x.view(-1, 1, 28, 28).to(dev), y.to(dev)

In [None]:
train_dl, valid_dl = get_data(train_ds, val_ds, bs)
train_dl = W_DataLoader(train_dl, preprocess)
valid_dl = W_DataLoader(valid_dl, preprocess)

In [None]:
model.to(dev)
opt = optim.SGD(model.parameters(), lr=lr, momentum=0.9)

In [None]:
# do fit and test

## Summary

 - **torch.nn**

   + ``Module``: creates a callable which behaves like a function, but can also
     contain state(such as neural net layer weights). It knows what ``Parameter`` (s) it
     contains and can zero all their gradients, loop through them for weight updates, etc.
   + ``Parameter``: a wrapper for a tensor that tells a ``Module`` that it has weights
     that need updating during backprop. Only tensors with the `requires_grad` attribute set are updated
   + ``functional``: a module(usually imported into the ``F`` namespace by convention)
     which contains activation functions, loss functions, etc, as well as non-stateful
     versions of layers such as convolutional and linear layers.
 - ``torch.optim``: Contains optimizers such as ``SGD``, which update the weights
   of ``Parameter`` during the backward step
 - ``Dataset``: An abstract interface of objects with a ``__len__`` and a ``__getitem__``,
   including classes provided with Pytorch such as ``TensorDataset``
 - ``DataLoader``: Takes any ``Dataset`` and creates an iterator which returns batches of data.