<a href="https://colab.research.google.com/github/asrjy/d2l-notes/blob/master/Chapter%206%20-%20Convolutional%20Neural%20Networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Convolutional Neural Networks

Flattening image layers means losing the spatial relationship between the pixels. In order to overcome this issue, we use CNNs instead of regular MLPs

Since MLPs don't care about the spatial relationship, we get the same results even though the order of the pixels are modified. This phenomenon is not wanted. 

Modern CNNs tend to be computationally effieicent, require fewer parameters than fully connected networks and easy to parallelize over multiple GPU cores. CNNs have also shown decent performance on one dimenstionals data such as audio, time series data where recurrent neural networks are conventionally used. 

## From Fully Connected Layers to Convolutions

For low dimensional data where we lack the knowledge to construct specific architectures that identify patterns that seek interactions among features, an MLP may be the best we can do. However, for high dimensional perceptual data, such structure less networks can gro unwieldy. 

Convolutional Neural Networks are one way to exploit the structure in natural images. 

### Invariance

Spatial Invariance allows CNNs to learn useful representations with fewer parameters. What it basically means is our model should identify a pig in air or a plane in water. Where the target is located is not of importance, it's existence is what we look for. 

There are two different types of invariances. 

1 - In the earliest layers, our network should respond similarly to the same patch, regardless of where the patch is located in the image. A patch here means a part of the image on which the NN works on. This principle is called translation invariance / translation equivariance. 

2 - The earliest layers of the network should focus on local regions, without regard for the contents in distant regions. This is called locality principle. Eventually they are aggregated to make predictions on the image as a whole. 

3 - As we proceed, deeper layers should capture longer range features of the image similar to higher level vision in nature. 

### Constraining the MLP

Assume the image is represented as X with shape (i, j) and the hidden representation is represented as H with shape also (i, j) where each element of the hidden representation is calculated by summing over the pixels of X centered at i, j weighted by V (i, j, a, b)

\begin{split}\begin{aligned} \left[\mathbf{H}\right]_{i, j} &= [\mathbf{U}]_{i, j} + \sum_k \sum_l[\mathsf{W}]_{i, j, k, l}  [\mathbf{X}]_{k, l}\\ &=  [\mathbf{U}]_{i, j} +
\sum_a \sum_b [\mathsf{V}]_{i, j, a, b}  [\mathbf{X}]_{i+a, j+b}.\end{aligned}\end{split}

The indices a, b run over both positive and negative offsets ocvering the entire image. 

#### Translation Invariance

According to this principle, a shift in X should lead to a shift in the hidden representation. This is only possible if V and U do not depepnd on i, j. As a result we represent V(i, j, a, b) as just V((a, b) and U as a constant. 

Now the simplified representation is 

$[\mathbf{H}]_{i, j} = u + \sum_a\sum_b [\mathbf{V}]_{a, b}  [\mathbf{X}]_{i+a, j+b}$.

The above concept is called convolution where we are effectively weighing pixels at (i+a j+b) in the vicinity of (i, j) with coefficients V(a, b) to obtain the hidden representation. 

V(a, b) requires far fewer parameters than V(i, j, a, b) since it no longer depends on i and j. 

#### Locality

According to this principle, we should not look very far from X(i, j) to get relevant information about what is going on at X(i, j). This means, after some value of a and b, V(a, b) need to be 0. 

Now the hidden representation becomes, 

$[\mathbf{H}]_{i, j} = u + \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} [\mathbf{V}]_{a, b}  [\mathbf{X}]_{i+a, j+b}$.

The value of delta is typically smaller than 10. 

The above equation is called a convolutional layer. V is referred to as Convolution kernel/filter. 

Without this layer, for a single megeapixel image, we would require billions of parameters, but with the convolutional layer, we would require a few hundred, without altering the dimensionality of either the inputs or the hidden representations. 

The cost of this reduction of in parameters is that, we only capture local information while determining the value of each hidden activation. 

This bias might not always agree with reality, as there could be images that are not translation invariant. 

### Convolutions

In mathematics, a convolution operation between two functions is the measure of overlap between f and g when g is flipped and shifted by x. 

$(f * g)(\mathbf{x}) = \int f(\mathbf{z}) g(\mathbf{x}-\mathbf{z}) d\mathbf{z}.$

When we are dealing with discrete objects, the integral in the beginning, becomes a sum. 

$(f * g)(i) = \sum_a f(a) g(i-a).$

For two dimensional tensors, we have corresponding indices a and b for i and j. 

$(f * g)(i, j) = \sum_a\sum_b f(a, b) g(i-a, j-b).$

This is similar to the convolution operation we arrived at before excluding the + instead of -. The more proper name for the equation we got before is cross-correlation. 

### Channels

To support multiple channels in both the inputs and the hidden representations, we add two more coordinates to V and one more to X. 

$[\mathsf{H}]_{i,j,d} = \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} \sum_c [\mathsf{V}]_{a, b, c, d} [\mathsf{X}]_{i+a, j+b, c},$

where d indexes the output channels in the hidden representation. 

### Exercises 

1 - Assume that the size of the convolution kernel is $\Delta = 0$. Show that in this case the convolution kernel implements an MLP independently for each set of channels. This leads to the Network in Network architectures [Lin et al., 2013].

Ans - When the size of convolution layer is 0, it means, nearby data of each coordinate is not considered. There's no convolution happening. The shape of V and X are going to be the same. This is what happens in a regular MLP. 

2.1 - Audio data is often represented as a one-dimensional sequence. When might you want to impose locality and translation invariance for audio?

Ans - Inclusion of locality and translation invariance means, audio at one part is not necessarily related to audio at another part. This wouldn't work for speech related problems, but would work for pattern recognition problems where local part doesnt have any relation to parts outside its locality.

3 - Why might translation invariance not be a good idea after all? Give an example.

Ans - Not a good idea in cases where one patch of the image has a relationship with it's position. An example could be when identifying the face of a human, if the position of eyes and mouth are exchanged, the prediction probably would not be a human. 

4 - Do you think that convolutional layers might also be applicable for text data? Which problems might you encounter with language?

Ans - Convolutional Layers include translational invariance and Locality. Translation Invariance would not work on text data as it is sequential data. Locality would not be of much help since words make sense only when combined with other words. 

5 - What happens with convolutions when an object is at the boundary of an image.

Ans - Values around the border are decided using padding. 

## Convolutions for Images

### The Cross-Correlation Operation

If the input is of size n_h \times n_w and the kernel is of size $k_h \times k_w$, the output after the cross correlation operation would be of size $(n_h-k_h+1) \times (n_w-k_w+1).$

This is because we need enough space to shift the kernel across the image. With padding, the output size will not be varied as we pad zeros around the boundary so there is enough space to shift the kernel. 

In [2]:
import torch
from torch import nn
from d2l import torch as d2l

def corr2d(X, K):
  h, w = K.shape
  Y = torch.zeros((X.shape[0] - h + 1) , (X.shape[1] - w + 1))
  for i in range(Y.shape[0]):
    for j in range(Y.shape[1]):
      Y[i, j] = (X[i: i+h, j:j+w] * K).sum()
  return Y 

In [3]:
X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
K = torch.tensor([[0.0, 1.0], [2.0, 3.0]])
corr2d(X, K)

tensor([[19., 25.],
        [37., 43.]])

### Convolutional Layers

A convolutional layer performs cross correlation  between input and kernel and adds a scalar bias to it to produce am output. The two parameters of a convolutional layer are the kernel and the bias. When training, we typically intialize them randomly. 

Implementing a convolutional layer based on the corr2d function defined above. 

In [5]:
class Conv2D(nn.Module):
  def __init__(self, kernel_size):
    super().__init__()
    self.weight = nn.Parameter(torch.rand(kernel_size))
    self.bias = nn.Parameter(torch.zeros(1))
  def forward(self, X):
    return corr2d(X, self.weight) + self.bias

### Object Edge Detection in Images

Detecting the edge of an object can be performed by finding the location of change of color in piels. 

In [6]:
X = torch.ones((6, 8))
X[:, 2:6] = 0
X

tensor([[1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.]])

When we perform cross correlation with a kernel of size 1x2, if the two elements match with the elements in the patch of the image, it outputs 0 ie., at location i, j it calculates x(i, j) - x(i+1, j)

In [7]:
K = torch.tensor([[1.0, -1.0]])

In [8]:
Y = corr2d(X, K)
Y

tensor([[ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.]])

In [9]:
corr2d(X.t(), K)

tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]])

It does not work because, the cross correlation can only detect vertical edge, not horizontal edges. 

### Learning a Kernel

Although the above process is nice, it's hard to define each kernel when we are working with bigger architectures. We want to kernel to automatically learn these processes. 

First we construct a convolution layer and initialize it randomly. In each iteration we will use the squared error to compare Y with the output of the convolution layer, calculate the gradient to update the kernel. 

In [10]:
# Convolutional Layer with 1 output channel and kernel fo shape (1, 2). Ignoring the bias for now. 
conv2d = nn.LazyConv2d(1, kernel_size = (1, 2), bias = False)

# (Example, Channel, Height, Width)
X = X.reshape((1, 1, 6, 8))
Y = Y.reshape((1, 1, 6, 7))
lr = 3e-2

for i in range(10):
  Y_hat = conv2d(X)
  l = (Y_hat - Y) ** 2
  conv2d.zero_grad()
  l.sum().backward()
  # Updating the kernel
  conv2d.weight.data[:] -= lr * conv2d.weight.grad
  if (i+1)%2 == 0:
    print(f"Epoch {i+1}, loss {l.sum():.3f}")

Epoch 2, loss 11.266
Epoch 4, loss 2.136
Epoch 6, loss 0.459
Epoch 8, loss 0.118
Epoch 10, loss 0.037




In [11]:
conv2d.weight.data.reshape((1, 2))

tensor([[ 1.0008, -0.9659]])

### Cross-Correlation and Convolution

In order to perform just convolution instead of cross correlation, we need to flip the kernel both horizontally and vertically, then perform cross correlation with the input tensor. 

Since kernels are learnt from data, it doesnt matter whether layers perform cross correlation or convolution. The output remains the same. 

Meaning if a layer performs cross correlation and it's weights are represented as K, the learned kernel be K', K' will be the same even when K is flipped horizontally and vertically. 

### Feature Map and Receptive Field

The outpt of Convolution Layer is sometimes called Feature Map. Receptive field of any element x of any layer means the elements from the previous layers, that cal affect the calculation of x during the forward propagation. It may be larger than the actual size of the input. 

## Padding and Stride 

10 Convolution Layers of kernel size 5x5 on a 240x240 image, reduce the output size to 200x200. Padding and Strided Convolutions offer more control over the size of the output. 

### Padding

One straightforward solution to this issue is to add zeros around the image. If we add ph rows of padding (half on top and the rest on bottom) and pw columns of padding (half on left and the rest on right), the output shape would be 

(nh - kh + ph + 1) x (nw - kw + pw + 1). 

In many cases ph = kh - 1 and pw = kw - 1 to give input and output the same height and width. 

CNNs commonly use conv kernels of odd height and width. This means when we pad, we can divide the number of rows and columns by 2, thus having equal row and column paddings on both sides. 


In [12]:
import torch
from torch import nn

def comp_conv2d(conv2d, X):
  # conv2d requires images of 4 dimensions. Adding the example and channel dimension
  X = X.reshape((1, 1) + X.shape)
  Y = conv2d(X)
  # stripping the example and channel dimensions
  return Y.reshape(Y.shape[2:])

# 1 row and column are padded on either side, so total of 2 rows or columns are added
conv2d = nn.LazyConv2d(1, kernel_size = 3, padding = 1)

X = torch.rand(size = (8, 8))
comp_conv2d(conv2d, X).shape



torch.Size([8, 8])

In [14]:
# Using a convolution layer with different height and width but selecting different padding to get the same output shape

conv2d = nn.LazyConv2d(1, kernel_size = (5, 3), padding = (2, 1))
comp_conv2d(conv2d, X).shape



torch.Size([8, 8])

### Stride 

For computational efficiency, we might want to slide the conv layer more than 1 step at a time, skipping the intermediate locations. 

Number of rows and columns per traversed per slide is called stride. SO far we used strides 1. 

If the stride height is sh, stride width is sw, the output shape is 

$\lfloor(n_h-k_h+p_h+s_h)/s_h\rfloor \times \lfloor(n_w-k_w+p_w+s_w)/s_w\rfloor.$

If we set ph = kh - 1 and pw = kw - 1, the output shape is simplified to 

$\lfloor(n_h+s_h-1)/s_h\rfloor \times \lfloor(n_w+s_w-1)/s_w\rfloor$. 

In [15]:
conv2d = nn.LazyConv2d(1, kernel_size = 3, padding = 1, stride = 2)
comp_conv2d(conv2d, X).shape



torch.Size([4, 4])

In [17]:
conv2d = nn.LazyConv2d(1, kernel_size = (3, 5), padding = (0, 1), stride = (3, 4))
comp_conv2d(conv2d, X).shape



torch.Size([2, 2])

## Multiple Input and Multiple Output Channels

Notion of channels is as old as CNNs themselves (Since 1995 almost)

### Multiple Input Channels

The convolution kernel must have the same number of channels as the input layer. The outputs of the kernel at each channel are then added that result in a single channel kernel output. 


In [18]:
import torch
from d2l import torch as d2l

def corr2d_multi_channel(X, K):
  # Iterate through each dimension of K (each channel of K) then add them
  return sum(d2l.corr2d(x, k) for x, k in zip(X, K))

In [19]:
X = torch.tensor([[[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]],
               [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]]])
K = torch.tensor([[[0.0, 1.0], [2.0, 3.0]], [[1.0, 2.0], [3.0, 4.0]]])

corr2d_multi_channel(X, K)

tensor([[ 56.,  72.],
        [104., 120.]])

### Multiple Output Channels

In the above example, no matter the number nof inputs, we end up getting a single dimension output. But it is essential to have multiple channel outputs. In most popular neural networks we add more channels to the output as we go in depth into the neural network architecture. This is done to compensate for the lack of spatial resolution. Intuitively, this can be thought of as each channel corresponding to a different set of features. 

If ci and co are the number of input and output channels, kh and kw are kernel height and width, to get multi channel kernel output we create kernel of shape ci x kh x kw for each output channel co. We concatenate them on the output channel dimension so that shape is co x ci x kh x kw. 

In [20]:
def corr2d_multi_channel_kernel_output(X, K):
  # Iterate through the 0th dimension of 'K' and each time perform 
  # cross correlation operations with input 'X'. All of the results are stacked
  # together
  return torch.stack([corr2d_multi_channel(X, k) for k in K], 0)


In [21]:
K = torch.stack((K, K+1, K+2), 0)
K.shape

torch.Size([3, 2, 2, 2])

In [22]:
corr2d_multi_channel_kernel_output(X, K)

tensor([[[ 56.,  72.],
         [104., 120.]],

        [[ 76., 100.],
         [148., 172.]],

        [[ 96., 128.],
         [192., 224.]]])

### 1x1 Convolutional Layer

Because of it's size a 1x1 kernel cannot capture the patterns or interactions between adjacent elements in height and width dimensions. It's only computational benefit is to work on the channel dimension.

It can be used to change the channele dimension from ci to co. 

In [25]:
def corr2d_multi_in_out_1x1(X, K):
  c_i, h, w = X.shape
  c_o = K.shape[0]
  X = X.reshape((c_i, h * w))
  K = K.reshape((c_o, c_i))
  # Matrix Multiplication in the fully connected layer
  Y = torch.matmul(K, X)
  return Y.reshape((c_o, h, w))

In [26]:
X = torch.normal(0, 1, (3, 3, 3))
K = torch.normal(0, 1, (2, 3, 1, 1))

Y1 = corr2d_multi_in_out_1x1(X, K)
Y2 = corr2d_multi_channel_kernel_output(X, K)
assert float(torch.abs(Y1 - Y2).sum()) < 1e-6

Channels allow us to combine MLPs that allow for significant non linearities and Convolutions that allow for localized analysis of features. Channels allow CNN to reason with multiple features like edge and shape detection at the same time. They also reduce the number of parameters as compared to regular MLPs. 

Although this flexibility comes at a price. An image of size (h x w) the cost of computing k x k convolution is O(h x w x k\*\*2). For ci and co channel sizes, it becomes O(h x 2 x k\*\*2 x ci x co)

## Pooling

The deeper we go into the network, the larger the receptive field (size relative to the input) becomes to wihch each node is sensitive. Reducing spatial resolution accelerates this process. 

Pooling layers serve the dual purposes of mitigating sensitivity of con layers to location and spatially downsampling representations. 

### Maximum Pooling and Average Pooling

Like conv operators, pooling operators consist of fixed shape window that is slid over all regions in the input according to its stride, computing a single value for each location traversed by the window. Unlike conv operations, pooling has no parameters. Pooling operations are deterministic ie., they calculate average or maximum of all elements in the pooling window. These are called average pooling or maximum pooling. 

Average pooling is similar to downsampling an image. 

In [7]:
import torch 
from torch import nn
from d2l import torch as d2l

def pool2d(X, pool_size, mode = 'max'):
  p_h, p_w = pool_size
  Y = torch.zeros((X.shape[0] - p_h + 1, X.shape[1] - p_w + 1))
  for i in range(Y.shape[0]):
    for j in range(Y.shape[1]):
      if mode == 'max':
        Y[i, j] = X[i: i + p_h, j: j + p_w].max()
      elif mode == 'avg':
        Y[i, j] = X[i: i + p_h, j: j + p_w].mean()
  return Y

In [8]:
X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
pool2d(X, (2, 2))

tensor([[4., 5.],
        [7., 8.]])

In [9]:
pool2d(X, (2,2), 'avg')

tensor([[2., 3.],
        [5., 6.]])

### Padding and Stride

Similar to conv layers, pooling layers alter the output shape. We can also use pooling and striding to control this change in the output shape. 

In [10]:
X = torch.arange(16, dtype = torch.float32).reshape((1, 1, 4, 4))
X

tensor([[[[ 0.,  1.,  2.,  3.],
          [ 4.,  5.,  6.,  7.],
          [ 8.,  9., 10., 11.],
          [12., 13., 14., 15.]]]])

In [11]:
pool2d = nn.MaxPool2d(3)
pool2d(X)

tensor([[[[10.]]]])

In [12]:
pool2d = nn.MaxPool2d(3, padding = 1, stride = 2)
pool2d(X)

tensor([[[[ 5.,  7.],
          [13., 15.]]]])

In [13]:
pool2d = nn.MaxPool2d((2, 3), stride = (2, 3), padding = (0, 1))
pool2d(X)

tensor([[[[ 5.,  7.],
          [13., 15.]]]])

### Multiple Channels

When dealing with inputs  with multiple channels, pooling layer pools each channel seperately unlike conv layer that just adds them together. This means the number of input and output channels are going to be the same. 

In [14]:
X = torch.cat((X, X + 1), 1)
X

tensor([[[[ 0.,  1.,  2.,  3.],
          [ 4.,  5.,  6.,  7.],
          [ 8.,  9., 10., 11.],
          [12., 13., 14., 15.]],

         [[ 1.,  2.,  3.,  4.],
          [ 5.,  6.,  7.,  8.],
          [ 9., 10., 11., 12.],
          [13., 14., 15., 16.]]]])

In [15]:
pool2d = nn.MaxPool2d(3, padding = 1, stride = 2)
pool2d(X)

tensor([[[[ 5.,  7.],
          [13., 15.]],

         [[ 6.,  8.],
          [14., 16.]]]])

If the layer has sparse fetures with sharp regions, max pooling will likely retain the patterns, whereas average pooling will simply blur them. 

## Convolutional Neural Networks (LeNet)

LeNet is one of the first published CNNs to capture wide attention for its performance on CV tasks. 

### LeNet

It consists of two parts. 

1 - A convolutional encoder consisiting of two convolutional layers

2 - A dense block consisting of three fully connected layers. 

The basic units in each convlution block are a convolutional layer, a sigmoid activateion function and a subsequent average pooling operation. 

ReLUs and Max Pooling have not been used because they havent been discovered at that time. 

Each Convolution Layer uses a 5x5 kernel. The number of channels are increased in each block. In the first conv layer, there are 6 output channels and the second has 16. Each 2x2 pooling operation with stride 2 reduces the dimensionality by a factor of 4 via spatial downsampling. 

Before passing the output of the second conv layer to the dense block, the output is flattened by converting it from 4 dimensions to 2 dimensions. LeNet's dense block has three fully connected layers with 120, 84, 10 outputs respectively. 

In [22]:
import torch
from torch import nn
from d2l import torch as d2l
 
# def int_cnn(module):
#   if type(module) == nn.Linear or type(module) == nn.Conv2d:
#     nn.init.xavier_uniform(module.weight)

# class LeNet():
#   def __init__(self, lr = 0.1, num_classes = 10):
#     super().__init__()
#     self.net = nn.Sequential(
#         nn.LazyConv2d(6, kernel_size = 5, padding = 2), nn.Sigmoid(),
#         nn.AvgPool2d(kernel_size = 2, stride = 2),
#         nn.LazyConv2d(16, kernel_size = 5), nn.Sigmoid(),
#         nn.AvgPool2d(kernel_size = 2, stride = 2),
#         nn.Flatten(),
#         nn.LazyLinear(120), nn.Sigmoid(),
#         nn.LazyLinear(84), nn.Sigmoid(),
#         nn.LazyLinear(num_classes)
#     )

net = nn.Sequential(
  nn.Conv2d(1, 6, kernel_size = 5, padding = 2), nn.Sigmoid(),
  nn.AvgPool2d(kernel_size = 2, stride = 2),
  nn.Conv2d(6, 16, kernel_size = 5), nn.Sigmoid(),
  nn.AvgPool2d(kernel_size = 2, stride = 2),
  nn.Flatten(),
  nn.Linear(16 * 5 * 5, 120), nn.Sigmoid(),
  nn.Linear(120, 84), nn.Sigmoid(),
  nn.Linear(84, 10)
)

In [23]:
X = torch.rand(size = (1, 1, 28, 28), dtype = torch.float32)
for layer in net:
  X = layer(X)
  print(layer.__class__.__name__, 'output shape: \t', X.shape)

Conv2d output shape: 	 torch.Size([1, 6, 28, 28])
Sigmoid output shape: 	 torch.Size([1, 6, 28, 28])
AvgPool2d output shape: 	 torch.Size([1, 6, 14, 14])
Conv2d output shape: 	 torch.Size([1, 16, 10, 10])
Sigmoid output shape: 	 torch.Size([1, 16, 10, 10])
AvgPool2d output shape: 	 torch.Size([1, 16, 5, 5])
Flatten output shape: 	 torch.Size([1, 400])
Linear output shape: 	 torch.Size([1, 120])
Sigmoid output shape: 	 torch.Size([1, 120])
Linear output shape: 	 torch.Size([1, 84])
Sigmoid output shape: 	 torch.Size([1, 84])
Linear output shape: 	 torch.Size([1, 10])


In [24]:
batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size = batch_size)

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to ../data/FashionMNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0/26421880 [00:00<?, ?it/s]

Extracting ../data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ../data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to ../data/FashionMNIST/raw/train-labels-idx1-ubyte.gz


  0%|          | 0/29515 [00:00<?, ?it/s]

Extracting ../data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ../data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to ../data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0/4422102 [00:00<?, ?it/s]

Extracting ../data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ../data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to ../data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz


  0%|          | 0/5148 [00:00<?, ?it/s]

Extracting ../data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/FashionMNIST/raw



  cpuset_checked))


In [25]:
def evaluate_accuracy_gpu(net, data_iter, device = None):
  if isinstance(net, nn.Module):
    net.eval()
    if not device:
      device = next(iter(net.parameters())).device
    metric = d2l.Accumulator(2)

    with torch.no_grad():
      for X, y in data_iter:
        if isinstance(X, list):
          X = [x.to(device) for x in X]
        else:
          X = X.to(device)
        y = y.to(device)
        metric.add(d2l.accuracy(net(X), y), y.numel())
  return metric[0]/metric[1]