# Convolution and Relu

a convolutional base and a head of dense layers. We learned that the job of the base is to extract visual features from an image, which the head would then use to classify the image. These are the convolutional layer with ReLU activation, and the maximum pooling layer

The feature extraction performed by the base consists of three basic operations:
    Filter an image for a particular feature (convolution)
    Detect that feature within the filtered image (ReLU)
    Condense the image to enhance the features (maximum pooling)

In [2]:
# Filter de Convolution - A convolutional layer carries out the filtering step
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    """ 
    - filters: Integer, the dimensionality of the output space (the number of output filters in the convolution).
    - kernel_size: An integer or tuple/list of 2 integers, specifying the height and width of the 2D convolution window. 
        Can be a single integer to specify the same value for all spatial dimensions 
    """
    layers.Conv2D(filters=64, kernel_size=3)
])

SyntaxError: invalid syntax (<ipython-input-2-0a2a22fb343d>, line 11)

We can understand these parameters by looking at their relationship to the weights and activations of the layer.
    - The weights a convnet learns during training are primarily contained in its convolutional layers. These weights we call kernels. We can represent them as small arrays

    A kernel operates by scanning over an image and producing a weighted sum of pixel values. In this way, a kernel will act sort of like a polarized lens, emphasizing or deemphasizing certain patterns of information

    Kernels define how a convolutional layer is connected to the layer that follows. The kernel above will connect each neuron in the output to nine neurons in the input. By setting the dimensions of the kernels with kernel_size, you are telling the convnet how to form these connections

    The kernels in a convolutional layer determine what kinds of features it creates. During training, a convnet tries to learn what features it needs to solve the classification problem. This means finding the best values for its kernels.

Activations - The activations in the network we call feature maps. They are what result when we apply a filter to an image; they contain the visual features the kernel extracts. From the pattern of numbers in the kernel, you can tell the kinds of feature maps it creates. With the filters parameter, you tell the convolutional layer how many feature maps you want it to create as output.

ReLU - After filtering, the feature maps pass through the activation function, the rectifier function

In [None]:
# ReLU
model2 = keras.Sequential([
    layers.Conv2D(filters=64, kernel_size=3, activation='relu')
])

You could think about the activation function as scoring pixel values according to some measure of importance. The ReLU activation says that negative values are not important and so sets them to 0

# Condense with Maximum Pooling


In [None]:
model3 = keras.Sequential([
    layers.Conv2D(filters=64, kernel_size=3),
    layers.MaxPool2D(pool_size=2)
])

A MaxPool2D layer is much like a Conv2D layer, except that it uses a simple maximum function instead of a kernel, with the pool_size parameter analogous to kernel_size. A MaxPool2D layer doesn't have any trainable weights like a convolutional layer does in its kernel, however.
Notice that after applying the ReLU function (Detect) the feature map ends up with a lot of "dead space," that is, large areas containing only 0's (the black areas in the image). Having to carry these 0 activations through the entire network would increase the size of the model without adding much useful information. Instead, we would like to condense the feature map to retain only the most useful part -- the feature itself.
This in fact is what maximum pooling does. Max pooling takes a patch of activations in the original feature map and replaces them with the maximum activation in that patch. When applied after the ReLU activation, it has the effect of "intensifying" features. The pooling step increases the proportion of active pixels to zero pixels

In [None]:
""" We'll use another one of the functions in tf.nn to apply the pooling step, tf.nn.pool. This is a Python function that does the same thing as the MaxPool2D layer 
    you use when model building, but, being a simple function, is easier to use directly. """
import tensorflow as tf

image_condense = tf.nn.pool(
    input=image_detect, # image in the Detect step above
    window_shape=(2, 2),
    pooling_type='MAX',
    # we'll see what these do in the next lesson!
    strides=(2, 2),
    padding='SAME',
)

# Pooling, Maximum Pooling and Average
Often, as we process images, we want to gradually reduce the spatial resolution of our hidden representations, aggregating information so that the higher up we go in the network, the larger the receptive field (in the input) to which each hidden node is sensitive. Like convolutional layers, pooling operators consist of a fixed-shape window that is slid over all regions in the input according to its stride, computing a single output for each location traversed by the fixed-shape window (sometimes known as the pooling window). However, unlike the cross-correlation computation of the inputs and kernels in the convolutional layer, the pooling layer contains no parameters (there is no kernel). nstead, pooling operators are deterministic, typically calculating either the maximum or the average value of the elements in the pooling window. These operations are called maximum pooling (max pooling for short) and average pooling, respectively

In [None]:
""" In the code below, we implement the forward propagation of the pooling layer in the pool2d function. here we have no kernel, computing the output as either 
the maximum or the average of each region in the input. """

def pool2d(X, pool_size, mode='max'):
    p_h, p_w = pool_size
    Y = tf.Variable(tf.zeros((X.shape[0] - p_h + 1, X.shape[1] - p_w + '')))
    for i in range(Y.shape[1]):
        for j in range(Y.shape[1]):
            if mode == 'max':
                Y[i, j].assign(tf.reduce_max(X[i: i + p_h, j: j +p_w]))
            elif mode == 'avg':
                Y[i, j].assign(tf.reduce_mean(X[i: i + p_h, j: j + p_w]))
    return Y

# Translation Invariance
We called the zero-pixels "unimportant". Does this mean they carry no information at all? In fact, the zero-pixels carry positional information. The blank space still positions the feature within the image. When MaxPool2D removes some of these pixels, it removes some of the positional information in the feature map. This gives a convnet a property called translation invariance. This means that a convnet with maximum pooling will tend not to distinguish features by their location in the image. ("Translation" is the mathematical word for changing the position of something without rotating it or changing its shape or size.)

# The Sliding Window
The convolution and pooling operations share a common feature: they are both performed over a sliding window. With convolution, this "window" is given by the dimensions of the kernel, the parameter kernel_size. With pooling, it is the pooling window, given by pool_size. There are two additional parameters affecting both convolution and pooling layers -- these are the strides of the window and whether to use padding at the image edges. The strides parameter says how far the window should move at each step, and the padding parameter describes how we handle the pixels at the edges of the inpu

In [None]:
model4 = keras.Sequential([
    layers.Conv2D(
        filter=64,
        kernel_size=3,
        strides=1,
        padding='same',
        activation='relu'
    ),
    layers.MaxPool2D(
        pool_size=2,
        strides=1,
        padding='same'
    )
])

NameError: name 'keras' is not defined

- Stride - The distance the window moves at each step is called the stride. We need to specify the stride in both dimensions of the image: one for moving left to right and one for moving top to bottom.  convolutional layers will most often have strides=(1, 1). Increasing the stride means that we miss out on potentially valuble information in our summary. Maximum pooling layers, however, will almost always have stride values greater than 1, like (2, 2) or (3, 3), but not larger than the window itself.

- Padding - As described above, one tricky issue when applying convolutional layers is that we tend to lose pixels on the perimeter of our image. Since we typically use small kernels, for any given convolution, we might only lose a few pixels, but this can add up as we apply many successive convolutional layers. One straightforward solution to this problem is to add extra pixels of filler around the boundary of our input image, thus increasing the effective size of the image

CNNs commonly use convolution kernels with odd height and width values, such as 1, 3, 5, or 7. Choosing odd kernel sizes has the benefit that we can preserve the spatial dimensionality while padding with the same number of rows on top and bottom, and the same number of columns on left and right.

# Convolutions for images
in such a layer, an input tensor and a kernel tensor are combined to produce an output tensor through a cross-correlation operation. When the convolution window slides to a certain position, the input subtensor contained in that window and the kernel tensor are multiplied elementwise and the resulting tensor is summed up yielding a single scalar value. This result gives the value of the output tensor at the corresponding location. the output size is slightly smaller than the input size. Because the kernel has width and height greater than one, we can only properly compute the cross-correlation for locations where the kernel fits wholly within the image, the output size is given by the input size  minus the size of the convolution kernel (Nh - Kh + 1) x (Nw + Kw + 1) - n (Input shape), k (kernel shape)

In [None]:
from d2l import tensorflow as d2l

def corr2d(X, K):
    """ Compute 2D cross-correlation """
    h, w = K.shape # h (height), w (width)
    Y = tf.Variable(tf.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1)))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[0]):
            Y[i, j].assign(tf.reduce_sum(
                X[i: i + h, j: j + w] * K
            ))

""" We can construct the input tensor X and the kernel tensor K to validate the output of the above implementation of the two-dimensional cross-correlation operation. """
X = tf.constant([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]]) # Input shape
K = tf.constant([[0.0, 1.0], [2.0, 3.0]]) # Kernel size
corr2d(X, K)

# Convolutional layer
A convolutional layer cross-correlates the input and kernel and adds a scalar bias to produce an output. The two parameters of a convolutional layer are the kernel and the scalar bias. When training models based on convolutional layers, we typically initialize the kernels randomly, just as we would with a fully-connected layer

In [None]:
import tensorflow as tf

class Conv2D(tf.keras.layers.Layer):
    def __init__(self):
        super().__init__()

    def build(self, kernel_size):
        initializer = tf.random_normal_initializer()
        self.weight = self.add_weight(
            name="w", 
            shape=kernel_size,
            initializer=initializer
        )

    def call(self, inputs):
        return corr2d(inputs, self.weight) + self.bias

# Learning a Kernel
We first construct a convolutional layer and initialize its kernel as a random tensor. Next, in each iteration, we will use the squared error to compare Y with the output of the convolutional layer. We can then calculate the gradient to update the kernel. For the sake of simplicity, in the following we use the built-in class for two-dimensional convolutional layers and ignore the bias.

In [None]:
# Construct a two-dimensional convolutional layer with 1 output channel and a
# kernel of shape (1, 2). For the sake of simplicity, we ignore the bias here
conv2d = tf.layers.Conv2D(1, (1, 2), use_bias=False)

# The two-dimensional convolutional layer uses four-dimensional input and
# output in the format of (example, height, width, channel), where the batch
# size (number of examples in the batch) and the number of channels are both 1
X = tf.reshape(X, (1, 6, 8, 1))
Y = tf.reshape(Y, (1, 6, 7, 1))
lr = 3e-2 # Learning rate

Y_hat = conv2d(X)
for i in range(10):
    with tf.GradientTape(watch_accessed_variables=False) as g:
        g.watch(conv2d.weights[0])
        Y_hat = conv2d(X)
        l = (abs(Y_hat - Y)) ** 2
        
        #Update de kernel
        update = tf.multiply(lr, g.gradient(1, conv2d.weights[0]))
        weights = conv2d.get_weighs()
        weights[0] = conv2d.weights[0] - update
        conv2d.set_weights(weights)
        if (i + 1) % 2 == 0:
            print(f'epoch {i + 1}, loss {tf.reduce_sum(1):.3f}')

NameError: name 'tf' is not defined

In [None]:
tf.reshape(conv2d.get_weights()[0], (1, 2))

# Feature Map and Receptive Field
As described in Section, the convolutional layer output is sometimes called a feature map, as it can be regarded as the learned representations (features) in the spatial dimensions (e.g., width and height) to the subsequent layer. In CNNs, for any element  of some layer, its receptive field refers to all the elements (from all the previous layers) that may affect the calculation of  during the forward propagation

# Pading
In the following example, we create a two-dimensional convolutional layer with a height and width of 3 and apply 1 pixel of padding on all sides. Given an input with a height and width of 8, we find that the height and width of the output is also 8. As with convolutional layers, pooling layers can also change the output shape. And as before, we can alter the operation to achieve a desired output shape by padding the input and adjusting the stride. We can demonstrate the use of padding and strides in pooling layers via the built-in two-dimensional maximum pooling layer from the deep learning framework.

In [None]:
# We define a convenience function to calculate the convolutional layer. This
# function initializes the convolutional layer weights and performs
# corresponding dimensionality elevations and reductions on the input and
# outpu
def comp_conv2d(conv2d, X):
    """ Here (1, 1) indicates that the batch size and the number of channels are both 1 """
    X = tf.reshape(X, (1, ) + X.shape + (1, ))
    Y = conv2d(X)
    # Exclude the first two dimensions that do not interest us: examples and channels
    return tf.reshape(Y, Y.shape[1:3])

""" Note that here 1 row or column is padded on either side, so a total of 2 two rows or columns are added """
conv2d = tf.keras.layers.Conv2D(1, kernel_size=3, padding='same')
X = tf.random.uniform(shape=(8, 8))
comp_conv2d(conv2d, X).shape

In [None]:
""" When the height and width of the convolution kernel are different, we can make the output and input have the same height and width by setting different 
padding numbers for height and width. """

# Here, we use a convolution kernel with a height of 5 and a width of 3. The
# padding numbers on either side of the height and width are 2 and 1,
# respectively
conv2d = tf.keras.layers.Conv2D(1, kernel_size=(5, 3), padding='same')
comp_conv2d(conv2d, X).shape

# Stride
When computing the cross-correlation, we start with the convolution window at the upper-left corner of the input tensor, and then slide it over all locations both down and to the right. Below, we set the strides on both the height and width to 2, thus halving the input height and width.

In [None]:
conv2d = tf.keras.layers.Conv2D(1, kernel_size=3, padding='same', strides=2)
comp_conv2d(conv2d, X).shape

# Multiple Input Channels
When the input data contain multiple channels, we need to construct a convolution kernel with the same number of input channels as the input data, so that it can perform cross-correlation with the input data.

In [None]:
# we can implement cross-correlation operations with multiple input channels ourselves
def corr2d_multi_in(X, K):
    # First, iterate through the 0th dimesion (channel dimension) of 'X' and 'Y'
    return tf.reduce_sum([d2l.corr2d(x, k) for x, k in zip(X, K)], axis=0)

# We can construct the input tensor X and the kernel tensor K corresponding to the values to validate the output of the cross-correlation operation.
X = tf.constant([[[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]],
               [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]]])
K = tf.constant([[[0.0, 1.0], [2.0, 3.0]], [[1.0, 2.0], [3.0, 4.0]]])

corr2d_multi_in(X, K)

# Multipe Output Channels
In cross-correlation operations, the result on each output channel is calculated from the convolution kernel corresponding to that output channel and takes input from all channels in the input tensor.

In [None]:
#We implement a cross-correlation function to calculate the output of multiple channels
def corr2d_multi_in_out(X, K):
    # Iterate through the 0th dimension of `K`, and each time, perform
    # cross-correlation operations with input `X`. All of the results are
    # stacked together
    return tf.stack([corr2d_multi_in(X, K), for k in K], 0)

# We construct a convolution kernel with 3 output channels by concatenating the kernel tensor K with K+1 (plus one for each element in K) and K+2.
K = tf.stack((K, K + 1, K + 2), 0)
K.shape

# Below, we perform cross-correlation operations on the input tensor X with the kernel tensor K. Now the output contains 3 channels. 
# The result of the first channel is consistent with the result of the previous input tensor X and the multi-input channel, single-output channel kernel.
corr2d_multi_in_out(X, K)

# Summary
- Taking the input elements in the pooling window, the maximum pooling operation assigns the maximum value as the output and the average pooling operation assigns the average value as the output.

- One of the major benefits of a pooling layer is to alleviate the excessive sensitivity of the convolutional layer to location.

- We can specify the padding and stride for the pooling layer.

- Maximum pooling, combined with a stride larger than 1 can be used to reduce the spatial dimensions (e.g., width and height).

- The pooling layer’s number of output channels is the same as the number of input channels