# Convolutions in Practice
This notebook is meant to introduce convolutional layers, with special emphasis on the relation between the dimension of the input tensor, the kernel size, the stride, the number of filters and the dimension of the output tensor.

In [1]:
import tensorflow as tf

2024-03-24 17:05:40.207384: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-24 17:05:40.207616: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-24 17:05:40.390148: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [2]:
from tensorflow.keras.layers import Input, Conv2D, ZeroPadding2D, Dense, Flatten, Layer
from tensorflow.keras.models import Model
from tensorflow.keras import metrics
from tensorflow.keras.datasets import mnist

We run the example over the mnist data set. Keras provides a very friendly access to several renowed databases, comprising mnist, cifar10, cifar100, IMDB and many others. See https://keras.io/api/datasets/ for documentation

In [3]:
import numpy as np
(x_train, y_train), (x_test, y_test) = mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
[1m11490434/11490434[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 0us/step


Mnist images are grayscale images with pixels in the range [0,255].
We pass to floats, and normalize them in the range [0,1].

In [4]:
x_train = x_train.astype('float32') / 255.
x_test = x_test.astype('float32') / 255.

Bidimensional convolutions expect input with three dimensions (plus an additional batchsize dimension): width, height, channels. 
Since mnist digits have only two dimensions (being in grayscale), we need to extend them with an additional dimension.

In [5]:
(n,w,h) = x_train.shape
x_train = x_train.reshape(n,w,h,1)
(n,w,h) = x_test.shape
x_test = x_test.reshape(n,w,h,1)
print(x_train.shape)
print(x_test.shape)

(60000, 28, 28, 1)
(10000, 28, 28, 1)


Mnist labels are integers in the range [0,9]. Since the network will produce probabilities for each one of these categories, if we want to compare it with the ground trouth probability using categorical crossentropy, that is the traditional choice, we should change each integer in its categorical description, using e.g. the "`to_categorical`" function in utils.

Alternatively, we can use the so called ["sparse categorical crossentropy"](https://www.tensorflow.org/api_docs/python/tf/keras/losses/SparseCategoricalCrossentropy) loss function that allows us to directly compare predictions with labels.

Categorical crossentropy and sparse categorical crossentropy are both loss functions commonly used in classification tasks in deep learning. They are similar in their purpose of measuring the difference between the predicted probability distribution and the true distribution of class labels, but they differ in how they handle the representation of the target labels.

### Categorical Crossentropy:

1. **Input**: The output of the neural network is typically a probability distribution over the classes, generated by the softmax activation function. Each value in the output vector represents the probability of the input belonging to the corresponding class.

2. **Target**: The target labels are provided in one-hot encoded format, where each target label is represented as a binary vector with a 1 in the position corresponding to the true class and 0s elsewhere.

3. **Loss Calculation**: Categorical crossentropy calculates the cross-entropy loss between the predicted probability distribution and the true one-hot encoded target labels.

4. **Loss Function**: The formula for categorical crossentropy loss for a single sample is:

   $$\text{Loss} = -\sum_{i} y_{\text{true}}[i] \cdot \log(\text{softmax}(y_{\text{pred}})[i])$$

   Where:
   - $y_{\text{true}}$ is the true one-hot encoded target label.
   - $\text{softmax}(y_{\text{pred}})$ is the predicted probability distribution over classes.

5. **Overall Loss**: The loss is averaged over all samples in the batch to obtain the overall loss for that batch.

### Sparse Categorical Crossentropy:

Sparse categorical crossentropy is particularly useful when the target labels are provided as integers rather than one-hot encoded vectors. It eliminates the need for explicit conversion of target labels to one-hot encoded format.

1. **Input**: Same as categorical crossentropy.

2. **Target**: The target labels are provided as integers, where each integer represents the index of the true class for each sample.

3. **Loss Calculation**: Sparse categorical crossentropy computes the cross-entropy loss between the predicted probability distribution and a one-hot encoded version of the true class labels (implicitly converting the integer labels to one-hot encoded format during computation).

4. **Loss Function**: The formula for sparse categorical crossentropy loss for a single sample is similar to categorical crossentropy but handles integer labels directly:

   $$\text{Loss} = -\log(\text{softmax}(y_{\text{true}})[\text{true_class}])$$

   Where:
   - $\text{softmax}(y_{\text{true}})$ is the predicted probability distribution over classes.
   - $\text{true_class}$ is the index of the true class.

5. **Overall Loss**: The loss is averaged over all samples in the batch.

### Comparison:

- **Representation of Target Labels**: Categorical crossentropy requires target labels in one-hot encoded format, while sparse categorical crossentropy accepts integer labels directly.
- **Memory Efficiency**: Sparse categorical crossentropy is more memory-efficient, especially when dealing with a large number of classes, as it avoids the need to one-hot encode target labels.
- **Computation Efficiency**: Sparse categorical crossentropy might be slightly more computationally efficient because it avoids the explicit conversion of target labels to one-hot encoded format.
- **Usage**: Choose categorical crossentropy when target labels are in one-hot encoded format, and sparse categorical crossentropy when target labels are integers.


In [6]:
# y_train = keras.utils.to_categorical(y_train)
# y_test = keras.utils.to_categorical(y_test)

Let us come to the convolutional network. We define a simple network composed by three convolutional layers, followed by a couple of Dense layers.

In [12]:
xin = Input(shape=(28,28,1))
x = Conv2D(16,(3,3),strides=(2,2),padding='valid')(xin)
x = Conv2D(32,(3,3),strides=(2,2),padding='valid')(x)
x = Conv2D(64,(3,3),strides=(2,2),padding='valid')(x)
x = Flatten()(x)
x = Dense(64, activation ='relu')(x)
res = Dense(10,activation = 'softmax')(x)

mynet = Model(inputs=xin,outputs=res)

#### Code Explanation
This code defines a convolutional neural network (CNN).

- `Input(shape=(28,28,1))`: This line creates an input layer for the network. The `shape` parameter specifies the shape of the input data. In this case, it indicates that the input data will be 4D tensors with a shape of (batch_size, 28, 28, 1), where `batch_size` is the number of samples in each batch. This suggests that the input data consists of 28x28 images with 1 channels (grayscale).
- `Conv2D(16, (3,3), strides=(2,2), padding='valid')`: This line creates a convolutional layer with 16 filters, each with a 3x3 kernel. 
    - The `strides` parameter determines the step size of the sliding window during convolution. Here, it's set to (2,2), meaning the filter moves 2 pixels at a time in both the height and width dimensions. 
    - The `padding` parameter determines how the input is padded. `'valid'` padding means no padding is added to the input.
    - `(xin)`: This layer is applied to the `xin` input layer defined earlier.
- Similar to the previous line, the next two lines create additional convolutional layers with increasing numbers of filters (32 and 64, respectively). Each layer applies convolution to the output of the previous layer (`x`), resulting in a sequence of feature maps.
- `Flatten()`: This line adds a flatten layer to the network. It reshapes the 3D output tensor from the previous convolutional layers into a 1D tensor, which is required as input for the subsequent fully connected layers.
- `Dense(64, activation='relu')`: This line adds a fully connected (dense) layer with 64 neurons and ReLU activation function. It connects every neuron in the previous layer to every neuron in this layer.
- `Dense(10, activation='softmax')`: This line adds another fully connected layer with 10 neurons (since there are 10 output classes) and softmax activation function. Softmax converts the output values into probabilities representing the likelihood of each class.
- `Model(inputs=xin, outputs=res)`: This line creates a Keras `Model` by specifying the input (`xin`) and output (`res`) layers. This model represents the entire neural network architecture defined above.

#### Why we don't have activation function for the Convolution layers?
This is not uncommon, and there are reasons why activation functions are often omitted or deferred in convolutional neural networks (CNNs):

1. **Parameter Efficiency**: Convolutional layers already introduce non-linearity through their convolution operation. This non-linearity comes from the element-wise multiplication and summation of filter weights and input values. Adding an additional activation function immediately after the convolution operation may not significantly improve the model's representational power.

2. **Learning Representations**: CNNs are designed to automatically learn hierarchical representations from input data. The convolution operation itself allows the network to learn complex features and patterns. Activation functions are primarily used to introduce non-linearity and help the network learn more complex functions. In CNNs, the convolution operation inherently introduces non-linearity, making explicit activation functions less crucial.

3. **Gradient Flow**: Certain activation functions, such as ReLU, can suffer from the "dying ReLU" problem, where neurons can become inactive during training and stop learning. By omitting activation functions in convolutional layers, this problem can be mitigated, as the convolution operation itself allows gradients to flow through the network more easily.

4. **Model Flexibility**: Omitting activation functions in convolutional layers allows for greater flexibility in network design. Researchers and practitioners may experiment with different activation functions or even stacking multiple layers without activations to explore different architectures and achieve better performance.

However, it's important to note that the absence of activation functions in convolutional layers doesn't mean they're not used at all in CNNs. Activation functions are commonly used in fully connected layers and sometimes after certain convolutional layers, especially in deeper architectures where introducing additional non-linearity can be beneficial for learning complex representations. Additionally, certain architectures or tasks may benefit from using activation functions in convolutional layers. It often depends on the specific problem being addressed and the empirical performance of the model during training and evaluation.

Now let's have a look at the summary

In [13]:
mynet.summary()

- As we said earlier, in valid mode, no padding is applied. 

    Along each axis, the output dimension O is computed from the input dimension I using the formula O=(I-K)/S +1, where K is the kernel dimension and S is the stride. 
    
    For all layers, K=3 and S=2. So, for the first conv we pass from dimension 28 to dimension (28-3)/2+1 = 13, then to dimension (13-3)/2+1 = 6 and finally to dimension (6-3)/2+1 = 2. 

    - For practice you can modify "valid" to "same" and see what happens.


- The second important point is about the number of parameters. You must keep in mind that a kernel of dimension K1 x K2 has an actual dimension K1 x K2 x CI, where CI is number of input channels: in other words the kernel is computing at the same time spatial and cross-channel correlations.

    So, for the first convolution, we have 3 x 3 x 1 + 1 = 10 parameters for each filter (1 for the bias), and since we are computing 16 filters, the number of parameters is 10 x 16 = 160.
    
    For the second convolution, each filter has 3 x 3 x 16 + 1 = 145 parameters, ans since we have 32 filters, the total number of parameters is 145 x 32 = 4640.



Let us come to training.

In addition to the optimizer and the loss, we also pass a "metrics" argument. Metrics are additional functions that are not directly used for training, but allows us to monitor its advancement. For instance, we use accuracy, in this case (sparse, because we are using labels, and cateogrical because we have multiple categories).

In [14]:
mynet.compile(optimizer='adam',loss='sparse_categorical_crossentropy', metrics=[metrics.SparseCategoricalAccuracy()])

In [15]:
mynet.fit(x_train, y_train, shuffle=True, epochs=10, batch_size=32, validation_data=(x_test,y_test))

Epoch 1/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 5ms/step - loss: 0.4527 - sparse_categorical_accuracy: 0.8575 - val_loss: 0.1353 - val_sparse_categorical_accuracy: 0.9571
Epoch 2/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 5ms/step - loss: 0.1360 - sparse_categorical_accuracy: 0.9592 - val_loss: 0.1243 - val_sparse_categorical_accuracy: 0.9603
Epoch 3/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 5ms/step - loss: 0.1080 - sparse_categorical_accuracy: 0.9676 - val_loss: 0.1063 - val_sparse_categorical_accuracy: 0.9666
Epoch 4/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 5ms/step - loss: 0.0912 - sparse_categorical_accuracy: 0.9716 - val_loss: 0.0971 - val_sparse_categorical_accuracy: 0.9707
Epoch 5/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 5ms/step - loss: 0.0817 - sparse_categorical_accuracy: 0.9741 - val_loss: 0.0895 - val_sparse_categorical_accur

<keras.src.callbacks.history.History at 0x7b1628382e30>