# Chapter 14: Deep Computer Vision Using Convolutional Neural Networks

**Reference:** Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (Aurélien Géron)

---

## 1. Chapter Introduction

Although IBM’s Deep Blue supercomputer beat the chess world champion Garry Kasparov back in 1996, it wasn’t until fairly recently that computers were able to reliably perform seemingly trivial tasks such as detecting a puppy in a picture or recognizing spoken words. Why are these tasks so effortless to us humans? The answer lies in the fact that perception largely takes place outside the realm of our consciousness, within specialized visual, auditory, and other sensory modules in our brains. By the time sensory information reaches our consciousness, it is already adorned with high-level features.

Convolutional neural networks (CNNs) emerged from the study of the brain’s visual cortex, and they have been used in image recognition since the 1980s. In the last few years, thanks to the increase in computational power, the amount of available training data, and the tricks presented in Chapter 11 for training deep nets, CNNs have managed to achieve superhuman performance on some complex visual tasks. They power image search services, self-driving cars, automatic video classification systems, and more. Moreover, CNNs are not limited to visual perception: they are also successful at other tasks, such as voice recognition and natural language processing (NLP); however, we will focus on visual applications in this chapter.

We will start by exploring the architecture of CNNs, looking at their building blocks: convolutional layers and pooling layers. We will see how to implement them using TensorFlow and Keras. Then we will look at some of the best CNN architectures developed over the years, from LeNet-5 to ResNet and beyond. We will also discuss how to use pretrained models for transfer learning. Finally, we will tackle more complex problems such as object detection (finding multiple objects in an image and drawing bounding boxes around them) and semantic segmentation (classifying every single pixel in an image).

## 2. The Architecture of the Visual Cortex

David H. Hubel and Torsten Wiesel performed a series of experiments on cats in 1958 and 1959 (and a few years later on monkeys), giving crucial insights on the structure of the visual cortex. They showed that many neurons in the visual cortex have a **small local receptive field**, meaning they react only to visual stimuli located in a limited region of the visual field. The receptive fields of different neurons may overlap, and together they tile the whole visual field.

Moreover, the authors showed that some neurons react only to images of horizontal lines, while others react only to lines with different orientations (two neurons may have the same receptive field but react to different line orientations). They also noticed that some neurons have larger receptive fields, and they react to more complex patterns that are combinations of the lower-level patterns. These observations led to the idea that the higher-level neurons are based on the outputs of neighboring lower-level neurons. This powerful insight is at the core of CNNs.

## 3. Convolutional Layers

The most important building block of a CNN is the convolutional layer. Neurons in the first convolutional layer are not connected to every single pixel in the input image (like they were in the dense layers discussed in previous chapters), but only to pixels in their receptive fields. In turn, each neuron in the second convolutional layer is connected only to neurons located within a small rectangle in the first layer. This architecture allows the network to concentrate on small low-level features in the first hidden layer, then assemble them into larger higher-level features in the next hidden layer, and so on.

This hierarchical structure is common in real-world images, which is one of the reasons why CNNs work so well for image recognition.

### Filters (Kernels)

A neuron’s weights can be represented as a small image the size of the receptive field. For example, if the receptive field is 3 × 3, the weights can be represented as a 3 × 3 image. This set of weights is called a **filter** (or **kernel**). If we have a filter that looks like a vertical white line on a black background, a neuron using these weights will ignore everything except vertical lines in its receptive field.

### Stacking Feature Maps

A convolutional layer actually has multiple filters (you decide how many) and outputs one **feature map** per filter. Each feature map highlights the areas in the image that activate the filter the most.

### Equation of a Convolutional Layer

Let $z_{i,j,k}$ be the output of the neuron located in row $i$, column $j$ in feature map $k$ of the convolutional layer (layer $l$).

$$ z_{i,j,k} = b_k + \sum_{u=0}^{f_h-1} \sum_{v=0}^{f_w-1} \sum_{k'=0}^{f_{n'}-1} x_{i', j', k'} \cdot w_{u, v, k', k} $$

Where:
* $x_{i', j', k'}$ is the output of the neuron in layer $l-1$ at row $i'$, column $j'$, feature map $k'$ (or channel $k'$ if the previous layer is the input).
* $i' = i \times s_h + u$
* $j' = j \times s_w + v$
* $b_k$ is the bias term for feature map $k$ (in layer $l$).
* $w_{u, v, k', k}$ is the connection weight between any neuron in feature map $k$ of layer $l$ and its input located at row $u$, column $v$ (relative to the neuron’s receptive field), and feature map $k'$.
* $f_h, f_w$ are the height and width of the receptive field (kernel size).
* $s_h, s_w$ are the vertical and horizontal strides.
* $f_{n'}$ is the number of feature maps in the previous layer (layer $l-1$).

### TensorFlow Implementation

In TensorFlow, each input image is typically represented as a 3D tensor of shape `[height, width, channels]`. A mini-batch is represented as a 4D tensor of shape `[batch_size, height, width, channels]`.

Let's create a simple Convolutional Layer using Keras.

In [None]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
from sklearn.datasets import load_sample_image

# Load sample images
china = load_sample_image("china.jpg") / 255.0
flower = load_sample_image("flower.jpg") / 255.0
images = np.array([china, flower])
batch_size, height, width, channels = images.shape

# Create 2 filters
filters = np.zeros(shape=(7, 7, channels, 2), dtype=np.float32)
filters[:, 3, :, 0] = 1  # vertical line
filters[3, :, :, 1] = 1  # horizontal line

# Apply convolution manually using tf.nn.conv2d
# strides=[1, 1, 1, 1]: 1 for batch, 1 for height, 1 for width, 1 for channel
# padding="SAME": Use zero padding if necessary to keep output size = input size / stride
outputs = tf.nn.conv2d(images, filters, strides=1, padding="SAME")

print("Output shape:", outputs.shape)
# (2, 427, 640, 2): 2 images, same height/width, 2 feature maps (channels)

Using `keras.layers.Conv2D` is much simpler for building models:

In [None]:
# filters=64: Create 64 different kernels, resulting in 64 feature maps.
# kernel_size=7: Receptive field is 7x7 pixels.
# activation="relu": Apply ReLU activation after the convolution.
# padding="same": Output spatial dimensions match input dimensions (if stride=1).
conv = keras.layers.Conv2D(filters=64, kernel_size=7, activation="relu", padding="same",
                           input_shape=[height, width, channels])

# Passing the images through the layer
output_k = conv(images)
print("Keras Conv2D Output Shape:", output_k.shape)

## 4. Pooling Layers

Once you understand how convolutional layers work, pooling layers are quite easy to grasp. Their goal is to **subsample** (i.e., shrink) the input image in order to reduce the computational load, the memory usage, and the number of parameters (thereby limiting the risk of overfitting).

Just like in convolutional layers, each neuron in a pooling layer is connected to the outputs of a limited number of neurons in the previous layer, located within a small rectangular receptive field. However, a pooling neuron has no weights; all it does is aggregate the inputs using an aggregation function such as the max or the mean.

**Max Pooling:**
Outputs the maximum value in the receptive field. It introduces a level of invariance to small translations and rotational shifts.

In [None]:
# Max Pooling with a 2x2 pool size and stride of 2.
# This effectively divides height and width by 2.
# Drop 75% of the input data.
max_pool = keras.layers.MaxPooling2D(pool_size=2)

output_max = max_pool(images)
print("Max Pooling Output Shape:", output_max.shape)

**Average Pooling:**
Computes the mean of the values in the receptive field. Less popular than max pooling today as max pooling preserves the strongest features better.

In [None]:
avg_pool = keras.layers.AveragePooling2D(pool_size=2)
output_avg = avg_pool(images)
print("Avg Pooling Output Shape:", output_avg.shape)

## 5. CNN Architectures

Typical CNN architectures stack a few convolutional layers (each generally followed by a ReLU layer), then a pooling layer, then another few convolutional layers (+ReLU), then another pooling layer, and so on. The image gets smaller and smaller as it progresses through the network, but it also gets deeper and deeper (i.e., with more feature maps). At the top of the stack, a regular feedforward neural network is added, composed of a few fully connected layers (+ReLUs), and the final layer outputs the prediction (e.g., a softmax layer that outputs estimated class probabilities).

### Famous Architectures

1.  **LeNet-5 (1998):** Used for handwritten digit recognition (MNIST). Structure: Conv -> Avg Pool -> Conv -> Avg Pool -> Conv -> Dense -> Dense.
2.  **AlexNet (2012):** The winner of the ILSVRC 2012 challenge. Similar to LeNet but much larger and deeper. It pioneered the use of ReLU activations and Dropout to prevent overfitting.
3.  **GoogLeNet (2014):** Introduced the *Inception module*, which allows the network to have filters of different sizes (1x1, 3x3, 5x5) operating in parallel at the same level. It effectively allows the model to choose the right kernel size for the task.
4.  **VGGNet (2014):** Known for its simplicity. It uses only 3x3 convolutions stacked on top of each other. Two 3x3 convolutions have a receptive field equivalent to one 5x5 convolution but with fewer parameters.
5.  **ResNet (2015):** The Residual Network. It solved the problem of training very deep networks (152 layers) using **Skip Connections**. A skip connection adds the input of a layer directly to its output: $y = F(x) + x$. This allows the signal to propagate easily through the network.
6.  **Xception (2016):** An extension of Inception using *depthwise separable convolutions* (separating spatial convolutions from channel-wise cross-correlations). It is very efficient.

### Implementing a ResNet-34 Using Keras

Most standard architectures are available in `keras.applications`, but building one from scratch is instructive. Here is how to build a ResNet-34 using the Subclassing API to define a Residual Unit.

In [None]:
class ResidualUnit(keras.layers.Layer):
    def __init__(self, filters, strides=1, activation="relu", **kwargs):
        super().__init__(**kwargs)
        self.activation = keras.activations.get(activation)
        # Main path: Two Conv2D layers with Batch Normalization
        self.main_layers = [
            keras.layers.Conv2D(filters, 3, strides=strides,
                                padding="same", use_bias=False),
            keras.layers.BatchNormalization(),
            self.activation,
            keras.layers.Conv2D(filters, 3, strides=1,
                                padding="same", use_bias=False),
            keras.layers.BatchNormalization()]
        # Skip path: If strides > 1, we need to shrink the input to match the output shape
        self.skip_layers = []
        if strides > 1:
            self.skip_layers = [
                keras.layers.Conv2D(filters, 1, strides=strides,
                                    padding="same", use_bias=False),
                keras.layers.BatchNormalization()]

    def call(self, inputs):
        Z = inputs
        for layer in self.main_layers:
            Z = layer(Z)
        skip_Z = inputs
        for layer in self.skip_layers:
            skip_Z = layer(skip_Z)
        return self.activation(Z + skip_Z)

model = keras.models.Sequential()
model.add(keras.layers.Conv2D(64, 7, strides=2, input_shape=[224, 224, 3],
                              padding="same", use_bias=False))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Activation("relu"))
model.add(keras.layers.MaxPool2D(pool_size=3, strides=2, padding="same"))
prev_filters = 64
for filters in [64] * 3 + [128] * 4 + [256] * 6 + [512] * 3:
    strides = 1 if filters == prev_filters else 2
    model.add(ResidualUnit(filters, strides=strides))
    prev_filters = filters
model.add(keras.layers.GlobalAvgPool2D())
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(10, activation="softmax"))

# model.summary() # Uncomment to view the huge architecture

## 6. Using Pretrained Models from Keras

In general, you won't build ResNet or Xception from scratch. You will load a pretrained version from `keras.applications`. These models are usually pretrained on ImageNet (1000 classes).

In [None]:
# Load ResNet50 model pretrained on ImageNet
model = keras.applications.resnet50.ResNet50(weights="imagenet")

# Resize images to 224x224 as required by ResNet50
images_resized = tf.image.resize(images, [224, 224])

# Preprocess input (specific to ResNet50: e.g., scaling pixel values)
inputs = keras.applications.resnet50.preprocess_input(images_resized * 255)

# Predict
Y_proba = model.predict(inputs)

# Decode predictions into human readable labels
top_K = keras.applications.resnet50.decode_predictions(Y_proba, top=3)
for image_index in range(len(images)):
    print(f"Image #{image_index}")
    for class_id, name, y_proba in top_K[image_index]:
        print(f"  {class_id} - {name} {y_proba*100:.2f}%")

## 7. Pretrained Models for Transfer Learning

If you want to train an image classifier for a dataset that is not ImageNet (e.g., classifying flowers), you can use Transfer Learning. You load the pretrained model without the top layers (the classification head) and add your own.

### Procedure:
1.  Load the base model with `include_top=False`.
2.  Freeze the base model layers (make them non-trainable).
3.  Add a GlobalAveragePooling2D layer and a Dense output layer.
4.  Compile and train the model for a few epochs (training only the top layers).
5.  Unfreeze some of the top layers of the base model.
6.  Compile (with a lower learning rate) and fine-tune.

In [None]:
# 1. Load Base Model
base_model = keras.applications.Xception(weights="imagenet", include_top=False)
avg = keras.layers.GlobalAveragePooling2D()(base_model.output)
output = keras.layers.Dense(10, activation="softmax")(avg)
model = keras.models.Model(inputs=base_model.input, outputs=output)

# 2. Freeze Base Model
for layer in base_model.layers:
    layer.trainable = False

# 3. Compile and Train (Head only)
optimizer = keras.optimizers.SGD(learning_rate=0.2, momentum=0.9, decay=0.01)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
# history = model.fit(...) # Train for a few epochs

# 4. Unfreeze and Fine-tune
for layer in base_model.layers:
    layer.trainable = True

optimizer = keras.optimizers.SGD(learning_rate=0.01, momentum=0.9, decay=0.001)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
# history = model.fit(...) # Train for more epochs

## 8. Classification and Localization

Localizing an object in a picture can be expressed as a regression task. We need to predict the bounding box: horizontal and vertical coordinates of the center, height, and width ($x, y, w, h$).

We can build a model with two heads: one for classification (Softmax) and one for regression (MSE loss). The regression head typically uses 4 neurons (one for each coordinate).

In [None]:
# Example structure for Localization + Classification
base_model = keras.applications.Xception(weights="imagenet", include_top=False)
avg = keras.layers.GlobalAveragePooling2D()(base_model.output)

class_output = keras.layers.Dense(10, activation="softmax", name="class_output")(avg)
loc_output = keras.layers.Dense(4, name="loc_output")(avg)

model = keras.models.Model(inputs=base_model.input, outputs=[class_output, loc_output])
model.compile(loss=["sparse_categorical_crossentropy", "mse"],
              loss_weights=[0.8, 0.2], # Weight classification more heavily
              optimizer=optimizer, metrics=["accuracy"])

The evaluation metric for object detection is usually the **Intersection over Union (IoU)**. It is the area of the overlap between the predicted bounding box and the target bounding box, divided by the area of their union.

## 9. Object Detection

Classifying and localizing a single object is one thing, but detecting multiple objects (where the number of objects varies) is harder.

**Sliding Windows:**
The old approach was to slide a small window across the image and run a CNN classifier on each crop. This is computationally expensive.

**Fully Convolutional Networks (FCN):**
The idea is to replace the dense layers at the top of a CNN with convolutional layers. For example, a dense layer with 200 neurons on top of a 7x7 feature map is equivalent to a convolutional layer with 200 filters of size 7x7. This allows the network to process images of any size and output a spatial map of class probabilities (a heatmap) rather than a single classification.

**YOLO (You Only Look Once):**
YOLO divides the image into a grid (e.g., 7x7). For each grid cell, it predicts $B$ bounding boxes and a confidence score for each box. It also predicts class probabilities. This happens in a single forward pass, making it extremely fast (real-time).

**SSD (Single Shot MultiBox Detector):**
Similar to YOLO but makes predictions at different scales using feature maps from different layers of the network. This allows it to detect objects of various sizes more effectively.

## 10. Semantic Segmentation

In Semantic Segmentation, each pixel is classified according to the class of the object it belongs to (e.g., road, car, pedestrian, building). Note that different objects of the same class are not distinguished (e.g., all cars are just "car" pixels). Instance segmentation distinguishes different objects of the same class.

**Architecture (FCN for Segmentation):**
We can use a Fully Convolutional Network. However, standard CNNs reduce spatial resolution (due to pooling and strides). We need to restore the resolution to match the input image size.

**Upsampling / Transposed Convolution:**
We use **Transposed Convolutional Layers** (sometimes called deconvolution) to upscale the feature maps. It inserts zeros between inputs and then performs a convolution, effectively increasing the spatial dimensions.

**U-Net:**
The U-Net architecture is very popular for segmentation. It consists of an encoder (contracting path) and a decoder (expanding path). The key feature is **Skip Connections** that copy the high-resolution feature maps from the encoder to the corresponding layers in the decoder. This helps the decoder recover fine-grained spatial details that were lost during pooling.

In [None]:
# Example: Transposed Convolution for Upsampling
# Increasing dimensions from 7x7 to 14x14
inputs = tf.random.normal([1, 7, 7, 64])
transpose_conv = keras.layers.Conv2DTranspose(filters=32, kernel_size=3, strides=2,
                                              padding="same", activation="relu")
upsampled = transpose_conv(inputs)
print("Upsampled Shape:", upsampled.shape)