# **CHAPTER 14**
# **Deep Computer Vision Using Convolutional Neural Networks**

**The Architecture of the Visual Cortex**

This subchapter explains the biological motivation behind Convolutional Neural Networks (CNNs). Studies of the mammalian visual cortex show that vision is processed hierarchically. Neurons in early visual areas respond to simple patterns such as edges and orientations, while neurons in deeper layers respond to more complex patterns such as shapes, textures, and eventually entire objects.
Each neuron responds only to a small region of the visual field, known as its receptive field. This local connectivity allows the brain to efficiently process visual information while preserving spatial relationships. Additionally, neurons with similar functions are grouped together, forming layers of increasing abstraction.
CNNs are designed to mimic this structure by:
•	Using local receptive fields through convolution
•	Sharing weights across spatial locations
•	Building hierarchical feature representations across layers
This biological inspiration explains why CNNs are far more effective for image data than fully connected neural networks.


**Convolutional Layers**

Convolutional layers are the core building blocks of CNNs. Instead of connecting every input pixel to every neuron, convolutional layers connect each neuron to a small spatial region of the input. This dramatically reduces the number of parameters and improves generalization.
Each convolutional layer applies several filters (kernels) that slide over the input image. At each spatial position, the filter computes a dot product between its weights and the corresponding input values. The result is a feature map that highlights where a specific pattern appears in the image.
Important concepts covered include:
•	Stride: controls how far the filter moves at each step
•	Padding: determines whether the input is padded to preserve spatial dimensions
•	Depth: the number of filters, which determines how many features are learned
Convolutional layers preserve spatial structure while extracting meaningful features.


**Filters**

Filters are small matrices of trainable parameters that learn to detect specific visual patterns. In early layers, filters typically learn simple features such as vertical or horizontal edges. As layers get deeper, filters learn more abstract patterns like corners, textures, and object parts.
Each filter is applied across the entire input image using the same weights. This concept, known as weight sharing, allows CNNs to detect the same feature regardless of its position in the image and greatly reduces the total number of parameters.
The chapter emphasizes that filters are not manually designed; they are automatically learned through backpropagation during training.


**Stacking Multiple Feature Maps**

Images often have multiple channels, such as red, green, and blue (RGB). Convolutional layers handle this by applying filters that span all input channels. Each filter produces a single feature map by combining information across channels.
As multiple convolutional layers are stacked, the network builds increasingly complex representations:
•	Early layers capture low-level features
•	Middle layers combine features into motifs
•	Deep layers recognize object-level patterns
This hierarchical stacking is one of the main reasons CNNs scale so well to complex visual tasks.


**Pooling Layers**

Pooling layers reduce the spatial resolution of feature maps while retaining the most important information. This helps:
•	Reduce computational cost
•	Control overfitting
•	Improve translation invariance
The most common pooling operation is max pooling, which selects the maximum value within a small window.


**TensorFlow Implementation**

In [4]:
import tensorflow as tf
import numpy as np

images = tf.constant(np.random.rand(1, 4, 6, 3), dtype=tf.float32)

output = tf.nn.max_pool2d(
    images,
    ksize=(1, 1, 1, 3),  # ukuran kernel
    strides=(1, 1, 1, 3), # langkah pergeseran
    padding="VALID"       # 'VALID' atau 'SAME'
)

print(output)


tf.Tensor(
[[[[0.8506536 ]
   [0.72328687]
   [0.8799554 ]
   [0.9134423 ]
   [0.61638457]
   [0.99999356]]

  [[0.962648  ]
   [0.97197115]
   [0.8306107 ]
   [0.49877656]
   [0.93657494]
   [0.86493444]]

  [[0.5423574 ]
   [0.82106036]
   [0.7029755 ]
   [0.6428116 ]
   [0.2893866 ]
   [0.51773095]]

  [[0.9230914 ]
   [0.6528054 ]
   [0.96797824]
   [0.6367127 ]
   [0.654425  ]
   [0.45040113]]]], shape=(1, 4, 6, 1), dtype=float32)


In [5]:
from tensorflow import keras
import tensorflow as tf

# Contoh Lambda layer untuk max pooling di depth/channel
depth_pool = keras.layers.Lambda(
    lambda X: tf.nn.max_pool2d(
        X,
        ksize=(1, 1, 1, 3),  # kernel size: hanya depth
        strides=(1, 1, 1, 3), # stride: pergeseran depth
        padding="VALID"        # padding
    )
)


**CNN Architectures**

This section reviews influential CNN architectures that advanced the field of computer vision:
•	LeNet-5: one of the earliest CNNs, designed for digit recognition
•	AlexNet: demonstrated the power of deep CNNs on large datasets
•	VGGNet: used deep networks with small 3×3 filters
•	GoogLeNet (Inception): introduced multi-scale feature extraction
•	ResNet: solved the degradation problem using skip connections
Each architecture introduced innovations that enabled deeper, more accurate networks.


In [6]:
from tensorflow import keras

model = keras.models.Sequential([
    keras.layers.Conv2D(64, 7, activation="relu", padding="same",
                        input_shape=(28, 28, 1)),
    keras.layers.MaxPooling2D(2),

    keras.layers.Conv2D(128, 3, activation="relu", padding="same"),
    keras.layers.Conv2D(128, 3, activation="relu", padding="same"),
    keras.layers.MaxPooling2D(2),

    keras.layers.Conv2D(256, 3, activation="relu", padding="same"),
    keras.layers.Conv2D(256, 3, activation="relu", padding="same"),
    keras.layers.MaxPooling2D(2),

    keras.layers.Flatten(),
    keras.layers.Dense(128, activation="relu"),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(64, activation="relu"),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(10, activation="softmax")
])


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


**Implementing a ResNet-34 CNN Using Keras**

In [7]:
from tensorflow import keras

class ResidualUnit(keras.layers.Layer):
    def __init__(self, filters, strides=1, activation="relu", **kwargs):
        super().__init__(**kwargs)
        self.activation = keras.activations.get(activation)

        # Main path
        self.main_layers = [
            keras.layers.Conv2D(filters, 3, strides=strides,
                                padding="same", use_bias=False),
            keras.layers.BatchNormalization(),
            self.activation,
            keras.layers.Conv2D(filters, 3, strides=1,
                                padding="same", use_bias=False),
            keras.layers.BatchNormalization()
        ]

        # Skip connection path
        self.skip_layers = []
        if strides > 1:
            self.skip_layers = [
                keras.layers.Conv2D(filters, 1, strides=strides,
                                    padding="same", use_bias=False),
                keras.layers.BatchNormalization()
            ]

    def call(self, inputs):
        Z = inputs
        for layer in self.main_layers:
            Z = layer(Z)

        skip_Z = inputs
        for layer in self.skip_layers:
            skip_Z = layer(skip_Z)

        return self.activation(Z + skip_Z)


In [8]:
from tensorflow import keras

model = keras.models.Sequential()

# Stem
model.add(
    keras.layers.Conv2D(
        64, 7, strides=2, input_shape=[224, 224, 3],
        padding="same", use_bias=False
    )
)
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Activation("relu"))
model.add(keras.layers.MaxPool2D(pool_size=3, strides=2, padding="same"))

# Residual blocks
prev_filters = 64
for filters in [64] * 3 + [128] * 4 + [256] * 6 + [512] * 3:
    strides = 1 if filters == prev_filters else 2
    model.add(ResidualUnit(filters, strides=strides))
    prev_filters = filters

# Head
model.add(keras.layers.GlobalAvgPool2D())
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(10, activation="softmax"))


**Using Pretrained Models from Keras**

In [9]:
model = keras.applications.resnet50.ResNet50(weights="imagenet")

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels.h5
[1m102967424/102967424[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 0us/step


In [10]:
images_resized = tf.image.resize(images, [224, 224])

In [11]:
inputs = keras.applications.resnet50.preprocess_input(images_resized * 255)

In [12]:
Y_proba = model.predict(inputs)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 4s/step


In [13]:
from tensorflow import keras

top_K = keras.applications.resnet50.decode_predictions(Y_proba, top=3)

for image_index in range(len(images)):
    print("Image #{}".format(image_index))
    for class_id, name, y_prob in top_K[image_index]:
        print(" {} - {:12s} {:.2f}%".format(class_id, name, y_prob * 100))
    print()


Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/imagenet_class_index.json
[1m35363/35363[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
Image #0
 n04328186 - stopwatch    23.72%
 n02708093 - analog_clock 23.03%
 n02783161 - ballpoint    8.99%



**Pretrained Models for Transfer Learning**

In [14]:
import tensorflow_datasets as tfds
dataset, info = tfds.load("tf_flowers", as_supervised=True, with_info=True)
dataset_size = info.splits["train"].num_examples # 3670
class_names = info.features["label"].names # ["dandelion", "daisy", ...]
n_classes = info.features["label"].num_classes # 5



Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /root/tensorflow_datasets/tf_flowers/3.0.1...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/1 [00:00<?, ? splits/s]

Generating train examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/tf_flowers/incomplete.8MV2RW_3.0.1/tf_flowers-train.tfrecord*...:   0%|   …

Dataset tf_flowers downloaded and prepared to /root/tensorflow_datasets/tf_flowers/3.0.1. Subsequent calls will reuse this data.


In [16]:
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds

In [18]:
test_set  = tfds.load("tf_flowers", split="train[:10%]", as_supervised=True)
valid_set = tfds.load("tf_flowers", split="train[10%:25%]", as_supervised=True)
train_set = tfds.load("tf_flowers", split="train[25%:]", as_supervised=True)


In [19]:
def preprocess(image, label):
    resized_image = tf.image.resize(image, [224, 224])
    final_image = keras.applications.xception.preprocess_input(resized_image)
    return final_image, label

batch_size = 32
train_set = train_set.shuffle(1000).map(preprocess).batch(batch_size).prefetch(1)
valid_set = valid_set.map(preprocess).batch(batch_size).prefetch(1)
test_set  = test_set.map(preprocess).batch(batch_size).prefetch(1)

**Classification and Localization**

Localizing an object in a picture can be expressed as a regression task, as discussed in
Chapter 10: to predict a bounding box around the object, a common approach is to predict the horizontal and vertical coordinates of the object’s center, as well as its
height and width.

In [21]:
# Membuat optimizer sebelum compile
optimizer = keras.optimizers.SGD(
    learning_rate=0.01,  # atau sesuai kebutuhan
    momentum=0.9,
    decay=0.001
)

# Compile model setelah optimizer siap
model.compile(
    loss=["sparse_categorical_crossentropy", "mse"],
    loss_weights=[0.8, 0.2],
    optimizer=optimizer,
    metrics=["accuracy"]
)




**Summary**

Chapter 14 establishes CNNs as the foundation of modern computer vision. Key points include:
•	CNNs exploit spatial structure through convolutions
•	Pooling improves efficiency and robustness
•	Deep architectures learn hierarchical representations
•	Transfer learning and data augmentation are essential in practice
This chapter provides the conceptual and practical foundation for advanced vision tasks such as object detection, segmentation, and video analysis.
