# Chapter 14
## Deep Computer Vision Using Convolutional Neural Networks

In [1]:
# Utilities to load popular datasets and artificial data generators.
from sklearn.datasets import load_sample_images 

In [2]:
import tensorflow as tf

# Testen, ob TensorFlow korrekt installiert wurde
print("TensorFlow version:", tf.__version__)
print("Available CPUs:", tf.config.list_physical_devices('CPU'))

2025-02-11 21:27:55.816752: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-02-11 21:27:55.939654: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-02-11 21:27:56.095394: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1739305676.210329    5175 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1739305676.236283    5175 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-11 21:27:56.449580: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU ins

TensorFlow version: 2.18.0
Available CPUs: [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]


2025-02-11 21:28:00.949075: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


In [4]:
images = load_sample_images()["images"]
# This layers crops the central portion of the images to a target size.
images = tf.keras.layers.CenterCrop(height=70, width=120)(images)
# A preprocessing layer which rescales input values to a new range.
images = tf.keras.layers.Rescaling(scale=1 / 255)(images)

In [5]:
# two sample images, height, width, RGB
images.shape

TensorShape([2, 70, 120, 3])

# 1. Building Block: Create a convolutional2D layer 
- 32 random filters, 7x7 filter size
- Padding "same": Shape is conservated
- Padding "valid": Filter applied only on "valid" data (no padding, adding on boarders) 
- Strides: kernel jumps in pixel
- Convolutional layer performs linear operation. Stacked multiple convolutional layers without activation function  would be equivalent to a single convolutional layer -> cannot learn anything complex

In [9]:
conv_layer = tf.keras.layers.Conv2D(filters=32, kernel_size=7, padding="same", strides=1)
fmaps = conv_layer(images)
fmaps.shape

TensorShape([2, 70, 120, 32])

In [10]:
kernels, biases = conv_layer.get_weights()
# [kernel_height,kernel_width,input_channels,output_channels]
print(kernels.shape)
# [output_channels]
print(biases.shape)

(7, 7, 3, 32)
(32,)


### Hyperparameters
- filters
- kernel_size
- padding
- strides
- activation
- kernel_initializer

### Memory Requirements
- backpropagation needs variables from forward pass, this requires huge amount of RAM
- during training, every layer contributes to RAM usage due to back passing
- during inference like making predictions for new instances, only the active and the layer before needs RAM, remaining layers can release there memory load

## 2. Building Block: Pooling Layers
- goal: subsample, shrink data
- layer without weight
- just aggregates input using aggregation function like mean, max ...
- **max pooling layer**
    - destructive: 2x2 kernel with stride of 2 drops 75% of data!
    - introduces some level of invariance to small translations
    - in some applications like semantic segmentation this is not desirable
    - semantic segmentation: task of classifying each pixel to in an image according to the object that pixel belongs to
    - if input image is translated one pixel to the right, the output should also be translated to the right by one pixel -> *equivariance*
- max pooling is losing more information then average pooling but preserves stronger features and is therefore more popular
- pooling can be perforemd along the depth axis
    - not as common
    - CNN can learn to be invariant to various features
    - e.g. for hand written digits could ensure same output under various rotation, or thickness, brightness, skew, color, ...
- GlobalAveragePooling2D is another popular pooling layer
    - very destructive, output a single digit (mean of whole feature map)
    - benefitial before output layer

In [14]:
# Create a maxpooling layer, by default padding is valid (no padding), stride is 2, kernel 2 x 2
# AvgPool2D for average pooling
max_pool = tf.keras.layers.MaxPool2D(pool_size=2)
# GlobalAveragePooling: get mean intensity of RGB for each image
# use lambda for mean of spatial dimensions height and width 1 and 2
global_avg_pool = tf.keras.layers.Lambda(lambda X: tf.reduce_mean(X, axis=[1,2]))
global_avg_pool(images)

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[0.64338624, 0.5971759 , 0.5824972 ],
       [0.76306933, 0.2601113 , 0.10849128]], dtype=float32)>