# Using Convolutional Neural Network for Computer Vision

Convolutional neural networks are a type of architecture used for computer vision, voice recognition and natural language processing. It is mostly inspired
from the functioning of the human visual cortex. The most important part of a CNN is a _convolutional layer_: neurons in the first convolutional layer are not connected to every single pixel in the input image (like they were in the layers discussed in previous chapters), but only to pixels in their receptive 
fields. In turn the neurons in the next layer is only connected to neurons located within a small rectangle in the first layer. This allows the network 
to learn by concentrating on the low-level features on the lower layer and caring about the high level layers in the next layers. A neuron in a layer _z_
is connected of row _i_ and column _j_ to the output neurons of the previous neurons of row _i_ = $i + f_h - 1$ and _j_ = $j + f_w - 1$ where $f_h$ and
$f_w$ and the height and width of the receptive field.  
The weight of a neuron can be thought of as a small image the size of the receptive field. We have 2 possible set of weights(called __filter__ or 
__convolutional kernel__, or just __kernel__): the horizontal filter and the vertical filter. The first one is represented by a black square with a 
vertical white line in the middle and the second is a black. In reality though a convolutional layer has multiple filters(up to our choice) and output one
feature map per filter(a __feature map__ highlights the features detected by the filter in different regions of the input image). Now let's look at how to
implement CNN with Keras:

In [1]:
from sklearn.datasets import load_sample_images
import tensorflow as tf

images = load_sample_images()["images"]
images = tf.keras.layers.CenterCrop(height=70, width=120)(images)
images = tf.keras.layers.Rescaling(scale=1 / 255)(images)

# Now we can create a convolutional layer and feed it our images
conv_layer = tf.keras.layers.Conv2D(filters=32, kernel_size=7, activation="relu") # if we do not specify an activation function the model would not be able to recognize complex patterns
fmaps = conv_layer(images)

The second important components of CNNs are __pooling layers__. Their goal is to _subsample_(or shrink) the original input image in order to reduce the
computational load and memory usage of the model. Other than that they also introduca a level of invariance to small translations. The most common type of 
pooling, where each region in the input is divided into small, non-overlapping sub-regions. The maximum value in each sub-region is selected to form the 
pooled feature map. This operation helps preserve the most prominent features. We also have average pooling instead of taking the maximum value, average 
pooling calculates the average value of each sub-region. This pooling method is less aggressive than max pooling and can smooth the feature map. Here is how
to create a max pooling layer with Keras:

In [None]:
max_pool = tf.keras.layers.MaxPool2D(pool_size=2) # To use average pooling we can use AveragePooling2D

Now we can start building entire CNN architectures. A CNN architecture is mostly composed of a few convolutional layers(generally followed by a RELU
activation layer), a pooling layer and other convolutional layers so on and so forth... .

In [None]:
from functools import partial

DefaultConv2D = partial(tf.keras.layers.Conv2D, kernel_size=3, padding="same", activation="relu", kernel_initializer="he_normal")
model = tf.keras.Sequential([
    DefaultConv2D(filters=64, kernel_size=7, input_shape=[28, 28, 1]),
    tf.keras.layers.MaxPool2D(),
    DefaultConv2D(filters=128),
    DefaultConv2D(filters=128),
    tf.keras.layers.MaxPool2D(),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(units=128, activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(units=64, activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(units=10, activation="softmax")
])

## Building a ResNet CNN with keras

ResNet(__Residual Network__) is a type of deep neural network architecture that addresses the problem of vanishing gradients in very deep networks, 
allowing for the training of much deeper networks than was previously feasible. The core idea of ResNet is to learn the "residual" mapping instead of 
trying to learn the direct mapping. If the desired underlying mapping is H(𝑥), ResNet reformulates it as F(x) + x where F(x) = H(x) - x, it is the reisdual
mapping that needs to be learnt. ResNet introduces shortcut or _skip connections_, which are the connections that skip one or more layers. So let's
implement a __ResNet-34__(34 layers) with keras:

In [None]:
DefaultConv2D = partial(tf.keras.layers.Conv2D, kernel_size=3, strides=1, padding="same", kernel_initializer="he_normal", use_bias=False)

class ResidualUnit(tf.keras.layers.Layer):
    def __init__(self, filters, strides=1, activation="relu", **kwargs):
        super().__init__(**kwargs)
        self.activation = tf.keras.activations.get(activation)
        self.main_layers = [
            DefaultConv2D(filters, strides=strides),
            tf.keras.layers.BatchNormalization(),
            self.activation,
            DefaultConv2D(filters),
            tf.keras.layers.BatchNormalization()
        ]
        self.skip_layers = []
        if strides > 1:
            self.skip_layers = [
                DefaultConv2D(filters, kernel_size=1, strides=strides),
                tf.keras.layers.BatchNormalization()
            ]

    def call(self, inputs):
        Z = inputs
        for layer in self.main_layers:
            Z = layer(Z)
        skip_Z = inputs
        for layer in self.skip_layers:
            skip_Z = layer(skip_Z)
        return self.activation(Z + skip_Z)

## Classification and object localization

Localizing an object in a picture can be expressed as a regression task, to predict a bounding box around the object, a common approach is to predict the 
horizontal and vertical coordinates of the object’s center, as well as its height and width. It does not require much change to the model; we just
need to add a second dense output layer with four units(on top of the average pooling layer).

In [None]:
base_model = tf.keras.applications.xception.Xception(weights="imagenet", include_top=False)
avg = tf.keras.layers.GlobalAveragePooling2D()(base_model.output)
class_output = tf.keras.layers.Dense(n_classes, activation="softmax")(avg)
loc_output = tf.keras.layers.Dense(4)(avg)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1, momentum=0.9)
model = tf.keras.Model(inputs=base_model.input, outputs=[class_output, loc_output])
model.compile(loss=["sparse_categorical_crossentropy", "mse"], loss_weights=[0.8, 0.2], optimizer=optimizer, metrics=["accuracy"])

The problem we have now is that the data does not have bounding box around the flowers in the images, to solve this issue we can use software like 
_ImgLab_, _VGGImage_ or use crowdsourcing online(paying freelances to label the data for us). Note that the bounding boxes values that you will get
should be normalized like any other features. The most common loss function used for this type of tasks is the _intersection over union loss function_ and
it is defined in _tf.keras.metrics.MeanIoU_.  
If the image contains multiple objects 