<a href="https://colab.research.google.com/github/chaitragopalappa/MIE590-690D/blob/main/5_lab_NN_for_images.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

References/Source:
* Chapter 14, Probabilistic Machine Learning: An Introduction by Kevin Murphy  
* Dive into Deep Learning, by Aston Zhang, Zachary C. Lipton, Mu Li, Alexander J. Smola https://d2l.ai/index.html

* [Deep learning with Python, Francois Chollet](https://sourestdeeds.github.io/pdf/Deep%20Learning%20with%20Python.pdf)
  * [GITHUB with Keras](https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/README.md)

**Applications of ConvNet**

1. Image classification:
  * Single-label classification: assign one label to each image
  * Multi-label classification: tag all categories that an image belongs to,
    * Example, when we search for a photo on photo apps (e.g, Google Photos), behind the scenes it querying a very large multilabel classification model

2. Image segmentation: “segment” or “partition” an image into different areas, with each area usually representing a category.
  * Example: when Zoom or Google Meet diplays a custom back-
  ground behind you in a video call, it’s using an image segmentation model to tell your face apart from what’s behind it, at pixel precision.
  * Applications: medical imaging -e.g., tumor detection, autonomous driving - e.g., road surface classfication, remote sensing - e.g., flood monitoring, remote monitoring of road conditions

3. Object detection: draw rectangles (called bounding boxes) around objects of interest in an image, and associate each rectangle with a class.
  * Example: A self-driving car could use an object-detection model to monitor cars,pedestrians, and signs in view of its cameras, for instance.
  * Application: indutrial - e.g., automated defect detection on production lines, inspecting products for flaws, and ensuring quality control; medical imaging - e.g., detecting anomalies, tumors, or specific structures in X-rays, MRIs, CT scans, and other medical images.

---



<img src="https://github.com/chaitragopalappa/MIE590-690D/blob/main/images/Chapter%2011%20-%20Deep%20Learning%20With%20Python.png?raw=true" height="400" width ="400">

*Source: Chapter 11: Deep Learning with Python, Francois Chollet*

---


# Excercise 1. Classification on MNIST data
CHANGE TO GPU: RUNTIME-->select type

```
import sys
import os
import numpy as np
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt

print('Python', sys.version)
print('TensorFlow', tf.__version__)
print('GPU available:', tf.config.list_physical_devices('GPU'))
#---------------------------------

import keras
from keras import layers

inputs = keras.Input(shape=(28, 28, 1))
x = layers.Conv2D(filters=1, kernel_size=3, activation="relu")(inputs)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=1, kernel_size=3, activation="relu")(x)
x = layers.GlobalAveragePooling2D()(x)
outputs = layers.Dense(10, activation="softmax")(x)
model = keras.Model(inputs=inputs, outputs=outputs)

#----------------------------------
model.summary(line_length=80)
#-------------------------------------

from keras.datasets import mnist

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = train_images.reshape((60000, 28, 28, 1))
train_images = train_images.astype("float32") / 255
test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images.astype("float32") / 255
model.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)
model.fit(train_images, train_labels, epochs=5, batch_size=64)

#------------------------------------------
test_loss, test_acc = model.evaluate(test_images, test_labels)
print(f"Test accuracy: {test_acc:.3f}")

#------------------------------------------
# show a few test images with predictions
preds = np.argmax(model.predict(test_images[:12]), axis=1)
fig, axes = plt.subplots(3,4, figsize=(10,6))
for i, ax in enumerate(axes.ravel()):
    ax.imshow(test_images[i].squeeze(), cmap='gray')
    ax.set_title(f'pred: {preds[i]} / true: {test_labels[i]}')
    ax.axis('off')
plt.show()
```
----



1. Understand the code.
  * The code has been written using funcional API (see keras.model https://keras.io/api/models/model/ )
  * What is sparse categorical cross-entropy?
  * How many channels are the input image? how many channels does each output from each Conv2D layer have? how many features are being learnt from each layer?
  * For image classification , after ConvNet layers extract features, they are flattened and sent to FFNN for the classification. Why is there no flatten layer?
  * How deep is the FNNN?
  * Is there any restriction on layers to FNNN?s
  * What is the role of maxpooling layer?
2. Improve the model prediction - you can get >90% accuracy
  * Think about what could be modified to improve fit
  * Filters have dimension $H \times W \times C \times D$; (height, weight, input channel, feature maps). What is the role of C and D? What should be the value of C and D? Increase the value of D.
  * Stack several convolution layers on top of each other (instead of alternating between convolution and pooling).
      * Stacking has the advantage of increasing the receptive field (e.g., stacking 3 convolution layers with 3 X 3 filter will have a receptive field of 3 X 3 in the first layer, 5 X 5 in the second, and 7 X 7 in the third layer;
      * Receptive field can also be increased by using a larger filter of 7 X 7 in one layer
      * Which one is better - stacking layers with smaller filters in each layer or one layer with larger filter - and why?
3. PLay with below - what is the role of each?
  * Number of Conv2D layers
  * Padding
  * Stride
  * Batchnorm
---


# Excercise 2. Data augmentation for small samples and regularization

See 5b_Code_DataAugmentation.ipynb

Observe
1. Data processing and preparation
2. Data augmentation
* Overfitting can be caused by having too few samples to learn from.
* Data augmentation generates more training data from existing training samples by augmenting the samples via a number of random transformations that yield believable-looking images.
  * This helps expose the model to more aspects of the data so it can generalize better.
2. Regularization: Dropout layer

---



### Data Augmentation
```
data_augmentation_layers = [
    layers.RandomFlip("horizontal"),
    layers.RandomRotation(0.1),
    layers.RandomZoom(0.2),
]

def data_augmentation(images, targets):
    for layer in data_augmentation_layers:
        images = layer(images)
    return images, targets

augmented_train_dataset = train_dataset.map(
    data_augmentation, num_parallel_calls=8
)
augmented_train_dataset = augmented_train_dataset.prefetch(tf.data.AUTOTUNE)
```

* RandomFlip("horizontal")—Applies horizontal flipping to a random 50% of the images that go through it.
* RandomRotation(0.1)—Rotates the input images by a random value in the range [–10%, +10%] (these are fractions of a full circle—in degrees, the range would be [–36 degrees, +36 degrees])
* RandomZoom(0.2)—Zooms in or out of the image by a random factor in the
range [-20%, +20%]

---

# **Other forms of convolution: Specific to image segmentation and object detection specific layers**
1. Dilated convolution
2. Transposed convolution
3. Depth-wise separable convolution

---

**Dilated convolution**
Convolution is an operation that combines the pixel values in a local neighborhood. By using striding, and stacking many layers of convolution together, we can enlarge the receptive field of each neuron, which is the region of input space that each neuron responds to. However, we would need many layers to give each neuron enough context to cover the entire image (unless we used very large filters, which would be slow and require too many parameters).
Alternative is  **convolution with holes**, also known by the French term **à trous algorithm**, and recently renamed **dilated convolution**. This method simply
takes every r’th input element when performing convolution, where $r$ is the rate or dilation factor

In 1D convolution, convolving with filter $w$ using rate $r = 2$ would be equivalent to regular convolution of $w=[w_1,0,w_2,0,w_3]$

In 2D convolution

$$
z_{i,j,d} = b_d +
\sum_{u=0}^{H-1}
\sum_{v=0}^{W-1}
\sum_{c=0}^{C-1}
x_{i + r u,\, j + r v,\, c} \, w_{u,v,c,d}
$$

<img src="https://github.com/probml/pml-book/blob/main/book1-figures/Figure_14.23.png?raw=true" height="200" width ="400">

*Embedded from TextBook; Figure 14.23 Dilated Convolution: Dilated convolution with a 3x3 filter using rate 1, 2 and 3.*

---


**Transposed convolution**

In **convolution**, we reduce from a large input X to a small output Y by taking a weighted combination of the input pixels and the convolutional kernel K.

<img src="https://github.com/probml/probml-notebooks/blob/main/images/d2l-correlation.png?raw=true" height=200>

*Source: Embeded from Github repo of textbook (PML: An Introduction by Murphy) See textbook for original source*

[Convolution in 2D Discrete Functions- GIF](https://en.wikipedia.org/wiki/Convolution#/media/File:2D_Convolution_Animation.gif)

In **transposed convolution**, we do the opposite, in order to produce a larger output from a smaller input.

<img src="https://github.com/probml/pml-book/blob/main/book1-figures/Figure_14.24.png?raw=true" height="100" width ="800">

*Embedded from TextBook; Figure 14.23 Transposed convolution with 2x2 kernel.*

---

In [None]:
#**Transposed convolution**
#In convolution, we reduce from a large input X to a small output Y by taking a weighted combination of the input pixels and the convolutional kernel K.
def conv(X, K):
  h, w = K.shape
  Y = zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))
  for i in range(Y.shape[0]):
    for j in range(Y.shape[1]):
      Y[i, j] = (X[i:i + h, j:j + w] * K).sum()
  return Y
#In transposed convolution, we do the opposite, in order to produce a larger output from a smaller input:
def trans_conv(X, K):
  h, w = K.shape
  Y = zeros((X.shape[0] + h - 1, X.shape[1] + w - 1))
  for i in range(X.shape[0]):
    for j in range(X.shape[1]):
      Y[i:i + h, j:j + w] += X[i, j] * K
  return

**Depthwise separable convolution**
* Standard convolution uses a filter of size $H × W × C × D$, which requires a lot of data to learn and a lot of time to compute with.
* Depthwise separable convolution, first convolves each input channel by a corresponding 2d filter w, and then maps these C channels to D channels using $1 × 1$ convolution $w′$
$$
z_{i,j,d} = b_d +
\sum_{c=0}^{C-1}
w'_{c,d}
\sum_{u=0}^{H-1}
\sum_{v=0}^{W-1}
x_{i + u,\, j + v,\, c} \, w_{u,v}
$$

<img src="https://github.com/probml/pml-book/blob/main/book1-figures/Figure_14.26.png?raw=true" height="200" width ="400">

*Embedded from TextBook; Figure 14.26 Depthwise separable convolutions: each of the C input channels undergoes a 2d convolution to produce C output channels, which get combined pointwise (via 1x1 convolution) to produce D output channels.*

* It is more efficient
  * Regular convolution of a $12 × 12 × 3$ input with a $5 × 5 × 3 × 256$ filter gives a $8 × 8 × 256$ output (assuming valid convolution:12-5+1=8).
  * With separable convolution, we start with $12×12×3$ input, convolve with a $5 × 5 × 1 × 1$ filter (across space but not channels) to get $8 × 8 × 3$, then pointwise convolve (across channels but not space) with a $1 × 1 × 3 × 256$ filter to get a $8 × 8 × 256$ output. So the output has the same size as before, but we used many fewer parameters to define the layer, and used much less compute.
* The efficiency of these convolutions makes them suitable for real-time applications like image classification and object detection on mobile devices and embedded systems, such as drones.

---

# Image segmentation

Image segmentation assigns a class to each pixel in an image, thus segmenting the image into different zones (such as “background” and “foreground” or “road,” “car,” and “sidewalk”).

* Semantic segmentation:  Each pixel is independently classified into a semantic category, like “cat.” If there are two cats in the image, the corresponding pixels are all mapped to the same generic “cat” category. Useful for segmentation of "stuff" (like sky, road).
* Instance segmentation: Parse out individual object instances. In an image with two cats in it, instance segmentation would distinguish between pixels belonging to “cat 1” and pixels belonging to “cat 2”. Useful for segmentation of "things" (like car, person)
* Panoptic segmentation:  Combine segmentation of "stuff" (like sky, road) with instance segmentation of "things" (like car, person). This is the most informative of all three segmentation types.

---


**Image Segmentation using a specific CNN architecture called U-Net**
* The goal is to predict the label and 2d shape mask of each object instance in the image. This can be done by applying a semantic segmentation model to each detected box which has to label each pixel as foreground or background.
  * A **segmentation mask** is the image segmentation equivalent of a label: it’s an image the same size as the input image, with a single color channel where each integer value corresponds to the class of the corresponding pixel in the input image.
* A common way to tackle semantic segmentation is to use an encoder-decoder architecture.
  * The **encoder** uses standard convolution to map the input into a small 2d bottleneck, which captures high level properties of the input at a coarse spatial resolution. (This typically uses dilated convolution to capture a large field of view, i.e., more context.)
  * The **decoder** maps the small 2d bottleneck back to a full-sized output image using  **transposed convolution**. Since the bottleneck loses information, we can also add skip connections from input layers to output layers.

  <img src="https://github.com/probml/pml-book/blob/main/book1-figures/Figure_14.29.png?raw=true" height="200" width ="400">

*Embedded from TextBook; Figure 14.29 Illustration of an encoder-decoder (aka U-net) CNN for semantic segmentation. The encoder uses convolution (which downsamples), and the decoder uses transposed convolution (which upsamples).*

```
def get_model(img_size, num_classes):
    inputs = keras.Input(shape=img_size + (3,))
    x = Rescaling(1.0 / 255)(inputs)

    x = Conv2D(64, 3, strides=2, activation="relu", padding="same")(x)
    x = Conv2D(64, 3, activation="relu", padding="same")(x)
    x = Conv2D(128, 3, strides=2, activation="relu", padding="same")(x)
    x = Conv2D(128, 3, activation="relu", padding="same")(x)
    x = Conv2D(256, 3, strides=2, padding="same", activation="relu")(x)
    x = Conv2D(256, 3, activation="relu", padding="same")(x)

    x = Conv2DTranspose(256, 3, activation="relu", padding="same")(x)
    x = Conv2DTranspose(256, 3, strides=2, activation="relu", padding="same")(x)
    x = Conv2DTranspose(128, 3, activation="relu", padding="same")(x)
    x = Conv2DTranspose(128, 3, strides=2, activation="relu", padding="same")(x)
    x = Conv2DTranspose(64, 3, activation="relu", padding="same")(x)
    x = Conv2DTranspose(64, 3, strides=2, activation="relu", padding="same")(x)

    outputs = Conv2D(num_classes, 3, activation="softmax", padding="same")(x)

    return keras.Model(inputs, outputs)

model = get_model(img_size=img_size, num_classes=3)
```

* It can be redrawn as below. Since the overall structure resembles the letter U, this is also known as a U-net


  <img src="https://github.com/probml/pml-book/blob/main/book1-figures/Figure_14.30.png?raw=true" height="400" width ="800">

*Embedded from TextBook; Figure 14.29 Illustration of the U-Net model for semantic segmentation. Each blue box corresponds to a multi-channel feature map. The number of channels is shown on the top of the box, and the height/width is shown in the bottom left. White boxes denote copied feature maps. The different colored arrows correspond to different operations.*


# Exercise 3: Image segmentation

1. Review code 5c - read Chapter 11 https://deeplearningwithpython.io/chapters/chapter11_image-segmentation/
2. Review code 5d, which uses pre-trained model
Questions:
  * What is the metric used to evaluate the model?
  * What is skip connection? Does this code use skip connection?
  
