# Convolution Neural Networks Introduction
Generally follows and uses `Python Machine Learning 3rd Edition,Raschka, Chapter 15`.

- This will introduce basic concepts regarding neural networks and is by no-means comprehensive.  
- Neural networks can cover multiple full courses, so this only scratches the surface.  
- See the readings and resources section for resources for more information.



## Convolutional Networks - What are they?
First proposed by [Yann LeCun](https://papers.nips.cc/paper/1989/hash/53c3bce66e43be4f209556518c2fcb54-Abstract.html).

- Deep networks of traditional hidden units grow $(size_{l-1}+1)size_l$ parameters, e.g., adding 1,000-unit layer to an existing NN will add more than 1,000,000 parameters. That would be challenging to solve computationally.  
- Common in image processing as it tries to mimic the visual cortex.  
    - We know the there are different layers that process information.  
    - Some layers detect edges, outlines, and straight lines.  
    - Other layers allow us to recognize more complex shapes.
- The cortex isolates particular parts of the image, e.g., which section has the cat, which section of the cat has the tail, ....  
- With images, most nieghboring pixels represent the same thing, e.g., the individual blue pixels in the sky.  
- The edges of the objects, e.g., blue sky and white clouds, will generally be the only parts of the image with different neighbors.  
- Can be viewed as feature extractors, e.g., edges and blobs of images.  
- For color images, generally use 3 versions of the image, one in each of the primary color scales (RGB).  

<img src='files/diagrams/architecture-cnn-en.jpeg' style='height: 500 px'>

[Image Source: Stanford CS230 Convolutional Neural Networks cheatsheet](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-convolutional-neural-networks)

- [Convolutional layers](https://keras.io/api/layers/convolution_layers/convolution2d/) detect features by moving across the images in squares and multiplying them against a filter; think of it as scanning the image and "summarizing" the blocks.  
- The filter is specified in the convolution layers, and the matrix of weights are learned. These individual filters are concatonated into a kernel (i.e., matrix of weights).    
- Then it (usually) pools these scans, usually by taking the average or maximum - these aren't learnable, they just summarize the convolutional layers.

### Feature Extraction
Sometimes referred as feature extractors. The layers and filters are recoverable so you could look to understand which filter is identitying certain attributes.

- Low-level features are extracted first.  
- Later layers use the above within the network.

<img src='diagrams/14_01_feature_map.png'>

[Image source: Python Machine Learning 3rd Edition, page 519](https://github.com/rasbt/machine-learning-book/blob/main/ch14/figures/14_01.png)

- Local block of pixels are `local receptive field`. These are have `sparse connectivity` to pixels far away, e.g., the pixels near the dog's face and pixels in the grass field.  
- Filters and pooling extract the filters for the model. The features are all `latent` or `salient` and convolution networks are effective at extracting them, given enough examples.

### Math of Convolution (discrete example)

$$
y = x * w \rightarrow y[i] = \sum{x[i-k]w[k]}
$$

- $x$: input vector. 
- $w$: filter, these are learned parameters
- $y$: convolution

See pages 522 - 528 for more detail.

## Inputs, Padding, and Strides

- The inputs (`n x n` matrix of local pixels) are isolated and effectively scanned.  
- Padding will pad zeroes around the raw the area where the filter is being applied.  
    - This can help define the boundaries between shapes/objects.  
    - This also gives the chance for each pixel to be at the center when the filter is being applied and help with feature detection in those areas of the image.
- Strides are how big the step is for the next patch of pixels.  
    - A larger stride will decrease the size of the output and create a similar representation and can be used for compression.  
    - Generally, strides are down symmetrically.



<img src='files/diagrams/padding.png'>

[Source: Hundred Page Machine Learning Book](https://www.dropbox.com/s/uh48e6wjs4w13t5/Chapter6.pdf?dl=0)


<img src='files/diagrams/strides.png'>

Keras has options for both of these hyperparameters. [See the documentation for the Conv2D layer](https://keras.io/api/layers/convolution_layers/convolution2d/). Padding is implemented in its own layer, [see the documentation for it](https://keras.io/api/layers/reshaping_layers/zero_padding2d/). 

## Pooling (Aggregating)

<img src='diagrams/14_08_pooling.png' style='width: 700px'>

[Image Source: Python Machine Learning 3rd Edition, page 531](https://github.com/rasbt/machine-learning-book/blob/main/ch14/figures/14_08.png)

- Done on the filters.  
- Helps reduces variance among the local pixels between strides.  
- Creates more robust features since there is less sensitivity to variation.  
- Since it aggregates the convolutions, it leads to less features, which increases computational efficiency.  
- Generally is included after convolution step, however, there have been convolutional networks built without them.

## Channels

- May have to work with multiple data layers, that represent differing intensities of color.  
- $X_{N_1xN_2xC_in}$, where $C_{in}=3$ for RGB colors.  
- scikit-learn's MNIST data is greyscale, or $C_{in}=1$.  
- Convolutions are done separately for each layer.

## Regularization

- $l_2$ regularization.  
- Dropout (see introduction neural network lecture).  
    - Typically done on the later layers.  
    - Randomly drop neurons and the non-dropped account for the missing ones.  
    - Forces simplicity.  
    - Also a form of ensemble since it randomly drops and then averages.  

## MNIST

Architecture being used: 

<img src='diagrams/14_12_arch.png'>

[Image Source: Python Machine Learning 3rd Edition, Page 542](https://github.com/rasbt/machine-learning-book/tree/main/ch14)

> We will also include dropout in the last layer.

In [12]:
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers
import datetime

st = datetime.datetime.now()
# Model / data parameters
num_classes = 10
input_shape = (28, 28, 1)
# 28 x 28 = 784

# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

en = datetime.datetime.now()
el = en - st

print(f'Elapsed time: {el}')

Elapsed time: 0:00:00.257845


In [13]:
x_train.shape

(60000, 28, 28)

In [14]:
y_train.shape

(60000,)

In [15]:
# Scale images to the [0, 1] range
x_train = x_train.astype("float32") / 255
x_test = x_test.astype("float32") / 255

# Make sure images have shape (28, 28, 1)
x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)

print(f'{x_train.shape[0]:,} train samples')
print(f'{x_test.shape[0]:,} test samples')

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

print(f'Columns in y: {y_train.shape[1]:,}')

60,000 train samples
10,000 test samples
Columns in y: 10


In [16]:
y_train[:5, :]

array([[0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]], dtype=float32)

> We exploded the categories, so we can use the cross entropy categorical loss. If we hadn't done this, we would have needed to use the `sparse` version.

[Keras Loss Functions](https://keras.io/api/losses/)

In [17]:
model = keras.Sequential(
    [
        keras.Input(shape=input_shape),
        layers.Conv2D(filters=32, kernel_size=(5, 5), activation="relu", strides=(1,1)),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(filters=64, kernel_size=(5, 5), activation="relu", strides=(1,1)),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Flatten(),
        layers.Dropout(0.5),
        layers.Dense(num_classes, activation="softmax"),
    ], name= 'Convolutional'
)

model.summary()

Model: "Convolutional"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_2 (Conv2D)            (None, 24, 24, 32)        832       
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 12, 12, 32)        0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 8, 8, 64)          51264     
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 4, 4, 64)          0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 1024)              0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 1024)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)              

> Note the difference in the number of parameters the model needs to learn
- 34,826 for the Convolutional despite several more layers than the deep MLP.  
- 42,000 for the network with two hidden 50-unit layers.  

In [19]:
batch_size = 128
epochs = 15

model.compile(loss="categorical_crossentropy", 
              optimizer="adam", metrics=["accuracy"])

model.fit(x_train, y_train, 
          batch_size=batch_size, 
          epochs=epochs, validation_split=0.1)

Train on 54000 samples, validate on 6000 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x7f8d457ec390>

In [23]:
score = model.evaluate(x_test, y_test, verbose=0)
print(f'Test loss: {score[0]:.5f}')
print(f'Test loss: {score[1]:.5f}')

Test loss: 0.01789
Test loss: 0.99420


> Superior performance versus other ANNs we've tried on the MNIST data.

## Potential Data Issues

- May need to augment your data with rotations, zoom in/out, crops, intensity changes.  
- See Raschka, pages 550-564.

# Reading and Resources 
- [TensorFlow Clothing Classification](https://www.tensorflow.org/tutorials/keras/classification)  
- [Convolutional NN Example](https://www.tensorflow.org/tutorials/images/cnn)  
- [Stanford CS 230 on Convolutional Neural Networks](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-convolutional-neural-networks)
- [DeepLearning AI Videos](https://www.youtube.com/channel/UCcIXc5mJsHVYTZR1maL5l9w)  