# Convolutional Neural Networks

#### Group Members: Arhum Zafar, Rebecca Mercer, Abhiram Koganti

## Abstract

A Convolutional Neural Network (CNN) is a deep learning algorithm applied for visual image analysis. CNNs use pooling and convolution in order to produce classifications. Pooling reduces the resolution of the image by representing it in a smaller matrix. Convolution is used to detect features (such as “edges”) so that in large images it can perform more quickly than a deep network. These steps of feature extraction lead to quicker and less time/space consuming classification. CNNs have a wide range of applications in image and video recognition, image classification, natural language processing, and more. The intent of this project is to introduce the CNN algorithm and its fundamentals. 

#### Learning Objectives
- Students will understand the motivation behind convolutional neural networks.
- Students will be able to perform the mathematics of convolution.
- Students will have exposure to the general architecture of convolutional neural networks.

Our background will lead students through the steps of a CNN from input through feature extraction to classification. We will focus on the mathematical techniques for performing convolution. In the activity, students will practice convolution on mnist dataset examples we have set up. Students will use a script to use pooling and convolution for feature extraction and a script to perform classifications on the extracted feature maps.

## Background

### Introduction

Inputs to many machine learning models consist of features or attributes of specific instances. In deep learning, raw data, which is data without selected features, are used as inputs. The deep learning algorithm is responsible for extracting the features itself, while it continues to learn as it is used. Deep learning methods are very important to image recognition and other image processing applications due to their ability to work on very large amounts of data efficiently. <br>
<br>
Convolutional Neural Networks (CNNs) are deep learning methods that process large data inputs to recognize specific local patterns. Convolution and pooling layers are used to reduce the input into something computationally easier without losing important features.
As images are typically represented as matrices of pixels, classification of images involve outputting a label from an input image. This output can be produced through CNN, starting by examining low-level features (lines/curves/dots) and ending by examining higher-level features.

### Convolution

Convolution finds features (such as “edges” in an image) by repeatedly multiplying a small “filter” matrix against the larger input matrix (the image) whilst moving the filter matrix across the input matrix. The result of the matrix multiplication is then “pooled” to reduce the data to a single value representing the whole of a rectangular region of the input image. For example, after multiplying by a filter to detect horizontal lines, a pooling step might take each 3x3 area of the input image and turn it into a single value representing the likelihood that there is a horizontal line in that portion of the image. Pooling thus __reduces the volume of data.__ <br>
<br>
Filters may be responsible for identifying features like a green line, a horizontal gradient, or a blue dot in an image; but they may also operate on features that are harder to describe or for which there is no obvious human description. Because the filter is applied all over the input image, convolution may be less sensitive to the position of the input image, but it can still be very sensitive to other qualities of the image such as overall brightness, orientation, colors, etc. This is why CNNs are often trained with images presented at multiple angles, saturations and exposures, yielding better performance when time to classify images.


#### Convolution Algorithm

Convolution is the process of combining two signals to output a third signal. Convolution in CNNs combines an input matrix and a filter matrix to produce a feature map, as shown below<br>
#### Figure 1: Convolution [3]
![alt text](convalgo.png "Title")

The filter (the smaller input) is multiplied by each value in the original matrix in order to determine the new value in the output matrix. The rest of the output is filled by sliding, also known as convolving, the filter around the image. A large number in the result (comparably) relates to a greater likelihood that the feature associated with the filter exists in this location. <br>
<br>
The output of convolution is called a *feature map*. The feature map is a matrix where large values represent the presence of the used filters associated features and small or zero values represent an absence of this feature in specific locations.


### Pooling

Pooling reduces the resolution of an image by representing it in a smaller matrix. In conjunction with convolution, pooling is responsible for down-sampling the outputted feature maps. The features of an image can be summarized in patches. Converting to a smaller representation helps reduce the size of the problem as subsequent layers can use smaller matrices.

#### Max Pooling vs Average Pooling

There are multiple different methods that can be used to reduce the size of a matrix without losing the important details of the original matrix. For feature maps, the important details are in the presence of small or zero values, while the presence of large values tell us whether a feature exists in this space. Max pooling and average pooling are two common methods for fulfilling this task.

#### Figure 2: Common Types of Pooling [5]
![alt text](maxavgpool.png "Title") <br>


As shown above, max pooling reduces a subsection of the input, known as the pool size, to its single largest value;  whereas average pooling averages the values in the subsection (as shown above). Between the two, max pooling is more commonly used in CNNs, as it is less computationally challenging.

### Convolutional Neural Networks

Convolutional Neural Networks make use of both convolution and pooling. Convolution represents the input as feature maps, while pooling samples the feature maps to summarize by features. **A convolutional layer always comes first and is followed by a pooling layer.**

#### Figure 3: The Architecture of a Convolutional Neural Network [4] <br>
<br>

![alt text](cnnarc.png "Title")


Repeated convolutions allow you to go from detecting very low-level features to high-level features. For example, using a 3x3 filter we might find short (3 pixel) sections of horizontal lines. After max pooling these, a subsequent 9x9 filter might detect the presence of longer (27 pixel) horizontal lines.

### Why CNNs?

CNNs reduce the volume of data needing to be processed in later layers within the pipeline. They extract features that would otherwise use a deep neural network, which are larger in size and harder to train. Additionally, they are also potentially more explainable, as we can see what a horizontal lines detection filter is doing and understand how it works. <br>
<br>
CNNs have many applications today. The most common application is image classification software as it can identify multiple faces in an image and learn unique features. An example of this is facial recognition, which many of you may have encountered in Apple Photos, where images are sorted into albums by who appears in them or in Facebook's suggested taggings for photos. Lastly, CNNs are also easy to parallelize using GPUs, FPGAs or ASICs allowing for more rapid computation than a neural network that doesn’t have such a regular, repetitive process.

### Ethical Considerations

CNNs, like all machine learning algorithms, are subject to bias with input data. It is important to ensure that any training dataset includes a diverse set of all possible inputs. Otherwise, models can learn to prefer a particular group over another. *For example*, a hand-dryer with a camera in it trained to recognize hands cannot be trained on images of white caucasian hands alone, or a CNN trained to recognize people needs to be trained on people in wheel-chairs too. Machine learning models that are trained on biased or uniform training data learn to be biased models. Diversifying training data or selecting training data for fairness between particular groups can help to mitigate bias before it is formed.

## Warm Up Questions
Solutions can be found at the bottom of the notebook.

#### 1) Why do we use Convolutional Neural Networks? (select all that apply)
a) The position of the object in the image to be detected can vary. <br>
b) The size of an input image can be  too big for a simple neural network. <br>
c) CNNs are computationally less expensive <br>
d) CNNs are much cooler than a regular neural network :)

#### 2) What is the main purpose of the Convolution Layer? (select one)
a) To generate feature maps <br>
b) To make the image discrenible <br>
c) To ease the computational cost

#### 3) Find the value of the 1st row, 2nd column of the matrix resulting from average pooling (2x2) on the below matrix:
![alt text](warm3.png "Title")



## Exercises
Solutions can be found at the bottom of the notebook.

## 1.
### a)  Let there be a sample input matrix, A, and a filter, B, shown below. Find the output feature map.
#### (You can assume the stride to be 1)

![alt text](image.png "Title")

### b)  Once you obtained the feature map from part (a), we now need to pool it to reduce the number of parameters, which will shorten the training time while addressing any possibilities of overfitting. Using max pooling, try to find the pooled feature map. 
#### (You can assume a 2x2 window and a stride of 1)

### c) Why must the depth of the input parameter and filter be the same?

## 2.
### a) In CNNs, we process the image in different layers, instead of doing it in a single layer directly. Could you think of a possible advantage this gives CNNs over other networks (e.g. a fully connected network)? <br>
### b) How is the number of layers related to the size of an input image?

# Code


#### Make sure that you have all the ".py" files given in the folder in your Jupyter directory!
"Shift + Enter" through the below cells, the computation time should take no more than a couple minutes.

In [15]:
# Importing images from the mnist datafile.
# Setting up the model.

import numpy as np
from conv import Conv3x3
from maxpool import MaxPool2
from softmax import Softmax
import gzip

f = gzip.open('t10k-images-idx3-ubyte.gz','r')
image_size = 28
num_images = 10000
f.read(1)
buf = f.read(image_size * image_size * num_images)
data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)
test_images = data.reshape(num_images, image_size, image_size, 1)
test_images.shape

(10000, 28, 28, 1)

In [16]:
# Import labels from mnist datafile.

f = gzip.open('t10k-labels-idx1-ubyte.gz','r')
num_images = 10000
f.read(1)
buf = f.read(image_size * image_size * num_images)
data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)
data=data[7:,]


In [17]:
# Processing Images and Labels

labels=[]
images=[]
for i in range(len(data)):
    if(data[i]==1 or data[i]==0):
            labels.append(data[i])
            images.append(np.squeeze(test_images[i],axis=2))

In [18]:
import numpy as np
from conv import Conv3x3
from maxpool import MaxPool2
from softmax import Softmax
import gzip

# We only use the first 1k examples of each set in the interest of time.
# Feel free to change this if you want.
train_images = images[:900]
train_labels = labels[:900]
test_images = images[900:]
test_labels = labels[900:]

conv = Conv3x3(8)                  # 28x28x1 -> 26x26x8
pool = MaxPool2()                  # 26x26x8 -> 13x13x8
softmax = Softmax(13 * 13 * 8, 10) # 13x13x8 -> 10

def forward(image, label):
  '''
  Completes a forward pass of the CNN and calculates the accuracy and
  cross-entropy loss.
  - image is a 2d numpy array
  - label is a digit
  '''
  # We transform the image from [0, 255] to [-0.5, 0.5] to make it easier
  # to work with. This is standard practice.
  out = conv.forward((image / 255) - 0.5)
  out = pool.forward(out)
  out = softmax.forward(out)

  # Calculate cross-entropy loss and accuracy. np.log() is the natural log.
  loss = -np.log(out[int(label)])
  acc = 1 if np.argmax(out) == label else 0

  return out, loss, acc

def train(im, label, lr=.005):
  '''
  Completes a full training step on the given image and label.
  Returns the cross-entropy loss and accuracy.
  - image is a 2d numpy array
  - label is a digit
  - lr is the learning rate
  '''
  # Forward
  out, loss, acc = forward(im, label)

  # Calculate initial gradient
  gradient = np.zeros(10)
  gradient[int(label)] = -1 / out[int(label)]

  # Backprop
  gradient = softmax.backprop(gradient, lr)
  gradient = pool.backprop(gradient)
  gradient = conv.backprop(gradient, lr)

  return loss, acc

print('MNIST CNN initialized!')

# Train the CNN for 3 epochs
for epoch in range(3):
  print('--- Epoch %d ---' % (epoch + 1))

  # Shuffle the training data
#   permutation = np.random.permutation(len(train_images))
#   train_images = train_images[permutation]
#   train_labels = train_labels[permutation]

  # Train!
  loss = 0
  num_correct = 0
  for i, (im, label) in enumerate(zip(train_images, train_labels)):
    if i % 100 == 99:
      print(
        '[Step %d] Past 100 steps: Average Loss %.3f | Accuracy: %d%%' %
        (i + 1, loss / 100, num_correct)
      )
      loss = 0
      num_correct = 0

    l, acc = train(im, label)
    loss += l
    num_correct += acc

# Test the CNN
print('\n--- Testing the CNN ---')
loss = 0
num_correct = 0
for im, label in zip(test_images, test_labels):
  _, l, acc = forward(im, label)
  loss += l
  num_correct += acc

num_tests = len(test_images)
print('Test Loss:', loss / num_tests)
print('Test Accuracy:', num_correct / num_tests)


MNIST CNN initialized!
--- Epoch 1 ---
[Step 100] Past 100 steps: Average Loss 0.533 | Accuracy: 86%
[Step 200] Past 100 steps: Average Loss 0.066 | Accuracy: 98%
[Step 300] Past 100 steps: Average Loss 0.015 | Accuracy: 100%
[Step 400] Past 100 steps: Average Loss 0.012 | Accuracy: 100%
[Step 500] Past 100 steps: Average Loss 0.036 | Accuracy: 99%
[Step 600] Past 100 steps: Average Loss 0.009 | Accuracy: 100%
[Step 700] Past 100 steps: Average Loss 0.011 | Accuracy: 100%
[Step 800] Past 100 steps: Average Loss 0.009 | Accuracy: 100%
[Step 900] Past 100 steps: Average Loss 0.006 | Accuracy: 100%
--- Epoch 2 ---
[Step 100] Past 100 steps: Average Loss 0.002 | Accuracy: 99%
[Step 200] Past 100 steps: Average Loss 0.012 | Accuracy: 100%
[Step 300] Past 100 steps: Average Loss 0.002 | Accuracy: 100%
[Step 400] Past 100 steps: Average Loss 0.002 | Accuracy: 100%
[Step 500] Past 100 steps: Average Loss 0.012 | Accuracy: 100%
[Step 600] Past 100 steps: Average Loss 0.003 | Accuracy: 100%
[Ste

#### Looks like we have some pretty high testing accuracies! - nice work :) <br>

## Coding Question 
After running the code above, **list the number of images in the training set?** <br>
*Hint: Look at the input databases!*



## Conclusion
Through this activity, we introduced and formulated the topic of Convolutional Neural Networks. By understanding concepts such as convolution and pooling, as well as the architecture of a CNN, we now have a better idea of how CNNs take in input images, assign importance to various aspects within the image, and be able to differentiate each one from another --  all leading to being able to better understand the sophistication of an image. <br>
<br>
In the real world, CNNs are heavily used by companies with large amounts of data, giving them an apparent advantage over their competitors. Obviously, the more training data one can feed into a CNN, the more robust the CNN will become when it comes time for use. Facebook and Instagram can use the photos of the billion users it has, Google can use its search data, and Amazon can use data from the millions of transactions that are processed each day.<br>
<br>
Now that you have understood Convolutional Neural Networks, you now know the magic behind it all!

## Activity Solutions

###  Solutions to Warm-Up Questions
1) Options A & B <br>
2) Option A <Br>
3) Value = 5
 

### Solutions to Excercises
#### Problem 1

a) The feature map will be the below matrix:<br>
![alt text](sol1a.png "Title") <br>
As we slide the filter over the input matrix, we scalar multiply each corresponding element in the filter with the input matrix, resulting in a 3x3 matrix. 
<br>
<br>
b) The pooled matrix will be the 2x2 matrix shown below: <br>
![alt text](sol1b.png "Title")
<br>
c) Depth translates to the different channels of the training images. Thus, when deciding on the dimensions of the filter, they must be equal to retain all characteristics of the input image.


#### Problem 2

a) Since the CNNs have the files which extract features and are then subsequently pooled, the number of weights that are updated is significantly lower and thus less computationally expensive when it is divided into layers.

b) As the image gets larger, we use more layers to make the process less computationally expensive, we use more layers. This is also referred to as a Deep Convolutional Neural Network.

### Solution to Coding Question
The only two numbers being compared are 0 and 1. As both of those numbers are distinct in shape, the model does pretty well.

## References

[1] A. Deshpande, “A Beginner's Guide To Understanding Convolutional Neural Networks,” 
*A Beginner's Guide To Understanding Convolutional Neural Networks – Adit Deshpande – Engineering at Forward | UCLA CS '19.* [Online]. Available: https://adeshpande3.github.io/adeshpande3.github.io/A-Beginner's-Guide-To-Understanding-Convolutional-Neural-Networks/. [Accessed: 5-Dec-2019].
<br>
<br>
[2] 	K. Vu, “Beginner's Guide: Image Recognition And Deep Learning - DZone AI,” 
dzone.com, 29-Nov-2018. [Online]. Available: https://dzone.com/articles/beginners-guide-image-recognition-and-deeplearnin. [Accessed: 5-Dec-2019]. <br>
<br>
[3]	M. Basavarajaiah, “6 basic things to know about Convolution,” *Medium*, 02-Apr-2019. 
[Online]. Available: https://medium.com/@bdhuma/6-basic-things-to-know-about-convolution-daef5e1bc411. [Accessed: 7-Dec-2019].
<br>
<br>
[4]	S. Saha, “A Comprehensive Guide to Convolutional Neural Networks - the ELI5 way,” 
*Medium*, 17-Dec-2018. [Online]. Available: https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53. [Accessed: 11-Dec-2019].
<br>
<br>
[5]	Yani, Muhamad & Irawan, S, & S.T., M.T.. (2019). “Application of Transfer Learning 
Using Convolutional Neural Network Method for Early Detection of Terry’s Nail.” 
*Journal of Physics: Conference Series.* 1201.012052. 10.1088/1742-6596/1201/1/012052.