# DSCI 572 Lecture 7

convolutions and CNNs

Terminology:

- convolution
- filter
- convolutional neural network / convolutional network / convolutional net / convnet
- fully connected
- pooling
- stride
- object recognition, object detection, object localization
- AlexNet

### Convolutions

- What is a convolution?
- What is a linear operator/transformation?
  - in the discrete world, anything that can be represented as a matrix multiplication
  - we have linear operatiors in the continuous world (like sum, derivative; hence the trick in the last lecture), but that is out of scope here
  - finite differencing is a linear operator. PCA is linear. 
- 1-D convolutions
- 2-D and higher-D convolutions
- boundary conditions: the output might be slightly bigger or slightly smaller, or the same
  - these are the options in [scipy.signal.convolve2d](https://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.signal.convolve2d.html)

- (optional) all of this extends to continuous scenarios, where convolution is an integral instead of a sum
- convolution is often written as $x \ast y$ 
- convolution is commutative: $x \ast y = y \ast x$
  - Nonetheless, we often have a "signal" and a "filter" and they have different interpretations
  - The filter is often small and has an interpretation like "highpass" or "lowpass"

- Note (for completeness, not required): there is a fast implementations of convolution using the FFT (fast Fourier transform). For an $n\times n$ image and $m\times m$ filter this takes $nm\log(nm)$ time instead of $n^2m^2$. For sufficiently big images and filters this is a big win BUT there's a catch, which is that the equivalence only holds for periodic boundary conditions. 

- (Jump to lab and look at first convolution example, discuss. )

- Convolutions appear all over the field of _signal processing_, which is traditionally a discipline within electrical engineering. But they also show up in math, physics, CS, etc. A lot of communications theory is based on this stuff.



## What does this have to do with deep learning

- Imagine an neural net with image as inputs. For a $1000\times 1000$ image that is 1 million features.
- Now say the next layer is 10% or even 1% that size. Now we need a matrix of size $10^6\times10^4=10^{10}$. That is not going to happen.
- Key insight: things happen "locally" in images. The top-left matters and the bottom-right matters, but they don't necessarily need to interact right away.
  - so we do some "local processing" on the different parts of the image, and then "report back" and start merging the information when we've reduced the dimension
  - this is the promise/dream of "deep learning": hierarchical abstractions like edges, curves, objects, higher and higher level "understanding"
- Key idea: use layers that are not fully connected (this was called "Dense" in Keras). Instead, have units in layer 2 that only get input from some _nearby units_ (pixels) in layer 1. 
- The above notion is precisely a convolution. Thus people talk about convolutions but keep in mind it's just a not-fully-connected neural network. This means everything from before (gradients, tricks) carry over nicely. 
- But for computational reasons we don't form those giant matrices full of zeros! We just do convolutions. 
- The parameters (weights) are now the filters themselves. So we can interpret it as "learning the filters". It's all the same stuff. 

## How it all works

- Let's say we start with an $m\times m$ image.
- Let's say we have $k^{(1)}$ filters in the first layers, and each is of size $3\times 3$. 
- Then, we now have $k$ of these (approximately) $m\times m$ images
- We then _sum_ over these $k$ images to get back to one image
  - You can think of this as a 3-d convolution, but it's not necessarily a helpful way of thinking
- Next, we often apply some sort of _pooling_ which _decreases the size_ of the representation
  - this is a downsampling operation 
  - A common approach is max pooling, which takes the maximum, say, a region of $2\times 2$ pixels
  - The intuition has to do with invariances (don't care if it's shifted a bit, just want maximum signal)
  - There are other forms of pooling, like average pooling


## Hyperparameters

There are a lot of hyperparameters. The big ones are 

- _number of filters at each layer_ and _filter size_, which replace the layer size in regular fully-connected networks. Then there are pooling hyperparemeters. 
- There's also the _stride_ of the filters. For a regular convolution this is 1, but you may wish to slide the window more than 1 pixel each time (for example by the filter size for non-overlapping windows)
- We retain all the usual hyperparamters of activation function, regularization, initialization, etc.

## Visualizing filters

- This is something people love to do
- Often see something similar to [Gabor filters](https://en.wikipedia.org/wiki/Gabor_filter) from human visual system
- See AlexNet paper linked below, Figure 3.
- The low layers are often similar in a lot of models
  - Maybe we only need to retrain later layers when transferring to different tasks?

## Keras example

Take a look at lab code, discuss

## Famous architectures

- Historically: Neocognitron, LeNet, HMAX
- More recently:
  - [AlexNet](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf) (named after Alex Krizhevsky) set a standard in 2012.
  - [Inception / GoogLeNet](https://arxiv.org/pdf/1409.4842.pdf) (pun on Google & LeNet), 2014, 22 layers 

### Data sets

- There tend to be trendy datasets in object recognition that evolve over time as the field proceeds. 
- For example see this [compilation of MNIST results](http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html)
- This is good because people can compare with each other
- But also bad because these data sets may not be important and we may overfit on them (optimization bias as a whole field of researchers)
  - there have been [disturbing findings](http://people.csail.mit.edu/torralba/research/bias/) about the limited transferability of models/insights to new data sets


## Things people do

- object detection
- object recognition
- image segmentation
- art (see links on course README)
- more recently, [GANs](https://en.wikipedia.org/wiki/Generative_adversarial_networks) for realistic image generation, see e.g. figure 2 of [this paper](http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf)

## More info / another look

See CPSC 340 slides [Neural networks](https://www.cs.ubc.ca/~schmidtm/Courses/340-F16/L29.pdf), [Deep learning](https://www.cs.ubc.ca/~schmidtm/Courses/340-F16/L30.pdf), [CNNs](https://www.cs.ubc.ca/~schmidtm/Courses/340-F16/L31.pdf), [More CNNs](https://www.cs.ubc.ca/~schmidtm/Courses/340-F16/L32.pdf)