# Deep Learning: Image Recognition
**Instructor:** Adam Geitgey

Thanks to deep learning, image recognition systems have improved and are now used for everything from searching photo libraries to generating text-based descriptions of photographs. In this course, learn how to build a deep neural network that can recognize objects in photographs. Find out how to adjust state-of-the-art deep neural networks to recognize new objects, without the need to retrain the network. Explore cloud-based image recognition APIs that you can use as an alternative to building your own systems. Learn the steps involved to start building and deploying your own image recognition system.

#### Build cutting-edge image recognition systems
* **Image recognition** is the ability for computers to look at a photograph and understand what's in the photograph
* In the last few years, researchers have made major break throughs in image recognition thanks to neural networks
* **Keras** is a high-level library for building neural networks in Python with only a few lines of code; built on top of either TensorFlow or Theano
* One of the most important things to configure in a neural network is activation functions
    * Before values flow from one layer to the next they pass through an activation function
    * **Activation functions** decide which inputs from the previous layer are important enough to feed to the next layer
* The final step of defining a neural network is to compile it by calling `model.compile()`; this tells Keras that we're done building the model and that we actually want to carry it out in memory
* The optimizer algorithm is used to train the neural network
* The loss function is how the training process measures how right or how wrong your NN's predictions are

#### Using Images as Input to a NN
* Bright points are closer to 255 and dark points are closer to 0
* We can think of an image as a 3D array that is always three layers deep; so to be able to feed this image into a NN, we need the NN to have 1 input node for every number in this 3-D array (ie pixel)
* These numbers add up very quickly
* For a small **256 x 256 pixel image**, (by modern terms, a pretty tiny image):
    * We need 256 x 256 x 3 = **196,608 input nodes**
    * And that's just for the input layer
    * The number of nodes in the entire neural network will quickly grow into the millions
    * That's why using NNs for image processing in so computationally intensive
        * **Because of this, image recognition systems tend to use small image sizes**
        * It's very common to build image recognition systems that work with images that are **between 128 and 512 pixels wide.**
        * Any larger than that, and it gets too slow and requires too much memory
        * When working with larger images, we usually just scale them down to those smaller sizes before feeding them into the neural network 

### Recognizing Image Contents with a Neural Network
* During the **inference phase** the neural network will give us a **prediction**; this prediction will be in the form of a probability
* We can also build a single neural neetwork that has more than one output

<img src='data/nn1.png' width="600" height="300" align="center"/>

* You can roughly think of the the top (leftmost) layers as looking for simple patterns like lines and sharp edges and the lower layers use the signals from the higher layers to look for more and more complex shapes and patterns
* With all the layers working together, the model can identify very complex objects
* That means that adding more layers to a NN tends to give it the capacity to learn more complex patterns and shapes; this is where the term **deep learning** originally came from
* **Deep learning** is just the idea that making models deeper by adding more capacity to them lets us recognize more complex patterns in data

### Adding convolution for translational invariance

* If we only train the NN with pictures of numbers that are perfectly centered, the NN will get confused if it sees anything else (for example, an uncentered "8")
* For example:

<img src='data/nn2.png' width="600" height="300" align="center"/>

<img src='data/nn3.png' width="600" height="300" align="center"/>

* The neural network won't be able to make a good prediction on the uncentered 8 from the lower example above (where the model is only trained on centered 8s). 
* But, the 8 could appear anywhere in the image; it could just as easily appear at the bottom, like this:

<img src='data/nn4.png' width="600" height="300" align="center"/>

* **We need to improve our neural network so that it can recognize objects in any position in the image.**
    * This is called **Translation invariance.**
    
#### Translation Invariance and Convolutional Layers
* **Translation invariance** is the idea that a machine learning model can recognize an object no matter whether it is moved (or *translated*) in the image.
* The solution is to add a new type of layer to our neural network: a **convolutional layer**
* Unlike a normal Dense layer, where every node is connected to every other node, this (convolutional) layer breaks apart the image in a special way so that it can recognize the same object in different positions
* We do this by passing a small window (shown in orange below) over the image. 
* Each time it lands somewhere, we grab a new image tile; we repeat this until we've covered the entire image

<img src='data/nn5.png' width="600" height="300" align="center"/>

* Next, we pass each image tile through the same NN layer(s). Each tile will be processed the same way and will save a value each time
* In other words, we're turning the image into an array, where each entry in the array represents whether or not the neural network thinks a certain pattern appears at that part of the image
* Next, we'll repeat the exact process again, but this time we'll use a different set of weights on the nodes in our NN layer
* This will create another feature map that tells us whether or not a certain pattern appears in the image
* But because we're using different weights, it will be looking for a different pattern than the first time
* We can repeat this process several times until we have several layers in our new array 

<img src='data/nn6.png' width="600" height="300" align="center"/>

<img src='data/nn7.png' width="600" height="300" align="center"/>

* This turns our original array into a 3D array
* Each element in the array represents where (whether?) a certain pattern occurs
* But because we are checking each tile of the original image, it doesn't matter where in the image a pattern occurs, we can find it anywhere
* This **3D array is waht we'll feed into the next layer of the neural network.**
* It will use this information to determine which patterns are most important in determining the final output
* Adding a convolutional layer makes it possible for our neural network to be able to find the pattern, no matter where it appears in an image
* **Normally, we'll have several convolutional layers** that repeat the above process multiple times. 
* **The rough idea is that we keep squishing down the image with each convolutional layer while still capturing the most important information from it.** By the time we reach the output layer, the neural network will have been able to identify whether or not the object appeared
* Convolutional neural networks are the standard approach to building image recognition systems

## 3. Designing a Deep Neural Network for Image Recognition

### Designing a neural network architecture for image recognition
* Before we start coding our image recognition NN, let's sketch out how a basic neural network works
* **A basic neural network comprised of all dense, or fully-connected, layers doesn't work efficiently for images because objects can appear in lots of different places in an image.**
* The solution is to add one or more convolution layers, which help us detect patterns no matter where they appear in our image
* **It can be very effective to place two or more convolutional layers in a row** so in our example we'll add them in pairs
* The convolutional layers are looking for patterns in our image and recording whether or not they found those patterns in each part of our image; but we don't usually need to know *where* in an image a pattern was found down to the specific pixel; it's good enough to know the rough location where it was found. To solve this problem we can use a technique called **max pooling**

#### Max Pooling

<img src='data/nn8.png' width="600" height="300" align="center"/>

* We could pass the above information (regarding whether or not a pixel corresponds to a cloud) directly to the rest of our neural network, but it we can reduce the amount of information that we pass to the next layer, it will make the neural network's job much easier (and faster)
* The idea of **max pooling** is to down sample the data by only passing on the most important bits

<img src='data/nn9.png' width="600" height="300" align="center"/>

* The idea above is that, by capturing the most important data (most extreme values), we'll get nearly the same result, but much more efficiently.

#### Dropout
* **Dropout** is a technique to make the NN more robust and prevent overfitting
* The idea is that we add a droppout layer between other layers that will randomly throw away some of the data passing throught by cutting some of the connections in the neural network

<img src='data/nn10.png' width="600" height="300" align="center"/>

* By randomly cutting connections with each training image, the neural network is forced to try harder to learn  multiple ways to represent the same ideas (rather than *memorize* an image).

<img src='data/nn11.png' width="600" height="300" align="center"/>

* If we want to make our network more powerful and able to recognize more complex images, we can add more layers to it
* But, instead of just adding layers randomly, we'll add more copies of our convolutional block.
* When all these layers are working together, we'll be able to detect complex objects 

<img src='data/nn12.png' width="600" height="300" align="center"/>

* This is a very typical design for an image recognition neural network, but it's also one of the most basic
* The latest designs involve branching pathways, shortcuts between groups of layers, and all sorts of other tricks, but they all build on these same basic ideas.

### Exploring the CIFAR-10 Data Set
* [See here](https://www.cs.toronto.edu/~kriz/cifar.html) for more detail on the CIFAR-10 dataset (and AlexNet)

#### Exploring your dataset
   * Always look through the data by hand
   * Check for obvious errors
   * Verify that the data makes sense
   
### Loading an image dataset
* The function `cifar10.load_data()` returns **four different arrays**
    * `X_train`
    * `y_train`
    * `X_test`
    * `y_test`
* **`(x_train, y_train), (x_test, y_test) = cifar10.load_data()`**
* **NNs work between when the data are floats between zero and one**:

```
# Normalize data set to 0-to-1 range
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
```

* cifar10 provides the labels for each class as values from 0 to 9, **but since we are creating a NN with 10 outputs, we need a separate expected value for each of those outputs. So we need to convert each label from a single number into an array with 10 elements.** In that array, one element should be set to one and the rest set to zero. 
* **This is something you'll almost always need to do with your trainind data, so keras provides a helper function: `keras.utils.to_categorical()`**
    * To use this function, you just pass in your array with the labels (which in our case is `y_train`) along with the numbe of classes it has (which in our case is `10`)
    * `y_train = keras.utils.to_categorical(y_train, 10)`
    * `y_test = keras.utils.to_categorical(y_test, 10)`

#### Dense Layers
* `relu` is the standard choice for activation function when working with images because it works well and is computationally efficient
* We'll need one node in the output layer for *each* object we want to detect
* **When doing classification with more than one kind of object, the output layer will almost always use a `softmax` activation function.**
    * The **softmax** activation function is a special function that $\star$ **makes sure all the output values from this layer add up to exactly one** $\star$
* When we're bilding a neural network and adding layers to it, it's helpful to print out a list of the layers in the neural networks so far; we can do this by calling:
    * `model.summary()`
    
#### Convolution layers
* **To be able to recognize images efficiently, we'll add convolutional layers before our densely connected layers**
* **Note that there are 2 types of convolutional layers: 1D and 2D**
    * Since we're working with images, we'll want to add the Conv2D layer
    * For some data like sound waves, you can use Conv1D (but typically you'll use Conv2D)
* Parameters:
    * The first parameter is how many different filters should be in the layer
        * Each filter will be able of detecting one pattern in the image (we'll start with 32, a power of 2)
    * Next, we need to pass in the size of the window that we'll use when creating image tiles from each image
        * By passing in the tuple `(3,3)`, we are selecting a 3 pixel x 3 pixel window
        * This will split up the original image into 3 x 3 tiles; when we do that, we have to decide what to do with the edges of the image. If the image size isn't exactly divisible by 3, we'll have a few extra pixels left over on the edge. We can either throw that info away or we can add padding to the image. **Padding is just extra zeros added to the edge(s) of the image to make the math work out, and also to avoid losing info from the edges.**
    * To add extra padding that causes the image to retain its original size: **`padding= same`**
    * Just like a normal Dense layer, convolutional layers also need an activation function and we almost always use the `relu` activation function because of its efficiency
* `model.add(Conv2D(32, (3, 3), padding="same", activation="relu", input_shape=(32, 32, 3)))`
* **Note:** Whenever we transition between convolutional layers and dense layers, we need to tell Keras that we're no longer working with 2D data
    * **To do that, we need to create a `Flatten()` layer**
    

```
# Create a model and add layers
model = Sequential()

model.add(Conv2D(32, (3, 3), padding="same", activation="relu", input_shape=(32, 32, 3)))
model.add(Conv2D(32, (3, 3), activation="relu"))

model.add(Conv2D(64, (3, 3), padding="same", activation="relu"))
model.add(Conv2D(64, (3, 3), activation="relu"))

model.add(Flatten())

model.add(Dense(512, activation="relu"))
model.add(Dense(10, activation="softmax"))

# Print a summary of the model
model.summary()
```

* Note that each layer also has a **total number or parameters** listed. This is the total number of weights in that layer (including bias)
* **The total params = the size or complexity of our NN**
* **Note too that 512 nodes in the first Dense layer, because images are input as 32 x 32 pixels, then flattened in the `Flatten()` layer.**

### Max pooling
* **Typically we'll do max pooling right after a block of convolutional layers**
* The only parameter that we have to pass in to a maxpooling layer is the size of the area we want to pool together (**`pool_size`**)

```
# Create a model and add layers
model = Sequential()

model.add(Conv2D(32, (3, 3), padding='same', input_shape=(32, 32, 3), activation="relu"))
model.add(Conv2D(32, (3, 3), activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(64, (3, 3), padding='same', activation="relu"))
model.add(Conv2D(64, (3, 3), activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Flatten())
model.add(Dense(512, activation="relu"))
model.add(Dense(10, activation="softmax"))

# Print a summary of the model
model.summary()
```

<img src='data/nn5.png' width="600" height="300" align="center"/>