from this graph we can see that for each patch of images , it is convolved against its corresponding filter.
when all of the products are calculated, all of the data is summed up across all 3 channels. This yields a single number for each convolution of the filters across a patch of images. Such assumption should be verified 
![CNN for images with 3 channels](./images/snap_CNN_3_layers.png)



# Pytoch:
Define a Network Architecture

The various layers that make up any neural network are documented, here. For a convolutional neural network, we'll use a simple series of layers:

    Convolutional layers
    Maxpooling layers
    Fully-connected (linear) layers

To define a neural network in PyTorch, you'll create and name a new neural network class, define the layers of the network in a function __init__ and define the feedforward behavior of the network that employs those initialized layers in the function forward, which takes in an input image tensor, x. The structure of such a class, called Net is shown below.

Note: During training, PyTorch will be able to perform backpropagation by keeping track of the network's feedforward behavior and using autograd to calculate the update to the weights in the network.


Let's go over the details of what is happening in this code.
Define the Layers in __init__

Convolutional and maxpooling layers are defined in __init__:

#### 1 input image channel (for grayscale images), 32 output channels/feature maps, 3x3 square convolution kernel
self.conv1 = nn.Conv2d(1, 32, 3)

#### maxpool that uses a square window of kernel_size=2, stride=2
self.pool = nn.MaxPool2d(2, 2)      

Refer to Layers in forward

Then these layers are referred to in the forward function like this, in which the conv1 layer has a ReLu activation applied to it before maxpooling is applied:

x = self.pool(F.relu(self.conv1(x)))

Best practice is to place any layers whose weights will change during the training process in __init__ and refer to them in the forward function; any layers or functions that always behave in the same way, such as a pre-defined activation function, may appear in the __init__ or in the forward function; it is mostly a matter of style and readability.



In [1]:
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):

    def __init__(self, n_classes):
        super(Net, self).__init__()

        # 1 input image channel (grayscale), 32 output channels/feature maps
        # 5x5 square convolution kernel
        self.conv1 = nn.Conv2d(1, 32, 5)

        # maxpool layer
        # pool with kernel_size=2, stride=2
        self.pool = nn.MaxPool2d(2, 2)

        # fully-connected layer
        # 32*4 input size to account for the downsampled image size after pooling
        # num_classes outputs (for n_classes of image data)
        self.fc1 = nn.Linear(32*4, n_classes)

    # define the feedforward behavior
    def forward(self, x):
        # one conv/relu + pool layers
        x = self.pool(F.relu(self.conv1(x)))

        # prep for linear layer by flattening the feature maps into feature vectors
        x = x.view(x.size(0), -1)
        # linear layer 
        x = F.relu(self.fc1(x))

        # final output
        return x

# instantiate and print your Net
n_classes = 20 # example number of classes
net = Net(n_classes)
print(net)

Net(
  (conv1): Conv2d(1, 32, kernel_size=(5, 5), stride=(1, 1))
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (fc1): Linear(in_features=128, out_features=20, bias=True)
)


# pooling 
to reduce dimension of parameters and overfitting
![global average pooling converts each feature map into a single average value](./images/snap_pooling.png)

Consider an RGB picture (800,800) , with three color channels. So, the input image has (800x800x3) pixel values.

Let's assume that in a CNN paradigm, we apply filters to the image in the first layer. The filter has the following shape (5,5,3,6). This will result, after convolution, assuming same padding, to an output of shape (800,800,6).

Let us assume we apply Relu activation to the output. It seems, based on my understanding from the image below, Relu is applied 800x800x6 times to each of the output pixels.

Questions:

1) please correct if i made a wrong assumption above in terms of dimensions.

2) what is the feature map based on the concrete example above ? is it the ouput (800x800x6) before the activation is applied ? is it the ouput (800x800x6) after the activation is applied ?

3) how many feature maps do we have in the example above ? 

# visualizing activation function First Layer
visualization of first layer output is more easy
first you can look ath them sometimes and see what's going on
for example a first layer filter like this means it is trying to detect horizontal edges 
![filter 1D](./images/snap_filter.png)
however if you remember from the example above each filter was (5,5,3). This means a filter is actually a colored images with 3 layers , which can be thought of as a RGB image. Pixture below shows an example of this for alex net.
not that rah-rah patterns means detecting edges of various angles, horizontal, diagonal , verticle, etc. Some filters show one color in one corner and another in another corner. These are to detect change in color which is also another way of detecting object edges. 
![filter 3D](./images/snap_fiter_color.png)


## important Observation:
imagine you have many filters in your first layer. Assume one of them detect verticle edges. And assume your input image has many verticle edges. This means the feature map of this filter will have some high values (along the edges it detects). When a patch of filter (5x5x3) moves on a verticle line, the sum of values on the patch will be large. Relu(large value) = large value. This behaviro is also explained by "the filter is activated". So this filter will show patches of large values in its resulting 800x800 feature map (example above) after activation. 

if a filter pattern is absent, the 800x800 feature map will be mainly dark, activation will be Relu(small)= 0. the resuls is dark, and the feature map when convolved with future filters in the next layers results in small values. 

# Feature map and Feature Vector - Visualizing Final Layers ouputs
the lecture refers to feature vector for inputs to Fc6 and fc8 otputs. essentially outputs from deeper layers that are in vector form.   
![fAlex net](./images/snap_alex_net.png)

Last Layer

In addition to looking at the first layer(s) of a CNN, we can take the opposite approach, and look at the last linear layer in a model.

We know that the output of a classification CNN, is a fully-connected class score layer, and one layer before that is a feature vector that represents the content of the input image in some way. This feature vector is produced after an input image has gone through all the layers in the CNN, and it contains enough distinguishing information to classify the image.

 Final Feature Vector

So, how can we understand what’s going on in this final feature vector? What kind of information has it distilled from an image?

To visualize what a vector represents about an image, we can compare it to other feature vectors, produced by the same CNN as it sees different input images. We can run a bunch of different images through a CNN and record the last feature vector for each image. This creates a feature space, where we can compare how similar these vectors are to one another.

We can measure vector-closeness by looking at the nearest neighbors in feature space. Nearest neighbors for an image is just an image that is near to it; that matches its pixels values as closely as possible. So, an image of an orange basketball will closely match other orange basketballs or even other orange, round shapes like an orange fruit, as seen below.

Nearest neighbors in feature space

In feature space, the nearest neighbors for a given feature vector are the vectors that most closely match that one; we typically compare these with a metric like MSE or L1 distance. And these images may or may not have similar pixels, which the nearest-neighbor pixel images do; instead they have very similar content, which the feature vector has distilled.

In short, to visualize the last layer in a CNN, we ask: which feature vectors are closest to one another and which images do those correspond to?

And you can see an example of nearest neighbors in feature space, below; an image of a basketball that matches with other images of basketballs despite being a different color.

Dimensionality reduction

Another method for visualizing this last layer in a CNN is to reduce the dimensionality of the final feature vector so that we can display it in 2D or 3D space.

For example, say we have a CNN that produces a 256-dimension vector (a list of 256 values). In this case, our task would be to reduce this 256-dimension vector into 2 dimensions that can then be plotted on an x-y axis. There are a few techniques that have been developed for compressing data like this.

Principal Component Analysis

One is PCA, principal component analysis, which takes a high dimensional vector and compresses it down to two dimensions. It does this by looking at the feature space and creating two variables (x, y) that are functions of these features; these two variables want to be as different as possible, which means that the produced x and y end up separating the original feature data distribution by as large a margin as possible.

t-SNE

Another really powerful method for visualization is called t-SNE (pronounced, tea-SNEE), which stands for t-distributed stochastic neighbor embeddings. It’s a non-linear dimensionality reduction that, again, aims to separate data in a way that clusters similar data close together and separates differing data.

As an example, below is a t-SNE reduction done on the MNIST dataset, which is a dataset of thousands of 28x28 images, similar to FashionMNIST, where each image is one of 10 hand-written digits 0-9.

The 28x28 pixel space of each digit is compressed to 2 dimensions by t-SNE and you can see that this produces ten clusters, one for each type of digits in the dataset!


# Other Feature Visualization Techniques
Feature visualization is an active area of research and before we move on, I'd like like to give you an overview of some of the techniques that you might see in research or try to implement on your own!

### Occlusion Experiments

Occlusion means to block out or mask part of an image or object. For example, if you are looking at a person but their face is behind a book; this person's face is hidden (occluded). Occlusion can be used in feature visualization by blocking out selective parts of an image and seeing how a network responds.

The process for an occlusion experiment is as follows:

    Mask part of an image before feeding it into a trained CNN,
    Draw a heatmap of class scores for each masked image,
    Slide the masked area to a different spot and repeat steps 1 and 2.

The result should be a heatmap that shows the predicted class of an image as a function of which part of an image was occluded. The reasoning is that if the class score for a partially occluded image is different than the true class, then the occluded area was likely very important!


### Saliency Maps

Salience can be thought of as the importance of something, and for a given image, a saliency map asks: Which pixels are most important in classifying this image?

Not all pixels in an image are needed or relevant for classification. In the image of the elephant above, you don't need all the information in the image about the background and you may not even need all the detail about an elephant's skin texture; only the pixels that distinguish the elephant from any other animal are important.

Saliency maps aim to show these important pictures by computing the gradient of the class score with respect to the image pixels. A gradient is a measure of change, and so, the gradient of the class score with respect to the image pixels is a measure of how much a class score for an image changes if a pixel changes a little bit.


### Measuring change

A saliency map tells us, for each pixel in an input image, if we change it's value slightly (by dp), how the class output will change. If the class scores change a lot, then the pixel that experienced a change, dp, is important in the classification task.

Looking at the saliency map below, you can see that it identifies the most important pixels in classifying an image of a flower. These kinds of maps have even been used to perform image segmentation (imagine the map overlay acting as an image mask)!


### Guided Backpropagation

Similar to the process for constructing a saliency map, you can compute the gradients for mid level neurons in a network with respect to the input pixels. Guided backpropagation looks at each pixel in an input image, and asks: if we change it's pixel value slightly, how will the output of a particular neuron or layer in the network change. If the expected output change a lot, then the pixel that experienced a change, is important to that particular layer.

This is very similar to the backpropagation steps for measuring the error between an input and output and propagating it back through a network. Guided backpropagation tells us exactly which parts of the image patches, that we’ve looked at, activate a specific neuron/layer.

