## 1x1 convolution

In [14]:
import numpy as np
import tensorflow as tf

print(tf.__version__)

# custom init with the seed set to 0 by default
def custom_init(shape, dtype=tf.float32, partition_info=None, seed=0):
    return tf.random_normal(shape, dtype=dtype, seed=seed)

def conv_1x1(x, num_outputs, init):
    return tf.layers.conv2d(x, num_outputs, 1, 1, kernel_initializer=init)

x = tf.constant(np.random.randn(1, 2, 2, 1), dtype=tf.float32)

# defines the number of output channels or kernels
num_outputs = 2

# `tf.layers.dense` flattens the input tensor if the rank > 2 and reshapes it back to the original rank
# as the output.
dense_out = tf.layers.dense(x, 
                            num_outputs, 
                            kernel_initializer=custom_init)
conv_out = conv_1x1(x, 
                    num_outputs, 
                    init=custom_init)

    
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    
    a = sess.run(dense_out)
    b = sess.run(conv_out)
    print("Dense Output =\n", a)
    print("Conv 1x1 Output =\n", b)

    print("Same output? =", np.allclose(a, b, atol=1.e-5))

1.3.0
Dense Output =
 [[[[-0.38790497  2.04511309]
   [ 0.75163925 -3.96279335]]

  [[ 0.63316619 -3.33817935]
   [-0.33753824  1.77956951]]]]
Conv 1x1 Output =
 [[[[-0.38790497  2.04511309]
   [ 0.75163925 -3.96279335]]

  [[ 0.63316619 -3.33817935]
   [-0.33753824  1.77956951]]]]
Same output? = True


## Transposed Convolution (deconvolution)

In [11]:
import tensorflow as tf
import numpy as np


def upsample(x):
    """
    Apply a two times upsample on x and return the result.
    :x: 4-Rank Tensor
    :return: TF Operation
    """
    # TODO: Use `tf.layers.conv2d_transpose`
    # The second argument 3 is the number of kernels/output channels.
    num_kernels = 3
    # The third argument is the kernel size, (2, 2). 
    # Note that the kernel size could also be (1, 1) and the output shape would be the same. 
    # However, if it were changed to (3, 3) note the shape would be (9, 9), at least with 'VALID' padding.
    kernel_size = (2, 2)
    # The fourth argument, the number of strides, is how we get from a height and width from (4, 4) to (8, 8). 
    # If this were a regular convolution the output height and width would be (2, 2).
    num_strides = (2, 2)
    return tf.layers.conv2d_transpose(x, num_kernels, kernel_size, num_strides)


x = tf.constant(np.random.randn(1, 4, 4, 3), dtype=tf.float32)
conv = upsample(x)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    result = sess.run(conv)

    print('Input Shape: {}'.format(x.get_shape()))
    print('Output Shape: {}'.format(result.shape))


Input Shape: (1, 4, 4, 3)
Output Shape: (1, 8, 8, 3)


## Scene Understanding

### Bounding boxes

### Semantic Segmentation

### FCN-8 - Encoder
The FCN-8 architecture developed at Berkeley. In fact, many FCN models are derived from this FCN-8 implementation. 

The encoder for FCN-8 is the VGG16 model pretrained on ImageNet for classification. The fully-connected layers are replaced by 1-by-1 convolutions. Here’s an example of going from a fully-connected layer to a 1-by-1 convolution in TensorFlow:
```python
num_classes = 2
output = tf.layers.dense(input, num_classes)
```
To:
```python
num_classes = 2
output = tf.layers.conv2d(input, num_classes, 1, strides=(1,1))
```
The third argument, 1, is the kernel size, meaning this is a 1 by 1 convolution. Thus far, we’ve downsampled the input image and extracted features using the VGG16 encoder. We’ve also replaced the linear layers with 1 by 1 convolutional layers, preserving spatial information.

But this is just the encoder portion of the network. Next comes the decoder.
>Key observation is that fully connected layers in classification networks can be viewed as convolutions with kernels that cover their entire input regions. This is equivalent to evaluating the original classification network on overlapping input patches but is much more efficient because computation is shared over the overlapping regions of patches 
![](http://blog.qure.ai/assets/images/segmentation-review/FCN%20-%20illustration.png)

[A 2017 Guide to Semantic Segmentation with Deep Learning](http://blog.qure.ai/notes/semantic-segmentation-deep-learning-review)

### FCN-8 - Decoder

To build the decoder portion of FCN-8, we’ll upsample the input to the original image size. The shape of the tensor after the final convolutional transpose layer will be 4-dimensional: 
 - batch_size, 
 - original_height, 
 - original_width, 
 - num_classes
 
Let’s implement those transposed convolutions we discussed earlier as follows:
```python
output = tf.layers.conv2d_transpose(input, num_classes, 4, strides=(2, 2))
```
The transpose convolutional layers increase the height and width dimensions of the 4D input Tensor.

>After convolutionalizing fully connected layers in a imagenet pretrained network like VGG, feature maps still need to be upsampled because of pooling operations in CNNs. Instead of using simple bilinear interpolation, deconvolutional layers **can learn the interpolation**. This layer is also known as upconvolution, full convolution, transposed convolution or fractionally-strided convolution.


### Skip Connections
The final step is adding skip connections to the model. In order to do this we’ll combine the output of two layers. The first output is the output of the current layer. The second output is the output of a layer further back in the network, typically a pooling layer. In the following example we combine the result of the previous layer with the result of the 4th pooling layer through elementwise addition (tf.add).
```
# make sure the shapes are the same!
input = tf.add(input, pool_4)
```
We can then follow this with another transposed convolution layer.
```
input = tf.layers.conv2d_transpose(input, num_classes, 4, strides=(2, 2))
```
We’ll repeat this once more with the third pooling layer output.
```
input = tf.add(input, pool_3)
input = tf.layers.conv2d_transpose(input, num_classes, 16, strides=(8, 8))
```

>However, upsampling (even with deconvolutional layers) produces coarse segmentation maps because of loss of information during pooling. Therefore, shortcut/skip connections are introduced from higher resolution feature maps.

### FCN-8 - Classification & Loss

The final step is to define a loss. That way, we can approach training a FCN just like we would approach training a normal classification CNN.

In the case of a FCN, the goal is to assign each pixel to the appropriate class. We already happen to know a great loss function for this setup, cross entropy loss! Remember the output tensor is 4D so we have to reshape it to 2D:

```python
...
logits = tf.reshape(input, (-1, num_classes))
```
logits is now a 2D tensor where each row represents a pixel and each column a class. From here when just use standard 
cross entropy loss:
```python
cross_entropy_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, labels))
```
That’s it, we now have an end-to-end model for semantic segmentation. Time to get training!

### Links
 - [carnd submission](https://github.com/dave-msk/CarND-Semantic-Segmentation-Submission)