# Q & A

## Questions

1. As mentioned in the first section of the paper, just curious what causes spatial transforms to apply "attention" to relevant sections of an image? Is it one of: "the localisation network, grid generator, and sampler" or the combination of all 3 of them that leads to attentiveness?

2. Why is Equation (2) of the paper the "attention" transform? Equation (2) being the matrix:

$$\begin{bmatrix}
s & 0 & t[x] \\
0 & s & t[y] \\
\end{bmatrix}$$

3. Following up on Section 3.4, how/why do spatial transformers minimize the overall cost function of their parent CNN during training?

4. Continuing with Section 3.4 again, why do spatial transformers limit the number of objects that can be modeled by a feed-forward network?

5. The conclusion section says spatial transforms learn without making changes to their parent CNN's cost function. But Section 3.4 (please see Question #3 above) seems to suggest that spatial transformers do indeed make changes to their parent CNN's cost function. So I'm confused. Please advise if I'm mixing up concepts here and may be the authors are meaning something else in Section 3.4's cost minimization discussion, and something entirely different in the Conclusion section.

6. This seems really useful and cool. Do people know what orientation the object is? How do we know that we have to put it face up or face down? 


7. Can this be extended beyond just normalizing data to creating new data? Or to learn like a manifold?


## Responses

##### 1. As mentioned in the first section of the paper, just curious what causes spatial transforms to apply "attention" to relevant sections of an image? Is it one of: "the localisation network, grid generator, and sampler" or the combination of all 3 of them that leads to attentiveness?

it is of course, a combination of all 3, but let's consider them in the order of backpropagation.

1. sampler

We have your image U and your spatially transformed image V.  Interpolation will determine what sets of points $u_{ij}: {i,j}\in HW$ 
the gradients which pass through, and what contributions they have.

2. grid generator

This piece is really the definition of a spatial transformation. It is how $\theta$ relates U and V in terms of mapping a point $v_{ij}$ to a position.  Then the sampler determines which $u_{ij}$ to associate for that position.  Now, that means that the grid generator is really what ties $\theta$ to $U$ and $V$

3. localizer network

Gradients of $\theta$ will propogate through the localization network, allowing it to tune the parameters $\theta$


##### 2. Why is Equation (2) of the paper the "attention" transform? Equation (2) being the matrix:

$$\begin{bmatrix}
s & 0 & t[x] \\
0 & s & t[y] \\
\end{bmatrix}$$

In [None]:
N = 128
square2 = np.zeros((N, N))
square2[60:100, 60:100] = 1
horse = rescale(skimage.data.horse(), .5, anti_aliasing=True, multichannel=False, mode='constant')
coffee = rescale(skimage.data.coffee(), 1.0, anti_aliasing=True, multichannel=True, mode='constant')
interp_dict = dict(
    [(y, x) for x, y in enumerate(['Nearest Neighbor', 'Linear', 'Quadratic', 'Cubic'])]
)
def f(tx, ty, s,):
    fig=plt.figure(figsize=(12, 12), dpi= 80, facecolor='w', edgecolor='k')

    θ =  θ / 180 * np.pi
    mat = np.array([
        [s * np.cos(θ),  -s * np.sin(θ), tx],
        [s * np.sin(θ), s * np.cos(θ), ty],
        [0, 0, 1]
    ])
    xdim = image.shape[1]
    ydim = image.shape[0]
    shiftR = np.array([
            [1, 0, -xdim],
            [0, 1, -ydim],
            [0, 0, 1] # rigid body
        ])
    shiftL = np.array([
            [1, 0, xdim/2],
            [0, 1, ydim/2],
            [0, 0, 1] # rigid body
        ])
    mat = shiftL @ mat @ shiftR
    
    img = warp(image, mat, output_shape=([2*x for x in image.shape]), 
               order=interp, mode='constant')
    if cmap != 'RGB':
        plt.imshow(img, cmap=cmap)
    else:
        plt.imshow(img)
    plt.grid()

def reset_values(b):
    for child in plot2.children:
        if not hasattr(child, 'description'):
            continue
        elif child.description in ['tx', 'ty', 'θ', 'kx', 'ky']:
            child.value = 0
        elif child.description in ['s']:
            child.value = 1.0

reset_button = widgets.Button(description = "Reset")
reset_button.on_click(reset_values)

x2 = widgets.IntSlider(min=-200, max=200, step=9.8, orientation='vertical', description='$t_x$')
y2 = widgets.IntSlider(min=-200, max=200, step=9.8, orientation='vertical', description='$t_y$')
s = widgets.FloatSlider(min=0, max=2.0, value=1, orientation='vertical', description='$s$')
plot = interactive(f, tx=x2, ty=y2, s=s,interp=interpolation, image=images)
layout = Layout(display='flex', flex_flow='row', justify_content='space-between')