## Task 5.1 Gradients are row vectors

Consider a function $f : \mathbb R^{n×1} → \mathbb R^{m×1}$ mapping n-dimensional column vectors onto m-dimensional column vectors.

For any $\mathbf x \in \mathbb R^{n×1}$, the derivative $f'(\mathbf x)$ is characterized by its linear approximation property
$$ f(\mathbf x + \mathbf h) \approx f(\mathbf x) + f'(\mathbf x) \cdot \mathbf h $$
for small $\mathbf h \in \mathbb R^{n×1}$.

Considering the dimension of the involved entities $\mathbf x$, $\mathbf h$, $f(\mathbf x)$, and $f'(\mathbf x)$, explain why the gradient, i.e. the derivative of a scalar function ($m=1$), is a row vector.

Lets talk about the entities:

1. $x$ and $h$ have the same vector dimension ($\mathbb{R}^{n\times 1}$) which means that they are both n-dimensional column vectors.
2. The output of the $f(x)$ function where $m = 1$ is a scalar.
3. The derivative $f'(x)$ represents a linear transformation which maps an n-dimensional input to an m-dimensional output. In the case where $m = 1$ there will be a linear transformatioon whcih maps the n-dimensional input to a 1-dimensional output which means that it creates a row vector. This vector captures the gradient of $f(x)$ while taking $x$ in account. 

## Task 5.2 Gradient of a Bilinear form

Let $\mathbf x \in \mathbb R^n$ and $\mathbf y \in \mathbb R^m$ be usual column vectors.
The bilinear form $f(\mathbf x, \mathbf y, W) = \mathbf x^t W \mathbf y = \sum_i x_i \sum_j w_{ij} y_j$ yields a scalar.

1. What's the correct dimension of $W$?
2. Determine the dimension of the following derivatives:
   1. $\nabla_{\mathbf x} f$  (standard row-vector gradient)
   2. $\nabla_{\mathbf x^t} f \equiv \nabla^t_{\mathbf x} f$  (column-vector gradient)
   3. $\nabla_{\mathbf y} f$  (row-vector gradient)
   4. $\nabla_{\mathbf y^t} f \equiv \nabla^t_{\mathbf y} f$  (column-vector gradient)
3. Compute the derivatives

![image.png](attachment:image.png)

## Task 5.3 Computation Graph

Consider the following computational graph:

![https://github.com/rhaschke/Neural-Networks/blob/master/backprop.svg](https://raw.githubusercontent.com/rhaschke/Neural-Networks/master/backprop.svg)

1. Write the computational graph as a formula: $y = ||W*x - t||^2$

2. Perform a forward pass for the graph, starting with the given values for $\mathbf W$, $\mathbf x$, and $\mathbf t$.
Denote the results directly in the graph, above the connecting arrow lines.


![image.png](attachment:image.png)


3. Determine the local gradients of all operation nodes, in component-wise notation, i.e.:
\begin{align}
+: &\qquad \frac{\partial e_i}{\partial z_j} = \frac{\partial(z_i + u_i)}{\partial z_j} =  \frac{\partial z_i}{\partial z_j} + \frac{\partial u_i}{\partial z_j} = \delta_{ij} &\quad &\frac{\partial e_i}{\partial u_j} = \delta_{ij} \\
\|\cdot\|^2: &\qquad \frac{\partial y}{\partial e_i} = \frac{\partial \sqrt{\sum e_i^2}^2}{\partial e_i} = \frac{\partial \sum e_i^2}{\partial e_i} = 2e_i\\
\times: &\qquad \frac{\partial z_i}{\partial x_j} = \frac{\partial (\sum_k W_{ik}x_k)}{\partial x_j} = \sum_k \frac{\partial (W_{ik}x_k)}{\partial x_j} = \frac{\partial (W_{ij}x_j)}{\partial x_j} = W_{ij}
&\quad &\frac{\partial z_i}{\partial{W_{jk}}} = \frac{\partial (\sum_k x_k W_{ki})}{\partial W_{jk}} = \sum_k \frac{\partial (x_k W_{ki})}{\partial W_{jk}} = \frac{\partial (x_k W_{jk})}{\partial W_{jk}} = x_k\\
\times\text{-}1: &\qquad \frac{\partial u_i}{\partial t_j} = \frac{\partial (-1 \cdot u_i)}{\partial t_j} = -\frac{\partial u_i}{\partial t_j} = -1\delta_{ij}\\
\end{align}

4. Perform a full backward-pass, denoting the results in the graph below the connecting arrow lines.

![image-2.png](attachment:image-2.png)