In [35]:
# hide
from manim import *
# use -r width,height to make it bigger or smaller. Use .scale to scale it.

In [None]:
%%manim -qm -r 640,480 -v WARNING SetColumnColorsExample

# hide

# class SquareToCircle(Scene):
#    def construct(self):
#       square = Square()
#       circle = Circle()
#       circle.set_fill(PINK, opacity=0.5)
#       self.play(Create(square))
#       self.play(Transform(square, circle))
#       self.wait()
class SetColumnColorsExample(Scene):
    def construct(self):
        m = MathTex(r"\frac{dy}{dx}")
        m0 = Matrix([["\pi", 1], [-1, 3]],
        ).set_column_colors([RED], GREEN).scale(3)
        self.add(m0)
        self.add(m)

Taking the derivative of a scalar with respect to a scalar is one thing, but what about a vector and a vector, or a matrix and a matrix? What are the dimensions then?

In this post I hope to clarify some of these by applying this key concept over and over:

>The derivative $\frac{dy}{dx}$ tells you how much $y$ changes when you increase $x$ by a little bit.

$x$ and $y$ could be anything here-- scalars, vectors, matrices or tensors.

From this concept, we can already learn something about the shape of $\frac{dy}{dx}$. It must somehow tell us, for each dimension of $y$, how much a small increase in each dimension of $x$ would change it by.

## Scalar and a scalar
First let's see a concrete example of how this works with scalars (numbers). Consider the following function:

$$ y = x^2 $$

We know the derivative of this is:

$$ \frac{dy}{dx} = 2x $$

If we increase $x$ by a little bit, $y$ will increase by $2x$. Applying our concept above, $\frac{dy}{dx}$ looks at each dimension of $y$ and tells us how much a small increase in $x$ would change it by. Since $y$ only has one dimension and $x$ only has one dimension, there's only one change to look at. So $\frac{dy}{dx}$ also has one dimension.

## Vector and a scalar
Now let's consider a a vector $\mathbf{y}$ and scalar $x$ like so:


$$
\mathbf{y} = \begin{bmatrix} a \\ b \\ c \end{bmatrix}
$$

From now on I'll write values like $a$, $b$, $c$ instead of using actual functions of $x$, which should keep things a little cleaner. Just imagine each of $a$, $b$, $c$ somehow depends on $x$. For example, maybe $a = x^2$.

Applying our concept again, $\frac{d\mathbf{y}}{dx}$ looks at each dimension of $\mathbf{y}$ to see how much a small increase in $x$ would change it by. Since there are 3 dimensions in $\mathbf{y}$ to look at and only one in $x$, we would expect $\frac{d\mathbf{y}}{dx}$ to have 3 dimensions. Each dimension tells us how much a small increase in $x$ will change that dimension of $\mathbf{y}$ by.

If we carry out this logic and do the math:
$$
\frac{d\mathbf{y}}{dx} = \begin{bmatrix} \frac{da}{dx} \\ \frac{db}{dx} \\ \frac{dc}{dx} \end{bmatrix}
$$

We see that $\frac{d\mathbf{y}}{dx}$ indeed has 3 dimensions as expected. The first one tells us how much the first dimension of $\mathbf{y}$ changes by when we increase $x$ by a small amount. The second tells us how much the second dimension of $\mathbf{y}$ changes by when we increase $x$ by a small amount. And same for the third.

## Scalar and a vector:
Now let's flip the above! Let's make $y$ a scalar and $\mathbf{x}$ a vector:

$$
\mathbf{x} = \begin{bmatrix} a \\ b \\ c \end{bmatrix}
$$

We want to know:

$$
\frac{dy}{d\mathbf{x}}
$$

We can apply the same concept again. $\frac{dy}{d\mathbf{x}}$ looks at each dimension of $y$ and tells us how much a small increase in $\mathbf{x}$ will change it by. But this time $\mathbf{x}$ has 3 dimensions and $y$ has only one dimension. So, $\frac{dy}{d\mathbf{x}}$ will tell us how much the one dimension of $y$ is changed by _each_ of the 3 dimensions of $\mathbf{x}$. It will have 3 dimensions like so:

$$
\frac{dy}{d\mathbf{x}}= \begin{bmatrix} \frac{dy}{da} & \frac{dy}{db} & \frac{dy}{dc} \end{bmatrix}
$$

As an aside, you may be wondering why I wrote the derivative vector horizontally, and how to know when it should be horizontal or vertical? I could have just as easily written this:

$$
\frac{dy}{d\mathbf{x}}= \begin{bmatrix} \frac{dy}{da} \\ \frac{dy}{db} \\ \frac{dy}{dc} \end{bmatrix}
$$

There are actually two [conventions](https://en.wikipedia.org/wiki/Matrix_calculus#Layout_conventions) around which way to do it, but not everyone is consistent. 

I've picked the one where each row corresponds to a dimension of the derivative's "numerator" $y$, and each column corresponds to a dimension of the derivative's "denominator" $\mathbf{x}$. Since the numerator is a scalar and has only one dimension, there is only one row. And since the denominator has 3 dimensions, there are 3 columns. We'll see this again in the next section.

## Vector and a vector
Now things are getting interesting! We'll make $\mathbf{x}$ and $\mathbf{y}$ both vectors now:

$$
\mathbf{x} = \begin{bmatrix} a \\ b \\ c \end{bmatrix}
$$

$$
\mathbf{y} = \begin{bmatrix} q \\ r \end{bmatrix}
$$

And we want to know:

$$
\frac{d\mathbf{y}}{d\mathbf{x}}
$$

We can again apply the same concept! We want to know how much each dimension of $\mathbf{y}$ is changed by each dimension of $\mathbf{x}$. 

$\mathbf{y}$ has 2 dimensions, and for each of those 2 dimensions we need to check how each of the 3 dimensions of $\mathbf{x}$ will change it by. So we expect $\frac{d\mathbf{y}}{d\mathbf{x}}$ to have 2x3 = 6 different entries.
They are usually arranged into a matrix like so (again each row will represent a dimension of the numerator, and each column will represent a dimension of the denominator):

$$
\frac{d\mathbf{y}}{d\mathbf{x}} = 
\begin{bmatrix} 
\frac{dq}{da} & \frac{dq}{db} & \frac{dq}{dc} \\
\frac{dr}{da} & \frac{dr}{db} & \frac{dr}{dc} &
\end{bmatrix}
$$


We'll describe the matrix's dimensionality as `2 x 3`, so it has 2 rows and 3 columns. One row for each dimension of $\mathbf{y}$, and one column for each dimension of $\mathbf{x}$.

In other words, if we look at position `(i, j)` in this matrix (ex. `(2, 1)` corresponding to row 2 column 1, which contains $\frac{dr}{da}$), it will tell us how much dimension `i` in $\mathbf{y}$ is changed by a small increase in dimension `j` of $\mathbf{x}$.

Fun fact, this particular matrix that you get from $\frac{d\mathbf{y}}{d\mathbf{x}}$ when $\mathbf{y}$ and $\mathbf{x}$ are both vectors has a special name: The [Jacobian](https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant).

## Shifting Perspective
Let's pause here and look at another way of understanding the dimensions of the above scenarios.
* **Scalar by a scalar**: In this case $y$ and $x$ were both scalars (one dimension). So the dimension of the derivative was: $dim(y) * dim(x)$ = `1 x 1`
* **Scalar by a vector**: In this case, $y$ was a scalar but $\mathbf{x}$ was a 3 dimensional vector. So the dimension of the derivative was $dim(y) * dim(x)$ = `1 x 3`
* **Vector by a vector**: In this case, $\mathbf{y}$ was a 2 dimensional vector and $\mathbf{x}$ was a 3 dimensional vector. So the dimension of the derivative was $dim(y) * dim(x)$ = `2 x 3`

We just took the dimensions and stuck all of $y$'s dimensions first, followed by all of $x$'s. Moreover, when we have a shape like `2 x 3`, entry `(i, j)` tells us how much a small change in the `jth` dimension of $x$ will change the `i`th dimension of $y$ by.

## Matrix by a vector
Now we can tackle more complicated things using this perspective shift. Consider:
* A `4 x 5` matrix $\mathbf{Y}$
* A 7 dimensional vector $\mathbf{x}$.

As usual, want to know:
$$
\frac{d\mathbf{Y}}{d\mathbf{x}}
$$

$\frac{d\mathbf{Y}}{d\mathbf{x}}$ looks at each of the `4 * 5 = 20` values in $\mathbf{Y}$ and tells us how each of the 7 values in $\mathbf{x}$ would change it by. We can organize this in the same way we have been so far. But now instead of a 2D matrix like `2 x 3`, we will have a 3D _tensor_ with dimensions `4 x 5 x 7`. We can think of a tensor as just a 3D list.

## Matrix by a matrix
Let's finish up with a matrix-matrix derivative. Consider:
* A `4 x 5` matrix $\mathbf{Y}$
* A `7 x 3` matrix $\mathbf{X}$.

As usual, want to know:
$$
\frac{d\mathbf{Y}}{d\mathbf{X}}
$$

$\frac{d\mathbf{Y}}{d\mathbf{X}}$ looks at each of the `4 * 5 = 20` values in $\mathbf{Y}$ and tells us how each of the `7 * 3 = 21` values in $\mathbf{X}$ would change it by. We can organize this in the same way we have been so far. Our final 4D tensor will have the dimensions `4 x 5 x 7 x 3`.


## Bonus: Machine Learning/Forward Layer
So now we should be well equipped to compute any matrix/vector derivatives and know their dimensionality. If you're a ML person, you might be reading this due to its relevance to optimization techniques like gradient descent. As a bonus, let's look at the dimensionalities involved in gradients for the forward layer of a neural network.

In a forward layer (leaving out the activation and bias term for simplicity), our model is just:

$$
\mathbf{Y} = \mathbf{X}\mathbf{W}
$$

In $\mathbf{X}$, each row represents one of `N` data points, each of which is `M` dimensional, so it is an `N x M` dimensional matrix. $\mathbf{W}$ is a `M x D` matrix that maps `M` dimensional things to `D` dimensional things. $\mathbf{Y}$ are the outputs from this layer, and since there are `N` data points, we will have `N` sets of `D` dimensional outputs and this vector will be `N x D` dimensional.

Assume we also have a loss function: 

$$
L = scalar(\mathbf{Y})
$$

For the sake of simplicity, it doesn't matter what $scalar$ means. All we need to know is it turns $\mathbf{Y}$ into a scalar that represents how incorrect our model's predictions are, which all loss functions do in one way or another.


Now, to optimize our weight matrix $\mathbf{W}$, we need to compute:
$$
\frac{dL}{d\mathbf{W}}
$$

From our analysis, we know the dimensionality of this. $L$ is a scalar, and $\mathbf{W}$ is a `M x D` dimensional matrix. The result will tell us how the scalar $L$ changes with each dimension of $\mathbf{W}$. Since $\mathbf{W}$ is `M x D`, $\frac{dL}{d\mathbf{W}}$ will also be `M x D`.

Usually this quantity is computed using the chain rule, so:
$$
\frac{dL}{d\mathbf{W}} = \frac{dL}{d\mathbf{Y}} \frac{d\mathbf{Y}}{d\mathbf{W}}
$$

Looking at each component:
* $\frac{dL}{d\mathbf{Y}}$ is `N x D` dimensional since $L$ is a scalar and $\mathbf{Y}$ is `N x D`.
* $\frac{d\mathbf{Y}}{d\mathbf{W}}$ is `(N x D) x (M x D)` dimensional, since $\mathbf{Y}$ is `N x D` and $\mathbf{W}$ is `M x D`

But hold on a second.


# hide
In this matrix, position $(i, j)$ tells you how much the $i^{th}$ dimension of $\mathbf{y}$ changes when you change the $j^{th}$ dimension of $\mathbf{x}$ by a little bit. For example, position $(2, 1)$ (row 2 column 1), the one containing $\frac{dr}{da}$, tells us how much the 2nd dimension of $\mathbf{y}$ changes when we change the 1st dimension of $\mathbf{x}$ by a little bit.



We also have a loss function $L = \sum (\mathbf{y} - \mathbf{t})^2$, where $\mathbf{t}$ is an `N x 1` dimensional vector of target values, and $\mathbf{y}$ are the predictions from our model. The $\sum$ sums over all of the entries, so $L$ will just be a scalar.

where $\mathbf{W}$ is a `M x D` dimensional weight vector and $\mathbf{X}$ is a `N x M` matrix.


In [21]:
import re

In [11]:
pattern="---\n.*\n---"
s = """
# Tensor Products and Gradients


---
layout: post
permalink: other-url
---
This post. And another line.

"""
re.search(pattern, s, re.DOTALL).group()

'---\nlayout: post\npermalink: other-url\n---'

In [13]:
d = "2022-05-something"
re.search("[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]-", d)

In [34]:
pattern=r"(\$\$.+?\$\$)"
s = r"""
Now let's begin
$$
some \\ text \\
$$
and then \\
$$
hello \\ world
$$
ok
"""
for t in re.findall(pattern, s, re.DOTALL):
    tt = t.replace("\\\\", "\\\\\\")
    s = s.replace(t, tt)
print(s)


Now let's begin
$$
some \\\ text \\\
$$
and then \\
$$
hello \\\ world
$$
ok



ModuleNotFoundError: No module named 'tensorflow'