# Gradients and Tensors

In [None]:
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

There is a certain amount of mathematics we'll need before proceeding. This lecture will introduce you to two mathematical notions that are central to machine learning: **gradients** and **tensors**.

The gradient is a notion from calculus, and it is a generalized notion of a derivative into higher-dimensional spaces.

## Derivatives in One Dimension

When you first learn calculus, you generally start with *differential* calculus, and you always start with functions of a single variable. Thus you learn derivative shortcut rules such as:

$\frac{d}{dx}[x^2] = 2x$

$\frac{d}{dx}[sin(x)] = cos(x)$

$\frac{d}{dx}[5^x] = 5^xln(5)$

And there is the Newtonian notation as well:

$f'(x) = 2x$ for $f(x) = x^2$

$f'(x) = cos(x)$ for $f(x) = sin(x)$

$f'(x) = 5^xln(5)$ for $f(x) = 5^x$

## Derivatives of Functions of Multiple Variables

Keep in mind that a derivative tells us the *rate of change* of a function for any value of its independent variables. To say e.g. that $2x$ is the derivative of $x^2$ is to say that, for any value of $x$, the rate of change, or *slope*, of the function $x^2$ has a value of $2x$. At $x=0$, the slope is $2\times 0 = 0$; at $x=1$, the slope is $2\times 1 = 2$; etc.

In [None]:
X = np.linspace(-3, 3, 13)
y = X**2

fig, ax = plt.subplots(figsize=(6, 6))
ax.plot(X, y)
ax.plot(X, 2*X - 1)
ax.set_title("The rate of change of the function $f(x) = x^2$ at $x=1$ is $f'(1) = 2$.");

But most of the functions we've been working with are functions of multiple variables. The optimization function calculated for a multiple linear regression, for example, presupposes multiple predictors or "features".

If we wanted to describe the rate of change of such a function, we can't simply say e.g. that the function changes at a particular rate for some particular value of a single variable, say $x_1$, because the function (and its rate of change) by definition also depends on other variables! What we can do instead is to describe how the function changes *with respect to $x_1$*, and the way we do that is *by assuming that we hold the other variables constant*.

## Partial Differentiation

This is the idea behind **partial differentiation**. Consider the following function of two variables, which should be reminiscent of a multiple linear regression: $f(x_1, x_2) = 3x_1 - 5x_2$. For this function $f$, we could consider how its values change *with respect to $x_1$* or *with respect to $x_2$*. And to calculate these rates of change, we simply apply our familiar rules of differentiation to the relevant variable, *while treating any other variable as a constant*.

Thus we can calculate:

$\frac{\partial f}{\partial x_1} = 3$

$\frac{\partial f}{\partial x_2} = 5$.

## The Gradient

The **gradient** simply collects together *all* the partial derivatives of a function, and in *vector* form. For our function $f$, the first partial derivative tells us that, as we hold $x_2$ fixed, the rate of change of $f$ with respect to $x_1$ is 3. That is, $f$ has a (constant) rate of change of 3 *in the $x_1$-direction*. The second partial derivative tells us that, as we hold $x_1$ fixed, the rate of change of $f$ with respect to $x_2$ is 5. That is, $f$ has a (constant) rate of change of 5 *in the $x_2$-direction*.

And so the gradient is defined as follows:

$$\begin{align}\\
    \large \nabla f &= \sum_i \dfrac{\partial f}{\partial \theta_i}\hat{\theta_i} \\
            &= \frac{\partial f}{\partial \theta_1}\hat{\theta_1} + \dots +  \frac{\partial f}{\partial \theta_n}\hat{\theta_n}
\end{align}$$

In the multivariate case, the gradient tells us how the function is changing **in each dimension**. A large value of the derivative with respect to a particular variable means that the gradient will have a large component in the corresponding direction. Therefore, **the gradient will point in the direction of steepest increase**.

## An Example

Consider the function $z = x^2 + y^2$.

In [None]:
# We're making a grid of (x, y) points.
# Start by making multiple copies of our x vector.

x_col = X.reshape(1, 13)
x_plane = np.ones(13).reshape(13, 1) @ x_col

In [None]:
# Now do the same for y values.

y_col = X.reshape(13, 1)
y_plane = y_col @ -np.ones(13).reshape(1, 13)

In [None]:
z = x_plane**2 + y_plane**2

Let's plot $z$ as a heatmap:

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(z, xticklabels=X, yticklabels=X, annot=True, ax=ax, cmap="Blues");

The heatmap shows the values of $z = x^2 + y^2$. Note that the function has a global minimum at $(x, y) = (0, 0)$, indicated in the heatmap by the palest shade of blue. And more generally the shape of the function in $xyz$ space is a bowl or paraboloid ![paraboloid](images/512px-Paraboloid_of_Revolution.svg.png), where the slope of the bowl increases as you get farther away from the vertex. Image source: [wikipedia](https://en.wikipedia.org/wiki/Paraboloid).

In particular, the gradient of $z$ is:

$\large\nabla z = \frac{\partial z}{\partial x}\hat{x} + \frac{\partial z}{\partial x}\hat{y} = 2x\hat{x} + 2y\hat{y}$, and so it's clear that, as we move further away from $(0, 0)$ the (absolute value of the) slope increases.

## Gradient Descent

The observation that the gradient points in the direction of fastest increase is critical to the machine-learning technique of **gradient descent**:

First, if the gradient is pointing in the direction of fastest increase, then the *negative* of the gradient will point in the direction of fastest *decrease*. And second, if the function whose slope we're examining is a *cost* function for some modeling algorithm, where the independent variables are the model's parameters, then an adjustment of the values of our parameters in the direction of the negative gradient **will result in a lower value of the cost function** and, thus, a better model! We'll see applications of this idea soon.

## Tensors

We turn now to tensors. A **tensor** is a notion from algebra, and it is a generalized notion of a matrix into higher-dimensional spaces.

## Matrices

The utility and relevance of matrices is already apparent. A **matrix** is a two-dimensional array of objects (generally numbers). Matrices of course arise naturally in the context of data that has a row-and-column structure. `NumPy` has a *matrix* class reserved for such two-dimensional objects:

In [None]:
matrix = np.matrix([[1, 2, 3], [4, 5, 6]])
matrix

## `NumPy` Arrays

But typically we'll be making use of `NumPy`'s *array* class.

In [None]:
array = np.array([[1, 2, 3], [4, 5, 6]])
array

There are some [subtle differences](https://stackoverflow.com/questions/4151128/what-are-the-differences-between-numpy-arrays-and-matrices-which-one-should-i-u) between these two, but one key difference is that arrays can have *any* dimension.

In [None]:
type(array)

The type indicates here that this object is an *n*-dimensional array, where *n* can be any counting number we like.

## Tensor Uses

### Matrix Uses

A matrix is a convenient way of representing a certain kind of *structure*. Consider a typical data table, where the columns represent features and the rows represent observations:

<table>
    <tr>
        <th></th>
        <th> feature1 </th>
        <th> feature2 </th>
        <th> feature3 </th>
    </tr>
    <tr>
        <th> observation1 </th>
        <td> obs1feat1 </td>
        <td> obs1feat2 </td>
        <td> obs1feat3 </td>
    </tr>
    <tr>
        <th> observation2 </th>
        <td> obs2feat1 </td>
        <td> obs2feat2 </td>
        <td> obs2feat3 </td>
    </tr>
    <tr>
        <th> observation3 </th>
        <td> obs3feat1 </td>
        <td> obs3feat2 </td>
        <td> obs3feat3 </td>
    </tr>
    <tr>
        <th> observation4 </th>
        <td> obs4feat1 </td>
        <td> obs4feat2 </td>
        <td> obs4feat3 </td>
    </tr>
    <tr>
        <th> observation5 </th>
        <td> obs5feat1 </td>
        <td> obs5feat2 </td>
        <td> obs5feat3 </td>
    </tr>

This sort of structure is useful for all sorts of data. When do we want or need higher dimensions?

### Uses for Higher-Dimensional Objects

#### Images

One very common use for tensors is in representing color images. To digitize an image, we'll chop it into little bits or *pixels*: The pixel in the top left will represent the extreme top-left of the image, the pixel in the top right will represent the extreme top-right of the image, etc.

But now *how* will each pixel represent a part of the image? The chief thing to capture is the *color* of the relevant part of the image (larger patterns of the image will emerge when we take the pixels in aggregate), and we can use our three color channels (red, green, blue) to represent any color we like. But that means that we need to associate each pixel with a *triple* of numbers. We can represent that with a tensor.

Take the following image:

![paint](images/paint.jpeg) image source: depositphotos.com

Let's read it in with `MatPlotLib`:

In [None]:
paint_digitized = plt.imread("images/paint.jpeg")
paint_digitized.shape

We should see high green numbers on the left (low second-coordinate values), high red numbers in the middle (middling second-coordinate values), and high blue numbers on the right (high second-coordinate values). Let's see:

In [None]:
# green
paint_digitized[10, 100, :]

In [None]:
# red
paint_digitized[10, 600, :]

In [None]:
# blue
paint_digitized[10, 800, :]

`PyPlot` can also turn a digitization into an image, so we should be able to reproduce the image by using the `.imshow()` function:

In [None]:
plt.imshow(paint_digitized);

#### Gradient Fields

Another use for a tensor would be to represent the gradient we constructed above for the function $z = x^2 + y^2$! We could take our grid of $(x, y)$ points and associate each gridpoint with *both* the value of $\frac{\partial z}{\partial x}$ and the value of $\frac{\partial z}{\partial y}$ at that point.

In [None]:
grad_tensor = np.stack((2*x_plane, 2*y_plane))
print(grad_tensor.shape)
grad_tensor

Now each point in our grid is associated with a *double* of values $(\frac{\partial z}{\partial x}, \frac{\partial z}{\partial y})$.

## `np.reshape()`

Sometimes bits of our data pipeline will expect data to have a certain shape, and it may therefore be necessary to transform data arrays to fit those predetermined structures. `NumPy`'s `reshape()` function will be invaluable for that purpose.

In [None]:
grad_tensor.shape

In [None]:
# grad_tensor as a row vector
grad_tensor.reshape(1, 338)

In [None]:
# grad_tensor as a column vector
grad_tensor.reshape(338, 1)

In [None]:
# grad_tensor as a five-dimensional object
grad_tensor.reshape(13, 13, 1, 1, 2)