🏷️sec_geometry-linear-algebraic-ops
In :numref:sec_linear-algebra
, we encountered the basics of linear algebra
and saw how it could be used to express common operations for transforming our data.
Linear algebra is one of the key mathematical pillars
underlying much of the work that we do in deep learning
and in machine learning more broadly.
While :numref:sec_linear-algebra
contained enough machinery
to communicate the mechanics of modern deep learning models,
there is a lot more to the subject.
In this section, we will go deeper,
highlighting some geometric interpretations of linear algebra operations,
and introducing a few fundamental concepts, including of eigenvalues and eigenvectors.
First, we need to discuss the two common geometric interpretations of vectors, as either points or directions in space. Fundamentally, a vector is a list of numbers such as the Python list below.
#@tab all
v = [1, 7, 0, 1]
Mathematicians most often write this as either a column or row vector, which is to say either as
or
These often have different interpretations,
where data examples are column vectors
and weights used to form weighted sums are row vectors.
However, it can be beneficial to be flexible.
As we have described in :numref:sec_linear-algebra
,
though a single vector's default orientation is a column vector,
for any matrix representing a tabular dataset,
treating each data example as a row vector
in the matrix
is more conventional.
Given a vector, the first interpretation
that we should give it is as a point in space.
In two or three dimensions, we can visualize these points
by using the components of the vectors to define
the location of the points in space compared
to a fixed reference called the origin. This can be seen in :numref:fig_grid
.
This geometric point of view allows us to consider the problem on a more abstract level. No longer faced with some insurmountable seeming problem like classifying pictures as either cats or dogs, we can start considering tasks abstractly as collections of points in space and picturing the task as discovering how to separate two distinct clusters of points.
In parallel, there is a second point of view
that people often take of vectors: as directions in space.
Not only can we think of the vector fig_arrow
the same.
One of the benefits of this shift is that
we can make visual sense of the act of vector addition.
In particular, we follow the directions given by one vector,
and then follow the directions given by the other, as is seen in :numref:fig_add-vec
.
Vector subtraction has a similar interpretation.
By considering the identity that
As we saw in :numref:sec_linear-algebra
,
if we take two column vectors
eq_dot_def
Because :eqref:eq_dot_def
is symmetric, we will mirror the notation
of classical multiplication and write
to highlight the fact that exchanging the order of the vectors will yield the same answer.
The dot product :eqref:eq_dot_def
also admits a geometric interpretation: it is closely related to the angle between two vectors. Consider the angle shown in :numref:fig_angle
.
To start, let us consider two specific vectors:
The vector
With some simple algebraic manipulation, we can rearrange terms to obtain
In short, for these two specific vectors,
the dot product combined with the norms tell us the angle between the two vectors. This same fact is true in general. We will not derive the expression here, however,
if we consider writing
eq_angle_forumla
This is a nice result since nothing in the computation references two-dimensions. Indeed, we can use this in three or three million dimensions without issue.
As a simple example, let us see how to compute the angle between a pair of vectors:
%matplotlib inline
from d2l import mxnet as d2l
from IPython import display
from mxnet import gluon, np, npx
npx.set_np()
def angle(v, w):
return np.arccos(v.dot(w) / (np.linalg.norm(v) * np.linalg.norm(w)))
angle(np.array([0, 1, 2]), np.array([2, 3, 4]))
#@tab pytorch
%matplotlib inline
from d2l import torch as d2l
from IPython import display
import torch
from torchvision import transforms
import torchvision
def angle(v, w):
return torch.acos(v.dot(w) / (torch.norm(v) * torch.norm(w)))
angle(torch.tensor([0, 1, 2], dtype=torch.float32), torch.tensor([2.0, 3, 4]))
#@tab tensorflow
%matplotlib inline
from d2l import tensorflow as d2l
from IPython import display
import tensorflow as tf
def angle(v, w):
return tf.acos(tf.tensordot(v, w, axes=1) / (tf.norm(v) * tf.norm(w)))
angle(tf.constant([0, 1, 2], dtype=tf.float32), tf.constant([2.0, 3, 4]))
We will not use it right now, but it is useful to know
that we will refer to vectors for which the angle is
It is reasonable to ask: why is computing the angle useful?
The answer comes in the kind of invariance we expect data to have.
Consider an image, and a duplicate image,
where every pixel value is the same but
Examples like this are everywhere. In text, we might want the topic being discussed to not change if we write twice as long of document that says the same thing. For some encoding (such as counting the number of occurrences of words in some vocabulary), this corresponds to a doubling of the vector encoding the document, so again we can use the angle.
In ML contexts where the angle is employed to measure the closeness of two vectors, practitioners adopt the term cosine similarity to refer to the portion $$ \cos(\theta) = \frac{\mathbf{v}\cdot\mathbf{w}}{|\mathbf{v}||\mathbf{w}|}. $$
The cosine takes a maximum value of
In addition to working with vectors, another key object
that you must understand to go far in linear algebra
is the hyperplane, a generalization to higher dimensions
of a line (two dimensions) or of a plane (three dimensions).
In an
Let us start with an example.
Suppose that we have a column vector eq_angle_forumla
,
we can see that this is equivalent to
$$
|\mathbf{v}||\mathbf{w}|\cos(\theta) = 1 ; \iff ; |\mathbf{v}|\cos(\theta) = \frac{1}{|\mathbf{w}|} = \frac{1}{\sqrt{5}}.
$$
If we consider the geometric meaning of this expression,
we see that this is equivalent to saying
that the length of the projection of fig_vector-project
.
The set of all points where this is true is a line
at right angles to the vector
If we now look at what happens when we ask about the set of points with
fig_space-division
.
The story in higher dimension is much the same.
If we now take fig_higher-division
.
While our ability to visualize runs out at this point,
nothing stops us from doing this in tens, hundreds, or billions of dimensions.
This occurs often when thinking about machine learned models.
For instance, we can understand linear classification models
like those from :numref:sec_softmax
,
as methods to find hyperplanes that separate the different target classes.
In this context, such hyperplanes are often referred to as decision planes.
The majority of deep learned classification models end
with a linear layer fed into a softmax,
so one can interpret the role of the deep neural network
to be to find a non-linear embedding such that the target classes
can be separated cleanly by hyperplanes.
To give a hand-built example, notice that we can produce a reasonable model
to classify tiny images of t-shirts and trousers from the Fashion MNIST dataset
(seen in :numref:sec_fashion_mnist
)
by just taking the vector between their means to define the decision plane
and eyeball a crude threshold. First we will load the data and compute the averages.
# Load in the dataset
train = gluon.data.vision.FashionMNIST(train=True)
test = gluon.data.vision.FashionMNIST(train=False)
X_train_0 = np.stack([x[0] for x in train if x[1] == 0]).astype(float)
X_train_1 = np.stack([x[0] for x in train if x[1] == 1]).astype(float)
X_test = np.stack(
[x[0] for x in test if x[1] == 0 or x[1] == 1]).astype(float)
y_test = np.stack(
[x[1] for x in test if x[1] == 0 or x[1] == 1]).astype(float)
# Compute averages
ave_0 = np.mean(X_train_0, axis=0)
ave_1 = np.mean(X_train_1, axis=0)
#@tab pytorch
# Load in the dataset
trans = []
trans.append(transforms.ToTensor())
trans = transforms.Compose(trans)
train = torchvision.datasets.FashionMNIST(root="../data", transform=trans,
train=True, download=True)
test = torchvision.datasets.FashionMNIST(root="../data", transform=trans,
train=False, download=True)
X_train_0 = torch.stack(
[x[0] * 256 for x in train if x[1] == 0]).type(torch.float32)
X_train_1 = torch.stack(
[x[0] * 256 for x in train if x[1] == 1]).type(torch.float32)
X_test = torch.stack(
[x[0] * 256 for x in test if x[1] == 0 or x[1] == 1]).type(torch.float32)
y_test = torch.stack([torch.tensor(x[1]) for x in test
if x[1] == 0 or x[1] == 1]).type(torch.float32)
# Compute averages
ave_0 = torch.mean(X_train_0, axis=0)
ave_1 = torch.mean(X_train_1, axis=0)
#@tab tensorflow
# Load in the dataset
((train_images, train_labels), (
test_images, test_labels)) = tf.keras.datasets.fashion_mnist.load_data()
X_train_0 = tf.cast(tf.stack(train_images[[i for i, label in enumerate(
train_labels) if label == 0]] * 256), dtype=tf.float32)
X_train_1 = tf.cast(tf.stack(train_images[[i for i, label in enumerate(
train_labels) if label == 1]] * 256), dtype=tf.float32)
X_test = tf.cast(tf.stack(test_images[[i for i, label in enumerate(
test_labels) if label == 0]] * 256), dtype=tf.float32)
y_test = tf.cast(tf.stack(test_images[[i for i, label in enumerate(
test_labels) if label == 1]] * 256), dtype=tf.float32)
# Compute averages
ave_0 = tf.reduce_mean(X_train_0, axis=0)
ave_1 = tf.reduce_mean(X_train_1, axis=0)
It can be informative to examine these averages in detail, so let us plot what they look like. In this case, we see that the average indeed resembles a blurry image of a t-shirt.
#@tab mxnet, pytorch
# Plot average t-shirt
d2l.set_figsize()
d2l.plt.imshow(ave_0.reshape(28, 28).tolist(), cmap='Greys')
d2l.plt.show()
#@tab tensorflow
# Plot average t-shirt
d2l.set_figsize()
d2l.plt.imshow(tf.reshape(ave_0, (28, 28)), cmap='Greys')
d2l.plt.show()
In the second case, we again see that the average resembles a blurry image of trousers.
#@tab mxnet, pytorch
# Plot average trousers
d2l.plt.imshow(ave_1.reshape(28, 28).tolist(), cmap='Greys')
d2l.plt.show()
#@tab tensorflow
# Plot average trousers
d2l.plt.imshow(tf.reshape(ave_1, (28, 28)), cmap='Greys')
d2l.plt.show()
In a fully machine learned solution, we would learn the threshold from the dataset. In this case, I simply eyeballed a threshold that looked good on the training data by hand.
# Print test set accuracy with eyeballed threshold
w = (ave_1 - ave_0).T
predictions = X_test.reshape(2000, -1).dot(w.flatten()) > -1500000
# Accuracy
np.mean(predictions.astype(y_test.dtype) == y_test, dtype=np.float64)
#@tab pytorch
# Print test set accuracy with eyeballed threshold
w = (ave_1 - ave_0).T
# '@' is Matrix Multiplication operator in pytorch.
predictions = X_test.reshape(2000, -1) @ (w.flatten()) > -1500000
# Accuracy
torch.mean((predictions.type(y_test.dtype) == y_test).float(), dtype=torch.float64)
#@tab tensorflow
# Print test set accuracy with eyeballed threshold
w = tf.transpose(ave_1 - ave_0)
predictions = tf.reduce_sum(X_test * tf.nest.flatten(w), axis=0) > -1500000
# Accuracy
tf.reduce_mean(
tf.cast(tf.cast(predictions, y_test.dtype) == y_test, tf.float32))
Through :numref:sec_linear-algebra
and the above discussions,
we have a solid understanding of the geometry of vectors, lengths, and angles.
However, there is one important object we have omitted discussing,
and that is a geometric understanding of linear transformations represented by matrices. Fully internalizing what matrices can do to transform data
between two potentially different high dimensional spaces takes significant practice,
and is beyond the scope of this appendix.
However, we can start building up intuition in two dimensions.
Suppose that we have some matrix:
If we want to apply this to an arbitrary vector
This may seem like an odd computation,
where something clear became somewhat impenetrable.
However, it tells us that we can write the way
that a matrix transforms any vector
in terms of how it transforms two specific vectors:
Let us draw what happens when we use the specific matrix
If we look at the specific vector fig_grid-transform
.
This is the most important intuitive point to internalize about linear transformations represented by matrices. Matrices are incapable of distorting some parts of space differently than others. All they can do is take the original coordinates on our space and skew, rotate, and scale them.
Some distortions can be severe. For instance the matrix
compresses the entire two-dimensional plane down to a single line.
Identifying and working with such transformations are the topic of a later section,
but geometrically we can see that this is fundamentally different
from the types of transformations we saw above.
For instance, the result from matrix
While this picture was for a
Consider again the matrix
This compresses the entire plane down to live on the single line
This means that one of the columns is, in a sense, redundant
because it does not define a unique direction in space.
This should not surprise us too much
since we already saw that this matrix
collapses the entire plane down into a single line.
Moreover, we see that the linear dependence
In general, we will say that a collection of vectors
In this case, we can solve for one of the vectors in terms of some combination of the others, and effectively render it redundant. Thus, a linear dependence in the columns of a matrix is a witness to the fact that our matrix is compressing the space down to some lower dimension. If there is no linear dependence we say the vectors are linearly independent. If the columns of a matrix are linearly independent, no compression occurs and the operation can be undone.
If we have a general
has
and show that
This procedure, as described, is very inefficient. It requires looking at every subset of the columns of our given matrix, and thus is potentially exponential in the number of columns. Later we will see a more computationally efficient way to compute the rank of a matrix, but for now, this is sufficient to see that the concept is well defined and understand the meaning.
We have seen above that multiplication by a matrix with linearly dependent columns
cannot be undone, i.e., there is no inverse operation that can always recover the input. However, multiplication by a full-rank matrix
(i.e., some
which is the matrix with ones along the diagonal, and zeros elsewhere.
We call this the identity matrix.
It is the matrix which leaves our data unchanged when applied.
To find a matrix which undoes what our matrix
If we look at this as a system, we have
then we can see that the inverse is
We can test to see this by seeing that multiplying by the inverse given by the formula above works in practice.
M = np.array([[1, 2], [1, 4]])
M_inv = np.array([[2, -1], [-0.5, 0.5]])
M_inv.dot(M)
#@tab pytorch
M = torch.tensor([[1, 2], [1, 4]], dtype=torch.float32)
M_inv = torch.tensor([[2, -1], [-0.5, 0.5]])
M_inv @ M
#@tab tensorflow
M = tf.constant([[1, 2], [1, 4]], dtype=tf.float32)
M_inv = tf.constant([[2, -1], [-0.5, 0.5]])
tf.matmul(M_inv, M)
While the inverse of a matrix is useful in theory, we must say that most of the time we do not wish to use the matrix inverse to solve a problem in practice. In general, there are far more numerically stable algorithms for solving linear equations like
than computing the inverse and multiplying to get
Just as division by a small number can lead to numerical instability, so can inversion of a matrix which is close to having low rank.
Moreover, it is common that the matrix
While we do not have time to dive all the way into the thorny numerical issues frequently encountered when working with linear algebra, we want to provide you with some intuition about when to proceed with caution, and generally avoiding inversion in practice is a good rule of thumb.
The geometric view of linear algebra gives an intuitive way
to interpret a fundamental quantity known as the determinant.
Consider the grid image from before, but now with a highlighted region (:numref:fig_grid-filled
).
Look at the highlighted square. This is a square with edges given
by
it is an exercise in coordinate geometry to compute
the area of this parallelogram and obtain that the area is
In general, if we have a matrix
we can see with some computation that the area
of the resulting parallelogram is
Let us check this quickly with some example code.
import numpy as np
np.linalg.det(np.array([[1, -1], [2, 3]]))
#@tab pytorch
torch.det(torch.tensor([[1, -1], [2, 3]], dtype=torch.float32))
#@tab tensorflow
tf.linalg.det(tf.constant([[1, -1], [2, 3]], dtype=tf.float32))
The eagle-eyed amongst us will notice that this expression can be zero or even negative. For the negative term, this is a matter of convention taken generally in mathematics: if the matrix flips the figure, we say the area is negated. Let us see now that when the determinant is zero, we learn more.
Let us consider
If we compute the determinant of this matrix,
we get
As a final comment, imagine that we have any figure drawn on the plane. Thinking like computer scientists, we can decompose that figure into a collection of little squares so that the area of the figure is in essence just the number of squares in the decomposition. If we now transform that figure by a matrix, we send each of these squares to parallelograms, each one of which has area given by the determinant. We see that for any figure, the determinant gives the (signed) number that a matrix scales the area of any figure.
Computing determinants for larger matrices can be laborious,
but the intuition is the same.
The determinant remains the factor
that
In :numref:sec_linear-algebra
the concept of tensors was introduced.
In this section, we will dive more deeply into tensor contractions
(the tensor equivalent of matrix multiplication),
and see how it can provide a unified view
on a number of matrix and vector operations.
With matrices and vectors we knew how to multiply them to transform data. We need to have a similar definition for tensors if they are to be useful to us. Think about matrix multiplication:
or equivalently
This pattern is one we can repeat for tensors. For tensors, there is no one case of what to sum over that can be universally chosen, so we need specify exactly which indices we want to sum over. For instance we could consider
Such a transformation is called a tensor contraction. It can represent a far more flexible family of transformations that matrix multiplication alone.
As a often-used notational simplification, we can notice that the sum is over exactly those indices that occur more than once in the expression, thus people often work with Einstein notation, where the summation is implicitly taken over all repeated indices. This gives the compact expression:
Let us see how many of the linear algebraic definitions we have seen before can be expressed in this compressed tensor notation:
$\mathbf{v} \cdot \mathbf{w} = \sum_i v_iw_i$ $|\mathbf{v}|_2^{2} = \sum_i v_iv_i$ - $(\mathbf{A}\mathbf{v})i = \sum_j a{ij}v_j$
- $(\mathbf{A}\mathbf{B}){ik} = \sum_j a{ij}b_{jk}$
$\mathrm{tr}(\mathbf{A}) = \sum_i a_{ii}$
In this way, we can replace a myriad of specialized notations with short tensor expressions.
Tensors may flexibly be operated on in code as well.
As seen in :numref:sec_linear-algebra
,
we can create tensors as is shown below.
# Define tensors
B = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
A = np.array([[1, 2], [3, 4]])
v = np.array([1, 2])
# Print out the shapes
A.shape, B.shape, v.shape
#@tab pytorch
# Define tensors
B = torch.tensor([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
A = torch.tensor([[1, 2], [3, 4]])
v = torch.tensor([1, 2])
# Print out the shapes
A.shape, B.shape, v.shape
#@tab tensorflow
# Define tensors
B = tf.constant([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
A = tf.constant([[1, 2], [3, 4]])
v = tf.constant([1, 2])
# Print out the shapes
A.shape, B.shape, v.shape
Einstein summation has been implemented directly.
The indices that occurs in the Einstein summation can be passed as a string,
followed by the tensors that are being acted upon.
For instance, to implement matrix multiplication,
we can consider the Einstein summation seen above
(
# Reimplement matrix multiplication
np.einsum("ij, j -> i", A, v), A.dot(v)
#@tab pytorch
# Reimplement matrix multiplication
torch.einsum("ij, j -> i", A, v), A@v
#@tab tensorflow
# Reimplement matrix multiplication
tf.einsum("ij, j -> i", A, v), tf.matmul(A, tf.reshape(v, (2, 1)))
This is a highly flexible notation. For instance if we want to compute what would be traditionally written as
$$ c_{kl} = \sum_{ij} \mathbf{b}{ijk}\mathbf{a}{il}v_j. $$
it can be implemented via Einstein summation as:
np.einsum("ijk, il, j -> kl", B, A, v)
#@tab pytorch
torch.einsum("ijk, il, j -> kl", B, A, v)
#@tab tensorflow
tf.einsum("ijk, il, j -> kl", B, A, v)
This notation is readable and efficient for humans,
however bulky if for whatever reason
we need to generate a tensor contraction programmatically.
For this reason, einsum
provides an alternative notation
by providing integer indices for each tensor.
For example, the same tensor contraction can also be written as:
np.einsum(B, [0, 1, 2], A, [0, 3], v, [1], [2, 3])
#@tab pytorch
# PyTorch doesn't support this type of notation.
#@tab tensorflow
# TensorFlow doesn't support this type of notation.
Either notation allows for concise and efficient representation of tensor contractions in code.
- Vectors can be interpreted geometrically as either points or directions in space.
- Dot products define the notion of angle to arbitrarily high-dimensional spaces.
- Hyperplanes are high-dimensional generalizations of lines and planes. They can be used to define decision planes that are often used as the last step in a classification task.
- Matrix multiplication can be geometrically interpreted as uniform distortions of the underlying coordinates. They represent a very restricted, but mathematically clean, way to transform vectors.
- Linear dependence is a way to tell when a collection of vectors are in a lower dimensional space than we would expect (say you have
$3$ vectors living in a$2$ -dimensional space). The rank of a matrix is the size of the largest subset of its columns that are linearly independent. - When a matrix's inverse is defined, matrix inversion allows us to find another matrix that undoes the action of the first. Matrix inversion is useful in theory, but requires care in practice owing to numerical instability.
- Determinants allow us to measure how much a matrix expands or contracts a space. A nonzero determinant implies an invertible (non-singular) matrix and a zero-valued determinant means that the matrix is non-invertible (singular).
- Tensor contractions and Einstein summation provide for a neat and clean notation for expressing many of the computations that are seen in machine learning.
- What is the angle between $$ \vec v_1 = \begin{bmatrix} 1 \ 0 \ -1 \ 2 \end{bmatrix}, \qquad \vec v_2 = \begin{bmatrix} 3 \ 1 \ 0 \ 1 \end{bmatrix}? $$
- True or false: $\begin{bmatrix}1 & 2\0&1\end{bmatrix}$ and $\begin{bmatrix}1 & -2\0&1\end{bmatrix}$ are inverses of one another?
- Suppose that we draw a shape in the plane with area
$100\mathrm{m}^2$ . What is the area after transforming the figure by the matrix $$ \begin{bmatrix} 2 & 3\ 1 & 2 \end{bmatrix}. $$ - Which of the following sets of vectors are linearly independent?
- $\left{\begin{pmatrix}1\0\-1\end{pmatrix}, \begin{pmatrix}2\1\-1\end{pmatrix}, \begin{pmatrix}3\1\1\end{pmatrix}\right}$
- $\left{\begin{pmatrix}3\1\1\end{pmatrix}, \begin{pmatrix}1\1\1\end{pmatrix}, \begin{pmatrix}0\0\0\end{pmatrix}\right}$
- $\left{\begin{pmatrix}1\1\0\end{pmatrix}, \begin{pmatrix}0\1\-1\end{pmatrix}, \begin{pmatrix}1\0\1\end{pmatrix}\right}$
- Suppose that you have a matrix written as $A = \begin{bmatrix}c\d\end{bmatrix}\cdot\begin{bmatrix}a & b\end{bmatrix}$ for some choice of values
$a, b, c$ , and$d$ . True or false: the determinant of such a matrix is always$0$ ? - The vectors $e_1 = \begin{bmatrix}1\0\end{bmatrix}$ and $e_2 = \begin{bmatrix}0\1\end{bmatrix}$ are orthogonal. What is the condition on a matrix
$A$ so that$Ae_1$ and$Ae_2$ are orthogonal? - How can you write
$\mathrm{tr}(\mathbf{A}^4)$ in Einstein notation for an arbitrary matrix$A$ ?
:begin_tab:mxnet
Discussions
:end_tab:
:begin_tab:pytorch
Discussions
:end_tab:
:begin_tab:tensorflow
Discussions
:end_tab: