title | weight |
---|---|
Cache-Oblivious Algorithms |
7 |
In the context of the external memory model, there are two types of efficient algorithms:
-
Cache-aware algorithms that are efficient for known
$B$ and$M$ . -
Cache-oblivious algorithms that are efficient for any
$B$ and$M$ .
For example, external merge sorting is a cache-aware, but not cache-oblivious algorithm: we need to know the memory characteristics of the system, namely the ratio of available memory to the block size, to find the right
Cache-oblivious algorithms are interesting because they automatically become optimal for all memory levels in the cache hierarchy, and not just the one for which they were specifically tuned. In this article, we consider some of their applications in matrix calculations.
Assume we have a square matrix
for (int i = 0; i < n; i++)
for (int j = 0; j < i; j++)
swap(a[j * N + i], a[i * N + j]);
Here we used a single pointer to the beginning of the memory region instead of a 2d array to be more explicit about its memory operations.
The I/O complexity of this code is
The cache-oblivious algorithm relies on the following block matrix identity:
It lets us solve the problem recursively using a divide-and-conquer approach:
- Divide the input matrix into 4 smaller matrices.
- Transpose each one recursively.
- Combine results by swapping the corner result matrices.
Implementing D&C on matrices is a bit more complex than on arrays, but the main idea is the same. Instead of copying submatrices explicitly, we want to use "views" into them, and also switch to the naive method when the data starts fitting in the L1 cache (or pick something small like
void transpose(int *a, int n, int N) {
if (n <= 32) {
for (int i = 0; i < n; i++)
for (int j = 0; j < i; j++)
swap(a[i * N + j], a[j * N + i]);
} else {
int k = n / 2;
transpose(a, k, N);
transpose(a + k, k, N);
transpose(a + k * N, k, N);
transpose(a + k * N + k, k, N);
for (int i = 0; i < k; i++)
for (int j = 0; j < k; j++)
swap(a[i * N + (j + k)], a[(i + k) * N + j]);
if (n & 1)
for (int i = 0; i < n - 1; i++)
swap(a[i * N + n - 1], a[(n - 1) * N + i]);
}
}
The I/O complexity of the algorithm is
Adapting this code for the general case of non-square matrices is left as an exercise to the reader.
Next, let's consider something slightly more complex: matrix multiplication.
The naive algorithm just translates its definition into code:
// don't forget to initialize c[][] with zeroes
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
for (int k = 0; k < n; k++)
c[i * n + j] += a[i * n + k] * b[k * n + j];
It needs to access
One well-known optimization is to transpose
for (int i = 0; i < n; i++)
for (int j = 0; j < i; j++)
swap(b[j][i], b[i][j])
// ^ or use our faster transpose from before
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
for (int k = 0; k < n; k++)
c[i * n + j] += a[i * n + k] * b[j * n + k]; // <- note the indices
Whether the transpose is done naively or with the cache-oblivious method we previously developed, the matrix multiplication with one of the matrices transposed would work in
It seems like we can't do better, but it turns out we can.
Cache-oblivious matrix multiplication relies on essentially the same trick as the transposition. We need to divide the data until it fits into lowest cache (i.e.,
It is slightly harder to implement though because we now have a total of 8 recursive matrix multiplications:
void matmul(const float *a, const float *b, float *c, int n, int N) {
if (n <= 32) {
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
for (int k = 0; k < n; k++)
c[i * N + j] += a[i * N + k] * b[k * N + j];
} else {
int k = n / 2;
// c11 = a11 b11 + a12 b21
matmul(a, b, c, k, N);
matmul(a + k, b + k * N, c, k, N);
// c12 = a11 b12 + a12 b22
matmul(a, b + k, c + k, k, N);
matmul(a + k, b + k * N + k, c + k, k, N);
// c21 = a21 b11 + a22 b21
matmul(a + k * N, b, c + k * N, k, N);
matmul(a + k * N + k, b + k * N, c + k * N, k, N);
// c22 = a21 b12 + a22 b22
mul(a + k * N, b + k, c + k * N + k, k, N);
mul(a + k * N + k, b + k * N + k, c + k * N + k, k, N);
if (n & 1) {
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
for (int k = (i < n - 1 && j < n - 1) ? n - 1 : 0; k < n; k++)
c[i * N + j] += a[i * N + k] * b[k * N + j];
}
}
}
Because there are many other factors in play here, we are not going to benchmark this implementation, and instead just do its theoretical performance analysis in external memory model.
Arithmetic complexity of the algorithm remains is the same, because the recurrence
is solved by
It doesn't seem like we "conquered" anything yet, but let's think about its I/O complexity:
The recurrence is dominated by
This is better than just
In a spirit similar to the Karatsuba algorithm, matrix multiplication can be decomposed in a way that involves 7 matrix multiplications of size
This technique, known as the Strassen algorithm, similarly splits each matrix into 4:
But then it computes intermediate products of the
You can verify these formulas with simple substitution if you feel like it.
As far as I know, none of the mainstream optimized linear algebra libraries use the Strassen algorithm, although there are some prototype implementations that are efficient for matrices larger than 2000 or so.
This technique can and actually has been extended multiple times to reduce the asymptotic even further by considering more submatrix products. As of 2020, current world record is
For a solid theoretical viewpoint, consider reading Cache-Oblivious Algorithms and Data Structures by Erik Demaine.