You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the things they show is that the insertion of local matrix elements into the global matrix is expensive, primarily because one has to do a bisection search in each row for the column we want to add a local entry to. It turns out that we don't actually do a bisection search if one goes through the highly optimized path via AffineConstraints::distribute_local_to_global() (as one should) because that function passes an already-sorted array of entries for each row to SparseMatrix::add(). There, we have the following (slightly trimmed for clarity):
template <typename number>
template <typename number2>
void
SparseMatrix<number>::add(const size_type row,
const size_type n_cols,
const size_type *col_indices,
const number2 * values,
constbool elide_zero_values,
constbool col_indices_are_sorted)
{
if (elide_zero_values == false && col_indices_are_sorted == true &&
n_cols > 3)
{
const size_type *this_cols = &cols->colnums[cols->rowstart[row]];
const size_type row_length_1 = cols->row_length(row) - 1;
number * val_ptr = &val[cols->rowstart[row]];
if (m() == n())
{
// find diagonal and add it if foundAssert(this_cols[0] == row, ExcInternalError());
const size_type *diag_pos =
Utilities::lower_bound(col_indices, col_indices + n_cols, row);
const size_type diag = diag_pos - col_indices;
size_type post_diag = diag;
if (diag != n_cols && *diag_pos == row)
{
val_ptr[0] += *(values + (diag_pos - col_indices));
++post_diag;
}
// Add indices before diagonal. Because the input array// is sorted, and because the entries in this matrix row// are sorted, we can just linearly walk the colnums array// and the input array in parallel, stopping whenever the// former matches the column index of the next index in// the input array:
size_type counter = 1;
for (size_type i = 0; i < diag; ++i)
{
while (this_cols[counter] < col_indices[i] && // **** linear walk over entries
counter < row_length_1)
++counter;
Assert((this_cols[counter] == col_indices[i]) ||
(values[i] == number2()),
ExcInvalidIndex(row, col_indices[i]));
val_ptr[counter] += values[i];
}
// Then do the same to add indices after the diagonal:for (size_type i = post_diag; i < n_cols; ++i)
{
while (this_cols[counter] < col_indices[i] && // **** linear walk over entries
counter < row_length_1)
++counter;
Assert((this_cols[counter] == col_indices[i]) ||
(values[i] == number2()),
ExcInvalidIndex(row, col_indices[i]));
val_ptr[counter] += values[i];
}
...
This is pretty good, and I suspect that the linear forward search for the entry in the row that matches the next column index for which we want to add something is pretty efficient.
But I do wonder whether we could optimize this a bit more. For 3d problems (say, Stokes), adding the local matrix entries for a row that corresponds to a DoF on a face/edge/vertex touches only 1/2, 1/4, and 1/8 of the entries in the row, and there are many entries per row (~400 for Stokes Q2/Q1 elements in 3d). A more efficient approach may be to see whether the next index (or one of the two next indices) is right, and if not do a bisection search.
The text was updated successfully, but these errors were encountered:
I came across this paper https://dl.acm.org/doi/full/10.1145/3503925: "On Memory Traffic and Optimisations for Low-order Finite Element Assembly Algorithms on Multi-core CPUs". @kronbichler and @peterrum might be interested in it as well.
One of the things they show is that the insertion of local matrix elements into the global matrix is expensive, primarily because one has to do a bisection search in each row for the column we want to add a local entry to. It turns out that we don't actually do a bisection search if one goes through the highly optimized path via
AffineConstraints::distribute_local_to_global()
(as one should) because that function passes an already-sorted array of entries for each row toSparseMatrix::add()
. There, we have the following (slightly trimmed for clarity):This is pretty good, and I suspect that the linear forward search for the entry in the row that matches the next column index for which we want to add something is pretty efficient.
But I do wonder whether we could optimize this a bit more. For 3d problems (say, Stokes), adding the local matrix entries for a row that corresponds to a DoF on a face/edge/vertex touches only 1/2, 1/4, and 1/8 of the entries in the row, and there are many entries per row (~400 for Stokes Q2/Q1 elements in 3d). A more efficient approach may be to see whether the next index (or one of the two next indices) is right, and if not do a bisection search.
The text was updated successfully, but these errors were encountered: