Optimize insertion into global sparse matrix during assembly #15048

bangerth · 2023-04-07T18:12:14Z

I came across this paper https://dl.acm.org/doi/full/10.1145/3503925: "On Memory Traffic and Optimisations for Low-order Finite Element Assembly Algorithms on Multi-core CPUs". @kronbichler and @peterrum might be interested in it as well.

One of the things they show is that the insertion of local matrix elements into the global matrix is expensive, primarily because one has to do a bisection search in each row for the column we want to add a local entry to. It turns out that we don't actually do a bisection search if one goes through the highly optimized path via AffineConstraints::distribute_local_to_global() (as one should) because that function passes an already-sorted array of entries for each row to SparseMatrix::add(). There, we have the following (slightly trimmed for clarity):

template <typename number>
template <typename number2>
void
SparseMatrix<number>::add(const size_type  row,
                          const size_type  n_cols,
                          const size_type *col_indices,
                          const number2 *  values,
                          const bool       elide_zero_values,
                          const bool       col_indices_are_sorted)
{
  if (elide_zero_values == false && col_indices_are_sorted == true &&
      n_cols > 3)
    {
      const size_type *this_cols    = &cols->colnums[cols->rowstart[row]];
      const size_type  row_length_1 = cols->row_length(row) - 1;
      number *         val_ptr      = &val[cols->rowstart[row]];

      if (m() == n())
        {
          // find diagonal and add it if found
          Assert(this_cols[0] == row, ExcInternalError());
          const size_type *diag_pos =
            Utilities::lower_bound(col_indices, col_indices + n_cols, row);
          const size_type diag      = diag_pos - col_indices;
          size_type       post_diag = diag;
          if (diag != n_cols && *diag_pos == row)
            {
              val_ptr[0] += *(values + (diag_pos - col_indices));
              ++post_diag;
            }

          // Add indices before diagonal. Because the input array
          // is sorted, and because the entries in this matrix row
          // are sorted, we can just linearly walk the colnums array
          // and the input array in parallel, stopping whenever the
          // former matches the column index of the next index in
          // the input array:
          size_type counter = 1;
          for (size_type i = 0; i < diag; ++i)
            {
              while (this_cols[counter] < col_indices[i] &&           // **** linear walk over entries
                     counter < row_length_1)
                ++counter;

              Assert((this_cols[counter] == col_indices[i]) ||
                       (values[i] == number2()),
                     ExcInvalidIndex(row, col_indices[i]));

              val_ptr[counter] += values[i];
            }

          // Then do the same to add indices after the diagonal:
          for (size_type i = post_diag; i < n_cols; ++i)
            {
              while (this_cols[counter] < col_indices[i] &&           // **** linear walk over entries
                     counter < row_length_1)
                ++counter;

              Assert((this_cols[counter] == col_indices[i]) ||
                       (values[i] == number2()),
                     ExcInvalidIndex(row, col_indices[i]));

              val_ptr[counter] += values[i];
            }
 ...

This is pretty good, and I suspect that the linear forward search for the entry in the row that matches the next column index for which we want to add something is pretty efficient.

But I do wonder whether we could optimize this a bit more. For 3d problems (say, Stokes), adding the local matrix entries for a row that corresponds to a DoF on a face/edge/vertex touches only 1/2, 1/4, and 1/8 of the entries in the row, and there are many entries per row (~400 for Stokes Q2/Q1 elements in 3d). A more efficient approach may be to see whether the next index (or one of the two next indices) is right, and if not do a bisection search.

The text was updated successfully, but these errors were encountered:

bangerth · 2023-04-07T18:31:26Z

The same applies in affine_constraints.templates.h in the following function:

    namespace dealiiSparseMatrix
    {
      template <typename SparseMatrixIterator, typename LocalType>
      static inline void
      add_value(const LocalType       value,
                const size_type       row,
                const size_type       column,
                SparseMatrixIterator &matrix_values)
      {
        (void)row;
        if (value != LocalType())
          {
            while (matrix_values->column() < column)
              ++matrix_values;
            Assert(matrix_values->column() == column,
                   typename SparseMatrix<
                     typename SparseMatrixIterator::MatrixType::value_type>::
                     ExcInvalidIndex(row, column));
            matrix_values->value() += value;
          }
      }
    } // namespace dealiiSparseMatrix

bangerth added the Starter project label Apr 7, 2023

kronbichler mentioned this issue Apr 7, 2023

Sort vertices by the order in which we traverse them #15044

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize insertion into global sparse matrix during assembly #15048

Optimize insertion into global sparse matrix during assembly #15048

bangerth commented Apr 7, 2023 •

edited by drwells

bangerth commented Apr 7, 2023 •

edited by drwells

Optimize insertion into global sparse matrix during assembly #15048

Optimize insertion into global sparse matrix during assembly #15048

Comments

bangerth commented Apr 7, 2023 • edited by drwells

bangerth commented Apr 7, 2023 • edited by drwells

bangerth commented Apr 7, 2023 •

edited by drwells

bangerth commented Apr 7, 2023 •

edited by drwells