Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize insertion into global sparse matrix during assembly #15048

Open
bangerth opened this issue Apr 7, 2023 · 1 comment
Open

Optimize insertion into global sparse matrix during assembly #15048

bangerth opened this issue Apr 7, 2023 · 1 comment

Comments

@bangerth
Copy link
Member

bangerth commented Apr 7, 2023

I came across this paper https://dl.acm.org/doi/full/10.1145/3503925: "On Memory Traffic and Optimisations for Low-order Finite Element Assembly Algorithms on Multi-core CPUs". @kronbichler and @peterrum might be interested in it as well.

One of the things they show is that the insertion of local matrix elements into the global matrix is expensive, primarily because one has to do a bisection search in each row for the column we want to add a local entry to. It turns out that we don't actually do a bisection search if one goes through the highly optimized path via AffineConstraints::distribute_local_to_global() (as one should) because that function passes an already-sorted array of entries for each row to SparseMatrix::add(). There, we have the following (slightly trimmed for clarity):

template <typename number>
template <typename number2>
void
SparseMatrix<number>::add(const size_type  row,
                          const size_type  n_cols,
                          const size_type *col_indices,
                          const number2 *  values,
                          const bool       elide_zero_values,
                          const bool       col_indices_are_sorted)
{
  if (elide_zero_values == false && col_indices_are_sorted == true &&
      n_cols > 3)
    {
      const size_type *this_cols    = &cols->colnums[cols->rowstart[row]];
      const size_type  row_length_1 = cols->row_length(row) - 1;
      number *         val_ptr      = &val[cols->rowstart[row]];

      if (m() == n())
        {
          // find diagonal and add it if found
          Assert(this_cols[0] == row, ExcInternalError());
          const size_type *diag_pos =
            Utilities::lower_bound(col_indices, col_indices + n_cols, row);
          const size_type diag      = diag_pos - col_indices;
          size_type       post_diag = diag;
          if (diag != n_cols && *diag_pos == row)
            {
              val_ptr[0] += *(values + (diag_pos - col_indices));
              ++post_diag;
            }

          // Add indices before diagonal. Because the input array
          // is sorted, and because the entries in this matrix row
          // are sorted, we can just linearly walk the colnums array
          // and the input array in parallel, stopping whenever the
          // former matches the column index of the next index in
          // the input array:
          size_type counter = 1;
          for (size_type i = 0; i < diag; ++i)
            {
              while (this_cols[counter] < col_indices[i] &&           // **** linear walk over entries
                     counter < row_length_1)
                ++counter;

              Assert((this_cols[counter] == col_indices[i]) ||
                       (values[i] == number2()),
                     ExcInvalidIndex(row, col_indices[i]));

              val_ptr[counter] += values[i];
            }

          // Then do the same to add indices after the diagonal:
          for (size_type i = post_diag; i < n_cols; ++i)
            {
              while (this_cols[counter] < col_indices[i] &&           // **** linear walk over entries
                     counter < row_length_1)
                ++counter;

              Assert((this_cols[counter] == col_indices[i]) ||
                       (values[i] == number2()),
                     ExcInvalidIndex(row, col_indices[i]));

              val_ptr[counter] += values[i];
            }
 ...

This is pretty good, and I suspect that the linear forward search for the entry in the row that matches the next column index for which we want to add something is pretty efficient.

But I do wonder whether we could optimize this a bit more. For 3d problems (say, Stokes), adding the local matrix entries for a row that corresponds to a DoF on a face/edge/vertex touches only 1/2, 1/4, and 1/8 of the entries in the row, and there are many entries per row (~400 for Stokes Q2/Q1 elements in 3d). A more efficient approach may be to see whether the next index (or one of the two next indices) is right, and if not do a bisection search.

@bangerth
Copy link
Member Author

bangerth commented Apr 7, 2023

The same applies in affine_constraints.templates.h in the following function:

    namespace dealiiSparseMatrix
    {
      template <typename SparseMatrixIterator, typename LocalType>
      static inline void
      add_value(const LocalType       value,
                const size_type       row,
                const size_type       column,
                SparseMatrixIterator &matrix_values)
      {
        (void)row;
        if (value != LocalType())
          {
            while (matrix_values->column() < column)
              ++matrix_values;
            Assert(matrix_values->column() == column,
                   typename SparseMatrix<
                     typename SparseMatrixIterator::MatrixType::value_type>::
                     ExcInvalidIndex(row, column));
            matrix_values->value() += value;
          }
      }
    } // namespace dealiiSparseMatrix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant