Global coarsening: compress weights #13099

peterrum · 2021-12-19T17:09:49Z

No description provided.

peterrum · 2021-12-20T19:27:53Z

@kronbichler I am tempted to get rid of the short-cut path, since it makes the application of hanging-node constraints and the application of weights annoyingly complicated. What do you think?

dealii/include/deal.II/multigrid/mg_transfer_global_coarsening.templates.h

Lines 2526 to 2571 in cfa11cf

    
           // identity -> take short cut and work directly on global vectors 
        
           if (scheme.prolongation_matrix.size() == 0 && 
        
               scheme.prolongation_matrix_1d.size() == 0) 
        
             { 
        
               for (unsigned int cell = 0; cell < scheme.n_coarse_cells; 
        
                    cell += n_lanes) 
        
                 { 
        
                   const unsigned int n_lanes_filled = 
        
                     (cell + n_lanes > scheme.n_coarse_cells) ? 
        
                       (scheme.n_coarse_cells - cell) : 
        
                       n_lanes; 
        
                   // read from source vector 
        
                   for (unsigned int v = 0; v < n_lanes_filled; ++v) 
        
                     { 
        
                       if ((scheme.n_dofs_per_cell_fine != 0) && 
        
                           (scheme.n_dofs_per_cell_coarse != 0)) 
        
                         { 
        
                           if (fine_element_is_continuous) 
        
                             for (unsigned int i = 0; 
        
                                  i < scheme.n_dofs_per_cell_fine; 
        
                                  ++i) 
        
                               vec_fine_ptr->local_element(indices_fine[i]) += 
        
                                 read_dof_values(indices_coarse[i], vec_coarse) * 
        
                                 weights[i]; 
        
                           else 
        
                             for (unsigned int i = 0; 
        
                                  i < scheme.n_dofs_per_cell_fine; 
        
                                  ++i) 
        
                               vec_fine_ptr->local_element(indices_fine[i]) += 
        
                                 read_dof_values(indices_coarse[i], vec_coarse); 
        
                         } 
        
                       indices_fine += scheme.n_dofs_per_cell_fine; 
        
                       indices_coarse += scheme.n_dofs_per_cell_coarse; 
        
                       if (fine_element_is_continuous) 
        
                         weights += scheme.n_dofs_per_cell_fine; 
        
                     } 
        
                   if (fine_element_is_continuous) 
        
                     weights_compressed += 1; 
        
                 } 
        
               continue; 
        
             }

kronbichler · 2021-12-21T08:53:27Z

I am tempted to get rid of the short-cut path [...]

I think I agree. Just so we do not miss anything, can you summarize the operations we need to go through in that case? I guess it is a single sum-factorization interpolation (at least if we take the right path), i.e., approximately dim * (n_points_fine)^dim n_points_coarse?

peterrum · 2021-12-21T08:55:32Z

I think I agree. Just so we do not miss anything, can you summarize the operations we need to go through in that case? I guess it is a single sum-factorization interpolation (at least if we take the right path), i.e., approximately dim * (n_points_fine)^dim n_points_coarse?

No that would be still guarded. The only difference would be that data is copied into a buffer in a vectorized form.

kronbichler

Looks good to me, apart from a few comments on the code. I agree we should remove the specialization as it does not contribute to faster execution (unless everything sits in L2 cache).

kronbichler · 2021-12-21T08:56:38Z

include/deal.II/multigrid/mg_transfer_global_coarsening.templates.h

+
+      for (const auto &scheme : transfer.schemes)
+        {
+          std::cout << scheme.degree_fine << std::endl;


Suggested change

std::cout << scheme.degree_fine << std::endl;

kronbichler · 2021-12-21T08:57:22Z

include/deal.II/multigrid/mg_transfer_global_coarsening.templates.h

+
+                      if (!set(mask[c][shift], weights[0]))
+                        return;
+


Suggested change

kronbichler · 2021-12-21T08:58:41Z

include/deal.II/multigrid/mg_transfer_global_coarsening.templates.h

+                        9 * degree_to_3[k] + 3 * degree_to_3[j];
+
+                      if (!set(mask[c][shift], weights[0]))
+                        return;


Can you explain why you want to return here? Shouldn't we simply continue in that case? Either way, I do not think I understand what exactly happens in this function and what is supposed to happen if set returns false, given the side effects inside the function on mask and multiple checks. Also, I think that some comment on what the loop over cell does would help.

kronbichler · 2021-12-21T09:07:35Z

include/deal.II/multigrid/mg_transfer_global_coarsening.templates.h

+    for (unsigned int c = 0; c < n_components; ++c)
+      for (int k = 0; k < (dim > 2 ? loop_length : 1); ++k)
+        for (int j = 0; j < (dim > 1 ? loop_length : 1); ++j)
+          {
+            const unsigned int shift = 9 * degree_to_3[k] + 3 * degree_to_3[j];
+            data[0] *= weights[shift];
+            // loop bound as int avoids compiler warnings in case loop_length
+            // == 1 (polynomial degree 0)
+            for (int i = 1; i < loop_length - 1; ++i)
+              data[i] *= weights[shift + 1];
+            data[loop_length - 1] *= weights[shift + 2];
+            data += loop_length;
+          }


Is this the same code as in the other MG transfer function

dealii/source/multigrid/mg_transfer_matrix_free.cc

Lines 362 to 388 in cfa11cf

weight_dofs_on_child(const VectorizedArray<Number> *weights,

const unsigned int n_components,

const unsigned int fe_degree,

VectorizedArray<Number> * data)

{

Assert(fe_degree > 0, ExcNotImplemented());

Assert(fe_degree < 100, ExcNotImplemented());

const int loop_length = degree != -1 ? 2 * degree + 1 : 2 * fe_degree + 1;

unsigned int degree_to_3[100];

degree_to_3[0] = 0;

for (int i = 1; i < loop_length - 1; ++i)

degree_to_3[i] = 1;

degree_to_3[loop_length - 1] = 2;

for (unsigned int c = 0; c < n_components; ++c)

for (int k = 0; k < (dim > 2 ? loop_length : 1); ++k)

for (int j = 0; j < (dim > 1 ? loop_length : 1); ++j)

{

const unsigned int shift = 9 * degree_to_3[k] + 3 * degree_to_3[j];

data[0] *= weights[shift];

// loop bound as int avoids compiler warnings in case loop_length

// == 1 (polynomial degree 0)

for (int i = 1; i < loop_length - 1; ++i)

data[i] *= weights[shift + 1];

data[loop_length - 1] *= weights[shift + 2];

data += loop_length;

}

}

I suggest to move the code to a common location and only implement it once. Since this is a regular tensor product variant, it might fit in include/deal.II/matrix_free/tensor_product_kernels.h.

peterrum · 2021-12-21T11:09:48Z

@kronbichler I have made the changes!

kronbichler · 2021-12-21T11:36:46Z

include/deal.II/matrix_free/tensor_product_kernels.h

@@ -2586,6 +2586,36 @@ namespace internal
  }


+  template <int dim, typename Number>
+  inline void
+  weight_dofs_on_child(const VectorizedArray<Number> *weights,


Can we rename this into something more generic, e.g. weight_fe_q_dofs_by_entity? Also, I wonder if you would mind injecting the dimension loop_length via a template argument for the case this is supported, in order to reduce loop overhead?

Also, I wonder if you would mind injecting the dimension loop_length via a template argument for the case this is supported, in order to reduce loop overhead?

I would have thought that inline is enough. But I can do the change.

I am not sure, but my experience is that the compiler feels tempted to not inline such functions (with more than >20 obvious instructions) because it sees it fit for multiple template arguments, reducing the machine code size a bit. Now I won't blame any compiler for doing so (not sure if this is the case here, though), because it is impossible for the compiler to know that different degrees will rarely be executed in close temporal proximity, making the code size reduction effective at all.

kronbichler · 2021-12-21T15:51:46Z

/rebuild

peterrum added the Multigrid label Dec 19, 2021

peterrum force-pushed the gc_weighting branch 2 times, most recently from 5aedbc7 to cfa11cf Compare December 20, 2021 19:23

peterrum changed the title ~~[WIP] Global coarsening: compress weights~~ Global coarsening: compress weights Dec 20, 2021

peterrum mentioned this pull request Dec 20, 2021

Use FEEvaluationHangingNodes in MGTwoLevelTransfer #13048

Merged

kronbichler reviewed Dec 21, 2021

View reviewed changes

peterrum force-pushed the gc_weighting branch from cfa11cf to 8cd520f Compare December 21, 2021 11:09

kronbichler reviewed Dec 21, 2021

View reviewed changes

kronbichler approved these changes Dec 21, 2021

View reviewed changes

peterrum force-pushed the gc_weighting branch from 8cd520f to a2962cd Compare December 21, 2021 12:26

kronbichler added ready to test Reviewed and ready to merge labels Dec 21, 2021

peterrum force-pushed the gc_weighting branch from a2962cd to 9f5e645 Compare December 21, 2021 19:35

Global coarsening: compress weights

bf201fd

peterrum force-pushed the gc_weighting branch from 9f5e645 to bf201fd Compare December 21, 2021 20:09

peterrum merged commit acabf31 into dealii:master Dec 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Global coarsening: compress weights #13099

Global coarsening: compress weights #13099

peterrum commented Dec 19, 2021

peterrum commented Dec 20, 2021

kronbichler commented Dec 21, 2021

peterrum commented Dec 21, 2021

kronbichler left a comment

kronbichler Dec 21, 2021

kronbichler Dec 21, 2021

kronbichler Dec 21, 2021

kronbichler Dec 21, 2021

peterrum commented Dec 21, 2021

kronbichler Dec 21, 2021

peterrum Dec 21, 2021

kronbichler Dec 21, 2021

peterrum Dec 21, 2021

kronbichler commented Dec 21, 2021

	weight_dofs_on_child(const VectorizedArray<Number> *weights,
	const unsigned int n_components,
	const unsigned int fe_degree,
	VectorizedArray<Number> * data)
	{
	Assert(fe_degree > 0, ExcNotImplemented());
	Assert(fe_degree < 100, ExcNotImplemented());
	const int loop_length = degree != -1 ? 2 * degree + 1 : 2 * fe_degree + 1;
	unsigned int degree_to_3[100];
	degree_to_3[0] = 0;
	for (int i = 1; i < loop_length - 1; ++i)
	degree_to_3[i] = 1;
	degree_to_3[loop_length - 1] = 2;
	for (unsigned int c = 0; c < n_components; ++c)
	for (int k = 0; k < (dim > 2 ? loop_length : 1); ++k)
	for (int j = 0; j < (dim > 1 ? loop_length : 1); ++j)
	{
	const unsigned int shift = 9 * degree_to_3[k] + 3 * degree_to_3[j];
	data[0] *= weights[shift];
	// loop bound as int avoids compiler warnings in case loop_length
	// == 1 (polynomial degree 0)
	for (int i = 1; i < loop_length - 1; ++i)
	data[i] *= weights[shift + 1];
	data[loop_length - 1] *= weights[shift + 2];
	data += loop_length;
	}
	}

Global coarsening: compress weights #13099

Global coarsening: compress weights #13099

Conversation

peterrum commented Dec 19, 2021

peterrum commented Dec 20, 2021

kronbichler commented Dec 21, 2021

peterrum commented Dec 21, 2021

kronbichler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterrum commented Dec 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kronbichler commented Dec 21, 2021