mallocMC with alpaka is slow #183

psychocoderHPC · 2020-07-06T14:08:25Z

Since #173 we added alpaka to mallocMC the performance is lower than before.
I tested https://github.com/ComputationalRadiationPhysics/mallocMC/tree/31f9b3c3f1cf919a96312a30ab9aa223f41fed67 with patches from #180

The performance with PIConGPU for the spec example is 28% lower.

mallocMC with alpaka

./bin/picongpu_mallocmc -d 1 1 1 -g 128 128 128 -s 100 --periodic 1 1 1
...
 85 % =       85 | time elapsed:             3sec 889msec | avg time per step:  48msec
 90 % =       90 | time elapsed:             4sec 133msec | avg time per step:  48msec
 95 % =       95 | time elapsed:             4sec 376msec | avg time per step:  48msec
100 % =      100 | time elapsed:             4sec 619msec | avg time per step:  48msec
calculation  simulation time:  4sec 619msec = 4 sec
full simulation time: 18sec 365msec = 18 sec

current PIConGPU with old alpaka

./bin/picongpu -d 1 1 1 -g 128 128 128 -s 100 --periodic 1 1 1
...
 85 % =       85 | time elapsed:             3sec  26msec | avg time per step:  37msec
 90 % =       90 | time elapsed:             3sec 216msec | avg time per step:  38msec
 95 % =       95 | time elapsed:             3sec 407msec | avg time per step:  38msec
100 % =      100 | time elapsed:             3sec 597msec | avg time per step:  38msec
calculation  simulation time:  3sec 597msec = 3 sec

I currently investigating where the performance decrease is coming from.

The text was updated successfully, but these errors were encountered:

psychocoderHPC · 2020-07-06T15:08:48Z

It looks like compile time parameter will be passed at runtime and loaded from the global memory:

The code above show a snapshot where https://github.com/ComputationalRadiationPhysics/mallocMC/blob/31f9b3c3f1cf919a96312a30ab9aa223f41fed67/src/include/mallocMC/creationPolicies/Scatter_impl.hpp#L74-L98 is load

	// Callseq Start 5
	{
	.reg .b32 temp_param_reg;
	// <end>}
	.param .b64 param0;
	st.param.b64	[param0+0], %rd199;
	.param .b64 retval0;
	call.uni (retval0), 
	_ZN8mallocMC15DeviceAllocatorINS_16CreationPolicies7ScatterIN8picongpu16DeviceHeapConfigENS1_11ScatterConf27DefaultScatterHashingParamsEEENS_20DistributionPolicies4NoopENS_11OOMPolicies10ReturnNullENS_17AlignmentPolicies6ShrinkINSC_12ShrinkConfig19DefaultShrinkConfigEEEE6mallocIN6alpaka3acc12AccGpuCudaRtISt17integral_constantImLm3EEjEEEEPvRKT_m, 
	(
	param0
	);
	ld.param.b64	%rd753, [retval0+0];
	
	//{
	}// Callseq End 5
	setp.ne.s64	%p31, %rd753, 0;
	@%p31 bra 	BB17_31;

psychocoderHPC · 2020-07-06T16:11:28Z

It looks like compile time parameter will be passed at runtime and loaded from the global memory:

The code above show a snapshot where

https://github.com/ComputationalRadiationPhysics/mallocMC/blob/31f9b3c3f1cf919a96312a30ab9aa223f41fed67/src/include/mallocMC/creationPolicies/Scatter_impl.hpp#L74-L98

is load
	// Callseq Start 5
	{
	.reg .b32 temp_param_reg;
	// <end>}
	.param .b64 param0;
	st.param.b64	[param0+0], %rd199;
	.param .b64 retval0;
	call.uni (retval0), 
	_ZN8mallocMC15DeviceAllocatorINS_16CreationPolicies7ScatterIN8picongpu16DeviceHeapConfigENS1_11ScatterConf27DefaultScatterHashingParamsEEENS_20DistributionPolicies4NoopENS_11OOMPolicies10ReturnNullENS_17AlignmentPolicies6ShrinkINSC_12ShrinkConfig19DefaultShrinkConfigEEEE6mallocIN6alpaka3acc12AccGpuCudaRtISt17integral_constantImLm3EEjEEEEPvRKT_m, 
	(
	param0
	);
	ld.param.b64	%rd753, [retval0+0];
	
	//{
	}// Callseq End 5
	setp.ne.s64	%p31, %rd753, 0;
	@%p31 bra 	BB17_31;

I forgot to remove the __noinline__ for CUDA. Therefore this strange behavior with the global loads.
Never the less even without __noinline__ the code ist slow :-(

psychocoderHPC · 2020-07-09T13:42:40Z

It is not a mallocMC issue. The CMake CUDA module is not forwarding flags from CMAKE_CXX_FLAGS_*. Therefore alpaka must be extended to handle this flags if the CUDA backend is used.

                string(TOUPPER "${CMAKE_BUILD_TYPE}" build_config)
                if(build_config)
                    # CMAKE_CXX_FLAGS_* are strings and not lists, to allow the usage 
                    # together with foreach we need to transform the string to a list
                    string (REPLACE " " ";" CXX_FLAGS_AS_LIST ${CMAKE_CXX_FLAGS_${build_config}})
                    foreach( _flag ${CXX_FLAGS_AS_LIST})
                        list(APPEND CUDA_NVCC_FLAGS -Xcompiler=${_flag})
                    endforeach()
                endif()

psychocoderHPC closed this as completed Jul 9, 2020

psychocoderHPC mentioned this issue Jul 10, 2020

build type compiler flags not passed to nvcc alpaka-group/alpaka#1073

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mallocMC with alpaka is slow #183

mallocMC with alpaka is slow #183

psychocoderHPC commented Jul 6, 2020

psychocoderHPC commented Jul 6, 2020

psychocoderHPC commented Jul 6, 2020

psychocoderHPC commented Jul 9, 2020

mallocMC with alpaka is slow #183

mallocMC with alpaka is slow #183

Comments

psychocoderHPC commented Jul 6, 2020

mallocMC with alpaka

current PIConGPU with old alpaka

psychocoderHPC commented Jul 6, 2020

psychocoderHPC commented Jul 6, 2020

psychocoderHPC commented Jul 9, 2020