Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mallocMC with alpaka is slow #183

Closed
psychocoderHPC opened this issue Jul 6, 2020 · 3 comments
Closed

mallocMC with alpaka is slow #183

psychocoderHPC opened this issue Jul 6, 2020 · 3 comments

Comments

@psychocoderHPC
Copy link
Member

Since #173 we added alpaka to mallocMC the performance is lower than before.
I tested https://github.com/ComputationalRadiationPhysics/mallocMC/tree/31f9b3c3f1cf919a96312a30ab9aa223f41fed67 with patches from #180

The performance with PIConGPU for the spec example is 28% lower.

mallocMC with alpaka

./bin/picongpu_mallocmc -d 1 1 1 -g 128 128 128 -s 100 --periodic 1 1 1
...
 85 % =       85 | time elapsed:             3sec 889msec | avg time per step:  48msec
 90 % =       90 | time elapsed:             4sec 133msec | avg time per step:  48msec
 95 % =       95 | time elapsed:             4sec 376msec | avg time per step:  48msec
100 % =      100 | time elapsed:             4sec 619msec | avg time per step:  48msec
calculation  simulation time:  4sec 619msec = 4 sec
full simulation time: 18sec 365msec = 18 sec

current PIConGPU with old alpaka

./bin/picongpu -d 1 1 1 -g 128 128 128 -s 100 --periodic 1 1 1
...
 85 % =       85 | time elapsed:             3sec  26msec | avg time per step:  37msec
 90 % =       90 | time elapsed:             3sec 216msec | avg time per step:  38msec
 95 % =       95 | time elapsed:             3sec 407msec | avg time per step:  38msec
100 % =      100 | time elapsed:             3sec 597msec | avg time per step:  38msec
calculation  simulation time:  3sec 597msec = 3 sec

I currently investigating where the performance decrease is coming from.

@psychocoderHPC
Copy link
Member Author

It looks like compile time parameter will be passed at runtime and loaded from the global memory:

The code above show a snapshot where https://github.com/ComputationalRadiationPhysics/mallocMC/blob/31f9b3c3f1cf919a96312a30ab9aa223f41fed67/src/include/mallocMC/creationPolicies/Scatter_impl.hpp#L74-L98 is load

	// Callseq Start 5
	{
	.reg .b32 temp_param_reg;
	// <end>}
	.param .b64 param0;
	st.param.b64	[param0+0], %rd199;
	.param .b64 retval0;
	call.uni (retval0), 
	_ZN8mallocMC15DeviceAllocatorINS_16CreationPolicies7ScatterIN8picongpu16DeviceHeapConfigENS1_11ScatterConf27DefaultScatterHashingParamsEEENS_20DistributionPolicies4NoopENS_11OOMPolicies10ReturnNullENS_17AlignmentPolicies6ShrinkINSC_12ShrinkConfig19DefaultShrinkConfigEEEE6mallocIN6alpaka3acc12AccGpuCudaRtISt17integral_constantImLm3EEjEEEEPvRKT_m, 
	(
	param0
	);
	ld.param.b64	%rd753, [retval0+0];
	
	//{
	}// Callseq End 5
	setp.ne.s64	%p31, %rd753, 0;
	@%p31 bra 	BB17_31;

@psychocoderHPC
Copy link
Member Author

It looks like compile time parameter will be passed at runtime and loaded from the global memory:

The code above show a snapshot where

https://github.com/ComputationalRadiationPhysics/mallocMC/blob/31f9b3c3f1cf919a96312a30ab9aa223f41fed67/src/include/mallocMC/creationPolicies/Scatter_impl.hpp#L74-L98

is load

	// Callseq Start 5
	{
	.reg .b32 temp_param_reg;
	// <end>}
	.param .b64 param0;
	st.param.b64	[param0+0], %rd199;
	.param .b64 retval0;
	call.uni (retval0), 
	_ZN8mallocMC15DeviceAllocatorINS_16CreationPolicies7ScatterIN8picongpu16DeviceHeapConfigENS1_11ScatterConf27DefaultScatterHashingParamsEEENS_20DistributionPolicies4NoopENS_11OOMPolicies10ReturnNullENS_17AlignmentPolicies6ShrinkINSC_12ShrinkConfig19DefaultShrinkConfigEEEE6mallocIN6alpaka3acc12AccGpuCudaRtISt17integral_constantImLm3EEjEEEEPvRKT_m, 
	(
	param0
	);
	ld.param.b64	%rd753, [retval0+0];
	
	//{
	}// Callseq End 5
	setp.ne.s64	%p31, %rd753, 0;
	@%p31 bra 	BB17_31;

I forgot to remove the __noinline__ for CUDA. Therefore this strange behavior with the global loads.
Never the less even without __noinline__ the code ist slow :-(

@psychocoderHPC
Copy link
Member Author

It is not a mallocMC issue. The CMake CUDA module is not forwarding flags from CMAKE_CXX_FLAGS_*. Therefore alpaka must be extended to handle this flags if the CUDA backend is used.

                string(TOUPPER "${CMAKE_BUILD_TYPE}" build_config)
                if(build_config)
                    # CMAKE_CXX_FLAGS_* are strings and not lists, to allow the usage 
                    # together with foreach we need to transform the string to a list
                    string (REPLACE " " ";" CXX_FLAGS_AS_LIST ${CMAKE_CXX_FLAGS_${build_config}})
                    foreach( _flag ${CXX_FLAGS_AS_LIST})
                        list(APPEND CUDA_NVCC_FLAGS -Xcompiler=${_flag})
                    endforeach()
                endif()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant