Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel argument order for cooperative launches is reversed #40

Closed
codecircuit opened this issue Feb 13, 2018 · 10 comments
Closed

Kernel argument order for cooperative launches is reversed #40

codecircuit opened this issue Feb 13, 2018 · 10 comments

Comments

@codecircuit
Copy link
Contributor

Compile and execute:

#include <iostream>

#include "cuda/api_wrappers.h"

__global__ void foo(int a, int b, int c) {
	if (threadIdx.x == 0) {
		printf("a = %d, b = %d, c = %d\n", a, b, c);
	}
	__syncthreads();
}


int main() {
	constexpr cuda::grid_block_dimension_t block_dim = 32;
	constexpr cuda::grid_dimension_t grid_dim = 1;
	int a = 1;
	int b = 2;
	int c = 3;
	std::cout << "Launching non-cooperative" << std::endl;
	auto config = cuda::make_launch_config(grid_dim, block_dim);
	cuda::enqueue_launch(cuda::thread_blocks_may_not_cooperate,
	                     foo,
	                     cuda::stream::default_stream_id,
	                     config,
	                     a, b, c);
	auto device = cuda::device::current::get();
	device.synchronize();

	std::cout << "Launching cooperative" << std::endl;
	cuda::enqueue_launch(cuda::thread_blocks_may_cooperate,
	                     foo,
	                     cuda::stream::default_stream_id,
	                     config,
	                     a, b, c);
	device.synchronize();
}

Expected output:

Launching non-cooperative
a = 1, b = 2, c = 3
Launching cooperative
a = 1, b = 2, c = 3

Actual output:

Launching non-cooperative
a = 1, b = 2, c = 3
Launching cooperative
a = 3, b = 2, c = 1
@eyalroz
Copy link
Owner

eyalroz commented Feb 13, 2018

Not seeing this with CUDA 9.1 and a CC 3.0 device, nor with CUDA 8.0 and a CC 6.1 device.

But - seeing this with CUDA 9.1 and a CC 6.1 device. Hmm, I have an idea.

@lahwaacz
Copy link

I'd say here is an undefined behaviour.

@eyalroz
Copy link
Owner

eyalroz commented Feb 13, 2018

@lahwaacz : This is a variation on a common and beloved trick by Sean Parent. But maybe somehow it's reversed...

@lahwaacz
Copy link

Right in the first post he says:

Very useful if you don't care about evaluation order

@eyalroz
Copy link
Owner

eyalroz commented Feb 13, 2018

@lahwaacz : You're right...

So now I have to generate an index sequence and use it to place each parameter in its correct position, all in C++11. Annoying :-(

@lahwaacz
Copy link

lahwaacz commented Feb 13, 2018

So now I have to generate an index sequence and use it to place each parameter in its correct position, all in C++11. Annoying :-(

What? No you don't... Why people insist on inventing some "cool" one-liners (or "few-liners") all the time instead of writing simple readable code is beyond me.

template <class F, class Arg>
void for_each_argument(F f, Arg&& arg) {
    f(std::forward<Arg>(arg));
}

template <class F, class Arg, class... Args>
void for_each_argument(F f, Arg&& arg, Args&&... args) {
    for_each_argument(f, std::forward<Arg>(arg));
    for_each_argument(f, std::forward<Args>(args)...);
}

Haven't bothered with trying to compile it, but you get the idea.

@eyalroz
Copy link
Owner

eyalroz commented Feb 13, 2018

@lahwaacz : Ah, right, ok, I guess I could do that. I was trying to be snazzy and not instantiate recursively all the way. But I guess it's not that critical - until people write kernels with 100 parameters and call them with many different types.

@lahwaacz
Copy link

I bet that after inlining all the lambdas the result will be essentially identical to your original version (which executed in the correct order). As for compilation time, I don't see how you could do less than O(N) instantiations when you actually need to call f with each of the N arguments. As a bonus to readability, my version features a reasonably small constant factor due to lack of almost any wrapping code.

@lahwaacz
Copy link

lahwaacz commented Feb 14, 2018

Since you seem obsessed with the number of instantiations, let me point out that your commit message in 46db0c4 is wrong in claiming that "[you] did reduce the number of instantiations to about one per 4 arguments rather than one-per-one". For example, if you call collect_argument_addresses with 4 arguments (generally each of a different type), the compiler also has to instantiate collect_argument_addresses for 2 arguments twice, and for each of them 2 instances of collect_argument_addresses with a single argument. That's 7 instantiations in total, which is exactly the same as what would give my implementation which seemed too naive to you. Both versions could be improved by replacing all recursive calls with the fixed number of parameters by the body of the single-argument function.

@eyalroz
Copy link
Owner

eyalroz commented Feb 14, 2018

I was mostly obsessing about larger number of parameters. But - maybe you're right. I'll go ahead and simplify the code like you suggested.

eyalroz pushed a commit that referenced this issue Feb 14, 2018
* Taking Jakub Klinkovsky's suggestion and simplifying the address collection code a bit; so - not trying to reduce the number of instantiations for super-high number of arguments, but actually reducing them for a smaller number of arguments relative to the previous commit (Not all the way though).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants