Kernel argument order for cooperative launches is reversed #40

codecircuit · 2018-02-13T12:00:46Z

Compile and execute:

#include <iostream>

#include "cuda/api_wrappers.h"

__global__ void foo(int a, int b, int c) {
	if (threadIdx.x == 0) {
		printf("a = %d, b = %d, c = %d\n", a, b, c);
	}
	__syncthreads();
}


int main() {
	constexpr cuda::grid_block_dimension_t block_dim = 32;
	constexpr cuda::grid_dimension_t grid_dim = 1;
	int a = 1;
	int b = 2;
	int c = 3;
	std::cout << "Launching non-cooperative" << std::endl;
	auto config = cuda::make_launch_config(grid_dim, block_dim);
	cuda::enqueue_launch(cuda::thread_blocks_may_not_cooperate,
	                     foo,
	                     cuda::stream::default_stream_id,
	                     config,
	                     a, b, c);
	auto device = cuda::device::current::get();
	device.synchronize();

	std::cout << "Launching cooperative" << std::endl;
	cuda::enqueue_launch(cuda::thread_blocks_may_cooperate,
	                     foo,
	                     cuda::stream::default_stream_id,
	                     config,
	                     a, b, c);
	device.synchronize();
}

Expected output:

Launching non-cooperative
a = 1, b = 2, c = 3
Launching cooperative
a = 1, b = 2, c = 3

Actual output:

Launching non-cooperative
a = 1, b = 2, c = 3
Launching cooperative
a = 3, b = 2, c = 1

The text was updated successfully, but these errors were encountered:

eyalroz · 2018-02-13T21:56:57Z

Not seeing this with CUDA 9.1 and a CC 3.0 device, nor with CUDA 8.0 and a CC 6.1 device.

But - seeing this with CUDA 9.1 and a CC 6.1 device. Hmm, I have an idea.

lahwaacz · 2018-02-13T22:18:38Z

I'd say here is an undefined behaviour.

eyalroz · 2018-02-13T22:29:40Z

@lahwaacz : This is a variation on a common and beloved trick by Sean Parent. But maybe somehow it's reversed...

lahwaacz · 2018-02-13T22:33:30Z

Right in the first post he says:

Very useful if you don't care about evaluation order

eyalroz · 2018-02-13T22:49:18Z

@lahwaacz : You're right...

So now I have to generate an index sequence and use it to place each parameter in its correct position, all in C++11. Annoying :-(

lahwaacz · 2018-02-13T23:07:33Z

So now I have to generate an index sequence and use it to place each parameter in its correct position, all in C++11. Annoying :-(

What? No you don't... Why people insist on inventing some "cool" one-liners (or "few-liners") all the time instead of writing simple readable code is beyond me.

template <class F, class Arg>
void for_each_argument(F f, Arg&& arg) {
    f(std::forward<Arg>(arg));
}

template <class F, class Arg, class... Args>
void for_each_argument(F f, Arg&& arg, Args&&... args) {
    for_each_argument(f, std::forward<Arg>(arg));
    for_each_argument(f, std::forward<Args>(args)...);
}

Haven't bothered with trying to compile it, but you get the idea.

eyalroz · 2018-02-13T23:12:22Z

@lahwaacz : Ah, right, ok, I guess I could do that. I was trying to be snazzy and not instantiate recursively all the way. But I guess it's not that critical - until people write kernels with 100 parameters and call them with many different types.

lahwaacz · 2018-02-13T23:29:12Z

I bet that after inlining all the lambdas the result will be essentially identical to your original version (which executed in the correct order). As for compilation time, I don't see how you could do less than O(N) instantiations when you actually need to call f with each of the N arguments. As a bonus to readability, my version features a reasonably small constant factor due to lack of almost any wrapping code.

lahwaacz · 2018-02-14T07:41:18Z

Since you seem obsessed with the number of instantiations, let me point out that your commit message in 46db0c4 is wrong in claiming that "[you] did reduce the number of instantiations to about one per 4 arguments rather than one-per-one". For example, if you call collect_argument_addresses with 4 arguments (generally each of a different type), the compiler also has to instantiate collect_argument_addresses for 2 arguments twice, and for each of them 2 instances of collect_argument_addresses with a single argument. That's 7 instantiations in total, which is exactly the same as what would give my implementation which seemed too naive to you. Both versions could be improved by replacing all recursive calls with the fixed number of parameters by the body of the single-argument function.

eyalroz · 2018-02-14T10:02:21Z

I was mostly obsessing about larger number of parameters. But - maybe you're right. I'll go ahead and simplify the code like you suggested.

* Taking Jakub Klinkovsky's suggestion and simplifying the address collection code a bit; so - not trying to reduce the number of instantiations for super-high number of arguments, but actually reducing them for a smaller number of arguments relative to the previous commit (Not all the way though).

eyalroz closed this as completed in 46db0c4 Feb 13, 2018

codecircuit mentioned this issue Feb 14, 2018

Kernel gets wrong argument values #42

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel argument order for cooperative launches is reversed #40

Kernel argument order for cooperative launches is reversed #40

codecircuit commented Feb 13, 2018

eyalroz commented Feb 13, 2018 •

edited

Loading

lahwaacz commented Feb 13, 2018

eyalroz commented Feb 13, 2018

lahwaacz commented Feb 13, 2018

eyalroz commented Feb 13, 2018

lahwaacz commented Feb 13, 2018 •

edited

Loading

eyalroz commented Feb 13, 2018

lahwaacz commented Feb 13, 2018

lahwaacz commented Feb 14, 2018 •

edited

Loading

eyalroz commented Feb 14, 2018

Kernel argument order for cooperative launches is reversed #40

Kernel argument order for cooperative launches is reversed #40

Comments

codecircuit commented Feb 13, 2018

eyalroz commented Feb 13, 2018 • edited Loading

lahwaacz commented Feb 13, 2018

eyalroz commented Feb 13, 2018

lahwaacz commented Feb 13, 2018

eyalroz commented Feb 13, 2018

lahwaacz commented Feb 13, 2018 • edited Loading

eyalroz commented Feb 13, 2018

lahwaacz commented Feb 13, 2018

lahwaacz commented Feb 14, 2018 • edited Loading

eyalroz commented Feb 14, 2018

eyalroz commented Feb 13, 2018 •

edited

Loading

lahwaacz commented Feb 13, 2018 •

edited

Loading

lahwaacz commented Feb 14, 2018 •

edited

Loading