Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyOpenCL: values only computed correctly for first worker in the group #54

Closed
josiahw opened this issue Mar 12, 2018 · 5 comments
Closed
Assignees
Labels

Comments

@josiahw
Copy link

josiahw commented Mar 12, 2018

I'm testing out some basic PyOpenCL examples (PyOpenCL detects the library and parameters correctly). The demo benchmark looked like a nice simple kernel to try:
https://raw.githubusercontent.com/inducer/pyopencl/master/examples/benchmark.py

The value only gets reliably assigned for the first worker in the group. Writing a constant value of 1 to the output only results in 1's being written to the first worker in each group (the rest of the values are 0). This might be a writing issue, or it might be a problem with the call to get_global_id(0)

A secondary issue is that the openCL driver will only allow worker groups in powers of 2 - so I cannot assign 12 workers here, only 8:
Traceback_ (most recent call last):
File "opencl_test.py", line 91, in
exec_evt = prg.sum(queue, global_size, local_size, a_buf, b_buf, dest_buf)
File "/usr/local/lib/python2.7/dist-packages/pyopencl/cffi_cl.py", line 1766, in call
return self._enqueue(self, queue, global_size, local_size, *args, **kwargs)
File "", line 90, in enqueue_knl_sum
File "/usr/local/lib/python2.7/dist-packages/pyopencl/cffi_cl.py", line 1952, in enqueue_nd_range_kernel
global_work_size, local_work_size, c_wait_for, num_wait_for))
File "/usr/local/lib/python2.7/dist-packages/pyopencl/cffi_cl.py", line 664, in _handle_error
raise e
pyopencl.cffi_cl.LogicError: clEnqueueNDRangeKernel failed: INVALID_WORK_GROUP_SIZE

I'm unsure if there's an incorrect assumption in the pyopencl code somewhere... but this reduces total throughput by a third. With 8 workers the result is:
gpu: 2.32745s
cpu: 0.469361066818s

@doe300
Copy link
Owner

doe300 commented Mar 13, 2018

Can you please give more details, e.g. expected vs. actual results.

Also, about the work-group size: VC4CL only asserts that the local size divides the global size and is at most 12. So if the global-size is a power of two, the local size needs to be too. Any other restriction may be applied by pyopencl.

@josiahw
Copy link
Author

josiahw commented Mar 17, 2018

I just tested with a fresh code pull. The assignment is still an issue.

Here's the minimal kernel to reproduce:
__kernel void sum(__global float *c)
{
int gid = get_global_id(0);
c[gid] = 1.f;
}

Expected result: array of [1.0, 1.0, 1.0, 1.0, 1.0, 1.0... ]
Actual result: array of [1.0, 0.0, 0.0, 0.0, 0.0, 0.0... ]

@doe300 doe300 added the bug label Mar 17, 2018
@doe300 doe300 self-assigned this Mar 17, 2018
doe300 added a commit to doe300/VC4CL that referenced this issue Mar 17, 2018
…now handles optimized implicit UNIFORMs correctly, see #27, doe300/VC4C#54
@doe300
Copy link
Owner

doe300 commented Mar 17, 2018

I fixed some host-side issues regarding this, can you re-test?

@josiahw
Copy link
Author

josiahw commented Mar 17, 2018

I can confirm this is now fixed. :-)

@doe300
Copy link
Owner

doe300 commented Mar 17, 2018

Very good:)

@doe300 doe300 closed this as completed Mar 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants