PyOpenCL: values only computed correctly for first worker in the group #54

josiahw · 2018-03-12T05:27:18Z

I'm testing out some basic PyOpenCL examples (PyOpenCL detects the library and parameters correctly). The demo benchmark looked like a nice simple kernel to try:
https://raw.githubusercontent.com/inducer/pyopencl/master/examples/benchmark.py

The value only gets reliably assigned for the first worker in the group. Writing a constant value of 1 to the output only results in 1's being written to the first worker in each group (the rest of the values are 0). This might be a writing issue, or it might be a problem with the call to get_global_id(0)

A secondary issue is that the openCL driver will only allow worker groups in powers of 2 - so I cannot assign 12 workers here, only 8:
Traceback_ (most recent call last):
File "opencl_test.py", line 91, in
exec_evt = prg.sum(queue, global_size, local_size, a_buf, b_buf, dest_buf)
File "/usr/local/lib/python2.7/dist-packages/pyopencl/cffi_cl.py", line 1766, in call
return self._enqueue(self, queue, global_size, local_size, *args, **kwargs)
File "", line 90, in enqueue_knl_sum
File "/usr/local/lib/python2.7/dist-packages/pyopencl/cffi_cl.py", line 1952, in enqueue_nd_range_kernel
global_work_size, local_work_size, c_wait_for, num_wait_for))
File "/usr/local/lib/python2.7/dist-packages/pyopencl/cffi_cl.py", line 664, in _handle_error
raise e
pyopencl.cffi_cl.LogicError: clEnqueueNDRangeKernel failed: INVALID_WORK_GROUP_SIZE

I'm unsure if there's an incorrect assumption in the pyopencl code somewhere... but this reduces total throughput by a third. With 8 workers the result is:
gpu: 2.32745s
cpu: 0.469361066818s

doe300 · 2018-03-13T18:21:38Z

Can you please give more details, e.g. expected vs. actual results.

Also, about the work-group size: VC4CL only asserts that the local size divides the global size and is at most 12. So if the global-size is a power of two, the local size needs to be too. Any other restriction may be applied by pyopencl.

…C4C#54

josiahw · 2018-03-17T06:42:51Z

I just tested with a fresh code pull. The assignment is still an issue.

Here's the minimal kernel to reproduce:
__kernel void sum(__global float *c)
{
int gid = get_global_id(0);
c[gid] = 1.f;
}

Expected result: array of [1.0, 1.0, 1.0, 1.0, 1.0, 1.0... ]
Actual result: array of [1.0, 0.0, 0.0, 0.0, 0.0, 0.0... ]

…now handles optimized implicit UNIFORMs correctly, see #27, doe300/VC4C#54

doe300 · 2018-03-17T09:04:45Z

I fixed some host-side issues regarding this, can you re-test?

josiahw · 2018-03-17T09:54:39Z

I can confirm this is now fixed. :-)

doe300 · 2018-03-17T10:40:06Z

Very good:)

doe300 added a commit to doe300/VC4CL that referenced this issue Mar 14, 2018

Fixes host-side counting of UNIFORMs, see doe300/VC4C#44 and doe300/V…

af97b17

…C4C#54

doe300 added the bug label Mar 17, 2018

doe300 self-assigned this Mar 17, 2018

doe300 added a commit to doe300/VC4CL that referenced this issue Mar 17, 2018

Fixes UNIFORM base-pointer offset for additional QPUs, dump analyzer …

67c6753

…now handles optimized implicit UNIFORMs correctly, see #27, doe300/VC4C#54

doe300 closed this as completed Mar 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyOpenCL: values only computed correctly for first worker in the group #54

PyOpenCL: values only computed correctly for first worker in the group #54

josiahw commented Mar 12, 2018

doe300 commented Mar 13, 2018

josiahw commented Mar 17, 2018

doe300 commented Mar 17, 2018

josiahw commented Mar 17, 2018

doe300 commented Mar 17, 2018

PyOpenCL: values only computed correctly for first worker in the group #54

PyOpenCL: values only computed correctly for first worker in the group #54

Comments

josiahw commented Mar 12, 2018

doe300 commented Mar 13, 2018

josiahw commented Mar 17, 2018

doe300 commented Mar 17, 2018

josiahw commented Mar 17, 2018

doe300 commented Mar 17, 2018