Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors using CL_DEVICE_TYPE_CPU on Mac OSX Yosemite #158

Open
lajash opened this issue Jan 29, 2015 · 23 comments
Open

Errors using CL_DEVICE_TYPE_CPU on Mac OSX Yosemite #158

lajash opened this issue Jan 29, 2015 · 23 comments

Comments

@lajash
Copy link

lajash commented Jan 29, 2015

Running the compiler_bug.cpp from the command line using the following :
g++ -o compiler_bug compiler_bug.cpp -std=c++0x -I OpenHeaders -framework OpenCL &&
./compiler_bug
Intel(R) Core(TM) i7-2820QM CPU @ 2.30GHz
OpenCL compilation error
CVMS_ERROR_COMPILER_FAILURE: CVMS compiler has crashed or hung building an element.
clBuildProgram

Running stencil operators on the CPU also causes crashes. The same code runs on the GPU.

I'm testing this on a Macbook Pro with an i7 CPU and a ATI Radeon HD 6750M GPU.

@ddemidov
Copy link
Owner

Can you share the source of compiler_bug.cpp here?

@lajash
Copy link
Author

lajash commented Jan 29, 2015

Hi Denis,

Its the one from your gist -
https://gist.github.com/ddemidov/8681608 https://gist.github.com/ddemidov/8681608

Thanks,
Rajesh

On Jan 29, 2015, at 4:54 PM, Denis Demidov notifications@github.com wrote:

Can you share the source of compiler_bug.cpp here?


Reply to this email directly or view it on GitHub #158 (comment).

@ddemidov
Copy link
Owner

In that case its a known issue (see #92). That issue have never been resolved as far as I know. I would raise a ticket with Apple support, since the code in the gist is standard OpenCL and should work.

Edit: Also, I don't have access to a MacOSX machine, so there is not much I can do here.

@lajash
Copy link
Author

lajash commented Jan 29, 2015

Ah ok.. I’ve looked at #92. Will change the filter to CL_DEVICE_TYPE_GPU for the Mac platform for now. Not sure if this applies to the new Macs though.

Thanks for your help,

Rajesh.

On Jan 29, 2015, at 5:03 PM, Denis Demidov notifications@github.com wrote:

In that case its a known issue (see #92 #92). That issue have never been resolved as far as I know. I would raise a ticket with Apple support, since the code in the gist is standard OpenCL and should work.


Reply to this email directly or view it on GitHub #158 (comment).

@lajash
Copy link
Author

lajash commented Jan 29, 2015

Closing this issue as it seems to be a bug in the Apple OpenCL framework. FYI, the 10.10.2 update also does not fix this.

@lajash lajash closed this as completed Jan 29, 2015
@ddemidov
Copy link
Owner

Thank you for the info! If you are able to find a workaround, I would be glad to either introduce it to vexcl or accept a pull request.

@lajash
Copy link
Author

lajash commented Jan 29, 2015

Looking into it… will keep you updated.

On Jan 29, 2015, at 6:38 PM, Denis Demidov notifications@github.com wrote:

Thank you for the info! If you are able to find a workaround, I would be glad to either introduce it to vexcl or accept a pull request.


Reply to this email directly or view it on GitHub #158 (comment).

@lajash
Copy link
Author

lajash commented Jan 29, 2015

Denis,
Interestingly, the default programs provided in the Xcode samples all run fine on the CPU. Just looked at the compiler and program options and don't see anything special there either.

Could it be workgroups / queues related ? Never mind, will check it myself without bothering you.

@lajash lajash reopened this Jan 29, 2015
@ddemidov
Copy link
Owner

I think Apple's OpenCL implementation does not support workgroups of more than one workitem on CPUs (vexcl uses this restriction for kernels on CPU devices), but in the gist the kernel is never launched since it fails the compilation step.

@lajash
Copy link
Author

lajash commented Jan 29, 2015

Here's what works on the CPU thus far ... (taken from your examples, of course )

#include <iostream>
#include <vector>
#include <string>
#include <stdexcept>

#define __CL_ENABLE_EXCEPTIONS
#include <vexcl/vexcl.hpp>

//---------------------------------------------------------------------------
int main() 
{
    const size_t n = 1024 * 1024;
    vex::Context ctx( vex::Filter::Type(CL_DEVICE_TYPE_CPU) );

    std::vector<double> a(n, 1.0);
    std::vector<double> c(n, 0.5);

    std::vector<double> results(n);

    vex::vector<double> A(ctx.queue(), a);
    vex::vector<double> B(ctx.queue(), n);
    vex::vector<double> C(ctx.queue(), c);

    A = (B + C) / 5;
    vex::copy(A, results);
    std::cout << "Size: " << results.size() << std::endl;
    std::cout << "First 2: " << results[0] << "," << results[1] << std::endl;

    VEX_FUNCTION(double, squared_radius, (double, x)(double, y),
    return x * x + y * y;
    );

    A = sqrt(squared_radius(B, C));
    vex::copy(A, results);
    std::cout << "Size: " << results.size() << std::endl;
    std::cout << "First 2: " << results[0] << "," << results[1] << std::endl;    

    VEX_STENCIL_OPERATOR(S, /*return type:*/double, /*window width:*/3, /*center:*/1,
    "return sin(X[0] - X[-1]) + sin(X[1] - X[0]);", ctx);
    A = S(A);

    vex::copy(A, results);
    std::cout << "Size: " << results.size() << std::endl;
    std::cout << "First 2: " << results[0] << "," << results[1] << std::endl;    
}

I'm going to try writing a few custom kernels that use the stencil window to see if I can break anything, Let me know if this makes any sense.

@ddemidov
Copy link
Owner

I think it would make more sense to run unit tests distributed with vexcl. You can do this with

cd $VEXCL_ROOT
mkdir build
cd build
cmake ..
make
OCL_DEVICE=i7 VEXCL_SHOW_KERNELS=1 make test

After that the test log may be found at Testing/Temporary/LastTest.log. If you upload it to e.g. gist.github.com, we could concentrate on the failing tests.

@lajash
Copy link
Author

lajash commented Jan 30, 2015

Hi Denis, here goes ...
https://gist.github.com/lajash/59d9a2f489d2aa05f1e9

@ddemidov
Copy link
Owner

So the failing kernels I can see are:

  • In vector/multivector arithmetics:
#if defined(cl_khr_fp64)
#  pragma OPENCL EXTENSION cl_khr_fp64: enable
#elif defined(cl_amd_fp64)
#  pragma OPENCL EXTENSION cl_amd_fp64: enable
#endif

kernel void vexcl_vector_kernel
(
  ulong n,
  global double * prm_1
)
{
  ulong chunk_size  = (n + get_global_size(0) - 1) / get_global_size(0);
  ulong chunk_start = get_global_id(0) * chunk_size;
  ulong chunk_end   = chunk_start + chunk_size;
  if (n < chunk_end) chunk_end = n;
  for(ulong idx = chunk_start; idx < chunk_end; ++idx)
  {
    prm_1[idx] = 42;
  }
}

This is the kernel from the gist above. Btw, I've had another idea worth testing about this kernel, see below.

  • Boost.Compute integration example (sort function call), which fails due to wrong workgroup size (Apple only supports workgroups with single item on CPUs). I would run unit tests from boost.compute and report any failures to @kylelutz.
  • FFT test, which could also be due to wrong workgroup size. I'll see if using a workgroup of single item makes any sense there.

About the failing assignment test: it occurred to me this could be due to a type mismatch: prm_1[idx] is double, and 42 is int. Could you please try to compile and run the code from the gist again? I have updated it to use 42.0 instead of 42.

@lajash
Copy link
Author

lajash commented Jan 30, 2015

Will do … will get back to you shortly…

On Jan 30, 2015, at 12:52 PM, Denis Demidov notifications@github.com wrote:

So the failing kernels I can see are:

In vector/multivector arithmetics:
#if defined(cl_khr_fp64)

pragma OPENCL EXTENSION cl_khr_fp64: enable

#elif defined(cl_amd_fp64)

pragma OPENCL EXTENSION cl_amd_fp64: enable

#endif

kernel void vexcl_vector_kernel
(
ulong n,
global double * prm_1
)
{
ulong chunk_size = (n + get_global_size(0) - 1) / get_global_size(0);
ulong chunk_start = get_global_id(0) * chunk_size;
ulong chunk_end = chunk_start + chunk_size;
if (n < chunk_end) chunk_end = n;
for(ulong idx = chunk_start; idx < chunk_end; ++idx)
{
prm_1[idx] = 42;
}
}
This is the kernel from the gist https://gist.github.com/ddemidov/8681608 above. Btw, I've had another idea worth testing about this kernel, see below.

Boost.Compute integration example (sort function call), which fails due to wrong workgroup size (Apple only supports workgroups with single item on CPUs). I would run unit tests from boost.compute and report https://github.com/kylelutz/compute/issues/new any failures to @kylelutz https://github.com/kylelutz.
FFT test, which could also be due to wrong workgroup size. I'll see if using a workgroup of single item makes any sense there.
About the failing assignment test: it occurred to me this could be due to a type mismatch: prm_1[idx] is double, and 42 is int. Could you please try to compile and run the code from the gist https://gist.github.com/ddemidov/8681608 again? I have updated https://gist.github.com/ddemidov/8681608#file-compiler_bug-cpp-L84 it to use 42.0 instead of 42.


Reply to this email directly or view it on GitHub #158 (comment).

@lajash
Copy link
Author

lajash commented Jan 30, 2015

After using the updated compiler_bug.cpp from your gist,
Intel(R) Core(TM) i7-2820QM CPU @ 2.30GHz
OpenCL compilation error
CVMS_ERROR_SERVICE_FAILURE: CVMS compiler has crashed or hung managing the service.
clBuildProgram

So basically, it makes no difference.

@ddemidov
Copy link
Owner

I don't see anything wrong with this kernel, and it does work with any other OpenCL platform I have access to. I believe nothing left here but opening an issue with Apple support.

Regarding the FFT issue: it does work correctly when workgroup size is set set to 1. So could you please check if fft tests are passing for you with branch issue-158-fft?

Note however that (according to examples/fft_benchmark.cpp) VexCL's implementaion of FFT is about two orders of magnitude slower that fftw on a CPU, so there is probably no reason to use it with a CPU anyway.

@lajash
Copy link
Author

lajash commented Jan 30, 2015

You're right .... there's an issue with the quantum of data being transferred. See this gist https://gist.github.com/lajash/1645b473676633b35d9e

NDEnqueKernel issue with larger dataset.

@lajash
Copy link
Author

lajash commented Jan 30, 2015

Checking fft now ... you're right, makes no sense to use vex::fft if no GPU involved. Will you be working on optimizing it in the future ?

@ddemidov
Copy link
Owner

Re fft optimization: I don't think it makes sense when fftw is available. On a CPU one can just map the device memory to a host pointer and then use fftw (or any other host-side algorithm) on a device vectors (see the example here). Also, the FFT implementation was provided by @neapel, so he could probably chime in here.

@lajash
Copy link
Author

lajash commented Jan 30, 2015

New test log added here.... https://gist.github.com/lajash/991c1bd6a1fc9d3ffa95

Doesn't look like it fixed anything though. :(

@lajash
Copy link
Author

lajash commented Jan 30, 2015

Quick update ... the code at https://gist.github.com/lajash/1645b473676633b35d9e runs on my Mac now ... just tried running it multiple times and voila, it runs in 1 out of 4 tries ... but it's extremely slow !! So it looks like its Apple's icd that may have a issue with the CPU . This works on all other platforms I presume ?

@ddemidov
Copy link
Owner

Your result vector is 100 times less in size than A. So you should get an out of boundary error and a segfault here.

Stencil operators use slow path when run on a CPU, which may be significant for wide stencils like yours. Function based variant works 1.5 faster on my CPU than stencil-based one, and only slightly slower (about 15%) on a GPU.

@lajash
Copy link
Author

lajash commented Jan 30, 2015

Thanks Denis …. will use function variants …

On Jan 30, 2015, at 2:36 PM, Denis Demidov notifications@github.com wrote:

Your result vector is 100 times less https://gist.github.com/lajash/1645b473676633b35d9e#file-stdev-cpp-L18 in size than A. So you should get an out of boundary error and a segfault here https://gist.github.com/lajash/1645b473676633b35d9e#file-stdev-cpp-L55.

Stencil operators use slow path when run on a CPU, which may be significant for wide stencils like yours. Function based variant https://gist.github.com/ddemidov/4c126b012e4ebf669b51#file-stdev-cpp-L64-L83 works 1.5 faster on my CPU than stencil-based one, and only slightly slower (about 15%) on a GPU.


Reply to this email directly or view it on GitHub #158 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants