Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] OpenMP does not work anymore with function call on cython 3.0 #5573

Closed
arthurlm opened this issue Jul 28, 2023 · 6 comments · Fixed by #5577
Closed

[BUG] OpenMP does not work anymore with function call on cython 3.0 #5573

arthurlm opened this issue Jul 28, 2023 · 6 comments · Fixed by #5577

Comments

@arthurlm
Copy link

Describe the bug

After upgrading from Cython 0.29.36 to Cython 3.0.0 produced code just hang infinitly when OpenMP is used and sub-function are called.

Looking at produced code for bellow lines:

for j in prange(J):
    foo_vec(x[:, j], y[:, j])

Using Cython 0.29.36 produced code is:

// for j in prange(J):

  __pyx_t_1 = __pyx_v_J;
  if ((1 == 0)) abort();
  {
      #if ((defined(__APPLE__) || defined(__OSX__)) && (defined(__GNUC__) && (__GNUC__ > 2 || (__GNUC__ == 2 && (__GNUC_MINOR__ > 95)))))
          #undef likely
          #undef unlikely
          #define likely(x)   (x)
          #define unlikely(x) (x)
      #endif
      __pyx_t_3 = (__pyx_t_1 - 0 + 1 - 1/abs(1)) / 1;
      if (__pyx_t_3 > 0)
      {
          #ifdef _OPENMP
          #pragma omp parallel
          #endif /* _OPENMP */
          {
              #ifdef _OPENMP
              #pragma omp for firstprivate(__pyx_v_j) lastprivate(__pyx_v_j)
              #endif /* _OPENMP */
              for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_3; __pyx_t_2++){
                  {
                      __pyx_v_j = (Py_ssize_t)(0 + 1 * __pyx_t_2);

//     foo_vec(x[:, j], y[:, j])

                      __pyx_t_4.data = __pyx_v_x.data;
                      __pyx_t_4.memview = __pyx_v_x.memview;
                      __PYX_INC_MEMVIEW(&__pyx_t_4, 0);
                      __pyx_t_4.shape[0] = __pyx_v_x.shape[0];
__pyx_t_4.strides[0] = __pyx_v_x.strides[0];
    __pyx_t_4.suboffsets[0] = -1;

{
    Py_ssize_t __pyx_tmp_idx = __pyx_v_j;
    Py_ssize_t __pyx_tmp_stride = __pyx_v_x.strides[1];
        __pyx_t_4.data += __pyx_tmp_idx * __pyx_tmp_stride;
}

__pyx_t_5.data = __pyx_v_y.data;
                      __pyx_t_5.memview = __pyx_v_y.memview;
                      __PYX_INC_MEMVIEW(&__pyx_t_5, 0);
                      __pyx_t_5.shape[0] = __pyx_v_y.shape[0];
__pyx_t_5.strides[0] = __pyx_v_y.strides[0];
    __pyx_t_5.suboffsets[0] = -1;

{
    Py_ssize_t __pyx_tmp_idx = __pyx_v_j;
    Py_ssize_t __pyx_tmp_stride = __pyx_v_y.strides[1];
        __pyx_t_5.data += __pyx_tmp_idx * __pyx_tmp_stride;
}

__pyx_f_6my_ext_foo_vec(__pyx_t_4, __pyx_t_5);
                      __PYX_XDEC_MEMVIEW(&__pyx_t_4, 0);
                      __pyx_t_4.memview = NULL;
                      __pyx_t_4.data = NULL;
                      __PYX_XDEC_MEMVIEW(&__pyx_t_5, 0);
                      __pyx_t_5.memview = NULL;
                      __pyx_t_5.data = NULL;
                  }
              }
          }
      }
  }
  #if ((defined(__APPLE__) || defined(__OSX__)) && (defined(__GNUC__) && (__GNUC__ > 2 || (__GNUC__ == 2 && (__GNUC_MINOR__ > 95)))))
      #undef likely
      #undef unlikely
      #define likely(x)   __builtin_expect(!!(x), 1)
      #define unlikely(x) __builtin_expect(!!(x), 0)
  #endif

Using Cython 3.0 produced code is:

// for j in prange(J):

  __pyx_t_1 = __pyx_v_J;
  {
      #if ((defined(__APPLE__) || defined(__OSX__)) && (defined(__GNUC__) && (__GNUC__ > 2 || (__GNUC__ == 2 && (__GNUC_MINOR__ > 95)))))
          #undef likely
          #undef unlikely
          #define likely(x)   (x)
          #define unlikely(x) (x)
      #endif
      __pyx_t_3 = (__pyx_t_1 - 0 + 1 - 1/abs(1)) / 1;
      if (__pyx_t_3 > 0)
      {
          #ifdef _OPENMP
          #pragma omp parallel
          #endif /* _OPENMP */
          {
              #ifdef _OPENMP
              #pragma omp for firstprivate(__pyx_v_j) lastprivate(__pyx_v_j)
              #endif /* _OPENMP */
              for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_3; __pyx_t_2++){
                  {
                      __pyx_v_j = (Py_ssize_t)(0 + 1 * __pyx_t_2);

//    foo_vec(x[:, j], y[:, j])

                      __pyx_t_4.data = __pyx_v_x.data;
                      __pyx_t_4.memview = __pyx_v_x.memview;
                      __PYX_INC_MEMVIEW(&__pyx_t_4, 0);
                      __pyx_t_4.shape[0] = __pyx_v_x.shape[0];
__pyx_t_4.strides[0] = __pyx_v_x.strides[0];
    __pyx_t_4.suboffsets[0] = -1;

{
    Py_ssize_t __pyx_tmp_idx = __pyx_v_j;
    Py_ssize_t __pyx_tmp_stride = __pyx_v_x.strides[1];
        __pyx_t_4.data += __pyx_tmp_idx * __pyx_tmp_stride;
}

__pyx_t_5.data = __pyx_v_y.data;
                      __pyx_t_5.memview = __pyx_v_y.memview;
                      __PYX_INC_MEMVIEW(&__pyx_t_5, 0);
                      __pyx_t_5.shape[0] = __pyx_v_y.shape[0];
__pyx_t_5.strides[0] = __pyx_v_y.strides[0];
    __pyx_t_5.suboffsets[0] = -1;

{
    Py_ssize_t __pyx_tmp_idx = __pyx_v_j;
    Py_ssize_t __pyx_tmp_stride = __pyx_v_y.strides[1];
        __pyx_t_5.data += __pyx_tmp_idx * __pyx_tmp_stride;
}

__pyx_f_6my_ext_foo_vec(__pyx_t_4, __pyx_t_5); if (unlikely(__Pyx_ErrOccurredWithGIL())) __PYX_ERR(0, 40, __pyx_L5_error)
                      __PYX_XCLEAR_MEMVIEW(&__pyx_t_4, 0);
                      __pyx_t_4.memview = NULL; __pyx_t_4.data = NULL;
                      __PYX_XCLEAR_MEMVIEW(&__pyx_t_5, 0);
                      __pyx_t_5.memview = NULL; __pyx_t_5.data = NULL;
                      goto __pyx_L8;
                      __pyx_L5_error:;
                      {
                          #ifdef WITH_THREAD
                          PyGILState_STATE __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
                          #endif
                          #ifdef _OPENMP
                          #pragma omp flush(__pyx_parallel_exc_type)
                          #endif /* _OPENMP */
                          if (!__pyx_parallel_exc_type) {
                            __Pyx_ErrFetchWithState(&__pyx_parallel_exc_type, &__pyx_parallel_exc_value, &__pyx_parallel_exc_tb);
                            __pyx_parallel_filename = __pyx_filename; __pyx_parallel_lineno = __pyx_lineno; __pyx_parallel_clineno = __pyx_clineno;
                            __Pyx_GOTREF(__pyx_parallel_exc_type);
                          }
                          #ifdef WITH_THREAD
                          __Pyx_PyGILState_Release(__pyx_gilstate_save);
                          #endif
                      }
                      __pyx_parallel_why = 4;
                      goto __pyx_L7;
                      __pyx_L7:;
                      #ifdef _OPENMP
                      #pragma omp critical(__pyx_parallel_lastprivates0)
                      #endif /* _OPENMP */
                      {
                          __pyx_parallel_temp0 = __pyx_v_j;
                      }
                      __pyx_L8:;
                      #ifdef _OPENMP
                      #pragma omp flush(__pyx_parallel_why)
                      #endif /* _OPENMP */
                  }
              }
              #ifdef _OPENMP
              Py_END_ALLOW_THREADS
              #else
{
#ifdef WITH_THREAD
              PyGILState_STATE __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
              #endif
              #endif /* _OPENMP */
              /* Clean up any temporaries */
              __PYX_XCLEAR_MEMVIEW(&__pyx_t_4, 0);
              __pyx_t_4.memview = NULL; __pyx_t_4.data = NULL;
              __PYX_XCLEAR_MEMVIEW(&__pyx_t_5, 0);
              __pyx_t_5.memview = NULL; __pyx_t_5.data = NULL;
              #ifdef WITH_THREAD
              __Pyx_PyGILState_Release(__pyx_gilstate_save);
              #endif
              #ifndef _OPENMP
}
#endif /* _OPENMP */
          }
      }
      if (__pyx_parallel_exc_type) {
        /* This may have been overridden by a continue, break or return in another thread. Prefer the error. */
        __pyx_parallel_why = 4;
      }
      if (__pyx_parallel_why) {
        __pyx_v_j = __pyx_parallel_temp0;
        switch (__pyx_parallel_why) {
              case 4:
          {
              #ifdef WITH_THREAD
              PyGILState_STATE __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
              #endif
              __Pyx_GIVEREF(__pyx_parallel_exc_type);
              __Pyx_ErrRestoreWithState(__pyx_parallel_exc_type, __pyx_parallel_exc_value, __pyx_parallel_exc_tb);
              __pyx_filename = __pyx_parallel_filename; __pyx_lineno = __pyx_parallel_lineno; __pyx_clineno = __pyx_parallel_clineno;
              #ifdef WITH_THREAD
              __Pyx_PyGILState_Release(__pyx_gilstate_save);
              #endif
          }
          goto __pyx_L1_error;
        }
      }
  }
  #if ((defined(__APPLE__) || defined(__OSX__)) && (defined(__GNUC__) && (__GNUC__ > 2 || (__GNUC__ == 2 && (__GNUC_MINOR__ > 95)))))
      #undef likely
      #undef unlikely
      #define likely(x)   __builtin_expect(!!(x), 1)
      #define unlikely(x) __builtin_expect(!!(x), 0)
  #endif

I am not expecting any GIL check / any other check in prange body.

Code to reproduce the behaviour:

my_ext.pyx :

cimport cython
from cython.parallel cimport prange

cimport numpy as np

np.import_array()


def foo(np.ndarray x not None):
    cdef int ndim = np.PyArray_NDIM(x)
    y = np.PyArray_EMPTY(ndim, np.PyArray_DIMS(x), np.NPY_FLOAT64, 0)

    if ndim == 1:
        foo_vec(x, y)
    elif ndim == 2:
        foo_mat(x, y)

    return y


@cython.boundscheck(False)
@cython.wraparound(False)
cdef void foo_vec(const double[:] x, double[:] y) nogil:
    cdef:
        Py_ssize_t I = x.shape[0]
        Py_ssize_t i

    for i in range(I):
        y[i] = x[i] * 2.0 + 1.0  # Do whatever computation on vec


@cython.boundscheck(False)
@cython.wraparound(False)
cdef void foo_mat(const double[:, :] x,  double[:, :] y) nogil:
    cdef:
        Py_ssize_t J = x.shape[1]
        Py_ssize_t j
    
    for j in prange(J):
        foo_vec(x[:, j], y[:, j])

setup.py :

from setuptools import setup, Extension
from Cython.Build import cythonize
import numpy as np

setup(
    ext_modules=cythonize(
        [Extension(
            "my_ext", ["my_ext.pyx"],
            include_dirs=[np.get_include()],
            extra_compile_args=["-fopenmp"],
            extra_link_args=["-fopenmp"],
            define_macros=[("NPY_NO_DEPRECATED_API", "NPY_1_7_API_VERSION")],
        )],
        language_level=3,
        annotate=True,
    )
)

Expected behaviour

I am expecting code produced code that:

  • does not require any GIL check
  • works correctly on latest Cython version😀

OS

Linux

Python version

3.10.6

Cython version

3.0.0

Additional context

No response

@arthurlm
Copy link
Author

I finaly found that the noexcept is now required before nogil qualifier to tell cython that cdef function will not raise any exception.

It fix this issue.

@da-woods
Copy link
Contributor

da-woods commented Jul 28, 2023

@arthurlm This looks like a simpler duplicate of #5564. So thanks for the example - it'll hopefully be easier to investigate

It shouldn't be hanging without noexcept - this is a bug that I hope to fix in the near future. Although noexcept may well give a speed improvement.

@da-woods
Copy link
Contributor

da-woods commented Jul 29, 2023

Slightly alarmingly, I now don't understand how this ever worked....

As far as I can tell we release the GIL inside the omp parallel block (so we release the GIL for every thread), and we reacquire the GIL inside the the omp parallel block (as at the end we require every thread to hold the GIL).

This isn't true I think... ignore

da-woods added a commit to da-woods/cython that referenced this issue Jul 29, 2023
Essentially the problem is that a `nogil` function just promises
that the function *can* be called without the GIL, not that
it actually doesn't have the GIL.

If you call `prange`/`parallel` inside that function while still
holding the GIL, if any thread has finished the GIL will be held
(... I think it's probably more of a mess than this, but mostly
you get away with it...). If a thread is still running and tries
to acquire the GIL then it'll deadlock permenantly.

I've solved this by adding an implicit `with nogil` around any
`prange`/`parallel` that's in a `nogil` block but isn't
certain that it has the GIL.

Fixes cython#5573 and cython#5564.
@da-woods
Copy link
Contributor

Got to the bottom of this.

I believe there's a subtle error in your code: a nogil function says that it can be run without the GIL, but doesn't guarantee that it's run without the GIL. prange blocks must be run without the GIL (although you get away with it most of the time unless you hit something that needs to re-acquire the GIL). In your case you don't release the GIL before calling foo_mat, so one thread holds the GIL for the whole time.

Frankly that's so subtle that I'm not sure anyone can be expected to spot it. Therefore my proposed fix is to get rid of the requirement and make absolutely 100% sure that the GIL is released before entering a prange block.

@arthurlm
Copy link
Author

arthurlm commented Aug 1, 2023

Thanks for your answer and your fix.
gil / nogil / with gil / with nogil is more clear to me now.

However thinking a little bit more on this, do we really want to implicitly add with nogil block automatically ?
Does this should not be a compiler error instead ? And we should let the programmer decide where he want to add the block ?
For example, in more complex case: the prange / parallel block could be called from function that already have done the check (or also use prange). So always adding the check might cause a small performances slow-down.

I am not saying there is a perfect solution. I just want to explore all the possible fixes and improve my code 😀

@da-woods
Copy link
Contributor

da-woods commented Aug 2, 2023

However thinking a little bit more on this, do we really want to implicitly add with nogil block automatically ?
Does this should not be a compiler error instead?

I think if I was starting from scratch that might be how I'd do it. However, making it a compiler error now would break quite a bit of existing code, and probably require us to create a way to annotate a function to say "this function MUST be called without the GIL".

For example, in more complex case: the prange / parallel block could be called from function that already have done the check (or also use prange). So always adding the check might cause a small performances slow-down.

Yes, that's very true. I think it's likely low cost compared to setting up a parallel/prange block. But I could be wrong.

Again, we could provide some way for the user to turn off the check, but that makes it more complicated.

da-woods added a commit that referenced this issue Aug 21, 2023
* Fix issues with parallel/exception checked functions

Essentially the problem is that a `nogil` function just promises
that the function *can* be called without the GIL, not that
it actually doesn't have the GIL.

If you call `prange`/`parallel` inside that function while still
holding the GIL, if any thread has finished the GIL will be held
(... I think it's probably more of a mess than this, but mostly
you get away with it...). If a thread is still running and tries
to acquire the GIL then it'll deadlock permenantly.

I've solved this by adding an implicit `with nogil` around any
`prange`/`parallel` that's in a `nogil` block but isn't
certain that it has the GIL.

Fixes #5573 and #5564.

* Add extra field in pxd

* Separate comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants