Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing declaration of inner OpenMP privates #3348

Merged
merged 5 commits into from Feb 13, 2020

Conversation

fschlimb
Copy link
Contributor

@fschlimb fschlimb commented Feb 7, 2020

Currently inner/nested pranges are not translated to an active OpenMP parallel for.
The private temporary variables identified for these inner loops were only declared in the

#if 0
#pragma omp for private....
#endif

but not anywhere else. This apparently causes races.
The suggested code propagates the privatization to the outer construct so that it gets actually exposed.

@fschlimb
Copy link
Contributor Author

fschlimb commented Feb 7, 2020

@ogrisel

@fschlimb fschlimb closed this Feb 7, 2020
@fschlimb fschlimb reopened this Feb 7, 2020
@ogrisel
Copy link
Contributor

ogrisel commented Feb 7, 2020

Unfortunately I am not familiar enough with the internals of Cython to comment on this PR. But as I was pinged, let me just suggest to write a new non-regression test for this fix :)

@fschlimb
Copy link
Contributor Author

I just provided a much simpler fix, which simply ignore inner loops as temp-collecting blocks.

@scoder
Copy link
Contributor

scoder commented Feb 11, 2020

Ah, yes, that seems a very good way to resolve this.
Could you also add a test that shows the actual problem?

@fschlimb
Copy link
Contributor Author

Will think about a reliable test.

@fschlimb
Copy link
Contributor Author

I tried, but this is rather difficult to produce reliably. The difficulty is that the race is in a monolithic generated code; it cannot be influenced/changed from cython code. The region for possible errors is super small.
The bug fixed here might actually be the cause for this intermittent failure:

DISABLED. For some reason this fails intermittently on jenkins, with

Any suggestions appreciated.

@da-woods
Copy link
Contributor

Could you post an example of the code that fails intermittently?

@scoder
Copy link
Contributor

scoder commented Feb 12, 2020

I think it's already an improvement to have a test that at least exercises the code, even if it's not guaranteed to fail with a race condition. If enabling the disabled test that you pointed to is all it takes, and if the tests then proves to work on Jenkins, that's great, let's just try that!

@da-woods
Copy link
Contributor

I had a thought on testing - you should be able to tell if variables are private by their address:

from cython.parallel import prange

def test_inner_private():
    cdef int inner_private, outer_private
    cdef int* inner_addresses[2]
    cdef int* outer_addresses[2]
    cdef Py_ssize_t n, m
    for n in prange(2, num_threads=2, nogil=True):
        outer_private = 1
        outer_addresses[n] = &outer_private
        for m in prange(1):
            inner_private = 1
            inner_addresses[n] = &inner_private
    
    inner_are_the_same = inner_addresses[0] == inner_addresses[1]
    outer_are_the_same = outer_addresses[0] == outer_addresses[1]
    
    print(inner_are_the_same)
    print(outer_are_the_same)
    
    return inner_are_the_same == outer_are_the_same

At the minute both inner_private and outer_private look to correctly be private (so the test isn't helpful) but it's possible I haven't reproduce the bug this fixes.

@scoder
Copy link
Contributor

scoder commented Feb 12, 2020

@da-woods the problem is with temp variables, not user visible ones. That makes it difficult to do anything in user test code. What could fail (or … work, whatever) is having thread-specific calculations that require a temp variable and use the same temp name in the C code, and then collect the results to check if all threads produced their correct result or if any of them leaked.

@fschlimb
Copy link
Contributor Author

I enabled the test.

I can only reliably make something fail on my offload branch, where the parallel sections is offloaded to a GPU/device using #pragma omp target. The higher degree of parallelism on a GPU makes it more sensitive to issues like this. The PR will definitely include a test that usually failed without this fix. I made it an extra PR because it is not device specific but generally how OpenMP is used.

@fschlimb
Copy link
Contributor Author

@da-woods: the example which exposed the issue is used in #3342

DISABLED. For some reason this fails intermittently on jenkins, with
the first line of output being '0 0 0 0'. The generated code looks
awfully correct though... needs investigation

>> test_nested_break_continue()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it's this line that disabled the test (I had to take a second look, too). Note the >> at the beginning, which isn't Python's >>>.

@da-woods
Copy link
Contributor

Yes - I can see that __pyx_t_5 and __pyx_t_6 are shared between threads and shouldn't be. I can also see that it's a nightmare to test reliably. You could maybe create some macros in a cdef extern from * block that get the address of __pyx_t_5 and __pyx_t_6, then use my address trick, but that'd be sensitive to anything that renumbered the temps.

@fschlimb
Copy link
Contributor Author

oh, I see. Fixed it.

@da-woods
Copy link
Contributor

#cython: language = c++

from cython.parallel import prange

cdef extern from *:
    """
    class C {
    public:
        const C* address;
        const C& operator=(const C& rhs) {
            address = &rhs;
            return *this;
        }
    };
    
    C make_C() { 
        return C();
    }
    """
    cdef cppclass C:
        const C* address
        C()
    C make_C() nogil except +  # will generate a temp for exception checking



def test_inner_private():
    cdef C not_parallel[2]
    cdef C inner_vals[2]
    cdef C outer_vals[2]
    cdef C ignore
    cdef Py_ssize_t n, m
    
    for n in range(2):
        not_parallel[n] = make_C()
    assert not_parallel[0].address == not_parallel[1].address, "Addresses should be the same since they come from the same temp"
    
    for n in prange(2, num_threads=2, nogil=True):
        outer_vals[n] = make_C()
        for m in prange(1):
            ignore, inner_vals[n] = make_C(), make_C()
    
    inner_are_the_same = inner_vals[0].address == inner_vals[1].address
    outer_are_the_same = outer_vals[0].address == outer_vals[1].address
    
    assert outer_are_the_same == False  # currently works
    assert inner_are_the_same == False  # currently fails

Here's a test that I believe works. I've abused C++ assignment operators to get the addresses of the temporary variables. Private variables have different addresses in different threads so this should be reliable and not depend on race conditions.

@fschlimb
Copy link
Contributor Author

Cool. I just checked and it actually behaves as expected on master and this branch.

I am not enough familiar with all the intricate details of OpenMP to be 100% sure this must work according to the spec. Still, the only critical assumption made here seems to be that the 2 loop-iterations are actually executed on different threads.

I specified schedule='static' and chunksize=1 in the first parfor to make that more explicit and added the example to the run-tests. Hope this is the right way to do this.

@da-woods
Copy link
Contributor

I am not enough familiar with all the intricate details of OpenMP to be 100% sure this must work according to the spec

@fschlimb Honestly me neither. One option I considered was to do assert outer_are_the_same == inner_are_the_same since I think that's should definitely be true. If the test starts to be unreliable then it can always be removed.

I wonder if it should be in a separate file though, since it forces C++, and the other tests don't?

@scoder
Copy link
Contributor

scoder commented Feb 12, 2020

Yeah, adding C++ into the game makes this a bit heavy – and it also means that the C compilation mode can't be used to test this. OpenMP is a compiler feature, so that's an actual drawback.

It definitely needs a separate test file, because just adding it (and especially its #cython: directive) at the end will not enable C++ mode for the test file. You also need to tag the test as #tag: cpp to make it compile (only) in C++. We should definitely keep the existing tests configured as they are now.

However, now that we have this new test, I think we can keep it as a separate test file, and rely on the now-enabled old test to cover the C case.

@fschlimb
Copy link
Contributor Author

Ok, done.

@da-woods
Copy link
Contributor

@fschlimb It looks like you missed the from cython.parallel import prange. However, since it looks like it needs changing anyway, I've come up with a C-compatible version of the same test (abusing macros rather than assignment). That could go in the main file with the rest of the tests (and the C++ version could be deleted because it adds nothing new)

from cython.parallel import prange

cdef extern from *:
    """
    #define address_of_temp(store, temp) store = &temp
    #define address_of_temp2(store, ignore, temp) store = &temp
    
    double get_value() { 
        return 1.0;
    }
    """
    void address_of_temp(...) nogil
    void address_of_temp2(...) nogil
    double get_value() nogil except -1.0  # will generate a temp for exception checking



def test_inner_private():
    """
    Determines if a temp variable is private by taking its address in multiple threads
    and seeing if they're the same (thread private variables should have different
    addresses
    >>> test_inner_private()
    ok
    """
    cdef double* not_parallel[2]
    cdef double* inner_vals[2]
    cdef double* outer_vals[2]
    cdef Py_ssize_t n, m
    
    for n in range(2):
        address_of_temp(not_parallel[n], get_value())
    assert not_parallel[0] == not_parallel[1], "Addresses should be the same since they come from the same temp"
    
    for n in prange(2, num_threads=2, schedule='static', chunksize=1, nogil=True):
        address_of_temp(outer_vals[n], get_value())
        for m in prange(1):
            # second temp just ensures different numbering
            address_of_temp2(inner_vals[n], get_value(), get_value())
    
    inner_are_the_same = inner_vals[0] == inner_vals[1]
    outer_are_the_same = outer_vals[0] == outer_vals[1]
    
    assert outer_are_the_same == False, "Temporary variables in outer loop should be private"
    assert inner_are_the_same == False,  "Temporary variables in inner loop should be private"
    
    print('ok')

@fschlimb
Copy link
Contributor Author

Thanks @da-woods, this is really nice.

I modified the test so that it will succeed if OpenMP is not enabled with -fopenmp.

@scoder
Copy link
Contributor

scoder commented Feb 13, 2020

Lovely. I'll wait for travis to give its placet, and then this is ready to merge.
Thanks @fschlimb and @da-woods!

@scoder scoder merged commit 9995c70 into cython:master Feb 13, 2020
@scoder scoder modified the milestones: 3.0, 0.29.16 Feb 18, 2020
scoder pushed a commit that referenced this pull request Feb 18, 2020
* omp declare privates on outer loop, since inner loops are not 'omp parallel for'ified
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants