Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

fix custom operation in fork #14451

Merged
merged 6 commits into from
Mar 20, 2019
Merged

fix custom operation in fork #14451

merged 6 commits into from
Mar 20, 2019

Conversation

arcadiaphy
Copy link
Member

@arcadiaphy arcadiaphy commented Mar 16, 2019

Description

Fix #14396. Update pthread_atfork function to properly fork on custom operator.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

Copy link
Member

@wkcn wkcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the quick fix! LGTM
I have two questions:
Does each process have independent threads?
Does the bug exist on Windows?

Copy link
Member

@anirudh2290 anirudh2290 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. Great work @arcadiaphy ! Would it be possible to add a test similar to the reproducible example you provided in the issue: #14396 ?

@arcadiaphy
Copy link
Member Author

arcadiaphy commented Mar 17, 2019

@wkcn For the two questions:

  1. Yes, each process has its independent threads. Fork only duplicates the caller thread, so we need to make sure all locking primitives are in valid states and re-create the threads in child process. The easiest way is to restart CustomOperator when fork happens just like Engine does.
  2. There is no fork on windows, so python use spawn method to create new process. I have no windows machine so I can only test on linux and mac with:import multiprocessing as mp; mp.set_start_method('spawn') It seems like python re-import mxnet in child process when spawning, so the bug doesn't exist at all.

@arcadiaphy
Copy link
Member Author

arcadiaphy commented Mar 17, 2019

I find another bug, the custom operator hangs when having unfinished async operations.

import mxnet as mx

class AdditionOP(mx.operator.CustomOp):
    def __init__(self):
        super(AdditionOP, self).__init__()
    def forward(self, is_train, req, in_data, out_data, aux):
        out_data[0][:] = in_data[0] + in_data[1]
    def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
        in_grad[0][:] = out_grad[0]
        in_grad[1][:] = out_grad[0]

@mx.operator.register("AdditionOP")
class AdditionOPProp(mx.operator.CustomOpProp):
    def __init__(self):
        super(AdditionOPProp, self).__init__()
    def list_arguments(self):
        return ['a', 'b']
    def list_outputs(self):
        return ['output']
    def infer_shape(self, in_shape):
        return in_shape, [in_shape[0]]
    def create_operator(self, ctx, shapes, dtypes):
        return AdditionOP()

def foo():
    a = mx.nd.array([1, 2, 3])
    b = mx.nd.array([4, 5, 6])
    c = mx.nd.Custom(a, b, op_type='AdditionOP')

def main():
    # program hangs in exit
    foo()
    # add following code, program hangs in forking
    # from multiprocessing import Process
    # p = Process(target=foo)
    # p.start()
    # p.join()

if __name__ == '__main__':
    main()

I test the script on ubuntu, it hangs when calling python CFUNCTYPE function in c++. If you add mx.nd.waitall() in foo to wait all async operations, the bug disappears.

It seems that calling python function from c++ when destructing static variable or forking may lead to problems.

@arcadiaphy arcadiaphy force-pushed the pr_custom branch 2 times, most recently from c7871d6 to 5bd800c Compare March 17, 2019 13:57
@anirudh2290
Copy link
Member

@arcadiaphy waitall silently hides exceptions, it doesnt rethrow the exception yet. This should be fixed by #14397 . what happenswhen you add c.wait_to_read() in foo . does it still hang ?

@arcadiaphy
Copy link
Member Author

@anirudh2290 No hangs after adding c.wait_to_read(). It only happens in calling CFUNCTYPE when destructing CustomOperator or running pre-fork Stop function.

@karan6181
Copy link
Contributor

@mxnet-label-bot add [Bug, Operator]

pthread_atfork(
[]() {
CustomOperator::Get()->Stop();
Engine::Get()->Stop();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call Engine stop before customop stop. This will ensure that all ops pushed to engine have executed before stopping engine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take that back. Maybe that won't work since Stopping engine before custom op means there may be ops that havent been pushed to engine by custom op but it has stopped. Another option is to add waitall after the worker joined in CustomOp stop to make sure all pushed ops have completed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second option doesn't work either. When it goes into Stop function, the custom worker thread has already frozen on python function, leading to the blocking of worker join. I cannot think of a simple and graceful way to complete or ignore the unfinished custom operation other than manually add waiting command in python code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know why it is frozen on worker join ? The worker should just push the operator to engine and return.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very easy to understand the first situation. The destruction of static variable on engine exit happens when python interpreter has been shutdown, so something bad happens when you call CFUNCTYPE function at that time.
I push a new commit to move custom Stop function to MXNotifyShutdown, forcing custom function called before python shutdown, so the first situation is all right.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second situation of forking is so mysterious, I don't have a clue now. The custom Stop function is registered in prepare fork handler, everything should be OK at that time, but the worker thread just freezes.

@arcadiaphy
Copy link
Member Author

arcadiaphy commented Mar 18, 2019

I write a simple example to show the program freeze, which comes from calling CFUNCTYPE in thread when preparing fork.
fork example code

#include <iostream>
#include <thread>
#include <pthread.h>
#include "./lib.h"

callback *fun_python;

void set_python_callback(callback *func) {
  fun_python = func;
}

class Foo {
public:
  Foo() {
    pthread_atfork([]() {
      // call fun_python successfully
      std::cout << fun_python(5) << std::endl;
      // call fun_python failed in thread
      std::thread t([]() { std::cout << fun_python(10) << std::endl;});
      t.join();
    }, NULL, NULL);
  }
};

static Foo f;

@YutingZhang
Copy link
Contributor

@arcadiaphy Does this also related to https://discuss.mxnet.io/t/custom-operator-cause-stuck-in-multicard-train/1452
Basically, a CustomOP could randomly get stuck in a multi-GPU setting. I also observed the same problem in my current experiment.

@anirudh2290
Copy link
Member

I think this may be related to https://docs.python.org/3/c-api/init.html#non-python-created-threads .
Quoting from the doc:

If you need to call Python code from these threads (often this will be part of a callback API provided by the aforementioned third-party library), you must first register these threads with the interpreter by creating a thread state data structure, then acquiring the GIL, and finally storing their thread state pointer, before you can start using the Python/C API. When you are done, you should reset the thread state pointer, release the GIL, and finally free the thread state data structure.

We don't have any such handling for callbacks in custom ops.

@arcadiaphy
Copy link
Member Author

arcadiaphy commented Mar 19, 2019

@anirudh2290 The program blocks because of GIL. According to the following docs in your link, if fork locks GIL, then acquiring GIL in another thread and join them will cause deadlock.

Another important thing to note about threads is their behaviour in the face of the C fork() call. On most systems with fork(), after a process forks only the thread that issued the fork will exist. That also means any locks held by other threads will never be released. Python solves this for os.fork() by acquiring the locks it uses internally before the fork, and releasing them afterwards. In addition, it resets any Lock Objects in the child. When extending or embedding Python, there is no way to inform Python of additional (non-Python) locks that need to be acquired before or reset after a fork. OS facilities such as pthread_atfork() would need to be used to accomplish the same thing. Additionally, when extending or embedding Python, calling fork() directly rather than through os.fork() (and returning to or calling into Python) may result in a deadlock by one of Python’s internal locks being held by a thread that is defunct after the fork. PyOS_AfterFork_Child() tries to reset the necessary locks, but is not always able to.

Additional handling should be added to correctly run callbacks:

#include <iostream>
#include <thread>
#include <pthread.h>
#include <Python.h>
#include "./lib.h"

callback *fun_python;

void set_python_callback(callback *func) {
  fun_python = func;
}

class Foo {
public:
  Foo() {
    pthread_atfork([]() {
      // call fun_python successfully
      std::cout << fun_python(5) << std::endl;
      // call fun_python failed in thread
      Py_BEGIN_ALLOW_THREADS
      std::thread t([]() { std::cout << fun_python(10) << std::endl;});
      t.join();
      Py_END_ALLOW_THREADS
    }, NULL, NULL);
  }
};

static Foo f;

But adding this will make mxnet link against libpython, do we really want to do this?

@arcadiaphy
Copy link
Member Author

arcadiaphy commented Mar 19, 2019

@YutingZhang The problem you mentioned may be related to GIL, correctly handling GIL in c++ extension is a tricky problem, easy to cause deadlock. Now we have no such handling at all.

@anirudh2290
Copy link
Member

@arcadiaphy yes I don't see other option apart from depending on libpython if we have to fix these issues w.r.t custom op. What do others think ?

@arcadiaphy
Copy link
Member Author

@anirudh2290 If we don't want to depend on libpython, then this partial fix will work fine if you don't try to fork when having unfinished custom operation.

@anirudh2290
Copy link
Member

@arcadiaphy I think we can move forward with this PR as this is an improvement from what we have today. Can you document this as a limitation of using custom op with forking here: http://mxnet.incubator.apache.org/versions/master/tutorials/gluon/customop.html ,
http://mxnet.incubator.apache.org/versions/master/faq/new_op.html. I think custom op may cause issues for multi-gpu scenario because of not handling the callbacks correctly, but I haven't found a reproducible step yet. Once found I will open an issue.

@arcadiaphy arcadiaphy requested a review from szha as a code owner March 20, 2019 04:21
@arcadiaphy
Copy link
Member Author

@anirudh2290 Add some docs on custom op.

@arcadiaphy
Copy link
Member Author

arcadiaphy commented Mar 20, 2019

@anirudh2290 Docs updated, everything OK now.

@anirudh2290 anirudh2290 merged commit 4b1811c into apache:master Mar 20, 2019
@arcadiaphy arcadiaphy deleted the pr_custom branch March 20, 2019 16:25
vdantu pushed a commit to vdantu/incubator-mxnet that referenced this pull request Mar 31, 2019
* fix custom operation in fork

* add test

* fix custom stop

* swap order

* add docs

* update doc
ZhennanQin pushed a commit to ZhennanQin/incubator-mxnet that referenced this pull request Apr 3, 2019
* fix custom operation in fork

* add test

* fix custom stop

* swap order

* add docs

* update doc
nswamy pushed a commit that referenced this pull request Apr 5, 2019
* fix custom operation in fork

* add test

* fix custom stop

* swap order

* add docs

* update doc
@arcadiaphy arcadiaphy mentioned this pull request Apr 21, 2019
7 tasks
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* fix custom operation in fork

* add test

* fix custom stop

* swap order

* add docs

* update doc
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

mx.nd.Custom not working in subprocess
6 participants