Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-4025: [Python] Avoid pthread_once in ThreadPool #3177

Closed
wants to merge 19 commits into from

Conversation

Projects
None yet
6 participants
@pcmoritz
Copy link
Contributor

commented Dec 14, 2018

This seems to fix the TensorFlow/PyTorch crash for me.

Some disassembling (thanks for the hint @pitrou) brought up that pthread_once is called by the C++11 future library. Replacing it with boost seems to fix it (but unfortunately pulls in a number of other boost dependencies like atomic, chrono and date_time; I wonder if that is avoidable).

@@ -294,9 +294,14 @@ std::shared_ptr<ThreadPool> ThreadPool::MakeCpuThreadPool() {
return pool;
}

ThreadPool* arrow_global_thread_pool = nullptr;

This comment has been minimized.

Copy link
@pcmoritz

pcmoritz Dec 14, 2018

Author Contributor

This may not be needed any more, I'll check that.

This comment has been minimized.

Copy link
@pcmoritz

pcmoritz Dec 16, 2018

Author Contributor

(this turns out to actually not be needed)

@pcmoritz

This comment has been minimized.

Copy link
Contributor Author

commented Dec 14, 2018

To test this, the manylinux1 base image needs to be rerun (in arrow/python/manylinux1) with

docker build -t arrow_manylinux1_x86_64_base -f Dockerfile-x86_64_base .

and then wheels can be generated with

docker run --shm-size=2g --rm -t -i -v $PWD:/io -v $PWD/../../:/arrow arrow_manylinux1_x86_64_base:latest /io/build_arrow.sh
@pitrou

This comment has been minimized.

Copy link
Contributor

commented Dec 15, 2018

Yuck. We don't want to bring in boost just to workaround Tensoflow's standard violations.

If you think this should be solved, I encourage you to find another way.

ThreadPool* GetCpuThreadPool() {
static std::shared_ptr<ThreadPool> singleton = ThreadPool::MakeCpuThreadPool();
return singleton.get();
if (!arrow_global_thread_pool) {

This comment has been minimized.

Copy link
@pitrou

pitrou Dec 15, 2018

Contributor

This is not thread-safe. There is a reason why I used a static singleton, and that reason is the thread safety of its initialization.

This comment has been minimized.

Copy link
@fsaintjacques

fsaintjacques Dec 15, 2018

Contributor

I think we can even demote it to a unique_ptr, since the object is never re-assigned.

This comment has been minimized.

Copy link
@pcmoritz

pcmoritz Dec 16, 2018

Author Contributor

Yeah, you are both right (this was just to try out if removing the static shared_ptr makes a difference)

static std::shared_ptr<ThreadPool> singleton = ThreadPool::MakeCpuThreadPool();
return singleton.get();
if (!arrow_global_thread_pool) {
arrow_global_thread_pool = new ThreadPool();

This comment has been minimized.

Copy link
@pitrou

pitrou Dec 15, 2018

Contributor

Creates a memory leak (object is never destroyed, Valgrind will complain).

@wesm

This comment has been minimized.

Copy link
Member

commented Dec 15, 2018

I'm also -0/-1 on eschewing C++11 future/thread -- that seems hardly sustainable. I think we need to engage the TF / PyTorch communities on this. It feels like they swiped the credit card, but sent the bill to our house.

@wesm

This comment has been minimized.

Copy link
Member

commented Dec 15, 2018

It might mean that we have to create some kind of a pip wheel to hoist a modern C++ standard library into the environment (for PyTorch / TF to depend on) so they can avoid all this static linking insanity, similar to what's being done in conda now

@pcmoritz pcmoritz force-pushed the pcmoritz:global-threadpool branch from e2bd0d0 to 636390d Dec 15, 2018

@pcmoritz pcmoritz force-pushed the pcmoritz:global-threadpool branch from 05a729a to 62f9e82 Dec 15, 2018

@pcmoritz pcmoritz force-pushed the pcmoritz:global-threadpool branch from c4778dc to 4cba4c8 Dec 15, 2018

pcmoritz added some commits Dec 15, 2018

@pcmoritz

This comment has been minimized.

Copy link
Contributor Author

commented Dec 15, 2018

Thanks for the feedback @wesm @pitrou. Here are my thoughts:

This problem really needs to be fixed. If we don't fix it, doing deep learning with arrow via TensorFlow or PyTorch will be severely hindered and that's a non-starter certainly for Ray but I think also for the adoption of pyarrow in general. The workarounds we had so far work in some cases, but they are sometimes failing and we obviously cannot afford that. Making TensorFlow and pytorch manylinux1 compatible won't happen in the near future, it has been discussed several times and is non-trivial because of the need for GPU support and the nvidia toolchain. Plus even if it were fixed tomorrow, older versions of TensorFlow are still in use for various reasons. So I fear this needs to be fixed on our side.

Here are the solutions I can think of:

  1. Use boost::threads. While minimally invasive from the source code perspective I agree it is not a great solution.
  2. Change the API of the thread pool to get rid of futures (see below).
  3. Make the ThreadPool optional/configurable and provide an alternative that plays nicely with TensorFlow. This could be configurable from the python/C++ side at runtime or via a macro at compile time in which case it would be up to the application runtime (Ray in our case) to do something reasonable.
  4. See if the problem will be fixed with manylinux2010. If it uses a more modern libstdc++, the problem might be fixed there (in that case we would need to use a fork in Ray for the time being, because we need an immediate solution).

At least now that we know that the problem comes from std::call_once and the futures, we have a way of attacking it :)

Concerning 2) I have a sketch in this PR now. It's not ready yet and there are probably bugs at the moment but it may give you an idea on how this would look like.

It would entail the following more limited API that however covers the use cases of the ThreadPool we have at the moment: Only functions void() can be submitted to the thread pool, they can communicate with the caller via closures (arguments and return values) [alternatively we limit ourselves to Status return values as it is sketched in the PR]. Instead of being able to wait for the termination of each function individually, there would be pool.Wait() which waits for all functions currently queued. This API is obviously not as nice as the current one, but we could have it as a temporary workaround until manylinux2010 is ready.

Please let me know about your thoughts/if you have further ideas how to fix it. I'm willing to spend the time making whatever approach we decide on working, but it really needs to be fixed :)

Best,
Philipp.

@pitrou

This comment has been minimized.

Copy link
Contributor

commented Dec 15, 2018

Concerning 2) I have a sketch in this PR now. It's not ready yet and there are probably bugs at the moment but it may give you an idea on how this would look like.

I just took a look... Do you realize the thread pool can be called from several threads? If you wait for the thread pool to be idle, you may be waiting much longer than necessary.

At least now that we know that the problem comes from std::call_once and the futures, we have a way of attacking it :)

What if some other problem comes up? What if loading Gandiva (which is gonna bundle LLVM) side-to-side with Tensorflow also causes crashes?

See if the problem will be fixed with manylinux2010. If it uses a more modern libstdc++, the problem might be fixed there

It may indeed. Though I'm afraid that depends on the exact libstdc++ version Tensorflow builds against?

This problem really needs to be fixed. If we don't fix it, doing deep learning with arrow via TensorFlow or PyTorch will be severely hindered

I understand the sentiment, but it seems the only sane way to fix this on our side is to ask people to use conda packages, not wheels. If wheels are broken, just say it to users instead of piling up fragile workarounds in the Arrow codebase.

Actually, I would even go as far as asking distutils-sig if the non-conformant wheels can be removed from PyPI. PyPI has no policy for this currently, but it is an obvious detriment to the whole ecosystem that people are allowed to upload rogue wheels and other packages have to deal with the fallout.

@pitrou

This comment has been minimized.

Copy link
Contributor

commented Dec 15, 2018

By the way, it seems manylinux2010 is almost there now: pypa/manylinux#179

(i.e. it's in pip master, but not any pip releases yet, AFAICT)

@@ -33,7 +33,10 @@ uint8_t* pointer_logical_and(const uint8_t* address, uintptr_t bits) {

// This function is just for avoiding MinGW-w64 32bit crash.
// See also: https://sourceforge.net/p/mingw-w64/bugs/767/
void* wrap_memcpy(void* dst, const void* src, size_t n) { return memcpy(dst, src, n); }
Status wrap_memcpy(void* dst, const void* src, size_t n) {

This comment has been minimized.

Copy link
@fsaintjacques

fsaintjacques Dec 15, 2018

Contributor

I wouldn't wrap this with a different return type. There's a non negligible chance that gcc can't apply the inline-memcpy with a single instruction optimization. This can severely affect performance in code that does memcpy in loops.

I'd also add the static inline keywords.

@@ -145,6 +145,11 @@ Status ThreadPool::Shutdown(bool wait) {
return Status::OK();
}

void ThreadPool::Wait() {

This comment has been minimized.

Copy link
@fsaintjacques

fsaintjacques Dec 15, 2018

Contributor

The typical nomenclature for this is Join.

}

void Wait();

This comment has been minimized.

Copy link
@fsaintjacques

fsaintjacques Dec 15, 2018

Contributor

I'd also expose the WaitFor variant with chrono (see https://en.cppreference.com/w/cpp/thread/condition_variable/wait_for).

@@ -650,11 +650,12 @@ Status PlasmaClient::Impl::List(ObjectTable* objects) {
return ReadListReply(buffer.data(), buffer.size(), objects);
}

static void ComputeBlockHash(const unsigned char* data, int64_t nbytes, uint64_t* hash) {
static Status ComputeBlockHash(const unsigned char* data, int64_t nbytes, uint64_t* hash) {

This comment has been minimized.

Copy link
@fsaintjacques

fsaintjacques Dec 15, 2018

Contributor

Why?


Status st = SpawnReal(detail::packaged_task_wrapper<Result>(std::move(task)));
// Submit a callable for execution.
void Submit(std::function<Status()> func) {

This comment has been minimized.

Copy link
@fsaintjacques

fsaintjacques Dec 15, 2018

Contributor

I wouldn't enforce that every function must return Status, a thread pool shouldn't care about the logical data. The original code didn't enforce this, what's returned is the the failure of submitting.

// std::call_once(utf8_initialized, internal::InitializeLargeTable);
// How should we do this? Maybe use
// https://en.wikibooks.org/wiki/More_C%2B%2B_Idioms/Nifty_Counter
internal::InitializeLargeTable();

This comment has been minimized.

Copy link
@fsaintjacques

fsaintjacques Dec 15, 2018

Contributor

Note that you can refactor the ut8 stuff into a simple struct class that allocates the array on the heap. Then you use the same static + unique_ptr singleton technique and you'll get rid of the once_flag.

@pcmoritz

This comment has been minimized.

Copy link
Contributor Author

commented Dec 15, 2018

Thanks again for the further feedback! It seems to me changing the threadpool API is also suboptimal.

I'd suggest the following solution:

  • Keep the API as is, but instead of std::future expose arrow::Future
  • Introduce a flag that redirects the implementation of arrow::Future to either std::future or boost::future.

Then we can leave it to the user which option they prefer. One sane default might be that if boost is vendored (like in our wheels), boost::future will be used. This will fix the problem not only for Ray but also vanilla pyarrow wheels.

In the future (if manylinux2010 and/or TensorFlow and pytorch solve the issue), we can remove the flag and just default to std::future.

@wesm

This comment has been minimized.

Copy link
Member

commented Dec 15, 2018

This problem really needs to be fixed. If we don't fix it, doing deep learning with arrow via TensorFlow or PyTorch will be severely hindered and that's a non-starter certainly for Ray but I think also for the adoption of pyarrow in general. The workarounds we had so far work in some cases, but they are sometimes failing and we obviously cannot afford that. Making TensorFlow and pytorch manylinux1 compatible won't happen in the near future, it has been discussed several times and is non-trivial because of the need for GPU support and the nvidia toolchain. Plus even if it were fixed tomorrow, older versions of TensorFlow are still in use for various reasons. So I fear this needs to be fixed on our side.

I don't think it's reasonable for TF / PyTorch to build with ad hoc C++ toolchains and linker flags and expect the rest of the Python community to conform to their decisions with zero consultation. They need to engage other projects on these toolchain issues -- they have the money and the people to put in effort to be compatible with the rest of the ecosystem, so we need to hold them accountable. There aren't that many players in this ecosystem, and we will not allow ourselves to be pushed around.

I sympathize with the issues you are having, but we cannot treat these other open source projects as immutable entities. Let's talk with the development teams and come up with a solution so we can all use C++ in a consistent way and have working applications.

@wesm

This comment has been minimized.

Copy link
Member

commented Dec 15, 2018

For the avoidance of ambiguity, @pcmoritz, I'm expecting you and @robertnishihara to take the lead on engaging with TF / PyTorch communities on this. Please let us know if we can help

@wesm

This comment has been minimized.

Copy link
Member

commented Dec 15, 2018

Actually, I would even go as far as asking distutils-sig if the non-conformant wheels can be removed from PyPI.

Somehow we need to create some catalyst to cause the TF and PyTorch developers to take more responsibility for the toolchain problem they have created; this might be one way. Either way we need to escalate the issue and make a lot more people aware of what is going on

@wesm

This comment has been minimized.

Copy link
Member

commented Dec 15, 2018

Aside: pyarrow is being installed more frequently than PyTorch and about 1/4 as often as TF on PyPI.

@fsaintjacques

This comment has been minimized.

Copy link
Contributor

commented Dec 15, 2018

Do you want me to concoct a small work-around using purely atomics that would respect the google style guide?

@wesm

This comment has been minimized.

Copy link
Member

commented Dec 15, 2018

I don't think it's worth investing any more effort in this until we open a dialogue about the toolchain issues with the relevant developer communities

@pcmoritz

This comment has been minimized.

Copy link
Contributor Author

commented Dec 16, 2018

Thanks for all the views! I'll send a post to the TensorFlow and PyTorch lists to make sure this issue is raised prominently. Hopefully the community can agree on a way forward that works for everyone. At least we should make sure that there will be compatibility with manylinux2010 in the future.

I'll also prepare a patch for Ray 0.6.1 to make sure this works in the short term for our users.

@robertnishihara

This comment has been minimized.

@soumith

This comment has been minimized.

Copy link

commented Dec 16, 2018

@wesm thanks a lot for pushing to start a discussion and dialogue around these issues. PyTorch certainly isn't immutable and we dont expect to be treated that way.
I've responded on the email thread that @pcmoritz started to get the discussion going and try finding a solution.

It might mean that we have to create some kind of a pip wheel to hoist a modern C++ standard library into the environment (for PyTorch / TF to depend on) so they can avoid all this static linking insanity, similar to what's being done in conda now

Just FYI, PyTorch hasn't statically linked with the std C++ since Jan this year -- we've been trying our best to work with the necessary constraints we have (in terms of CUDA, C++11) and covering as old a linux distro we can possibly go to.

@wesm

This comment has been minimized.

Copy link
Member

commented Feb 27, 2019

Closing this pending some guidance about how to proceed on creating compatible wheels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.