Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why can't an "autonoread"-locked table be read from multiple threads? #1038

Open
JSKenyon opened this issue Aug 4, 2020 · 16 comments
Open

Comments

@JSKenyon
Copy link

JSKenyon commented Aug 4, 2020

I am currently implementing a parallel/distributed calibration application and I am encountering slowdowns caused by reads being serialised between threads. To mitigate this, it would be ideal if I could:

  • read non-overlapping blocks of data from separate threads
  • ensure that python-casacore drops the GIL (I already have a proof of concept workaround for this part)

The issue is that reading an MS from multiple threads causes segfaults. I believe that this is likely a result of some internal caching mechanism but I would appreciate it if someone more familiar with casacore could explain precisely what is happening behind the scenes.

If it is only the caching that causes problems, I would happily sacrifice the caching in order to have parallel reads (and implicitly parallel memory allocations). In my opinion, for the multi-TB measurement sets produced by instruments like MeerKAT this parallel read functionality is crucial.

I have already done a number of experiments, including using processes for parallel reads, but I desperately want to avoid multi-processing as in my experience it is very difficult to maintain. What I have managed is to read from multiple threads using a multi-ms. This is very cool, but I would prefer not to coerce the user into using a multi-ms.

To summarize, parallel reads using threads is high on my wish list!

@bennahugo
Copy link

I would recommend using a producer consumer pattern with one thread spinning off data from disk and the other threads performing work on the data. This is how I previously implemented parallel processing with casacore. I'm not sure if the locking mechanism is thread safe as you mentioned.

@JSKenyon
Copy link
Author

JSKenyon commented Aug 4, 2020

That approach is highly sub-optimal based on my tests. It forces one thread to do all the memory allocation and becomes very slow. From what I have seen the issue is not I/O bandwidth, but memory allocation. Additionally, a producer consumer framework is at odds with parallel/distributed frameworks such as dask. I have already observed slowdowns using dask-ms (which I believe implements a producer consumer type approach @sjperkins?) as a single thread cannot allocate fast enough to fully utilize the available compute resources.

Reading from multiple processes seems to scale linearly with the number of processes. Having this behaviour using threads would be a game changer in my opinion. I have already shown that it can work using threads for a multi-MS, provided one tweaks python-casacore to drop the GIL. So I really believe that this should be possible, possibly with only minor changes to casacore internals. Of course I may be wrong - I am not well acquainted with casacore's internals.

@bennahugo
Copy link

bennahugo commented Aug 4, 2020 via email

@JSKenyon
Copy link
Author

JSKenyon commented Aug 4, 2020

You are absolutely correct @bennahugo - that preallocation is substantially better. But reading from N processes is still roughly N times faster.

@bennahugo
Copy link

It may be something to do with the prefetch and caching system employed in IncrementalStorageManagers. I'm not sure if there is any sort of mutex locking around that in casacore itself - it has been a while since I looked at that code, but it would make sense that a prefetch system should have race condition protection. Someone with a more intimate knowledge of storage managers should comment but the one of the other standard storage managers may not have this limitation. I'm not sure how to test it aside from writing a c++ test case that would replicate the python test to see if this is indeed where the bottle neck lies. The alternative is to compile casacore with symbols and try profiling your python application using perf to trace the amount of time spent inside the internal casacore reading functions.

@sjperkins
Copy link
Contributor

From a dask-ms perspective, the fact that the python CASA table access methods do not release the GIL is of concern.

dask-ms allocates a single I/O thread per CASA table (including subtables), but since the GIL will effectively serialise any python CASA table calls, any benefit from accessing multiple tables in parallel threads is lost.

@bennahugo
Copy link

That would explain it, so I guess this issue is a python-casacore issue primarily and not a locking issue further up? I think perhaps the easiest way is to write a C++ snippet to replicate the python program behaviour to compare the timings and see where most of the time is spent in the casacore stack.

@sjperkins
Copy link
Contributor

sjperkins commented Aug 5, 2020

It sounds as if there are 3 issues at play here:

  1. python-casacore retaining the GIL on table access calls.
  2. underlying thread-safety of accessing the same CASA table from multiple threads.
  3. allocation speed for large chunks of memory.

(1) can be solved by modifying python-casacore to drop the GIL during CASA table access.

Based on private communication and @JSKenyon's experiments (2) is dangerous due to thread safety issues.

(3) may be related to extremely large allocations by the linux kernel.

@sjperkins
Copy link
Contributor

sjperkins commented Aug 5, 2020

I have already observed slowdowns using dask-ms (which I believe implements a producer consumer type approach @sjperkins?) as a single thread cannot allocate fast enough to fully utilize the available compute resources.

For clarification, dask-ms allocates buffers in multiple threads, but serialises all reads to those buffer/writes in a single thread, for a single CASA table and it's subtables. Another table + subtable's access will be serialised in another thread).

The producer/consumer model is implicit in the use of a ThreadPoolExecutor to isolate access to the CASA table to a single thread.

@bennahugo
Copy link

3 could be solved by playing around with the huge page allocation settings and inspecting the translation lookaside buffer probes via perf to see if any gains can be made. I suspect that for our use case it might help to increase the page sizes. It may however be necessary to write a custom memory allocator to make full use page aligned memory accesses.

@JSKenyon
Copy link
Author

JSKenyon commented Aug 5, 2020

While huge pages might help, I am against it in general. Most users will not have the ability to muck around with those settings.

I would really appreciate it if @gervandiepen or @aroffringa could pitch in, particularly regarding the addition of different storage managers. This might all be a pointless debate if it has already been solved behind the scenes.

@tammojan
Copy link
Contributor

tammojan commented Aug 5, 2020

@gervandiepen is on vacation, I think he'll return on Monday.

@bennahugo
Copy link

I dug up a old allocation test that tested this. It would be useful to configure nodes that do MeerKAT (or ASKAP/LOFAR?) data reduction to optimally use the hugepage system though. From an old email thread:

I cannot reproduce Simon's observation. Are you sure the kernel page caches aren't slowing you down? It might be worthwhile dropping them.

There is a small improvement (~~1%) pinning the memory and process to node 0. The default behaviour of the kernel is to try its best to keep memory on the same node as a process.

That said there are a few optimizations to try. Foremost is to force rebuilding of numpy. This will make maximal use of the
vector instruction set of the CPU. Maybe Bruce can comment but I think the default -O3 optimizations only uses the x86 instruction set? This should suffice to compile:

CFLAGS="-O3 -march=native"  pip install --no-binary :all: numpy==1.13.3 #same as system

Another trick is to enable huge pages - potentially wasteful in my contexts but could work for large-array dominated applications.
The change can be made permanent in grub, but to get it temporarily until next reboot do:
echo always > /sys/kernel/mm/transparent_hugepage/enabled

The final comparison is as follows:

image

@JSKenyon
Copy link
Author

JSKenyon commented Aug 11, 2020

I have just put up a PR on python-casacore to demonstrate some of my findings with regards to the GIL and parallel reads.

@JSKenyon
Copy link
Author

Just wanted to check in here - any more thoughts @tammojan, @aroffringa, @gervandiepen? Sorry to be a pest, but this is really something I want to dig into.

@sjperkins
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants