Why can't an "autonoread"-locked table be read from multiple threads? #1038

JSKenyon · 2020-08-04T12:00:01Z

I am currently implementing a parallel/distributed calibration application and I am encountering slowdowns caused by reads being serialised between threads. To mitigate this, it would be ideal if I could:

read non-overlapping blocks of data from separate threads
ensure that python-casacore drops the GIL (I already have a proof of concept workaround for this part)

The issue is that reading an MS from multiple threads causes segfaults. I believe that this is likely a result of some internal caching mechanism but I would appreciate it if someone more familiar with casacore could explain precisely what is happening behind the scenes.

If it is only the caching that causes problems, I would happily sacrifice the caching in order to have parallel reads (and implicitly parallel memory allocations). In my opinion, for the multi-TB measurement sets produced by instruments like MeerKAT this parallel read functionality is crucial.

I have already done a number of experiments, including using processes for parallel reads, but I desperately want to avoid multi-processing as in my experience it is very difficult to maintain. What I have managed is to read from multiple threads using a multi-ms. This is very cool, but I would prefer not to coerce the user into using a multi-ms.

To summarize, parallel reads using threads is high on my wish list!

bennahugo · 2020-08-04T12:08:51Z

I would recommend using a producer consumer pattern with one thread spinning off data from disk and the other threads performing work on the data. This is how I previously implemented parallel processing with casacore. I'm not sure if the locking mechanism is thread safe as you mentioned.

JSKenyon · 2020-08-04T12:25:05Z

That approach is highly sub-optimal based on my tests. It forces one thread to do all the memory allocation and becomes very slow. From what I have seen the issue is not I/O bandwidth, but memory allocation. Additionally, a producer consumer framework is at odds with parallel/distributed frameworks such as dask. I have already observed slowdowns using dask-ms (which I believe implements a producer consumer type approach @sjperkins?) as a single thread cannot allocate fast enough to fully utilize the available compute resources.

Reading from multiple processes seems to scale linearly with the number of processes. Having this behaviour using threads would be a game changer in my opinion. I have already shown that it can work using threads for a multi-MS, provided one tweaks python-casacore to drop the GIL. So I really believe that this should be possible, possibly with only minor changes to casacore internals. Of course I may be wrong - I am not well acquainted with casacore's internals.

bennahugo · 2020-08-04T12:29:26Z

I have noticed python-casacore (not sure if it is underlying in casacore) takes a long time to allocate memory. You can use numpy memory preallocation and read using getcolnp, which was significantly faster in our case (DDFacet with casacore 2.x a while back).

…

On Tue, Aug 4, 2020 at 2:25 PM JSKenyon ***@***.***> wrote: That approach is highly sub-optimal based on my tests. It forces one thread to do all the memory allocation and becomes very slow. From what I have seen the issue is not I/O bandwidth, but memory allocation. Additionally, a producer consumer framework is at odds with parallel/distributed frameworks such as dask. I have already observed slowdowns using dask-ms (which I believe implements a producer consumer type approach @sjperkins <https://github.com/sjperkins>?) as a single thread cannot allocate fast enough to fully utilize the available compute resources. Reading from multiple processes seems to scale linearly with the number of processes. Having this behaviour using threads would be a game changer in my opinion. I have already shown that it can work using threads for a multi-MS, provided one tweaks python-casacore to drop the GIL. So I really believe that this should be possible, possibly with only minor changes to casacore internals. Of course I may be wrong - I am not well acquainted with casacore's internals. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1038 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB4RE6U2XZ6L4QRUEYKGIQDR6743BANCNFSM4PUKJ4CQ> .

-- -- Benjamin Hugo PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town

JSKenyon · 2020-08-04T12:40:08Z

You are absolutely correct @bennahugo - that preallocation is substantially better. But reading from N processes is still roughly N times faster.

bennahugo · 2020-08-04T13:41:54Z

It may be something to do with the prefetch and caching system employed in IncrementalStorageManagers. I'm not sure if there is any sort of mutex locking around that in casacore itself - it has been a while since I looked at that code, but it would make sense that a prefetch system should have race condition protection. Someone with a more intimate knowledge of storage managers should comment but the one of the other standard storage managers may not have this limitation. I'm not sure how to test it aside from writing a c++ test case that would replicate the python test to see if this is indeed where the bottle neck lies. The alternative is to compile casacore with symbols and try profiling your python application using perf to trace the amount of time spent inside the internal casacore reading functions.

sjperkins · 2020-08-05T06:56:54Z

From a dask-ms perspective, the fact that the python CASA table access methods do not release the GIL is of concern.

dask-ms allocates a single I/O thread per CASA table (including subtables), but since the GIL will effectively serialise any python CASA table calls, any benefit from accessing multiple tables in parallel threads is lost.

bennahugo · 2020-08-05T08:53:03Z

That would explain it, so I guess this issue is a python-casacore issue primarily and not a locking issue further up? I think perhaps the easiest way is to write a C++ snippet to replicate the python program behaviour to compare the timings and see where most of the time is spent in the casacore stack.

sjperkins · 2020-08-05T09:05:49Z

It sounds as if there are 3 issues at play here:

python-casacore retaining the GIL on table access calls.
underlying thread-safety of accessing the same CASA table from multiple threads.
allocation speed for large chunks of memory.

(1) can be solved by modifying python-casacore to drop the GIL during CASA table access.

Based on private communication and @JSKenyon's experiments (2) is dangerous due to thread safety issues.

(3) may be related to extremely large allocations by the linux kernel.

sjperkins · 2020-08-05T09:11:37Z

I have already observed slowdowns using dask-ms (which I believe implements a producer consumer type approach @sjperkins?) as a single thread cannot allocate fast enough to fully utilize the available compute resources.

For clarification, dask-ms allocates buffers in multiple threads, but serialises all reads to those buffer/writes in a single thread, for a single CASA table and it's subtables. Another table + subtable's access will be serialised in another thread).

The producer/consumer model is implicit in the use of a ThreadPoolExecutor to isolate access to the CASA table to a single thread.

bennahugo · 2020-08-05T09:12:14Z

3 could be solved by playing around with the huge page allocation settings and inspecting the translation lookaside buffer probes via perf to see if any gains can be made. I suspect that for our use case it might help to increase the page sizes. It may however be necessary to write a custom memory allocator to make full use page aligned memory accesses.

JSKenyon · 2020-08-05T09:26:05Z

While huge pages might help, I am against it in general. Most users will not have the ability to muck around with those settings.

I would really appreciate it if @gervandiepen or @aroffringa could pitch in, particularly regarding the addition of different storage managers. This might all be a pointless debate if it has already been solved behind the scenes.

tammojan · 2020-08-05T09:27:58Z

@gervandiepen is on vacation, I think he'll return on Monday.

bennahugo · 2020-08-05T09:28:09Z

I dug up a old allocation test that tested this. It would be useful to configure nodes that do MeerKAT (or ASKAP/LOFAR?) data reduction to optimally use the hugepage system though. From an old email thread:

I cannot reproduce Simon's observation. Are you sure the kernel page caches aren't slowing you down? It might be worthwhile dropping them.

There is a small improvement (~~1%) pinning the memory and process to node 0. The default behaviour of the kernel is to try its best to keep memory on the same node as a process.

That said there are a few optimizations to try. Foremost is to force rebuilding of numpy. This will make maximal use of the
vector instruction set of the CPU. Maybe Bruce can comment but I think the default -O3 optimizations only uses the x86 instruction set? This should suffice to compile:

CFLAGS="-O3 -march=native"  pip install --no-binary :all: numpy==1.13.3 #same as system

Another trick is to enable huge pages - potentially wasteful in my contexts but could work for large-array dominated applications.
The change can be made permanent in grub, but to get it temporarily until next reboot do:
echo always > /sys/kernel/mm/transparent_hugepage/enabled

The final comparison is as follows:

JSKenyon · 2020-08-11T10:13:32Z

I have just put up a PR on python-casacore to demonstrate some of my findings with regards to the GIL and parallel reads.

JSKenyon · 2020-09-18T10:21:22Z

Just wanted to check in here - any more thoughts @tammojan, @aroffringa, @gervandiepen? Sorry to be a pest, but this is really something I want to dig into.

sjperkins · 2023-10-23T07:28:04Z

Support multiple table objects reading from the same underlying table in multiple threads ratt-ru/arcae#67

JSKenyon mentioned this issue Aug 11, 2020

Add code to drop the GIL. Demostrate usefulness on reads/writes. casacore/python-casacore#209

Draft

sjperkins mentioned this issue Aug 12, 2020

Improve multi-ms functionality ratt-ru/dask-ms#121

Open

JSKenyon mentioned this issue Aug 17, 2020

On the vagaries of the MS, casacore and python-casacore. ratt-ru/QuartiCal#25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why can't an "autonoread"-locked table be read from multiple threads? #1038

Why can't an "autonoread"-locked table be read from multiple threads? #1038

JSKenyon commented Aug 4, 2020

bennahugo commented Aug 4, 2020

JSKenyon commented Aug 4, 2020

bennahugo commented Aug 4, 2020 via email

JSKenyon commented Aug 4, 2020

bennahugo commented Aug 4, 2020

sjperkins commented Aug 5, 2020

bennahugo commented Aug 5, 2020

sjperkins commented Aug 5, 2020 •

edited

Loading

sjperkins commented Aug 5, 2020 •

edited

Loading

bennahugo commented Aug 5, 2020

JSKenyon commented Aug 5, 2020

tammojan commented Aug 5, 2020

bennahugo commented Aug 5, 2020

JSKenyon commented Aug 11, 2020 •

edited

Loading

JSKenyon commented Sep 18, 2020

sjperkins commented Oct 23, 2023

Why can't an "autonoread"-locked table be read from multiple threads? #1038

Why can't an "autonoread"-locked table be read from multiple threads? #1038

Comments

JSKenyon commented Aug 4, 2020

bennahugo commented Aug 4, 2020

JSKenyon commented Aug 4, 2020

bennahugo commented Aug 4, 2020 via email

JSKenyon commented Aug 4, 2020

bennahugo commented Aug 4, 2020

sjperkins commented Aug 5, 2020

bennahugo commented Aug 5, 2020

sjperkins commented Aug 5, 2020 • edited Loading

sjperkins commented Aug 5, 2020 • edited Loading

bennahugo commented Aug 5, 2020

JSKenyon commented Aug 5, 2020

tammojan commented Aug 5, 2020

bennahugo commented Aug 5, 2020

JSKenyon commented Aug 11, 2020 • edited Loading

JSKenyon commented Sep 18, 2020

sjperkins commented Oct 23, 2023

sjperkins commented Aug 5, 2020 •

edited

Loading

sjperkins commented Aug 5, 2020 •

edited

Loading

JSKenyon commented Aug 11, 2020 •

edited

Loading