-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why can't an "autonoread"-locked table be read from multiple threads? #1038
Comments
I would recommend using a producer consumer pattern with one thread spinning off data from disk and the other threads performing work on the data. This is how I previously implemented parallel processing with casacore. I'm not sure if the locking mechanism is thread safe as you mentioned. |
That approach is highly sub-optimal based on my tests. It forces one thread to do all the memory allocation and becomes very slow. From what I have seen the issue is not I/O bandwidth, but memory allocation. Additionally, a producer consumer framework is at odds with parallel/distributed frameworks such as dask. I have already observed slowdowns using dask-ms (which I believe implements a producer consumer type approach @sjperkins?) as a single thread cannot allocate fast enough to fully utilize the available compute resources. Reading from multiple processes seems to scale linearly with the number of processes. Having this behaviour using threads would be a game changer in my opinion. I have already shown that it can work using threads for a multi-MS, provided one tweaks python-casacore to drop the GIL. So I really believe that this should be possible, possibly with only minor changes to casacore internals. Of course I may be wrong - I am not well acquainted with casacore's internals. |
I have noticed python-casacore (not sure if it is underlying in casacore)
takes a long time to allocate memory. You can use numpy memory
preallocation and read using getcolnp, which was significantly faster in
our case (DDFacet with casacore 2.x a while back).
…On Tue, Aug 4, 2020 at 2:25 PM JSKenyon ***@***.***> wrote:
That approach is highly sub-optimal based on my tests. It forces one
thread to do all the memory allocation and becomes very slow. From what I
have seen the issue is not I/O bandwidth, but memory allocation.
Additionally, a producer consumer framework is at odds with
parallel/distributed frameworks such as dask. I have already observed
slowdowns using dask-ms (which I believe implements a producer consumer
type approach @sjperkins <https://github.com/sjperkins>?) as a single
thread cannot allocate fast enough to fully utilize the available compute
resources.
Reading from multiple processes seems to scale linearly with the number of
processes. Having this behaviour using threads would be a game changer in
my opinion. I have already shown that it can work using threads for a
multi-MS, provided one tweaks python-casacore to drop the GIL. So I really
believe that this should be possible, possibly with only minor changes to
casacore internals. Of course I may be wrong - I am not well acquainted
with casacore's internals.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1038 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB4RE6U2XZ6L4QRUEYKGIQDR6743BANCNFSM4PUKJ4CQ>
.
--
--
Benjamin Hugo
PhD. student,
Centre for Radio Astronomy Techniques and Technologies
Department of Physics and Electronics
Rhodes University
Junior software developer
Radio Astronomy Research Group
South African Radio Astronomy Observatory
Black River Business Park
Observatory
Cape Town
|
You are absolutely correct @bennahugo - that preallocation is substantially better. But reading from N processes is still roughly N times faster. |
It may be something to do with the prefetch and caching system employed in IncrementalStorageManagers. I'm not sure if there is any sort of mutex locking around that in casacore itself - it has been a while since I looked at that code, but it would make sense that a prefetch system should have race condition protection. Someone with a more intimate knowledge of storage managers should comment but the one of the other standard storage managers may not have this limitation. I'm not sure how to test it aside from writing a c++ test case that would replicate the python test to see if this is indeed where the bottle neck lies. The alternative is to compile casacore with symbols and try profiling your python application using perf to trace the amount of time spent inside the internal casacore reading functions. |
From a dask-ms perspective, the fact that the python CASA table access methods do not release the GIL is of concern. dask-ms allocates a single I/O thread per CASA table (including subtables), but since the GIL will effectively serialise any python CASA table calls, any benefit from accessing multiple tables in parallel threads is lost. |
That would explain it, so I guess this issue is a python-casacore issue primarily and not a locking issue further up? I think perhaps the easiest way is to write a C++ snippet to replicate the python program behaviour to compare the timings and see where most of the time is spent in the casacore stack. |
It sounds as if there are 3 issues at play here:
(1) can be solved by modifying python-casacore to drop the GIL during CASA table access. Based on private communication and @JSKenyon's experiments (2) is dangerous due to thread safety issues. (3) may be related to extremely large allocations by the linux kernel. |
For clarification, dask-ms allocates buffers in multiple threads, but serialises all reads to those buffer/writes in a single thread, for a single CASA table and it's subtables. Another table + subtable's access will be serialised in another thread). The producer/consumer model is implicit in the use of a ThreadPoolExecutor to isolate access to the CASA table to a single thread. |
3 could be solved by playing around with the huge page allocation settings and inspecting the translation lookaside buffer probes via perf to see if any gains can be made. I suspect that for our use case it might help to increase the page sizes. It may however be necessary to write a custom memory allocator to make full use page aligned memory accesses. |
While huge pages might help, I am against it in general. Most users will not have the ability to muck around with those settings. I would really appreciate it if @gervandiepen or @aroffringa could pitch in, particularly regarding the addition of different storage managers. This might all be a pointless debate if it has already been solved behind the scenes. |
@gervandiepen is on vacation, I think he'll return on Monday. |
I dug up a old allocation test that tested this. It would be useful to configure nodes that do MeerKAT (or ASKAP/LOFAR?) data reduction to optimally use the hugepage system though. From an old email thread:
|
I have just put up a PR on python-casacore to demonstrate some of my findings with regards to the GIL and parallel reads. |
Just wanted to check in here - any more thoughts @tammojan, @aroffringa, @gervandiepen? Sorry to be a pest, but this is really something I want to dig into. |
I am currently implementing a parallel/distributed calibration application and I am encountering slowdowns caused by reads being serialised between threads. To mitigate this, it would be ideal if I could:
The issue is that reading an MS from multiple threads causes segfaults. I believe that this is likely a result of some internal caching mechanism but I would appreciate it if someone more familiar with casacore could explain precisely what is happening behind the scenes.
If it is only the caching that causes problems, I would happily sacrifice the caching in order to have parallel reads (and implicitly parallel memory allocations). In my opinion, for the multi-TB measurement sets produced by instruments like MeerKAT this parallel read functionality is crucial.
I have already done a number of experiments, including using processes for parallel reads, but I desperately want to avoid multi-processing as in my experience it is very difficult to maintain. What I have managed is to read from multiple threads using a multi-ms. This is very cool, but I would prefer not to coerce the user into using a multi-ms.
To summarize, parallel reads using threads is high on my wish list!
The text was updated successfully, but these errors were encountered: