Enhancing parallel IO in Casacore Table Data System #729

JasonRuonanWang · 2018-06-08T17:31:25Z

I don't know if this is the best place but I was looking to some discussions on enhancing parallel IO for the table system as well as Measurementsets.

A few years ago I wrote a parallel storage manager for the casacore table system based on ADIOS (https://github.com/SKA-ScienceDataProcessor/AdiosStMan.git) when I was at ICRAR, which was put into the SKA-SDP as an early investigation into SKA scale measurementset experiments. When @gervandiepen was visiting us at ICRAR we talked about merging this AdiosStMan into the master but I was worried about the code quality because it was only a prototype rather than a production software. @gervandiepen was also worried about introducing MPI dependency to casacore. So eventually we dropped the plan.

Recently I have been working on updating the Adios Storage Manager using our new software ADIOS 2.0 (https://github.com/ornladios/ADIOS2) that we are developing at Oak Ridge National Lab under the US Department of Energy's Exa-scale Computing program. The motivation behind this is that we are planning on running a full-scale SKA 1 data pipeline simulation, which may require writing data into a single measurementset table from multiple processes. If this is successful then we can re-think about improving the code quality and merge our efforts into the master.

However, the current parallel storage manager design is more like a hack than a generic solution because we couldn't find parallel support in casacore's table layer API. As a result, we need to hack user applications and make them believe that each MPI rank is writing to its own independent table, while the Adios Storage Manger does dirty tricks underneath and actually writes data into a single column file. This has worked so far but it makes user applications very ugly and less maintainable.

I have roughly read through the table layer code trying to think of a solution. If we can add some MPI-aware code in the procedures of creating a table, making only Rank 0 actually write table metadata file, while other ranks hold the table metadata in memory without writing to files, this would make the table system much more friendly to parallel storage managers as well as MPI incorporated applications.

I am wondering if this is possible or if there is any other suggestions or plans on enhancing parallel IO for casacore. Looking forward to any discussions or suggestions. Thank you.

JasonRuonanWang · 2018-06-08T17:40:26Z

There is another approach that I can think of, which is adding another mode in TableLock::LockOption. This mode does the following: Any process comes first owns the table lock, while if the table is already locked, other processes skip writing the table metadata file. This will also work for any MPI based parallel IO storage managers or applications. And the advantage of doing so over the previous approach is that this would not require introducing MPI dependency into casacore.

Any suggestions?

tammojan · 2018-06-08T17:46:50Z

Hi Jason. We are currently working on a new version of the Measurement Set in a project called MSv3. This is a collaboration between NRAO and the SKA organisation. Perhaps you can give a presentation of your work in one of our next telecons (usually on Wednesdays at 10:00 EST) and discuss your suggestions?

@gervandiepen or @mpokorny, both on the MSv3project, may want to comment on your suggestions.

JasonRuonanWang · 2018-06-08T18:18:59Z

@tammojan That's great. I would like to attend and talk about my work. I should be in the same time zone as NRAO which makes things easier. Could you send me details? My email is jason.ruonan.wang@gmail.com

Thanks.

gervandiepen · 2018-06-12T13:36:21Z

Hi Jason, Very good you are thinking on how to use MPI in the table system. I don't think I like the TableLock option so much; the current behaviour is that a process waits until it gets the lock and starts the IO thereafter. The idea behind it is that multiple applications can access the same table. Your idea would change that paradigm. I don't mind an optional dependency on MPI in the table code. They can be ifdef-ed to be able to build Casacore without MPI. See you tomorrow, Ger

…

On Fri, Jun 8, 2018 at 8:19 PM, Jason Wang ***@***.***> wrote: @tammojan <https://github.com/tammojan> That's great. I would like to attend and talk about my work. I should be in the same time zone as NRAO which makes things easier. Could you send me details? My email is ***@***.*** Thanks. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#729 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHfIPNw3mPtb19vm_tDI8n-6muTmwHCkks5t6sAZgaJpZM4Ugr2m> .

mpokorny · 2018-06-12T15:10:49Z

Hi Jason,
FYI, over the last several months I've been working on methods using MPI-IO collective data access functions to read and write casacore table data columns, supporting a wide range of parallelization and data access patterns. My work is limited to reading and writing data columns of a uniform shape only, so is not (yet) anything like a casacore storage manager. One reason I took a path avoiding the casacore table system directly is that the CTDS API doesn't accommodate anything like MPI communicator handles, so I decided to avoid wedging my code into the CTDS just to do some exploration of parallel IO. As I understand it (perhaps improperly!), ADIOS uses MPI but not MPI-IO, and I can imagine your work might have similar issues integrating into the CTDS API that mine may have (eventually). I would hope that we might come up with a better design for integrating MPI into CTDS if we have more than one use case to consider.

JasonRuonanWang · 2018-06-12T17:15:31Z

@gervandiepen Yes sir. If MPI dependency is not a big issue then I would personally prefer the first approach too.

JasonRuonanWang · 2018-06-12T17:28:31Z

@mpokorny It's great to hear there are people working on the same issue. I hope there is something we can do together to make parallel IO a formal CTDS feature so we will never need to explore different approaches.

There is MPI-IO in ADIOS but I haven't been using it for CTDS, because having the flexibility of writing different rows or columns asynchronously is an important feature of a table system, while MPI-IO is mainly designed for synchronous IO. So you are right to this extent. For the MPI communicator issue, what I did in ADIOS storage manager is that I pass the MPI communicator in the constructor of AdiosStMan class. In this way it bypasses the table interface of CTDS. But this does not solve the problem I mentioned -- you can't create a table from multiple MPI ranks because creating a table is a CTDS table feature rather than a storage manager feature. So it would be definitely better if CTDS natively supports MPI so we can have future MPI based apps nicely written and compatible with serial code.

JasonRuonanWang · 2018-06-26T22:44:44Z

Opened #742 which added MPI awareness to the table layer API. This feature together with a parallel storage manager can enable serial casacore applications to write tables from multiple MPI processes without changing a single line of code.

JasonRuonanWang · 2018-06-26T23:12:43Z

According to what we agreed in the last telecon, the first step is to add MPI awareness to the table API, and the second step is to re-write the table layer to a certain level that it would allow the table layer to use alternative storage backends besides Aips IO.

Several ideas for the second step have come up during the work of the first step. Basically we can do it in a few ways:

1. Introduce a middleware API at the Aips IO layer.

This can re-use the current Table implementation without re-writing anything at the Table layer. However, Aips IO is stream based, so we have to implement the alternative IO backend in the same way in order to make the current Table implementation compatible with the new IO backend implementation. This could possibly limit the new IO backend implementation in several ways ranging from flexibility to being able to self-contained and self-described.

2. Introduce a middleware API at the CTDS Table layer.

This will require re-writing the Table layer, which is a looooot of work.

3. Implement MSv3 in a way that it can switch underlying storage backends (table implementations).

This is my preferred way. MS is the main use scenario of CTDS, and probably the only use scenario of CTDS which potentially requires peta to exa scale I/O (thus you have to make it parallel). In the meantime, the nature of the MeasurementSet does not necessarily require it to be stored in a table system. A lot of the features of CTDS are not really used in MS. Therefore, re-implementing CTDS is not a cost-efficient way. A more cost efficient way is to implement MSv3 so that it can be based on various I/O backends. Thus it not only gets more flexibility and functionalities such as streaming MS tables over network, but also potentially gets better parallel I/O performance too by bypassing the table layer and directly interacting with newer generation I/O systems such as ADIOS 2.

mpokorny · 2018-06-28T15:49:30Z

Jason, regarding your proposal 3, would it mean that programs that access measurement sets would then use something like the MeasurementSet API, without referring to any Aips IO or CTDS API? I could certainly see how that might open up many possibilities for I/O backends. Depending strongly, of course, on the design of any such new MeasurementSet API.

As an aside, I've been frequently annoyed by how much of CASA code uses CTDS functionality rather than MS functionality. It strikes me as wrong architecturally (and conceptually), is inefficient in several ways (e.g, programmer time, lines of code), and makes CASA code very sensitive to low-level changes in I/O configuration such as file system, and CTDS system and file format parameters. Making the MS API more flexible and promoting it as the API for accessing MSs might help clean up CASA quite a lot. That might impose a bit of work on CASA, although I believe it would bring important benefits.

JasonRuonanWang · 2018-06-28T16:29:57Z

@mpokorny Yes basically I think we are talking about the same thing. In this way, applications can just interact with MSv3 API and MSv3 will handle all underlying I/O operations without exposing the complexity to applications. And this will open up many possibilities as you said. For example, if you don't have a large local filesystem but you have a very fast network, then you can produce MeasurementSets and stream them over the network. Or if you find a filesystem is the performance bottleneck, you can store data in something else such as object stores or block devices. Or if you don't have enough storage space anywhere and you don't care losing some precisions of data, then you can let the underlying I/O middleware compress the data in-situ for you.

We have been discussing these scenarios a lot for SKA workflows, and if we consider MSv3 as something people will rely on in the next decade or two, these scenarios are very likely to become hard requirements. In the meantime these are not only required by radio astronomy, but also many other research areas. So enable MSv3 to take advantage of some generic I/O middleware will benefit a lot in the long run.

gervandiepen · 2018-06-29T06:30:53Z

I fully agree. For LOFAR and Apertif software we've tried (and mostly succeeded) to have a C++ interface to the data without showing CTDS functionality. We did less well in Python scripts though.

…

On Thu, Jun 28, 2018 at 6:29 PM, Jason Wang ***@***.***> wrote: @mpokorny <https://github.com/mpokorny> Yes basically I think we are talking about the same thing. In this way, applications can just interact with MSv3 API and MSv3 will handle all underlying I/O operations without exposing the complexity to applications. And this will open up many possibilities as you said. For example, if you don't have a large local filesystem but you have a very fast network, then you can produce MeasurementSets and stream them over the network. Or if you find a filesystem is the performance bottleneck, you can store data in something else such as object stores or block devices. Or if you don't have enough storage space anywhere and you don't care losing some precisions of data, then you can let the underlying I/O middleware compress the data in-situ for you. We have been discussing these scenarios a lot for SKA workflows, and if we consider MSv3 as something people will rely on in the next decade or two, these scenarios are very likely to become hard requirements. In the meantime these are not only required by radio astronomy, but also many other research areas. So enable MSv3 to take advantage of some generic I/O middleware will benefit a lot in the long run. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#729 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHfIPFfTs16fS57JiO8F7WnGf_R0OJZAks5uBQSFgaJpZM4Ugr2m> .

twillis449 · 2018-08-07T18:54:05Z

Some of you might be interested in https://github.com/ska-sa/xarray-ms. On a smaller scale I've successfully used python multiprocessing to calculate visibilities in parallel for SKA-simulations. Of course my sims only run for a 4 minute observation (0.14 second integrations) with one frequency channel and 197 antennas. But these sims revealed some subtle effects that will affect sources at large distances from the field centre.

JasonRuonanWang · 2018-08-07T19:23:09Z

@twillis449 Thanks for this. Is it using the multi-MS feature of CASA which generates a number of physical tables and then concatenate them logically?

sjperkins · 2018-08-07T21:06:08Z

It's nice to see that xarray-ms is getting exposure outside SKA SA. At heart it's a dask layer around python casacore. I guess it would benefit from any enhancements to the underlying casa file system made here.

sjperkins · 2018-08-07T21:10:33Z

@twillis449 Are you aware of https://github.com/ska-sa/codex-africanus? The dask algorithm implementations are designed to take xarray-ms arrays.

twillis449 · 2018-08-07T21:15:40Z

@JasonRuonanWang No, basically I just get blocks of UVdata from the UVW column and farm off visibility Amp/Phase calculations to sub-processes. When the sub-process is finished I just catch the results and stick them back into the corresponding location in e.g. the DATA or CORRECTED_DATA columns. @sjperkins I'm impressed with the stuff you people are developing in SKA-SA, but so far for my sims for SKA1 I've been able to get by with 'toy' measurment sets . That may change if I start doing some simulations for the ngVLA array where there are more antennas and longer baselines.

sjperkins · 2018-08-07T23:03:34Z

With regard to MSv3: Is there an intention to move the underlying storage to something resembling an object store (in the spirit of ceph for e.g.). MSv2 seems to rely on monolithic files which are sub optimal for parallel file systems. While I'm no expert, it seems that separate files for individual chunks of data is the way forward on these systems. Of course, the MS already supports this to some degree via tiling and hypercolumns...

twillis449 · 2018-08-07T23:09:56Z

SKA SDP must be doing a LOT of parallel processing. Anybody know what their plan / design for data files is?

JasonRuonanWang · 2018-08-07T23:27:07Z

@twillis449 @sjperkins

I was basically one of the people in SKA SDP doing research in parallel IO several years ago, and I have come up with this ADIOS storage manager (https://www.sciencedirect.com/science/article/pii/S2213133716300452) which is able to write in parallel at the casacore table column layer. That was pretty much the only practical generic solution I could think of then, given the limited time and resources I had. However, the downside is also obvious, which is that it could only parallelize IO at the column layer, not the table layer. To improve this, recently I submitted the pull request #742 to make the table layer MPI-aware, such that it can support parallel storage managers better.

But ultimately the ideal solution for MSv3 is still to have the flexibility of changing the entire storage backend (including table layer and column layer) to using something independent of casacore. Like @sjperkins said, this could be Ceph, or it could be ADIOS 2.0 which I am currently participating the development at ORNL. So users can take advantage of their existing hardware without changing the upper layer application code. For instance, SA can continue using Ceph, while Australia can use something else like Nyriad storage, and if I need to showcase a pipeline on one of our DOE supercomputers, I can easily switch to Lustre mode. This idea, I have briefly discussed with @gervandiepen and we basically agreed in principle. The problem is really how much resource we can put to realize these ideas.

JasonRuonanWang · 2021-06-03T07:32:48Z

I think this issue can be closed since we have added Adios2StMan to casacore, and we have modified the casacore table layer to have truly parallel I/O. We have used this improvement in our full-scale SKA1 data processing workflow using simulated data and verified using the entire Summit supercomputer, which was the fastest machine in the world. Results can be found in https://www.computer.org/csdl/proceedings-article/sc/2020/999800a011/1oeORuLNgAw

JasonRuonanWang mentioned this issue Jun 26, 2018

Added MPI support for parallel CTDS I/O #742

Merged

gijzelaerr added the enhancement label Oct 31, 2018

JasonRuonanWang closed this as completed Jun 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancing parallel IO in Casacore Table Data System #729

Enhancing parallel IO in Casacore Table Data System #729

JasonRuonanWang commented Jun 8, 2018

JasonRuonanWang commented Jun 8, 2018 •

edited

Loading

tammojan commented Jun 8, 2018

JasonRuonanWang commented Jun 8, 2018

gervandiepen commented Jun 12, 2018 via email

mpokorny commented Jun 12, 2018

JasonRuonanWang commented Jun 12, 2018

JasonRuonanWang commented Jun 12, 2018

JasonRuonanWang commented Jun 26, 2018

JasonRuonanWang commented Jun 26, 2018

mpokorny commented Jun 28, 2018

JasonRuonanWang commented Jun 28, 2018

gervandiepen commented Jun 29, 2018 via email

twillis449 commented Aug 7, 2018

JasonRuonanWang commented Aug 7, 2018

sjperkins commented Aug 7, 2018

sjperkins commented Aug 7, 2018

twillis449 commented Aug 7, 2018

sjperkins commented Aug 7, 2018 •

edited

Loading

twillis449 commented Aug 7, 2018

JasonRuonanWang commented Aug 7, 2018

JasonRuonanWang commented Jun 3, 2021

Enhancing parallel IO in Casacore Table Data System #729

Enhancing parallel IO in Casacore Table Data System #729

Comments

JasonRuonanWang commented Jun 8, 2018

JasonRuonanWang commented Jun 8, 2018 • edited Loading

tammojan commented Jun 8, 2018

JasonRuonanWang commented Jun 8, 2018

gervandiepen commented Jun 12, 2018 via email

mpokorny commented Jun 12, 2018

JasonRuonanWang commented Jun 12, 2018

JasonRuonanWang commented Jun 12, 2018

JasonRuonanWang commented Jun 26, 2018

JasonRuonanWang commented Jun 26, 2018

mpokorny commented Jun 28, 2018

JasonRuonanWang commented Jun 28, 2018

gervandiepen commented Jun 29, 2018 via email

twillis449 commented Aug 7, 2018

JasonRuonanWang commented Aug 7, 2018

sjperkins commented Aug 7, 2018

sjperkins commented Aug 7, 2018

twillis449 commented Aug 7, 2018

sjperkins commented Aug 7, 2018 • edited Loading

twillis449 commented Aug 7, 2018

JasonRuonanWang commented Aug 7, 2018

JasonRuonanWang commented Jun 3, 2021

JasonRuonanWang commented Jun 8, 2018 •

edited

Loading

sjperkins commented Aug 7, 2018 •

edited

Loading