Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancing parallel IO in Casacore Table Data System #729

Closed
JasonRuonanWang opened this issue Jun 8, 2018 · 21 comments
Closed

Enhancing parallel IO in Casacore Table Data System #729

JasonRuonanWang opened this issue Jun 8, 2018 · 21 comments

Comments

@JasonRuonanWang
Copy link
Contributor

I don't know if this is the best place but I was looking to some discussions on enhancing parallel IO for the table system as well as Measurementsets.

A few years ago I wrote a parallel storage manager for the casacore table system based on ADIOS (https://github.com/SKA-ScienceDataProcessor/AdiosStMan.git) when I was at ICRAR, which was put into the SKA-SDP as an early investigation into SKA scale measurementset experiments. When @gervandiepen was visiting us at ICRAR we talked about merging this AdiosStMan into the master but I was worried about the code quality because it was only a prototype rather than a production software. @gervandiepen was also worried about introducing MPI dependency to casacore. So eventually we dropped the plan.

Recently I have been working on updating the Adios Storage Manager using our new software ADIOS 2.0 (https://github.com/ornladios/ADIOS2) that we are developing at Oak Ridge National Lab under the US Department of Energy's Exa-scale Computing program. The motivation behind this is that we are planning on running a full-scale SKA 1 data pipeline simulation, which may require writing data into a single measurementset table from multiple processes. If this is successful then we can re-think about improving the code quality and merge our efforts into the master.

However, the current parallel storage manager design is more like a hack than a generic solution because we couldn't find parallel support in casacore's table layer API. As a result, we need to hack user applications and make them believe that each MPI rank is writing to its own independent table, while the Adios Storage Manger does dirty tricks underneath and actually writes data into a single column file. This has worked so far but it makes user applications very ugly and less maintainable.

I have roughly read through the table layer code trying to think of a solution. If we can add some MPI-aware code in the procedures of creating a table, making only Rank 0 actually write table metadata file, while other ranks hold the table metadata in memory without writing to files, this would make the table system much more friendly to parallel storage managers as well as MPI incorporated applications.

I am wondering if this is possible or if there is any other suggestions or plans on enhancing parallel IO for casacore. Looking forward to any discussions or suggestions. Thank you.

@JasonRuonanWang
Copy link
Contributor Author

JasonRuonanWang commented Jun 8, 2018

There is another approach that I can think of, which is adding another mode in TableLock::LockOption. This mode does the following: Any process comes first owns the table lock, while if the table is already locked, other processes skip writing the table metadata file. This will also work for any MPI based parallel IO storage managers or applications. And the advantage of doing so over the previous approach is that this would not require introducing MPI dependency into casacore.

Any suggestions?

@tammojan
Copy link
Contributor

tammojan commented Jun 8, 2018

Hi Jason. We are currently working on a new version of the Measurement Set in a project called MSv3. This is a collaboration between NRAO and the SKA organisation. Perhaps you can give a presentation of your work in one of our next telecons (usually on Wednesdays at 10:00 EST) and discuss your suggestions?

@gervandiepen or @mpokorny, both on the MSv3project, may want to comment on your suggestions.

@JasonRuonanWang
Copy link
Contributor Author

@tammojan That's great. I would like to attend and talk about my work. I should be in the same time zone as NRAO which makes things easier. Could you send me details? My email is jason.ruonan.wang@gmail.com

Thanks.

@gervandiepen
Copy link
Contributor

gervandiepen commented Jun 12, 2018 via email

@mpokorny
Copy link
Contributor

Hi Jason,
FYI, over the last several months I've been working on methods using MPI-IO collective data access functions to read and write casacore table data columns, supporting a wide range of parallelization and data access patterns. My work is limited to reading and writing data columns of a uniform shape only, so is not (yet) anything like a casacore storage manager. One reason I took a path avoiding the casacore table system directly is that the CTDS API doesn't accommodate anything like MPI communicator handles, so I decided to avoid wedging my code into the CTDS just to do some exploration of parallel IO. As I understand it (perhaps improperly!), ADIOS uses MPI but not MPI-IO, and I can imagine your work might have similar issues integrating into the CTDS API that mine may have (eventually). I would hope that we might come up with a better design for integrating MPI into CTDS if we have more than one use case to consider.

@JasonRuonanWang
Copy link
Contributor Author

@gervandiepen Yes sir. If MPI dependency is not a big issue then I would personally prefer the first approach too.

@JasonRuonanWang
Copy link
Contributor Author

@mpokorny It's great to hear there are people working on the same issue. I hope there is something we can do together to make parallel IO a formal CTDS feature so we will never need to explore different approaches.

There is MPI-IO in ADIOS but I haven't been using it for CTDS, because having the flexibility of writing different rows or columns asynchronously is an important feature of a table system, while MPI-IO is mainly designed for synchronous IO. So you are right to this extent. For the MPI communicator issue, what I did in ADIOS storage manager is that I pass the MPI communicator in the constructor of AdiosStMan class. In this way it bypasses the table interface of CTDS. But this does not solve the problem I mentioned -- you can't create a table from multiple MPI ranks because creating a table is a CTDS table feature rather than a storage manager feature. So it would be definitely better if CTDS natively supports MPI so we can have future MPI based apps nicely written and compatible with serial code.

@JasonRuonanWang
Copy link
Contributor Author

Opened #742 which added MPI awareness to the table layer API. This feature together with a parallel storage manager can enable serial casacore applications to write tables from multiple MPI processes without changing a single line of code.

@JasonRuonanWang
Copy link
Contributor Author

According to what we agreed in the last telecon, the first step is to add MPI awareness to the table API, and the second step is to re-write the table layer to a certain level that it would allow the table layer to use alternative storage backends besides Aips IO.

Several ideas for the second step have come up during the work of the first step. Basically we can do it in a few ways:

1. Introduce a middleware API at the Aips IO layer.

This can re-use the current Table implementation without re-writing anything at the Table layer. However, Aips IO is stream based, so we have to implement the alternative IO backend in the same way in order to make the current Table implementation compatible with the new IO backend implementation. This could possibly limit the new IO backend implementation in several ways ranging from flexibility to being able to self-contained and self-described.

2. Introduce a middleware API at the CTDS Table layer.

This will require re-writing the Table layer, which is a looooot of work.

3. Implement MSv3 in a way that it can switch underlying storage backends (table implementations).

This is my preferred way. MS is the main use scenario of CTDS, and probably the only use scenario of CTDS which potentially requires peta to exa scale I/O (thus you have to make it parallel). In the meantime, the nature of the MeasurementSet does not necessarily require it to be stored in a table system. A lot of the features of CTDS are not really used in MS. Therefore, re-implementing CTDS is not a cost-efficient way. A more cost efficient way is to implement MSv3 so that it can be based on various I/O backends. Thus it not only gets more flexibility and functionalities such as streaming MS tables over network, but also potentially gets better parallel I/O performance too by bypassing the table layer and directly interacting with newer generation I/O systems such as ADIOS 2.

@mpokorny
Copy link
Contributor

Jason, regarding your proposal 3, would it mean that programs that access measurement sets would then use something like the MeasurementSet API, without referring to any Aips IO or CTDS API? I could certainly see how that might open up many possibilities for I/O backends. Depending strongly, of course, on the design of any such new MeasurementSet API.

As an aside, I've been frequently annoyed by how much of CASA code uses CTDS functionality rather than MS functionality. It strikes me as wrong architecturally (and conceptually), is inefficient in several ways (e.g, programmer time, lines of code), and makes CASA code very sensitive to low-level changes in I/O configuration such as file system, and CTDS system and file format parameters. Making the MS API more flexible and promoting it as the API for accessing MSs might help clean up CASA quite a lot. That might impose a bit of work on CASA, although I believe it would bring important benefits.

@JasonRuonanWang
Copy link
Contributor Author

@mpokorny Yes basically I think we are talking about the same thing. In this way, applications can just interact with MSv3 API and MSv3 will handle all underlying I/O operations without exposing the complexity to applications. And this will open up many possibilities as you said. For example, if you don't have a large local filesystem but you have a very fast network, then you can produce MeasurementSets and stream them over the network. Or if you find a filesystem is the performance bottleneck, you can store data in something else such as object stores or block devices. Or if you don't have enough storage space anywhere and you don't care losing some precisions of data, then you can let the underlying I/O middleware compress the data in-situ for you.

We have been discussing these scenarios a lot for SKA workflows, and if we consider MSv3 as something people will rely on in the next decade or two, these scenarios are very likely to become hard requirements. In the meantime these are not only required by radio astronomy, but also many other research areas. So enable MSv3 to take advantage of some generic I/O middleware will benefit a lot in the long run.

@gervandiepen
Copy link
Contributor

gervandiepen commented Jun 29, 2018 via email

@twillis449
Copy link

Some of you might be interested in https://github.com/ska-sa/xarray-ms. On a smaller scale I've successfully used python multiprocessing to calculate visibilities in parallel for SKA-simulations. Of course my sims only run for a 4 minute observation (0.14 second integrations) with one frequency channel and 197 antennas. But these sims revealed some subtle effects that will affect sources at large distances from the field centre.

@JasonRuonanWang
Copy link
Contributor Author

@twillis449 Thanks for this. Is it using the multi-MS feature of CASA which generates a number of physical tables and then concatenate them logically?

@sjperkins
Copy link
Contributor

It's nice to see that xarray-ms is getting exposure outside SKA SA. At heart it's a dask layer around python casacore. I guess it would benefit from any enhancements to the underlying casa file system made here.

@sjperkins
Copy link
Contributor

@twillis449 Are you aware of https://github.com/ska-sa/codex-africanus? The dask algorithm implementations are designed to take xarray-ms arrays.

@twillis449
Copy link

@JasonRuonanWang No, basically I just get blocks of UVdata from the UVW column and farm off visibility Amp/Phase calculations to sub-processes. When the sub-process is finished I just catch the results and stick them back into the corresponding location in e.g. the DATA or CORRECTED_DATA columns. @sjperkins I'm impressed with the stuff you people are developing in SKA-SA, but so far for my sims for SKA1 I've been able to get by with 'toy' measurment sets . That may change if I start doing some simulations for the ngVLA array where there are more antennas and longer baselines.

@sjperkins
Copy link
Contributor

sjperkins commented Aug 7, 2018

With regard to MSv3: Is there an intention to move the underlying storage to something resembling an object store (in the spirit of ceph for e.g.). MSv2 seems to rely on monolithic files which are sub optimal for parallel file systems. While I'm no expert, it seems that separate files for individual chunks of data is the way forward on these systems. Of course, the MS already supports this to some degree via tiling and hypercolumns...

@twillis449
Copy link

SKA SDP must be doing a LOT of parallel processing. Anybody know what their plan / design for data files is?

@JasonRuonanWang
Copy link
Contributor Author

@twillis449 @sjperkins

I was basically one of the people in SKA SDP doing research in parallel IO several years ago, and I have come up with this ADIOS storage manager (https://www.sciencedirect.com/science/article/pii/S2213133716300452) which is able to write in parallel at the casacore table column layer. That was pretty much the only practical generic solution I could think of then, given the limited time and resources I had. However, the downside is also obvious, which is that it could only parallelize IO at the column layer, not the table layer. To improve this, recently I submitted the pull request #742 to make the table layer MPI-aware, such that it can support parallel storage managers better.

But ultimately the ideal solution for MSv3 is still to have the flexibility of changing the entire storage backend (including table layer and column layer) to using something independent of casacore. Like @sjperkins said, this could be Ceph, or it could be ADIOS 2.0 which I am currently participating the development at ORNL. So users can take advantage of their existing hardware without changing the upper layer application code. For instance, SA can continue using Ceph, while Australia can use something else like Nyriad storage, and if I need to showcase a pipeline on one of our DOE supercomputers, I can easily switch to Lustre mode. This idea, I have briefly discussed with @gervandiepen and we basically agreed in principle. The problem is really how much resource we can put to realize these ideas.

@JasonRuonanWang
Copy link
Contributor Author

I think this issue can be closed since we have added Adios2StMan to casacore, and we have modified the casacore table layer to have truly parallel I/O. We have used this improvement in our full-scale SKA1 data processing workflow using simulated data and verified using the entire Summit supercomputer, which was the fastest machine in the world. Results can be found in https://www.computer.org/csdl/proceedings-article/sc/2020/999800a011/1oeORuLNgAw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants