Make the analysis storage chunk equal to checkpoint interval in SAMS #912

Lnaden · 2018-03-12T13:11:23Z

This is an optimization for read/write by using the checkpoint_interval
for the analysis file, and 1 for the checkpoint file.

Supersedes and closes #861 by merging into SAMS
Fixes #860

This is an optimization for read/write by using the `checkpoint_interval` for the analysis file, and 1 for the checkpoint file. Supersedes and closes #861 by merging into SAMS Fixes #860

Lnaden · 2018-03-12T13:16:44Z

This might need some more refinement to this to prevent the mpi file locking from hdf5 1.10.x versions. We cannot use libnetcdf 4.6.0 without upgrading both netcdf4, which also updates hdf5

Lnaden · 2018-03-12T14:16:38Z

Adding discussion from slack conversation for documentation.

From @jchodera:
I'm wondering if we want to instead set the chunking interval to the number of iterations.
Consider this:

If you're intending to run a 10,000 iteration simulation, then you will need space to store that data anyway.
If you're using switch_experiment_every: 500, then the number of iterations will appear to be 500 when the storage file is created (I think?) so this chunking interval will be used instead.
In either case, the larger the chunking interval, presumably the more efficient writing and reading will be. We can do some timing (with repex for writing and analysis for reading) to find out.

We can still accept this PR, of course. It will likely be much better than no chunking!

From @Lnaden

if we want to instead set the chunking interval to the number of iterations.

I don’t think we should do that, I’m worried we would run into a memory problem at that point if we are not careful. I do think the checkpoint_interval is a bit too small, but I did not know what the correct chunk size should be. The guides I have read have all said something along the lines of…

The chunk size should be roughly equal to the size the data are manipulated in

There are no hard and fast rules sadly. The chunk sizes trade off fixing slow access of large data while slowing down fast access of small data.

Maybe we should all sit down and brainstorm some more about what the correct chunksize is for the iteration dimension. Either way, its now a single variable we can tune in the MultiStateReporter class.

I’ll merge the PR anyways and we can refine it later as needed

Lnaden · 2018-03-12T14:22:55Z

@jchodera is correct in that the memory argument is not a good one. It is not memory limited.

Assuming chunksize of all iterations, there are a few problems I see:

Resuming would result in us having to load the entire simulation, which may be just as slow as the current scheme.
If someone set to run an infinite simulation which stopped with online analysis check, we would need to come up with the logic for this.

Also, the very large chunksizes will slow down all write actions too since the whole datset must be loaded for each write access too. See the HDFGroup's whitepaper on too large of chunksizes (a good read)

Chunks are too large
It may be tempting to simply set the chunk size to be the same as the dataset size in order to enable compression on a contiguous dataset. However, this can have unintended consequences. Because the entire chunk must be read from disk and decompressed before performing any operations, this will impose a great performance penalty when operating on a small subset of the dataset if the cache is not large enough to hold the one-chunk dataset. In addition, if the dataset is large enough, since the entire chunk must be held in memory while compressing and decompressing, the operation could cause the operating system to page memory to disk, slowing down the entire system.

Paper can be found here

jchodera · 2018-03-12T14:43:43Z

Great find! Ok, let's go with this for now!

Lnaden · 2018-03-12T14:48:13Z

We'll want to optimize this some more. People with large checkpoint intervals may slow down their write speeds at runtime. So maybe set a maximum size, just as an idea.

Make the analysis storage chunk equal to checkpoint interval in SAMS

719de5c

This is an optimization for read/write by using the `checkpoint_interval` for the analysis file, and 1 for the checkpoint file. Supersedes and closes #861 by merging into SAMS Fixes #860

Lnaden merged commit bbd630e into sams Mar 12, 2018

Lnaden deleted the chunksizes branch March 12, 2018 14:45

This was referenced Mar 12, 2018

Make the analysis storage chunk equal to checkpoint interval #861

Closed

Optimize chunksizes in Reporter #860

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the analysis storage chunk equal to checkpoint interval in SAMS #912

Make the analysis storage chunk equal to checkpoint interval in SAMS #912

Lnaden commented Mar 12, 2018

Lnaden commented Mar 12, 2018

Lnaden commented Mar 12, 2018

Lnaden commented Mar 12, 2018

jchodera commented Mar 12, 2018

Lnaden commented Mar 12, 2018

Make the analysis storage chunk equal to checkpoint interval in SAMS #912

Make the analysis storage chunk equal to checkpoint interval in SAMS #912

Conversation

Lnaden commented Mar 12, 2018

Lnaden commented Mar 12, 2018

Lnaden commented Mar 12, 2018

Lnaden commented Mar 12, 2018

jchodera commented Mar 12, 2018

Lnaden commented Mar 12, 2018