Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the analysis storage chunk equal to checkpoint interval in SAMS #912

Merged
merged 1 commit into from
Mar 12, 2018

Conversation

Lnaden
Copy link
Contributor

@Lnaden Lnaden commented Mar 12, 2018

This is an optimization for read/write by using the checkpoint_interval
for the analysis file, and 1 for the checkpoint file.

Supersedes and closes #861 by merging into SAMS
Fixes #860

This is an optimization for read/write by using the `checkpoint_interval`
for the analysis file, and 1 for the checkpoint file.

Supersedes and closes #861 by merging into SAMS
Fixes #860
@Lnaden
Copy link
Contributor Author

Lnaden commented Mar 12, 2018

This might need some more refinement to this to prevent the mpi file locking from hdf5 1.10.x versions. We cannot use libnetcdf 4.6.0 without upgrading both netcdf4, which also updates hdf5

@Lnaden
Copy link
Contributor Author

Lnaden commented Mar 12, 2018

Adding discussion from slack conversation for documentation.

From @jchodera:
I'm wondering if we want to instead set the chunking interval to the number of iterations.
Consider this:

  • If you're intending to run a 10,000 iteration simulation, then you will need space to store that data anyway.
  • If you're using switch_experiment_every: 500, then the number of iterations will appear to be 500 when the storage file is created (I think?) so this chunking interval will be used instead.
  • In either case, the larger the chunking interval, presumably the more efficient writing and reading will be. We can do some timing (with repex for writing and analysis for reading) to find out.

We can still accept this PR, of course. It will likely be much better than no chunking!

From @Lnaden

if we want to instead set the chunking interval to the number of iterations.

I don’t think we should do that, I’m worried we would run into a memory problem at that point if we are not careful. I do think the checkpoint_interval is a bit too small, but I did not know what the correct chunk size should be. The guides I have read have all said something along the lines of…

The chunk size should be roughly equal to the size the data are manipulated in

There are no hard and fast rules sadly. The chunk sizes trade off fixing slow access of large data while slowing down fast access of small data.

Maybe we should all sit down and brainstorm some more about what the correct chunksize is for the iteration dimension. Either way, its now a single variable we can tune in the MultiStateReporter class.

I’ll merge the PR anyways and we can refine it later as needed

@Lnaden
Copy link
Contributor Author

Lnaden commented Mar 12, 2018

@jchodera is correct in that the memory argument is not a good one. It is not memory limited.

Assuming chunksize of all iterations, there are a few problems I see:

  • Resuming would result in us having to load the entire simulation, which may be just as slow as the current scheme.
  • If someone set to run an infinite simulation which stopped with online analysis check, we would need to come up with the logic for this.

Also, the very large chunksizes will slow down all write actions too since the whole datset must be loaded for each write access too. See the HDFGroup's whitepaper on too large of chunksizes (a good read)

Chunks are too large
It may be tempting to simply set the chunk size to be the same as the dataset size in order to enable compression on a contiguous dataset. However, this can have unintended consequences. Because the entire chunk must be read from disk and decompressed before performing any operations, this will impose a great performance penalty when operating on a small subset of the dataset if the cache is not large enough to hold the one-chunk dataset. In addition, if the dataset is large enough, since the entire chunk must be held in memory while compressing and decompressing, the operation could cause the operating system to page memory to disk, slowing down the entire system.

Paper can be found here

@jchodera
Copy link
Member

Great find! Ok, let's go with this for now!

@Lnaden
Copy link
Contributor Author

Lnaden commented Mar 12, 2018

We'll want to optimize this some more. People with large checkpoint intervals may slow down their write speeds at runtime. So maybe set a maximum size, just as an idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants