Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmarking index in different dbs #8

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

gauteh
Copy link
Owner

@gauteh gauteh commented May 3, 2022

  • sqlite
  • heed
  • redis
  • sled
  • postgres?

using sled (my laptop, currently used in dars):

serialized size: 7359563
serialized size (x_wind_ml): 733527

test serde_db_heed::deserialize_meps_bincode                       ... bench:       7,856 ns/iter (+/- 42)
test serde_db_heed::deserialize_meps_bincode_only_read             ... bench:         105 ns/iter (+/- 0)
test serde_db_redis::deserialize_meps_bincode                      ... bench:  12,481,908 ns/iter (+/- 1,416,965)
test serde_db_redis::deserialize_meps_bincode_only_read            ... bench:   6,876,984 ns/iter (+/- 687,845)
test serde_db_redis::deserialize_meps_bincode_x_wind_ml            ... bench:     253,664 ns/iter (+/- 86,054)
test serde_db_sled::deserialize_meps_bincode_db_sled               ... bench:       7,999 ns/iter (+/- 121)
test serde_db_sled::deserialize_meps_bincode_only_read             ... bench:         197 ns/iter (+/- 1)
test serde_db_sqlite::deserialize_meps_bincode                     ... bench:   5,173,499 ns/iter (+/- 827,225)
test serde_db_sqlite::deserialize_meps_bincode_only_read           ... bench:   3,582,576 ns/iter (+/- 330,532)
test serde_db_sqlite::deserialize_meps_bincode_only_read_x_wind_ml ... bench:     143,666 ns/iter (+/- 43,035)

so main part is deserializing, and reading is about 8 us. so we have some margin since majority is currently bounded by deserializing.

one potentially useful optimization is to only deserialize the necessary datasets, but requires datasets to be split into different db entities. but lets start with binary blob and index the lookup-key. Another interesting point is binary-layout for lower deserialization costs.

@codecov-commenter
Copy link

codecov-commenter commented May 4, 2022

Codecov Report

Merging #8 (5af91ff) into master (ac11b05) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master       #8   +/-   ##
=======================================
  Coverage   67.15%   67.15%           
=======================================
  Files          16       16           
  Lines         813      813           
=======================================
  Hits          546      546           
  Misses        267      267           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@gauteh gauteh changed the title set up for benching index in different dbs benchmarking index in different dbs May 4, 2022
@gauteh
Copy link
Owner Author

gauteh commented May 4, 2022

@magnusuMET: I think that if we are to use a distributed database, or even sqlite, we have to only fetch the relevant datasets. And we might have to store metadata closer (currently metadata is in memory, but that requires reading files on startup).

Benchmarks are kind of flawed because several of them rely on memorymapped and cached variables, so these are best-case scenarios. The difference between seialized and only-read versions of benchmarks are not real for e.g. redis, and is a cache artifact. if one is turned off the other goes up in time.

@magnusuMET
Copy link
Collaborator

I would expect the index for a specific file to be created on-demand. We could run a caching step on startup to ensure the most popular datasets are always available. I am not sure whether this could be a direction to go for?

@gauteh
Copy link
Owner Author

gauteh commented May 4, 2022

Yeah, I guess that could work as well. It would be a critical point to get right for latency and performance depending on use. I was hoping to make the server have as little logic as possible, but I don't think this will make much difference. It still needs to check modified-time against file. One question is how to share list of datasets and locations between instances.

@gauteh
Copy link
Owner Author

gauteh commented May 4, 2022

It takes about 29ms to index a 1.5gb file, but that would depend on how fast the disk is and disk-cache etc.

@gauteh
Copy link
Owner Author

gauteh commented May 4, 2022

It is still required to scan aggregated files, and also generate DAS and DDS. Not sure how long it takes to do those, but they rely on the rust-hdf5 so they are not multi-threaded. So they can't be used regularly.

@magnusuMET
Copy link
Collaborator

I think we need to separate this into two parts, one concerning efficient serving of the data and the other on discovery.

When serving a dataset one should try to fetch the index from a db, fallback to disk, check mtime of file against the cached entry, and possibly update the cache for that dataset (expensive, blocking). Here having a distributed in-memory (async?) db might be useful to prevent the blocking chunk index, although the db does not need to be persistent or up to date.

Discovery needs to keep an up to date view of the filesystem to add or remove new entries to the entry page. For this we could spawn a background thread which checks every couple of minutes for changes, maybe this should create DAS and DDS for all datasets?

@gauteh
Copy link
Owner Author

gauteh commented May 4, 2022

Yes, I think you are right.

I have been thinking about splitting discovery from the data-server into either a separate scraper, or even a different tool that can be used to insert or update new datasets. At the moment there is no discovery, a previous version used inotify but that doesn't work on NFS and is maybe not very reliable on large amounts of files. I think it will make it easier to insert datasets that are stored in object-store and have different ways to be discovered, or maybe even need to be registered.

At the risk of making things so complicated we never reach a functional level:

  • the data-server needs to handle when it discovers an out-dated dataset (mtime, missing dataset) -> signal server? block while waiting for update and return delayed response?
  • change in dataset might require cache purge on nginx if in front of dars
  • change in dataset also means that DDS and DAS are out of date, if the dataset change is discovered after DAS, DDS but before DODS the best we can do is return an error. but we don't track that.. maybe inherent limitation of protocol.

@gauteh
Copy link
Owner Author

gauteh commented May 5, 2022

Maybe we should do a quick chat about this at some point? It will require some restructuring of dars datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants