Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data-discovery and index #19

Open
gauteh opened this issue May 2, 2022 · 1 comment
Open

Data-discovery and index #19

gauteh opened this issue May 2, 2022 · 1 comment
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed question Further information is requested

Comments

@gauteh
Copy link
Owner

gauteh commented May 2, 2022

In gauteh/hidefix#8 a couple of different DBs have been benchmarked. The deserialization of the full index of a large file (4gb) takes about 8 us (on my laptop), its about 8 mb, and takes about 100-150 ns to read from memory-mapped type local databases (sled, heed). Reading it (8 mb binary) from redis, sqlite or similar takes about 3 to 6 ms which is maybe a bit too high. It would be interesting to also try postgres.

  1. We need to keep data-discovery and dataset removal/update in mind:
  • I think datasets should be registered, not auto-discovered by the data-server: the registration could be run by another dedicated service that auto-detects/scrapes sources.
  • When a data-file turns out to be missing, or mtime has changed, we return an error, possibly notifying the scraper-service.
  1. I think that we have to assume internal network-latency is OK, I don't see how we can do much about that, except keeping communication at a minimum.

A solution could be:

  • Keep a central db with the index, DAS, DDS and list of datasets. This could be an SQL server or whatever, it is only written to by the scraper.
  • Each worker has a local cache of datasets (index, DAS, DDS) (e.g. heed, or maybe even just in-memory): to avoid having to verify that a dataset still exists it checks the mtime of the source on request. If the mtime is changed: Update cache from server. In the case of NCML-aggregates this will not be discovered.
  • When the central DB is changed, cache clearing is triggered at the workers. Retrieving new data from the central server is pretty cheap. This will handle NCML-changes.
  • This will make it possible to extend to cloud data-sources since the central-db then would point to e.g. an s3 URL.

Unfortunately this complicates things significantly, but I don't see how to avoid it when scaling up. It would be nice to still support a stand-alone server that does not need a central db, but just caches locally and discovers datasets itself in some way. That would make it significantly easier to test the server out.

Some reasons:

  • Storing the full index of all datasets on every worker takes a lot of space and needs to be kept in sync
  • Network disk of index is slow? Embedded databases like SQLite still too slow, so then need a memory mapped DB anyway
  • Indexing on-demand too slow, especially for aggregated datasets.

Since data is usually on network disks, caching data could possibly be done using large file system cache or maybe something like https://docs.rs/freqfs/latest/freqfs/index.html.

@magnusuMET

@gauteh gauteh changed the title store index in database Data-discovery and index Jul 11, 2022
@gauteh gauteh added enhancement New feature or request help wanted Extra attention is needed question Further information is requested labels Jul 11, 2022
@gauteh
Copy link
Owner Author

gauteh commented Jan 9, 2023

Indexing is now so fast (2-300 ms) usually that it is not necessary to keep a full cache of this locally. To speed up discovery it is probably best to not pre-compute the DAS and DDS requests either. I think maybe a good solution could be:

  • A central database server with:
    • List of datasets
    • Computed DAS and DDS (optionally, otherwise inserted on first access)
    • Stored hidefix index blob
    • Aggregated coordinate dimension
  • Each server has a fairly large local memorymapped (or just in-memory if enough RAM) LRU cache of hidefix-indexes and DAS, DDS'es. Otherwise fetch or compute on database server.

Then we are less dependent on the latency to the database server which seems to be in the 100s of ms range for a kubernetes cluster, but we can still use a standard setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants