Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for chunking #6

Open
danielballan opened this issue Apr 6, 2020 · 1 comment
Open

Add support for chunking #6

danielballan opened this issue Apr 6, 2020 · 1 comment

Comments

@danielballan
Copy link
Member

For large batches, it would be convenient to be able to compress and transfer the packed data in chunks. To facilitate this, we need to encode sets of Documents files (msgpack or jsonl files) that go with sets of external files, so that if I have up to chunk N of Documents and up to chunk N of the external files they reference, I have exactly the relevant external files that I need---no more or less.

This has not been implemented, but it has been designed in detail in a conversation between myself and @tacaswell, documented here.


We can keep the directory structure as it is now: just one directory, plus a sub-directory structure under external_files if the --copy-external flag is set. Instead of writing one documents_manifest.txt and one external_files_manifest_<root_hash>.txt per root, we can write out N manifests for each, a documents_manfiest_i.txt per chunk and external_file_manifest_<root_hash>_i.txt files per chunk and root. A given external file should only be listed once globally in the whole set of external_file_manifest_<root_hash>_i.txt files. If a file is referenced by a Run in chunk x and a Run in chunk y > x, it should only be listed in the manifest for chunk x.

The user can specify the chunk size in terms of number of Runs (given as an integer) or max byte size (given as a string like 10000000B or 10MB).

The chunking and compression can be done separately, downstream. Only the first chunk should contain the catalog.yml. The chunks can be "un-tarred" into the same directly, as they will have no conflicting files. We can also incorporate optional tarring and compression into databroker-pack itself, but it needs to be possible to do it outside the tool for use cases where the large file transfer is being handled separately.

@danielballan
Copy link
Member Author

Awaiting bluesky/suitcase-utils#39

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant