You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For large batches, it would be convenient to be able to compress and transfer the packed data in chunks. To facilitate this, we need to encode sets of Documents files (msgpack or jsonl files) that go with sets of external files, so that if I have up to chunk N of Documents and up to chunk N of the external files they reference, I have exactly the relevant external files that I need---no more or less.
This has not been implemented, but it has been designed in detail in a conversation between myself and @tacaswell, documented here.
We can keep the directory structure as it is now: just one directory, plus a sub-directory structure under external_files if the --copy-external flag is set. Instead of writing one documents_manifest.txt and one external_files_manifest_<root_hash>.txt per root, we can write out N manifests for each, a documents_manfiest_i.txt per chunk and external_file_manifest_<root_hash>_i.txt files per chunk and root. A given external file should only be listed once globally in the whole set of external_file_manifest_<root_hash>_i.txt files. If a file is referenced by a Run in chunk x and a Run in chunk y > x, it should only be listed in the manifest for chunk x.
The user can specify the chunk size in terms of number of Runs (given as an integer) or max byte size (given as a string like 10000000B or 10MB).
The chunking and compression can be done separately, downstream. Only the first chunk should contain the catalog.yml. The chunks can be "un-tarred" into the same directly, as they will have no conflicting files. We can also incorporate optional tarring and compression into databroker-pack itself, but it needs to be possible to do it outside the tool for use cases where the large file transfer is being handled separately.
The text was updated successfully, but these errors were encountered:
For large batches, it would be convenient to be able to compress and transfer the packed data in chunks. To facilitate this, we need to encode sets of Documents files (
msgpack
orjsonl
files) that go with sets of external files, so that if I have up to chunk N of Documents and up to chunk N of the external files they reference, I have exactly the relevant external files that I need---no more or less.This has not been implemented, but it has been designed in detail in a conversation between myself and @tacaswell, documented here.
We can keep the directory structure as it is now: just one directory, plus a sub-directory structure under
external_files
if the--copy-external
flag is set. Instead of writing onedocuments_manifest.txt
and oneexternal_files_manifest_<root_hash>.txt
perroot
, we can write out N manifests for each, adocuments_manfiest_i.txt
per chunk andexternal_file_manifest_<root_hash>_i.txt
files per chunk androot
. A given external file should only be listed once globally in the whole set ofexternal_file_manifest_<root_hash>_i.txt
files. If a file is referenced by a Run in chunk x and a Run in chunk y > x, it should only be listed in the manifest for chunk x.The user can specify the chunk size in terms of number of Runs (given as an integer) or max byte size (given as a string like
10000000B
or10MB
).The chunking and compression can be done separately, downstream. Only the first chunk should contain the
catalog.yml
. The chunks can be "un-tarred" into the same directly, as they will have no conflicting files. We can also incorporate optional tarring and compression intodatabroker-pack
itself, but it needs to be possible to do it outside the tool for use cases where the large file transfer is being handled separately.The text was updated successfully, but these errors were encountered: