Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] implement extractor and shift to jsonl as internal format #33

Closed
6 of 7 tasks
rudolfix opened this issue Jun 23, 2022 · 0 comments
Closed
6 of 7 tasks

[core] implement extractor and shift to jsonl as internal format #33

rudolfix opened this issue Jun 23, 2022 · 0 comments
Assignees

Comments

@rudolfix
Copy link
Collaborator

rudolfix commented Jun 23, 2022

Implement extractor that will replace ad hoc implementation in the Pipeline

  • stream processing - never loads more than one iteration at a time into memory
  • stores data in jsonl (as a consequence) - this will allow unpacker to follow the same pattern
  • for deferred iterators uses multi threading pool
  • generates load_id that is preserved until data is loaded
  • may drop the event count from the file name
  • optional: extract_many to extract many iterators in parallel
  • moves whole folders to unpacker to be more atomic

Implementation

@rudolfix rudolfix self-assigned this Jun 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant