-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Allow reading from stdin with schema inference #10
Conversation
f63f110
to
f3ce0c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet. Can you fix the chipotle issues? Also, would it make sense to add this feature to the other tools as well?
f3ce0c5
to
47cdc08
Compare
Fixed the clippy warnings and refactored the error to be dryer. Yep, would make sense to copy - but in that case it would be beneficial to factor the I don't have much experience with cargo, so maybe better if you or @lsh have a look at it? |
Yeah, we could have a separate library. @lsh offered to take a look. For now, feel free to just have the same code copied so we can merge the pull request. Does that make sense or do you prefer to just support stdin for one tool until we make the library? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Is there really no implementation of something like this in the standard library we can use?
One thing I would like to see is support for not having to explicitly say /dev/stdin. Could we just default to it when there is no other input file provided?
We may need to add explicit arguments to indicate what is output so that input can become optional. All of these should work
|
Asking the following for the sake of having a record of the decision. What is the desired behavior for:
i.e. do we just drop the stdin data? |
Excellent - I was about to have a look at creating a
Not having to specify
Good question. A few options:
So for now my tendency would be to throw an error. Then maybe relax later. |
I agree. Let's throw an error for now. |
Piped input does not support Seek out of the box Seek is required to infer the schema To work around this, we buffer the input iff input file does not support seek Only the number of lines actually used to infer the schema are buffered to allow reading of files larger than memory This works, because the arrow crate only seeks twice: 1. To check whether seek is supported at the start 2. To reset to the start of the file after schem inference The seekable buffer wrapper is only used when necessary There should be no performance penalty for currently supported use cases Use cases: ```sh cat test.csv | csv2parquet /dev/stdin test.parquet zstdcat test.csv.zst | csv2parquet /dev/stdin test.parquet ``` Resolves domoritz#3 feat: refactor SeekableReader into arrow-tools lib create Also refactor schema matching to make it less verbose by using map_err instead of match, see json2parquet for before/after
47cdc08
to
b472f89
Compare
I force pushed the work from #13 to here so that the comments stay in one place. Old commits/branches should still be archived on Github |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. Please update the docs and then we can merge this feature.
@corneliusroemer can you wrap up the pull request so that we can merge it? |
@domoritz Sorry, yes! Can you resend the invitation - it expired after 7 days 🙃 |
Definitely. You can still push to this pull request even without the invitation. |
@corneliusroemer can you wrap this up so we can merge and make a release? Or do you need help with anything? |
Piped input does not support Seek out of the box
Seek is required to infer the schema
To work around this, we buffer the input iff input file
does not support seek
Only the number of lines actually used to infer the schema
are buffered to allow reading of files larger than memory
This works, because the arrow crate only seeks twice:
The seekable buffer wrapper is only used when necessary
There should be no performance penalty for currently supported
use cases
Use cases:
Resolves #3