Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Allow reading from stdin with schema inference #10

Merged
merged 2 commits into from
Apr 12, 2023

Conversation

corneliusroemer
Copy link
Collaborator

@corneliusroemer corneliusroemer commented Mar 4, 2023

Piped input does not support Seek out of the box
Seek is required to infer the schema
To work around this, we buffer the input iff input file
does not support seek
Only the number of lines actually used to infer the schema
are buffered to allow reading of files larger than memory
This works, because the arrow crate only seeks twice:

  1. To check whether seek is supported at the start
  2. To reset to the start of the file after schem inference

The seekable buffer wrapper is only used when necessary

There should be no performance penalty for currently supported
use cases

Use cases:

cat test.csv | csv2parquet /dev/stdin test.parquet
zstdcat test.csv.zst | csv2parquet /dev/stdin test.parquet

Resolves #3

Copy link
Owner

@domoritz domoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet. Can you fix the chipotle issues? Also, would it make sense to add this feature to the other tools as well?

@corneliusroemer
Copy link
Collaborator Author

Fixed the clippy warnings and refactored the error to be dryer.

Yep, would make sense to copy - but in that case it would be beneficial to factor the SeekableReader functionality into a separate lib crate so as to not repeat ourselves 4 times.

I don't have much experience with cargo, so maybe better if you or @lsh have a look at it?

@domoritz
Copy link
Owner

domoritz commented Mar 4, 2023

Yeah, we could have a separate library. @lsh offered to take a look. For now, feel free to just have the same code copied so we can merge the pull request. Does that make sense or do you prefer to just support stdin for one tool until we make the library?

Copy link
Owner

@domoritz domoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Is there really no implementation of something like this in the standard library we can use?

One thing I would like to see is support for not having to explicitly say /dev/stdin. Could we just default to it when there is no other input file provided?

@lsh
Copy link
Collaborator

lsh commented Mar 5, 2023

@domoritz Maybe the atty crate could help here?

@domoritz
Copy link
Owner

domoritz commented Mar 5, 2023

We may need to add explicit arguments to indicate what is output so that input can become optional. All of these should work

cat test.csv | csv2parquet /dev/stdin -o test.parquet
cat test.csv | csv2parquet -o test.parquet
cat test.csv | csv2parquet # to stdout

@lsh
Copy link
Collaborator

lsh commented Mar 5, 2023

All of these should work

cat test.csv | csv2parquet -o test.parquet

Asking the following for the sake of having a record of the decision.

What is the desired behavior for:

cat test.csv | csv2parquet random_file.csv -o test.parquet

i.e. do we just drop the stdin data?

@corneliusroemer
Copy link
Collaborator Author

Excellent - I was about to have a look at creating a lib.rs crate for shared functionality. Will review what you got up to @lsh. I also put the code up for Code Review on Stack Exchange Code Review, so may propose some changes from that.

Is there really no implementation of something like this in the standard library we can use?
Not that I'm aware of. When you tell people you want to make stdin seekable they suggest you don't. The reason we need to do this is because of upstream crates not having thought about the stdin use case - which is a bit of an edge case.

Not having to specify /dev/stdin and out-file every time is a great idea. I was about to suggest that when I played with "print schema only" and got annoyed that I had to pass an outfile that was never used.

What is the desired behavior for:
cat test.csv | csv2parquet random_file.csv -o test.parquet
i.e. do we just drop the stdin data?

Good question. A few options:

  1. Throw an error: "Only one input file accepted", this would be the conservative approach.
  2. Ignore one of the two - but it's not obvious which. I would only go this way if there's a strong convention, otherwise behaviour will be unexpected for a good share of users.
  3. Use both. This makes sense if we want to support reading multiple csv/json files into one parquet file. That's not an unreasonable extension. Say if you have lots of ndjson files, instead of catting them all together, why not pass all the paths to the CLI. But for this to make sense, we should first support multiple input files, decide how to deal with edge cases like if some or all files contain headers etc.

So for now my tendency would be to throw an error. Then maybe relax later.

@domoritz
Copy link
Owner

domoritz commented Mar 5, 2023

I agree. Let's throw an error for now.

Piped input does not support Seek out of the box
Seek is required to infer the schema
To work around this, we buffer the input iff input file
does not support seek
Only the number of lines actually used to infer the schema
are buffered to allow reading of files larger than memory
This works, because the arrow crate only seeks twice:
1. To check whether seek is supported at the start
2. To reset to the start of the file after schem inference

The seekable buffer wrapper is only used when necessary

There should be no performance penalty for currently supported
use cases

Use cases:
```sh
cat test.csv | csv2parquet /dev/stdin test.parquet
zstdcat test.csv.zst | csv2parquet /dev/stdin test.parquet
```

Resolves domoritz#3

feat: refactor SeekableReader into arrow-tools lib create

Also refactor schema matching to make it less verbose by using map_err
instead of match, see json2parquet for before/after
@corneliusroemer
Copy link
Collaborator Author

I force pushed the work from #13 to here so that the comments stay in one place. Old commits/branches should still be archived on Github

Copy link
Owner

@domoritz domoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Please update the docs and then we can merge this feature.

@domoritz
Copy link
Owner

@corneliusroemer can you wrap up the pull request so that we can merge it?

@corneliusroemer
Copy link
Collaborator Author

@domoritz Sorry, yes! Can you resend the invitation - it expired after 7 days 🙃

@domoritz
Copy link
Owner

Definitely. You can still push to this pull request even without the invitation.

@domoritz
Copy link
Owner

domoritz commented Apr 1, 2023

@corneliusroemer can you wrap this up so we can merge and make a release? Or do you need help with anything?

@domoritz domoritz merged commit 643bff1 into domoritz:main Apr 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

allow reading from stdin (ideally with schema inference)
3 participants