feat: Allow reading from stdin with schema inference #10

corneliusroemer · 2023-03-04T19:24:45Z

Piped input does not support Seek out of the box
Seek is required to infer the schema
To work around this, we buffer the input iff input file
does not support seek
Only the number of lines actually used to infer the schema
are buffered to allow reading of files larger than memory
This works, because the arrow crate only seeks twice:

To check whether seek is supported at the start
To reset to the start of the file after schem inference

The seekable buffer wrapper is only used when necessary

There should be no performance penalty for currently supported
use cases

Use cases:

cat test.csv | csv2parquet /dev/stdin test.parquet
zstdcat test.csv.zst | csv2parquet /dev/stdin test.parquet

Resolves #3

domoritz

Sweet. Can you fix the chipotle issues? Also, would it make sense to add this feature to the other tools as well?

corneliusroemer · 2023-03-04T20:01:02Z

Fixed the clippy warnings and refactored the error to be dryer.

Yep, would make sense to copy - but in that case it would be beneficial to factor the SeekableReader functionality into a separate lib crate so as to not repeat ourselves 4 times.

I don't have much experience with cargo, so maybe better if you or @lsh have a look at it?

domoritz · 2023-03-04T22:55:42Z

Yeah, we could have a separate library. @lsh offered to take a look. For now, feel free to just have the same code copied so we can merge the pull request. Does that make sense or do you prefer to just support stdin for one tool until we make the library?

domoritz

Looks good. Is there really no implementation of something like this in the standard library we can use?

One thing I would like to see is support for not having to explicitly say /dev/stdin. Could we just default to it when there is no other input file provided?

lsh · 2023-03-05T00:46:09Z

@domoritz Maybe the atty crate could help here?

domoritz · 2023-03-05T02:12:44Z

We may need to add explicit arguments to indicate what is output so that input can become optional. All of these should work

cat test.csv | csv2parquet /dev/stdin -o test.parquet
cat test.csv | csv2parquet -o test.parquet
cat test.csv | csv2parquet # to stdout

lsh · 2023-03-05T04:41:21Z

All of these should work
cat test.csv | csv2parquet -o test.parquet

Asking the following for the sake of having a record of the decision.

What is the desired behavior for:

cat test.csv | csv2parquet random_file.csv -o test.parquet

i.e. do we just drop the stdin data?

corneliusroemer · 2023-03-05T06:13:21Z

Excellent - I was about to have a look at creating a lib.rs crate for shared functionality. Will review what you got up to @lsh. I also put the code up for Code Review on Stack Exchange Code Review, so may propose some changes from that.

Is there really no implementation of something like this in the standard library we can use?
Not that I'm aware of. When you tell people you want to make stdin seekable they suggest you don't. The reason we need to do this is because of upstream crates not having thought about the stdin use case - which is a bit of an edge case.

Not having to specify /dev/stdin and out-file every time is a great idea. I was about to suggest that when I played with "print schema only" and got annoyed that I had to pass an outfile that was never used.

What is the desired behavior for:
cat test.csv | csv2parquet random_file.csv -o test.parquet
i.e. do we just drop the stdin data?

Good question. A few options:

Throw an error: "Only one input file accepted", this would be the conservative approach.
Ignore one of the two - but it's not obvious which. I would only go this way if there's a strong convention, otherwise behaviour will be unexpected for a good share of users.
Use both. This makes sense if we want to support reading multiple csv/json files into one parquet file. That's not an unreasonable extension. Say if you have lots of ndjson files, instead of catting them all together, why not pass all the paths to the CLI. But for this to make sense, we should first support multiple input files, decide how to deal with edge cases like if some or all files contain headers etc.

So for now my tendency would be to throw an error. Then maybe relax later.

domoritz · 2023-03-05T14:02:44Z

I agree. Let's throw an error for now.

Piped input does not support Seek out of the box Seek is required to infer the schema To work around this, we buffer the input iff input file does not support seek Only the number of lines actually used to infer the schema are buffered to allow reading of files larger than memory This works, because the arrow crate only seeks twice: 1. To check whether seek is supported at the start 2. To reset to the start of the file after schem inference The seekable buffer wrapper is only used when necessary There should be no performance penalty for currently supported use cases Use cases: ```sh cat test.csv | csv2parquet /dev/stdin test.parquet zstdcat test.csv.zst | csv2parquet /dev/stdin test.parquet ``` Resolves domoritz#3 feat: refactor SeekableReader into arrow-tools lib create Also refactor schema matching to make it less verbose by using map_err instead of match, see json2parquet for before/after

corneliusroemer · 2023-03-06T11:47:43Z

I force pushed the work from #13 to here so that the comments stay in one place. Old commits/branches should still be archived on Github

domoritz

Great. Please update the docs and then we can merge this feature.

domoritz · 2023-03-15T00:36:19Z

@corneliusroemer can you wrap up the pull request so that we can merge it?

corneliusroemer · 2023-03-15T12:44:41Z

@domoritz Sorry, yes! Can you resend the invitation - it expired after 7 days 🙃

domoritz · 2023-03-15T13:29:57Z

Definitely. You can still push to this pull request even without the invitation.

domoritz · 2023-04-01T04:03:55Z

@corneliusroemer can you wrap this up so we can merge and make a release? Or do you need help with anything?

corneliusroemer force-pushed the auto-schema-for-pipe branch from f63f110 to f3ce0c5 Compare March 4, 2023 19:26

domoritz reviewed Mar 4, 2023

View reviewed changes

corneliusroemer force-pushed the auto-schema-for-pipe branch from f3ce0c5 to 47cdc08 Compare March 4, 2023 19:56

domoritz reviewed Mar 4, 2023

View reviewed changes

lsh mentioned this pull request Mar 5, 2023

chore: Add arrow-tools crate for shared logic between utils #12

Merged

corneliusroemer mentioned this pull request Mar 5, 2023

When creating/inferring schema only, do not buffer stdin #14

Open

domoritz mentioned this pull request Mar 5, 2023

Refactor seekable reader into arrow-tools lib crate #13

Closed

corneliusroemer force-pushed the auto-schema-for-pipe branch from 47cdc08 to b472f89 Compare March 6, 2023 11:45

domoritz approved these changes Mar 6, 2023

View reviewed changes

Merge branch 'main' into auto-schema-for-pipe

11e869c

domoritz merged commit 643bff1 into domoritz:main Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Allow reading from stdin with schema inference #10

feat: Allow reading from stdin with schema inference #10

corneliusroemer commented Mar 4, 2023 •

edited

Loading

domoritz left a comment

corneliusroemer commented Mar 4, 2023

domoritz commented Mar 4, 2023

domoritz left a comment •

edited

Loading

lsh commented Mar 5, 2023

domoritz commented Mar 5, 2023 •

edited

Loading

lsh commented Mar 5, 2023 •

edited

Loading

corneliusroemer commented Mar 5, 2023

domoritz commented Mar 5, 2023

corneliusroemer commented Mar 6, 2023

domoritz left a comment

domoritz commented Mar 15, 2023

corneliusroemer commented Mar 15, 2023

domoritz commented Mar 15, 2023

domoritz commented Apr 1, 2023

feat: Allow reading from stdin with schema inference #10

feat: Allow reading from stdin with schema inference #10

Conversation

corneliusroemer commented Mar 4, 2023 • edited Loading

domoritz left a comment

Choose a reason for hiding this comment

corneliusroemer commented Mar 4, 2023

domoritz commented Mar 4, 2023

domoritz left a comment • edited Loading

Choose a reason for hiding this comment

lsh commented Mar 5, 2023

domoritz commented Mar 5, 2023 • edited Loading

lsh commented Mar 5, 2023 • edited Loading

corneliusroemer commented Mar 5, 2023

domoritz commented Mar 5, 2023

corneliusroemer commented Mar 6, 2023

domoritz left a comment

Choose a reason for hiding this comment

domoritz commented Mar 15, 2023

corneliusroemer commented Mar 15, 2023

domoritz commented Mar 15, 2023

domoritz commented Apr 1, 2023

corneliusroemer commented Mar 4, 2023 •

edited

Loading

domoritz left a comment •

edited

Loading

domoritz commented Mar 5, 2023 •

edited

Loading

lsh commented Mar 5, 2023 •

edited

Loading