Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading a parquet written to "/dev/stdout" from "/dev/stdin" using duckdb CLI and a pipe fails with Error: PipeFileSystem: GetLastModifiedTime is not implemented! #2826

Closed
mskyttner opened this issue Dec 20, 2021 · 8 comments · Fixed by #2845

Comments

@mskyttner
Copy link

What happens?

Writing a parquet to /dev/stdout and reading it again from /dev/stdin using a pipe and chained duckdb CLI commands results in Error: Not implemented Error: PipeFileSystem: GetLastModifiedTime is not implemented!

To Reproduce

Use this bash oneliner at the command prompt to reproduce:

# read back duckdb CLI parquet output from /dev/stdout from /dev/stdin
R --quiet -e 'readr::write_csv(iris, "iris.csv")' && duckdb :memory: "copy (select * from read_csv_auto('iris.csv', HEADER=TRUE)) to '/dev/stdout' with (format 'parquet');" | duckdb :memory: "select * from read_parquet('/dev/stdin') limit 5;" ; duckdb -version

Outcome:

Error: Not implemented Error: PipeFileSystem: GetLastModifiedTime is not implemented!
v0.3.1 88aa81c6b

Related issues

Possibly related issue: #2296, which indicates it should be possible to chain duckdb commands together in a pipe.

Environment :

  • OS: [Any capable of running bash, which are Mac/Win/Linux I guess,but tested on Linux]
  • DuckDB Version: [duckdb CLI v0.3.1 88aa81c, downloaded from https://github.com/duckdb/duckdb/releases/download/v0.3.1/duckdb_cli-linux-amd64.zip]
  • DuckDB Client: [duckdb CLI]
@dforsber
Copy link

This maybe also related:

% duckdb --version
v0.3.1 88aa81c6b
% duckdb :memory: "COPY (SELECT * FROM parquet_scan('test.parquet') LIMIT 20) TO '/dev/stdout' WITH (FORMAT 'Parquet');" \
     | parquet-tools meta /dev/stdin
Error: Not implemented Error: PipeFileSystem: FileSync is not implemented!

@Mytherin
Copy link
Collaborator

Reading a Parquet file efficiently from a stream is not possible, as reading a Parquet file requires jumping around randomly in memory which is not possible in a streaming manner.

We could detect that we are reading from a stream and buffer the entire Parquet file in memory, but I am not sure if supporting this in that matter is even desirable. Parquet as a format is simply not intended for streaming consumption. Perhaps we should just improve the error message here instead.

Writing Parquet files to a stream should be possible, however.

@Alex-Monahan
Copy link
Contributor

I've got a couple of questions, mostly for my own learning!

Would it make any difference if the parquet_scan was querying many smaller files with a glob syntax? Would streaming or buffering be helpful in that scenario?

Also, would a more memory-efficient workaround alternative be to just insert from the input parquet files into a DuckDB table and then use the copy command to create a parquet file from that table? Or would that still read all of the parquet files into memory at once?

@dforsber
Copy link

Couple of assumptions. Parquet files have row (batch) groups that need to be read back and forth to match column rows I assume. But once the group is processed, it is not needed anymore. Thus, no need to read the whole file. And similarly, no need to write the whole file at once (I would assume). :)

@Mytherin
Copy link
Collaborator

Our blog on Parquet files has an image displaying how Parquet files are structured internally.

Essentially the row groups come first, and the meta data sits only at the end of the file. If you read this file from a FIFO stream you have no clue what bytes you are reading until you get to the end of the FIFO stream and can actually read the metadata. This requires essentially consuming the entirety of the FIFO stream and keeping it cached in memory, which kind of defeats the point of the FIFO stream in the first place.

The file format is made for streaming writes, which is why the metadata sits at the end in the first place. All the data can be written, and only after all the data is written does the metadata need to be written. This allows the file to be written without having to cache any significant portions of it. So it does work well for outputting to stdout.

Also, would a more memory-efficient workaround alternative be to just insert from the input parquet files into a DuckDB table and then use the copy command to create a parquet file from that table? Or would that still read all of the parquet files into memory at once?

Perhaps this is confusing: you can stream read Parquet files, but not from a FIFO pipe (e.g. stdin or stdout). The efficient manner of working with Parquet files is to do parquet_scan on a file that sits on disk and processing it that way.

@mskyttner
Copy link
Author

I now realize that this is probably a duplicate of #2127 which I forgot about (sorry!).

I don't have any ideas of how to improve the error message if that is advisable. Maybe "Error: Not implemented Error: PipeFileSystem: Reading parquet stream from /dev/stdin is not recommended since metadata is located at the end of the stream, please first write the stream to disk or S3 object storage instead" but that might be a bit lengthy.

@Mytherin
Copy link
Collaborator

@dforsber @mskyttner

Both these problems should now be fixed in #2845. There is a clear error message if you try to read a Parquet file from /dev/stdin, and the FileSync issue is resolved

@dforsber
Copy link

Thank you 👍🏻 .

Nothing would prevent sending parquet metadata from the end of the file first along with total file size through pipe if really needed e.g. to make it possible to pipe parquet data between duckdb instances more efficiently. But that wouldn't be compatible with file redirection anyway, and I struggle to find a use case for it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants