-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading a parquet written to "/dev/stdout" from "/dev/stdin" using duckdb CLI and a pipe fails with Error: PipeFileSystem: GetLastModifiedTime is not implemented! #2826
Comments
This maybe also related:
|
Reading a Parquet file efficiently from a stream is not possible, as reading a Parquet file requires jumping around randomly in memory which is not possible in a streaming manner. We could detect that we are reading from a stream and buffer the entire Parquet file in memory, but I am not sure if supporting this in that matter is even desirable. Parquet as a format is simply not intended for streaming consumption. Perhaps we should just improve the error message here instead. Writing Parquet files to a stream should be possible, however. |
I've got a couple of questions, mostly for my own learning! Would it make any difference if the parquet_scan was querying many smaller files with a glob syntax? Would streaming or buffering be helpful in that scenario? Also, would a more memory-efficient workaround alternative be to just insert from the input parquet files into a DuckDB table and then use the copy command to create a parquet file from that table? Or would that still read all of the parquet files into memory at once? |
Couple of assumptions. Parquet files have row (batch) groups that need to be read back and forth to match column rows I assume. But once the group is processed, it is not needed anymore. Thus, no need to read the whole file. And similarly, no need to write the whole file at once (I would assume). :) |
Our blog on Parquet files has an image displaying how Parquet files are structured internally. Essentially the row groups come first, and the meta data sits only at the end of the file. If you read this file from a FIFO stream you have no clue what bytes you are reading until you get to the end of the FIFO stream and can actually read the metadata. This requires essentially consuming the entirety of the FIFO stream and keeping it cached in memory, which kind of defeats the point of the FIFO stream in the first place. The file format is made for streaming writes, which is why the metadata sits at the end in the first place. All the data can be written, and only after all the data is written does the metadata need to be written. This allows the file to be written without having to cache any significant portions of it. So it does work well for outputting to
Perhaps this is confusing: you can stream read Parquet files, but not from a FIFO pipe (e.g. stdin or stdout). The efficient manner of working with Parquet files is to do |
I now realize that this is probably a duplicate of #2127 which I forgot about (sorry!). I don't have any ideas of how to improve the error message if that is advisable. Maybe |
Both these problems should now be fixed in #2845. There is a clear error message if you try to read a Parquet file from /dev/stdin, and the FileSync issue is resolved |
Thank you 👍🏻 . Nothing would prevent sending parquet metadata from the end of the file first along with total file size through pipe if really needed e.g. to make it possible to pipe parquet data between duckdb instances more efficiently. But that wouldn't be compatible with file redirection anyway, and I struggle to find a use case for it :) |
What happens?
Writing a parquet to /dev/stdout and reading it again from /dev/stdin using a pipe and chained duckdb CLI commands results in
Error: Not implemented Error: PipeFileSystem: GetLastModifiedTime is not implemented!
To Reproduce
Use this bash oneliner at the command prompt to reproduce:
Outcome:
Related issues
Possibly related issue: #2296, which indicates it should be possible to chain duckdb commands together in a pipe.
Environment :
The text was updated successfully, but these errors were encountered: