Reading a parquet written to "/dev/stdout" from "/dev/stdin" using duckdb CLI and a pipe fails with Error: PipeFileSystem: GetLastModifiedTime is not implemented! #2826

mskyttner · 2021-12-20T22:23:59Z

What happens?

Writing a parquet to /dev/stdout and reading it again from /dev/stdin using a pipe and chained duckdb CLI commands results in Error: Not implemented Error: PipeFileSystem: GetLastModifiedTime is not implemented!

To Reproduce

Use this bash oneliner at the command prompt to reproduce:

# read back duckdb CLI parquet output from /dev/stdout from /dev/stdin
R --quiet -e 'readr::write_csv(iris, "iris.csv")' && duckdb :memory: "copy (select * from read_csv_auto('iris.csv', HEADER=TRUE)) to '/dev/stdout' with (format 'parquet');" | duckdb :memory: "select * from read_parquet('/dev/stdin') limit 5;" ; duckdb -version

Outcome:

Error: Not implemented Error: PipeFileSystem: GetLastModifiedTime is not implemented!
v0.3.1 88aa81c6b

Related issues

Possibly related issue: #2296, which indicates it should be possible to chain duckdb commands together in a pipe.

Environment :

OS: [Any capable of running bash, which are Mac/Win/Linux I guess,but tested on Linux]
DuckDB Version: [duckdb CLI v0.3.1 88aa81c, downloaded from https://github.com/duckdb/duckdb/releases/download/v0.3.1/duckdb_cli-linux-amd64.zip]
DuckDB Client: [duckdb CLI]

The text was updated successfully, but these errors were encountered:

dforsber · 2021-12-21T19:24:55Z

This maybe also related:

% duckdb --version
v0.3.1 88aa81c6b
% duckdb :memory: "COPY (SELECT * FROM parquet_scan('test.parquet') LIMIT 20) TO '/dev/stdout' WITH (FORMAT 'Parquet');" \
     | parquet-tools meta /dev/stdin
Error: Not implemented Error: PipeFileSystem: FileSync is not implemented!

Mytherin · 2021-12-22T18:30:10Z

Reading a Parquet file efficiently from a stream is not possible, as reading a Parquet file requires jumping around randomly in memory which is not possible in a streaming manner.

We could detect that we are reading from a stream and buffer the entire Parquet file in memory, but I am not sure if supporting this in that matter is even desirable. Parquet as a format is simply not intended for streaming consumption. Perhaps we should just improve the error message here instead.

Writing Parquet files to a stream should be possible, however.

Alex-Monahan · 2021-12-22T19:07:41Z

I've got a couple of questions, mostly for my own learning!

Would it make any difference if the parquet_scan was querying many smaller files with a glob syntax? Would streaming or buffering be helpful in that scenario?

Also, would a more memory-efficient workaround alternative be to just insert from the input parquet files into a DuckDB table and then use the copy command to create a parquet file from that table? Or would that still read all of the parquet files into memory at once?

dforsber · 2021-12-22T19:12:31Z

Couple of assumptions. Parquet files have row (batch) groups that need to be read back and forth to match column rows I assume. But once the group is processed, it is not needed anymore. Thus, no need to read the whole file. And similarly, no need to write the whole file at once (I would assume). :)

Mytherin · 2021-12-22T19:33:02Z

Our blog on Parquet files has an image displaying how Parquet files are structured internally.

Essentially the row groups come first, and the meta data sits only at the end of the file. If you read this file from a FIFO stream you have no clue what bytes you are reading until you get to the end of the FIFO stream and can actually read the metadata. This requires essentially consuming the entirety of the FIFO stream and keeping it cached in memory, which kind of defeats the point of the FIFO stream in the first place.

The file format is made for streaming writes, which is why the metadata sits at the end in the first place. All the data can be written, and only after all the data is written does the metadata need to be written. This allows the file to be written without having to cache any significant portions of it. So it does work well for outputting to stdout.

Also, would a more memory-efficient workaround alternative be to just insert from the input parquet files into a DuckDB table and then use the copy command to create a parquet file from that table? Or would that still read all of the parquet files into memory at once?

Perhaps this is confusing: you can stream read Parquet files, but not from a FIFO pipe (e.g. stdin or stdout). The efficient manner of working with Parquet files is to do parquet_scan on a file that sits on disk and processing it that way.

mskyttner · 2021-12-28T20:01:39Z

I now realize that this is probably a duplicate of #2127 which I forgot about (sorry!).

I don't have any ideas of how to improve the error message if that is advisable. Maybe "Error: Not implemented Error: PipeFileSystem: Reading parquet stream from /dev/stdin is not recommended since metadata is located at the end of the stream, please first write the stream to disk or S3 object storage instead" but that might be a bit lengthy.

Mytherin · 2021-12-29T13:58:56Z

@dforsber @mskyttner

Both these problems should now be fixed in #2845. There is a clear error message if you try to read a Parquet file from /dev/stdin, and the FileSync issue is resolved

dforsber · 2021-12-29T18:47:15Z

Thank you 👍🏻 .

Nothing would prevent sending parquet metadata from the end of the file first along with total file size through pipe if really needed e.g. to make it possible to pipe parquet data between duckdb instances more efficiently. But that wouldn't be compatible with file redirection anyway, and I struggle to find a use case for it :)

Mytherin mentioned this issue Dec 28, 2021

Turn FileSync into a nop in the PipeFileSystem, and add clear error message when ParquetReader is used to read from a FIFO stream (e.g. /dev/stdin) #2845

Merged

Mytherin closed this as completed in #2845 Dec 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading a parquet written to "/dev/stdout" from "/dev/stdin" using duckdb CLI and a pipe fails with Error: PipeFileSystem: GetLastModifiedTime is not implemented! #2826

Reading a parquet written to "/dev/stdout" from "/dev/stdin" using duckdb CLI and a pipe fails with Error: PipeFileSystem: GetLastModifiedTime is not implemented! #2826

mskyttner commented Dec 20, 2021

dforsber commented Dec 21, 2021

Mytherin commented Dec 22, 2021

Alex-Monahan commented Dec 22, 2021

dforsber commented Dec 22, 2021

Mytherin commented Dec 22, 2021

mskyttner commented Dec 28, 2021

Mytherin commented Dec 29, 2021

dforsber commented Dec 29, 2021

Reading a parquet written to "/dev/stdout" from "/dev/stdin" using duckdb CLI and a pipe fails with Error: PipeFileSystem: GetLastModifiedTime is not implemented! #2826

Reading a parquet written to "/dev/stdout" from "/dev/stdin" using duckdb CLI and a pipe fails with Error: PipeFileSystem: GetLastModifiedTime is not implemented! #2826

Comments

mskyttner commented Dec 20, 2021

What happens?

To Reproduce

Related issues

Environment :

dforsber commented Dec 21, 2021

Mytherin commented Dec 22, 2021

Alex-Monahan commented Dec 22, 2021

dforsber commented Dec 22, 2021

Mytherin commented Dec 22, 2021

mskyttner commented Dec 28, 2021

Mytherin commented Dec 29, 2021

dforsber commented Dec 29, 2021