Support reading of ZSTD files, and add support for writing GZIP and ZSTD files #2697

Mytherin · 2021-11-30T20:19:16Z

Implements #2538

This PR implements the ZStdFileSystem (as part of the Parquet extension, since the ZSTD library is shipped as part of the Parquet extension). This allows DuckDB to read ZSTD compressed CSV files in a streaming manner. By default the compression type will be automatically inferred from the file extension. The compression type can also be provided as a parameter to the COPY statement and the read_csv function.

D select * from 'mtcars.csv.zst' limit 2;
┌───────────────┬──────┬─────┬───────┬─────┬──────┬───────┬───────┬────┬────┬──────┬──────┐
│     model     │ mpg  │ cyl │ disp  │ hp  │ drat │  wt   │ qsec  │ vs │ am │ gear │ carb │
├───────────────┼──────┼─────┼───────┼─────┼──────┼───────┼───────┼────┼────┼──────┼──────┤
│ Mazda RX4     │ 21.0 │ 6   │ 160.0 │ 110 │ 3.9  │ 2.62  │ 16.46 │ 0  │ 1  │ 4    │ 4    │
│ Mazda RX4 Wag │ 21.0 │ 6   │ 160.0 │ 110 │ 3.9  │ 2.875 │ 17.02 │ 0  │ 1  │ 4    │ 4    │
└───────────────┴──────┴─────┴───────┴─────┴──────┴───────┴───────┴────┴────┴──────┴──────┘

This PR cleans up the GZipFileSystem by adding an underlying CompressedFileSystem that is used by both the GZip and ZStd file systems. This should also make it easier to add support for other compression formats in the future.

ZSTD/GZIP Writing

In addition to enabling reading support of ZSTD files, this PR also adds support for writing files compressed in both ZSTD and GZIP formats, e.g.:

COPY mtcars TO 'mtcars.csv.zst';
COPY mtcars TO 'mtcars.csv.gz';

-- or by specifying the compression manually
COPY mtcars TO 'mtcars.csv.zst' (COMPRESSION ZSTD);
COPY mtcars TO 'mtcars.csv.gz' (COMPRESSION GZIP);

Just like regular CSV files, compressed CSV files are written in a streaming manner, meaning the full data set does not need to fit in memory for the write.

…ompressed files

…ions in destructor of file streams

Mytherin added 7 commits November 30, 2021 14:58

Initial Support for ZSTD File System

b48b4fb

Add CompressedFileSystem class to unify support for reading/writing c…

75592bc

…ompressed files

Unify gzip and zstd file system under the compressed file system

fd7b9a3

Streaming gzip write working

a54f954

Add support for writing ZSTD files to the ZSTD File System

7f2eda3

Clean up FileCompressionType parsing for COPY TO/FROM handling

ca37274

32KB buffer for gzip, instead of 1KB

8066658

Mytherin linked an issue Nov 30, 2021 that may be closed by this pull request

read CSV compressed with zstd #2538

Closed

Mytherin mentioned this pull request Nov 30, 2021

read CSV compressed with zstd #2538

Closed

Explicitly call Close on file in CSV write, and avoid throwing except…

089b242

…ions in destructor of file streams

Mytherin merged commit 3dfdc2f into duckdb:master Dec 1, 2021

Mytherin deleted the zstdfilesystem branch December 3, 2021 14:23

Mytherin mentioned this pull request Dec 18, 2021

Feature: support reading of zstd compressed csv files #2814

Closed

greg-finley mentioned this pull request Feb 27, 2023

Read using duckdb instead of BigQuery? greg-finley/lichess-bigquery#6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support reading of ZSTD files, and add support for writing GZIP and ZSTD files #2697

Support reading of ZSTD files, and add support for writing GZIP and ZSTD files #2697

Mytherin commented Nov 30, 2021

Support reading of ZSTD files, and add support for writing GZIP and ZSTD files #2697

Support reading of ZSTD files, and add support for writing GZIP and ZSTD files #2697

Conversation

Mytherin commented Nov 30, 2021

ZSTD/GZIP Writing