Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reading of ZSTD files, and add support for writing GZIP and ZSTD files #2697

Merged
merged 8 commits into from
Dec 1, 2021

Conversation

Mytherin
Copy link
Collaborator

Implements #2538

This PR implements the ZStdFileSystem (as part of the Parquet extension, since the ZSTD library is shipped as part of the Parquet extension). This allows DuckDB to read ZSTD compressed CSV files in a streaming manner. By default the compression type will be automatically inferred from the file extension. The compression type can also be provided as a parameter to the COPY statement and the read_csv function.

D select * from 'mtcars.csv.zst' limit 2;
┌───────────────┬──────┬─────┬───────┬─────┬──────┬───────┬───────┬────┬────┬──────┬──────┐
│     model     │ mpg  │ cyl │ disp  │ hp  │ drat │  wt   │ qsec  │ vs │ am │ gear │ carb │
├───────────────┼──────┼─────┼───────┼─────┼──────┼───────┼───────┼────┼────┼──────┼──────┤
│ Mazda RX4     │ 21.06160.01103.92.6216.460144    │
│ Mazda RX4 Wag │ 21.06160.01103.92.87517.020144    │
└───────────────┴──────┴─────┴───────┴─────┴──────┴───────┴───────┴────┴────┴──────┴──────┘

This PR cleans up the GZipFileSystem by adding an underlying CompressedFileSystem that is used by both the GZip and ZStd file systems. This should also make it easier to add support for other compression formats in the future.

ZSTD/GZIP Writing

In addition to enabling reading support of ZSTD files, this PR also adds support for writing files compressed in both ZSTD and GZIP formats, e.g.:

COPY mtcars TO 'mtcars.csv.zst';
COPY mtcars TO 'mtcars.csv.gz';

-- or by specifying the compression manually
COPY mtcars TO 'mtcars.csv.zst' (COMPRESSION ZSTD);
COPY mtcars TO 'mtcars.csv.gz' (COMPRESSION GZIP);

Just like regular CSV files, compressed CSV files are written in a streaming manner, meaning the full data set does not need to fit in memory for the write.

@Mytherin Mytherin linked an issue Nov 30, 2021 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read CSV compressed with zstd
1 participant