large data loads - Format-specific variants #59

tom-clickhouse · 2023-10-01T12:31:54Z

While the approach of the initial version of the data load script is generic, robust, and works for all formats, we can optimize performance by exploiting file type-specific knowledge and available metadata and, with that, avoiding unnecessary reads.

For example, ClickHouse SQL queries can access (and potentially utilize for our script) an exhaustive list of Parquet metadata. All numeric columns in a parquet file have metadata describing the minimum and maximum values per row group. From 23.8, ClickHouse automatically exploits this metadata at query time to speed up queries filtering on numeric columns in parquet files. Our script could utilize this as an alternative to rowNumberInAllBlocks by allowing parallel reading within Parquet.

The text was updated successfully, but these errors were encountered:

tom-clickhouse added the enhancement New feature or request label Oct 1, 2023

tom-clickhouse self-assigned this Oct 1, 2023

tom-clickhouse closed this as completed Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

large data loads - Format-specific variants #59

large data loads - Format-specific variants #59

tom-clickhouse commented Oct 1, 2023

large data loads - Format-specific variants #59

large data loads - Format-specific variants #59

Comments

tom-clickhouse commented Oct 1, 2023