You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While the approach of the initial version of the data load script is generic, robust, and works for all formats, we can optimize performance by exploiting file type-specific knowledge and available metadata and, with that, avoiding unnecessary reads.
For example, ClickHouse SQL queries can access (and potentially utilize for our script) an exhaustive list of Parquet metadata. All numeric columns in a parquet file have metadata describing the minimum and maximum values per row group. From 23.8, ClickHouse automatically exploits this metadata at query time to speed up queries filtering on numeric columns in parquet files. Our script could utilize this as an alternative to rowNumberInAllBlocks by allowing parallel reading within Parquet.
The text was updated successfully, but these errors were encountered:
While the approach of the initial version of the data load script is generic, robust, and works for all formats, we can optimize performance by exploiting file type-specific knowledge and available metadata and, with that, avoiding unnecessary reads.
For example, ClickHouse SQL queries can access (and potentially utilize for our script) an exhaustive list of Parquet metadata. All numeric columns in a parquet file have metadata describing the minimum and maximum values per row group. From 23.8, ClickHouse automatically exploits this metadata at query time to speed up queries filtering on numeric columns in parquet files. Our script could utilize this as an alternative to
rowNumberInAllBlocks
by allowing parallel reading within Parquet.The text was updated successfully, but these errors were encountered: