Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

large data loads - Format-specific variants #59

Closed
tom-clickhouse opened this issue Oct 1, 2023 · 0 comments
Closed

large data loads - Format-specific variants #59

tom-clickhouse opened this issue Oct 1, 2023 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@tom-clickhouse
Copy link
Collaborator

While the approach of the initial version of the data load script is generic, robust, and works for all formats, we can optimize performance by exploiting file type-specific knowledge and available metadata and, with that, avoiding unnecessary reads.

For example, ClickHouse SQL queries can access (and potentially utilize for our script) an exhaustive list of Parquet metadata. All numeric columns in a parquet file have metadata describing the minimum and maximum values per row group. From 23.8, ClickHouse automatically exploits this metadata at query time to speed up queries filtering on numeric columns in parquet files. Our script could utilize this as an alternative to rowNumberInAllBlocks by allowing parallel reading within Parquet.

@tom-clickhouse tom-clickhouse added the enhancement New feature or request label Oct 1, 2023
@tom-clickhouse tom-clickhouse self-assigned this Oct 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant