You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Following up on a discussion in discord with @Tishj, I would like to provide some information about my experience with this data type in my prior work and feedback about the upcoming variant data type. I previously worked at Datadog on Husky, which is the database underlying many products there. Husky implements something similar to variant as the core data type. Users send logs with any structure, including pathological structures.
Shredding Variant
A pathological structure could be a mistake by the user where they swapped the key and value in their logging code, like UUID keys all with a constant value. A naive approach to variant shredding means allocating one column buffer per field, which in the pathological case means one column buffer per row. Even with smart allocation strategies, such as growing column buffers dynamically instead of starting with a large fixed size, the overhead of that many individual heap allocations will still perform poorly. Even a non-pathological workload in the domain of logging and observability will result in tens of thousands of unique columns over time, so it was worth investing in a solution here for Husky.
Husky solved this by having a dense and sparse representation for columns, which I spoke about in my talk linked above. That is a somewhat different approach than variant takes both in Parquet and DuckDB's native format, but the same logic applies here: shredding every leaf field is not a good use of resources in many circumstances. The obvious answer is to let the user specify a budget, perhaps with conservative default. How should that budget be allocated? This question is more application-specific, but I think for the default implementation there are three metrics that work a priori for selection:
Column density: how many rows define a value for this field (whether null or non-null)?
Type conflicts: which fields contain no type conflicts, and therefore can benefit from upper/lower bound and NDV statistics?
Column size: how would the size of this column compare if it were shredded and all the regular column compression techniques could apply?
These statistics can be used to make reasonable selections automatically. Potentially an area for research too!
Merging shredded variant data is necessary for implementing compaction and vacuuming of files that contain variant. A naive merge of shredded variant data requires buffering it all in memory. If you can instead store the leaf fields sorted in a deterministic way, such as lexicographically by their path from the root, you can do a k-way merge where only one column page is required to be in memory from each input to the merge. This is discussed in a blog post about Husky's compaction system.
This may not be possible for Parquet variant, as you do not control the spec, but for DuckDB variant there is still time to implement this!
Summary
This new data type has a lot of potential, and I'm very excited for it to be released!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Following up on a discussion in discord with @Tishj, I would like to provide some information about my experience with this data type in my prior work and feedback about the upcoming variant data type. I previously worked at Datadog on Husky, which is the database underlying many products there. Husky implements something similar to variant as the core data type. Users send logs with any structure, including pathological structures.
Shredding Variant
A pathological structure could be a mistake by the user where they swapped the key and value in their logging code, like UUID keys all with a constant value. A naive approach to variant shredding means allocating one column buffer per field, which in the pathological case means one column buffer per row. Even with smart allocation strategies, such as growing column buffers dynamically instead of starting with a large fixed size, the overhead of that many individual heap allocations will still perform poorly. Even a non-pathological workload in the domain of logging and observability will result in tens of thousands of unique columns over time, so it was worth investing in a solution here for Husky.
Husky solved this by having a dense and sparse representation for columns, which I spoke about in my talk linked above. That is a somewhat different approach than variant takes both in Parquet and DuckDB's native format, but the same logic applies here: shredding every leaf field is not a good use of resources in many circumstances. The obvious answer is to let the user specify a budget, perhaps with conservative default. How should that budget be allocated? This question is more application-specific, but I think for the default implementation there are three metrics that work a priori for selection:
These statistics can be used to make reasonable selections automatically. Potentially an area for research too!
ClickHouse also has a JSON data type which exposes parameters to control shredding: https://clickhouse.com/docs/sql-reference/data-types/newjson
Merging Variant
Merging shredded variant data is necessary for implementing compaction and vacuuming of files that contain variant. A naive merge of shredded variant data requires buffering it all in memory. If you can instead store the leaf fields sorted in a deterministic way, such as lexicographically by their path from the root, you can do a k-way merge where only one column page is required to be in memory from each input to the merge. This is discussed in a blog post about Husky's compaction system.
This may not be possible for Parquet variant, as you do not control the spec, but for DuckDB variant there is still time to implement this!
Summary
This new data type has a lot of potential, and I'm very excited for it to be released!
Beta Was this translation helpful? Give feedback.
All reactions