-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abstract and Lazy Vector representation #396
Comments
To expand on this: Vectors should have a separate type that represents the underlying encoding of the value, denoted in a
The vector type changes the internal representation of the vector, but not the logical representation. For example, a vector of type Vectors also receive a Vector types are not fixed in the pipeline, for example, a scan can return vectors of different types for the same column. An example of when this would happen is on a column compressed with RLE. If the entire vector has a single value, it seems like a good idea that the scan returns a In the future, the vector types we want to implement are at the very least |
As for the deferred execution: vectors can also take the form of a lazy vector. A lazy vector does not hold any data yet, but knows how to obtain that data (e.g. by physically scanning a file or so). Lazy vectors are useful because they provide the advantages of late materialization in selective queries in an easy to implement manner. Consider, for example, the following query with a very selective semi join: SELECT *
FROM lineitem
WHERE l_quantity = (SELECT MAX(l_quantity) FROM lineitem); The query will only return a few rows, but requires all the columns in the final projection. Using lazy vectors we avoid having to scan and read the unnecessary vectors from disk/memory entirely. Note that lazy vectors are not just another The way I would implement lazy vectors is that vectors have a unique pointer to a struct MaterializationInfo {
virtual ~ MaterializationInfo(){}
virtual void Materialize(Vector &target) = 0;
}; For normal vectors this is an empty pointer. For lazy vectors this structure is used to materialize the vector by calling the overloaded Scans will always only emit lazy vectors, and they will be materialized as necessary throughout the plan. |
Abstract vector types are implemented now as of #409, lazy vectors remain. |
Lazy vectors are a very effective method to support late/JIT materialization. Significant performance opportunities are likely here. |
Late materialization via lazy vectors appears to be a very high value performance opportunity. Without late materialization: Incremental columns in the select clause incur between about 0.5 and 1.5 seconds per column at Scale Factor 100 (600 million rows).
|
Ideally want inclusion of additional columns to pay a cost that is a f(result set) rather than f(rows_in_table). Does the current processing model support filtering first, then projecting second as the default materialization model? For a wide variety of queries, late/lazy materialization is the best default. |
Currently everything is early materialization, but late materialization in the form of lazy vectors is on the agenda (as is shown here). Lazy vectors only enable late materialization in the "primary" pipeline, though. For example, if you had a query in the form of: SELECT COUNT(l_shipdate), COUNT(o_orderdate) FROM lineitem, orders WHERE l_orderkey=o_orderkey; The |
Materializing o_orderdate in the HT is most efficient across a wide variety of cases. |
I think that the Lazy Vectors and the Vector Volcano model abstraction will support JIT materialization as described here: Early materialization: Implementation options on late materialization given f1, f2, fp3, fp4, p5, p6 Two Phase late materialization benefits are conditional in nature, and are useful when:
Two Phase can be 1-3 orders of magnitude less efficient when:
JIT Materialization is close to optimal under a wide variety of filter/selection conditions |
These are less issues but future feature discussions. Closing here. |
Vectors should be able to have differing encodings, also the materialisation of vectors should be deferrable.
The text was updated successfully, but these errors were encountered: