Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstract and Lazy Vector representation #396

Closed
hannes opened this issue Jan 22, 2020 · 10 comments
Closed

Abstract and Lazy Vector representation #396

hannes opened this issue Jan 22, 2020 · 10 comments
Assignees

Comments

@hannes
Copy link
Member

hannes commented Jan 22, 2020

Vectors should be able to have differing encodings, also the materialisation of vectors should be deferrable.

@Mytherin
Copy link
Collaborator

To expand on this:

Vectors should have a separate type that represents the underlying encoding of the value, denoted in a VectorType enum. Initially we implement the following two types:

  • FLAT_VECTOR: A regular uncompressed ("flat") vector
  • CONSTANT_VECTOR: A single constant element

The vector type changes the internal representation of the vector, but not the logical representation. For example, a vector of type CONSTANT_VECTOR can have up to 1024 entries and hold a selection vector, however, the internal representation of this vector will always be a single constant value.

Vectors also receive a Flatten (or Normalize, the name is undecided currently) method that turns the vector from its current vector type into a FLAT_VECTOR. For the CONSTANT_VECTOR, for example, that means duplicating the constant value along its selection vector and for its current count. When a function is encountered that does not have specialized code for dealing with a specific encoding it will call this method and turn the vector into a flat vector, effectively uncompressing it.

Vector types are not fixed in the pipeline, for example, a scan can return vectors of different types for the same column. An example of when this would happen is on a column compressed with RLE. If the entire vector has a single value, it seems like a good idea that the scan returns a CONSTANT_VECTOR. Otherwise, it can return a vector of type RLE_VECTOR.

In the future, the vector types we want to implement are at the very least RLE_VECTOR, DICTIONARY_VECTOR and BIAS_VECTOR (Frame of Reference). These types will be gradually rolled out.

@Mytherin
Copy link
Collaborator

Mytherin commented Jan 22, 2020

As for the deferred execution: vectors can also take the form of a lazy vector. A lazy vector does not hold any data yet, but knows how to obtain that data (e.g. by physically scanning a file or so). Lazy vectors are useful because they provide the advantages of late materialization in selective queries in an easy to implement manner.

Consider, for example, the following query with a very selective semi join:

SELECT *
FROM lineitem
WHERE l_quantity = (SELECT MAX(l_quantity) FROM lineitem);

The query will only return a few rows, but requires all the columns in the final projection. Using lazy vectors we avoid having to scan and read the unnecessary vectors from disk/memory entirely.

Note that lazy vectors are not just another VectorType: a lazy vector can be transformed into any type of vector and should not be transformed only into a FLAT_VECTOR. Hence the materialization of a lazy vector should occur before the vector type is checked in an operation.

The way I would implement lazy vectors is that vectors have a unique pointer to a MaterializationInfo structure that looks as follows:

struct MaterializationInfo {
    virtual ~ MaterializationInfo(){}

    virtual void Materialize(Vector &target) = 0;
};

For normal vectors this is an empty pointer. For lazy vectors this structure is used to materialize the vector by calling the overloaded Materialize method (i.e. by calling info->Materialize(*this)). By having this be an object instead of only a function pointer we can also attach any information that we need to materialize the vector.

Scans will always only emit lazy vectors, and they will be materialized as necessary throughout the plan.

@Mytherin
Copy link
Collaborator

Mytherin commented Feb 4, 2020

Abstract vector types are implemented now as of #409, lazy vectors remain.

@jtommaney1
Copy link

Lazy vectors are a very effective method to support late/JIT materialization.

Significant performance opportunities are likely here.

@jtommaney1
Copy link

Late materialization via lazy vectors appears to be a very high value performance opportunity.

Without late materialization:

Incremental columns in the select clause incur between about 0.5 and 1.5 seconds per column at Scale Factor 100 (600 million rows).

import time
for sql in [
... "select count(l_shipdate) from lineitem where l_suppkey = 100000 and l_partkey = 100000",
... "select count(l_shipdate), count(l_discount) from lineitem where l_suppkey = 100000 and l_partkey = 100000",
... "select count(l_shipdate), count(l_discount),count(l_extendedprice) from lineitem where l_suppkey = 100000 and l_partkey = 100000",
... "select count(l_shipdate), count(l_discount),count(l_extendedprice),count(l_quantity) from lineitem where l_suppkey = 100000 and l_partkey = 100000",
... ]:
... print(sql)
... for DoP in [1,2,4,8]:
... time.sleep(1)
... min_ela=1000000
... max_ela=0
... sum_ela=0
... result=cur.execute("PRAGMA threads="+str(DoP))
... for execs in range (0,3):
... start = timer()
... result=cur.execute(sql).fetchall()
... ela=(timer() - start)
... min_ela=round(min(ela, min_ela) ,3)
... max_ela=round(max(ela, max_ela) ,3)
... sum_ela=sum_ela+ela
... print('DoP='+str(DoP)+ ' elapsed= ', str(round(sum_ela/(execs+1),3)).ljust(5,'0')
... + ' (' + str(min_ela).ljust(5,'0') + '-' + str(max_ela).ljust(5,'0')
... + ') ' + str(result) + ' ' + ''.ljust(int(ela10),'*') )
...
select count(l_shipdate) from lineitem where l_suppkey = 100000 and l_partkey = 100000
DoP=1 elapsed= 3.730 (3.682-3.814) [(0,)] ************************************
DoP=2 elapsed= 4.272 (4.234-4.296) [(0,)] ******************************************
DoP=4 elapsed= 4.144 (4.131-4.157) [(0,)] *****************************************
DoP=8 elapsed= 4.809 (4.805-4.812) [(0,)] ************************************************
select count(l_shipdate), count(l_discount) from lineitem where l_suppkey = 100000 and l_partkey = 100000
DoP=1 elapsed= 4.242 (4.192-4.327) [(0, 0)] *****************************************
DoP=2 elapsed= 5.163 (5.160-5.169) [(0, 0)] ***************************************************
DoP=4 elapsed= 5.294 (5.275-5.306) [(0, 0)] ****************************************************
DoP=8 elapsed= 6.319 (6.266-6.347) [(0, 0)] ***************************************************************
select count(l_shipdate), count(l_discount),count(l_extendedprice) from lineitem where l_suppkey = 100000 and l_partkey = 100000
DoP=1 elapsed= 4.747 (4.700-4.831) [(0, 0, 0)] **********************************************
DoP=2 elapsed= 5.941 (5.909-5.978) [(0, 0, 0)] ***********************************************************
DoP=4 elapsed= 6.281 (6.061-6.485) [(0, 0, 0)] ************************************************************
DoP=8 elapsed= 7.759 (7.747-7.780) [(0, 0, 0)] *****************************************************************************
select count(l_shipdate), count(l_discount),count(l_extendedprice),count(l_quantity) from lineitem where l_suppkey = 100000 and l_partkey = 100000
DoP=1 elapsed= 5.212 (5.166-5.295) [(0, 0, 0, 0)] ***************************************************
DoP=2 elapsed= 6.722 (6.668-6.751) [(0, 0, 0, 0)] ******************************************************************
DoP=4 elapsed= 7.584 (7.564-7.604) [(0, 0, 0, 0)] ***************************************************************************
DoP=8 elapsed= 9.392 (9.383-9.402) [(0, 0, 0, 0)] **********************************************************************************************

@jtommaney1
Copy link

Ideally want inclusion of additional columns to pay a cost that is a f(result set) rather than f(rows_in_table).

Does the current processing model support filtering first, then projecting second as the default materialization model?

For a wide variety of queries, late/lazy materialization is the best default.

@Mytherin
Copy link
Collaborator

Currently everything is early materialization, but late materialization in the form of lazy vectors is on the agenda (as is shown here). Lazy vectors only enable late materialization in the "primary" pipeline, though. For example, if you had a query in the form of:

SELECT COUNT(l_shipdate), COUNT(o_orderdate) FROM lineitem, orders WHERE l_orderkey=o_orderkey;

The o_orderdate column will be fully materialized in the hash table that is built for the hash join. The l_shipdate column (on the probe side) will be lazily materialized for matches that are found in the hash table.

@jtommaney1
Copy link

Materializing o_orderdate in the HT is most efficient across a wide variety of cases.

@jtommaney1
Copy link

I think that the Lazy Vectors and the Vector Volcano model abstraction will support JIT materialization as described here:
abbreviations:
f* - Filter Column
p* - Projection Column
fp - Filter and Project
CS - ColumnScan
AF - Apply Filter
OS - OptimizedScan - access that may be modified by prior selectivity.
Skip Scan when 0 rows in Chunk from previous access
Random access of "few" rows in Chunk from previous access (PRAGMA driven)
ColumnScan with RID intersect

Early materialization:
CS( o.fp1, o.p2 ) -> AF -> in-memory HT
CS( l.f1, l.p2, l.p3 ) -> AF -> probe
Late materialization:
CS(o.fp1) -> AF -> OS( o.p2 ) -> in-memory HT
CS(l.f1) -> AF -> OS(l.p2, l.p3 ) -> probe

Implementation options on late materialization given f1, f2, fp3, fp4, p5, p6
Materialization
Two Phase: CS(f1, f2, fp3, fp4) -> AF -> OS(p5,p6)
Three Phase: CS(f1, f2) -> AF -> OS(fp3, fp4) -> AF -> OS(p5,p6)
JIT Materialization: CS(f1)->AF->OS(f2)->AF->OS(fp3)->AF->OS(fp4)-> AF -> OS(p s5,p6)

Two Phase late materialization benefits are conditional in nature, and are useful when:

  • filters are not restrictive.
  • data skipping is very restrictive (loaded by date, filter by narrow date range)
  • Ratio of projection columns to filter columns is high

Two Phase can be 1-3 orders of magnitude less efficient when:

  • there are restrictive filters not covered by data skipping ( common query pattern )

JIT Materialization is close to optimal under a wide variety of filter/selection conditions

@hannes
Copy link
Member Author

hannes commented Apr 30, 2022

These are less issues but future feature discussions. Closing here.

@hannes hannes closed this as completed Apr 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants