Abstract and Lazy Vector representation #396

hannes · 2020-01-22T15:40:21Z

Vectors should be able to have differing encodings, also the materialisation of vectors should be deferrable.

Mytherin · 2020-01-22T16:20:50Z

To expand on this:

Vectors should have a separate type that represents the underlying encoding of the value, denoted in a VectorType enum. Initially we implement the following two types:

FLAT_VECTOR: A regular uncompressed ("flat") vector
CONSTANT_VECTOR: A single constant element

The vector type changes the internal representation of the vector, but not the logical representation. For example, a vector of type CONSTANT_VECTOR can have up to 1024 entries and hold a selection vector, however, the internal representation of this vector will always be a single constant value.

Vectors also receive a Flatten (or Normalize, the name is undecided currently) method that turns the vector from its current vector type into a FLAT_VECTOR. For the CONSTANT_VECTOR, for example, that means duplicating the constant value along its selection vector and for its current count. When a function is encountered that does not have specialized code for dealing with a specific encoding it will call this method and turn the vector into a flat vector, effectively uncompressing it.

Vector types are not fixed in the pipeline, for example, a scan can return vectors of different types for the same column. An example of when this would happen is on a column compressed with RLE. If the entire vector has a single value, it seems like a good idea that the scan returns a CONSTANT_VECTOR. Otherwise, it can return a vector of type RLE_VECTOR.

In the future, the vector types we want to implement are at the very least RLE_VECTOR, DICTIONARY_VECTOR and BIAS_VECTOR (Frame of Reference). These types will be gradually rolled out.

Mytherin · 2020-01-22T16:32:53Z

As for the deferred execution: vectors can also take the form of a lazy vector. A lazy vector does not hold any data yet, but knows how to obtain that data (e.g. by physically scanning a file or so). Lazy vectors are useful because they provide the advantages of late materialization in selective queries in an easy to implement manner.

Consider, for example, the following query with a very selective semi join:

SELECT *
FROM lineitem
WHERE l_quantity = (SELECT MAX(l_quantity) FROM lineitem);

The query will only return a few rows, but requires all the columns in the final projection. Using lazy vectors we avoid having to scan and read the unnecessary vectors from disk/memory entirely.

Note that lazy vectors are not just another VectorType: a lazy vector can be transformed into any type of vector and should not be transformed only into a FLAT_VECTOR. Hence the materialization of a lazy vector should occur before the vector type is checked in an operation.

The way I would implement lazy vectors is that vectors have a unique pointer to a MaterializationInfo structure that looks as follows:

struct MaterializationInfo {
    virtual ~ MaterializationInfo(){}

    virtual void Materialize(Vector &target) = 0;
};

For normal vectors this is an empty pointer. For lazy vectors this structure is used to materialize the vector by calling the overloaded Materialize method (i.e. by calling info->Materialize(*this)). By having this be an object instead of only a function pointer we can also attach any information that we need to materialize the vector.

Scans will always only emit lazy vectors, and they will be materialized as necessary throughout the plan.

Mytherin · 2020-02-04T13:52:15Z

Abstract vector types are implemented now as of #409, lazy vectors remain.

jtommaney1 · 2020-08-19T18:22:34Z

Lazy vectors are a very effective method to support late/JIT materialization.

Significant performance opportunities are likely here.

jtommaney1 · 2020-09-14T22:32:40Z

Late materialization via lazy vectors appears to be a very high value performance opportunity.

Without late materialization:

Incremental columns in the select clause incur between about 0.5 and 1.5 seconds per column at Scale Factor 100 (600 million rows).

import time
for sql in [
... "select count(l_shipdate) from lineitem where l_suppkey = 100000 and l_partkey = 100000",
... "select count(l_shipdate), count(l_discount) from lineitem where l_suppkey = 100000 and l_partkey = 100000",
... "select count(l_shipdate), count(l_discount),count(l_extendedprice) from lineitem where l_suppkey = 100000 and l_partkey = 100000",
... "select count(l_shipdate), count(l_discount),count(l_extendedprice),count(l_quantity) from lineitem where l_suppkey = 100000 and l_partkey = 100000",
... ]:
... print(sql)
... for DoP in [1,2,4,8]:
... time.sleep(1)
... min_ela=1000000
... max_ela=0
... sum_ela=0
... result=cur.execute("PRAGMA threads="+str(DoP))
... for execs in range (0,3):
... start = timer()
... result=cur.execute(sql).fetchall()
... ela=(timer() - start)
... min_ela=round(min(ela, min_ela) ,3)
... max_ela=round(max(ela, max_ela) ,3)
... sum_ela=sum_ela+ela
... print('DoP='+str(DoP)+ ' elapsed= ', str(round(sum_ela/(execs+1),3)).ljust(5,'0')
... + ' (' + str(min_ela).ljust(5,'0') + '-' + str(max_ela).ljust(5,'0')
... + ') ' + str(result) + ' ' + ''.ljust(int(ela10),'*') )
...
select count(l_shipdate) from lineitem where l_suppkey = 100000 and l_partkey = 100000
DoP=1 elapsed= 3.730 (3.682-3.814) [(0,)] ************************************
DoP=2 elapsed= 4.272 (4.234-4.296) [(0,)] ******************************************
DoP=4 elapsed= 4.144 (4.131-4.157) [(0,)] *****************************************
DoP=8 elapsed= 4.809 (4.805-4.812) [(0,)] ************************************************
select count(l_shipdate), count(l_discount) from lineitem where l_suppkey = 100000 and l_partkey = 100000
DoP=1 elapsed= 4.242 (4.192-4.327) [(0, 0)] *****************************************
DoP=2 elapsed= 5.163 (5.160-5.169) [(0, 0)] ***************************************************
DoP=4 elapsed= 5.294 (5.275-5.306) [(0, 0)] ****************************************************
DoP=8 elapsed= 6.319 (6.266-6.347) [(0, 0)] ***************************************************************
select count(l_shipdate), count(l_discount),count(l_extendedprice) from lineitem where l_suppkey = 100000 and l_partkey = 100000
DoP=1 elapsed= 4.747 (4.700-4.831) [(0, 0, 0)] **********************************************
DoP=2 elapsed= 5.941 (5.909-5.978) [(0, 0, 0)] ***********************************************************
DoP=4 elapsed= 6.281 (6.061-6.485) [(0, 0, 0)] ************************************************************
DoP=8 elapsed= 7.759 (7.747-7.780) [(0, 0, 0)] *****************************************************************************
select count(l_shipdate), count(l_discount),count(l_extendedprice),count(l_quantity) from lineitem where l_suppkey = 100000 and l_partkey = 100000
DoP=1 elapsed= 5.212 (5.166-5.295) [(0, 0, 0, 0)] ***************************************************
DoP=2 elapsed= 6.722 (6.668-6.751) [(0, 0, 0, 0)] ******************************************************************
DoP=4 elapsed= 7.584 (7.564-7.604) [(0, 0, 0, 0)] ***************************************************************************
DoP=8 elapsed= 9.392 (9.383-9.402) [(0, 0, 0, 0)] **********************************************************************************************

jtommaney1 · 2020-09-14T22:46:41Z

Ideally want inclusion of additional columns to pay a cost that is a f(result set) rather than f(rows_in_table).

Does the current processing model support filtering first, then projecting second as the default materialization model?

For a wide variety of queries, late/lazy materialization is the best default.

Mytherin · 2020-09-15T11:54:36Z

Currently everything is early materialization, but late materialization in the form of lazy vectors is on the agenda (as is shown here). Lazy vectors only enable late materialization in the "primary" pipeline, though. For example, if you had a query in the form of:

SELECT COUNT(l_shipdate), COUNT(o_orderdate) FROM lineitem, orders WHERE l_orderkey=o_orderkey;

The o_orderdate column will be fully materialized in the hash table that is built for the hash join. The l_shipdate column (on the probe side) will be lazily materialized for matches that are found in the hash table.

jtommaney1 · 2020-09-15T18:26:55Z

Materializing o_orderdate in the HT is most efficient across a wide variety of cases.

jtommaney1 · 2020-09-15T18:34:21Z

I think that the Lazy Vectors and the Vector Volcano model abstraction will support JIT materialization as described here:
abbreviations:
f* - Filter Column
p* - Projection Column
fp - Filter and Project
CS - ColumnScan
AF - Apply Filter
OS - OptimizedScan - access that may be modified by prior selectivity.
Skip Scan when 0 rows in Chunk from previous access
Random access of "few" rows in Chunk from previous access (PRAGMA driven)
ColumnScan with RID intersect

Early materialization:
CS( o.fp1, o.p2 ) -> AF -> in-memory HT
CS( l.f1, l.p2, l.p3 ) -> AF -> probe
Late materialization:
CS(o.fp1) -> AF -> OS( o.p2 ) -> in-memory HT
CS(l.f1) -> AF -> OS(l.p2, l.p3 ) -> probe

Implementation options on late materialization given f1, f2, fp3, fp4, p5, p6
Materialization
Two Phase: CS(f1, f2, fp3, fp4) -> AF -> OS(p5,p6)
Three Phase: CS(f1, f2) -> AF -> OS(fp3, fp4) -> AF -> OS(p5,p6)
JIT Materialization: CS(f1)->AF->OS(f2)->AF->OS(fp3)->AF->OS(fp4)-> AF -> OS(p s5,p6)

Two Phase late materialization benefits are conditional in nature, and are useful when:

filters are not restrictive.
data skipping is very restrictive (loaded by date, filter by narrow date range)
Ratio of projection columns to filter columns is high

Two Phase can be 1-3 orders of magnitude less efficient when:

there are restrictive filters not covered by data skipping ( common query pattern )

JIT Materialization is close to optimal under a wide variety of filter/selection conditions

hannes · 2022-04-30T05:30:49Z

These are less issues but future feature discussions. Closing here.

hannes assigned Mytherin Jan 22, 2020

jtommaney1 mentioned this issue Aug 19, 2020

Vector-optimized fixed length character fields #856

Closed

Mytherin mentioned this issue Dec 5, 2020

Initial sampling implementation #1195

Merged

Mytherin mentioned this issue Mar 4, 2021

R function duckdb_read_csv() uses excessive memory? #988

Closed

hannes closed this as completed Apr 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abstract and Lazy Vector representation #396

Abstract and Lazy Vector representation #396

hannes commented Jan 22, 2020

Mytherin commented Jan 22, 2020

Mytherin commented Jan 22, 2020 •

edited

Mytherin commented Feb 4, 2020

jtommaney1 commented Aug 19, 2020

jtommaney1 commented Sep 14, 2020

jtommaney1 commented Sep 14, 2020

Mytherin commented Sep 15, 2020

jtommaney1 commented Sep 15, 2020

jtommaney1 commented Sep 15, 2020

hannes commented Apr 30, 2022

Abstract and Lazy Vector representation #396

Abstract and Lazy Vector representation #396

Comments

hannes commented Jan 22, 2020

Mytherin commented Jan 22, 2020

Mytherin commented Jan 22, 2020 • edited

Mytherin commented Feb 4, 2020

jtommaney1 commented Aug 19, 2020

jtommaney1 commented Sep 14, 2020

jtommaney1 commented Sep 14, 2020

Mytherin commented Sep 15, 2020

jtommaney1 commented Sep 15, 2020

jtommaney1 commented Sep 15, 2020

hannes commented Apr 30, 2022

Mytherin commented Jan 22, 2020 •

edited