Add support to write bucketed (but not partitioned) tables #9740

aditi-pandit · 2024-05-07T23:19:21Z

The Velox HiveConnector supports writing bucketed files only when they are partitioned as well. This presents a feature gap wrt Presto.

Presto behavior (for bucketed but not partitioned):

Supports CTAS into bucketed (but not partitioned tables)
Cannot append/overwrite to existing bucketed tables (though can append to TEMPORARY ones).

The CTAS into bucketed tables has become important because such tables are used for CTE (WITH clause).
Note: This PR only handles CTAS situations. There will be a separate PR for TEMPORARY tables. prestodb/presto#19744 prestodb/presto#22630

Background

TableWriter and TableFinish

Presto uses TableWriter PlanNodes to do the writing operations. The TableWriter nodes run on the workers. These nodes write the input rows into data files (on a staging directory before moving them to a target directory). The TableWriter node works in conjunction with a TableCommit node on the co-ordinator. The TableCommit node (TableFinishOperator) does the final renaming of target directory and commit to the meta-store.

It is important to note that plans with Bucketed tables involve a LocalExchange that brings all the data to a single driver for TableWriter so that it can bucket and write the data appropriately.

EXPLAIN CREATE TABLE lineitem_bucketed2(orderkey, partkey, suppkey, linenumber, quantity, ds) WITH (bucket_count = 10, bucketed_by = ARRAY['orderkey'], sorted_by = ARRAY['orderkey']) AS SELECT orderkey, partkey, suppkey, linenumber, quantity, '2021-12-20' FROM tpch.tiny.lineitem;

Plan with TableWriter and TableCommit mode. Note the LocalExchange moving all data to a single driver.

- Output[PlanNodeId 7]
     - TableCommit[PlanNodeId 5][Optional[hive.tpch_bucketed.lineitem_bucketed2]] => [rows_23:bigint] 
         - RemoteStreamingExchange[PlanNodeId 299][GATHER] => [rows:bigint, fragments:varbinary, commitcontext:varbinary] 
             - TableWriter[PlanNodeId 6] => [rows:bigint, fragments:varbinary, commitcontext:varbinary] 
                     orderkey := orderkey (1:194)  partkey := partkey (1:204) suppkey := suppkey (1:213) linenumber := linenumber (1:222) quantity := quantity (1:234) ds := expr (1:244)                                             
                 - LocalExchange[PlanNodeId 330][SINGLE] () => [orderkey:bigint, partkey:bigint, suppkey:bigint, linenumber:integer, quantity:double, expr:varchar(10)] >
                         - RemoteStreamingExchange[PlanNodeId 298][REPARTITION] => [orderkey:bigint, partkey:bigint, suppkey:bigint, linenumber:integer, quantity:double, expr:varcha>
                              - ScanProject[PlanNodeId 0,187][table = TableHandle {connectorId='tpch', connectorHandle='lineitem:sf0.01', layout='Optional[lineitem:sf0.01]'}, project>
                                 expr := VARCHAR'2021-12-20' suppkey := tpch:suppkey (1:262) partkey := tpch:partkey (1:262) linenumber := tpch:linenumber (1:262) orderkey := tpch:orderkey (1:262) quantity := tpch:quantity (1:262)

The above command creates 10 files as follows. 10 is the bucket count.

Aditis-MacBook-Pro:lineitem_bucketed aditipandit$ pwd
${DATA_DIR}/hive_data/tpch/lineitem_bucketed

Aditis-MacBook-Pro:lineitem_bucketed2 aditipandit$ ls
000000_0_20240507_221727_00018_73r2r 
000003_0_20240507_221727_00018_73r2r 
000006_0_20240507_221727_00018_73r2r 
000009_0_20240507_221727_00018_73r2r
000001_0_20240507_221727_00018_73r2r
000004_0_20240507_221727_00018_73r2r 
000007_0_20240507_221727_00018_73r2r
000002_0_20240507_221727_00018_73r2r 
000005_0_20240507_221727_00018_73r2r 
000008_0_20240507_221727_00018_73r2r

TableWriter output

The TableWriter output contains three columns per fragment (one for each individual target file). This format is being presented for completeness.
There are no special changes for bucketed tables here. The only important difference is that the writePath/targetPath would not contain the partition directory.

TableWriter output row
ROWrows:BIGINT,fragments:VARBINARY,commitcontext:VARBINARY

Rows	Fragments	CommitContext
N (numPartitionUpdates)	NULL	TaskCommitContext
NULL	PartitionUpdate0
NULL	PartitionUpdate1
NULL	...
NULL	PartitionUpdateN

The fragments column is JSON strings of PartitionUpdate as in the following format

{ 
"Name": "ds=2022-08-06/partition=events_pcp_product_finder_product_similartiy__groupby__999999998000212604", 
"updateMode": "NEW", 
"writePath": "", 
"targetPath": "", 
"fileWriteInfos": [ 
   { "writeFileName": "", "targetFileName": "", "fileSize": 3517346970 },
   { "writeFileName": "", "targetFileName": "", "fileSize": 4314798687 }, ] 
"rowCount": 3950431150, 
"inMemoryDataSizeInBytes": 4992001194927, 
"onDiskDataSizeInBytes": 1374893372141, 
"containsNumberedFileNames": false
}

The commitcontext column is a constant vector of TaskCommitContext in JSON string

{ 
"lifespan": "TaskWide", 
"taskId": "20220822_190126_00000_78c2f.1.0.0", 
"pageSinkCommitStrategy": "TASK_COMMIT", 
"lastPage": false
}

Empty buckets

The TableWriter generates PartitionUpdate messages only for the files it has written. So if there are empty buckets then there isn't a PartitionUpdate message for it.

If there are no PartitionUpdate output messages for any bucket, then the TableFinish operator fixes the HiveMetaStore with empty files for each bucket. https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/HiveMetadata.java#L1794

Design

As outlined above all table writing happens in the TableWriter operator.

The TableWriter forwards the write to the HiveDataSink which is registered by the HiveConnector for it.

The HiveDataSink already supported bucketed (and partitioned) tables. So all the logic for wiring bucket metadata and bucket computation already existed. The only missing piece was to handle fileNames for bucketed but not partitioned files in the writerIds, and map the proper writerId to input rows when appending to the HiveDataSink. This PR fixes that.

Note: The Prestissimo changes are in prestodb/presto#22737

netlify · 2024-05-07T23:19:47Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`7eac4f6`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/664fbd5fbecd320008bccdc3

velox/connectors/hive/HiveDataSink.h

xiaoxmeng

@aditi-pandit looks good % minors. Haven't looked at the test yet. Thanks!

velox/connectors/hive/HiveDataSink.h

velox/connectors/hive/HiveDataSink.cpp

xiaoxmeng

@aditi-pandit LGTM. Thanks!

velox/connectors/hive/HiveDataSink.cpp

velox/connectors/hive/HiveDataSink.h

facebook-github-bot · 2024-05-23T21:57:21Z

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

aditi-pandit · 2024-05-23T22:04:52Z

Thanks @xiaoxmeng. Addressed the review comments.

kewang1024 · 2024-05-25T01:24:23Z

In the table writer output, it says N (numPartitionUpdates)

But I remember when I was debugging, the N is actually the num of rows being written

facebook-github-bot · 2024-05-26T08:13:54Z

@xiaoxmeng merged this pull request in fe65ed0.

conbench-facebook · 2024-05-26T08:37:32Z

Conbench analyzed the 1 benchmark run on commit fe65ed0e.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

…ncubator#9740) Summary: The Velox HiveConnector supports writing bucketed files only when they are partitioned as well. This presents a feature gap wrt Presto. Presto behavior (for bucketed but not partitioned): - Supports CTAS into bucketed (but not partitioned tables) - Cannot append/overwrite to existing bucketed tables (though can append to TEMPORARY ones). The CTAS into bucketed tables has become important because such tables are used for CTE (WITH clause). Note: This PR only handles CTAS situations. There will be a separate PR for TEMPORARY tables. prestodb/presto#19744 prestodb/presto#22630 ### Background #### TableWriter and TableFinish Presto uses TableWriter PlanNodes to do the writing operations. The TableWriter nodes run on the workers. These nodes write the input rows into data files (on a staging directory before moving them to a target directory). The TableWriter node works in conjunction with a TableCommit node on the co-ordinator. The TableCommit node (TableFinishOperator) does the final renaming of target directory and commit to the meta-store. It is important to note that plans with Bucketed tables involve a LocalExchange that brings all the data to a single driver for TableWriter so that it can bucket and write the data appropriately. ``` EXPLAIN CREATE TABLE lineitem_bucketed2(orderkey, partkey, suppkey, linenumber, quantity, ds) WITH (bucket_count = 10, bucketed_by = ARRAY['orderkey'], sorted_by = ARRAY['orderkey']) AS SELECT orderkey, partkey, suppkey, linenumber, quantity, '2021-12-20' FROM tpch.tiny.lineitem; ``` Plan with TableWriter and TableCommit mode. Note the LocalExchange moving all data to a single driver. ``` - Output[PlanNodeId 7] - TableCommit[PlanNodeId 5][Optional[hive.tpch_bucketed.lineitem_bucketed2]] => [rows_23:bigint] - RemoteStreamingExchange[PlanNodeId 299][GATHER] => [rows:bigint, fragments:varbinary, commitcontext:varbinary] - TableWriter[PlanNodeId 6] => [rows:bigint, fragments:varbinary, commitcontext:varbinary] orderkey := orderkey (1:194) partkey := partkey (1:204) suppkey := suppkey (1:213) linenumber := linenumber (1:222) quantity := quantity (1:234) ds := expr (1:244) - LocalExchange[PlanNodeId 330][SINGLE] () => [orderkey:bigint, partkey:bigint, suppkey:bigint, linenumber:integer, quantity:double, expr:varchar(10)] > - RemoteStreamingExchange[PlanNodeId 298][REPARTITION] => [orderkey:bigint, partkey:bigint, suppkey:bigint, linenumber:integer, quantity:double, expr:varcha> - ScanProject[PlanNodeId 0,187][table = TableHandle {connectorId='tpch', connectorHandle='lineitem:sf0.01', layout='Optional[lineitem:sf0.01]'}, project> expr := VARCHAR'2021-12-20' suppkey := tpch:suppkey (1:262) partkey := tpch:partkey (1:262) linenumber := tpch:linenumber (1:262) orderkey := tpch:orderkey (1:262) quantity := tpch:quantity (1:262) ``` The above command creates 10 files as follows. 10 is the bucket count. ``` Aditis-MacBook-Pro:lineitem_bucketed aditipandit$ pwd ${DATA_DIR}/hive_data/tpch/lineitem_bucketed Aditis-MacBook-Pro:lineitem_bucketed2 aditipandit$ ls 000000_0_20240507_221727_00018_73r2r 000003_0_20240507_221727_00018_73r2r 000006_0_20240507_221727_00018_73r2r 000009_0_20240507_221727_00018_73r2r 000001_0_20240507_221727_00018_73r2r 000004_0_20240507_221727_00018_73r2r 000007_0_20240507_221727_00018_73r2r 000002_0_20240507_221727_00018_73r2r 000005_0_20240507_221727_00018_73r2r 000008_0_20240507_221727_00018_73r2r ``` #### TableWriter output The TableWriter output contains three columns per fragment (one for each individual target file). This format is being presented for completeness. **There are no special changes for bucketed tables here. The only important difference is that the writePath/targetPath would not contain the partition directory.** | TableWriter output row | |--------| | ROW<rows:BIGINT,fragments:VARBINARY,commitcontext:VARBINARY> | | Rows | | Fragments | | CommitContext | |--------|--------|--------|--------|--------| | N (numPartitionUpdates) | | NULL | | TaskCommitContext | | NULL | | PartitionUpdate0 | | | | NULL | | PartitionUpdate1 | | | | NULL | | ... | | | | NULL | | PartitionUpdateN | | | The fragments column is JSON strings of PartitionUpdate as in the following format ``` { "Name": "ds=2022-08-06/partition=events_pcp_product_finder_product_similartiy__groupby__999999998000212604", "updateMode": "NEW", "writePath": "", "targetPath": "", "fileWriteInfos": [ { "writeFileName": "", "targetFileName": "", "fileSize": 3517346970 }, { "writeFileName": "", "targetFileName": "", "fileSize": 4314798687 }, ] "rowCount": 3950431150, "inMemoryDataSizeInBytes": 4992001194927, "onDiskDataSizeInBytes": 1374893372141, "containsNumberedFileNames": false } ``` The commitcontext column is a constant vector of TaskCommitContext in JSON string ``` { "lifespan": "TaskWide", "taskId": "20220822_190126_00000_78c2f.1.0.0", "pageSinkCommitStrategy": "TASK_COMMIT", "lastPage": false } ``` #### Empty buckets The TableWriter generates PartitionUpdate messages only for the files it has written. So if there are empty buckets then there isn't a PartitionUpdate message for it. If there are no PartitionUpdate output messages for any bucket, then the TableFinish operator fixes the HiveMetaStore with empty files for each bucket. https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/HiveMetadata.java#L1794 ### Design As outlined above all table writing happens in the TableWriter operator. The TableWriter forwards the write to the HiveDataSink which is registered by the HiveConnector for it. The HiveDataSink already supported bucketed (and partitioned) tables. So all the logic for wiring bucket metadata and bucket computation already existed. The only missing piece was to handle fileNames for bucketed but not partitioned files in the writerIds, and map the proper writerId to input rows when appending to the HiveDataSink. This PR fixes that. ******************************************** Note: The Prestissimo changes are in prestodb/presto#22737 Pull Request resolved: facebookincubator#9740 Reviewed By: kewang1024 Differential Revision: D57748876 Pulled By: xiaoxmeng fbshipit-source-id: 33bb77c6fce4d2519f3214e2fb93891f1f910716

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 7, 2024

aditi-pandit marked this pull request as draft May 7, 2024 23:19

aditi-pandit force-pushed the bucketed_table branch from 0172c18 to 371a154 Compare May 7, 2024 23:20

Yuhta requested a review from xiaoxmeng May 8, 2024 14:44

aditi-pandit force-pushed the bucketed_table branch 4 times, most recently from 96500ec to b10d786 Compare May 10, 2024 05:50

aditi-pandit mentioned this pull request May 13, 2024

[Native] CTE support in Prestissimo prestodb/presto#22630

Open

aditi-pandit force-pushed the bucketed_table branch 2 times, most recently from 3e71a3a to dad3c3f Compare May 14, 2024 05:35

aditi-pandit marked this pull request as ready for review May 14, 2024 05:37

aditi-pandit changed the title ~~[Do not review]Add support to write bucketed (but not partitioned) tables~~ Add support to write bucketed (but not partitioned) tables May 14, 2024

aditi-pandit mentioned this pull request May 14, 2024

[Native] Add support for bucketed (but not partitioned) tables prestodb/presto#22737

Merged

aditi-pandit force-pushed the bucketed_table branch from dad3c3f to 2ac639e Compare May 14, 2024 16:37

jaystarshot reviewed May 15, 2024

View reviewed changes

velox/connectors/hive/HiveDataSink.h Outdated Show resolved Hide resolved

aditi-pandit mentioned this pull request May 17, 2024

[Do not review] Add support for Presto temporary tables SPI #9844

Closed

xiaoxmeng reviewed May 20, 2024

View reviewed changes

velox/connectors/hive/HiveDataSink.h Outdated Show resolved Hide resolved

velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved

aditi-pandit force-pushed the bucketed_table branch from 2ac639e to 85e5fb9 Compare May 23, 2024 20:44

xiaoxmeng approved these changes May 23, 2024

View reviewed changes

velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved

velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved

velox/connectors/hive/HiveDataSink.h Outdated Show resolved Hide resolved

Add support to write bucketed (but not partitioned) tables

7eac4f6

aditi-pandit force-pushed the bucketed_table branch from 85e5fb9 to 7eac4f6 Compare May 23, 2024 22:04

kewang1024 self-requested a review May 25, 2024 01:24

kewang1024 approved these changes May 25, 2024

View reviewed changes

facebook-github-bot closed this in fe65ed0 May 26, 2024

facebook-github-bot added the Merged label May 26, 2024

aditi-pandit deleted the bucketed_table branch May 27, 2024 23:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support to write bucketed (but not partitioned) tables #9740

Add support to write bucketed (but not partitioned) tables #9740

aditi-pandit commented May 7, 2024 •

edited

Loading

netlify bot commented May 7, 2024 •

edited

Loading

xiaoxmeng left a comment

xiaoxmeng left a comment

facebook-github-bot commented May 23, 2024

aditi-pandit commented May 23, 2024

kewang1024 commented May 25, 2024

facebook-github-bot commented May 26, 2024

conbench-facebook bot commented May 26, 2024

Add support to write bucketed (but not partitioned) tables #9740

Add support to write bucketed (but not partitioned) tables #9740

Conversation

aditi-pandit commented May 7, 2024 • edited Loading

Background

TableWriter and TableFinish

TableWriter output

Empty buckets

Design

netlify bot commented May 7, 2024 • edited Loading

✅ Deploy Preview for meta-velox canceled.

xiaoxmeng left a comment

Choose a reason for hiding this comment

xiaoxmeng left a comment

Choose a reason for hiding this comment

facebook-github-bot commented May 23, 2024

aditi-pandit commented May 23, 2024

kewang1024 commented May 25, 2024

facebook-github-bot commented May 26, 2024

conbench-facebook bot commented May 26, 2024

aditi-pandit commented May 7, 2024 •

edited

Loading

netlify bot commented May 7, 2024 •

edited

Loading