[GLUTEN-3378][CORE] Datasource V2 data lake read support #3843

liujiayi771 · 2023-11-24T15:16:18Z

What changes were proposed in this pull request?

Implement datasource v2 data lake read based on [VL] Unified design for data lake read support in Gluten + Velox #3378 .
Introduce gluten-iceberg module, IcebergScanTransformer is defined that extends BatchScanExecTransformer. Although the IcebergLocalFilesNode is currently the same as the LocalFilesNode, it is being prepared for future implementation of the "delete file" functionality.
The ScanTransformerFactory is used to construct various types of ScanTransformer. For Iceberg, the IcebergScanTransformer will be constructed using service loader in order to avoid the dependency of gluten-core on gluten-iceberg.
The logic that pushes down the Filter conditions to the BatchScan runtimeFilters has been removed from the original code. This logic was unnecessary because BatchScan only supports DPP's runtime filter for partition filtering.
The logic for executing subqueries has been placed in the transformDynamicPruningExpr method. This allows both BatchScan and FileSourceScan to share this logic, as it is also required by BatchScan.

How was this patch tested?

Add VeloxTPCHIcebergSuite and VeloxIcebergSuite.
The modifications to the interfaces can be tested using the existing CI.

github-actions · 2023-11-24T15:16:34Z

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Other pull requests

github-actions · 2023-11-24T15:16:54Z

Run Gluten Clickhouse CI

github-actions · 2023-11-25T00:30:10Z

Run Gluten Clickhouse CI

github-actions · 2023-11-25T10:25:14Z

Run Gluten Clickhouse CI

github-actions · 2023-11-25T10:45:31Z

Run Gluten Clickhouse CI

github-actions · 2023-11-25T12:50:21Z

Run Gluten Clickhouse CI

liujiayi771 · 2023-11-26T00:48:13Z

@YannByron @yma11 @rui-mo Could you help review?

docs/get-started/Velox.md

github-actions · 2023-11-26T07:29:03Z

Run Gluten Clickhouse CI

rz-vastdata · 2023-11-26T12:26:43Z

gluten-core/src/main/scala/io/glutenproject/execution/ScanTransformerFactory.scala

+    }
+    val scan = batchScanExec.scan
+    scan match {
+      case _ if scan.getClass.getName == IcebergScanClassName =>


IIUC, supporting a new Scan type (e.g. adding a non-Iceberg data source) will require modifying this match code, as well as the supportedBatchScan() method below.
Would it please be possible to allow other plugins to register themselves, so that adding a new format will not require changing ScanTransformerFactory code?

@rz-vastdata We can achieve this using the Service Loader, similar to DataSourceRegister in Spark. However, the current ScanTransformer in Gluten extends Spark's ScanExec and requires a constructor with parameters. This makes it a bit difficult to handle with Service Loader. I will think about whether there is another way to use the Service Loader.

rz-vastdata · 2023-11-26T12:35:40Z

gluten-iceberg/src/main/java/io/glutenproject/substrait/rel/IcebergLocalFilesNode.java

+
+public class IcebergLocalFilesNode extends LocalFilesNode {
+
+  class DeleteFile {


Not sure if it's used by this PR...

I didn't use it. This is a TODO.

YannByron · 2023-11-27T01:41:54Z

My fault in design. I noticed that there are many modification that rename XXXExecTransformer to XXXTransformer. Maybe we can use XXXExecTransformer as these class names directly to avoid it.

yma11 · 2023-11-27T01:45:29Z

My fault in design. I noticed that there are many modification that rename XXXExecTransformer to XXXTransformer. Maybe we can use XXXExecTransformer as these class names directly to avoid it.

+1

yma11 · 2023-11-27T01:49:08Z

gluten-core/src/main/scala/io/glutenproject/execution/DatasourceScanTransformer.scala

+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.sources.BaseRelation
+
+trait DatasourceScanTransformer extends BaseScanTransformer {


Is this added to just keep consistent with vanilla Spark? If so, I think we can deduct this hierarchy to make the inheritance not so complicated.

Delta will extends this trait.

yma11 · 2023-11-27T01:50:27Z

gluten-core/src/main/scala/org/apache/spark/sql/hive/HiveTableScanExecTransformer.scala

@@ -52,7 +52,7 @@ class HiveTableScanExecTransformer(
    relation: HiveTableRelation,
    partitionPruningPred: Seq[Expression])(session: SparkSession)
  extends HiveTableScanExec(requestedAttributes, relation, partitionPruningPred)(session)
-  with BasicScanExecTransformer {
+  with BaseScanTransformer {


Better to still use BasicScanExecTransformer as this change is not necessary.

OK. I will revert these changes first.

yma11 · 2023-11-27T02:06:59Z

gluten-core/src/main/scala/io/glutenproject/execution/BaseDataSource.scala

+import org.apache.spark.sql.connector.read.InputPartition
+import org.apache.spark.sql.types.StructType
+
+trait BaseDataSource extends SupportFormat {


Seems BasicScanTransformer is the only implementation of SupportFormat, maybe you can add the fields directly?

Add this interface to BaseDataSource?

I mean SupportFormat.

yma11 · 2023-11-27T03:11:31Z

gluten-iceberg/src/main/java/io/glutenproject/substrait/rel/IcebergLocalFilesNode.java

+
+  // TODO: Add delete file support for MOR iceberg table
+
+  IcebergLocalFilesNode(


Will we also need to use service loader for this LocalFilesNode serialization? I am not sure how much in common for IcebergLocalFilesNode and DeltaLocalFilesNode.

No. It is not used by gluten-core. The incremental part of the delta and iceberg definitions of the MOR table is expected to have significant differences.

There should be a toProtobuf() method for IcebergLocalFilesNode and it will be called in gluten-core/substrait/../PlanNode for serialization. Or how do you plan to pass these new added fields?

The current PR does not involve deleting files. To support deleting files in the future, we will need to implement a specific toProtobuf method and modify the proto file.

github-actions · 2023-11-27T04:43:23Z

Run Gluten Clickhouse CI

liujiayi771 · 2023-11-27T04:44:45Z

gluten-core/src/main/scala/org/apache/spark/sql/hive/HiveTableScanExecTransformer.scala

@@ -70,11 +70,11 @@ class HiveTableScanExecTransformer(

  override def getPartitions: Seq[InputPartition] = partitions

-  override def getPartitionSchemas: StructType = relation.tableMeta.partitionSchema


@yma11 I think the names of these interfaces need to be changed. Here, you can refer to the corresponding interfaces in Spark, and we should not use plurals.

github-actions · 2023-11-27T05:04:47Z

Run Gluten Clickhouse CI

github-actions · 2023-11-27T05:38:33Z

Run Gluten Clickhouse CI

github-actions · 2023-11-28T02:44:22Z

Run Gluten Clickhouse CI

github-actions · 2023-11-29T04:16:13Z

Run Gluten Clickhouse CI

github-actions · 2023-11-29T05:47:57Z

Run Gluten Clickhouse CI

liujiayi771 · 2023-11-29T07:35:00Z

@yma11 @YannByron @rz-vastdata I have modified the reflection to use the service loader. Can you please help review?

github-actions · 2023-11-29T07:49:42Z

Run Gluten Clickhouse CI

github-actions · 2023-11-29T10:53:32Z

Run Gluten Clickhouse CI

YannByron · 2023-11-30T02:10:59Z

gluten-core/src/main/scala/io/glutenproject/execution/DataSourceV2TransformerRegister.scala

+ * their data source v2 transformer. This allows users to give the data source v2 transformer alias
+ * as the format type over the fully qualified class name.
+ */
+trait DataSourceV2TransformerRegister {


For now, it's ok. When support other datasource based on v1, i'll update here.

Yes. DataSource V1 has different interface parameters.

YannByron · 2023-11-30T02:14:35Z

gluten-core/src/main/scala/io/glutenproject/execution/ScanTransformerFactory.scala

+
+  private val dataSourceV2TransformerMap = new ConcurrentHashMap[String, Class[_]]()
+
+  def createFileSourceScanTransformer(


can we combine the two createFileSourceScanTransformer methods ?

Of course, initially they were combined together, but we need to add some optional "Option" parameters.

github-actions · 2023-11-30T03:45:29Z

Run Gluten Clickhouse CI

github-actions · 2023-11-30T03:46:05Z

#3378

github-actions · 2023-11-30T04:02:33Z

Run Gluten Clickhouse CI

rui-mo · 2023-12-01T03:15:28Z

gluten-core/src/main/scala/io/glutenproject/execution/BasicPhysicalOperatorTransformer.scala

+        ScanTransformerFactory.createFileSourceScanTransformer(
+          fileSourceScan,
+          reuseSubquery,
+          extraFilters = leftFilters)
      case batchScan: BatchScanExec =>


Thanks for your fix. Can we avoid the extra filter pushdown for BatchScan by not calling applyFilterPushdownToScan for it?

https://github.com/oap-project/gluten/blob/c531abd94045db71a8f8ef692e5c5a80cbcd118f/gluten-core/src/main/scala/io/glutenproject/extension/ColumnarOverrides.scala#L140-L145

Fixed this in d879fd0.

github-actions · 2023-12-01T15:57:50Z

Run Gluten Clickhouse CI

yma11

Hi @liujiayi771, thanks for your update!

GlutenPerfBot · 2023-12-05T23:53:10Z

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query	log/native_master_12_05_2023_time.csv	log/native_master_12_04_2023_94c91c55d_time.csv	difference	percentage
q1	34.72	34.65	-0.068	99.80%
q2	25.00	24.91	-0.099	99.61%
q3	37.98	36.38	-1.594	95.80%
q4	38.25	37.37	-0.886	97.69%
q5	72.06	72.63	0.567	100.79%
q6	5.37	6.87	1.504	128.03%
q7	82.30	85.44	3.137	103.81%
q8	86.96	87.86	0.910	101.05%
q9	126.92	124.52	-2.405	98.10%
q10	45.43	46.04	0.613	101.35%
q11	20.31	20.12	-0.183	99.10%
q12	27.02	26.71	-0.313	98.84%
q13	47.15	46.45	-0.698	98.52%
q14	19.01	14.55	-4.464	76.52%
q15	29.59	28.16	-1.428	95.17%
q16	15.81	15.75	-0.059	99.63%
q17	103.87	103.10	-0.768	99.26%
q18	150.70	150.63	-0.066	99.96%
q19	14.53	12.90	-1.626	88.80%
q20	28.18	27.73	-0.452	98.40%
q21	223.72	222.80	-0.925	99.59%
q22	13.17	13.06	-0.113	99.14%
total	1248.05	1238.63	-9.417	99.25%

…che#3843)" This reverts commit a462434.

rz-vastdata reviewed Nov 26, 2023

View reviewed changes

docs/get-started/Velox.md Outdated Show resolved Hide resolved

rz-vastdata reviewed Nov 26, 2023

View reviewed changes

yma11 reviewed Nov 27, 2023

View reviewed changes

liujiayi771 commented Nov 27, 2023

View reviewed changes

liujiayi771 force-pushed the iceberg-v2 branch from d013400 to 6aa2493 Compare November 27, 2023 05:04

liujiayi771 force-pushed the iceberg-v2 branch from 6aa2493 to 0dc0f5f Compare November 27, 2023 05:38

liujiayi771 force-pushed the iceberg-v2 branch from bb08d4c to 5b723e7 Compare November 29, 2023 05:47

liujiayi771 force-pushed the iceberg-v2 branch from a4f915c to af544ab Compare November 29, 2023 10:52

liujiayi771 marked this pull request as ready for review November 29, 2023 12:22

YannByron reviewed Nov 30, 2023

View reviewed changes

liujiayi771 force-pushed the iceberg-v2 branch from af544ab to b37c3db Compare November 30, 2023 03:44

liujiayi771 changed the title ~~[WIP][GLUTEN-3378][CORE] Datasource V2 data lake read support~~ [GLUTEN-3378][CORE] Datasource V2 data lake read support Nov 30, 2023

liujiayi771 force-pushed the iceberg-v2 branch from b37c3db to a8bdbda Compare November 30, 2023 04:02

rui-mo requested a review from zzcclp December 1, 2023 03:02

rui-mo reviewed Dec 1, 2023

View reviewed changes

liujiayi771 added 5 commits December 1, 2023 23:45

Datasource V2 data lake read support

826e229

Remove SupportFormat

da41772

use service loader

0bc25eb

combine api

495dea1

remove BatchScanExec in filter pushdown

d879fd0

liujiayi771 force-pushed the iceberg-v2 branch from a8bdbda to d879fd0 Compare December 1, 2023 15:57

yma11 approved these changes Dec 5, 2023

View reviewed changes

yma11 merged commit a462434 into apache:main Dec 5, 2023
17 checks passed

ulysses-you mentioned this pull request Dec 6, 2023

[VL] Make bloom_filter_agg fall back when might_contain is not transformable #3917

Merged

loneylee added a commit to loneylee/gluten that referenced this pull request Dec 7, 2023

Revert "[GLUTEN-3378][CORE] Datasource V2 data lake read support (apa…

e968b13

…che#3843)" This reverts commit a462434.

loneylee added a commit to loneylee/gluten that referenced this pull request Dec 7, 2023

Revert "[GLUTEN-3378][CORE] Datasource V2 data lake read support (apa…

809fae8

…che#3843)" This reverts commit a462434.

YannByron mentioned this pull request Dec 8, 2023

[GLUTEN-3378][CORE] DeltaScanTransformer to support delta table #3982

Merged

liujiayi771 mentioned this pull request Dec 20, 2023

[CORE][VL] Fix BatchScanExec filter pushdown logic #4132

Merged


		public class IcebergLocalFilesNode extends LocalFilesNode {

		class DeleteFile {


		// TODO: Add delete file support for MOR iceberg table

		IcebergLocalFilesNode(

		@@ -70,11 +70,11 @@ class HiveTableScanExecTransformer(

		override def getPartitions: Seq[InputPartition] = partitions

		override def getPartitionSchemas: StructType = relation.tableMeta.partitionSchema


		private val dataSourceV2TransformerMap = new ConcurrentHashMap[String, Class[_]]()

		def createFileSourceScanTransformer(

[GLUTEN-3378][CORE] Datasource V2 data lake read support #3843

[GLUTEN-3378][CORE] Datasource V2 data lake read support #3843

Conversation

liujiayi771 commented Nov 24, 2023 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

github-actions bot commented Nov 24, 2023

github-actions bot commented Nov 24, 2023

github-actions bot commented Nov 25, 2023

github-actions bot commented Nov 25, 2023

github-actions bot commented Nov 25, 2023

github-actions bot commented Nov 25, 2023

liujiayi771 commented Nov 26, 2023

github-actions bot commented Nov 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YannByron commented Nov 27, 2023

yma11 commented Nov 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yma11 Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 27, 2023

Choose a reason for hiding this comment

github-actions bot commented Nov 27, 2023

github-actions bot commented Nov 27, 2023

github-actions bot commented Nov 28, 2023

github-actions bot commented Nov 29, 2023

github-actions bot commented Nov 29, 2023

liujiayi771 commented Nov 29, 2023

github-actions bot commented Nov 29, 2023

github-actions bot commented Nov 29, 2023

YannByron Nov 30, 2023 • edited Loading

Choose a reason for hiding this comment

liujiayi771 Nov 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 30, 2023

github-actions bot commented Nov 30, 2023

github-actions bot commented Nov 30, 2023

rui-mo Dec 1, 2023 • edited Loading

Choose a reason for hiding this comment

liujiayi771 Dec 2, 2023 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Dec 1, 2023

yma11 left a comment • edited Loading

Choose a reason for hiding this comment

GlutenPerfBot commented Dec 5, 2023

liujiayi771 commented Nov 24, 2023 •

edited

Loading

yma11 Nov 27, 2023 •

edited

Loading

YannByron Nov 30, 2023 •

edited

Loading

liujiayi771 Nov 30, 2023 •

edited

Loading

rui-mo Dec 1, 2023 •

edited

Loading

liujiayi771 Dec 2, 2023 •

edited

Loading

yma11 left a comment •

edited

Loading