[GLUTEN-1582][CH] Improve txt/json read by use native engine by KevinyhZou · Pull Request #1584 · apache/gluten

KevinyhZou · 2023-05-09T03:53:08Z

What changes were proposed in this pull request?

Improve read txt/json format hive table by use native engine

(Fixes: #1582)

How was this patch tested?

unit tests, manual tests

This is tested on clickhouse-backend. The tested query select l_linenumber,count(*) from linenumber_t where l_linenumber <3 group by l_linenumber limit 3 , the tested table has about 63000000 rows, and stored as textfile, which archived about 100% improvement.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

github-actions · 2023-05-09T03:53:22Z

#1582

github-actions · 2023-05-09T03:53:23Z

Run Gluten Clickhouse CI

lgbo-ustc · 2023-05-09T04:10:09Z

cpp-ch/local-engine/Storages/SubstraitSource/ReadBufferBuilder.cpp

        return read_buffer;
    }
+
+    std::pair<size_t, size_t> adjustFileReadStartAndEndPos(


Should use ReadBuffer instead of hdfs api direclty ?

ReadBuffer don't have apis to adjust file position, and this should be better in HDFS ReadBuffer. @lgbo-ustc

lgbo-ustc · 2023-05-09T04:14:02Z

gluten-core/src/main/resources/substrait/proto/substrait/algebra.proto

      // The length in byte to read from this item
      uint64 length = 8;

+      NamedStruct schema = 16;


larger id should be at the end

github-actions · 2023-05-09T09:11:24Z

Run Gluten Clickhouse CI

github-actions · 2023-05-09T13:55:20Z

Run Gluten Clickhouse CI

github-actions · 2023-05-09T13:57:49Z

Run Gluten Clickhouse CI

github-actions · 2023-05-10T10:11:35Z

Run Gluten Clickhouse CI

github-actions · 2023-05-11T08:19:41Z

Run Gluten Clickhouse CI

github-actions · 2023-05-12T03:02:38Z

Run Gluten Clickhouse CI

taiyang-li · 2023-05-16T03:42:16Z

cpp-ch/local-engine/Storages/SubstraitSource/FormatFile.cpp

    }
 #endif
-
+    if (file.has_text())


hive text format relies on macro USE_HIVE

loneylee · 2023-05-16T03:50:57Z

gluten-core/src/main/scala/io/glutenproject/extension/ColumnarOverrides.scala

-    // if the transformed is instance of ShuffleExchangeLike, so we need to remove it in AQE mode
-    // have tested gluten-it TPCH when AQE OFF
+      val childTransformer = BackendsApiManager.getSparkPlanExecApiInstance
+        .genHiveTableScanExecTransformer(plan.child)


Why not transform hiveExec in TransformPreOverrides, like case plan: HiveTableScanExec.

I moved this to TransformPreOverrides, and as HiveTableScanExec is a private class, so it can not be used like case plan: HiveTableScanExec , but use reflect instead

github-actions · 2023-05-19T10:27:19Z

Run Gluten Clickhouse CI

github-actions · 2023-05-19T11:48:56Z

Run Gluten Clickhouse CI

zzcclp · 2023-05-24T06:59:28Z

gluten-core/src/main/scala/io/glutenproject/extension/ColumnarOverrides.scala

 // into native implementations.
 case class TransformPostOverrides(session: SparkSession, isAdaptiveContext: Boolean)
-    extends Rule[SparkPlan] {
+    extends Rule[SparkPlan] with Logging {


zzcclp · 2023-05-24T07:01:38Z

gluten-core/src/main/scala/io/glutenproject/extension/ColumnarOverrides.scala

-        logDebug(s"Transformation for ${p.getClass} is currently not supported.")
-        val children = plan.children.map(replaceWithTransformerPlan)
-        p.withNewChildren(children)
+        val planTransformer = BackendsApiManager.getSparkPlanExecApiInstance


why transforms HiveTableScanExec here ?

Because I see BatchScanExec, FileSourceScanExec are all thansform in this TransformPreOverrides class, HiveTableScanExec is similar to these, so I make it here. And it can not be used as case HiveTableScanExec, so I transform it like this. What better place should I make this transform ?

zzcclp · 2023-05-24T07:02:34Z

...s-clickhouse/src/main/scala/io/glutenproject/backendsapi/clickhouse/CHSparkPlanExecApi.scala

+   * @return
+   */
+  override def genHiveTableScanExecTransformer(child: SparkPlan): BasicScanExecTransformer = {
+    if (!child.getClass.getSimpleName.equals("HiveTableScanExec")) {


use 'isInstanceOf' ?

HiveTableScanExec is a private class, it can not be used like child.isInstanceOf[HiveTableScanExec]

@KevinyhZou
HiveTableScanExec is private in the "hive" package. It can be used in the hive package to do some transform.
Others, BatchScanExec is extended with DS v2, and FileScanExec is extended with DS v1. My suggestion is that HiveTableScanExec should use it's own transform like hiveexectransform. hiveexectransform can be trait with BasicScanExecTransformer

OK, it will be better to implements this to put the logic into a HiveExecTransformer, but it still need to judge whether the class is HiveTableScanExec in TransformPreOverrides, which should use its class name. And I will try to do this. thanks @loneylee

github-actions · 2023-05-24T07:29:48Z

Run Gluten Clickhouse CI

loneylee · 2023-05-25T03:04:56Z

cpp-ch/local-engine/Storages/SubstraitSource/HiveTextFormatFile.h

+
+namespace local_engine
+{
+class HiveTextFormatFile : public FormatFile


Can it be renamed to TextFormatFile? Since FileSourceScanExecTransformer can also read csv or text format, it may have the same parser.

github-actions · 2023-05-28T14:20:11Z

Run Gluten Clickhouse CI

github-actions · 2023-05-29T05:30:24Z

Run Gluten Clickhouse CI

github-actions · 2023-05-29T12:55:30Z

Run Gluten Clickhouse CI

github-actions · 2023-05-29T12:59:41Z

Run Gluten Clickhouse CI

github-actions · 2023-05-29T13:00:03Z

Run Gluten Clickhouse CI

zzcclp · 2023-05-30T12:23:51Z

backends-clickhouse/src/main/scala/io/glutenproject/backendsapi/clickhouse/CHMetricsApi.scala


+  override def genHiveTableScanTransformerMetrics(
+      sparkContext: SparkContext): Map[String, SQLMetric] =
+    Map(


please modify metrics according to the new genFileSourceScanTransformerMetrics

zzcclp · 2023-05-30T12:24:24Z

backends-clickhouse/src/main/scala/io/glutenproject/metrics/HiveTableScanMetricsUpdater.scala

+import org.apache.spark.sql.execution.metric.SQLMetric
+import org.apache.spark.sql.utils.OASPackageBridge.InputMetricsWrapper
+
+class HiveTableScanMetricsUpdater(val metrics: Map[String, SQLMetric]) extends MetricsUpdater {


ref to FileSourceScanMetricsUpdater

zzcclp · 2023-05-30T12:24:56Z

please rebase to main

github-actions · 2023-05-31T06:09:53Z

Run Gluten Clickhouse CI

github-actions · 2023-05-31T06:31:59Z

Run Gluten Clickhouse CI

zzcclp · 2023-05-31T06:56:16Z

backends-clickhouse/src/main/scala/io/glutenproject/metrics/HiveTableScanMetricsUpdater.scala

+import org.apache.spark.sql.execution.metric.SQLMetric
+import org.apache.spark.sql.utils.OASPackageBridge.InputMetricsWrapper
+
+class HiveTableScanMetricsUpdater(val metrics: Map[String, SQLMetric]) extends MetricsUpdater {


add @transient before val metrics: Map[String, SQLMetric] to avoid serialize metrics

github-actions · 2023-05-31T07:29:15Z

Run Gluten Clickhouse CI

liuneng1994

LGTM

KevinyhZou marked this pull request as draft May 9, 2023 03:53

lgbo-ustc reviewed May 9, 2023

View reviewed changes

KevinyhZou force-pushed the improve_txt_native_engine_read branch from f1c46d3 to 2649e0c Compare May 10, 2023 10:11

KevinyhZou marked this pull request as ready for review May 12, 2023 03:29

taiyang-li reviewed May 16, 2023

View reviewed changes

loneylee reviewed May 16, 2023

View reviewed changes

KevinyhZou force-pushed the improve_txt_native_engine_read branch from c61ed9e to 3924f2e Compare May 19, 2023 10:27

KevinyhZou force-pushed the improve_txt_native_engine_read branch from 3924f2e to 1488f14 Compare May 19, 2023 11:48

zzcclp reviewed May 24, 2023

View reviewed changes

loneylee reviewed May 25, 2023

View reviewed changes

KevinyhZou force-pushed the improve_txt_native_engine_read branch from 65bcb29 to cc0b721 Compare May 29, 2023 05:30

zzcclp reviewed May 30, 2023

View reviewed changes

improve txt/json read by native engine

d0d2622

KevinyhZou force-pushed the improve_txt_native_engine_read branch from 69a7ba6 to d0d2622 Compare May 31, 2023 06:31

zzcclp reviewed May 31, 2023

View reviewed changes

review fix

f104f60

zzcclp requested a review from liuneng1994 May 31, 2023 08:14

liuneng1994 approved these changes Jun 2, 2023

View reviewed changes

liuneng1994 merged commit 946d0a1 into apache:main Jun 2, 2023

Conversation

KevinyhZou commented May 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

github-actions bot commented May 9, 2023

Uh oh!

github-actions bot commented May 9, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 9, 2023

Uh oh!

github-actions bot commented May 9, 2023

Uh oh!

github-actions bot commented May 9, 2023

Uh oh!

github-actions bot commented May 10, 2023

Uh oh!

github-actions bot commented May 11, 2023

Uh oh!

github-actions bot commented May 12, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 19, 2023

Uh oh!

github-actions bot commented May 19, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KevinyhZou May 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KevinyhZou May 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KevinyhZou May 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 24, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 28, 2023

Uh oh!

github-actions bot commented May 29, 2023

Uh oh!

github-actions bot commented May 29, 2023

Uh oh!

github-actions bot commented May 29, 2023

Uh oh!

github-actions bot commented May 29, 2023

Uh oh!

KevinyhZou commented May 9, 2023 •

edited

Loading

KevinyhZou May 24, 2023 •

edited

Loading

KevinyhZou May 24, 2023 •

edited

Loading

KevinyhZou May 29, 2023 •

edited

Loading

zzcclp May 31, 2023 •

edited

Loading