[GLUTEN-1582][CH] Improve txt/json read by use native engine#1584
[GLUTEN-1582][CH] Improve txt/json read by use native engine#1584liuneng1994 merged 2 commits intoapache:mainfrom
Conversation
|
Run Gluten Clickhouse CI |
| return read_buffer; | ||
| } | ||
|
|
||
| std::pair<size_t, size_t> adjustFileReadStartAndEndPos( |
There was a problem hiding this comment.
Should use ReadBuffer instead of hdfs api direclty ?
There was a problem hiding this comment.
ReadBuffer don't have apis to adjust file position, and this should be better in HDFS ReadBuffer. @lgbo-ustc
| // The length in byte to read from this item | ||
| uint64 length = 8; | ||
|
|
||
| NamedStruct schema = 16; |
There was a problem hiding this comment.
larger id should be at the end
|
Run Gluten Clickhouse CI |
2 similar comments
|
Run Gluten Clickhouse CI |
|
Run Gluten Clickhouse CI |
f1c46d3 to
2649e0c
Compare
|
Run Gluten Clickhouse CI |
2 similar comments
|
Run Gluten Clickhouse CI |
|
Run Gluten Clickhouse CI |
| } | ||
| #endif | ||
|
|
||
| if (file.has_text()) |
There was a problem hiding this comment.
hive text format relies on macro USE_HIVE
| // if the transformed is instance of ShuffleExchangeLike, so we need to remove it in AQE mode | ||
| // have tested gluten-it TPCH when AQE OFF | ||
| val childTransformer = BackendsApiManager.getSparkPlanExecApiInstance | ||
| .genHiveTableScanExecTransformer(plan.child) |
There was a problem hiding this comment.
Why not transform hiveExec in TransformPreOverrides, like case plan: HiveTableScanExec.
There was a problem hiding this comment.
I moved this to TransformPreOverrides, and as HiveTableScanExec is a private class, so it can not be used like case plan: HiveTableScanExec , but use reflect instead
c61ed9e to
3924f2e
Compare
|
Run Gluten Clickhouse CI |
3924f2e to
1488f14
Compare
|
Run Gluten Clickhouse CI |
| // into native implementations. | ||
| case class TransformPostOverrides(session: SparkSession, isAdaptiveContext: Boolean) | ||
| extends Rule[SparkPlan] { | ||
| extends Rule[SparkPlan] with Logging { |
| logDebug(s"Transformation for ${p.getClass} is currently not supported.") | ||
| val children = plan.children.map(replaceWithTransformerPlan) | ||
| p.withNewChildren(children) | ||
| val planTransformer = BackendsApiManager.getSparkPlanExecApiInstance |
There was a problem hiding this comment.
why transforms HiveTableScanExec here ?
There was a problem hiding this comment.
Because I see BatchScanExec, FileSourceScanExec are all thansform in this TransformPreOverrides class, HiveTableScanExec is similar to these, so I make it here. And it can not be used as case HiveTableScanExec, so I transform it like this. What better place should I make this transform ?
| * @return | ||
| */ | ||
| override def genHiveTableScanExecTransformer(child: SparkPlan): BasicScanExecTransformer = { | ||
| if (!child.getClass.getSimpleName.equals("HiveTableScanExec")) { |
There was a problem hiding this comment.
HiveTableScanExec is a private class, it can not be used like child.isInstanceOf[HiveTableScanExec]
There was a problem hiding this comment.
@KevinyhZou
HiveTableScanExec is private in the "hive" package. It can be used in the hive package to do some transform.
Others, BatchScanExec is extended with DS v2, and FileScanExec is extended with DS v1. My suggestion is that HiveTableScanExec should use it's own transform like hiveexectransform. hiveexectransform can be trait with BasicScanExecTransformer
There was a problem hiding this comment.
OK, it will be better to implements this to put the logic into a HiveExecTransformer, but it still need to judge whether the class is HiveTableScanExec in TransformPreOverrides, which should use its class name. And I will try to do this. thanks @loneylee
|
Run Gluten Clickhouse CI |
|
|
||
| namespace local_engine | ||
| { | ||
| class HiveTextFormatFile : public FormatFile |
There was a problem hiding this comment.
Can it be renamed to TextFormatFile? Since FileSourceScanExecTransformer can also read csv or text format, it may have the same parser.
|
Run Gluten Clickhouse CI |
65bcb29 to
cc0b721
Compare
|
Run Gluten Clickhouse CI |
3 similar comments
|
Run Gluten Clickhouse CI |
|
Run Gluten Clickhouse CI |
|
Run Gluten Clickhouse CI |
|
|
||
| override def genHiveTableScanTransformerMetrics( | ||
| sparkContext: SparkContext): Map[String, SQLMetric] = | ||
| Map( |
There was a problem hiding this comment.
please modify metrics according to the new genFileSourceScanTransformerMetrics
| import org.apache.spark.sql.execution.metric.SQLMetric | ||
| import org.apache.spark.sql.utils.OASPackageBridge.InputMetricsWrapper | ||
|
|
||
| class HiveTableScanMetricsUpdater(val metrics: Map[String, SQLMetric]) extends MetricsUpdater { |
There was a problem hiding this comment.
ref to FileSourceScanMetricsUpdater
|
please rebase to main |
|
Run Gluten Clickhouse CI |
69a7ba6 to
d0d2622
Compare
|
Run Gluten Clickhouse CI |
| import org.apache.spark.sql.execution.metric.SQLMetric | ||
| import org.apache.spark.sql.utils.OASPackageBridge.InputMetricsWrapper | ||
|
|
||
| class HiveTableScanMetricsUpdater(val metrics: Map[String, SQLMetric]) extends MetricsUpdater { |
There was a problem hiding this comment.
add @transient before val metrics: Map[String, SQLMetric] to avoid serialize metrics
|
Run Gluten Clickhouse CI |
What changes were proposed in this pull request?
Improve read txt/json format hive table by use native engine
(Fixes: #1582)
How was this patch tested?
unit tests, manual tests
This is tested on clickhouse-backend. The tested query
select l_linenumber,count(*) from linenumber_t where l_linenumber <3 group by l_linenumber limit 3, the tested table has about 63000000 rows, and stored as textfile, which archived about 100% improvement.(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)