Skip to content

[GLUTEN-1582][CH] Improve txt/json read by use native engine#1584

Merged
liuneng1994 merged 2 commits intoapache:mainfrom
KevinyhZou:improve_txt_native_engine_read
Jun 2, 2023
Merged

[GLUTEN-1582][CH] Improve txt/json read by use native engine#1584
liuneng1994 merged 2 commits intoapache:mainfrom
KevinyhZou:improve_txt_native_engine_read

Conversation

@KevinyhZou
Copy link
Contributor

@KevinyhZou KevinyhZou commented May 9, 2023

What changes were proposed in this pull request?

Improve read txt/json format hive table by use native engine

(Fixes: #1582)

How was this patch tested?

unit tests, manual tests

This is tested on clickhouse-backend. The tested query select l_linenumber,count(*) from linenumber_t where l_linenumber <3 group by l_linenumber limit 3 , the tested table has about 63000000 rows, and stored as textfile, which archived about 100% improvement.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

@github-actions
Copy link

github-actions bot commented May 9, 2023

#1582

@github-actions
Copy link

github-actions bot commented May 9, 2023

Run Gluten Clickhouse CI

@KevinyhZou KevinyhZou marked this pull request as draft May 9, 2023 03:53
return read_buffer;
}

std::pair<size_t, size_t> adjustFileReadStartAndEndPos(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use ReadBuffer instead of hdfs api direclty ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadBuffer don't have apis to adjust file position, and this should be better in HDFS ReadBuffer. @lgbo-ustc

// The length in byte to read from this item
uint64 length = 8;

NamedStruct schema = 16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

larger id should be at the end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@github-actions
Copy link

github-actions bot commented May 9, 2023

Run Gluten Clickhouse CI

2 similar comments
@github-actions
Copy link

github-actions bot commented May 9, 2023

Run Gluten Clickhouse CI

@github-actions
Copy link

github-actions bot commented May 9, 2023

Run Gluten Clickhouse CI

@KevinyhZou KevinyhZou force-pushed the improve_txt_native_engine_read branch from f1c46d3 to 2649e0c Compare May 10, 2023 10:11
@github-actions
Copy link

Run Gluten Clickhouse CI

2 similar comments
@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@KevinyhZou KevinyhZou marked this pull request as ready for review May 12, 2023 03:29
}
#endif

if (file.has_text())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hive text format relies on macro USE_HIVE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// if the transformed is instance of ShuffleExchangeLike, so we need to remove it in AQE mode
// have tested gluten-it TPCH when AQE OFF
val childTransformer = BackendsApiManager.getSparkPlanExecApiInstance
.genHiveTableScanExecTransformer(plan.child)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not transform hiveExec in TransformPreOverrides, like case plan: HiveTableScanExec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this to TransformPreOverrides, and as HiveTableScanExec is a private class, so it can not be used like case plan: HiveTableScanExec , but use reflect instead

@KevinyhZou KevinyhZou force-pushed the improve_txt_native_engine_read branch from c61ed9e to 3924f2e Compare May 19, 2023 10:27
@github-actions
Copy link

Run Gluten Clickhouse CI

@KevinyhZou KevinyhZou force-pushed the improve_txt_native_engine_read branch from 3924f2e to 1488f14 Compare May 19, 2023 11:48
@github-actions
Copy link

Run Gluten Clickhouse CI

// into native implementations.
case class TransformPostOverrides(session: SparkSession, isAdaptiveContext: Boolean)
extends Rule[SparkPlan] {
extends Rule[SparkPlan] with Logging {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

logDebug(s"Transformation for ${p.getClass} is currently not supported.")
val children = plan.children.map(replaceWithTransformerPlan)
p.withNewChildren(children)
val planTransformer = BackendsApiManager.getSparkPlanExecApiInstance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why transforms HiveTableScanExec here ?

Copy link
Contributor Author

@KevinyhZou KevinyhZou May 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I see BatchScanExec, FileSourceScanExec are all thansform in this TransformPreOverrides class, HiveTableScanExec is similar to these, so I make it here. And it can not be used as case HiveTableScanExec, so I transform it like this. What better place should I make this transform ?

* @return
*/
override def genHiveTableScanExecTransformer(child: SparkPlan): BasicScanExecTransformer = {
if (!child.getClass.getSimpleName.equals("HiveTableScanExec")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use 'isInstanceOf' ?

Copy link
Contributor Author

@KevinyhZou KevinyhZou May 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HiveTableScanExec is a private class, it can not be used like child.isInstanceOf[HiveTableScanExec]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KevinyhZou
HiveTableScanExec is private in the "hive" package. It can be used in the hive package to do some transform.
Others, BatchScanExec is extended with DS v2, and FileScanExec is extended with DS v1. My suggestion is that HiveTableScanExec should use it's own transform like hiveexectransform. hiveexectransform can be trait with BasicScanExecTransformer

Copy link
Contributor Author

@KevinyhZou KevinyhZou May 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it will be better to implements this to put the logic into a HiveExecTransformer, but it still need to judge whether the class is HiveTableScanExec in TransformPreOverrides, which should use its class name. And I will try to do this. thanks @loneylee

@github-actions
Copy link

Run Gluten Clickhouse CI


namespace local_engine
{
class HiveTextFormatFile : public FormatFile
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be renamed to TextFormatFile? Since FileSourceScanExecTransformer can also read csv or text format, it may have the same parser.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@github-actions
Copy link

Run Gluten Clickhouse CI

@KevinyhZou KevinyhZou force-pushed the improve_txt_native_engine_read branch from 65bcb29 to cc0b721 Compare May 29, 2023 05:30
@github-actions
Copy link

Run Gluten Clickhouse CI

3 similar comments
@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI


override def genHiveTableScanTransformerMetrics(
sparkContext: SparkContext): Map[String, SQLMetric] =
Map(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please modify metrics according to the new genFileSourceScanTransformerMetrics

import org.apache.spark.sql.execution.metric.SQLMetric
import org.apache.spark.sql.utils.OASPackageBridge.InputMetricsWrapper

class HiveTableScanMetricsUpdater(val metrics: Map[String, SQLMetric]) extends MetricsUpdater {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref to FileSourceScanMetricsUpdater

@zzcclp
Copy link
Contributor

zzcclp commented May 30, 2023

please rebase to main

@github-actions
Copy link

Run Gluten Clickhouse CI

@KevinyhZou KevinyhZou force-pushed the improve_txt_native_engine_read branch from 69a7ba6 to d0d2622 Compare May 31, 2023 06:31
@github-actions
Copy link

Run Gluten Clickhouse CI

import org.apache.spark.sql.execution.metric.SQLMetric
import org.apache.spark.sql.utils.OASPackageBridge.InputMetricsWrapper

class HiveTableScanMetricsUpdater(val metrics: Map[String, SQLMetric]) extends MetricsUpdater {
Copy link
Contributor

@zzcclp zzcclp May 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add @transient before val metrics: Map[String, SQLMetric] to avoid serialize metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@github-actions
Copy link

Run Gluten Clickhouse CI

@zzcclp zzcclp requested a review from liuneng1994 May 31, 2023 08:14
Copy link
Contributor

@liuneng1994 liuneng1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@liuneng1994 liuneng1994 merged commit 946d0a1 into apache:main Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CH] Improve txt/json format hive table read by use native engine

6 participants