[GLUTEN-8379][VL] Support query trace by jinchengchenghh · Pull Request #8380 · apache/gluten

jinchengchenghh · 2024-12-31T02:50:03Z

Run the MicroBenchmark to generate stage level plan and then enable query trace in benchmark to profile node level query. Benchmark with query trace enabled replaces ValueStreamNode which is hard to serialize to ValuesNode. This issue may be fixed by plan serialization optimization that only serializes the plan node to profile in velox query trace. facebookincubator/velox#12084

github-actions · 2024-12-31T02:50:19Z

#8379

github-actions · 2024-12-31T02:50:35Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-03T01:43:14Z

Run Gluten Clickhouse CI on x86

zhli1142015 · 2025-01-03T03:16:25Z

We are also looking into this. I tried it locally but encountered the error below. Would you fix it in your PR?

Caused by: org.apache.gluten.exception.GlutenException: Exception: VeloxUserError
Error Source: USER
Error Code: UNSUPPORTED
Reason: ValueStream plan node is not serializable
Retriable: False
Function: serialize
File: /var/git/incubator-gluten/cpp/velox/operators/plannodes/RowVectorStream.h
Line: 136
Stack trace:
# 0  facebook::velox::VeloxException::VeloxException(char const*, unsigned long, char const*, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, bool, facebook::velox::VeloxException::Type, std::basic_string_view<char, std::char_traits<char> >)
# 1  void facebook::velox::detail::veloxCheckFail<facebook::velox::VeloxUserError, char const*>(facebook::velox::detail::VeloxCheckFailArgs const&, char const*)
# 2  gluten::ValueStreamNode::serialize() const
# 3  facebook::velox::core::PlanNode::serialize() const
# 4  facebook::velox::core::TopNNode::serialize() const
# 5  facebook::velox::core::PlanNode::serialize() const
# 6  facebook::velox::core::ProjectNode::serialize() const
# 7  facebook::velox::exec::trace::TaskTraceMetadataWriter::write(std::shared_ptr<facebook::velox::core::QueryCtx> const&, std::shared_ptr<facebook::velox::core::PlanNode const> const&)
# 8  facebook::velox::exec::Task::maybeInitTrace()

jinchengchenghh · 2025-01-03T05:11:39Z

No, I just add the config. The exception is because we don't support to serialize ValueStream Node to json, we could support it, it is not hard.

jinchengchenghh · 2025-01-03T05:19:17Z

QueryTrace only supports some operators, we need to extend it.

zhli1142015 · 2025-01-03T05:44:45Z

I think we might need to convert the exception below into a warning log to tolerate the issue you mentioned, where only some operators are supported.
https://github.com/facebookincubator/velox/blob/9da5fa7ee803540e330bc626a57388c9cbcad1b3/velox/exec/Operator.cpp#L130

zhli1142015 · 2025-01-03T05:45:23Z

And we also need to register spark fucntions in velox_query_replayer.

jinchengchenghh · 2025-01-03T05:50:34Z

If you don't set the trace config, we won't run into this exception. The exception indicates you could not enable query trace in that case.

Looks like we still have further more work to do to enable query trace in Gluten

FelixYBW · 2025-01-03T22:12:47Z

@jinchengchenghh what's the difference/advantage from our microbenchmark tool?

jinchengchenghh · 2025-01-06T01:18:16Z

Our benchmark accepts a total velox plan, but query trace can specify the plan node id, it can save the input of any plan node @FelixYBW

github-actions · 2025-01-06T01:19:27Z

Run Gluten Clickhouse CI on x86

zhztheplayer · 2025-01-06T01:37:19Z

Our benchmark accepts a total velox plan, but query trace can specify the plan node id, it can save the input of any plan node @FelixYBW

Can we consolidate the two functionalities? They sound overlap more or less?

FelixYBW · 2025-01-06T08:42:03Z

Can we consolidate the two functionalities? They sound overlap more or less?

If it can save the input of any plan node and reproduce the node only, it will be useful to debug.

Yohahaha · 2025-01-13T01:42:32Z

cpp/velox/compute/WholeStageResultIterator.cc

      configs[velox::core::QueryConfig::kSparkLegacyDateFormatter] = "false";
    }

+    const auto setIfExists = [&](const std::string& glutenKey, const std::string& veloxKey) {


if this func is common, should we declare it as static member function?

Yes, good suggestion, I think the config needs to do the refactor, now if the config not set by java side, we will set a default value of velox config to velox query config, we should change to not set the config.

If velox config default value is changed, we can change automatic with them if we don't have different default value with velox.

github-actions · 2025-01-14T04:19:41Z

Run Gluten ClickHouse CI on ARM

github-actions · 2025-01-15T02:33:16Z

Run Gluten ClickHouse CI on ARM

github-actions · 2025-01-15T13:10:43Z

Run Gluten ClickHouse CI on ARM

github-actions · 2025-01-16T01:33:13Z

Run Gluten ClickHouse CI on ARM

zhztheplayer

Is it possible we finally get the generic benchmark and this feature merged in Gluten? Since both are invasive to the core execution code path. E.g., Generic benchmark requires for a individual parameter passed through JNI bridge and this feature requires for changing the ValueStreamNode to ValueNode. Do both features have own users as of now?

zhztheplayer · 2025-01-16T05:36:25Z

cpp/velox/compute/WholeStageResultIterator.cc

+std::string getQueryId(const std::unordered_map<std::string, std::string>& confMap) {
+  auto it = confMap.find(kQueryTraceQueryId);
+  if (it != confMap.end()) {
+    return it->second;
+  }
+  return "";
+}
+


A little bit confused with the empty query ID. Can we always give a meaningful ID to Velox?

Can we uses the applicationId_stageId as the queryId?

Can we uses the applicationId_stageId as the queryId?

Do we have applicationId passed through JNI? Or could just align it with the task name somehow.

If applicationId is important for the feature and not yet passed through JNI, could also add it and use it for query context and task names / IDs in another PR. Thank you.

The queryId concept is not really query id, it is velox query id, not Spark queryId, now one query for one task, so I think we can use the same name with task name.

Now we don't pass the applicationId

zhztheplayer · 2025-01-16T05:44:37Z

cpp/velox/substrait/SubstraitToVeloxPlan.cc

+  VELOX_CHECK_LT(streamIdx, inputIters_.size(), "Could not find stream index {} in input iterator list.", streamIdx);
+  const auto iterator = inputIters_[streamIdx];
+  while (iterator->hasNext()) {
+    auto cb = VeloxColumnarBatch::from(defaultLeafVeloxMemoryPool().get(), iterator->next());


Is the memory managed by Spark? Given that defaultLeafVeloxMemoryPool is global.

This is only used in benchmark, so we don't need to track it.

zhztheplayer · 2025-01-16T05:51:14Z

docs/developers/QueryTrace.md

+---
+layout: page
+title: How To Use Gluten
+nav_order: 1
+parent: Developer Overview
+---


The header needs to update.

@jinchengchenghh

nit: Would you update title (and perhaps other fields as well) in the header? Thanks.

duanmeng · 2025-01-16T06:42:27Z

@jinchengchenghh It is awesome to support query tracing in gluten. By the way, I am planning to support PlanNode serialization without leaf nodes to avoid serializing potential ValueStreamNode so that we can support more general query tracing in gluten.

github-actions · 2025-01-17T01:41:34Z

Run Gluten ClickHouse CI on ARM

jinchengchenghh · 2025-01-17T02:05:51Z

The GenericBenchmark is frequently used by us CC @JkSelf , it is really usefully for debug in cpp side, we can use vscode to do step-by-step debug. And query trace is widely used in Velox to do correctness verification and performance profiling, so I think we need to enable it in Gluten. and with this optimization #8380 (comment), we could drop the ValueStreamNode to ValusNode replace. @zhztheplayer

duanmeng · 2025-01-17T11:56:39Z

The GenericBenchmark is frequently used by us CC @JkSelf , it is really usefully for debug in cpp side, we can use vscode to do step-by-step debug. And query trace is widely used in Velox to do correctness verification and performance profiling, so I think we need to enable it in Gluten. and with this optimization #8380 (comment), we could drop the ValueStreamNode to ValusNode replace. @zhztheplayer

As discussed with @xiaoxmeng and @jinchengchenghh offline, we could add simple ValueStreamNode serde methods that serialize only id and output type and create a ValueStreamNode with empty ResultIterator. Because we do not need to trace the ValueStreamNode.

jinchengchenghh · 2025-01-17T23:30:27Z

I tried the serialize ValueStreamNode as empty serialization, but failed by deserialization.

The plan node deserialization is registered in velox, but ValueStreamNode exists in gluten, so we cannot do that.
https://github.com/facebookincubator/velox/blob/main/velox/core/PlanNode.cpp#L2716

root@sr249:/mnt/DP_disk1/code/velox/build/velox/tool/trace# ./velox_query_replayer  --root_dir /tmp/query_trace --task_id Gluten_Stage_0_TID_0_VTID_0 --query_id=query_1 --node_id=7 --summary
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0117 20:08:46.359496 2923199 HiveConnector.cpp:56] Hive connector test-hive created with maximum of 20000 cached file handles.
terminate called after throwing an instance of 'facebook::velox::VeloxUserError'
  what():  Exception: VeloxUserError
Error Source: USER
Error Code: INVALID_ARGUMENT
Reason: Deserialization function for class: ValueStreamNode is not registered
Retriable: False
Expression: registry.Has(name)
Function: deserialize
File: /mnt/DP_disk1/code/velox/./velox/common/serialization/Serializable.h
Line: 196
Stack trace:
Stack trace has been disabled. Use --velox_exception_user_stacktrace_enabled=true to enable it

jinchengchenghh · 2025-01-20T02:16:20Z

Can you help review again? Thanks! @zhztheplayer

zhztheplayer

Some nits. Thank you!

zhztheplayer · 2025-01-20T02:46:33Z

docs/developers/QueryTrace.md

+---
+layout: page
+title: How To Use Gluten
+nav_order: 1
+parent: Developer Overview
+---


@jinchengchenghh

nit: Would you update title (and perhaps other fields as well) in the header? Thanks.

zhztheplayer · 2025-01-20T02:47:43Z

shims/common/src/main/scala/org/apache/gluten/config/GlutenConfig.scala

+    .createWithDefault(false)
+
+  val QUERY_TRACE_DIR = buildConf("spark.gluten.sql.columnar.backend.velox.queryTraceDir")
+    .doc("Base dir of a query to store tracing data.")


What happens if one leaves this empty?

If the config is not set correctly, Velox will throw exception, so as other config.

***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. W0120 15:09:59.768762 3040717 GenericBenchmark.cc:253] Setting CPU for thread 0 to 0 terminate called after throwing an instance of 'facebook::velox::VeloxUserError' what(): Exception: VeloxUserError Error Source: USER Error Code: INVALID_ARGUMENT Reason: Query trace enabled but the trace dir is not set Retriable: False Expression: !queryConfig.queryTraceDir().empty() Function: maybeMakeTraceConfig File: /mnt/DP_disk1/code/incubator-gluten/ep/build-velox/build/velox_ep/velox/exec/Task.cpp Line: 3027 Stack trace:

Can we set it with default /tmp/query_trace?
BTW, if we want expose the config to user, we need remove the internal()

This config is only used in internal, user will not set it by this config name, because we need to enable query trace in benchmark, so user set it by FLAGS_XX in benchmark

zhztheplayer · 2025-01-20T02:51:21Z

cpp/velox/compute/WholeStageResultIterator.cc

+std::string getQueryId(const std::unordered_map<std::string, std::string>& confMap) {
+  auto it = confMap.find(kQueryTraceQueryId);
+  if (it != confMap.end()) {
+    return it->second;
+  }
+  return "";
+}
+


Can we uses the applicationId_stageId as the queryId?

Do we have applicationId passed through JNI? Or could just align it with the task name somehow.

github-actions · 2025-01-20T08:05:53Z

Run Gluten ClickHouse CI on ARM

github-actions · 2025-01-21T08:11:49Z

Run Gluten ClickHouse CI on ARM

FelixYBW · 2025-01-22T01:21:44Z

Thank you, Chengcheng for your effort!

Run the MicroBenchmark to generate stage level plan and then enable query trace in benchmark to profile node level query. Benchmark with query trace enabled replaces ValueStreamNode which is hard to serialize to ValuesNode. This issue may be fixed by plan serialization optimization that only serializes the plan node to profile in velox query trace. facebookincubator/velox#12084

github-actions bot added CORE works for Gluten Core VELOX labels Dec 31, 2024

jinchengchenghh force-pushed the trace branch from c3e87eb to ed2dd88 Compare January 3, 2025 01:42

jinchengchenghh force-pushed the trace branch from ed2dd88 to 69a5f31 Compare January 6, 2025 01:18

Yohahaha reviewed Jan 13, 2025

View reviewed changes

jinchengchenghh force-pushed the trace branch from 69a5f31 to f8b94a4 Compare January 14, 2025 04:19

github-actions bot added the DOCS label Jan 15, 2025

zhztheplayer reviewed Jan 16, 2025

View reviewed changes

zhztheplayer reviewed Jan 20, 2025

View reviewed changes

zhztheplayer approved these changes Jan 20, 2025

View reviewed changes

jinchengchenghh force-pushed the trace branch from 4ac6780 to f82e5a6 Compare January 21, 2025 08:11

support query trace

f82e5a6

jinchengchenghh merged commit 57fa103 into apache:main Jan 22, 2025
49 checks passed

Conversation

jinchengchenghh commented Dec 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 31, 2024

Uh oh!

github-actions bot commented Dec 31, 2024

Uh oh!

github-actions bot commented Jan 3, 2025

Uh oh!

zhli1142015 commented Jan 3, 2025

Uh oh!

jinchengchenghh commented Jan 3, 2025

Uh oh!

jinchengchenghh commented Jan 3, 2025

Uh oh!

zhli1142015 commented Jan 3, 2025

Uh oh!

zhli1142015 commented Jan 3, 2025

Uh oh!

jinchengchenghh commented Jan 3, 2025

Uh oh!

FelixYBW commented Jan 3, 2025

Uh oh!

jinchengchenghh commented Jan 6, 2025

Uh oh!

github-actions bot commented Jan 6, 2025

Uh oh!

zhztheplayer commented Jan 6, 2025

Uh oh!

FelixYBW commented Jan 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 14, 2025

Uh oh!

github-actions bot commented Jan 15, 2025

Uh oh!

github-actions bot commented Jan 15, 2025

Uh oh!

github-actions bot commented Jan 16, 2025

Uh oh!

zhztheplayer left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duanmeng commented Jan 16, 2025

Uh oh!

github-actions bot commented Jan 17, 2025

Uh oh!

jinchengchenghh commented Jan 17, 2025

Uh oh!

duanmeng commented Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jinchengchenghh commented Jan 17, 2025

Uh oh!

jinchengchenghh commented Jan 20, 2025

jinchengchenghh commented Dec 31, 2024 •

edited

Loading

zhztheplayer left a comment •

edited

Loading

duanmeng commented Jan 17, 2025 •

edited

Loading