[GLUTEN-8379][VL] Support query trace#8380
Conversation
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
We are also looking into this. I tried it locally but encountered the error below. Would you fix it in your PR? |
|
No, I just add the config. The exception is because we don't support to serialize ValueStream Node to json, we could support it, it is not hard. |
|
QueryTrace only supports some operators, we need to extend it. |
|
I think we might need to convert the exception below into a warning log to tolerate the issue you mentioned, where only some operators are supported. |
|
And we also need to register spark fucntions in |
|
If you don't set the trace config, we won't run into this exception. The exception indicates you could not enable query trace in that case. Looks like we still have further more work to do to enable query trace in Gluten |
|
@jinchengchenghh what's the difference/advantage from our microbenchmark tool? |
|
Our benchmark accepts a total velox plan, but query trace can specify the plan node id, it can save the input of any plan node @FelixYBW |
|
Run Gluten Clickhouse CI on x86 |
Can we consolidate the two functionalities? They sound overlap more or less? |
If it can save the input of any plan node and reproduce the node only, it will be useful to debug. |
| configs[velox::core::QueryConfig::kSparkLegacyDateFormatter] = "false"; | ||
| } | ||
|
|
||
| const auto setIfExists = [&](const std::string& glutenKey, const std::string& veloxKey) { |
There was a problem hiding this comment.
if this func is common, should we declare it as static member function?
There was a problem hiding this comment.
Yes, good suggestion, I think the config needs to do the refactor, now if the config not set by java side, we will set a default value of velox config to velox query config, we should change to not set the config.
If velox config default value is changed, we can change automatic with them if we don't have different default value with velox.
69a5f31 to
f8b94a4
Compare
|
Run Gluten ClickHouse CI on ARM |
|
Run Gluten ClickHouse CI on ARM |
2 similar comments
|
Run Gluten ClickHouse CI on ARM |
|
Run Gluten ClickHouse CI on ARM |
There was a problem hiding this comment.
Is it possible we finally get the generic benchmark and this feature merged in Gluten? Since both are invasive to the core execution code path. E.g., Generic benchmark requires for a individual parameter passed through JNI bridge and this feature requires for changing the ValueStreamNode to ValueNode. Do both features have own users as of now?
| std::string getQueryId(const std::unordered_map<std::string, std::string>& confMap) { | ||
| auto it = confMap.find(kQueryTraceQueryId); | ||
| if (it != confMap.end()) { | ||
| return it->second; | ||
| } | ||
| return ""; | ||
| } | ||
|
|
There was a problem hiding this comment.
A little bit confused with the empty query ID. Can we always give a meaningful ID to Velox?
There was a problem hiding this comment.
Can we uses the applicationId_stageId as the queryId?
There was a problem hiding this comment.
Can we uses the applicationId_stageId as the queryId?
Do we have applicationId passed through JNI? Or could just align it with the task name somehow.
There was a problem hiding this comment.
If applicationId is important for the feature and not yet passed through JNI, could also add it and use it for query context and task names / IDs in another PR. Thank you.
There was a problem hiding this comment.
The queryId concept is not really query id, it is velox query id, not Spark queryId, now one query for one task, so I think we can use the same name with task name.
There was a problem hiding this comment.
Now we don't pass the applicationId
| VELOX_CHECK_LT(streamIdx, inputIters_.size(), "Could not find stream index {} in input iterator list.", streamIdx); | ||
| const auto iterator = inputIters_[streamIdx]; | ||
| while (iterator->hasNext()) { | ||
| auto cb = VeloxColumnarBatch::from(defaultLeafVeloxMemoryPool().get(), iterator->next()); |
There was a problem hiding this comment.
Is the memory managed by Spark? Given that defaultLeafVeloxMemoryPool is global.
There was a problem hiding this comment.
This is only used in benchmark, so we don't need to track it.
| --- | ||
| layout: page | ||
| title: How To Use Gluten | ||
| nav_order: 1 | ||
| parent: Developer Overview | ||
| --- |
There was a problem hiding this comment.
The header needs to update.
There was a problem hiding this comment.
nit: Would you update title (and perhaps other fields as well) in the header? Thanks.
|
@jinchengchenghh It is awesome to support query tracing in gluten. By the way, I am planning to support |
|
Run Gluten ClickHouse CI on ARM |
|
The GenericBenchmark is frequently used by us CC @JkSelf , it is really usefully for debug in cpp side, we can use vscode to do step-by-step debug. And query trace is widely used in Velox to do correctness verification and performance profiling, so I think we need to enable it in Gluten. and with this optimization #8380 (comment), we could drop the ValueStreamNode to ValusNode replace. @zhztheplayer |
As discussed with @xiaoxmeng and @jinchengchenghh offline, we could add simple |
|
I tried the serialize ValueStreamNode as empty serialization, but failed by deserialization. The plan node deserialization is registered in velox, but ValueStreamNode exists in gluten, so we cannot do that. |
|
Can you help review again? Thanks! @zhztheplayer |
| --- | ||
| layout: page | ||
| title: How To Use Gluten | ||
| nav_order: 1 | ||
| parent: Developer Overview | ||
| --- |
There was a problem hiding this comment.
nit: Would you update title (and perhaps other fields as well) in the header? Thanks.
| .createWithDefault(false) | ||
|
|
||
| val QUERY_TRACE_DIR = buildConf("spark.gluten.sql.columnar.backend.velox.queryTraceDir") | ||
| .doc("Base dir of a query to store tracing data.") |
There was a problem hiding this comment.
What happens if one leaves this empty?
There was a problem hiding this comment.
If the config is not set correctly, Velox will throw exception, so as other config.
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
W0120 15:09:59.768762 3040717 GenericBenchmark.cc:253] Setting CPU for thread 0 to 0
terminate called after throwing an instance of 'facebook::velox::VeloxUserError'
what(): Exception: VeloxUserError
Error Source: USER
Error Code: INVALID_ARGUMENT
Reason: Query trace enabled but the trace dir is not set
Retriable: False
Expression: !queryConfig.queryTraceDir().empty()
Function: maybeMakeTraceConfig
File: /mnt/DP_disk1/code/incubator-gluten/ep/build-velox/build/velox_ep/velox/exec/Task.cpp
Line: 3027
Stack trace:
There was a problem hiding this comment.
Can we set it with default /tmp/query_trace?
BTW, if we want expose the config to user, we need remove the internal()
There was a problem hiding this comment.
This config is only used in internal, user will not set it by this config name, because we need to enable query trace in benchmark, so user set it by FLAGS_XX in benchmark
| std::string getQueryId(const std::unordered_map<std::string, std::string>& confMap) { | ||
| auto it = confMap.find(kQueryTraceQueryId); | ||
| if (it != confMap.end()) { | ||
| return it->second; | ||
| } | ||
| return ""; | ||
| } | ||
|
|
There was a problem hiding this comment.
Can we uses the applicationId_stageId as the queryId?
Do we have applicationId passed through JNI? Or could just align it with the task name somehow.
|
Run Gluten ClickHouse CI on ARM |
4ac6780 to
f82e5a6
Compare
|
Run Gluten ClickHouse CI on ARM |
|
Thank you, Chengcheng for your effort! |
Run the MicroBenchmark to generate stage level plan and then enable query trace in benchmark to profile node level query. Benchmark with query trace enabled replaces ValueStreamNode which is hard to serialize to ValuesNode. This issue may be fixed by plan serialization optimization that only serializes the plan node to profile in velox query trace. facebookincubator/velox#12084
Run the MicroBenchmark to generate stage level plan and then enable query trace in benchmark to profile node level query. Benchmark with query trace enabled replaces ValueStreamNode which is hard to serialize to ValuesNode. This issue may be fixed by plan serialization optimization that only serializes the plan node to profile in velox query trace. facebookincubator/velox#12084
Run the MicroBenchmark to generate stage level plan and then enable query trace in benchmark to profile node level query. Benchmark with query trace enabled replaces ValueStreamNode which is hard to serialize to ValuesNode. This issue may be fixed by plan serialization optimization that only serializes the plan node to profile in velox query trace. facebookincubator/velox#12084