Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Gluten-986] Remove unimportant metrics in Agg and Join operator and input metrics for all operator. #998

Merged
merged 12 commits into from
Feb 28, 2023

Conversation

Yohahaha
Copy link
Contributor

@Yohahaha Yohahaha commented Feb 22, 2023

What changes were proposed in this pull request?

Agg should not care about preProject, postProject and extractionNeeded's metrics, these nodes does not modify row numbers, maybe change total bytes. We should keep it clearly and remove these.

Operator's input metrics and output metrics are equals, we just need output metrics and align with Spark.

https://github.com/facebookincubator/velox/blob/db4ea520893c30ac9197e75b88a3b85bfc801758/velox/exec/Driver.cpp#L346-L373

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[Gluten-${ISSUES_ID}] ${detailed message}

See also:

@Yohahaha Yohahaha changed the title Remove unimportant metrics in Agg operator. [Gluten-986]Remove unimportant metrics in Agg operator. Feb 22, 2023
@github-actions
Copy link

#986

@zhztheplayer zhztheplayer changed the title [Gluten-986]Remove unimportant metrics in Agg operator. [Gluten-986] Remove unimportant metrics in Agg operator. Feb 22, 2023
@Yohahaha Yohahaha changed the title [Gluten-986] Remove unimportant metrics in Agg operator. [WIP][Gluten-986] Remove unimportant metrics in Agg and Join operator. Feb 22, 2023
@FelixYBW
Copy link
Contributor

Some of the metrics are used in our analysis tool. Let me check later.

@Yohahaha Yohahaha changed the title [WIP][Gluten-986] Remove unimportant metrics in Agg and Join operator. [WIP][Gluten-986] Remove unimportant metrics in Agg and Join operator and input metrics for all operator. Feb 23, 2023
@Yohahaha Yohahaha changed the title [WIP][Gluten-986] Remove unimportant metrics in Agg and Join operator and input metrics for all operator. [Gluten-986] Remove unimportant metrics in Agg and Join operator and input metrics for all operator. Feb 23, 2023
"concatTime" -> SQLMetrics.createNanoTimingMetric(sparkContext, "totaltime_coalescebatch"),
"numInputBatches" -> SQLMetrics.createMetric(sparkContext, "number of input batches"),
"numOutputBatches" -> SQLMetrics.createMetric(sparkContext, "number of output batches"),
"collectTime" -> SQLMetrics.createNanoTimingMetric(sparkContext, "time to collect batch"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our script needs totaltime as prefix. Let's change to "totaltime to collect batch"

"numInputBatches" -> SQLMetrics.createMetric(sparkContext, "number of input batches"),
"numOutputBatches" -> SQLMetrics.createMetric(sparkContext, "number of output batches"),
"collectTime" -> SQLMetrics.createNanoTimingMetric(sparkContext, "time to collect batch"),
"concatTime" -> SQLMetrics.createNanoTimingMetric(sparkContext, "time to coalesce batch"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to totaltime to coalesce batch

"numOutputBatches" -> SQLMetrics.createMetric(sparkContext, "output_batches"),
"processTime" -> SQLMetrics.createTimingMetric(sparkContext, "totaltime_rowtoarrowcolumnar")
"numOutputBatches" -> SQLMetrics.createMetric(sparkContext, "number of output batches"),
"convertTime" -> SQLMetrics.createTimingMetric(sparkContext, "time to convert")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to "totaltime to convert"

"aggWallNanos" -> SQLMetrics.createNanoTimingMetric(
sparkContext, "totaltime_aggregation"),
sparkContext, "wall time"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to "totaltime of aggregation" We need to breakdown the operator time by its name

@@ -180,11 +180,11 @@ case class ColumnarShuffleExchangeAdaptor(override val outputPartitioning: Parti
override lazy val metrics: Map[String, SQLMetric] = Map(
"dataSize" -> SQLMetrics.createSizeMetric(sparkContext, "data size"),
"bytesSpilled" -> SQLMetrics.createSizeMetric(sparkContext, "shuffle bytes spilled"),
"computePidTime" -> SQLMetrics.createNanoTimingMetric(sparkContext, "totaltime_computepid"),
"splitTime" -> SQLMetrics.createNanoTimingMetric(sparkContext, "totaltime_split"),
"computePidTime" -> SQLMetrics.createNanoTimingMetric(sparkContext, "compute pid time"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we moved the computepid to Velox pipeline, It's not used now. We can delete it.

"computePidTime" -> SQLMetrics.createNanoTimingMetric(sparkContext, "totaltime_computepid"),
"splitTime" -> SQLMetrics.createNanoTimingMetric(sparkContext, "totaltime_split"),
"computePidTime" -> SQLMetrics.createNanoTimingMetric(sparkContext, "compute pid time"),
"splitTime" -> SQLMetrics.createNanoTimingMetric(sparkContext, "split time"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"totaltime to split"

"spillTime" -> SQLMetrics.createNanoTimingMetric(sparkContext, "shuffle spill time"),
"compressTime" -> SQLMetrics.createNanoTimingMetric(sparkContext, "totaltime_compress"),
"prepareTime" -> SQLMetrics.createNanoTimingMetric(sparkContext, "totaltime_prepare"),
"compressTime" -> SQLMetrics.createNanoTimingMetric(sparkContext, "compress time"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totaltime to compress

"compressTime" -> SQLMetrics.createNanoTimingMetric(sparkContext, "totaltime_compress"),
"prepareTime" -> SQLMetrics.createNanoTimingMetric(sparkContext, "totaltime_prepare"),
"compressTime" -> SQLMetrics.createNanoTimingMetric(sparkContext, "compress time"),
"prepareTime" -> SQLMetrics.createNanoTimingMetric(sparkContext, "prepare time"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totaltime to prepare

"preProjectionCount" -> SQLMetrics.createNanoTimingMetric(
sparkContext, "preProjection cpu wall time count"),
"preProjectionWallNanos" -> SQLMetrics.createNanoTimingMetric(
sparkContext, "totaltime_preProjection"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep this one, may change to "totaltime of preProjection"

"postProjectionCount" -> SQLMetrics.createNanoTimingMetric(
sparkContext, "postProjection cpu wall time count"),
"postProjectionWallNanos" -> SQLMetrics.createNanoTimingMetric(
sparkContext, "totaltime_postProjection"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep this one, change to "totaltime of postProjection"

"buildCount" -> SQLMetrics.createNanoTimingMetric(
sparkContext, "build side cpu wall time count"),
"buildWallNanos" -> SQLMetrics.createNanoTimingMetric(
sparkContext, "totaltime_build_input"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this. It actually count the arrow2Velox conversion

"streamCount" -> SQLMetrics.createNanoTimingMetric(
sparkContext, "stream side cpu wall time count"),
"streamWallNanos" -> SQLMetrics.createNanoTimingMetric(
sparkContext, "totaltime_stream_input"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep this, change to totaltime stream input

"streamWallNanos" -> SQLMetrics.createNanoTimingMetric(
sparkContext, "totaltime_stream_input"),
"streamVeloxToArrow" -> SQLMetrics.createNanoTimingMetric(
sparkContext, "totaltime_velox_to_arrow_converter"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep it, it only count part of the conversion

"streamPreProjectionCount" -> SQLMetrics.createNanoTimingMetric(
sparkContext, "stream preProjection cpu wall time count"),
"streamPreProjectionWallNanos" -> SQLMetrics.createNanoTimingMetric(
sparkContext, "totaltime_stream_preProjection"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep this

"buildPreProjectionCount" -> SQLMetrics.createNanoTimingMetric(
sparkContext, "build preProjection cpu wall time count"),
"buildPreProjectionWallNanos" -> SQLMetrics.createNanoTimingMetric(
sparkContext, "totaltime_build_preProjection"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep it

"hashBuildWallNanos" -> SQLMetrics.createNanoTimingMetric(
sparkContext, "totaltime_hashbuild"),
sparkContext, "hash build wall time"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to totaltime of hash build

"hashProbeWallNanos" -> SQLMetrics.createNanoTimingMetric(
sparkContext, "totaltime_hashprobe"),
sparkContext, "hash probe wall time"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totaltime of probe

"postProjectionCount" -> SQLMetrics.createNanoTimingMetric(
sparkContext, "postProjection cpu wall time count"),
"postProjectionWallNanos" -> SQLMetrics.createNanoTimingMetric(
sparkContext, "totaltime_postProjection"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep it, use totaltime of postProjection

sparkContext, "number of build preProjection memory allocations"),

"hashBuildInputRows" -> SQLMetrics.createMetric(
sparkContext, "number of hash build input rows"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep this. Then we needn't to check the previous operator to get the input rows

@@ -248,26 +136,16 @@ trait HashJoinLikeExecTransformer
"hashBuildSpilledFiles" -> SQLMetrics.createMetric(
sparkContext, "total spilled files of hash build"),

"hashProbeInputRows" -> SQLMetrics.createMetric(
sparkContext, "number of hash probe input rows"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same, let keep this one also

"hashProbeRawInputRows" -> SQLMetrics.createMetric(
sparkContext, "number of hash probe raw input rows"),
"hashProbeRawInputBytes" -> SQLMetrics.createSizeMetric(
sparkContext, "number of hash probe raw input bytes"),
"hashProbeOutputRows" -> SQLMetrics.createMetric(
sparkContext, "number of hash probe output rows"),
"hashProbeOutputVectors" -> SQLMetrics.createMetric(
sparkContext, "number of hash probe output vectors"),
"hashProbeOutputBytes" -> SQLMetrics.createSizeMetric(
sparkContext, "number of hash probe output bytes"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks it's broken. let's remove it.

number of hash probe output bytes -39,682,000,181,513,076,736 (495,594,061,042,182,848.0)

"postProjectionRawInputBytes" -> SQLMetrics.createSizeMetric(
sparkContext, "number of postProjection raw input bytes"),
"postProjectionOutputRows" -> SQLMetrics.createMetric(
sparkContext, "number of postProjection output rows"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep it. It's the final output rows from hash join

sparkContext, "number of postProjection raw input bytes"),
"postProjectionOutputRows" -> SQLMetrics.createMetric(
sparkContext, "number of postProjection output rows"),
"postProjectionOutputVectors" -> SQLMetrics.createMetric(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this one

@FelixYBW
Copy link
Contributor

FelixYBW commented Feb 25, 2023

we can remove the other metrics. Thank you for your clean up, @Yohahaha

@FelixYBW
Copy link
Contributor

@waitinfuture OK to me.

Copy link
Contributor

@waitinfuture waitinfuture left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@FelixYBW FelixYBW merged commit c826496 into apache:main Feb 28, 2023
zhli1142015 pushed a commit to zhli1142015/gluten that referenced this pull request Mar 5, 2023
…input metrics for all operator. (apache#998)

clean up the metrics
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants