[SPARK-34015][R] Fixing input timing in gapply #31021

WamBamBoozle · 2021-01-05T01:24:07Z

What changes were proposed in this pull request?

When sparkR is run at log level INFO, a summary of how the worker spent its time processing the partition is printed. There is a logic error where it is over-reporting the time inputting rows.

In detail: the variable inputElap in a wider context is used to mark the end of reading rows, but in the part changed here it was used as a local variable for measuring the beginning of compute time in a loop over the groups in the partition. Thus, the error is not observable if there is only one group per partition, which is what you get in unit tests.

For our application, here's what a log entry looks like before these changes were applied:

20/10/09 04:08:58 INFO RRunner: Times: boot = 0.013 s, init = 0.005 s, broadcast = 0.000 s, read-input = 529.471 s, compute = 492.037 s, write-output = 0.020 s, total = 1021.546 s

this indicates that we're spending more time reading rows than operating on the rows.

After these changes, it looks like this:

20/12/15 06:43:29 INFO RRunner: Times: boot = 0.013 s, init = 0.010 s, broadcast = 0.000 s, read-input = 120.275 s, compute = 1680.161 s, write-output = 0.045 s, total = 1812.553 s

Why are the changes needed?

Metrics shouldn't mislead?

Does this PR introduce any user-facing change?

Aside from no longer misleading, no

How was this patch tested?

unit tests passed. Field test results seem plausible

HyukjinKwon

Do you mind:

keeping the PR description template https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE
filing a JIRA, see http://spark.apache.org/contributing.html
show how you tested this logic. Maybe show the logs before / after this change

WamBamBoozle · 2021-01-05T18:19:46Z

@HyukjinKwon -- please have another look

HyukjinKwon · 2021-01-06T00:35:40Z

ok to test

SparkQA · 2021-01-06T01:49:57Z

Test build #133695 has finished for PR 31021 at commit cbad669.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-01-06T01:52:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38283/

SparkQA · 2021-01-06T02:28:01Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38283/

HyukjinKwon

LGTM

### What changes were proposed in this pull request? When sparkR is run at log level INFO, a summary of how the worker spent its time processing the partition is printed. There is a logic error where it is over-reporting the time inputting rows. In detail: the variable inputElap in a wider context is used to mark the end of reading rows, but in the part changed here it was used as a local variable for measuring the beginning of compute time in a loop over the groups in the partition. Thus, the error is not observable if there is only one group per partition, which is what you get in unit tests. For our application, here's what a log entry looks like before these changes were applied: `20/10/09 04:08:58 INFO RRunner: Times: boot = 0.013 s, init = 0.005 s, broadcast = 0.000 s, read-input = 529.471 s, compute = 492.037 s, write-output = 0.020 s, total = 1021.546 s` this indicates that we're spending more time reading rows than operating on the rows. After these changes, it looks like this: `20/12/15 06:43:29 INFO RRunner: Times: boot = 0.013 s, init = 0.010 s, broadcast = 0.000 s, read-input = 120.275 s, compute = 1680.161 s, write-output = 0.045 s, total = 1812.553 s ` ### Why are the changes needed? Metrics shouldn't mislead? ### Does this PR introduce _any_ user-facing change? Aside from no longer misleading, no ### How was this patch tested? unit tests passed. Field test results seem plausible Closes #31021 from WamBamBoozle/input_timing. Authored-by: Tom.Howland <Tom.Howland@target.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 3d8ee49) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

HyukjinKwon · 2021-01-06T02:40:46Z

Merged to master and branch-3.1

HyukjinKwon · 2021-01-06T02:42:03Z

Thanks, @WamBamBoozle.

fixing input timing in worker.R

f19cfb0

github-actions bot added the R label Jan 5, 2021

HyukjinKwon reviewed Jan 5, 2021

View reviewed changes

WamBamBoozle changed the title ~~fixing input timing in worker.R~~ [SPARK-34015][R] fixing input timing in worker.R Jan 5, 2021

fixing output timing too

cbad669

HyukjinKwon approved these changes Jan 6, 2021

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-34015][R] fixing input timing in worker.R~~ [SPARK-34015][R] Fixing input timing in gapply Jan 6, 2021

HyukjinKwon closed this in 3d8ee49 Jan 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34015][R] Fixing input timing in gapply #31021

[SPARK-34015][R] Fixing input timing in gapply #31021

WamBamBoozle commented Jan 5, 2021 •

edited

HyukjinKwon left a comment

WamBamBoozle commented Jan 5, 2021

HyukjinKwon commented Jan 6, 2021

SparkQA commented Jan 6, 2021

SparkQA commented Jan 6, 2021

SparkQA commented Jan 6, 2021

HyukjinKwon left a comment

HyukjinKwon commented Jan 6, 2021

HyukjinKwon commented Jan 6, 2021

[SPARK-34015][R] Fixing input timing in gapply #31021

[SPARK-34015][R] Fixing input timing in gapply #31021

Conversation

WamBamBoozle commented Jan 5, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HyukjinKwon left a comment

Choose a reason for hiding this comment

WamBamBoozle commented Jan 5, 2021

HyukjinKwon commented Jan 6, 2021

SparkQA commented Jan 6, 2021

SparkQA commented Jan 6, 2021

SparkQA commented Jan 6, 2021

HyukjinKwon left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Jan 6, 2021

HyukjinKwon commented Jan 6, 2021

WamBamBoozle commented Jan 5, 2021 •

edited