Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34015][R] Fixing input timing in gapply #31021

Closed
wants to merge 2 commits into from

Conversation

WamBamBoozle
Copy link

@WamBamBoozle WamBamBoozle commented Jan 5, 2021

What changes were proposed in this pull request?

When sparkR is run at log level INFO, a summary of how the worker spent its time processing the partition is printed. There is a logic error where it is over-reporting the time inputting rows.

In detail: the variable inputElap in a wider context is used to mark the end of reading rows, but in the part changed here it was used as a local variable for measuring the beginning of compute time in a loop over the groups in the partition. Thus, the error is not observable if there is only one group per partition, which is what you get in unit tests.

For our application, here's what a log entry looks like before these changes were applied:

20/10/09 04:08:58 INFO RRunner: Times: boot = 0.013 s, init = 0.005 s, broadcast = 0.000 s, read-input = 529.471 s, compute = 492.037 s, write-output = 0.020 s, total = 1021.546 s

this indicates that we're spending more time reading rows than operating on the rows.

After these changes, it looks like this:

20/12/15 06:43:29 INFO RRunner: Times: boot = 0.013 s, init = 0.010 s, broadcast = 0.000 s, read-input = 120.275 s, compute = 1680.161 s, write-output = 0.045 s, total = 1812.553 s

Why are the changes needed?

Metrics shouldn't mislead?

Does this PR introduce any user-facing change?

Aside from no longer misleading, no

How was this patch tested?

unit tests passed. Field test results seem plausible

@github-actions github-actions bot added the R label Jan 5, 2021
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind:

@WamBamBoozle WamBamBoozle changed the title fixing input timing in worker.R [SPARK-34015][R] fixing input timing in worker.R Jan 5, 2021
@WamBamBoozle
Copy link
Author

@HyukjinKwon -- please have another look

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jan 6, 2021

Test build #133695 has finished for PR 31021 at commit cbad669.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38283/

@SparkQA
Copy link

SparkQA commented Jan 6, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38283/

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HyukjinKwon HyukjinKwon changed the title [SPARK-34015][R] fixing input timing in worker.R [SPARK-34015][R] Fixing input timing in gapply Jan 6, 2021
HyukjinKwon pushed a commit that referenced this pull request Jan 6, 2021
### What changes were proposed in this pull request?

When sparkR is run at log level INFO, a summary of how the worker spent its time processing the partition is printed. There is a logic error where it is over-reporting the time inputting rows.

In detail: the variable inputElap in a wider context is used to mark the end of reading rows, but in the part changed here it was used as a local variable for measuring the beginning of compute time in a loop over the groups in the partition. Thus, the error is not observable if there is only one group per partition, which is what you get in unit tests.

For our application, here's what a log entry looks like before these changes were applied:

`20/10/09 04:08:58 INFO RRunner: Times: boot = 0.013 s, init = 0.005 s, broadcast = 0.000 s, read-input = 529.471 s, compute = 492.037 s, write-output = 0.020 s, total = 1021.546 s`

this indicates that we're spending more time reading rows than operating on the rows.

After these changes, it looks like this:

`20/12/15 06:43:29 INFO RRunner: Times: boot = 0.013 s, init = 0.010 s, broadcast = 0.000 s, read-input = 120.275 s, compute = 1680.161 s, write-output = 0.045 s, total = 1812.553 s
`
### Why are the changes needed?

Metrics shouldn't mislead?

### Does this PR introduce _any_ user-facing change?

Aside from no longer misleading, no

### How was this patch tested?

unit tests passed. Field test results seem plausible

Closes #31021 from WamBamBoozle/input_timing.

Authored-by: Tom.Howland <Tom.Howland@target.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit 3d8ee49)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@HyukjinKwon
Copy link
Member

Merged to master and branch-3.1

@HyukjinKwon
Copy link
Member

Thanks, @WamBamBoozle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants