Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31360][WEBUI] Fix hung-up issue in StagePage #28136

Closed
wants to merge 1 commit into from

Conversation

sarutak
Copy link
Member

@sarutak sarutak commented Apr 6, 2020

What changes were proposed in this pull request?

This change (SPARK-31360) fixes a hung-up issue in StagePage.
StagePage will be hung-up with following operations.

  1. Run a job with shuffle.
    scala> sc.parallelize(1 to 10).map(x => (x, x)).reduceByKey(_ + _).collect

  2. Visit StagePage for the stage writing shuffle data and check Shuffle Write Time.

check-shuffle-write-time

  1. Run a job with no shuffle.
    scala> sc.parallelize(1 to 10).collect

  2. Visit StagePage for the last stage.

hungup

This issue is caused by following reason.

In stagepage.js, an array optionalColumns has indices for columns for optional metrics.
If a stage doesn't perform shuffle read or write, the corresponding indices are removed from the array.
StagePage doesn't try to create column for such metrics, even if the state of corresponding optional metrics are preserved as "visible".
But, if a stage doesn't perform both shuffle read and write, the index for Shuffle Write Time isn't removed.
In that case, StagePage tries to create a column for Shuffle Write Time even though there are no metrics for shuffle write, leading hungup.

Why are the changes needed?

This is a bug.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

I tested with operations I explained above and confirmed that StagePage won't be hung-up.

@SparkQA
Copy link

SparkQA commented Apr 6, 2020

Test build #120871 has finished for PR 28136 at commit fb89c37.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @sarutak and @srowen .
I also verified the bug. Merged to master.

sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?

This change (SPARK-31360) fixes a hung-up issue in StagePage.
StagePage will be hung-up with following operations.

1. Run a job with shuffle.
`scala> sc.parallelize(1 to 10).map(x => (x, x)).reduceByKey(_ + _).collect`

2. Visit StagePage for the stage writing shuffle data and check `Shuffle Write Time`.
<img width="401" alt="check-shuffle-write-time" src="https://user-images.githubusercontent.com/4736016/78557730-4513e200-784c-11ea-8b42-a5053b9489a5.png">

3. Run a job with no shuffle.
`scala> sc.parallelize(1 to 10).collect`

4. Visit StagePage for the last stage.
<img width="956" alt="hungup" src="https://user-images.githubusercontent.com/4736016/78557746-4f35e080-784c-11ea-83e8-5db745b88535.png">

This issue is caused by following reason.

In stagepage.js, an array `optionalColumns` has indices for columns for optional metrics.
If a stage doesn't perform shuffle read or write, the corresponding indices are removed from the array.
StagePage doesn't try to create column for such metrics, even if the state of corresponding optional metrics are preserved as "visible".
But, if a stage doesn't perform both shuffle read and write, the index for `Shuffle Write Time` isn't removed.
In that case, StagePage tries to create a column for `Shuffle Write Time` even though there are no metrics for shuffle write, leading hungup.

### Why are the changes needed?

This is a bug.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

I tested with operations I explained above and confirmed that StagePage won't be hung-up.

Closes apache#28136 from sarutak/fix-ui-hungup.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
holdenk pushed a commit to holdenk/spark that referenced this pull request Jun 25, 2020
### What changes were proposed in this pull request?

This change (SPARK-31360) fixes a hung-up issue in StagePage.
StagePage will be hung-up with following operations.

1. Run a job with shuffle.
`scala> sc.parallelize(1 to 10).map(x => (x, x)).reduceByKey(_ + _).collect`

2. Visit StagePage for the stage writing shuffle data and check `Shuffle Write Time`.
<img width="401" alt="check-shuffle-write-time" src="https://user-images.githubusercontent.com/4736016/78557730-4513e200-784c-11ea-8b42-a5053b9489a5.png">

3. Run a job with no shuffle.
`scala> sc.parallelize(1 to 10).collect`

4. Visit StagePage for the last stage.
<img width="956" alt="hungup" src="https://user-images.githubusercontent.com/4736016/78557746-4f35e080-784c-11ea-83e8-5db745b88535.png">

This issue is caused by following reason.

In stagepage.js, an array `optionalColumns` has indices for columns for optional metrics.
If a stage doesn't perform shuffle read or write, the corresponding indices are removed from the array.
StagePage doesn't try to create column for such metrics, even if the state of corresponding optional metrics are preserved as "visible".
But, if a stage doesn't perform both shuffle read and write, the index for `Shuffle Write Time` isn't removed.
In that case, StagePage tries to create a column for `Shuffle Write Time` even though there are no metrics for shuffle write, leading hungup.

### Why are the changes needed?

This is a bug.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

I tested with operations I explained above and confirmed that StagePage won't be hung-up.

Closes apache#28136 from sarutak/fix-ui-hungup.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
holdenk pushed a commit to holdenk/spark that referenced this pull request Oct 27, 2020
### What changes were proposed in this pull request?

This change (SPARK-31360) fixes a hung-up issue in StagePage.
StagePage will be hung-up with following operations.

1. Run a job with shuffle.
`scala> sc.parallelize(1 to 10).map(x => (x, x)).reduceByKey(_ + _).collect`

2. Visit StagePage for the stage writing shuffle data and check `Shuffle Write Time`.
<img width="401" alt="check-shuffle-write-time" src="https://user-images.githubusercontent.com/4736016/78557730-4513e200-784c-11ea-8b42-a5053b9489a5.png">

3. Run a job with no shuffle.
`scala> sc.parallelize(1 to 10).collect`

4. Visit StagePage for the last stage.
<img width="956" alt="hungup" src="https://user-images.githubusercontent.com/4736016/78557746-4f35e080-784c-11ea-83e8-5db745b88535.png">

This issue is caused by following reason.

In stagepage.js, an array `optionalColumns` has indices for columns for optional metrics.
If a stage doesn't perform shuffle read or write, the corresponding indices are removed from the array.
StagePage doesn't try to create column for such metrics, even if the state of corresponding optional metrics are preserved as "visible".
But, if a stage doesn't perform both shuffle read and write, the index for `Shuffle Write Time` isn't removed.
In that case, StagePage tries to create a column for `Shuffle Write Time` even though there are no metrics for shuffle write, leading hungup.

### Why are the changes needed?

This is a bug.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

I tested with operations I explained above and confirmed that StagePage won't be hung-up.

Closes apache#28136 from sarutak/fix-ui-hungup.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants