Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13071] [hadoop-2.7] Coalescing HadoopRDD overwrites existing input metrics #10973

Closed

Conversation

andrewor14
Copy link
Contributor

This issue is causing tests to fail consistently in master with Hadoop 2.6 / 2.7. This is because for Hadoop 2.5+ we overwrite existing values of InputMetrics#bytesRead in each call to HadoopRDD#compute. In the case of coalesce, e.g.

sc.textFile(..., 4).coalesce(2).count()

we will call compute multiple times in the same task, overwriting bytesRead values from previous calls to compute.

For a regression test, see InputOutputMetricsSuite.input metrics for old hadoop with coalesce. I did not add a new regression test because it's impossible without significant refactoring; there's a lot of existing duplicate code in this corner of Spark.

This was caused by #10835.

@andrewor14
Copy link
Contributor Author

@yhuai @JoshRosen

@SparkQA
Copy link

SparkQA commented Jan 29, 2016

Test build #50335 has finished for PR 10973 at commit c5a97fc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Jan 29, 2016

test this please

@yhuai
Copy link
Contributor

yhuai commented Jan 29, 2016

maybe also put hadoop2.7 in the title, so jenkins can test it?

@SparkQA
Copy link

SparkQA commented Jan 29, 2016

Test build #50348 has finished for PR 10973 at commit c5a97fc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14 andrewor14 changed the title [SPARK-13071] Coalescing HadoopRDD overwrites existing input metrics [SPARK-13071] [hadoop2.7] Coalescing HadoopRDD overwrites existing input metrics Jan 29, 2016
@andrewor14
Copy link
Contributor Author

like this? I didn't know Jenkins can do that

@andrewor14
Copy link
Contributor Author

retest this please

@andrewor14 andrewor14 changed the title [SPARK-13071] [hadoop2.7] Coalescing HadoopRDD overwrites existing input metrics [SPARK-13071] [hadoop-2.7] Coalescing HadoopRDD overwrites existing input metrics Jan 29, 2016
@andrewor14
Copy link
Contributor Author

let's try again. retest this please

@nongli
Copy link
Contributor

nongli commented Jan 29, 2016

LGTM

@SparkQA
Copy link

SparkQA commented Jan 29, 2016

Test build #50394 has finished for PR 10973 at commit c5a97fc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 29, 2016

Test build #50395 has finished for PR 10973 at commit c5a97fc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 29, 2016

Test build #50408 has finished for PR 10973 at commit c5a97fc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 29, 2016

Test build #2478 has finished for PR 10973 at commit c5a97fc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor Author

Merging into master.

@asfgit asfgit closed this in 12252d1 Jan 30, 2016
@andrewor14 andrewor14 deleted the fix-input-metrics-coalesce branch January 30, 2016 02:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants