[SPARK-10330] Use SparkHadoopUtil TaskAttemptContext reflection methods in more places #8499

JoshRosen · 2015-08-28T07:53:31Z

SparkHadoopUtil contains methods that use reflection to work around TaskAttemptContext binary incompatibilities between Hadoop 1.x and 2.x. We should use these methods in more places.

JoshRosen · 2015-08-28T07:56:23Z

I discovered this issue while trying to extend our spark-avro library to publish a single artifact that is compatible with multiple Hadoop versions. spark-avro itself works around these incompatibilities using the same reflection tricks, but its tests failed due to these calls in Spark which did not use reflection; see databricks/spark-avro#79 for some additional context / discussion.

I grepped to try to find all of the places that were using the non-reflective calls, but it's possible that I might have missed a callsite or two. I'm opening this PR now for testing + initial feedback.

SparkQA · 2015-08-28T09:53:58Z

Test build #41738 has finished for PR 8499 at commit 28d213d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-08-28T18:16:31Z

@vanzin, @srowen, does the lack of reflection in the existing code imply that builds created using the hadoop-provided profile will have broken data sources features under certain deployed Hadoop versions?

JoshRosen · 2015-08-28T18:17:59Z

/cc @pwendell also.

srowen · 2015-08-28T18:22:48Z

My understanding is that this would not work at runtime with Hadoop 1.x, yes, without reflection. At least, that was what prompted the original change.

vanzin · 2015-08-28T19:31:33Z

Yes, what Sean says. We may want to qualify that the "without hadoop" tarball is only compatible with Hadoop 2.x (and even if we fix the binary compat issues, some parts, such as the YARN backend, will only work on Hadoop 2.x anyway).

JoshRosen · 2015-08-28T22:03:24Z

I'd like to pull this into branch-1.5 since this change will simplify certain compatibility checks for the spark-avro library. Does anyone have review comments concerning the safety / correctness of these changes?

vanzin · 2015-08-28T22:06:23Z

The code looks fine, but do you want to pull this into 1.5.0? Wouldn't that mean a "-1" vote and a new rc?

vanzin · 2015-08-28T22:07:38Z

(Just to clarify: pushing to branch-1.5 now means the change will make it to 1.5.0 if there's a new rc. If it's not really meant to go into 1.5.0, we should wait until the release vote passes before pushing to the branch.)

davies · 2015-08-28T22:47:11Z

davies@localhost:~/work/spark$ git grep getConfiguration sql/ | grep -v getConfigurationFromJobContext
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala:  protected val serializableConf = new SerializableConfiguration(job.getConfiguration)
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala:    job.getConfiguration.set("spark.sql.sources.writeJobUUID", uniqueWriteJobId.toString)
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JSONRelation.scala:    val conf = job.getConfiguration
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala:    val conf = context.getConfiguration
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/DirectParquetOutputCommitter.scala:    val configuration = ContextUtil.getConfiguration(jobContext)
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala:    val conf = ContextUtil.getConfiguration(job)
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala:    val conf = job.getConfiguration
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala:    overrideMinSplitSize(parquetBlockSize, job.getConfiguration)
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetTypesConverter.scala:    val conf = configuration.getOrElse(ContextUtil.getConfiguration(job))
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala:    job.getConfiguration match {
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala:    val conf = job.getConfiguration
davies@localhost:~/work/spark$ git grep getConfiguration core/ | grep -v getConfigurationFromJobContext
core/src/main/scala/org/apache/spark/SparkContext.scala:    val updateConf = job.getConfiguration
core/src/main/scala/org/apache/spark/SparkContext.scala:    val updateConf = job.getConfiguration
core/src/main/scala/org/apache/spark/SparkContext.scala:    val updatedConf = job.getConfiguration
core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:   * call `JobContext/TaskAttemptContext.getConfiguration`, it will generate different byte codes
core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:    val method = context.getClass.getMethod("getConfiguration")
core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala:    job.getConfiguration.set("mapred.output.dir", path)
core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala:    saveAsNewAPIHadoopDataset(job.getConfiguration)
core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala:    val wrappedConf = new SerializableConfiguration(job.getConfiguration)
core/src/main/scala/org/apache/spark/rdd/SqlNewHadoopRDD.scala:    job.getConfiguration
core/src/test/java/org/apache/spark/JavaAPISuite.java:      Text.class, new Job().getConfiguration());
core/src/test/scala/org/apache/spark/FileSuite.scala:    job.getConfiguration.set("mapred.output.dir", tempDir.getPath + "/outputDataset_new")
core/src/test/scala/org/apache/spark/FileSuite.scala:    randomRDD.saveAsNewAPIHadoopDataset(job.getConfiguration)

JoshRosen · 2015-08-28T22:48:01Z

I'll add a Scalastyle rule to catch these patterns and will fix these cases.

srowen · 2015-08-29T07:59:55Z

Regarding when to push to 1.5.x, my $0.02: in theory if a change is suitable for 1.5.1 it's suitable for 1.5.0, or if it's too risky for 1.5.0, it's probably too risky for 1.5.1. In practice I appreciate there's an extra level of caution between RCs, but implementing that does in practice mean either branching the branch so things can go into 1.5.1 that don't go into 1.5.0, or, just holding back a couple days. The RC process is still taking weeks in general, so the holding-off approach is not that tenable.

marmbrus · 2015-08-29T20:13:55Z

This is a pretty common technique throughout the codebase and broken compatibility is pretty annoying for library writers, so I'd be inclined to include this in branch-1.5 (before the next RC is cut).

marmbrus · 2015-08-29T20:35:42Z

Talked to @rxin, going to merge.

…ds in more places SparkHadoopUtil contains methods that use reflection to work around TaskAttemptContext binary incompatibilities between Hadoop 1.x and 2.x. We should use these methods in more places. Author: Josh Rosen <joshrosen@databricks.com> Closes #8499 from JoshRosen/use-hadoop-reflection-in-more-places. (cherry picked from commit 6a6f3c9) Signed-off-by: Michael Armbrust <michael@databricks.com>

…obContext methods This is a followup to #8499 which adds a Scalastyle rule to mandate the use of SparkHadoopUtil's JobContext accessor methods and fixes the existing violations. Author: Josh Rosen <joshrosen@databricks.com> Closes #8521 from JoshRosen/SPARK-10330-part2.

Use reflection in more places.

28d213d

asfgit closed this in 6a6f3c9 Aug 29, 2015

JoshRosen mentioned this pull request Aug 29, 2015

[SPARK-10330] Add Scalastyle rule to require use of SparkHadoopUtil JobContext methods #8521

Closed

JoshRosen deleted the use-hadoop-reflection-in-more-places branch October 16, 2015 21:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-10330] Use SparkHadoopUtil TaskAttemptContext reflection methods in more places #8499

[SPARK-10330] Use SparkHadoopUtil TaskAttemptContext reflection methods in more places #8499

JoshRosen commented Aug 28, 2015

JoshRosen commented Aug 28, 2015

SparkQA commented Aug 28, 2015

JoshRosen commented Aug 28, 2015

JoshRosen commented Aug 28, 2015

srowen commented Aug 28, 2015

vanzin commented Aug 28, 2015

JoshRosen commented Aug 28, 2015

vanzin commented Aug 28, 2015

vanzin commented Aug 28, 2015

davies commented Aug 28, 2015

JoshRosen commented Aug 28, 2015

srowen commented Aug 29, 2015

marmbrus commented Aug 29, 2015

marmbrus commented Aug 29, 2015

[SPARK-10330] Use SparkHadoopUtil TaskAttemptContext reflection methods in more places #8499

[SPARK-10330] Use SparkHadoopUtil TaskAttemptContext reflection methods in more places #8499

Conversation

JoshRosen commented Aug 28, 2015

JoshRosen commented Aug 28, 2015

SparkQA commented Aug 28, 2015

JoshRosen commented Aug 28, 2015

JoshRosen commented Aug 28, 2015

srowen commented Aug 28, 2015

vanzin commented Aug 28, 2015

JoshRosen commented Aug 28, 2015

vanzin commented Aug 28, 2015

vanzin commented Aug 28, 2015

davies commented Aug 28, 2015

JoshRosen commented Aug 28, 2015

srowen commented Aug 29, 2015

marmbrus commented Aug 29, 2015

marmbrus commented Aug 29, 2015