Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-10330] Use SparkHadoopUtil TaskAttemptContext reflection methods in more places #8499

Conversation

JoshRosen
Copy link
Contributor

SparkHadoopUtil contains methods that use reflection to work around TaskAttemptContext binary incompatibilities between Hadoop 1.x and 2.x. We should use these methods in more places.

@JoshRosen
Copy link
Contributor Author

I discovered this issue while trying to extend our spark-avro library to publish a single artifact that is compatible with multiple Hadoop versions. spark-avro itself works around these incompatibilities using the same reflection tricks, but its tests failed due to these calls in Spark which did not use reflection; see databricks/spark-avro#79 for some additional context / discussion.

I grepped to try to find all of the places that were using the non-reflective calls, but it's possible that I might have missed a callsite or two. I'm opening this PR now for testing + initial feedback.

@SparkQA
Copy link

SparkQA commented Aug 28, 2015

Test build #41738 has finished for PR 8499 at commit 28d213d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

@vanzin, @srowen, does the lack of reflection in the existing code imply that builds created using the hadoop-provided profile will have broken data sources features under certain deployed Hadoop versions?

@JoshRosen
Copy link
Contributor Author

/cc @pwendell also.

@srowen
Copy link
Member

srowen commented Aug 28, 2015

My understanding is that this would not work at runtime with Hadoop 1.x, yes, without reflection. At least, that was what prompted the original change.

@vanzin
Copy link
Contributor

vanzin commented Aug 28, 2015

Yes, what Sean says. We may want to qualify that the "without hadoop" tarball is only compatible with Hadoop 2.x (and even if we fix the binary compat issues, some parts, such as the YARN backend, will only work on Hadoop 2.x anyway).

@JoshRosen
Copy link
Contributor Author

I'd like to pull this into branch-1.5 since this change will simplify certain compatibility checks for the spark-avro library. Does anyone have review comments concerning the safety / correctness of these changes?

@vanzin
Copy link
Contributor

vanzin commented Aug 28, 2015

The code looks fine, but do you want to pull this into 1.5.0? Wouldn't that mean a "-1" vote and a new rc?

@vanzin
Copy link
Contributor

vanzin commented Aug 28, 2015

(Just to clarify: pushing to branch-1.5 now means the change will make it to 1.5.0 if there's a new rc. If it's not really meant to go into 1.5.0, we should wait until the release vote passes before pushing to the branch.)

@davies
Copy link
Contributor

davies commented Aug 28, 2015

davies@localhost:~/work/spark$ git grep getConfiguration sql/ | grep -v getConfigurationFromJobContext
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala:  protected val serializableConf = new SerializableConfiguration(job.getConfiguration)
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala:    job.getConfiguration.set("spark.sql.sources.writeJobUUID", uniqueWriteJobId.toString)
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JSONRelation.scala:    val conf = job.getConfiguration
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala:    val conf = context.getConfiguration
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/DirectParquetOutputCommitter.scala:    val configuration = ContextUtil.getConfiguration(jobContext)
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala:    val conf = ContextUtil.getConfiguration(job)
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala:    val conf = job.getConfiguration
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala:    overrideMinSplitSize(parquetBlockSize, job.getConfiguration)
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetTypesConverter.scala:    val conf = configuration.getOrElse(ContextUtil.getConfiguration(job))
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala:    job.getConfiguration match {
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala:    val conf = job.getConfiguration
davies@localhost:~/work/spark$ git grep getConfiguration core/ | grep -v getConfigurationFromJobContext
core/src/main/scala/org/apache/spark/SparkContext.scala:    val updateConf = job.getConfiguration
core/src/main/scala/org/apache/spark/SparkContext.scala:    val updateConf = job.getConfiguration
core/src/main/scala/org/apache/spark/SparkContext.scala:    val updatedConf = job.getConfiguration
core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:   * call `JobContext/TaskAttemptContext.getConfiguration`, it will generate different byte codes
core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:    val method = context.getClass.getMethod("getConfiguration")
core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala:    job.getConfiguration.set("mapred.output.dir", path)
core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala:    saveAsNewAPIHadoopDataset(job.getConfiguration)
core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala:    val wrappedConf = new SerializableConfiguration(job.getConfiguration)
core/src/main/scala/org/apache/spark/rdd/SqlNewHadoopRDD.scala:    job.getConfiguration
core/src/test/java/org/apache/spark/JavaAPISuite.java:      Text.class, new Job().getConfiguration());
core/src/test/scala/org/apache/spark/FileSuite.scala:    job.getConfiguration.set("mapred.output.dir", tempDir.getPath + "/outputDataset_new")
core/src/test/scala/org/apache/spark/FileSuite.scala:    randomRDD.saveAsNewAPIHadoopDataset(job.getConfiguration)

@JoshRosen
Copy link
Contributor Author

I'll add a Scalastyle rule to catch these patterns and will fix these cases.

@srowen
Copy link
Member

srowen commented Aug 29, 2015

Regarding when to push to 1.5.x, my $0.02: in theory if a change is suitable for 1.5.1 it's suitable for 1.5.0, or if it's too risky for 1.5.0, it's probably too risky for 1.5.1. In practice I appreciate there's an extra level of caution between RCs, but implementing that does in practice mean either branching the branch so things can go into 1.5.1 that don't go into 1.5.0, or, just holding back a couple days. The RC process is still taking weeks in general, so the holding-off approach is not that tenable.

@marmbrus
Copy link
Contributor

This is a pretty common technique throughout the codebase and broken compatibility is pretty annoying for library writers, so I'd be inclined to include this in branch-1.5 (before the next RC is cut).

@marmbrus
Copy link
Contributor

Talked to @rxin, going to merge.

@asfgit asfgit closed this in 6a6f3c9 Aug 29, 2015
asfgit pushed a commit that referenced this pull request Aug 29, 2015
…ds in more places

SparkHadoopUtil contains methods that use reflection to work around TaskAttemptContext binary incompatibilities between Hadoop 1.x and 2.x. We should use these methods in more places.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8499 from JoshRosen/use-hadoop-reflection-in-more-places.

(cherry picked from commit 6a6f3c9)
Signed-off-by: Michael Armbrust <michael@databricks.com>
asfgit pushed a commit that referenced this pull request Sep 12, 2015
…obContext methods

This is a followup to #8499 which adds a Scalastyle rule to mandate the use of SparkHadoopUtil's JobContext accessor methods and fixes the existing violations.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8521 from JoshRosen/SPARK-10330-part2.
@JoshRosen JoshRosen deleted the use-hadoop-reflection-in-more-places branch October 16, 2015 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants