[SPARK-3778] newAPIHadoopRDD doesn't properly pass credentials for secure hdfs #4292

tgravescs · 2015-01-30T17:01:38Z

.this was #2676

https://issues.apache.org/jira/browse/SPARK-3778

This affects if someone is trying to access secure hdfs something like:
val lines = {
val hconf = new Configuration()
hconf.set("mapred.input.dir", "mydir")
hconf.set("textinputformat.record.delimiter","\003432\n")
sc.newAPIHadoopRDD(hconf, classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
}

…yarn

tgravescs · 2015-01-30T17:02:34Z

@JoshRosen you had looked at this before mind taking another look

SparkQA · 2015-01-30T17:02:48Z

Test build #26408 has started for PR 4292 at commit cf3b453.

This patch merges cleanly.

SparkQA · 2015-01-30T18:13:19Z

Test build #26408 has finished for PR 4292 at commit cf3b453.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-30T18:13:22Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26408/
Test PASSed.

vanzin · 2015-01-30T18:36:09Z

+1

harishreedharan · 2015-01-31T04:35:57Z

+1

lianhuiwang · 2015-01-31T13:32:16Z

core/src/main/scala/org/apache/spark/SparkContext.scala

-    new NewHadoopRDD(this, fClass, kClass, vClass, conf)
+    // Add necessary security credentials to the JobConf. Required to access secure HDFS.
+    val jconf = new JobConf(conf)
+    SparkHadoopUtil.get.addCredentials(jconf)


if mode is not yarn, SparkHadoopUtil.addCredentials didnot do anything. so here it donot resolve when non-Yarn mode.

Since security is supported only in Yarn mode, this should be fine.

Yep, looks like addCredentials is implemented as a no-op, so this should be fine.

harishreedharan · 2015-02-02T22:44:33Z

@pwendell - This is a small enough patch - and relatively less risk. It would be great to merge this into 1.3

pwendell · 2015-02-03T01:01:33Z

LGTM and seems very straightforward.

JoshRosen · 2015-02-03T06:45:11Z

LGTM, too, so I'm going to merge this into master (1.3.0). Thanks!

JoshRosen · 2015-02-03T06:52:20Z

Ugh, I just realized that this might potentially regress behavior for some weird corner-cases that arise due to our shared mutable hadoopConfiguration. A common use-case for sc.hadoopConfiguration is to pass credentials for S3 filesystems. The problem that crops up is when a user has already defined a bunch of RDDs and then mutates the configuration to pass credentials: in this case, I think this patch will break those user programs because modifications to the hadoopConfiguration won't be reflected in the JobConf.

JoshRosen · 2015-02-03T06:55:40Z

For example:

15/02/02 22:49:12 INFO SparkILoop: Created spark context..
Spark context available as sc.

scala> import org.apache.hadoop.mapred.{FileInputFormat, InputFormat, JobConf, SequenceFileInputFormat, TextInputFormat}
import org.apache.hadoop.mapred.{FileInputFormat, InputFormat, JobConf, SequenceFileInputFormat, TextInputFormat}

scala> import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.conf.Configuration

scala> val conf = new Configuration()
conf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml

scala> val jobConf = new JobConf(conf)
jobConf: org.apache.hadoop.mapred.JobConf = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml

scala> jobConf.getInt("myInt", 0)
res3: Int = 0

scala> conf.setInt("myInt", 1)

scala> jobConf.getInt("myInt", 0)
res5: Int = 0

pwendell · 2015-02-03T07:05:36Z

@JoshRosen that's a great point and could cause regressing behavior that would be really hard for users to diagnose. @tgravescs. What about deferring the injection of the credentials until just before the conf is broadcast?

pwendell · 2015-02-03T07:07:49Z

We could also just leave it as-is and then do something like that if we find this is encountered by users.

JoshRosen · 2015-02-03T07:14:01Z

I found some previous discussion of this issue.

I'd say that expecting sc.hadoopConfiguration to be mutated by users after it's already been used to define RDDs isn't something that we can / should realistically hope to support because there's just way too many ways that it could break (e.g. defensive copying, serialization, etc) and because it runs counter to user expectations around other types of Spark configurations (e.g. user modifications to SparkConf after creating SparkContext will not take effect).

harishreedharan · 2015-02-03T18:54:10Z

I think @JoshRosen is right. I don't think we need to worry about the change in the conf after the RDD has been defined. That makes sense.

tgravescs · 2015-02-09T16:18:51Z

sorry for my delay, I was out last week.

I would have to agree with @JoshRosen last comment. If they have already created the RDD then I wouldn't expect any changes to the hadoop configuration to apply.

newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on …

cf3b453

…yarn

tgravescs mentioned this pull request Jan 30, 2015

[SPARK-3778] newAPIHadoopRDD doesn't properly pass credentials for secure hdfs #2676

Closed

lianhuiwang reviewed Jan 31, 2015
View reviewed changes

asfgit closed this in c31c36c Feb 3, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3778] newAPIHadoopRDD doesn't properly pass credentials for secure hdfs #4292

[SPARK-3778] newAPIHadoopRDD doesn't properly pass credentials for secure hdfs #4292

tgravescs commented Jan 30, 2015

tgravescs commented Jan 30, 2015

SparkQA commented Jan 30, 2015

SparkQA commented Jan 30, 2015

AmplabJenkins commented Jan 30, 2015

vanzin commented Jan 30, 2015

harishreedharan commented Jan 31, 2015

lianhuiwang Jan 31, 2015

harishreedharan Feb 1, 2015

JoshRosen Feb 3, 2015

harishreedharan commented Feb 2, 2015

pwendell commented Feb 3, 2015

JoshRosen commented Feb 3, 2015

JoshRosen commented Feb 3, 2015

JoshRosen commented Feb 3, 2015

pwendell commented Feb 3, 2015

pwendell commented Feb 3, 2015

JoshRosen commented Feb 3, 2015

harishreedharan commented Feb 3, 2015

tgravescs commented Feb 9, 2015

[SPARK-3778] newAPIHadoopRDD doesn't properly pass credentials for secure hdfs #4292

[SPARK-3778] newAPIHadoopRDD doesn't properly pass credentials for secure hdfs #4292

Conversation

tgravescs commented Jan 30, 2015

tgravescs commented Jan 30, 2015

SparkQA commented Jan 30, 2015

SparkQA commented Jan 30, 2015

AmplabJenkins commented Jan 30, 2015

vanzin commented Jan 30, 2015

harishreedharan commented Jan 31, 2015

lianhuiwang Jan 31, 2015

Choose a reason for hiding this comment

harishreedharan Feb 1, 2015

Choose a reason for hiding this comment

JoshRosen Feb 3, 2015

Choose a reason for hiding this comment

harishreedharan commented Feb 2, 2015

pwendell commented Feb 3, 2015

JoshRosen commented Feb 3, 2015

JoshRosen commented Feb 3, 2015

JoshRosen commented Feb 3, 2015

pwendell commented Feb 3, 2015

pwendell commented Feb 3, 2015

JoshRosen commented Feb 3, 2015

harishreedharan commented Feb 3, 2015

tgravescs commented Feb 9, 2015