Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3778] newAPIHadoopRDD doesn't properly pass credentials for secure hdfs #4292

Closed
wants to merge 1 commit into from

Conversation

tgravescs
Copy link
Contributor

.this was #2676

https://issues.apache.org/jira/browse/SPARK-3778

This affects if someone is trying to access secure hdfs something like:
val lines = {
val hconf = new Configuration()
hconf.set("mapred.input.dir", "mydir")
hconf.set("textinputformat.record.delimiter","\003432\n")
sc.newAPIHadoopRDD(hconf, classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
}

@tgravescs
Copy link
Contributor Author

@JoshRosen you had looked at this before mind taking another look

@SparkQA
Copy link

SparkQA commented Jan 30, 2015

Test build #26408 has started for PR 4292 at commit cf3b453.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 30, 2015

Test build #26408 has finished for PR 4292 at commit cf3b453.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26408/
Test PASSed.

@vanzin
Copy link
Contributor

vanzin commented Jan 30, 2015

+1

1 similar comment
@harishreedharan
Copy link
Contributor

+1

new NewHadoopRDD(this, fClass, kClass, vClass, conf)
// Add necessary security credentials to the JobConf. Required to access secure HDFS.
val jconf = new JobConf(conf)
SparkHadoopUtil.get.addCredentials(jconf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if mode is not yarn, SparkHadoopUtil.addCredentials didnot do anything. so here it donot resolve when non-Yarn mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since security is supported only in Yarn mode, this should be fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, looks like addCredentials is implemented as a no-op, so this should be fine.

@harishreedharan
Copy link
Contributor

@pwendell - This is a small enough patch - and relatively less risk. It would be great to merge this into 1.3

@pwendell
Copy link
Contributor

pwendell commented Feb 3, 2015

LGTM and seems very straightforward.

@JoshRosen
Copy link
Contributor

LGTM, too, so I'm going to merge this into master (1.3.0). Thanks!

@asfgit asfgit closed this in c31c36c Feb 3, 2015
@JoshRosen
Copy link
Contributor

Ugh, I just realized that this might potentially regress behavior for some weird corner-cases that arise due to our shared mutable hadoopConfiguration. A common use-case for sc.hadoopConfiguration is to pass credentials for S3 filesystems. The problem that crops up is when a user has already defined a bunch of RDDs and then mutates the configuration to pass credentials: in this case, I think this patch will break those user programs because modifications to the hadoopConfiguration won't be reflected in the JobConf.

@JoshRosen
Copy link
Contributor

For example:

15/02/02 22:49:12 INFO SparkILoop: Created spark context..
Spark context available as sc.

scala> import org.apache.hadoop.mapred.{FileInputFormat, InputFormat, JobConf, SequenceFileInputFormat, TextInputFormat}
import org.apache.hadoop.mapred.{FileInputFormat, InputFormat, JobConf, SequenceFileInputFormat, TextInputFormat}

scala> import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.conf.Configuration

scala> val conf = new Configuration()
conf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml

scala> val jobConf = new JobConf(conf)
jobConf: org.apache.hadoop.mapred.JobConf = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml

scala> jobConf.getInt("myInt", 0)
res3: Int = 0

scala> conf.setInt("myInt", 1)

scala> jobConf.getInt("myInt", 0)
res5: Int = 0

@pwendell
Copy link
Contributor

pwendell commented Feb 3, 2015

@JoshRosen that's a great point and could cause regressing behavior that would be really hard for users to diagnose. @tgravescs. What about deferring the injection of the credentials until just before the conf is broadcast?

@pwendell
Copy link
Contributor

pwendell commented Feb 3, 2015

We could also just leave it as-is and then do something like that if we find this is encountered by users.

@JoshRosen
Copy link
Contributor

I found some previous discussion of this issue.

I'd say that expecting sc.hadoopConfiguration to be mutated by users after it's already been used to define RDDs isn't something that we can / should realistically hope to support because there's just way too many ways that it could break (e.g. defensive copying, serialization, etc) and because it runs counter to user expectations around other types of Spark configurations (e.g. user modifications to SparkConf after creating SparkContext will not take effect).

@harishreedharan
Copy link
Contributor

I think @JoshRosen is right. I don't think we need to worry about the change in the conf after the RDD has been defined. That makes sense.

@tgravescs
Copy link
Contributor Author

sorry for my delay, I was out last week.

I would have to agree with @JoshRosen last comment. If they have already created the RDD then I wouldn't expect any changes to the hadoop configuration to apply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants