Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28776][ML] SparkML Writer gets hadoop conf from session state #25505

Closed
wants to merge 1 commit into from

Conversation

helenyugithub
Copy link

@helenyugithub helenyugithub commented Aug 19, 2019

What changes were proposed in this pull request?

SparkML writer gets hadoop conf from session state, instead of the spark context.

Why are the changes needed?

Allow for multiple sessions in the same context that have different hadoop configurations.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Tested in pyspark.ml.tests.test_persistence.PersistenceTest test_default_read_write

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-28776] SparkML Writer gets hadoop conf from session state [SPARK-28776][ML] SparkML Writer gets hadoop conf from session state Aug 19, 2019
@helenyugithub
Copy link
Author

@jkbradley you seem to have reviewed previous prs to this file like #18742. Would you be able to review this pr, or suggest another person to do so?

@srowen
Copy link
Member

srowen commented Aug 20, 2019

Just for my reference, does this make it consistent with other similar code? just want to understand the argument why this needs to change - what problem does the current code cause?

@helenyugithub
Copy link
Author

Hi @srowen
I believe the general convention is to use the session state's hadoop conf (see

val fs = FileSystem.get(spark.sessionState.newHadoopConf())
) . This also prevents correctness issues when multiple spark sessions are sharing the same context and they want to have their own hadoop conf (for example, that contains an authentication key that shouldn't be shared across users)

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. There are probably some more cases where a similar fix should be made then, but OK to address this isolated one.

@SparkQA
Copy link

SparkQA commented Aug 20, 2019

Test build #4837 has finished for PR 25505 at commit 9c8f93c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@helenyugithub
Copy link
Author

Thanks @srowen ! Now that the tests have passed, is this PR good to merge?

@srowen
Copy link
Member

srowen commented Aug 21, 2019

(I usually leave it open at least a day to catch any more comments)

bulldozer-bot bot pushed a commit to palantir/spark that referenced this pull request Aug 21, 2019
## Upstream SPARK-28776 ticket and PR link (if not applicable, explain)
apache#25505
## What changes were proposed in this pull request?
SparkML writer gets hadoop conf from session state, instead of the spark context.

## How was this patch tested?
Tested in pyspark.ml.tests.test_persistence.PersistenceTest test_default_read_write

Please review http://spark.apache.org/contributing.html before opening a pull request.
@srowen srowen closed this in fb1f868 Aug 22, 2019
@srowen
Copy link
Member

srowen commented Aug 22, 2019

Merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants