Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33101][ML][3.0] Make LibSVM format propagate Hadoop config from DS options to underlying HDFS file system #29986

Closed
wants to merge 1 commit into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Oct 9, 2020

What changes were proposed in this pull request?

Propagate LibSVM options to Hadoop configs in the LibSVM datasource.

Why are the changes needed?

There is a bug that when running:

spark.read.format("libsvm").options(conf).load(path)

The underlying file system will not receive the conf options.

Does this PR introduce any user-facing change?

Yes. After the changes, for example, users should read files from Azure Data Lake successfully:

def hadoopConf1() = Map[String, String](
  s"fs.adl.oauth2.access.token.provider.type" -> "ClientCredential",
  s"fs.adl.oauth2.client.id" -> dbutils.secrets.get(scope = "...", key = "..."),
  s"fs.adl.oauth2.credential" -> dbutils.secrets.get(scope = "...", key = "..."),
  s"fs.adl.oauth2.refresh.url" -> s"https://login.microsoftonline.com/.../oauth2/token")
val df = spark.read.format("libsvm").options(hadoopConf1).load("adl://....azuredatalakestore.net/foldersp1/...")

and not get the following exception because the settings above are not propagated to the filesystem:

java.lang.IllegalArgumentException: No value for fs.adl.oauth2.access.token.provider found in conf file.
	at ....adl.AdlFileSystem.getNonEmptyVal(AdlFileSystem.java:820)
	at ....adl.AdlFileSystem.getCustomAccessTokenProvider(AdlFileSystem.java:220)
	at ....adl.AdlFileSystem.getAccessTokenProvider(AdlFileSystem.java:257)
	at ....adl.AdlFileSystem.initialize(AdlFileSystem.java:164)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)

How was this patch tested?

Added UT to LibSVMRelationSuite.

Authored-by: Max Gekk max.gekk@gmail.com
Signed-off-by: Dongjoon Hyun dhyun@apple.com
(cherry picked from commit 1234c66)
Signed-off-by: Max Gekk max.gekk@gmail.com

…options to underlying HDFS file system

Propagate LibSVM options to Hadoop configs in the LibSVM datasource.

There is a bug that when running:
```scala
spark.read.format("libsvm").options(conf).load(path)
```
The underlying file system will not receive the `conf` options.

Yes. After the changes, for example, users should read files from Azure Data Lake successfully:
```scala
def hadoopConf1() = Map[String, String](
  s"fs.adl.oauth2.access.token.provider.type" -> "ClientCredential",
  s"fs.adl.oauth2.client.id" -> dbutils.secrets.get(scope = "...", key = "..."),
  s"fs.adl.oauth2.credential" -> dbutils.secrets.get(scope = "...", key = "..."),
  s"fs.adl.oauth2.refresh.url" -> s"https://login.microsoftonline.com/.../oauth2/token")
val df = spark.read.format("libsvm").options(hadoopConf1).load("adl://....azuredatalakestore.net/foldersp1/...")
```
and not get the following exception because the settings above are not propagated to the filesystem:
```java
java.lang.IllegalArgumentException: No value for fs.adl.oauth2.access.token.provider found in conf file.
	at ....adl.AdlFileSystem.getNonEmptyVal(AdlFileSystem.java:820)
	at ....adl.AdlFileSystem.getCustomAccessTokenProvider(AdlFileSystem.java:220)
	at ....adl.AdlFileSystem.getAccessTokenProvider(AdlFileSystem.java:257)
	at ....adl.AdlFileSystem.initialize(AdlFileSystem.java:164)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
```

Added UT to `LibSVMRelationSuite`.

Closes apache#29984 from MaxGekk/ml-option-propagation.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 1234c66)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 9, 2020

@HyukjinKwon @dongjoon-hyun This PR does not have conflicts with branch-2.4.

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34188/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34188/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Test build #129583 has finished for PR 29986 at commit 5bde04f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to branch-3.0 and branch-2.4.

HyukjinKwon pushed a commit that referenced this pull request Oct 9, 2020
…m DS options to underlying HDFS file system

### What changes were proposed in this pull request?
Propagate LibSVM options to Hadoop configs in the LibSVM datasource.

### Why are the changes needed?
There is a bug that when running:
```scala
spark.read.format("libsvm").options(conf).load(path)
```
The underlying file system will not receive the `conf` options.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes, for example, users should read files from Azure Data Lake successfully:
```scala
def hadoopConf1() = Map[String, String](
  s"fs.adl.oauth2.access.token.provider.type" -> "ClientCredential",
  s"fs.adl.oauth2.client.id" -> dbutils.secrets.get(scope = "...", key = "..."),
  s"fs.adl.oauth2.credential" -> dbutils.secrets.get(scope = "...", key = "..."),
  s"fs.adl.oauth2.refresh.url" -> s"https://login.microsoftonline.com/.../oauth2/token")
val df = spark.read.format("libsvm").options(hadoopConf1).load("adl://....azuredatalakestore.net/foldersp1/...")
```
and not get the following exception because the settings above are not propagated to the filesystem:
```java
java.lang.IllegalArgumentException: No value for fs.adl.oauth2.access.token.provider found in conf file.
	at ....adl.AdlFileSystem.getNonEmptyVal(AdlFileSystem.java:820)
	at ....adl.AdlFileSystem.getCustomAccessTokenProvider(AdlFileSystem.java:220)
	at ....adl.AdlFileSystem.getAccessTokenProvider(AdlFileSystem.java:257)
	at ....adl.AdlFileSystem.initialize(AdlFileSystem.java:164)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
```

### How was this patch tested?
Added UT to `LibSVMRelationSuite`.

Authored-by: Max Gekk <max.gekkgmail.com>
Signed-off-by: Dongjoon Hyun <dhyunapple.com>
(cherry picked from commit 1234c66)
Signed-off-by: Max Gekk <max.gekkgmail.com>

Closes #29986 from MaxGekk/ml-option-propagation-3.0.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
HyukjinKwon pushed a commit that referenced this pull request Oct 9, 2020
…m DS options to underlying HDFS file system

### What changes were proposed in this pull request?
Propagate LibSVM options to Hadoop configs in the LibSVM datasource.

### Why are the changes needed?
There is a bug that when running:
```scala
spark.read.format("libsvm").options(conf).load(path)
```
The underlying file system will not receive the `conf` options.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes, for example, users should read files from Azure Data Lake successfully:
```scala
def hadoopConf1() = Map[String, String](
  s"fs.adl.oauth2.access.token.provider.type" -> "ClientCredential",
  s"fs.adl.oauth2.client.id" -> dbutils.secrets.get(scope = "...", key = "..."),
  s"fs.adl.oauth2.credential" -> dbutils.secrets.get(scope = "...", key = "..."),
  s"fs.adl.oauth2.refresh.url" -> s"https://login.microsoftonline.com/.../oauth2/token")
val df = spark.read.format("libsvm").options(hadoopConf1).load("adl://....azuredatalakestore.net/foldersp1/...")
```
and not get the following exception because the settings above are not propagated to the filesystem:
```java
java.lang.IllegalArgumentException: No value for fs.adl.oauth2.access.token.provider found in conf file.
	at ....adl.AdlFileSystem.getNonEmptyVal(AdlFileSystem.java:820)
	at ....adl.AdlFileSystem.getCustomAccessTokenProvider(AdlFileSystem.java:220)
	at ....adl.AdlFileSystem.getAccessTokenProvider(AdlFileSystem.java:257)
	at ....adl.AdlFileSystem.initialize(AdlFileSystem.java:164)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
```

### How was this patch tested?
Added UT to `LibSVMRelationSuite`.

Authored-by: Max Gekk <max.gekkgmail.com>
Signed-off-by: Dongjoon Hyun <dhyunapple.com>
(cherry picked from commit 1234c66)
Signed-off-by: Max Gekk <max.gekkgmail.com>

Closes #29986 from MaxGekk/ml-option-propagation-3.0.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit dcffa56)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@HyukjinKwon HyukjinKwon closed this Oct 9, 2020
holdenk pushed a commit to holdenk/spark that referenced this pull request Oct 27, 2020
…m DS options to underlying HDFS file system

### What changes were proposed in this pull request?
Propagate LibSVM options to Hadoop configs in the LibSVM datasource.

### Why are the changes needed?
There is a bug that when running:
```scala
spark.read.format("libsvm").options(conf).load(path)
```
The underlying file system will not receive the `conf` options.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes, for example, users should read files from Azure Data Lake successfully:
```scala
def hadoopConf1() = Map[String, String](
  s"fs.adl.oauth2.access.token.provider.type" -> "ClientCredential",
  s"fs.adl.oauth2.client.id" -> dbutils.secrets.get(scope = "...", key = "..."),
  s"fs.adl.oauth2.credential" -> dbutils.secrets.get(scope = "...", key = "..."),
  s"fs.adl.oauth2.refresh.url" -> s"https://login.microsoftonline.com/.../oauth2/token")
val df = spark.read.format("libsvm").options(hadoopConf1).load("adl://....azuredatalakestore.net/foldersp1/...")
```
and not get the following exception because the settings above are not propagated to the filesystem:
```java
java.lang.IllegalArgumentException: No value for fs.adl.oauth2.access.token.provider found in conf file.
	at ....adl.AdlFileSystem.getNonEmptyVal(AdlFileSystem.java:820)
	at ....adl.AdlFileSystem.getCustomAccessTokenProvider(AdlFileSystem.java:220)
	at ....adl.AdlFileSystem.getAccessTokenProvider(AdlFileSystem.java:257)
	at ....adl.AdlFileSystem.initialize(AdlFileSystem.java:164)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
```

### How was this patch tested?
Added UT to `LibSVMRelationSuite`.

Authored-by: Max Gekk <max.gekkgmail.com>
Signed-off-by: Dongjoon Hyun <dhyunapple.com>
(cherry picked from commit 1234c66)
Signed-off-by: Max Gekk <max.gekkgmail.com>

Closes apache#29986 from MaxGekk/ml-option-propagation-3.0.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@MaxGekk MaxGekk deleted the ml-option-propagation-3.0 branch December 11, 2020 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants