Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-36680][SQL][FOLLOWUP] Files with options should be put into resolveDataSource function #47370

Closed
wants to merge 5 commits into from

Conversation

logze
Copy link

@logze logze commented Jul 16, 2024

What changes were proposed in this pull request?

When reading csv, json and other files, pass the options parameter to the rules.resolveDataSource method to make the options parameter effective.

This is a bug fix for #46707 @szehon-ho

Why are the changes needed?

For the following SQL, the options parameter passed in does not take effect. This is because the rules.resolveDataSource method does not pass the options parameter during the datasource construction process

 SELECT * FROM csv.`/test/data.csv` WITH (`header` = true, 'delimiter' = '|')

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test in SQLQuerySuite

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Jul 16, 2024
Copy link
Contributor

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, minor nit. Thanks for supporting Files DS

sparkSession,
paths = Seq(ident.last),
className = ident.head,
options = unresolved.options.asCaseSensitiveMap.asScala.toMap)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove .asCaseSensitiveMap

Copy link
Author

@logze logze Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you , I have removed this asCaseSensitiveMap .

@logze
Copy link
Author

logze commented Jul 17, 2024

cc @huaxingao @cloud-fan could you review this PR? This is my first PR in the Spark community. Thank you :)

@@ -19,6 +19,8 @@ package org.apache.spark.sql.execution.datasources

import java.util.Locale

import scala.jdk.CollectionConverters.MapHasAsScala
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more common to just import scala.jdk.CollectionConverters._


test("SPARK-36680: Files hint options should be put into resolveDataSource function") {
val df1 = spark.range(100).toDF()
withTempPath(f => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

withTempPath { f =>

}

""".stripMargin
)
checkAnswer(df2, df1)
df2.queryExecution.analyzed foreach {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make the test more explicit, I suggest to do

val relations = df2.queryExecution.analyzed.collect {
  case LogicalRelation(fs: HadoopFsRelation, _, _, _) => fs
}
assert(relations.length == 1)
assert(relations.head.options == ...)

@logze
Copy link
Author

logze commented Jul 18, 2024

@cloud-fan Thank you very much, I modified the code according to your suggestion, can you take a look at it again?

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 1a428c1 Jul 18, 2024
@logze logze deleted the hint-options branch July 18, 2024 11:48
jingz-db pushed a commit to jingz-db/spark that referenced this pull request Jul 22, 2024
…solveDataSource function

### What changes were proposed in this pull request?

When reading csv, json and other files, pass the options parameter to the rules.resolveDataSource method to make the options parameter effective.

This is a bug fix for [apache#46707](apache#46707) szehon-ho

### Why are the changes needed?

For the following SQL, the options parameter passed in does not take effect. This is because the rules.resolveDataSource method does not pass the options parameter during the datasource construction process

```
 SELECT * FROM csv.`/test/data.csv` WITH (`header` = true, 'delimiter' = '|')
 ```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test in SQLQuerySuite

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47370 from logze/hint-options.

Authored-by: lizongze <lizongze@xiaomi.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…solveDataSource function

### What changes were proposed in this pull request?

When reading csv, json and other files, pass the options parameter to the rules.resolveDataSource method to make the options parameter effective.

This is a bug fix for [apache#46707](apache#46707) szehon-ho

### Why are the changes needed?

For the following SQL, the options parameter passed in does not take effect. This is because the rules.resolveDataSource method does not pass the options parameter during the datasource construction process

```
 SELECT * FROM csv.`/test/data.csv` WITH (`header` = true, 'delimiter' = '|')
 ```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test in SQLQuerySuite

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47370 from logze/hint-options.

Authored-by: lizongze <lizongze@xiaomi.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants