[SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths when inferring schema #29663

MaxGekk · 2020-09-07T17:35:07Z

What changes were proposed in this pull request?

In the PR, I propose to fix an issue with the CSV and JSON data sources in Spark SQL when both of the following are true:

no user specified schema
some file paths contain escaped glob metacharacters, such as [``], {``}, * etc.

Why are the changes needed?

To fix the issue when the follow two queries try to read from paths [abc].csv and [abc].json:

spark.read.csv("""/tmp/\[abc\].csv""").show
spark.read.json("""/tmp/\[abc\].json""").show

but would end up hitting an exception:

org.apache.spark.sql.AnalysisException: Path does not exist: file:/tmp/[abc].csv;
  at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:722)
  at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244)
  at scala.collection.immutable.List.foreach(List.scala:392)

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Added new test cases in DataFrameReaderWriterSuite.

SparkQA · 2020-09-07T20:47:45Z

Test build #128362 has finished for PR 29663 at commit 8d9cff6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-09-08T00:44:54Z

Merged to branch-2.4.

…aths when inferring schema ### What changes were proposed in this pull request? In the PR, I propose to fix an issue with the CSV and JSON data sources in Spark SQL when both of the following are true: * no user specified schema * some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc. ### Why are the changes needed? To fix the issue when the follow two queries try to read from paths `[abc].csv` and `[abc].json`: ```scala spark.read.csv("""/tmp/\[abc\].csv""").show spark.read.json("""/tmp/\[abc\].json""").show ``` but would end up hitting an exception: ``` org.apache.spark.sql.AnalysisException: Path does not exist: file:/tmp/[abc].csv; at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:722) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:392) ``` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added new test cases in `DataFrameReaderWriterSuite`. Closes #29663 from MaxGekk/globbing-paths-when-inferring-schema-2.4. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…aths with glob metacharacters ### What changes were proposed in this pull request? In the PR, I propose to fix an issue with LibSVM datasource when both of the following are true: * no user specified schema * some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc. The fix is a backport of #29675, and it is based on another bug fix for CSV/JSON datasources #29663. ### Why are the changes needed? To fix the issue when the follow two queries try to read from paths `[abc]`: ```scala spark.read.format("libsvm").load("""/tmp/\[abc\].csv""").show ``` but would end up hitting an exception: ``` Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-00000-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; org.apache.spark.sql.AnalysisException: Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-00000-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:770) at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:373) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) ``` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added UT to `LibSVMRelationSuite`. Closes #29678 from MaxGekk/globbing-paths-when-inferring-schema-ml-2.4. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

MaxGekk added 5 commits September 7, 2020 20:14

CSV/JSON data sources should avoid globbing paths when inferring schema

3615e10

Use map(_ == "true").getOrElse(true)

dd7c881

globPaths -> __globPaths__

3b26fea

Remove unrelated to CSV/JSON changes

ef34951

Avoid unnecessary changes

8d9cff6

probot-autolabeler bot added the SQL label Sep 7, 2020

MaxGekk mentioned this pull request Sep 7, 2020

[SPARK-32810][SQL] CSV/JSON data sources should avoid globbing paths when inferring schema #29659

Closed

HyukjinKwon approved these changes Sep 8, 2020

View reviewed changes

HyukjinKwon closed this Sep 8, 2020

This was referenced Sep 8, 2020

[SPARK-32815][ML][2.4] Fix LibSVM data source loading error on file paths with glob metacharacters #29676

Closed

[SPARK-32815][ML][2.4] Fix LibSVM data source loading error on file paths with glob metacharacters #29678

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths when inferring schema #29663

[SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths when inferring schema #29663

MaxGekk commented Sep 7, 2020

SparkQA commented Sep 7, 2020

HyukjinKwon commented Sep 8, 2020

[SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths when inferring schema #29663

[SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths when inferring schema #29663

Conversation

MaxGekk commented Sep 7, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Sep 7, 2020

HyukjinKwon commented Sep 8, 2020