[SPARK-32810][SQL] CSV/JSON data sources should avoid globbing paths when inferring schema #29659

MaxGekk · 2020-09-07T08:43:03Z

What changes were proposed in this pull request?

In the PR, I propose to fix an issue with the CSV and JSON data sources in Spark SQL when both of the following are true:

no user specified schema
some file paths contain escaped glob metacharacters, such as [``], {``}, * etc.

Why are the changes needed?

To fix the issue when the follow two queries try to read from paths [abc].csv and [abc].json:

spark.read.csv("""/tmp/\[abc\].csv""").show
spark.read.json("""/tmp/\[abc\].json""").show

but would end up hitting an exception:

org.apache.spark.sql.AnalysisException: Path does not exist: file:/tmp/[abc].csv;
  at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:722)
  at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244)
  at scala.collection.immutable.List.foreach(List.scala:392)

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Added new test cases in DataFrameReaderWriterSuite.

MaxGekk · 2020-09-07T08:49:24Z

Thanks to @rednaxelafx for the provided fix. @HyukjinKwon @cloud-fan Could you review this PR.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala

MaxGekk · 2020-09-07T09:29:09Z

I have checked that the test from this PR fails in Spark 3.0 and 2.4 too. Also the PR conflicts with branch-3.0 and branch-2.4.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

SparkQA · 2020-09-07T13:07:40Z

Test build #128351 has finished for PR 29659 at commit b898987.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-07T14:45:58Z

Test build #128353 has finished for PR 29659 at commit ca1af56.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-09-07T14:53:26Z

Here is the backport to branch-3.0: #29662 and for branch-2.4: #29663

SparkQA · 2020-09-07T15:45:28Z

Test build #128354 has finished for PR 29659 at commit d71c7f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-09-08T00:42:38Z

Merged to master.

…with glob metacharacters ### What changes were proposed in this pull request? In the PR, I propose to fix an issue with LibSVM datasource when both of the following are true: * no user specified schema * some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc. The fix is based on another bug fix for CSV/JSON datasources #29659. ### Why are the changes needed? To fix the issue when the follow two queries try to read from paths `[abc]`: ```scala spark.read.format("libsvm").load("""/tmp/\[abc\].csv""").show ``` but would end up hitting an exception: ``` Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-00000-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; org.apache.spark.sql.AnalysisException: Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-00000-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:770) at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:373) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) ``` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added UT to `LibSVMRelationSuite`. Closes #29670 from MaxGekk/globbing-paths-when-inferring-schema-ml. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…aths with glob metacharacters ### What changes were proposed in this pull request? In the PR, I propose to fix an issue with LibSVM datasource when both of the following are true: * no user specified schema * some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc. The fix is a backport of #29670, and it is based on another bug fix for CSV/JSON datasources #29659. ### Why are the changes needed? To fix the issue when the follow two queries try to read from paths `[abc]`: ```scala spark.read.format("libsvm").load("""/tmp/\[abc\].csv""").show ``` but would end up hitting an exception: ``` Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-00000-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; org.apache.spark.sql.AnalysisException: Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-00000-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:770) at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:373) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) ``` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added UT to `LibSVMRelationSuite`. Closes #29675 from MaxGekk/globbing-paths-when-inferring-schema-ml-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…aths with glob metacharacters ### What changes were proposed in this pull request? In the PR, I propose to fix an issue with LibSVM datasource when both of the following are true: * no user specified schema * some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc. The fix is a backport of apache#29670, and it is based on another bug fix for CSV/JSON datasources apache#29659. ### Why are the changes needed? To fix the issue when the follow two queries try to read from paths `[abc]`: ```scala spark.read.format("libsvm").load("""/tmp/\[abc\].csv""").show ``` but would end up hitting an exception: ``` Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-00000-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; org.apache.spark.sql.AnalysisException: Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-00000-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:770) at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:373) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) ``` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added UT to `LibSVMRelationSuite`. Closes apache#29675 from MaxGekk/globbing-paths-when-inferring-schema-ml-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

CSV/JSON data sources should avoid globbing paths when inferring schema

b898987

probot-autolabeler bot added the SQL label Sep 7, 2020

HyukjinKwon reviewed Sep 7, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Sep 7, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala Outdated Show resolved Hide resolved

MaxGekk added 3 commits September 7, 2020 13:23

Use map(_ == "true").getOrElse(true)

ee99817

globPaths -> __globPaths__

ca1af56

Remove unrelated to CSV/JSON changes

d71c7f4

cloud-fan approved these changes Sep 7, 2020

View reviewed changes

HyukjinKwon approved these changes Sep 8, 2020

View reviewed changes

HyukjinKwon closed this in 954cd9f Sep 8, 2020

This was referenced Sep 8, 2020

[SPARK-32815][ML] Fix LibSVM data source loading error on file paths with glob metacharacters #29670

Closed

[SPARK-32815][ML][3.0] Fix LibSVM data source loading error on file paths with glob metacharacters #29675

Closed

MaxGekk deleted the globbing-paths-when-inferring-schema branch December 11, 2020 20:27

thirtiseven mentioned this pull request Jul 9, 2024

[BUG] CSV/JSON data sources should avoid globbing paths when inferring schema NVIDIA/spark-rapids#11158

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32810][SQL] CSV/JSON data sources should avoid globbing paths when inferring schema #29659

[SPARK-32810][SQL] CSV/JSON data sources should avoid globbing paths when inferring schema #29659

MaxGekk commented Sep 7, 2020

MaxGekk commented Sep 7, 2020

MaxGekk commented Sep 7, 2020

SparkQA commented Sep 7, 2020

SparkQA commented Sep 7, 2020

MaxGekk commented Sep 7, 2020 •

edited

Loading

SparkQA commented Sep 7, 2020

HyukjinKwon commented Sep 8, 2020

[SPARK-32810][SQL] CSV/JSON data sources should avoid globbing paths when inferring schema #29659

[SPARK-32810][SQL] CSV/JSON data sources should avoid globbing paths when inferring schema #29659

Conversation

MaxGekk commented Sep 7, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

MaxGekk commented Sep 7, 2020

MaxGekk commented Sep 7, 2020

SparkQA commented Sep 7, 2020

SparkQA commented Sep 7, 2020

MaxGekk commented Sep 7, 2020 • edited Loading

SparkQA commented Sep 7, 2020

HyukjinKwon commented Sep 8, 2020

MaxGekk commented Sep 7, 2020 •

edited

Loading