Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32208][SQL] Spark SQL throw Illegal character exception when load certain abnormal path of HDFS #30707

Closed
wants to merge 1 commit into from

Conversation

southernriver
Copy link
Contributor

What changes were proposed in this pull request?

In the distributed hdfs storage system,Space and other special character are allowed in the path:

hdfs://ns1/tmp2/hive-staging/hadoop_hive_2020-07-06_17-31-29_139_7042265710400397740-1/-ext-10000/test_table=2020-06-17 18%3A00%3A00/part-00000-84396c4e-ba05-4936-afc7-db46c4251bfa.c000

When we load data by using

org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
org.apache.spark.sql.execution.datasources.orcOrcFileFormat.scala
org.apache.spark.sql.hive.orc.OrcFileFormat 

, exception may throw as below:

Caused by: java.net.URISyntaxException: Illegal character in path at index 136: hdfs://ns1/tmp2/hive-staging/hadoop_hive_2020-07-06_17-31-29_139_7042265710400397740-1/-ext-10000/test_table=2020-06-17 18%3A00%3A00/part-00000-84396c4e-ba05-4936-afc7-db46c4251bfa.c000
at java.net.URI$Parser.fail(URI.java:2848)
at java.net.URI$Parser.checkChars(URI.java:3021)
at java.net.URI$Parser.parseHierarchical(URI.java:3105)
at java.net.URI$Parser.parse(URI.java:3053)
at java.net.URI.<init>(URI.java:588)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356)atorg.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
anonfunbuildReaderWithPartitionValues1.apply(ParquetFileFormat.scala:352)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.orgapachesparksqlexecutiondatasourcesFileScanRDD
anon
readCurrentFile(FileScanRDD.scala:124)
at org.apache.spark.sql.execution.datasources.FileScanRDD
anon$1.nextIterator(FileScanRDD.scala:177)atorg.apache.spark.sql.execution.datasources.FileScanRDD
anon1.hasNext(FileScanRDD.scala:101)atorg.apache.spark.sql.execution.datasources.FileFormatWriteranonfunorgapachesparksqlexecutiondatasourcesFileFormatWriter
executeTask$3.apply(FileFormatWriter.scala:252)atorg.apache.spark.sql.execution.datasources.FileFormatWriter
anonfunorgapachesparksqlexecutiondatasourcesFileFormatWriterexecuteTask3.apply(FileFormatWriter.scala:250)
at org.apache.spark.util.Utils.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)atorg.apache.spark.sql.execution.datasources.FileFormatWriter.orgapachesparksqlexecutiondatasourcesFileFormatWriter$$executeTask(FileFormatWriter.scala:256)
... 10 more

Hdfs has provided serveral construct function to build path:

https://github.com/apache/hadoop/blob/master/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Path.java

We could fall back to construct a path from a String rather than URI.

Why are the changes needed?

It's reasonable to support all path of HDFS for module of ParquetFileFormat or OrcFileFormat.

Does this PR introduce any user-facing change?

No

How was this patch tested?

manual

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@github-actions github-actions bot added the SQL label Dec 10, 2020
@srowen
Copy link
Member

srowen commented Dec 10, 2020

Doesn't that space have to be escaped in the URI?

@southernriver
Copy link
Contributor Author

Doesn't that space have to be escaped in the URI?

The hdfs://ns1/tmp2/ 2020 is a full location path.
When constructed a path from URI as below:

  /**
   * Construct a path from a URI
   */
  public Path(URI aUri) {
    uri = aUri.normalize();
  }
   /**
     * @param  str   The string to be parsed into a URI
     *
     * @throws  NullPointerException
     *          If {@code str} is {@code null}
     *
     * @throws  URISyntaxException
     *          If the given string violates RFC&nbsp;2396, as augmented
     *          by the above deviations
     */
    public URI(String str) throws URISyntaxException {
        new Parser(str).parse(false);
    }

java.net.URISyntaxException would be throwed .
And then we can fall back to construct a path from a full location which contains space.

/** Construct a path from a String.  Path strings are URIs, but with
   * unescaped elements and some additional normalization. */
  public Path(String pathString) throws IllegalArgumentException {
    checkPathArg( pathString );

@@ -170,7 +170,15 @@ class OrcFileFormat
(file: PartitionedFile) => {
val conf = broadcastedConf.value.value

val filePath = new Path(new URI(file.filePath))
var path: Option[Path] = None
import scala.util.Try
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to avoid relying on try-catch whenever possible.

Also, a space in a URI doesn't look correct. HDFS itself has a complicated behaviour arond path and URIs so let;s don't follow the behaviour unless it's known as a legitimate behaviour (is it?).file.filePath should be a URI so the behaviour new Path(new URI(file.filePath)) seems correct.

@srowen
Copy link
Member

srowen commented Dec 11, 2020

@southernriver hdfs://ns1/tmp2/ 2020 is not a valid URI though. You'd have to escape the space, no? does it work then?

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Mar 22, 2021
@github-actions github-actions bot closed this Mar 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants