Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24176][SQL] LOAD DATA can't identify wildcard in the hdfs file path #21285

Closed
wants to merge 1 commit into from

Conversation

kevinyu98
Copy link
Contributor

What changes were proposed in this pull request?

When the wildcard characters (like "?") were in the LOAD DATA command's path name, the Path related API (hadoop and URI) couldn't parse it correctly. For example:
val srcPath = new Path(hdfsUri) in the tables.scala, returned wrong result for the following cases:

  • hdfsUri = file: /user/testdemo1/user1/t??eddata60.txt,
    srcPath = file:/user/testdemo1/user1/t
  • hdfsUri = file:/user/testdemo1/user1/?eddata60.txt',
    srcPath = file:/user/testdemo1/user1/
    (the same problem exists at val uriPath = uri.getPath())

The LOAD DATA LOCAL works because the local case called a utility Utils.resolveURI to replaced the "?" to "%3F", then the PATH API will not truncate the file name.

This fix uses Utils.resolveURI method for both local and non-local cases.

I did similar test on hive, it seems the hive has the same behavior.

hive> load data inpath 'hdfs:/tmp/?evin.txt' into table foo1; FAILED: SemanticException Line 1:17 Invalid path ''hdfs:/tmp/?evin.txt'': No files matching path hdfs://stcindia-node-6.fyre.ibm.com:8020/tmp/%3Fevin.txt hive> load data inpath 'hdfs:/tmp/k?evin.txt' into table foo1; FAILED: SemanticException Line 1:17 Invalid path ''hdfs:/tmp/k?evin.txt'': No files matching path hdfs://stcindia-node-6.fyre.ibm.com:8020/tmp/k%3Fevin.txt hive>

How was this patch tested?

Did the unit test locally, and added new test cases.

@HyukjinKwon
Copy link
Member

is it a duplicate of #20611?

@HyukjinKwon
Copy link
Member

cc @wzhfy and @sujith71955

@kevinyu98
Copy link
Contributor Author

kevinyu98 commented May 10, 2018

@HyukjinKwon thanks for reviewing this pr. I didn't notice that pr until you point out. If we plan to support wildcard in the LOAD DATA command, then we can close this PR.
But with his current code, the problem reported by this JIRA still exists, because for the non-local case, the Path will be truncate after val srcPath = new Path(loadPath). I download his code, and it still have the same issue as this pr reported.
I create text1.txt on my local machine, then run LOAD DATA
load data inpath '/Users/qianyangyu/IdeaProjects/spark/??xt1.txt' into table foo1;' successful
load data inpath '/Users/qianyangyu/IdeaProjects/spark/t?xt1.txt' into table foo1; failed
`spark-sql> load data inpath '/Users/qianyangyu/IdeaProjects/spark/??xt1.txt' into table foo1;
Time taken: 0.112 seconds

spark-sql> load data inpath '/Users/qianyangyu/IdeaProjects/spark/t?xt1.txt' into table foo1;
Error in query: LOAD DATA input path does not exist: /Users/qianyangyu/IdeaProjects/spark/t?xt1.txt;
`

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@kevinyu98
Copy link
Contributor Author

close this pr, pr#20611 has combined this fix into his.

@kevinyu98 kevinyu98 closed this Jul 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants