[SPARK-24176][SQL] LOAD DATA can't identify wildcard in the hdfs file path #21285

kevinyu98 · 2018-05-09T21:02:17Z

What changes were proposed in this pull request?

When the wildcard characters (like "?") were in the LOAD DATA command's path name, the Path related API (hadoop and URI) couldn't parse it correctly. For example:
val srcPath = new Path(hdfsUri) in the tables.scala, returned wrong result for the following cases:

hdfsUri = file: /user/testdemo1/user1/t??eddata60.txt,
srcPath = file:/user/testdemo1/user1/t
hdfsUri = file:/user/testdemo1/user1/?eddata60.txt',
srcPath = file:/user/testdemo1/user1/
(the same problem exists at val uriPath = uri.getPath())

The LOAD DATA LOCAL works because the local case called a utility Utils.resolveURI to replaced the "?" to "%3F", then the PATH API will not truncate the file name.

This fix uses Utils.resolveURI method for both local and non-local cases.

I did similar test on hive, it seems the hive has the same behavior.

hive> load data inpath 'hdfs:/tmp/?evin.txt' into table foo1; FAILED: SemanticException Line 1:17 Invalid path ''hdfs:/tmp/?evin.txt'': No files matching path hdfs://stcindia-node-6.fyre.ibm.com:8020/tmp/%3Fevin.txt hive> load data inpath 'hdfs:/tmp/k?evin.txt' into table foo1; FAILED: SemanticException Line 1:17 Invalid path ''hdfs:/tmp/k?evin.txt'': No files matching path hdfs://stcindia-node-6.fyre.ibm.com:8020/tmp/k%3Fevin.txt hive>

How was this patch tested?

Did the unit test locally, and added new test cases.

HyukjinKwon · 2018-05-10T02:43:36Z

is it a duplicate of #20611?

HyukjinKwon · 2018-05-10T02:45:20Z

cc @wzhfy and @sujith71955

kevinyu98 · 2018-05-10T06:42:11Z

@HyukjinKwon thanks for reviewing this pr. I didn't notice that pr until you point out. If we plan to support wildcard in the LOAD DATA command, then we can close this PR.
But with his current code, the problem reported by this JIRA still exists, because for the non-local case, the Path will be truncate after val srcPath = new Path(loadPath). I download his code, and it still have the same issue as this pr reported.
I create text1.txt on my local machine, then run LOAD DATA
load data inpath '/Users/qianyangyu/IdeaProjects/spark/??xt1.txt' into table foo1;' successful
load data inpath '/Users/qianyangyu/IdeaProjects/spark/t?xt1.txt' into table foo1; failed
`spark-sql> load data inpath '/Users/qianyangyu/IdeaProjects/spark/??xt1.txt' into table foo1;
Time taken: 0.112 seconds

spark-sql> load data inpath '/Users/qianyangyu/IdeaProjects/spark/t?xt1.txt' into table foo1;
Error in query: LOAD DATA input path does not exist: /Users/qianyangyu/IdeaProjects/spark/t?xt1.txt;
`

AmplabJenkins · 2018-06-09T00:05:14Z

Can one of the admins verify this patch?

kevinyu98 · 2018-07-09T21:18:36Z

close this pr, pr#20611 has combined this fix into his.

resolve the path string for load data before using it

3c1a1cf

kevinyu98 mentioned this pull request May 10, 2018

[SPARK-23425][SQL]Support wildcard in HDFS path for load table command #20611

Closed

kevinyu98 closed this Jul 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24176][SQL] LOAD DATA can't identify wildcard in the hdfs file path #21285

[SPARK-24176][SQL] LOAD DATA can't identify wildcard in the hdfs file path #21285

kevinyu98 commented May 9, 2018

HyukjinKwon commented May 10, 2018

HyukjinKwon commented May 10, 2018

kevinyu98 commented May 10, 2018 •

edited

AmplabJenkins commented Jun 9, 2018

kevinyu98 commented Jul 9, 2018

[SPARK-24176][SQL] LOAD DATA can't identify wildcard in the hdfs file path #21285

[SPARK-24176][SQL] LOAD DATA can't identify wildcard in the hdfs file path #21285

Conversation

kevinyu98 commented May 9, 2018

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented May 10, 2018

HyukjinKwon commented May 10, 2018

kevinyu98 commented May 10, 2018 • edited

AmplabJenkins commented Jun 9, 2018

kevinyu98 commented Jul 9, 2018

kevinyu98 commented May 10, 2018 •

edited