[SPARK-5741][SQL] Support the path contains comma in HiveContext #4532

watermen · 2015-02-11T10:59:53Z

When run select * from nzhang_part where hr = 'file,';, it throws exception java.lang.IllegalArgumentException: Can not create a Path from an empty string
. Because the path of hdfs contains comma, and FileInputFormat.setInputPaths will split path by comma.

SQL

set hive.merge.mapfiles=true; 
set hive.merge.mapredfiles=true;
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

create table nzhang_part like srcpart;

insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select key, value, hr from srcpart where ds='2008-04-08';

insert overwrite table nzhang_part partition (ds='2010-08-15', hr=11) select key, value from srcpart where ds='2008-04-08';

insert overwrite table nzhang_part partition (ds='2010-08-15', hr) 
select * from (
select key, value, hr from srcpart where ds='2008-04-08'
union all
select '1' as key, '1' as value, 'file,' as hr from src limit 1) s;

select * from nzhang_part where hr = 'file,';

Error Log

15/02/10 14:33:16 ERROR SparkSQLDriver: Failed in [select * from nzhang_part where hr = 'file,']
java.lang.IllegalArgumentException: Can not create a Path from an empty string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
at org.apache.hadoop.fs.Path.<init>(Path.java:135)
at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:241)
at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:400)
at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:251)
at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229)
at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
at scala.Option.map(Option.scala:145)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196)

AmplabJenkins · 2015-02-11T11:02:09Z

Can one of the admins verify this patch?

adrian-wang · 2015-02-11T17:03:30Z

ok to test.

yhuai · 2015-02-11T17:13:51Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala

@@ -248,7 +249,7 @@ private[hive] object HadoopTableReader extends HiveInspectors {
   * instantiate a HadoopRDD.
   */
  def initializeLocalJobConfFunc(path: String, tableDesc: TableDesc)(jobConf: JobConf) {
-    FileInputFormat.setInputPaths(jobConf, path)
+    jobConf.set("mapred.input.dir", StringUtils.escapeString(path.toString()))


Instead of setting the conf using the key, can we still use FileInputFormat.setInputPaths? Like

FileInputFormat.setInputPaths(jobConf, StringUtils.escapeString(path))

Can't, for examples "hdfs://x.x.x.x:9000/user/hive/warehouse/nzhang_part/ds=2010-08-15/hr=file," is will be splited into "hdfs://x.x.x.x:9000/user/hive/warehouse/nzhang_part/ds=2010-08-15/hr=file" and "" by FileInputFormat.getPathStrings, "" will be checked by Path.checkPathArg and

if( path.length() == 0 ) { throw new IllegalArgumentException("Can not create a Path from an empty string"); }

you can see

FileInputFormat.setInputPaths -> FileInputFormat.getPathStrings -> Path.checkPathArg

in hadoop for detail.

o, I see. getPathStrings does not really care if a comma is escaped or not... Can we use public static void setInputPaths(Job job, Path... inputPaths)? I think it is better to avoid using set directly with a string key (using a method seems more robust).

yhuai · 2015-02-11T17:15:37Z

ok to test

SparkQA · 2015-02-11T17:18:04Z

Test build #27292 has started for PR 4532 at commit b788a72.

This patch merges cleanly.

SparkQA · 2015-02-11T18:42:39Z

Test build #27292 has finished for PR 4532 at commit b788a72.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-11T18:42:42Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27292/
Test PASSed.

scwf · 2015-02-12T23:09:54Z

lgtm

watermen · 2015-02-19T08:55:41Z

@yhuai Can you review it?

SparkQA · 2015-02-27T02:13:10Z

Test build #28034 has started for PR 4532 at commit 9758ab1.

This patch merges cleanly.

SparkQA · 2015-02-27T03:33:14Z

Test build #28034 has finished for PR 4532 at commit 9758ab1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-27T03:33:17Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28034/
Test PASSed.

watermen · 2015-02-27T04:16:51Z

@yhuai Can you review the code for me?

yhuai · 2015-02-27T04:39:13Z

LGTM

watermen · 2015-03-02T02:03:24Z

@marmbrus @rxin Can it be merged?

marmbrus · 2015-03-02T18:13:24Z

Thanks! Merging to master and 1.3.

When run ```select * from nzhang_part where hr = 'file,';```, it throws exception ```java.lang.IllegalArgumentException: Can not create a Path from an empty string``` . Because the path of hdfs contains comma, and FileInputFormat.setInputPaths will split path by comma. ### SQL ``` set hive.merge.mapfiles=true; set hive.merge.mapredfiles=true; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; create table nzhang_part like srcpart; insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select key, value, hr from srcpart where ds='2008-04-08'; insert overwrite table nzhang_part partition (ds='2010-08-15', hr=11) select key, value from srcpart where ds='2008-04-08'; insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select * from ( select key, value, hr from srcpart where ds='2008-04-08' union all select '1' as key, '1' as value, 'file,' as hr from src limit 1) s; select * from nzhang_part where hr = 'file,'; ``` ### Error Log ``` 15/02/10 14:33:16 ERROR SparkSQLDriver: Failed in [select * from nzhang_part where hr = 'file,'] java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127) at org.apache.hadoop.fs.Path.<init>(Path.java:135) at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:241) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:400) at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:251) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196) Author: q00251598 <qiyadong@huawei.com> Closes #4532 from watermen/SPARK-5741 and squashes the following commits: 9758ab1 [q00251598] fix bug 1db1a1c [q00251598] use setInputPaths(Job job, Path... inputPaths) b788a72 [q00251598] change FileInputFormat.setInputPaths to jobConf.set and add test suite (cherry picked from commit 9ce12aa) Signed-off-by: Michael Armbrust <michael@databricks.com>

watermen changed the title ~~[SPARK-5741][SQL] Support comma in path in HiveContext~~ [SPARK-5741][SQL] Support the path contains comma in HiveContext Feb 11, 2015

watermen force-pushed the SPARK-5741 branch from 0ab9fac to 4b94742 Compare February 11, 2015 14:56

change FileInputFormat.setInputPaths to jobConf.set and add test suite

b788a72

watermen force-pushed the SPARK-5741 branch from 4b94742 to b788a72 Compare February 11, 2015 15:09

yhuai reviewed Feb 11, 2015
View reviewed changes

watermen added 2 commits February 26, 2015 20:54

use setInputPaths(Job job, Path... inputPaths)

1db1a1c

fix bug

9758ab1

asfgit closed this in 9ce12aa Mar 2, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-5741][SQL] Support the path contains comma in HiveContext #4532

[SPARK-5741][SQL] Support the path contains comma in HiveContext #4532

watermen commented Feb 11, 2015

AmplabJenkins commented Feb 11, 2015

adrian-wang commented Feb 11, 2015

yhuai Feb 11, 2015

watermen Feb 12, 2015

yhuai Feb 26, 2015

yhuai commented Feb 11, 2015

SparkQA commented Feb 11, 2015

SparkQA commented Feb 11, 2015

AmplabJenkins commented Feb 11, 2015

scwf commented Feb 12, 2015

watermen commented Feb 19, 2015

SparkQA commented Feb 27, 2015

SparkQA commented Feb 27, 2015

AmplabJenkins commented Feb 27, 2015

watermen commented Feb 27, 2015

yhuai commented Feb 27, 2015

watermen commented Mar 2, 2015

marmbrus commented Mar 2, 2015

[SPARK-5741][SQL] Support the path contains comma in HiveContext #4532

[SPARK-5741][SQL] Support the path contains comma in HiveContext #4532

Conversation

watermen commented Feb 11, 2015

SQL

Error Log

AmplabJenkins commented Feb 11, 2015

adrian-wang commented Feb 11, 2015

yhuai Feb 11, 2015

Choose a reason for hiding this comment

watermen Feb 12, 2015

Choose a reason for hiding this comment

yhuai Feb 26, 2015

Choose a reason for hiding this comment

yhuai commented Feb 11, 2015

SparkQA commented Feb 11, 2015

SparkQA commented Feb 11, 2015

AmplabJenkins commented Feb 11, 2015

scwf commented Feb 12, 2015

watermen commented Feb 19, 2015

SparkQA commented Feb 27, 2015

SparkQA commented Feb 27, 2015

AmplabJenkins commented Feb 27, 2015

watermen commented Feb 27, 2015

yhuai commented Feb 27, 2015

watermen commented Mar 2, 2015

marmbrus commented Mar 2, 2015