[SPARK-4131][SQL] support writing data into the filesystem from queries #4380

scwf · 2015-02-05T02:52:06Z

Support writing data into the filesystem from queries
syntax:
INSERT OVERWRITE [LOCAL] DIRECTORY directory [ROW FORMAT row_format] STORED AS file_format SELECT ... FROM ...

SparkQA · 2015-02-05T03:01:22Z

Test build #26814 has finished for PR 4380 at commit cd1cd10.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class WriteToDirectory[T](
- case class WriteToDirectory(
- assert(valueClass != null, "Output value class not set")
- assert(outputFileFormatClassName != null, "Output format class not set")

scwf · 2015-02-06T16:12:51Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala

+    execute()
+    logWarning("use execute collect")
+    Array.empty[Row]
+  }


seems this is not necessary

SparkQA · 2015-02-06T17:23:41Z

Test build #26918 has finished for PR 4380 at commit 8d4dfae.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class WriteToDirectory[T](
- case class WriteToDirectory(
- assert(valueClass != null, "Output value class not set")
- assert(outputFileFormatClassName != null, "Output format class not set")

SparkQA · 2015-02-06T17:43:49Z

Test build #26919 has finished for PR 4380 at commit 3d8a460.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class WriteToDirectory[T](
- case class WriteToDirectory(
- assert(valueClass != null, "Output value class not set")
- assert(outputFileFormatClassName != null, "Output format class not set")

scwf · 2015-02-07T03:35:50Z

/cc @liancheng can you help review this?

SparkQA · 2015-02-07T03:44:00Z

Test build #26982 has finished for PR 4380 at commit d45e915.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class WriteToDirectory[T](
- case class WriteToDirectory(
- assert(valueClass != null, "Output value class not set")
- assert(outputFileFormatClassName != null, "Output format class not set")

SparkQA · 2015-02-07T04:18:52Z

Test build #26984 has finished for PR 4380 at commit 0561486.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class DescribeCommand(
- case class WriteToDirectory[T](
- case class WriteToDirectory(
- assert(valueClass != null, "Output value class not set")
- assert(outputFileFormatClassName != null, "Output format class not set")

SparkQA · 2015-02-07T04:46:31Z

Test build #26983 has finished for PR 4380 at commit 743a89d.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- case class WriteToDirectory[T](
- case class WriteToDirectory(
- assert(valueClass != null, "Output value class not set")
- assert(outputFileFormatClassName != null, "Output format class not set")

scwf · 2015-02-07T05:15:56Z

retest this please

OopsOutOfMemory · 2015-02-07T05:50:04Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala

+    AttributeReference("col_name", StringType, nullable = false)(),
+    AttributeReference("data_type", StringType, nullable = false)(),
+    AttributeReference("comment", StringType, nullable = false)())
+}


hi, @scwf
FYI. The DescribeCommand has been moved into sources.ddl.scala, can remove this from here?

yeah, my bad, should remove it from here

SparkQA · 2015-02-07T05:58:01Z

Test build #26990 has finished for PR 4380 at commit 0561486.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class DescribeCommand(
- case class WriteToDirectory[T](
- case class WriteToDirectory(
- assert(valueClass != null, "Output value class not set")
- assert(outputFileFormatClassName != null, "Output format class not set")

SparkQA · 2015-02-07T07:27:02Z

Test build #26993 has finished for PR 4380 at commit 8a17484.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class WriteToDirectory[T](
- case class WriteToDirectory(
- assert(valueClass != null, "Output value class not set")
- assert(outputFileFormatClassName != null, "Output format class not set")

liancheng · 2015-02-10T20:10:37Z

Hey @scwf, @yhuai is working on generalized write support for the data source API, which I believe covers your concerns here. Please refer to #4294 and #4446 for details.

liancheng · 2015-02-10T21:04:42Z

Actually @yhuai's work doesn't cover your concerns entirely, but you can build yours upon his. Basically it's a sql("<some query>").save("path", "some.data.source"). And to support arbitrary Hive SerDes, we can have a Hive SerDe data source, which can be merged into the Hive data source (much) later once we are able to extract functionalities of HiveContext into a separate data source.

scwf · 2015-02-15T02:14:31Z

Yeah getit, but that maybe much later, is it possible to let this in for transition since
1 this syntax is a basic functional point in hive ql and it is useful from our customer
2 this Pr just add the implementation in hive subproject

marmbrus · 2015-03-18T02:38:27Z

ping. any update here?

scwf · 2015-03-19T09:37:08Z

@marmbrus, according to liancheng's suggestion this PR rely on the Hive SerDe data source(not implemented), so now this version added a 'WriteToDirs ' to implement this feature. could you have a look and give some feed back ?

marmbrus · 2015-04-03T01:16:51Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/WriteToDirectory.scala

+    serializer
+  }
+
+  // maybe we can make it as a common method, share with `InsertIntoHiveTable`


+1

I think my biggest feedback is that this seems like a lot of new code, given we already have the ability to write data using arbitrary SerDes to a file system. Can we acomplish the same thing with only small changes to the parser and maybe a small logical / physical node? Or am i missing something?

Yes in InsertIntoHiveTable we can write data using arbitrary SerDes to a file system. But we can not reuse InsertIntoHiveTable to share the same physical plan since
1 InsertIntoHiveTable and WriteToDirectory have different input
2 InsertIntoHiveTable has complex logical to handle dynamic partitioning

So i think we can extract a common interface(SaveAsHiveFile) to reuse the code of writing data to file system.

SparkQA · 2015-04-21T03:57:48Z

Test build #30628 has started for PR 4380 at commit 932c37c.

SparkQA · 2015-04-21T04:53:34Z

Test build #30633 has started for PR 4380 at commit 60e3f84.

scwf · 2015-04-21T15:41:22Z

/cc @marmbrus

SparkQA · 2015-04-30T14:38:47Z

Test build #31424 has started for PR 4380 at commit bc8f71b.

scwf · 2015-05-03T01:25:35Z

Updated, to summarize this:
1 Get a unresolved plan WriteToDirectory(path, child, isLocal, extra: ASTNode) in hiveql from hive ast

2 Analyze WriteToDirectory(path: String, child: LogicalPlan, isLocal: Boolean, ASTNode) to WriteToDirectory(path: String, child: LogicalPlan, isLocal: Boolean, desc: TableDesc) in hive context analyzer

3 Transform WriteToDirectory(path: String, child: LogicalPlan, isLocal: Boolean, desc: TableDesc) to execution.WriteToDirectory when query planning

4 Extract a common Interface SaveAsHiveFile to share code of writting data to FS
/cc @marmbrus

scwf · 2015-05-06T21:52:53Z

ping

SparkQA · 2015-05-19T11:09:24Z

Test build #33078 has started for PR 4380 at commit 445df13.

marmbrus · 2015-05-19T19:14:30Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

+      // Wait until children are resolved.
+      case p: LogicalPlan if !p.childrenResolved => p
+
+      case WriteToDirectory(path, child, isLocal, extra: ASTNode) =>


We shouldn't be holding onto the AST anymore, as this ties us a little too closely to a specific version of Hive. Instead look at how we are doing things in CreateTableAsSelect.

Also, please add a test case where you use this new operation on a Spark SQL temporary table.

yeah thanks i am updating this

SparkQA · 2015-05-21T08:03:06Z

Test build #33232 has started for PR 4380 at commit 401fab8.

SparkQA · 2015-07-29T03:05:12Z

Test build #38789 has finished for PR 4380 at commit 33a2b0a.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- assert(valueClass != null, "Output value class not set")
- assert(outputFileFormatClassName != null, "Output format class not set")
- case class WriteToDirectory(

scwf · 2015-07-29T06:17:23Z

retest this please

SparkQA · 2015-07-29T07:47:35Z

Test build #38811 has finished for PR 4380 at commit 33a2b0a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-29T07:51:36Z

Test build #144 has finished for PR 4380 at commit 33a2b0a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- assert(valueClass != null, "Output value class not set")
- assert(outputFileFormatClassName != null, "Output format class not set")
- case class WriteToDirectory(

SparkQA · 2015-08-04T01:59:22Z

Test build #39628 has finished for PR 4380 at commit 438a209.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- assert(valueClass != null, "Output value class not set")
- assert(outputFileFormatClassName != null, "Output format class not set")
- case class WriteToDirectory(

SparkQA · 2015-08-08T11:32:05Z

Test build #40233 has finished for PR 4380 at commit 085e427.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- assert(valueClass != null, "Output value class not set")
- assert(outputFileFormatClassName != null, "Output format class not set")
- case class WriteToDirectory(

rxin · 2015-08-11T21:40:10Z

@scwf looks like there is a real failure?

scwf · 2015-08-12T03:55:50Z

@rxin yes, since we upgrade the hive version to 1.2.1, we should adapt the token tree in hiveql, the old one is not correct in 1.2.1. Updated

scwf · 2015-08-12T06:29:20Z

retest this please

SparkQA · 2015-08-12T08:48:45Z

Test build #40600 has finished for PR 4380 at commit 9cc8474.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- assert(valueClass != null, "Output value class not set")
- assert(outputFileFormatClassName != null, "Output format class not set")
- case class WriteToDirectory(

litao-buptsse · 2015-09-17T08:13:43Z

@scwf @liancheng Is there any plan to merge this PR to branch-1.5 recently? I think this feature is pretty useful.

yhuai · 2015-09-17T14:59:20Z

@litao-buptsse Since it is a new feature, I think we will not get it in branch 1.5. But, we can target 1.6.

litao-buptsse · 2015-09-19T05:57:33Z

@yhuai Got it, thank you very much.

litao-buptsse · 2015-09-19T08:53:06Z

@scwf @yhuai I apply this patch to my branch-1.5 code. It works!

But I found a bug. When I use lower case "local", it will try to insert to hdfs file system.

insert overwrite local directory "/mypath" select ...

When I use upper case "LOCAL", it insert to local file system correctly.

insert overwrite LOCAL directory "/mypath" select ...

litao-buptsse · 2015-09-19T10:00:30Z

case Token(destinationToken(),
           Token("TOK_DIR", path :: formats) :: Nil) =>
      var isLocal = false
      formats.collect {
        case Token("LOCAL", others) =>
          isLocal = true
      }
      WriteToDirectory(
        BaseSemanticAnalyzer.unescapeSQLString(path.getText),
        query,
        isLocal,
        parseTableDesc(formats))

@scwf @yhuai For the code above, should change "LOCAL" to

val LOCAL = "(?i)LOCAL".r

case Token(LOCAL(), others) =>
          isLocal = true

litao-buptsse · 2015-09-19T10:14:34Z

@scwf @yhuai It works in "local" mode, but not well in "yarn-client" mode.

15/09/19 18:03:48 ERROR thriftserver.SparkSQLDriver: Failed in [insert overwrite directory 'file://[337/9806]
p/0919' select query from custom.common_pc_pv where logdate='2015091905' limit 10]
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most r
ecent failure: Lost task 0.3 in stage 5.0 (TID 71, cloud101411240.wd.nm.ss.nop.sogou-op.org): org.apache.hado
op.hive.ql.metadata.HiveException: java.io.IOException: Mkdirs failed to create file:/search/tmp/0919/_tempor
ary/0/_temporary/attempt_201509191803_0005_m_000000_3 (exists=false, cwd=file:/search/hadoop02/yarn_local/use
rcache/spark/appcache/application_1442391298043_56782/container_1442391298043_56782_01_000008)
        at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
        at org.apache.spark.sql.hive.SparkHiveWriterContainer.initWriters(hiveWriterContainers.scala:115)
        at org.apache.spark.sql.hive.SparkHiveWriterContainer.executorSideSetup(hiveWriterContainers.scala:87
)
        at org.apache.spark.sql.hive.SaveAsHiveFile$class.writeToFile$1(SaveAsHiveFile.scala:84)
        at org.apache.spark.sql.hive.SaveAsHiveFile$$anonfun$saveAsHiveFile$3.apply(SaveAsHiveFile.scala:68)
        at org.apache.spark.sql.hive.SaveAsHiveFile$$anonfun$saveAsHiveFile$3.apply(SaveAsHiveFile.scala:68)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Mkdirs failed to create file:/search/tmp/0919/_temporary/0/_temporary/attempt
_201509191803_0005_m_000000_3 (exists=false, cwd=file:/search/hadoop02/yarn_local/usercache/spark/appcache/ap
plication_1442391298043_56782/container_1442391298043_56782_01_000008)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
        at org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat.getHiveRecordWriter(HiveIgnoreKeyTextOu
tputFormat.java:80)
        at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:261)
        at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:246)
        ... 11 more

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndepen
dentStages(DAGScheduler.scala:1283)

scwf · 2015-09-21T01:30:23Z

@litao-buptsse, i will update this soon thanks.

andrewor14 · 2015-12-15T19:48:08Z

@yhuai @liancheng is this still relevant?

rxin · 2015-12-31T02:41:55Z

I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks!

litao-buptsse · 2016-01-19T06:24:27Z

I think it's a useful feature and widely used in hive. Why not finish this feature and merge it to branch-1.6?

scwf changed the title ~~[SPARK-4131] [SQL] [WIP] support writing data into the filesystem from queries~~ [SPARK-4131][SQL][WIP] support writing data into the filesystem from queries Feb 5, 2015

scwf reviewed Feb 6, 2015
View reviewed changes

OopsOutOfMemory reviewed Feb 7, 2015
View reviewed changes

scwf changed the title ~~[SPARK-4131][SQL][WIP] support writing data into the filesystem from queries~~ [SPARK-4131][SQL] support writing data into the filesystem from queries Mar 19, 2015

marmbrus reviewed Apr 3, 2015
View reviewed changes

marmbrus reviewed May 19, 2015
View reviewed changes

compile fix accordng to new changes of inernalrow

33a2b0a

conflics fix

438a209

scwf added 2 commits August 8, 2015 17:09

compile

b049e3a

Merge branch 'master' of https://github.com/apache/spark into write-dir

085e427

scwf added 3 commits August 12, 2015 10:35

merge bug fix

c6f4b4b

Merge branch 'master' into write-dir

cc1172e

fix write path

9cc8474

asfgit closed this in 7b4452b Dec 31, 2015

Parth-Brahmbhatt mentioned this pull request May 11, 2016

[SPARK-4131] [SQL] Support INSERT OVERWRITE [LOCAL] DIRECTORY '/path/to/dir' [ROW FORMAT row_format] [STORED AS file_format] query. #13067

Closed

[SPARK-4131][SQL] support writing data into the filesystem from queries #4380

[SPARK-4131][SQL] support writing data into the filesystem from queries #4380

Conversation

scwf commented Feb 5, 2015

SparkQA commented Feb 5, 2015

scwf Feb 6, 2015

Choose a reason for hiding this comment

SparkQA commented Feb 6, 2015

SparkQA commented Feb 6, 2015

scwf commented Feb 7, 2015

SparkQA commented Feb 7, 2015

SparkQA commented Feb 7, 2015

SparkQA commented Feb 7, 2015

scwf commented Feb 7, 2015

OopsOutOfMemory Feb 7, 2015

Choose a reason for hiding this comment

scwf Feb 7, 2015

Choose a reason for hiding this comment

SparkQA commented Feb 7, 2015

SparkQA commented Feb 7, 2015

liancheng commented Feb 10, 2015

liancheng commented Feb 10, 2015

scwf commented Feb 15, 2015

marmbrus commented Mar 18, 2015

scwf commented Mar 19, 2015

marmbrus Apr 3, 2015

Choose a reason for hiding this comment

scwf Apr 21, 2015

Choose a reason for hiding this comment

SparkQA commented Apr 21, 2015

SparkQA commented Apr 21, 2015

scwf commented Apr 21, 2015

SparkQA commented Apr 30, 2015

scwf commented May 3, 2015

scwf commented May 6, 2015

SparkQA commented May 19, 2015

marmbrus May 19, 2015

Choose a reason for hiding this comment

scwf May 21, 2015

Choose a reason for hiding this comment

SparkQA commented May 21, 2015

SparkQA commented Jul 29, 2015

scwf commented Jul 29, 2015

SparkQA commented Jul 29, 2015

SparkQA commented Jul 29, 2015

SparkQA commented Aug 4, 2015

SparkQA commented Aug 8, 2015

rxin commented Aug 11, 2015

scwf commented Aug 12, 2015

scwf commented Aug 12, 2015

SparkQA commented Aug 12, 2015

litao-buptsse commented Sep 17, 2015

yhuai commented Sep 17, 2015

litao-buptsse commented Sep 19, 2015

litao-buptsse commented Sep 19, 2015

litao-buptsse commented Sep 19, 2015

litao-buptsse commented Sep 19, 2015

scwf commented Sep 21, 2015

andrewor14 commented Dec 15, 2015

rxin commented Dec 31, 2015

litao-buptsse commented Jan 19, 2016