New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4131][SQL] support writing data into the filesystem from queries #4380
Conversation
Test build #26814 has finished for PR 4380 at commit
|
execute() | ||
logWarning("use execute collect") | ||
Array.empty[Row] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems this is not necessary
Test build #26918 has finished for PR 4380 at commit
|
Test build #26919 has finished for PR 4380 at commit
|
/cc @liancheng can you help review this? |
Test build #26982 has finished for PR 4380 at commit
|
Test build #26984 has finished for PR 4380 at commit
|
Test build #26983 has finished for PR 4380 at commit
|
retest this please |
AttributeReference("col_name", StringType, nullable = false)(), | ||
AttributeReference("data_type", StringType, nullable = false)(), | ||
AttributeReference("comment", StringType, nullable = false)()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi, @scwf
FYI. The DescribeCommand
has been moved into sources.ddl.scala
, can remove this from here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, my bad, should remove it from here
Test build #26990 has finished for PR 4380 at commit
|
Test build #26993 has finished for PR 4380 at commit
|
Actually @yhuai's work doesn't cover your concerns entirely, but you can build yours upon his. Basically it's a |
Yeah getit, but that maybe much later, is it possible to let this in for transition since |
ping. any update here? |
@marmbrus, according to liancheng's suggestion this PR rely on the Hive SerDe data source(not implemented), so now this version added a 'WriteToDirs ' to implement this feature. could you have a look and give some feed back ? |
serializer | ||
} | ||
|
||
// maybe we can make it as a common method, share with `InsertIntoHiveTable` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I think my biggest feedback is that this seems like a lot of new code, given we already have the ability to write data using arbitrary SerDes to a file system. Can we acomplish the same thing with only small changes to the parser and maybe a small logical / physical node? Or am i missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes in InsertIntoHiveTable we can write data using arbitrary SerDes to a file system. But we can not reuse InsertIntoHiveTable to share the same physical plan since
1 InsertIntoHiveTable and WriteToDirectory have different input
2 InsertIntoHiveTable has complex logical to handle dynamic partitioning
So i think we can extract a common interface(SaveAsHiveFile) to reuse the code of writing data to file system.
Test build #30628 has started for PR 4380 at commit |
Test build #30633 has started for PR 4380 at commit |
/cc @marmbrus |
Test build #31424 has started for PR 4380 at commit |
Updated, to summarize this: 2 Analyze WriteToDirectory(path: String, child: LogicalPlan, isLocal: Boolean, ASTNode) to WriteToDirectory(path: String, child: LogicalPlan, isLocal: Boolean, desc: TableDesc) in hive context analyzer 3 Transform WriteToDirectory(path: String, child: LogicalPlan, isLocal: Boolean, desc: TableDesc) to execution.WriteToDirectory when query planning 4 Extract a common Interface SaveAsHiveFile to share code of writting data to FS |
ping |
Test build #33078 has started for PR 4380 at commit |
// Wait until children are resolved. | ||
case p: LogicalPlan if !p.childrenResolved => p | ||
|
||
case WriteToDirectory(path, child, isLocal, extra: ASTNode) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't be holding onto the AST anymore, as this ties us a little too closely to a specific version of Hive. Instead look at how we are doing things in CreateTableAsSelect.
Also, please add a test case where you use this new operation on a Spark SQL temporary table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah thanks i am updating this
Test build #33232 has started for PR 4380 at commit |
Test build #38789 has finished for PR 4380 at commit
|
retest this please |
Test build #38811 has finished for PR 4380 at commit
|
Test build #144 has finished for PR 4380 at commit
|
Test build #39628 has finished for PR 4380 at commit
|
Test build #40233 has finished for PR 4380 at commit
|
@scwf looks like there is a real failure? |
@rxin yes, since we upgrade the hive version to 1.2.1, we should adapt the token tree in hiveql, the old one is not correct in 1.2.1. Updated |
retest this please |
Test build #40600 has finished for PR 4380 at commit
|
@scwf @liancheng Is there any plan to merge this PR to branch-1.5 recently? I think this feature is pretty useful. |
@litao-buptsse Since it is a new feature, I think we will not get it in branch 1.5. But, we can target 1.6. |
@yhuai Got it, thank you very much. |
@scwf @yhuai I apply this patch to my branch-1.5 code. It works! But I found a bug. When I use lower case "local", it will try to insert to hdfs file system.
When I use upper case "LOCAL", it insert to local file system correctly.
|
@scwf @yhuai For the code above, should change "LOCAL" to
|
@scwf @yhuai It works in "local" mode, but not well in "yarn-client" mode.
|
@litao-buptsse, i will update this soon thanks. |
@yhuai @liancheng is this still relevant? |
I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks! |
I think it's a useful feature and widely used in hive. Why not finish this feature and merge it to branch-1.6? |
Support writing data into the filesystem from queries
syntax:
INSERT OVERWRITE [LOCAL] DIRECTORY directory [ROW FORMAT row_format] STORED AS file_format SELECT ... FROM ...