New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-14543] [SQL] Improve InsertIntoTable column resolution. #12313
[SPARK-14543] [SQL] Improve InsertIntoTable column resolution. #12313
Conversation
cf82d95
to
c4c6820
Compare
Test build #55558 has finished for PR 12313 at commit
|
c4c6820
to
f2186e5
Compare
Test build #55616 has finished for PR 12313 at commit
|
f2186e5
to
c3e8561
Compare
Test build #56239 has finished for PR 12313 at commit
|
c3e8561
to
0a6ec77
Compare
Test build #56386 has finished for PR 12313 at commit
|
0a6ec77
to
6c107ff
Compare
Test build #56411 has finished for PR 12313 at commit
|
Test build #56586 has finished for PR 12313 at commit
|
fcacba5
to
7516ffd
Compare
Test build #56605 has finished for PR 12313 at commit
|
Test build #56708 has finished for PR 12313 at commit
|
42ff7b7
to
c443b32
Compare
Test build #56757 has finished for PR 12313 at commit
|
c443b32
to
ab23c9c
Compare
Retest this please. |
Test build #58156 has finished for PR 12313 at commit
|
@liancheng, @cloud-fan, this commit is a follow-up to #12239 that fixes column resolution when writing to both Hive MetastoreRelations and HadoopFsRelations. Could you review it? I think it would be a good addition to 2.0.0 as well. |
@@ -498,6 +499,117 @@ class Analyzer( | |||
} | |||
} | |||
|
|||
val ResolveOutputColumns = new ResolveOutputColumns | |||
class ResolveOutputColumns extends Rule[LogicalPlan] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can simply use object ResolveOutputColumns
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix this and the protected
method. I think I was instantiating this rule multiple times at one point and this is left-over.
ab23c9c
to
3a24e36
Compare
@cloud-fan, @liancheng, thanks for reviewing! I've rebased on master and fixed your comments so far. |
Test build #59894 has finished for PR 12313 at commit
|
@@ -505,6 +506,117 @@ class Analyzer( | |||
} | |||
} | |||
|
|||
object ResolveOutputColumns extends Rule[LogicalPlan] { | |||
def apply(plan: LogicalPlan): LogicalPlan = plan.transform { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should use plan.resolveOperators
@rdblue Thank you for updating the patch. I was out of town late last week and was busy on spark summit early this week. Sorry for my late reply. Having name-based resolution is very useful! Since this is a major improvement and we are pretty late in the 2.0 QA period, how about we re-target the work to 2.1? Also, I feel it will be good to have a short design doc to discuss the following:
Based on the design, we can break the work to multiple pieces and get them merged for 2.1 release. What do you think? |
This combines Hive's pre-insertion casts (without renames) that handle partitioning with the pre-insertion casts/renames in core. The combined rule, ResolveOutputColumns, will resolve columns by name or by position. Resolving by position will detect cases where the number of columns is incorrect or where the input columns are a permutation of the output columns and fail. When resolving by name, each output column is located by name in the child plan. This handles cases where a subset of a data frame is written out.
This PR now catches this problem during analysis and has a better error message. This commit updates the test for the new message and exception type.
Adding new argumetns to InsertIntoTable requires changes to several files. Instead of adding a long list of optional args, this adds an options map, like the one passed to DataSource. Future options can be added and used only where they are needed.
302464b
to
906e68d
Compare
@yhuai, whatever release you want to target is fine with me, but I don't think we should block this on a design doc for cleaning up the Since this PR doesn't change the public API, I don't think we should wait until we have a plan for the public API (design doc goal 1) to commit it. Similarly for the second goal, missing columns, extra columns, and metastore updates are beyond the scope here, when this can simply require that the number of columns matches since that's the most conservative strategy (that's what is now implemented). The remaining issue is whether it is a good idea to include the by-name code. I think the concern is that it may change based on the API design doc, but that's really unlikely. For example, adding the option to I'm just trying to avoid dragging this out for a lot longer, or the work of splitting it up needlessly and having to keep rebasing the changes on master. I could be wrong, so if I'm missing something here, then please let me know (and thanks for being patient). Also, I addressed @cloud-fan's comments and rebased on master. |
Test build #60327 has finished for PR 12313 at commit
|
Really like this PR! It removes one of Hive-specific rule! : ) |
If this is too large for merging to 2.0, could @rdblue deliver a small fix for capturing the illegal user inputs? Thanks! |
…ases for by position resolution ## What changes were proposed in this pull request? This PR migrates some test cases introduced in apache#12313 as a follow-up of apache#13754 and apache#13766. These test cases cover `DataFrameWriter.insertInto()`, while the former two only cover SQL `INSERT` statements. Note that the `testPartitionedTable` utility method tests both Hive SerDe tables and data source tables. ## How was this patch tested? N/A Author: Cheng Lian <lian@databricks.com> Closes apache#13810 from liancheng/spark-16037-follow-up-tests.
…ases for by position resolution ## What changes were proposed in this pull request? This PR migrates some test cases introduced in #12313 as a follow-up of #13754 and #13766. These test cases cover `DataFrameWriter.insertInto()`, while the former two only cover SQL `INSERT` statements. Note that the `testPartitionedTable` utility method tests both Hive SerDe tables and data source tables. ## How was this patch tested? N/A Author: Cheng Lian <lian@databricks.com> Closes #13810 from liancheng/spark-16037-follow-up-tests. (cherry picked from commit f4a3d45) Signed-off-by: Yin Huai <yhuai@databricks.com>
Test build #75047 has finished for PR 12313 at commit
|
Hi all, I just wonder where we are on this. |
We were trying to get this in just before the 2.0 release, which was a bad time. We've just been maintaining it in our version, but I'm going to be rebasing it on to 2.1 soon so I'll see what needs to be done to get it in upstream. |
I see. Thank you. |
Test build #82245 has finished for PR 12313 at commit
|
This is addressed by #21305. |
What changes are proposed in this pull request?
LogicalPlan
. It catches cases where there are too many data columns and throws anAnalysisException
rather than silently dropping the extra data. It also improves the error message when there are too few columns and warns when the output columns appear to be out of order.MetastoreRelation
with the pre-insert cast and rename forLogicalRelations
. Both are now handled as a singleResolveOutputColumns
step in the analyzer that implements the above improvements.Casts are now UpCasts to avoid silently adding incorrect casts when columns are misaligned.Casts are still somewhat unsafe and should be fixed in a follow-up PR.This is exposed on the:DataFrameWriter
How was this patch tested?
This patch includes unit tests that exercise the cases outlined above.