[SPARK-14543] [SQL] Improve InsertIntoTable column resolution. #12313

rdblue · 2016-04-12T00:23:09Z

What changes are proposed in this pull request?

This updates the logic to resolve output table from the incoming LogicalPlan. It catches cases where there are too many data columns and throws an AnalysisException rather than silently dropping the extra data. It also improves the error message when there are too few columns and warns when the output columns appear to be out of order.
This combines the pre-insert casts for Hive's MetastoreRelation with the pre-insert cast and rename for LogicalRelations. Both are now handled as a single ResolveOutputColumns step in the analyzer that implements the above improvements. ~~Casts are now UpCasts to avoid silently adding incorrect casts when columns are misaligned.~~ Casts are still somewhat unsafe and should be fixed in a follow-up PR.
This adds a by-name column resolution strategy that matches output columns to the incoming data by name. ~~This is exposed on the DataFrameWriter~~:

How was this patch tested?

This patch includes unit tests that exercise the cases outlined above.

SparkQA · 2016-04-12T01:32:53Z

Test build #55558 has finished for PR 12313 at commit c4c6820.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ResolveOutputColumns extends Rule[LogicalPlan]

SparkQA · 2016-04-12T16:39:28Z

Test build #55616 has finished for PR 12313 at commit f2186e5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ResolveOutputColumns extends Rule[LogicalPlan]

SparkQA · 2016-04-19T19:34:56Z

Test build #56239 has finished for PR 12313 at commit c3e8561.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- class ResolveOutputColumns extends Rule[LogicalPlan]

SparkQA · 2016-04-20T19:34:34Z

Test build #56386 has finished for PR 12313 at commit 0a6ec77.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-20T22:57:09Z

Test build #56411 has finished for PR 12313 at commit 6c107ff.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-04-21T22:52:10Z

Test build #56586 has finished for PR 12313 at commit fcacba5.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-04-22T01:42:02Z

Test build #56605 has finished for PR 12313 at commit 7516ffd.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-04-22T19:15:20Z

Test build #56708 has finished for PR 12313 at commit 42ff7b7.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-04-23T00:23:54Z

Test build #56757 has finished for PR 12313 at commit c443b32.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2016-05-09T18:24:38Z

Retest this please.

SparkQA · 2016-05-09T20:03:57Z

Test build #58156 has finished for PR 12313 at commit ab23c9c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2016-05-09T20:08:50Z

@liancheng, @cloud-fan, this commit is a follow-up to #12239 that fixes column resolution when writing to both Hive MetastoreRelations and HadoopFsRelations. Could you review it? I think it would be a good addition to 2.0.0 as well.

cloud-fan · 2016-05-19T15:07:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -498,6 +499,117 @@ class Analyzer(
    }
  }

+  val ResolveOutputColumns = new ResolveOutputColumns
+  class ResolveOutputColumns extends Rule[LogicalPlan] {


We can simply use object ResolveOutputColumns here

I'll fix this and the protected method. I think I was instantiating this rule multiple times at one point and this is left-over.

rdblue · 2016-05-19T17:08:59Z

@cloud-fan, @liancheng, thanks for reviewing! I've rebased on master and fixed your comments so far.

SparkQA · 2016-06-03T01:23:03Z

Test build #59894 has finished for PR 12313 at commit 302464b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-06-05T19:20:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -505,6 +506,117 @@ class Analyzer(
    }
  }

+  object ResolveOutputColumns extends Rule[LogicalPlan] {
+    def apply(plan: LogicalPlan): LogicalPlan = plan.transform {


should use plan.resolveOperators

yhuai · 2016-06-10T19:06:55Z

@rdblue Thank you for updating the patch. I was out of town late last week and was busy on spark summit early this week. Sorry for my late reply. Having name-based resolution is very useful! Since this is a major improvement and we are pretty late in the 2.0 QA period, how about we re-target the work to 2.1?

Also, I feel it will be good to have a short design doc to discuss the following:

DataframeWriter and SQL interfaces and how to switch between matching by ordinals and matching by names.
Approaches for handling missing fields and new fields (both top level fields and inner fields). It will be good to discuss the semantics when we have missing fields and new fields as well as how to maintain the schema in the metastore.

Based on the design, we can break the work to multiple pieces and get them merged for 2.1 release.

What do you think?

This combines Hive's pre-insertion casts (without renames) that handle partitioning with the pre-insertion casts/renames in core. The combined rule, ResolveOutputColumns, will resolve columns by name or by position. Resolving by position will detect cases where the number of columns is incorrect or where the input columns are a permutation of the output columns and fail. When resolving by name, each output column is located by name in the child plan. This handles cases where a subset of a data frame is written out.

This PR now catches this problem during analysis and has a better error message. This commit updates the test for the new message and exception type.

Adding new argumetns to InsertIntoTable requires changes to several files. Instead of adding a long list of optional args, this adds an options map, like the one passed to DataSource. Future options can be added and used only where they are needed.

rdblue · 2016-06-11T00:33:53Z

@yhuai, whatever release you want to target is fine with me, but I don't think we should block this on a design doc for cleaning up the DataFrameWriter. I'm all for writing one and I plan to participate in that design, but the scope of that work is quite a bit larger than what this PR does.

Since this PR doesn't change the public API, I don't think we should wait until we have a plan for the public API (design doc goal 1) to commit it. Similarly for the second goal, missing columns, extra columns, and metastore updates are beyond the scope here, when this can simply require that the number of columns matches since that's the most conservative strategy (that's what is now implemented).

The remaining issue is whether it is a good idea to include the by-name code. I think the concern is that it may change based on the API design doc, but that's really unlikely. For example, adding the option to InsertIntoTable will be required unless we duplicate that logical node, which I think is an unlikely choice.

I'm just trying to avoid dragging this out for a lot longer, or the work of splitting it up needlessly and having to keep rebasing the changes on master. I could be wrong, so if I'm missing something here, then please let me know (and thanks for being patient).

Also, I addressed @cloud-fan's comments and rebased on master.

SparkQA · 2016-06-11T02:47:47Z

Test build #60327 has finished for PR 12313 at commit 906e68d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-06-11T16:14:45Z

Really like this PR! It removes one of Hive-specific rule! : )

gatorsmile · 2016-06-12T23:58:06Z

If this is too large for merging to 2.0, could @rdblue deliver a small fix for capturing the illegal user inputs? Thanks!

…ases for by position resolution ## What changes were proposed in this pull request? This PR migrates some test cases introduced in apache#12313 as a follow-up of apache#13754 and apache#13766. These test cases cover `DataFrameWriter.insertInto()`, while the former two only cover SQL `INSERT` statements. Note that the `testPartitionedTable` utility method tests both Hive SerDe tables and data source tables. ## How was this patch tested? N/A Author: Cheng Lian <lian@databricks.com> Closes apache#13810 from liancheng/spark-16037-follow-up-tests.

…ases for by position resolution ## What changes were proposed in this pull request? This PR migrates some test cases introduced in #12313 as a follow-up of #13754 and #13766. These test cases cover `DataFrameWriter.insertInto()`, while the former two only cover SQL `INSERT` statements. Note that the `testPartitionedTable` utility method tests both Hive SerDe tables and data source tables. ## How was this patch tested? N/A Author: Cheng Lian <lian@databricks.com> Closes #13810 from liancheng/spark-16037-follow-up-tests. (cherry picked from commit f4a3d45) Signed-off-by: Yin Huai <yhuai@databricks.com>

SparkQA · 2017-03-22T13:39:47Z

Test build #75047 has finished for PR 12313 at commit 906e68d.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-05-11T12:23:34Z

Hi all, I just wonder where we are on this.

rdblue · 2017-05-11T16:28:03Z

We were trying to get this in just before the 2.0 release, which was a bad time. We've just been maintaining it in our version, but I'm going to be rebasing it on to 2.1 soon so I'll see what needs to be done to get it in upstream.

HyukjinKwon · 2017-05-11T23:42:44Z

I see. Thank you.

SparkQA · 2017-09-27T16:39:16Z

Test build #82245 has finished for PR 12313 at commit 906e68d.

This patch fails PySpark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

rdblue · 2018-07-26T18:48:00Z

This is addressed by #21305.

rdblue force-pushed the SPARK-14543-fix-hive-write-cast-and-rename branch from cf82d95 to c4c6820 Compare April 12, 2016 00:23

rdblue force-pushed the SPARK-14543-fix-hive-write-cast-and-rename branch from c4c6820 to f2186e5 Compare April 12, 2016 15:39

rdblue force-pushed the SPARK-14543-fix-hive-write-cast-and-rename branch from f2186e5 to c3e8561 Compare April 19, 2016 17:44

rdblue force-pushed the SPARK-14543-fix-hive-write-cast-and-rename branch from c3e8561 to 0a6ec77 Compare April 20, 2016 18:22

rdblue force-pushed the SPARK-14543-fix-hive-write-cast-and-rename branch from 0a6ec77 to 6c107ff Compare April 20, 2016 21:16

rdblue mentioned this pull request Apr 21, 2016

[SPARK-14459] [SQL] Detect relation partitioning and adjust the logical plan #12239

Closed

rdblue force-pushed the SPARK-14543-fix-hive-write-cast-and-rename branch from fcacba5 to 7516ffd Compare April 21, 2016 23:20

rdblue changed the title ~~[SPARK-14543] [SQL] Fix InsertIntoTable column resolution. (WIP)~~ [SPARK-14543] [SQL] Improve InsertIntoTable column resolution. Apr 22, 2016

rdblue force-pushed the SPARK-14543-fix-hive-write-cast-and-rename branch from 42ff7b7 to c443b32 Compare April 23, 2016 00:17

rdblue force-pushed the SPARK-14543-fix-hive-write-cast-and-rename branch from c443b32 to ab23c9c Compare May 9, 2016 18:05

cloud-fan reviewed May 19, 2016
View reviewed changes

rdblue force-pushed the SPARK-14543-fix-hive-write-cast-and-rename branch from ab23c9c to 3a24e36 Compare May 19, 2016 17:07

cloud-fan reviewed Jun 5, 2016
View reviewed changes

rdblue added 7 commits June 10, 2016 16:58

SPARK-14543: Fix bad SQL in HiveQuerySuite test.

9a8cbc2

SPARK-14543: Update InsertSuite test for too few columns.

bb8e7e7

This PR now catches this problem during analysis and has a better error message. This commit updates the test for the new message and exception type.

SPARK-14543: Remove DataFrameWriter#byName.

c820846

SPARK-14543: Fix HiveCompatibilitySuite cases.

d577aed

SPARK-14543: Updates based on review feedback.

906e68d

rdblue force-pushed the SPARK-14543-fix-hive-write-cast-and-rename branch from 302464b to 906e68d Compare June 11, 2016 00:33

gatorsmile mentioned this pull request Jun 11, 2016

[SPARK-15706] [SQL] Fix Wrong Answer when using IF NOT EXISTS in INSERT OVERWRITE for DYNAMIC PARTITION #13447

Closed

gatorsmile mentioned this pull request Jun 12, 2016

[SPARK-15907] [SQL] Issue Exceptions when Not Enough Input Columns for Dynamic Partitioning #13628

Closed

liancheng mentioned this pull request Jun 21, 2016

[SPARK-16037][SQL] Follow-up: add DataFrameWriter.insertInto() test cases for by position resolution #13810

Closed

rdblue closed this Jul 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14543] [SQL] Improve InsertIntoTable column resolution. #12313

[SPARK-14543] [SQL] Improve InsertIntoTable column resolution. #12313

rdblue commented Apr 12, 2016 •

edited

SparkQA commented Apr 12, 2016

SparkQA commented Apr 12, 2016

SparkQA commented Apr 19, 2016

SparkQA commented Apr 20, 2016

SparkQA commented Apr 20, 2016

SparkQA commented Apr 21, 2016

SparkQA commented Apr 22, 2016

SparkQA commented Apr 22, 2016

SparkQA commented Apr 23, 2016

rdblue commented May 9, 2016

SparkQA commented May 9, 2016

rdblue commented May 9, 2016

cloud-fan May 19, 2016

rdblue May 19, 2016

rdblue commented May 19, 2016

SparkQA commented Jun 3, 2016

cloud-fan Jun 5, 2016

yhuai commented Jun 10, 2016

rdblue commented Jun 11, 2016

SparkQA commented Jun 11, 2016

gatorsmile commented Jun 11, 2016

gatorsmile commented Jun 12, 2016

SparkQA commented Mar 22, 2017

HyukjinKwon commented May 11, 2017

rdblue commented May 11, 2017

HyukjinKwon commented May 11, 2017

SparkQA commented Sep 27, 2017

rdblue commented Jul 26, 2018

[SPARK-14543] [SQL] Improve InsertIntoTable column resolution. #12313

[SPARK-14543] [SQL] Improve InsertIntoTable column resolution. #12313

Conversation

rdblue commented Apr 12, 2016 • edited

What changes are proposed in this pull request?

How was this patch tested?

SparkQA commented Apr 12, 2016

SparkQA commented Apr 12, 2016

SparkQA commented Apr 19, 2016

SparkQA commented Apr 20, 2016

SparkQA commented Apr 20, 2016

SparkQA commented Apr 21, 2016

SparkQA commented Apr 22, 2016

SparkQA commented Apr 22, 2016

SparkQA commented Apr 23, 2016

rdblue commented May 9, 2016

SparkQA commented May 9, 2016

rdblue commented May 9, 2016

cloud-fan May 19, 2016

Choose a reason for hiding this comment

rdblue May 19, 2016

Choose a reason for hiding this comment

rdblue commented May 19, 2016

SparkQA commented Jun 3, 2016

cloud-fan Jun 5, 2016

Choose a reason for hiding this comment

yhuai commented Jun 10, 2016

rdblue commented Jun 11, 2016

SparkQA commented Jun 11, 2016

gatorsmile commented Jun 11, 2016

gatorsmile commented Jun 12, 2016

SparkQA commented Mar 22, 2017

HyukjinKwon commented May 11, 2017

rdblue commented May 11, 2017

HyukjinKwon commented May 11, 2017

SparkQA commented Sep 27, 2017

rdblue commented Jul 26, 2018

rdblue commented Apr 12, 2016 •

edited