Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-14543] [SQL] Improve InsertIntoTable column resolution. #12313

Closed

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Apr 12, 2016

What changes are proposed in this pull request?

  1. This updates the logic to resolve output table from the incoming LogicalPlan. It catches cases where there are too many data columns and throws an AnalysisException rather than silently dropping the extra data. It also improves the error message when there are too few columns and warns when the output columns appear to be out of order.
  2. This combines the pre-insert casts for Hive's MetastoreRelation with the pre-insert cast and rename for LogicalRelations. Both are now handled as a single ResolveOutputColumns step in the analyzer that implements the above improvements. Casts are now UpCasts to avoid silently adding incorrect casts when columns are misaligned. Casts are still somewhat unsafe and should be fixed in a follow-up PR.
  3. This adds a by-name column resolution strategy that matches output columns to the incoming data by name. This is exposed on the DataFrameWriter:

How was this patch tested?

This patch includes unit tests that exercise the cases outlined above.

@rdblue rdblue force-pushed the SPARK-14543-fix-hive-write-cast-and-rename branch from cf82d95 to c4c6820 Compare April 12, 2016 00:23
@SparkQA
Copy link

SparkQA commented Apr 12, 2016

Test build #55558 has finished for PR 12313 at commit c4c6820.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ResolveOutputColumns extends Rule[LogicalPlan]

@rdblue rdblue force-pushed the SPARK-14543-fix-hive-write-cast-and-rename branch from c4c6820 to f2186e5 Compare April 12, 2016 15:39
@SparkQA
Copy link

SparkQA commented Apr 12, 2016

Test build #55616 has finished for PR 12313 at commit f2186e5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ResolveOutputColumns extends Rule[LogicalPlan]

@rdblue rdblue force-pushed the SPARK-14543-fix-hive-write-cast-and-rename branch from f2186e5 to c3e8561 Compare April 19, 2016 17:44
@SparkQA
Copy link

SparkQA commented Apr 19, 2016

Test build #56239 has finished for PR 12313 at commit c3e8561.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • class ResolveOutputColumns extends Rule[LogicalPlan]

@rdblue rdblue force-pushed the SPARK-14543-fix-hive-write-cast-and-rename branch from c3e8561 to 0a6ec77 Compare April 20, 2016 18:22
@SparkQA
Copy link

SparkQA commented Apr 20, 2016

Test build #56386 has finished for PR 12313 at commit 0a6ec77.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue rdblue force-pushed the SPARK-14543-fix-hive-write-cast-and-rename branch from 0a6ec77 to 6c107ff Compare April 20, 2016 21:16
@SparkQA
Copy link

SparkQA commented Apr 20, 2016

Test build #56411 has finished for PR 12313 at commit 6c107ff.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 21, 2016

Test build #56586 has finished for PR 12313 at commit fcacba5.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@rdblue rdblue force-pushed the SPARK-14543-fix-hive-write-cast-and-rename branch from fcacba5 to 7516ffd Compare April 21, 2016 23:20
@SparkQA
Copy link

SparkQA commented Apr 22, 2016

Test build #56605 has finished for PR 12313 at commit 7516ffd.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 22, 2016

Test build #56708 has finished for PR 12313 at commit 42ff7b7.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@rdblue rdblue changed the title [SPARK-14543] [SQL] Fix InsertIntoTable column resolution. (WIP) [SPARK-14543] [SQL] Improve InsertIntoTable column resolution. Apr 22, 2016
@rdblue rdblue force-pushed the SPARK-14543-fix-hive-write-cast-and-rename branch from 42ff7b7 to c443b32 Compare April 23, 2016 00:17
@SparkQA
Copy link

SparkQA commented Apr 23, 2016

Test build #56757 has finished for PR 12313 at commit c443b32.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue rdblue force-pushed the SPARK-14543-fix-hive-write-cast-and-rename branch from c443b32 to ab23c9c Compare May 9, 2016 18:05
@rdblue
Copy link
Contributor Author

rdblue commented May 9, 2016

Retest this please.

@SparkQA
Copy link

SparkQA commented May 9, 2016

Test build #58156 has finished for PR 12313 at commit ab23c9c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor Author

rdblue commented May 9, 2016

@liancheng, @cloud-fan, this commit is a follow-up to #12239 that fixes column resolution when writing to both Hive MetastoreRelations and HadoopFsRelations. Could you review it? I think it would be a good addition to 2.0.0 as well.

@@ -498,6 +499,117 @@ class Analyzer(
}
}

val ResolveOutputColumns = new ResolveOutputColumns
class ResolveOutputColumns extends Rule[LogicalPlan] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simply use object ResolveOutputColumns here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix this and the protected method. I think I was instantiating this rule multiple times at one point and this is left-over.

@rdblue rdblue force-pushed the SPARK-14543-fix-hive-write-cast-and-rename branch from ab23c9c to 3a24e36 Compare May 19, 2016 17:07
@rdblue
Copy link
Contributor Author

rdblue commented May 19, 2016

@cloud-fan, @liancheng, thanks for reviewing! I've rebased on master and fixed your comments so far.

@SparkQA
Copy link

SparkQA commented Jun 3, 2016

Test build #59894 has finished for PR 12313 at commit 302464b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -505,6 +506,117 @@ class Analyzer(
}
}

object ResolveOutputColumns extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan.transform {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should use plan.resolveOperators

@yhuai
Copy link
Contributor

yhuai commented Jun 10, 2016

@rdblue Thank you for updating the patch. I was out of town late last week and was busy on spark summit early this week. Sorry for my late reply. Having name-based resolution is very useful! Since this is a major improvement and we are pretty late in the 2.0 QA period, how about we re-target the work to 2.1?

Also, I feel it will be good to have a short design doc to discuss the following:

  1. DataframeWriter and SQL interfaces and how to switch between matching by ordinals and matching by names.
  2. Approaches for handling missing fields and new fields (both top level fields and inner fields). It will be good to discuss the semantics when we have missing fields and new fields as well as how to maintain the schema in the metastore.

Based on the design, we can break the work to multiple pieces and get them merged for 2.1 release.

What do you think?

This combines Hive's pre-insertion casts (without renames) that handle
partitioning with the pre-insertion casts/renames in core. The combined
rule, ResolveOutputColumns, will resolve columns by name or by position.
Resolving by position will detect cases where the number of columns is
incorrect or where the input columns are a permutation of the output
columns and fail. When resolving by name, each output column is located
by name in the child plan. This handles cases where a subset of a data
frame is written out.
This PR now catches this problem during analysis and has a better error
message. This commit updates the test for the new message and exception
type.
Adding new argumetns to InsertIntoTable requires changes to several
files.  Instead of adding a long list of optional args, this adds an
options map, like the one passed to DataSource. Future options can
be added and used only where they are needed.
@rdblue rdblue force-pushed the SPARK-14543-fix-hive-write-cast-and-rename branch from 302464b to 906e68d Compare June 11, 2016 00:33
@rdblue
Copy link
Contributor Author

rdblue commented Jun 11, 2016

@yhuai, whatever release you want to target is fine with me, but I don't think we should block this on a design doc for cleaning up the DataFrameWriter. I'm all for writing one and I plan to participate in that design, but the scope of that work is quite a bit larger than what this PR does.

Since this PR doesn't change the public API, I don't think we should wait until we have a plan for the public API (design doc goal 1) to commit it. Similarly for the second goal, missing columns, extra columns, and metastore updates are beyond the scope here, when this can simply require that the number of columns matches since that's the most conservative strategy (that's what is now implemented).

The remaining issue is whether it is a good idea to include the by-name code. I think the concern is that it may change based on the API design doc, but that's really unlikely. For example, adding the option to InsertIntoTable will be required unless we duplicate that logical node, which I think is an unlikely choice.

I'm just trying to avoid dragging this out for a lot longer, or the work of splitting it up needlessly and having to keep rebasing the changes on master. I could be wrong, so if I'm missing something here, then please let me know (and thanks for being patient).

Also, I addressed @cloud-fan's comments and rebased on master.

@SparkQA
Copy link

SparkQA commented Jun 11, 2016

Test build #60327 has finished for PR 12313 at commit 906e68d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Really like this PR! It removes one of Hive-specific rule! : )

@gatorsmile
Copy link
Member

If this is too large for merging to 2.0, could @rdblue deliver a small fix for capturing the illegal user inputs? Thanks!

ghost pushed a commit to dbtsai/spark that referenced this pull request Jun 21, 2016
…ases for by position resolution

## What changes were proposed in this pull request?

This PR migrates some test cases introduced in apache#12313 as a follow-up of apache#13754 and apache#13766. These test cases cover `DataFrameWriter.insertInto()`, while the former two only cover SQL `INSERT` statements.

Note that the `testPartitionedTable` utility method tests both Hive SerDe tables and data source tables.

## How was this patch tested?

N/A

Author: Cheng Lian <lian@databricks.com>

Closes apache#13810 from liancheng/spark-16037-follow-up-tests.
asfgit pushed a commit that referenced this pull request Jun 21, 2016
…ases for by position resolution

## What changes were proposed in this pull request?

This PR migrates some test cases introduced in #12313 as a follow-up of #13754 and #13766. These test cases cover `DataFrameWriter.insertInto()`, while the former two only cover SQL `INSERT` statements.

Note that the `testPartitionedTable` utility method tests both Hive SerDe tables and data source tables.

## How was this patch tested?

N/A

Author: Cheng Lian <lian@databricks.com>

Closes #13810 from liancheng/spark-16037-follow-up-tests.

(cherry picked from commit f4a3d45)
Signed-off-by: Yin Huai <yhuai@databricks.com>
@SparkQA
Copy link

SparkQA commented Mar 22, 2017

Test build #75047 has finished for PR 12313 at commit 906e68d.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Hi all, I just wonder where we are on this.

@rdblue
Copy link
Contributor Author

rdblue commented May 11, 2017

We were trying to get this in just before the 2.0 release, which was a bad time. We've just been maintaining it in our version, but I'm going to be rebasing it on to 2.1 soon so I'll see what needs to be done to get it in upstream.

@HyukjinKwon
Copy link
Member

I see. Thank you.

@SparkQA
Copy link

SparkQA commented Sep 27, 2017

Test build #82245 has finished for PR 12313 at commit 906e68d.

  • This patch fails PySpark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor Author

rdblue commented Jul 26, 2018

This is addressed by #21305.

@rdblue rdblue closed this Jul 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants