[SPARK-28730][SQL] Configurable type coercion policy for table insertion #25453

gengliangwang · 2019-08-14T12:35:20Z

What changes were proposed in this pull request?

After all the discussions in the dev list: http://apache-spark-developers-list.1001551.n3.nabble.com/Discuss-Follow-ANSI-SQL-on-table-insertion-td27531.html#a27562.
Here I propose that we can make the store assignment rules in the analyzer configurable, and the behavior of V1 and V2 should be consistent.
When inserting a value into a column with a different data type, Spark will perform type coercion. After this PR, we support 2 policies for the type coercion rules:
legacy and strict.

With legacy policy, Spark allows casting any value to any data type. The legacy policy is the only behavior in Spark 2.x and it is compatible with Hive.
With strict policy, Spark doesn't allow any possible precision loss or data truncation in type coercion, e.g. int and long, float -> double are not allowed.

Eventually, the "legacy" mode will be removed, so it is disallowed in data source V2.
To ensure backward compatibility with existing queries, the default store assignment policy for data source V1 is "legacy".

How was this patch tested?

Unit test

gengliangwang · 2019-08-14T12:47:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      )
+      .stringConf
+      .transform(_.toUpperCase(Locale.ROOT))
+      .checkValues(StoreAssignmentPolicy.values.map(_.toString))


Here the configuration is ENUM instead of Boolean type. We will have a new policy "ANSI" after #25239 is finished.

gengliangwang · 2019-08-14T12:49:53Z

CC @cloud-fan @maropu @rdblue @HyukjinKwon @mccheah
If this is OK to you, I will apply the optional "strict" policy to V1 and finish this PR.
I hope we can move forward on the topic.

cloud-fan · 2019-08-14T14:03:51Z

I believe this PR is the minimal effort to fix the Spark 3.0 blocker: DS v1 and v2 tables have inconsistent behaviors regarding table insertion.

I think we all agree that we need to make table insertion behavior configurable, and IMO legacy mode should be the default instead of the strict mode, as it's the behavior in Spark 1.x and 2.x.

Fow now I think the legacy mode is the most reasonable default, but this may change after we make more progress on new policies and fixing the "return-null" behavior. We can discuss to change the default at that time.

SparkQA · 2019-08-14T14:13:39Z

Test build #109108 has finished for PR 25453 at commit bafe248.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-14T14:33:03Z

Test build #109109 has finished for PR 25453 at commit 9e5ff08.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-19T14:30:30Z

Test build #109347 has finished for PR 25453 at commit 77a19a5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-19T15:19:31Z

Test build #109348 has finished for PR 25453 at commit 4716e92.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-08-19T15:43:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "strict. With legacy policy, Spark allows casting any value to any data type. " +
+        "The legacy policy is the only behavior in Spark 2.x and it is compatible with Hive. " +
+        "With strict policy, Spark doesn't allow any possible precision loss or data truncation " +
+        "in type coercion, e.g. `int` and `long`, `float` -> `double` are not allowed."


e.g. int to long, timestamp to date ...

cloud-fan · 2019-08-19T15:44:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala

+    } else {
+      // run the type check first to ensure type errors are present
+      val canWrite = DataType.canWrite(
+        queryExpr.dataType, tableAttr.dataType, byName, useStrictRules,


why DataType.canWrite need to take the useStrictRules parameter? In this branch useStrictRules is true.

cloud-fan · 2019-08-19T15:44:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala

+      conf: SQLConf,
+      addError: String => Unit): Option[NamedExpression] = {
+
+    val useStrictRules = conf.storeAssignmentPolicy == StoreAssignmentPolicy.STRICT


useStrictRules -> useStrictRule

I would perfer "rules"

If we will add another rule in follow-up activities, it'd be better to use pattern-matching here instead of if?

what "rules" mean here? From the code we are applying the STRICT rule here. If you want to represent "non-legacy rules", I think it's better to write

val isLegacyMode = conf.storeAssignmentPolicy == StoreAssignmentPolicy.LEGACY ...

cloud-fan · 2019-08-19T15:45:25Z

...talyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/DataSourceV2AnalysisSuite.scala

-    assertAnalysisError(parsedPlan, Seq(
-      "Cannot write incompatible data to table", "'table-name'",
-      "Cannot write nullable values to non-null column", "'x'", "'y'"))
+    withSQLConf(SQLConf.STORE_ASSIGNMENT_POLICY.key ->


we can set this config in beforeAll and unset it in afterAll

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala

cloud-fan · 2019-08-19T15:47:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala

+      } else {
+        // always add an UpCast. it will be removed in the optimizer if it is unnecessary.
+        Some(Alias(
+          UpCast(queryExpr, tableAttr.dataType), tableAttr.name


shall we use Cast here? The upcast logic is already checked in DataType.canWrite. Then we can remove https://github.com/apache/spark/pull/25453/files#diff-7690f56bde3f7a3dd76fab9c136c1494R181

SparkQA · 2019-08-19T16:04:54Z

Test build #109342 has finished for PR 25453 at commit 1590fbc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

maropu · 2019-08-20T01:56:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -2525,7 +2525,8 @@ class Analyzer(
    override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperators {
      case append @ AppendData(table, query, isByName)
          if table.resolved && query.resolved && !append.outputResolved =>
-        val projection = resolveOutputColumns(table.name, table.output, query, isByName)
+        val projection =
+          TableOutputResolver.resolveOutputColumns(table.name, table.output, query, isByName, conf)



The code in line2528-2535, line2539-2546, and line2550-2557 is duplicate, so can we merge them into one rule by defining an extractor(unapply)?

btw, ResolveOutputRelation currently seems to be a thin wrapper for TableOutputResolver. Do we need to separate TableOutputResolver from ResolveOutputRelation?

The code in line2528-2535, line2539-2546, and line2550-2557 is duplicate, so can we merge them into one rule by defining an extractor(unapply)?

Eventually we need to call append.copy(query = projection) and overwrite.copy(query = projection), we need to match all the V2WriteCommand anyway. I think it is fine to keep the current code.

btw, ResolveOutputRelation currently seems to be a thin wrapper for TableOutputResolver. Do we need to separate TableOutputResolver from ResolveOutputRelation?

Otherwise, we can't access the method resolveOutputColumns in PreprocessTableInsertion

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala

maropu · 2019-08-20T02:08:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala

+      conf: SQLConf,
+      addError: String => Unit): Option[NamedExpression] = {
+
+    val useStrictRules = conf.storeAssignmentPolicy == StoreAssignmentPolicy.STRICT


If we will add another rule in follow-up activities, it'd be better to use pattern-matching here instead of if?

maropu · 2019-08-20T02:12:05Z

We still need WIP in the title?

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala

sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala

SparkQA · 2019-08-20T20:14:25Z

Test build #109424 has finished for PR 25453 at commit 1589fcd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-08-20T22:44:42Z

@gengliangwang, in the description, you said:

the behavior of V1 and V2 should be consistent

Why is that? We know that v2 already introduces breaking behavior changes and we can't avoid them. We were previously okay with different behavior between v1 and v2, so I see no reason to support the legacy type coercion in the v2 path.

I do think it makes sense to support the SQL standard type coercion along with strict type coercion in v2, though.

maropu · 2019-08-21T00:15:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -2525,7 +2525,8 @@ class Analyzer(
    override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperators {


We might need to update the description for ResolveOutputRelation and refine "what's a safe cast?".

btw, we might need to move some parts of this description to TableOutputResolver.

I have removed the word "safe" in the comment. I think the comment "Detect plans that are not compatible with the output table and throw AnalysisException" already states that there is type coercion check in the rule.

maropu · 2019-08-21T00:24:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala

+        }
+
+      case other =>
+        throw new AnalysisException(s"Unsupported store assignment policy: $other")


We need this? It seems we already have checked if the mode is valid? https://github.com/apache/spark/pull/25453/files#diff-9a6b543db706f1a90f790783d6930a13R1661

maropu · 2019-08-21T00:35:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala

+        outputField
+
+      case StoreAssignmentPolicy.STRICT =>
+        // run the type check first to ensure type errors are present


if (queryExpr.nullable && !tableAttr.nullable) { addError(s"Cannot write nullable values to non-null column '${tableAttr.name}'") None } else { // run the type check first to ensure type errors are present val canWrite = DataType.canWrite( queryExpr.dataType, tableAttr.dataType, byName, conf.resolver, tableAttr.name, addError) if (canWrite) { outputField } else { None } }

?

btw, we don't need the check queryExpr.nullable && !tableAttr.nullable in the other modes?

I think this is on purpose in the original code. Running DataType.canWrite can expose more errors.

btw, we don't need the check queryExpr.nullable && !tableAttr.nullable in the other modes?

IIRC there is no such check in Spark 2.x

maropu · 2019-08-21T00:39:16Z

...talyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/DataSourceV2AnalysisSuite.scala

+      .copy(SQLConf.STORE_ASSIGNMENT_POLICY -> StoreAssignmentPolicy.STRICT)
+    val catalog = new SessionCatalog(new InMemoryCatalog, FunctionRegistry.builtin, conf)
+    catalog.createDatabase(
+      CatalogDatabase("default", "", new URI("loc"), Map.empty),


how about creating a temporary dir for loc?

This is actually copied from AnalysisTest. I think it should be fine.

cloud-fan · 2019-08-21T00:56:24Z

We know that v2 already introduces breaking behavior changes and we can't avoid them.

read/write are very basic functionalities and ideally we should make them consistent. I know that some commands like DESC TABLE, SHOW TABLES have different behaviors between v1 and v2, but that is less important.

Furthermore, we need to make it configurable, and it will be awkward to have 2 store assignment configs for v1 and v2 tables.

To be honest I don't like the legacy mode. But I don't think it's possible to make strict mode the default. Breaking change has degrees, and to me strict mode is too breaking.

How about this: we add a hack in this PR, so that legacy mode is the default for v1 table, and strict mode is the default for v2 table. The hack can be removed if we decide to make the ansi sql mode the default. Then we can still have a single config to configure store assignment policy.

gengliangwang · 2019-08-21T05:03:49Z

Why is that? We know that v2 already introduces breaking behavior changes and we can't avoid them. We were previously okay with different behavior between v1 and v2, so I see no reason to support the legacy type coercion in the v2 path.

As per the discussion in dev list, I think most of us agree that we should make the table insertion behavior configurable. So I assume that you are asking to have two table insertion flags for V1 and V2 data sources.
Currently, there are data sources with both V1 and V2 implementation, and some are with V1 implement only. In the future, there can be data source with V2 implementation only. For Spark users, I think it makes more sense to choose the table insertion policy with one flag.

I have asked for your opinions one week ago in comment #25453 (comment) before I start the actual code changes. I hope we can move forward on this PR. The ANSI mode will be added right after this. Making the default policy as the legacy one is safest for now. We can discuss the default policy after ANSI mode is added.

SparkQA · 2019-08-21T07:05:02Z

Test build #109470 has finished for PR 25453 at commit 0206e35.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-21T07:05:02Z

Test build #109466 has finished for PR 25453 at commit 36a8b16.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-08-21T07:14:30Z

retest this please.

SparkQA · 2019-08-21T11:09:42Z

Test build #109478 has finished for PR 25453 at commit 0206e35.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-08-21T16:33:43Z

For Spark users, I think it makes more sense to choose the table insertion policy with one flag.

I don't think that v2 should allow the legacy mode. ANSI mode and strict modes make sense, but carrying the legacy mode forward just to avoid an extra config property doesn't seem worth it. This mode is user-facing, while the configuration properties are primarily administrator settings. Reducing the administrator settings by one doesn't seem worth corrupting user data in the v2 code path.

And, the v1 code path will eventually be removed along with a v1 setting for this mode. If v2 doesn't support legacy mode, then we can remove it as well.

I don't think it's possible to make strict mode the default.

I think it is fine to make ANSI mode the default -- assuming that's what the community votes for. But I don't think that carrying legacy mode forward is not the right way to avoid making strict mode the default.

If legacy mode is available then we will always need to support it even when v1 is gone. If it is the default, then we can't actually make progress in this area by combining this with breaking changes for v2.

rdblue · 2019-08-21T16:36:50Z

If we want to have just one property, I think what we can do is use different defaults for v1 and v2, but use the same configuration property. That property should only support configuring ANSI and strict modes. That way, setting the property sets the same mode for v1 and v2, but doesn't allow v2 to use the unsafe legacy mode. v1 would continue to default to the legacy mode (unless we want to replace that with ANSI) and v2 would default to whatever is decided by the vote on the dev list.

rdblue · 2019-08-21T16:39:10Z

I have asked for your opinions one week ago in comment #25453 (comment) before I start the actual code changes

This is what I see in that comment:

If this is OK to you, I will apply the optional "strict" policy to V1 and finish this PR.

That's not the same thing as making a setting that allows legacy mode with v2. I'm fine adding a way for v1 to use strict mode, although I doubt that's what we will want to ship in Spark 3.0 as a default.

gengliangwang · 2019-08-21T17:12:21Z

If we want to have just one property, I think what we can do is use different defaults for v1 and v2, but use the same configuration property. That property should only support configuring ANSI and strict modes. That way, setting the property sets the same mode for v1 and v2, but doesn't allow v2 to use the unsafe legacy mode. v1 would continue to default to the legacy mode (unless we want to replace that with ANSI) and v2 would default to whatever is decided by the vote on the dev list.

Totally agree. I prefer using the ANSI mode as default as well. Disallowing the legacy mode in V2 also makes sense. I am glad that we come to agreement on this :)

rdblue · 2019-08-21T17:24:17Z

Any thoughts on whether we can get rid of legacy mode for v1 if we implement ANSI mode? Maybe we could use the 3.0 release to remove legacy mode.

maropu · 2019-08-22T00:25:14Z

Totally agree. I prefer using the ANSI mode as default as well. Disallowing the legacy mode in V2 also makes sense. I am glad that we come to agreement on this :)

+1; the approach looks pretty reasonable to me, too.

cloud-fan · 2019-08-22T02:01:08Z

Any thoughts on whether we can get rid of legacy mode for v1 if we implement ANSI mode?

The general policy is to keep the legacy config for at least one release, if a behavior change is made. I think we can remove it in 3.1.

I agree with forbidding legacy mode in v2. For now we can make v1 and v2 have different default policies. Once we make ANSI policy the default, then v1 and v2 can still have the same default policy.

SparkQA · 2019-08-22T11:58:44Z

Test build #109564 has finished for PR 25453 at commit aa7afdf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-22T12:14:49Z

Test build #109566 has finished for PR 25453 at commit eb9442c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-08-22T16:03:07Z

Once we make ANSI policy the default, then v1 and v2 can still have the same default policy

From this and other comments, it sounds like there is an assumption that ANSI mode will be the default. Just want to remind everyone that I think that decision requires a vote on the dev list.

cloud-fan · 2019-08-23T03:10:05Z

Yes we need a vote. #25239 should only add the ansi sql policy without changing the default. A separated PR is needed to change the default if the vote passes.

cloud-fan · 2019-08-23T05:50:59Z

thanks, merging to master!

gengliangwang force-pushed the tableInsertRule branch from bafe248 to 9e5ff08 Compare August 14, 2019 12:43

gengliangwang commented Aug 14, 2019

View reviewed changes

gengliangwang changed the title ~~[SPARK-28730][SQL] Configurable type coercion policy for table insertion~~ [WIP][SPARK-28730][SQL] Configurable type coercion policy for table insertion Aug 14, 2019

dongjoon-hyun added the SQL label Aug 14, 2019

gengliangwang force-pushed the tableInsertRule branch from 9e5ff08 to 1590fbc Compare August 19, 2019 11:39

Configurable type coercion policy for table insertion

4716e92

gengliangwang force-pushed the tableInsertRule branch from 77a19a5 to 4716e92 Compare August 19, 2019 13:18

cloud-fan reviewed Aug 19, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala Show resolved Hide resolved

cloud-fan reviewed Aug 19, 2019

View reviewed changes

add test for v1

86ead08

maropu reviewed Aug 20, 2019

View reviewed changes

gengliangwang added 3 commits August 20, 2019 12:05

address comments

99b4b4f

fix tests

dccb160

resolve more comments

d660f3d

cloud-fan reviewed Aug 20, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Aug 20, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Aug 20, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Aug 20, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala Outdated Show resolved Hide resolved

rdblue mentioned this pull request Aug 20, 2019

[WIP][SPARK-28495][SQL] Table insertion: follow store assignment rules of ANSI SQL #25239

Closed

maropu reviewed Aug 21, 2019

View reviewed changes

gengliangwang added 2 commits August 21, 2019 13:30

address one comment

36a8b16

revise comment

0206e35

gengliangwang added 2 commits August 22, 2019 16:01

different default mode for v1 and v2

aa7afdf

revise wording

eb9442c

cloud-fan closed this in 895c90b Aug 23, 2019

		@@ -2525,7 +2525,8 @@ class Analyzer(
		override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperators {

[SPARK-28730][SQL] Configurable type coercion policy for table insertion #25453

[SPARK-28730][SQL] Configurable type coercion policy for table insertion #25453

Conversation

gengliangwang commented Aug 14, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

gengliangwang commented Aug 14, 2019

cloud-fan commented Aug 14, 2019

SparkQA commented Aug 14, 2019

SparkQA commented Aug 14, 2019

SparkQA commented Aug 19, 2019

SparkQA commented Aug 19, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 19, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Aug 20, 2019

SparkQA commented Aug 20, 2019

rdblue commented Aug 20, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Aug 21, 2019

gengliangwang commented Aug 21, 2019

SparkQA commented Aug 21, 2019

SparkQA commented Aug 21, 2019

gengliangwang commented Aug 21, 2019

SparkQA commented Aug 21, 2019

rdblue commented Aug 21, 2019

rdblue commented Aug 21, 2019

rdblue commented Aug 21, 2019

gengliangwang commented Aug 21, 2019

rdblue commented Aug 21, 2019

maropu commented Aug 22, 2019

cloud-fan commented Aug 22, 2019

SparkQA commented Aug 22, 2019

SparkQA commented Aug 22, 2019

rdblue commented Aug 22, 2019

cloud-fan commented Aug 23, 2019

cloud-fan commented Aug 23, 2019

gengliangwang commented Aug 14, 2019 •

edited

Loading