Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][SPARK-27856][SQL] Only allow type upcasting when inserting table #24806

Open
wants to merge 16 commits into
base: master
from

Conversation

Projects
None yet
6 participants
@gengliangwang
Copy link
Contributor

commented Jun 5, 2019

What changes were proposed in this pull request?

In Spark version 2.4 and earlier, when inserting into a table, Spark will cast the data type of input query to the data type of target table by coercion. This can be super confusing if users make a mistake and write string values to an int column. Users won't get error/warning and see null values in the target table.

Since Spark 3.0, by default only upcasting is allowed when inserting data into table. E.g. int -> long and int -> string are allowed, while long -> int or string -> int are not allowed. The old behaviour is preserved under a newly added configuration spark.sql.legacy.insertUnsafeCasts with a default value of false.

How was this patch tested?

Unit test

@gengliangwang

This comment has been minimized.

Copy link
Contributor Author

commented Jun 5, 2019

As @cloud-fan is taking a vacation for weeks, I am now taking #24721 over. I have updated the SQL configuration naming, and updated migration guide.

@gatorsmile @ueshin @rdblue Please help review it.

@SparkQA

This comment has been minimized.

Copy link

commented Jun 5, 2019

Test build #106203 has finished for PR 24806 at commit b91498f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@@ -50,6 +50,8 @@ license: |

- In Spark version 2.4 and earlier, JSON datasource and JSON functions like `from_json` convert a bad JSON record to a row with all `null`s in the PERMISSIVE mode when specified schema is `StructType`. Since Spark 3.0, the returned row can contain non-`null` fields if some of JSON column values were parsed and converted to desired types successfully.

- In Spark version 2.4 and earlier, when inserting into a table, Spark will cast the data type of input query to the data type of target table by coercion. Since Spark 3.0, by default only upcasting is allowed when inserting data into table. E.g. `int` -> `long` and `int` -> `string` are allowed, while `long` -> `int` or `string` -> `int` are not allowed. The old behaviour is preserved under a newly added configuration `spark.sql.legacy.insertTable.typeCoercion` with a default value of `false`.

This comment has been minimized.

Copy link
@rdblue

rdblue Jun 5, 2019

Contributor

I think the configuration name should be more clear. The name typeCoercion doesn't describe what's happening. How about spark.sql.legacy.insertUnsafeCasts? I think that is clear about what will actually happen when it is enabled.

* - Insert aliases when column names do not match
* - Detect plans that are not compatible with the output table and throw AnalysisException
*/
object ResolveOutputRelation extends Rule[LogicalPlan] {

This comment has been minimized.

Copy link
@rdblue

rdblue Jun 5, 2019

Contributor

Why was this moved?

It is difficult to see whether anything changed in this class. If the move was not required, please move it back.

This comment has been minimized.

Copy link
@gengliangwang

gengliangwang Jun 6, 2019

Author Contributor

move it outside of the Analyzer class, so that we can call its methods.

As per @cloud-fan commented in https://github.com/apache/spark/pull/24721/files#r287800626

This comment has been minimized.

Copy link
@rdblue

rdblue Jun 6, 2019

Contributor

Were there any modifications other than moving this?

I think that the right way to expose those functions is to move them to a utility class, not to expose this rule itself.

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Jun 9, 2019

Contributor

I'm OK with moving them to a utility class, but it's better to put analyzer/optimizer rules in its own file, instead of in the Analyzer object (can be done in another PR if we decide to create the util class in this PR)

case (DateType, TimestampType) => true

This comment has been minimized.

Copy link
@rdblue

rdblue Jun 5, 2019

Contributor

Nit: please remove non-functional whitespace additions.

@SparkQA

This comment has been minimized.

Copy link

commented Jun 5, 2019

Test build #106206 has finished for PR 24806 at commit babb288.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
val newQuery = if (conf.getConf(SQLConf.LEGACY_INSERT_TABLE_TYPE_COERCION)) {
DDLPreprocessingUtils.castAndRenameQueryOutput(insert.query, expectedColumns, conf)
} else {
val errors = new mutable.ArrayBuffer[String]()

This comment has been minimized.

Copy link
@rdblue

rdblue Jun 5, 2019

Contributor

Why not run ResolveOutputRelation?

This comment has been minimized.

Copy link
@gengliangwang

gengliangwang Jun 6, 2019

Author Contributor

Here is for the V1 path. InsertIntoTable won't be matched in ResolveOutputRelation.

This comment has been minimized.

Copy link
@rdblue

rdblue Jun 6, 2019

Contributor

If you aren't going to use the ResolveOutputRelation rule, can you explain exactly what the behavior this code path should have and how it differs from ResolveOutputRelation?

This comment has been minimized.

Copy link
@gengliangwang

gengliangwang Jun 10, 2019

Author Contributor

I check and I think it is workable. I will try moving it into the ResolveOutputRelation rule.

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Jun 12, 2019

Contributor

Have we moved it?

This comment has been minimized.

Copy link
@gengliangwang

gengliangwang Jun 12, 2019

Author Contributor

We need to figure out if we can up-cast decimal to double/float.
If we can't, then maybe we can't continue this PR.
I have created a new PR for the upcasting: #24849

This comment has been minimized.

Copy link
@rdblue

rdblue Jun 12, 2019

Contributor

@gengliangwang, why is decimal to double/float required? That should not be allowed in either v1 or v2, so I see no reason why we can't use the same rule for both.

This comment has been minimized.

Copy link
@gengliangwang

gengliangwang Jun 12, 2019

Author Contributor

For sql

create table t (d double);
insert into t values (10.0);

the 10.0 is decimal in Spark SQL parser.

@@ -59,6 +59,8 @@ class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter {
TestHive.setConf(SQLConf.IN_MEMORY_PARTITION_PRUNING, true)
// Ensures that cross joins are enabled so that we can test them
TestHive.setConf(SQLConf.CROSS_JOINS_ENABLED, true)
// Force type case during table insertion
TestHive.setConf(SQLConf.LEGACY_INSERT_TABLE_TYPE_COERCION, true)

This comment has been minimized.

Copy link
@rdblue

rdblue Jun 5, 2019

Contributor

Why is this required?

This comment has been minimized.

Copy link
@gengliangwang

gengliangwang Jun 6, 2019

Author Contributor

Not sure about that. Actually, I don't know how to run the test suite

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Jun 9, 2019

Contributor

This test is to make sure Spark has the same behavior with Hive, however Hive always add unsafe cast, so we need to turn on this config to make tests pass.

This comment has been minimized.

Copy link
@rdblue

rdblue Jun 12, 2019

Contributor

Hive automatically inserts unsafe casts?

@@ -374,15 +374,15 @@ class HiveQuerySuite extends HiveComparisonTest with SQLTestUtils with BeforeAnd
sql(s"CREATE TABLE test_partition (a STRING) PARTITIONED BY (b BIGINT, c STRING)")
sql(s"CREATE TABLE ptest (a STRING, b BIGINT, c STRING)")

val analyzedPlan = sql(
val optimizedPlan = sql(

This comment has been minimized.

Copy link
@rdblue

rdblue Jun 5, 2019

Contributor

Why does this use the optimized plan now? So that checkAnalysis will run?

This comment has been minimized.

Copy link
@gengliangwang

gengliangwang Jun 6, 2019

Author Contributor

The Cast operator will be eliminated by optimization rule ConstantFolding.

This comment has been minimized.

Copy link
@rdblue

rdblue Jun 12, 2019

Contributor

Why not update the test case with the Cast operator?

@SparkQA

This comment has been minimized.

Copy link

commented Jun 6, 2019

Test build #106230 has finished for PR 24806 at commit 85fa370.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.
@gengliangwang

This comment has been minimized.

Copy link
Contributor Author

commented Jun 6, 2019

retest this please.

@SparkQA

This comment has been minimized.

Copy link

commented Jun 6, 2019

Test build #106233 has finished for PR 24806 at commit 85fa370.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

commented Jun 6, 2019

Test build #106241 has finished for PR 24806 at commit d9e7e4e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang gengliangwang changed the title [SPARK-27856][SQL] Only allow type upcasting when inserting table [WIP][SPARK-27856][SQL] Only allow type upcasting when inserting table Jun 10, 2019

@SparkQA

This comment has been minimized.

Copy link

commented Jun 10, 2019

Test build #106354 has finished for PR 24806 at commit 75b3bf4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

commented Jun 11, 2019

Test build #106388 has finished for PR 24806 at commit f182938.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@@ -89,6 +90,7 @@ case class DecimalType(precision: Int, scale: Int) extends FractionalType {
(precision - scale) <= (dt.precision - dt.scale) && scale <= dt.scale
case dt: IntegralType =>
isTighterThan(DecimalType.forType(dt))
// For DoubleType/FloatType, the value can be NaN, PositiveInfinity or NegativeInfinity.

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Jun 12, 2019

Contributor

For isWiderThan, yes it's not safe to cast float/double to decimal because of NaN stuff.

For isTighterThan, I think it's safe to cast decimal to float/double if the precision doesn't exceed?

This comment has been minimized.

Copy link
@gengliangwang

gengliangwang Jun 12, 2019

Author Contributor

For isTighterThan, I think it's safe to cast decimal to float/double if the precision doesn't exceed?

Yes, it is. I was about to push the commit to fix tests.

@@ -290,15 +290,14 @@ abstract class DataSourceV2AnalysisSuite extends AnalysisTest {
StructField("y", DoubleType))).toAttributes)

val query = TestRelation(StructType(Seq(
StructField("x", DoubleType),
StructField("x", FloatType),

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Jun 12, 2019

Contributor

why this change?

This comment has been minimized.

Copy link
@gengliangwang

gengliangwang Jun 12, 2019

Author Contributor

In this test case, the error message "Cannot safely cast", "'x'", "DoubleType to FloatType" is gone after code changes.
So I think we should make the test case simpler. The nullability error in the x column is enough.

This comment has been minimized.

Copy link
@rdblue

rdblue Jun 12, 2019

Contributor

No, all errors should be reported: that's exactly what this test case is validating.

Attempting to write a double to a float column is an error that should be shown, even if there is a nullability error for the column as well.

@@ -440,16 +439,15 @@ abstract class DataSourceV2AnalysisSuite extends AnalysisTest {
StructField("y", DoubleType))).toAttributes)

val query = TestRelation(StructType(Seq(
StructField("x", DoubleType),
StructField("x", FloatType),

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Jun 12, 2019

Contributor

ditto

@@ -230,11 +230,12 @@ class EncoderResolutionSuite extends PlanTest {
castSuccess[Long, String]
castSuccess[Int, java.math.BigDecimal]
castSuccess[Long, java.math.BigDecimal]
castSuccess[Double, java.math.BigDecimal]

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Jun 12, 2019

Contributor

How is this supported?

case (DateType, TimestampType) => true
case (NullType, _) => false

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Jun 12, 2019

Contributor

Why can't we upcast null to other nullable types? I think it's pretty common to write INSERT INTO tbl VALUES (1, null)

This comment has been minimized.

Copy link
@gengliangwang

gengliangwang Jun 12, 2019

Author Contributor

Here we can't know the nullability of the to type. We should consider it is not nullable.
For the case you mentioned, it is handled in https://github.com/apache/spark/pull/24806/files#diff-86e655772e8f7cab055d2c2451b52275R134.

This comment has been minimized.

Copy link
@rdblue

rdblue Jun 12, 2019

Contributor

I agree with @cloud-fan that this should be allowed. Nullability is an additional check, but the types are compatible.

@gengliangwang

This comment has been minimized.

Copy link
Contributor Author

commented Jun 12, 2019

As discussed in #24849 , we can't upcast decimal to double/float.
I think we might have to close this one. Then, we can't do the upcasting either in V2 write path.

Or, we can match the case of converting decimal to double/float in DataType.canWrite.

There can be other solutions. What do you think? @cloud-fan @rdblue @gatorsmile

@SparkQA

This comment has been minimized.

Copy link

commented Jun 12, 2019

Test build #106408 has finished for PR 24806 at commit 3e29491.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.