Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33641][SQL] Invalidate new char/varchar types in public APIs that produce incorrect results #30586

Closed
wants to merge 15 commits into from

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Dec 3, 2020

What changes were proposed in this pull request?

In this PR, we suppose to narrow the use cases of the char/varchar data types, of which are invalid now or later

Why are the changes needed?

  1. udf
scala> spark.udf.register("abcd", () => "12345", org.apache.spark.sql.types.VarcharType(2))

scala> spark.sql("select abcd()").show
scala.MatchError: CharType(2) (of class org.apache.spark.sql.types.VarcharType)
  at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeFor(RowEncoder.scala:215)
  at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeForInput(RowEncoder.scala:212)
  at org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.<init>(objects.scala:1741)
  at org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:175)
  at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
  at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
  at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
  at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
  at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
  at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
  at org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:171)
  at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:66)
  at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:611)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:606)
  ... 47 elided
  1. spark.createDataframe
scala> spark.createDataFrame(spark.read.text("README.md").rdd, new org.apache.spark.sql.types.StructType().add("c", "char(1)")).show
+--------------------+
|                   c|
+--------------------+
|      # Apache Spark|
|                    |
|Spark is a unifie...|
|high-level APIs i...|
|supports general ...|
|rich set of highe...|
|MLlib for machine...|
|and Structured St...|
|                    |
|<https://spark.ap...|
|                    |
|[![Jenkins Build]...|
|[![AppVeyor Build...|
|[![PySpark Covera...|
|                    |
|                    |
  1. reader.schema
scala> spark.read.schema("a varchar(2)").text("./README.md").show(100)
+--------------------+
|                   a|
+--------------------+
|      # Apache Spark|
|                    |
|Spark is a unifie...|
|high-level APIs i...|
|supports general ...|
  1. etc

Does this PR introduce any user-facing change?

NO, we intend to avoid protentical breaking change

How was this patch tested?

new tests

@github-actions github-actions bot added the SQL label Dec 3, 2020
@SparkQA
Copy link

SparkQA commented Dec 3, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36703/

@SparkQA
Copy link

SparkQA commented Dec 3, 2020

Test build #132103 has finished for PR 30586 at commit 183db1a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 3, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36703/

@yaooqinn
Copy link
Member Author

yaooqinn commented Dec 3, 2020

cc @cloud-fan @maropu thanks~

@SparkQA
Copy link

SparkQA commented Dec 3, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36735/

@SparkQA
Copy link

SparkQA commented Dec 3, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36735/

@SparkQA
Copy link

SparkQA commented Dec 3, 2020

Test build #132134 has finished for PR 30586 at commit 9f24463.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*/
def failWithCharLikeTypes(schema: StructType): StructType = {
if (schema.exists(f => hasCharVarchar(f.dataType) ||
f.metadata.contains(CHAR_VARCHAR_TYPE_STRING_METADATA_KEY))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when do we need to check the metadata?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can assume that users won't set the very internal __CHAR_VARCHAR_TYPE_STRING metadata key, and only check with the explicit char/varchar type.

* @param dt the given struct type
* @return
*/
def failOrWarnWithCharLikeType(dt: DataType, fail: Boolean = true): DataType = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's have 2 methods instead of using a boolean flag.

def failIfHasCharVarchar(dt: DataType)...
def warnIfHasCharVarchar(dt: DataType)...

@SparkQA
Copy link

SparkQA commented Dec 3, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36745/

@SparkQA
Copy link

SparkQA commented Dec 3, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36745/

@SparkQA
Copy link

SparkQA commented Dec 3, 2020

Test build #132141 has finished for PR 30586 at commit 148ee63.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 3, 2020

Test build #132144 has finished for PR 30586 at commit 6bbebf6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn yaooqinn changed the title [SPARK-33641][SQL] Invalidate new char-like type in public APIs that produce incorrect results [SPARK-33641][SQL] Invalidate new char-like types in public APIs that produce incorrect results Dec 4, 2020
@SparkQA
Copy link

SparkQA commented Dec 4, 2020

Test build #132182 has finished for PR 30586 at commit 7ccb99b.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 4, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36782/

@SparkQA
Copy link

SparkQA commented Dec 4, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36782/

case s: UTF8String if s != null => DataType.fromDDL(s.toString)
case s: UTF8String if s != null =>
val dataType = DataType.fromDDL(s.toString)
CharVarcharUtils.replaceCharVarcharWithString(dataType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we give a warning here?

@SparkQA
Copy link

SparkQA commented Dec 4, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36792/

@SparkQA
Copy link

SparkQA commented Dec 4, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36792/

@SparkQA
Copy link

SparkQA commented Dec 4, 2020

Test build #132201 has finished for PR 30586 at commit f83c4af.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 5, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36863/

@SparkQA
Copy link

SparkQA commented Dec 5, 2020

Test build #132262 has finished for PR 30586 at commit 2ef9a88.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class OneVsRest(Estimator, _OneVsRestParams, HasParallelism, MLReadable, MLWritable):
  • class OneVsRestReader(MLReader):
  • class OneVsRestWriter(MLWriter):
  • class OneVsRestModel(Model, _OneVsRestParams, MLReadable, MLWritable):
  • class OneVsRestModelReader(MLReader):
  • class OneVsRestModelWriter(MLWriter):
  • class CrossValidatorReader(MLReader):
  • class CrossValidatorWriter(MLWriter):
  • class CrossValidatorModelReader(MLReader):
  • class CrossValidatorModelWriter(MLWriter):
  • class TrainValidationSplitReader(MLReader):
  • class TrainValidationSplitWriter(MLWriter):
  • class TrainValidationSplitModelReader(MLReader):
  • class TrainValidationSplitModelWriter(MLWriter):
  • case class TemporaryViewRelation(tableMeta: CatalogTable) extends LeafNode
  • case class CreateNamedStruct(children: Seq[Expression]) extends Expression with NoThrow
  • case class DateFromUnixDate(child: Expression) extends UnaryExpression
  • case class UnixDate(child: Expression) extends UnaryExpression
  • case class UnixSeconds(child: Expression) extends TimestampToLongBase
  • case class UnixMillis(child: Expression) extends TimestampToLongBase
  • case class UnixMicros(child: Expression) extends TimestampToLongBase
  • case class MakeDate(
  • class JDBCTableCatalog extends TableCatalog with SupportsNamespaces with Logging
  • class StreamingQueryListenerBus(sparkListenerBus: Option[LiveListenerBus])
  • case class StateSchemaNotCompatible(message: String) extends Exception(message)
  • class StateSchemaCompatibilityChecker(
  • class StreamingQueryHistoryServerPlugin extends AppHistoryServerPlugin
  • class StreamingQueryStatusStore(store: KVStore)

@SparkQA
Copy link

SparkQA commented Dec 5, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36863/

@yaooqinn
Copy link
Member Author

yaooqinn commented Dec 6, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Dec 6, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36890/

@SparkQA
Copy link

SparkQA commented Dec 6, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36890/

@SparkQA
Copy link

SparkQA commented Dec 6, 2020

Test build #132289 has finished for PR 30586 at commit 2ef9a88.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class OneVsRest(Estimator, _OneVsRestParams, HasParallelism, MLReadable, MLWritable):
  • class OneVsRestReader(MLReader):
  • class OneVsRestWriter(MLWriter):
  • class OneVsRestModel(Model, _OneVsRestParams, MLReadable, MLWritable):
  • class OneVsRestModelReader(MLReader):
  • class OneVsRestModelWriter(MLWriter):
  • class CrossValidatorReader(MLReader):
  • class CrossValidatorWriter(MLWriter):
  • class CrossValidatorModelReader(MLReader):
  • class CrossValidatorModelWriter(MLWriter):
  • class TrainValidationSplitReader(MLReader):
  • class TrainValidationSplitWriter(MLWriter):
  • class TrainValidationSplitModelReader(MLReader):
  • class TrainValidationSplitModelWriter(MLWriter):
  • case class TemporaryViewRelation(tableMeta: CatalogTable) extends LeafNode
  • case class CreateNamedStruct(children: Seq[Expression]) extends Expression with NoThrow
  • case class DateFromUnixDate(child: Expression) extends UnaryExpression
  • case class UnixDate(child: Expression) extends UnaryExpression
  • case class UnixSeconds(child: Expression) extends TimestampToLongBase
  • case class UnixMillis(child: Expression) extends TimestampToLongBase
  • case class UnixMicros(child: Expression) extends TimestampToLongBase
  • case class MakeDate(
  • class JDBCTableCatalog extends TableCatalog with SupportsNamespaces with Logging
  • class StreamingQueryListenerBus(sparkListenerBus: Option[LiveListenerBus])
  • case class StateSchemaNotCompatible(message: String) extends Exception(message)
  • class StateSchemaCompatibilityChecker(
  • class StreamingQueryHistoryServerPlugin extends AppHistoryServerPlugin
  • class StreamingQueryStatusStore(store: KVStore)

*/
def failIfHasCharVarchar(dt: DataType): DataType = {
if (!SQLConf.get.charVarcharAsString && hasCharVarchar(dt)) {
throw new AnalysisException(s"char/varchar type can only be used in the table schema")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove s at the beginning.

val LEGACY_CHAR_VARCHAR_AS_STRING =
buildConf("spark.sql.legacy.charVarcharAsString")
.internal()
.doc("When true, the parser will not fail to parse char and varchar type but treat it as" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this config is not related only to the parser because failIfHasCharVarchar is used in many places. Could you update it accordingly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems that they are all related to the parser and actually char/varchar shouldn't exist after the parser phase

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. If so, how about spark.sql. -> spark.sql.parser?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example, spark.read.schema(StructType) has no parser involved.

*/
def replaceCharVarcharWithStringForCast(dt: DataType): DataType = {
if (hasCharVarchar(dt)) {
logWarning("The Spark cast operator does not support char/varchar type and simply treat" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: treat -> treats

def replaceCharVarcharWithStringForCast(dt: DataType): DataType = {
if (hasCharVarchar(dt)) {
logWarning("The Spark cast operator does not support char/varchar type and simply treat" +
" them as string type. Please use string type directly to avoid confusion.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add tests for this warning msg by using LogAppender?

*/
def failIfHasCharVarchar(dt: DataType): DataType = {
if (!SQLConf.get.charVarcharAsString && hasCharVarchar(dt)) {
throw new AnalysisException("char/varchar type can only be used in the table schema")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we mention the config name?

* warning message if it has char or varchar types
*/
def replaceCharVarcharWithStringForCast(dt: DataType): DataType = {
if (hasCharVarchar(dt)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we skip warning if the config charVarcharAsString is true?

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except some minor comments.

@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36934/

@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Test build #132319 has finished for PR 30586 at commit 1f47ce0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36934/

@yaooqinn
Copy link
Member Author

yaooqinn commented Dec 7, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Test build #132345 has finished for PR 30586 at commit 10544a6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36945/

@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36945/

@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36953/

@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36953/

@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Test build #132334 has finished for PR 30586 at commit 10544a6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.1 as it's necessary for the char/varchar support.

@cloud-fan cloud-fan closed this in da72b87 Dec 7, 2020
cloud-fan pushed a commit that referenced this pull request Dec 7, 2020
…hat produce incorrect results

### What changes were proposed in this pull request?

In this PR, we suppose to narrow the use cases of the char/varchar data types, of which are invalid now or later

### Why are the changes needed?
1. udf
```scala
scala> spark.udf.register("abcd", () => "12345", org.apache.spark.sql.types.VarcharType(2))

scala> spark.sql("select abcd()").show
scala.MatchError: CharType(2) (of class org.apache.spark.sql.types.VarcharType)
  at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeFor(RowEncoder.scala:215)
  at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeForInput(RowEncoder.scala:212)
  at org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.<init>(objects.scala:1741)
  at org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:175)
  at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
  at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
  at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
  at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
  at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
  at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
  at org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:171)
  at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:66)
  at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:611)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:606)
  ... 47 elided
```

2. spark.createDataframe

```
scala> spark.createDataFrame(spark.read.text("README.md").rdd, new org.apache.spark.sql.types.StructType().add("c", "char(1)")).show
+--------------------+
|                   c|
+--------------------+
|      # Apache Spark|
|                    |
|Spark is a unifie...|
|high-level APIs i...|
|supports general ...|
|rich set of highe...|
|MLlib for machine...|
|and Structured St...|
|                    |
|<https://spark.ap...|
|                    |
|[![Jenkins Build]...|
|[![AppVeyor Build...|
|[![PySpark Covera...|
|                    |
|                    |
```

3. reader.schema

```
scala> spark.read.schema("a varchar(2)").text("./README.md").show(100)
+--------------------+
|                   a|
+--------------------+
|      # Apache Spark|
|                    |
|Spark is a unifie...|
|high-level APIs i...|
|supports general ...|
```
4. etc

### Does this PR introduce _any_ user-facing change?

NO, we intend to avoid protentical breaking change

### How was this patch tested?

new tests

Closes #30586 from yaooqinn/SPARK-33641.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit da72b87)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor

After a second thought, although char/varchar is kind of hidden before 3.1, it's still possible for users to use it already. @yaooqinn can we create a followup PR to add a migration guide?

@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Test build #132353 has finished for PR 30586 at commit 10544a6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

yaooqinn commented Dec 7, 2020

After a second thought, although char/varchar is kind of hidden before 3.1, it's still possible for users to use it already. @yaooqinn can we create a followup PR to add a migration guide?
OK~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants