-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33641][SQL] Invalidate new char/varchar types in public APIs that produce incorrect results #30586
Conversation
…produce incorrect results
Kubernetes integration test starting |
Test build #132103 has finished for PR 30586 at commit
|
Kubernetes integration test status failure |
cc @cloud-fan @maropu thanks~ |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
Outdated
Show resolved
Hide resolved
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #132134 has finished for PR 30586 at commit
|
*/ | ||
def failWithCharLikeTypes(schema: StructType): StructType = { | ||
if (schema.exists(f => hasCharVarchar(f.dataType) || | ||
f.metadata.contains(CHAR_VARCHAR_TYPE_STRING_METADATA_KEY))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when do we need to check the metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can assume that users won't set the very internal __CHAR_VARCHAR_TYPE_STRING
metadata key, and only check with the explicit char/varchar type.
* @param dt the given struct type | ||
* @return | ||
*/ | ||
def failOrWarnWithCharLikeType(dt: DataType, fail: Boolean = true): DataType = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's have 2 methods instead of using a boolean flag.
def failIfHasCharVarchar(dt: DataType)...
def warnIfHasCharVarchar(dt: DataType)...
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #132141 has finished for PR 30586 at commit
|
Test build #132144 has finished for PR 30586 at commit
|
Test build #132182 has finished for PR 30586 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
case s: UTF8String if s != null => DataType.fromDDL(s.toString) | ||
case s: UTF8String if s != null => | ||
val dataType = DataType.fromDDL(s.toString) | ||
CharVarcharUtils.replaceCharVarcharWithString(dataType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we give a warning here?
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #132201 has finished for PR 30586 at commit
|
Kubernetes integration test starting |
Test build #132262 has finished for PR 30586 at commit
|
Kubernetes integration test status failure |
retest this please |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #132289 has finished for PR 30586 at commit
|
*/ | ||
def failIfHasCharVarchar(dt: DataType): DataType = { | ||
if (!SQLConf.get.charVarcharAsString && hasCharVarchar(dt)) { | ||
throw new AnalysisException(s"char/varchar type can only be used in the table schema") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove s
at the beginning.
val LEGACY_CHAR_VARCHAR_AS_STRING = | ||
buildConf("spark.sql.legacy.charVarcharAsString") | ||
.internal() | ||
.doc("When true, the parser will not fail to parse char and varchar type but treat it as" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this config is not related only to the parser because failIfHasCharVarchar
is used in many places. Could you update it accordingly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems that they are all related to the parser and actually char/varchar shouldn't exist after the parser phase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok. If so, how about spark.sql.
-> spark.sql.parser
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example, spark.read.schema(StructType)
has no parser involved.
*/ | ||
def replaceCharVarcharWithStringForCast(dt: DataType): DataType = { | ||
if (hasCharVarchar(dt)) { | ||
logWarning("The Spark cast operator does not support char/varchar type and simply treat" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: treat
-> treats
def replaceCharVarcharWithStringForCast(dt: DataType): DataType = { | ||
if (hasCharVarchar(dt)) { | ||
logWarning("The Spark cast operator does not support char/varchar type and simply treat" + | ||
" them as string type. Please use string type directly to avoid confusion.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add tests for this warning msg by using LogAppender
?
*/ | ||
def failIfHasCharVarchar(dt: DataType): DataType = { | ||
if (!SQLConf.get.charVarcharAsString && hasCharVarchar(dt)) { | ||
throw new AnalysisException("char/varchar type can only be used in the table schema") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we mention the config name?
* warning message if it has char or varchar types | ||
*/ | ||
def replaceCharVarcharWithStringForCast(dt: DataType): DataType = { | ||
if (hasCharVarchar(dt)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we skip warning if the config charVarcharAsString
is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except some minor comments.
Kubernetes integration test starting |
Test build #132319 has finished for PR 30586 at commit
|
Kubernetes integration test status success |
retest this please |
Test build #132345 has finished for PR 30586 at commit
|
retest this please |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #132334 has finished for PR 30586 at commit
|
thanks, merging to master/3.1 as it's necessary for the char/varchar support. |
…hat produce incorrect results ### What changes were proposed in this pull request? In this PR, we suppose to narrow the use cases of the char/varchar data types, of which are invalid now or later ### Why are the changes needed? 1. udf ```scala scala> spark.udf.register("abcd", () => "12345", org.apache.spark.sql.types.VarcharType(2)) scala> spark.sql("select abcd()").show scala.MatchError: CharType(2) (of class org.apache.spark.sql.types.VarcharType) at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeFor(RowEncoder.scala:215) at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeForInput(RowEncoder.scala:212) at org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.<init>(objects.scala:1741) at org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:175) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242) at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198) at org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:171) at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:66) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:611) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:606) ... 47 elided ``` 2. spark.createDataframe ``` scala> spark.createDataFrame(spark.read.text("README.md").rdd, new org.apache.spark.sql.types.StructType().add("c", "char(1)")).show +--------------------+ | c| +--------------------+ | # Apache Spark| | | |Spark is a unifie...| |high-level APIs i...| |supports general ...| |rich set of highe...| |MLlib for machine...| |and Structured St...| | | |<https://spark.ap...| | | |[![Jenkins Build]...| |[![AppVeyor Build...| |[![PySpark Covera...| | | | | ``` 3. reader.schema ``` scala> spark.read.schema("a varchar(2)").text("./README.md").show(100) +--------------------+ | a| +--------------------+ | # Apache Spark| | | |Spark is a unifie...| |high-level APIs i...| |supports general ...| ``` 4. etc ### Does this PR introduce _any_ user-facing change? NO, we intend to avoid protentical breaking change ### How was this patch tested? new tests Closes #30586 from yaooqinn/SPARK-33641. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit da72b87) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
After a second thought, although char/varchar is kind of hidden before 3.1, it's still possible for users to use it already. @yaooqinn can we create a followup PR to add a migration guide? |
Test build #132353 has finished for PR 30586 at commit
|
|
What changes were proposed in this pull request?
In this PR, we suppose to narrow the use cases of the char/varchar data types, of which are invalid now or later
Why are the changes needed?
Does this PR introduce any user-facing change?
NO, we intend to avoid protentical breaking change
How was this patch tested?
new tests