New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47541][SQL] Collated strings in complex types supporting operations reverse, array_join, concat, map #45693
Conversation
@@ -1994,17 +1995,26 @@ case class Slice(x: Expression, start: Expression, length: Expression) | |||
case class ArrayJoin( | |||
array: Expression, | |||
delimiter: Expression, | |||
nullReplacement: Option[Expression]) extends Expression with ExpectsInputTypes { | |||
nullReplacement: Option[Expression]) extends ImplicitCastInputTypes with ExpectsInputTypes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You extend both Parent class and its Child. Remove ExpectsInputTypes.
private val commonType: DataType = { | ||
val elementType = array.dataType.asInstanceOf[ArrayType].elementType | ||
val s = nullReplacement match { | ||
case Some(replacement) => Seq(elementType, delimiter.dataType, replacement.dataType) | ||
case _ => Seq(elementType, delimiter.dataType) | ||
} | ||
TypeCoercion.findWiderCommonType(s).getOrElse(StringType) | ||
} | ||
|
||
override def inputTypes: Seq[AbstractDataType] = if (nullReplacement.isDefined) { | ||
Seq(ArrayType(StringType), StringType, StringType) | ||
Seq(ArrayType(commonType), commonType, commonType) | ||
} else { | ||
Seq(ArrayType(StringType), StringType) | ||
Seq(ArrayType(commonType), commonType) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit too much hustle to get expected input types. I would say these have to not depend on the parameters. Why not use StringTypeAnyCollation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing is. This what you do with findWiderCommonType is not correct. If delimiter has explicit collation, we would like to have that collation. In your case you would collect two different collations, where arrayType.elementType would go first and findWiderCommonType practically does not do anything with delimiter's type, but would keep the first element type.
@@ -2149,7 +2159,7 @@ case class ArrayJoin( | |||
} | |||
} | |||
|
|||
override def dataType: DataType = StringType | |||
override def dataType: DataType = commonType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use some type of pass through where we would say dataType = array.asInstanceOf[ArrayType].elementType? Implicit casting should solve the problem of collations.
override def inputTypes: Seq[AbstractDataType] = { | ||
val arrayType = array.dataType.asInstanceOf[ArrayType].elementType match { | ||
case _: StringType => array.dataType | ||
case _ => ArrayType(StringType) | ||
} | ||
if (nullReplacement.isDefined) { | ||
Seq(arrayType, StringTypeAnyCollation, StringTypeAnyCollation) | ||
} else { | ||
Seq(arrayType, StringTypeAnyCollation) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks better, but I am still concerned that we are setting inputTypes according to inputs that we get. If I understand the logic correctly, it should be independent. I would rather leave this as ArrayType only and then override checkInputTypes to do this check you performed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mihailom-db There are already examples of such checks performed in inputTypes
, here are few:
Lines 59 to 68 in 49b7dbc
override def inputTypes: Seq[AbstractDataType] = { | |
(left.dataType, right.dataType) match { | |
case (ArrayType(e1, hasNull1), ArrayType(e2, hasNull2)) => | |
TypeCoercion.findTightestCommonType(e1, e2) match { | |
case Some(dt) => Seq(ArrayType(dt, hasNull1), ArrayType(dt, hasNull2)) | |
case _ => Seq.empty | |
} | |
case _ => Seq.empty | |
} | |
} |
Lines 2341 to 2350 in 49b7dbc
override def inputTypes: Seq[AbstractDataType] = { | |
(left.dataType, right.dataType) match { | |
case (ArrayType(e1, hasNull), e2) => | |
TypeCoercion.findTightestCommonType(e1, e2) match { | |
case Some(dt) => Seq(ArrayType(dt, hasNull), dt) | |
case _ => Seq.empty | |
} | |
case _ => Seq.empty | |
} | |
} |
I see a potential issue in implicit conversions if I replace it with
ArrayType
without StringType information. Is there some kind of explicit conversion which will override this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArrayJoin should be fine with only ArrayType. This is because TypeCoercion rules in FunctionArgumentConversion cast this argument to ArrayType(StringType) and catalyst rules will loop over this and cast it to proper collation if necessary.
@@ -36,7 +36,9 @@ class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Seria | |||
private lazy val keyToIndex = keyType match { | |||
// Binary type data is `byte[]`, which can't use `==` to check equality. | |||
case _: AtomicType | _: CalendarIntervalType | _: NullType | |||
if !keyType.isInstanceOf[BinaryType] => new java.util.HashMap[Any, Int]() | |||
if !keyType.isInstanceOf[BinaryType] && (!keyType.isInstanceOf[StringType] || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's wait for Collation Refactory PR for this. This should be stringType.isBinaryComparable?
Also, can you push StringType
in a separate case
. Relying on order evaluation of || is not a good thing (I don't know if scala standard enforces this). E.g. you can end up with keyType.asInstanceOf[StringType].isBinaryCollation
evaluating first which will throw if this is not a StringType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good to me in combination with #45383
@@ -2002,9 +2003,9 @@ case class ArrayJoin( | |||
this(array, delimiter, Some(nullReplacement)) | |||
|
|||
override def inputTypes: Seq[AbstractDataType] = if (nullReplacement.isDefined) { | |||
Seq(ArrayType(StringType), StringType, StringType) | |||
Seq(ArrayType, StringTypeAnyCollation, StringTypeAnyCollation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the input is array of int? Before it will fail with type check, but now it seems different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need to override checkInputDataTypes
to check the array element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan There's a special case in FunctionArgumentConversion
which implicitly casts array parameter to array of strings #21620.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I'm not mistaken this behavior was allowed before and it still works as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for one issue
thanks, merging to master! |
### What changes were proposed in this pull request? Addition of abstract arraytype which accepts StringTypeCollated as elementType. Changes in this PR #45693 work for ArrayJoin, but will not work in general for other functions. This PR introduces a change to give an interface for all functions. Merge only after #45693. ### Why are the changes needed? This is needed in order to enable functions to use collated arrays. ### Does this PR introduce _any_ user-facing change? Yes, collation functions will work. ### How was this patch tested? Test for array_join added to `CollationSuite` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45891 from mihailom-db/SPARK-47736. Authored-by: Mihailo Milosevic <mihailo.milosevic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @nikolamand-db , @dbatomic and @cloud-fan .
Unfortunately, this broke master
branch ANSI CI.
Please check the following and recover the ANSI CI environment.
$ SPARK_ANSI_SQL_MODE=true build/sbt "sql/testOnly *CollationSuite -- -z complex"
...
[info] - Support operations on complex types containing collated strings *** FAILED *** (144 milliseconds)
[info] java.lang.RuntimeException: Max iterations (100) reached for batch Resolution, please set 'spark.sql.analyzer.maxIterations' to a larger value.
Since there are many Anyway, thank you all for working on this feature. |
I have a fix for this, will create a followup soon. |
What changes were proposed in this pull request?
Add proper support for complex types containing collated strings in operations reverse, array_join, concat, map (create). Examples:
Why are the changes needed?
To enable listed complex types operations support for collated strings.
Does this PR introduce any user-facing change?
Yes, results of listed complex types operations will return expected results to users.
How was this patch tested?
Added checks to collations suite.
Was this patch authored or co-authored using generative AI tooling?
No.