Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47541][SQL] Collated strings in complex types supporting operations reverse, array_join, concat, map #45693

Closed
wants to merge 9 commits into from

Conversation

nikolamand-db
Copy link
Contributor

What changes were proposed in this pull request?

Add proper support for complex types containing collated strings in operations reverse, array_join, concat, map (create). Examples:

select reverse('abc' collate utf8_binary_lcase);
select reverse(array('a' collate utf8_binary_lcase, 'b' collate utf8_binary_lcase));
select array_join(array('a' collate utf8_binary_lcase, 'b' collate utf8_binary_lcase), ', ' collate utf8_binary_lcase);
select concat('a' collate utf8_binary_lcase, 'b' collate utf8_binary_lcase);
select map('a' collate utf8_binary_lcase, 1, 'A' collate utf8_binary_lcase, 2);

Why are the changes needed?

To enable listed complex types operations support for collated strings.

Does this PR introduce any user-facing change?

Yes, results of listed complex types operations will return expected results to users.

How was this patch tested?

Added checks to collations suite.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Mar 25, 2024
@@ -1994,17 +1995,26 @@ case class Slice(x: Expression, start: Expression, length: Expression)
case class ArrayJoin(
array: Expression,
delimiter: Expression,
nullReplacement: Option[Expression]) extends Expression with ExpectsInputTypes {
nullReplacement: Option[Expression]) extends ImplicitCastInputTypes with ExpectsInputTypes {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You extend both Parent class and its Child. Remove ExpectsInputTypes.

Comment on lines 2005 to 2018
private val commonType: DataType = {
val elementType = array.dataType.asInstanceOf[ArrayType].elementType
val s = nullReplacement match {
case Some(replacement) => Seq(elementType, delimiter.dataType, replacement.dataType)
case _ => Seq(elementType, delimiter.dataType)
}
TypeCoercion.findWiderCommonType(s).getOrElse(StringType)
}

override def inputTypes: Seq[AbstractDataType] = if (nullReplacement.isDefined) {
Seq(ArrayType(StringType), StringType, StringType)
Seq(ArrayType(commonType), commonType, commonType)
} else {
Seq(ArrayType(StringType), StringType)
Seq(ArrayType(commonType), commonType)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit too much hustle to get expected input types. I would say these have to not depend on the parameters. Why not use StringTypeAnyCollation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing is. This what you do with findWiderCommonType is not correct. If delimiter has explicit collation, we would like to have that collation. In your case you would collect two different collations, where arrayType.elementType would go first and findWiderCommonType practically does not do anything with delimiter's type, but would keep the first element type.

@@ -2149,7 +2159,7 @@ case class ArrayJoin(
}
}

override def dataType: DataType = StringType
override def dataType: DataType = commonType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use some type of pass through where we would say dataType = array.asInstanceOf[ArrayType].elementType? Implicit casting should solve the problem of collations.

Comment on lines 2005 to 2014
override def inputTypes: Seq[AbstractDataType] = {
val arrayType = array.dataType.asInstanceOf[ArrayType].elementType match {
case _: StringType => array.dataType
case _ => ArrayType(StringType)
}
if (nullReplacement.isDefined) {
Seq(arrayType, StringTypeAnyCollation, StringTypeAnyCollation)
} else {
Seq(arrayType, StringTypeAnyCollation)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks better, but I am still concerned that we are setting inputTypes according to inputs that we get. If I understand the logic correctly, it should be independent. I would rather leave this as ArrayType only and then override checkInputTypes to do this check you performed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mihailom-db There are already examples of such checks performed in inputTypes, here are few:

override def inputTypes: Seq[AbstractDataType] = {
(left.dataType, right.dataType) match {
case (ArrayType(e1, hasNull1), ArrayType(e2, hasNull2)) =>
TypeCoercion.findTightestCommonType(e1, e2) match {
case Some(dt) => Seq(ArrayType(dt, hasNull1), ArrayType(dt, hasNull2))
case _ => Seq.empty
}
case _ => Seq.empty
}
}
override def inputTypes: Seq[AbstractDataType] = {
(left.dataType, right.dataType) match {
case (ArrayType(e1, hasNull), e2) =>
TypeCoercion.findTightestCommonType(e1, e2) match {
case Some(dt) => Seq(ArrayType(dt, hasNull), dt)
case _ => Seq.empty
}
case _ => Seq.empty
}
}

I see a potential issue in implicit conversions if I replace it with ArrayType without StringType information. Is there some kind of explicit conversion which will override this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayJoin should be fine with only ArrayType. This is because TypeCoercion rules in FunctionArgumentConversion cast this argument to ArrayType(StringType) and catalyst rules will loop over this and cast it to proper collation if necessary.

@@ -36,7 +36,9 @@ class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Seria
private lazy val keyToIndex = keyType match {
// Binary type data is `byte[]`, which can't use `==` to check equality.
case _: AtomicType | _: CalendarIntervalType | _: NullType
if !keyType.isInstanceOf[BinaryType] => new java.util.HashMap[Any, Int]()
if !keyType.isInstanceOf[BinaryType] && (!keyType.isInstanceOf[StringType] ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait for Collation Refactory PR for this. This should be stringType.isBinaryComparable?

Also, can you push StringType in a separate case. Relying on order evaluation of || is not a good thing (I don't know if scala standard enforces this). E.g. you can end up with keyType.asInstanceOf[StringType].isBinaryCollation evaluating first which will throw if this is not a StringType.

Copy link
Contributor

@mihailom-db mihailom-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me in combination with #45383

@@ -2002,9 +2003,9 @@ case class ArrayJoin(
this(array, delimiter, Some(nullReplacement))

override def inputTypes: Seq[AbstractDataType] = if (nullReplacement.isDefined) {
Seq(ArrayType(StringType), StringType, StringType)
Seq(ArrayType, StringTypeAnyCollation, StringTypeAnyCollation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the input is array of int? Before it will fail with type check, but now it seems different.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to override checkInputDataTypes to check the array element.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan There's a special case in FunctionArgumentConversion which implicitly casts array parameter to array of strings #21620.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I'm not mistaken this behavior was allowed before and it still works as expected.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one issue

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in f576b85 Apr 8, 2024
cloud-fan pushed a commit that referenced this pull request Apr 11, 2024
### What changes were proposed in this pull request?
Addition of abstract arraytype which accepts StringTypeCollated as elementType. Changes in this PR #45693 work for ArrayJoin, but will not work in general for other functions. This PR introduces a change to give an interface for all functions. Merge only after #45693.

### Why are the changes needed?
This is needed in order to enable functions to use collated arrays.

### Does this PR introduce _any_ user-facing change?
Yes, collation functions will work.

### How was this patch tested?
Test for array_join added to `CollationSuite`

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #45891 from mihailom-db/SPARK-47736.

Authored-by: Mihailo Milosevic <mihailo.milosevic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @nikolamand-db , @dbatomic and @cloud-fan .

Unfortunately, this broke master branch ANSI CI.

Please check the following and recover the ANSI CI environment.

$ SPARK_ANSI_SQL_MODE=true build/sbt "sql/testOnly *CollationSuite -- -z complex"
...
[info] - Support operations on complex types containing collated strings *** FAILED *** (144 milliseconds)
[info]   java.lang.RuntimeException: Max iterations (100) reached for batch Resolution, please set 'spark.sql.analyzer.maxIterations' to a larger value.

@dongjoon-hyun
Copy link
Member

Since there are many Collated-feature patch, I made a following PR to catch this kind of failure more eagerly.

Anyway, thank you all for working on this feature.

@mihailom-db
Copy link
Contributor

I have a fix for this, will create a followup soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants