Skip to content

Conversation

@ilicmarkodb
Copy link
Contributor

@ilicmarkodb ilicmarkodb commented Aug 27, 2024

Description

Changed string comparing from UTF16 to UTF8. This fixes comparison issues around the characters with surrogate pairs.

How was this patch tested?

Tests added to DefaultExpressionEvaluatorSuite.scala

@ilicmarkodb ilicmarkodb changed the title Fix string comparator Changed string comparing from UTF16 to UTF8. Aug 27, 2024
@ilicmarkodb ilicmarkodb changed the title Changed string comparing from UTF16 to UTF8. Changed string comparing from UTF16 to UTF8 Aug 27, 2024
@ilicmarkodb ilicmarkodb changed the title Changed string comparing from UTF16 to UTF8 [KERNEL] Changed string comparing from UTF16 to UTF8 Aug 27, 2024
Copy link
Collaborator

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this and fixing it.


import io.delta.golden.GoldenTableUtils.goldenTablePath
import io.delta.kernel.exceptions.{InvalidTableException, KernelException, TableNotFoundException}
import io.delta.kernel.expressions.{Literal, ScalarExpression}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated changes in this file?

Comment on lines 610 to 615
(
ofString("apples"),
ofString("oranges"),
ofString("apples"),
ofNull(StringType.STRING)
),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same test case as the one above?

Comment on lines 616 to 621
(
ofString("abc"),
ofString("abcd"),
ofString("abc"),
ofNull(StringType.STRING)
),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(
ofString("abc"),
ofString("abcd"),
ofString("abc"),
ofNull(StringType.STRING)
),
(ofString("abc"), ofString("abcd"), ofString("abc"), ofNull(STRING)),

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for other test cases.

ofString("\uD83C\uDF3C"),
ofString("\uFFFD"),
ofNull(StringType.STRING)
// scalastyle:on nonascii
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there more tests we could copy for Spark or Delta-Spark around other surrogate pairs and unsigned char comparison?

@ilicmarkodb ilicmarkodb requested a review from vkorukanti August 28, 2024 09:34
@vkorukanti vkorukanti changed the title [KERNEL] Changed string comparing from UTF16 to UTF8 [Kernel] Change string comparing from UTF16 to UTF8 Aug 28, 2024
@vkorukanti vkorukanti merged commit f468733 into delta-io:master Aug 28, 2024
vkorukanti pushed a commit to vkorukanti/delta that referenced this pull request Aug 30, 2024
Changed string comparing from UTF16 to UTF8. This fixes comparison
issues around the characters with surrogate pairs.

Tests added to `DefaultExpressionEvaluatorSuite.scala`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants