Skip to content

[CALCITE-6146] Target charset should be used when comparing two strings through CONVERT/TRANSLATE function during validation#3558

Merged
ILuffZhe merged 1 commit into
apache:mainfrom
ILuffZhe:CALCITE-6146
Dec 11, 2024
Merged

[CALCITE-6146] Target charset should be used when comparing two strings through CONVERT/TRANSLATE function during validation#3558
ILuffZhe merged 1 commit into
apache:mainfrom
ILuffZhe:CALCITE-6146

Conversation

@ILuffZhe
Copy link
Copy Markdown
Contributor

@ILuffZhe ILuffZhe commented Dec 2, 2023

No description provided.

return type;
}

@Nullable Charset getCharsetAfterConvert(SqlBasicCall call, @Nullable Charset typeCharset) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other functions which should preserve charsets?
For example, if you do string concatenation, or cast from CHAR to VARCHAR?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, only CONVERT and TRANSLATE function explicitly use charset.
In MySQL, CAST function can be used like cast(name as CHAR CHARACTER SET utf8), but we don't support such form currently.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was asking about something like CAST(TRANSLATE("name" using BIG5) AS CHAR(5)).
Is the charset of the result of this expression correct?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The char(5) has the charset ISO-8859-1 preserved in SqlCollation, so the charset of CAST(TRANSLATE("name" using BIG5) AS CHAR(5)) should also be ISO-8859-1.
I add a test for this, please take a look.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was asking about something like CAST(TRANSLATE("name" using BIG5) AS CHAR(5)). Is the charset of the result of this expression correct?

The result charset of TRANSLATE("name" using BIG5) is BIG5, so the cast to CHAR(5) with ISO-8859-1 is not allowed.

with.query("select \"name\", \"empid\" from \"hr\".\"emps\"\n"
+ "where cast(convert(\"name\" using GBK) as char(5))=_BIG5'Eric'")
.throws_(
"Cannot apply = to the two different charsets ISO-8859-1 and Big5");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you didn't write this error message, but I think it can be improved.
I would write "Cannot apply operation '=' to strings with different charsets 'ISO-8859-1' and 'Big5'`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried a similar test as a SqlOperatorTest as well, just to make sure that it triggers there as well?
I don't think this change is JDBC specific.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you didn't write this error message, but I think it can be improved. I would write "Cannot apply operation '=' to strings with different charsets 'ISO-8859-1' and 'Big5'`

Agreed, and I've changed the message.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried a similar test as a SqlOperatorTest as well, just to make sure that it triggers there as well? I don't think this change is JDBC specific.

I've added some tests in SqlOperatorTest, however the tests testStringComparisonWithConvertFunc in JdbcTest still fail.
When creating a new JavaType with different charset(by createTypeWithCharsetAndCollation) during validation, the wrapped charset in this JavaType is overwritten through canonizing process due to the DATATYPE_CACHE.
I'm not sure this is a bug or specific designed, and I think considering the charset for JavaType's(java.lang.String) digest computing might be a way to go.

if (SqlTypeUtil.inCharFamily(operandType0)
&& SqlTypeUtil.inCharFamily(operandType1)) {
Charset cs0 = operandType0.getCharset();
if (call.operand(0) instanceof SqlBasicCall) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right place for this check?
Why does the operandType0 have the wrong charset?
Maybe the bug is in the place where the charset for operand0 type was inferred.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I've added inferReturnType for both CONVERT and TRANSLATE function.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Dec 3, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

100.0% 100.0% Coverage
0.0% 0.0% Duplication

@ILuffZhe
Copy link
Copy Markdown
Contributor Author

ILuffZhe commented Dec 5, 2024

@mihaibudiu I mark the PR to draft, and I'll let you know if anything updated in the next few days.

@ILuffZhe ILuffZhe marked this pull request as draft December 5, 2024 01:06
@ILuffZhe ILuffZhe marked this pull request as ready for review December 7, 2024 08:26
Comment thread core/src/test/java/org/apache/calcite/test/JdbcTest.java Outdated
@ILuffZhe ILuffZhe requested a review from mihaibudiu December 8, 2024 02:28
+ "where cast(convert(\"name\" using LATIN1) as char(5))='Eric'")
.returns("name=Eric; empid=200\n");
with.query("select \"name\", \"empid\" from \"hr\".\"emps\"\n"
+ "where cast(convert(\"name\" using GBK) as char(5))='Eric'")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this the expected result?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this the expected result?

The result of convert(\"name\" using GBK) has GBK charset while CHAR(5) has ISO-8859-1 charset, which is not allowed in SqlCastFunction.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not obvious. I would add that as a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've added comment in both JdbcTest and SqlOperatorTest


@Override public RelDataType deriveType(SqlValidator validator,
SqlValidatorScope scope, SqlCall call) {
// special case for TRANSLATE: don't need to derive type for Charsets
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not obvious what this comment refers to. Special compared to what?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got your point, I'll change it to don't need to derive type for Charsets in both CONVERT and TRANSLATE. WDYT?

@Test void testStringComparisonWithConvertFunc() {
final SqlOperatorFixture f = fixture();
f.setFor(SqlStdOperatorTable.CONVERT, VM_JAVA);
f.check("select 'a' as col\n"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this example is confusing, because the column name is col and the literal is also col.
Can you change one of them?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this example is confusing, because the column name is col and the literal is also col. Can you change one of them?

Sure, I've changed the column alia.

Copy link
Copy Markdown
Contributor

@mihaibudiu mihaibudiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have never used these functions myself, but, assuming the semantics implemented is the correct one, this PR looks fine.

Can some of these results be validated on another DB?

@ILuffZhe
Copy link
Copy Markdown
Contributor Author

ILuffZhe commented Dec 9, 2024

I have never used these functions myself, but, assuming the semantics implemented is the correct one, this PR looks fine.

Can some of these results be validated on another DB?

Thank you for the review! @mihaibudiu

In earlier's PR, we've tested CONVERT(mysql) and TRANSLATE(bigquery) in function.iq , and you can take a look if it's satisfied.

BTW, I'll try to squash the commits if no more new comments come.

…gs through CONVERT/TRANSLATE function during validation
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Dec 9, 2024

@ILuffZhe ILuffZhe added the LGTM-will-merge-soon Overall PR looks OK. Only minor things left. label Dec 9, 2024
@ILuffZhe ILuffZhe merged commit 4948817 into apache:main Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

LGTM-will-merge-soon Overall PR looks OK. Only minor things left.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants