[SPARK-47462][SQL] Align mappings of other unsigned numeric types with TINYINT in MySQLDialect by yaooqinn · Pull Request #45588 · apache/spark

yaooqinn · 2024-03-19T10:21:48Z

What changes were proposed in this pull request?

Align mappings of other unsigned numeric types with TINYINT in MySQLDialect. TINYINT is mapping to ByteType and TINYINT UNSIGNED is mapping to ShortType.

In this PR, we

map SMALLINT to ShortType, SMALLINT UNSIGNED to IntegerType. W/o this, both of them are mapping to IntegerType
map MEDIUMINT UNSIGNED to IntegerType, and MEDIUMINT is AS-IS. W/o this, MEDIUMINT UNSIGNED uses LongType

Other unsigned/signed types remain unchanged and only improve the test coverage.

Why are the changes needed?

Consistency and efficiency while reading MySQL numeric values

Does this PR introduce any user-facing change?

yes, the mappings described the 1st section.

How was this patch tested?

new tests

Was this patch authored or co-authored using generative AI tooling?

no

…h TINYINT in MySQLDialect

dongjoon-hyun

Hi, @yaooqinn . This looks correct and is aligned with the previous one correctly.

BTW, do you think we have a chance of regression (or breaking change) due to the table schema change? Although I don't remember correctly, there was some incident before due to the table schema change like this. I'm worrying about those kind of situation.

yaooqinn · 2024-03-20T02:35:42Z

Hi @dongjoon-hyun

The regression you mentioned was performed in SPARK-43049 and undone in SPARK-46478. SPARK-43049 modified the string->varchar(255) to string->clob in the Oracle write path to accommodate sufficient character-length. Regrettably, the modification has caused a decline in performance.

In this PR, the changes happen in the read path. The table schema change at the spark side can happen when users perform CTAS against MySQL, i.e. CREATE TABLE abc AS SELECT * FROM a_jdbc_table. The table abc will result in a schema change after this PR.

It's important to keep in mind that the results of arithmetic operations can differ based on the type of data that is returned.

Since SPARK-45561 already had such impacts for TINYINT in Spark 3.5.1, it seems okay to extend to other types.

dongjoon-hyun · 2024-03-20T02:48:20Z

No, it was a slightly different issue. IIRC, a user read and tries to write back (with overwrite?) and it broke their existing Database schema. And, their whole backend systems were screwed, @yaooqinn .

Maybe, we had better a legacy configuration for this kind of potential schema change stuff.

yaooqinn · 2024-03-20T03:14:13Z

Thank you @dongjoon-hyun

For the case that users read/write things in a roundtrip:

Before, we read a smallint(db) as int(spark) in getCatalytType, and then we write an IntegerType(spark) to smallint(db) in getJDBCType
After, we read a smallint(db) as short(spark) in getCatalytType, and then we write an ShortType(spark) to smallint(db) in getJDBCType

I'm not sure the existing behavior works well but seems a bug to me, and we don't have test cases for that

dongjoon-hyun · 2024-03-20T03:42:03Z

Is this correct?

Before, we read a smallint(db) as int(spark) in getCatalytType, and then we write an IntegerType(spark) to smallint(db) in getJDBCType

According to getCommonJDBCType, IntegerType(spark) seems to go java.sql.Types.INTEGER instead of java.sql.Types.SMALLINT? Maybe, did I miss something in MySQLDialect?

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

Lines 142 to 146 in 61d7b0f

    
           case IntegerType => Option(JdbcType("INTEGER", java.sql.Types.INTEGER)) 
        
           case LongType => Option(JdbcType("BIGINT", java.sql.Types.BIGINT)) 
        
           case DoubleType => Option(JdbcType("DOUBLE PRECISION", java.sql.Types.DOUBLE)) 
        
           case FloatType => Option(JdbcType("REAL", java.sql.Types.FLOAT)) 
        
           case ShortType => Option(JdbcType("INTEGER", java.sql.Types.SMALLINT))

yaooqinn · 2024-03-20T04:23:54Z

It's incorrect, it's like we read a smallint and write an int back.

dongjoon-hyun · 2024-03-20T14:39:41Z

So,

Before: SIGNED SMALLINT(DB) -> SIGNED INT(SPARK) -> SIGNED INT(DB)?
After: SIGNED SMALLINT(DB) -> SIGNED SHORT(SPARK) -> SIGNED SAMLLINT(DB)?

dongjoon-hyun

+1, LGTM (Given that #45588 (comment))

Could you add a migration guide after all this PRs, @yaooqinn ?

dongjoon-hyun · 2024-03-20T14:41:08Z

cc @cloud-fan and @HyukjinKwon

dongjoon-hyun · 2024-03-20T16:31:52Z

Thank you, @yaooqinn and @cloud-fan .
Merged to master for Apache Spark 4.0.0.

yaooqinn · 2024-03-21T02:11:06Z

Thank you, @dongjoon-hyun and @cloud-fan.

I will send followups for migration guides

[SPARK-47462][SQL] Align mappings of other unsigned numeric types wit…

bc7c572

…h TINYINT in MySQLDialect

github-actions bot added the SQL label Mar 19, 2024

yaooqinn added 2 commits March 19, 2024 18:25

[SPARK-47462][SQL] Align mappings of other unsigned numeric types wit…

46fa0db

…h TINYINT in MySQLDialect

[SPARK-47462][SQL] Align mappings of other unsigned numeric types wit…

a51aaf7

…h TINYINT in MySQLDialect

dongjoon-hyun reviewed Mar 19, 2024

View reviewed changes

dongjoon-hyun approved these changes Mar 20, 2024

View reviewed changes

cloud-fan approved these changes Mar 20, 2024

View reviewed changes

dongjoon-hyun closed this in a34c8ce Mar 20, 2024

yaooqinn deleted the SPARK-47462 branch March 21, 2024 02:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47462][SQL] Align mappings of other unsigned numeric types with TINYINT in MySQLDialect#45588

[SPARK-47462][SQL] Align mappings of other unsigned numeric types with TINYINT in MySQLDialect#45588
yaooqinn wants to merge 3 commits intoapache:masterfrom
yaooqinn:SPARK-47462

yaooqinn commented Mar 19, 2024

Uh oh!

dongjoon-hyun left a comment

Uh oh!

yaooqinn commented Mar 20, 2024 •

edited

Loading

Uh oh!

dongjoon-hyun commented Mar 20, 2024 •

edited

Loading

Uh oh!

yaooqinn commented Mar 20, 2024

Uh oh!

dongjoon-hyun commented Mar 20, 2024

Uh oh!

yaooqinn commented Mar 20, 2024

Uh oh!

dongjoon-hyun commented Mar 20, 2024

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun commented Mar 20, 2024

Uh oh!

dongjoon-hyun commented Mar 20, 2024

Uh oh!

yaooqinn commented Mar 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yaooqinn commented Mar 19, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

yaooqinn commented Mar 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Mar 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaooqinn commented Mar 20, 2024

Uh oh!

dongjoon-hyun commented Mar 20, 2024

Uh oh!

yaooqinn commented Mar 20, 2024

Uh oh!

dongjoon-hyun commented Mar 20, 2024

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Mar 20, 2024

Uh oh!

dongjoon-hyun commented Mar 20, 2024

Uh oh!

yaooqinn commented Mar 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yaooqinn commented Mar 20, 2024 •

edited

Loading

dongjoon-hyun commented Mar 20, 2024 •

edited

Loading