Skip to content

[SPARK-47462][SQL] Align mappings of other unsigned numeric types with TINYINT in MySQLDialect#45588

Closed
yaooqinn wants to merge 3 commits intoapache:masterfrom
yaooqinn:SPARK-47462
Closed

[SPARK-47462][SQL] Align mappings of other unsigned numeric types with TINYINT in MySQLDialect#45588
yaooqinn wants to merge 3 commits intoapache:masterfrom
yaooqinn:SPARK-47462

Conversation

@yaooqinn
Copy link
Member

What changes were proposed in this pull request?

Align mappings of other unsigned numeric types with TINYINT in MySQLDialect. TINYINT is mapping to ByteType and TINYINT UNSIGNED is mapping to ShortType.

In this PR, we

  • map SMALLINT to ShortType, SMALLINT UNSIGNED to IntegerType. W/o this, both of them are mapping to IntegerType
  • map MEDIUMINT UNSIGNED to IntegerType, and MEDIUMINT is AS-IS. W/o this, MEDIUMINT UNSIGNED uses LongType

Other unsigned/signed types remain unchanged and only improve the test coverage.

Why are the changes needed?

Consistency and efficiency while reading MySQL numeric values

Does this PR introduce any user-facing change?

yes, the mappings described the 1st section.

How was this patch tested?

new tests

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the SQL label Mar 19, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @yaooqinn . This looks correct and is aligned with the previous one correctly.

BTW, do you think we have a chance of regression (or breaking change) due to the table schema change? Although I don't remember correctly, there was some incident before due to the table schema change like this. I'm worrying about those kind of situation.

@yaooqinn
Copy link
Member Author

yaooqinn commented Mar 20, 2024

Hi @dongjoon-hyun

The regression you mentioned was performed in SPARK-43049 and undone in SPARK-46478. SPARK-43049 modified the string->varchar(255) to string->clob in the Oracle write path to accommodate sufficient character-length. Regrettably, the modification has caused a decline in performance.

In this PR, the changes happen in the read path. The table schema change at the spark side can happen when users perform CTAS against MySQL, i.e. CREATE TABLE abc AS SELECT * FROM a_jdbc_table. The table abc will result in a schema change after this PR.

It's important to keep in mind that the results of arithmetic operations can differ based on the type of data that is returned.

Since SPARK-45561 already had such impacts for TINYINT in Spark 3.5.1, it seems okay to extend to other types.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Mar 20, 2024

No, it was a slightly different issue. IIRC, a user read and tries to write back (with overwrite?) and it broke their existing Database schema. And, their whole backend systems were screwed, @yaooqinn .

Maybe, we had better a legacy configuration for this kind of potential schema change stuff.

@yaooqinn
Copy link
Member Author

Thank you @dongjoon-hyun

For the case that users read/write things in a roundtrip:

  • Before, we read a smallint(db) as int(spark) in getCatalytType, and then we write an IntegerType(spark) to smallint(db) in getJDBCType
  • After, we read a smallint(db) as short(spark) in getCatalytType, and then we write an ShortType(spark) to smallint(db) in getJDBCType

I'm not sure the existing behavior works well but seems a bug to me, and we don't have test cases for that

@dongjoon-hyun
Copy link
Member

Is this correct?

Before, we read a smallint(db) as int(spark) in getCatalytType, and then we write an IntegerType(spark) to smallint(db) in getJDBCType

According to getCommonJDBCType, IntegerType(spark) seems to go java.sql.Types.INTEGER instead of java.sql.Types.SMALLINT? Maybe, did I miss something in MySQLDialect?

case IntegerType => Option(JdbcType("INTEGER", java.sql.Types.INTEGER))
case LongType => Option(JdbcType("BIGINT", java.sql.Types.BIGINT))
case DoubleType => Option(JdbcType("DOUBLE PRECISION", java.sql.Types.DOUBLE))
case FloatType => Option(JdbcType("REAL", java.sql.Types.FLOAT))
case ShortType => Option(JdbcType("INTEGER", java.sql.Types.SMALLINT))

@yaooqinn
Copy link
Member Author

It's incorrect, it's like we read a smallint and write an int back.

@dongjoon-hyun
Copy link
Member

So,

  • Before: SIGNED SMALLINT(DB) -> SIGNED INT(SPARK) -> SIGNED INT(DB)?
  • After: SIGNED SMALLINT(DB) -> SIGNED SHORT(SPARK) -> SIGNED SAMLLINT(DB)?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (Given that #45588 (comment))

Could you add a migration guide after all this PRs, @yaooqinn ?

@dongjoon-hyun
Copy link
Member

cc @cloud-fan and @HyukjinKwon

@dongjoon-hyun
Copy link
Member

Thank you, @yaooqinn and @cloud-fan .
Merged to master for Apache Spark 4.0.0.

@yaooqinn yaooqinn deleted the SPARK-47462 branch March 21, 2024 02:10
@yaooqinn
Copy link
Member Author

Thank you, @dongjoon-hyun and @cloud-fan.

I will send followups for migration guides

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants