[SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType #36499

Eugene-Mark · 2022-05-10T16:14:39Z

What changes were proposed in this pull request?

Implemented getCatalystType method in TeradataDialect
Handle Types.NUMERIC explicitly

Why are the changes needed?

Load table from Teradata, if the type of column in Teradata is Number, it will be converted to DecimalType(38,0) which will lose the fractional part of original data.

Does this PR introduce any user-facing change?

Yes, it will convert Number type to DecimalType(38,18) if the scale is 0, so that keep the fractional part in some way.

How was this patch tested?

UT is added to JDBCSuite.scala.

…c Type and Spark DecimalType

…a-loading

Eugene-Mark · 2022-05-13T16:10:10Z

@HyukjinKwon @srowen It's appreciated if the PR can be reviewed recently. Thanks!

srowen · 2022-05-16T14:37:28Z

I don't know anything about teradata - is it documented that this should be the result, and is it specific to teradata?

HyukjinKwon · 2022-05-17T00:32:14Z

Yeah, I don't quite follow the change. Why do we need to change precision and scale?

Eugene-Mark · 2022-05-17T16:53:19Z

@srowen I'm also not a Teradata guy, just invokes Teradata's API from Spark and found the issue. I didn't find the document explaining the issue at Teradata side. I tried to print metadata from JdbcUtils.scala -> getSchema, which indicates that the fieldScale is 0 before it passes to downstream invoker. The metadata is fetched from ResultSet, which is generated right after Spark execute statement.executeQuery with query s"SELECT * FROM $table WHERE 1=0".
Maybe it's good enough to give user a default DecimalType instead of a rounded int before we find better way to explain what happened on Teradata side.

Eugene-Mark · 2022-05-17T17:04:39Z

@HyukjinKwon The issue-38846 shows that the Number type of Teradata will lose its fractional part after loading to Spark. We find that JDBC ResultSetMetaData only has scaleField == 0 no matter what's it like on Teradata side. Inspired by how Spark handle Oracle ResultSetMetaData inconsistency issue, I explicitly change scale at TeradataDialect, to make sure at least a default DecimalType with scale 18 is returned, instead of a rounded int.

srowen · 2022-05-17T17:27:50Z

OK, I just wonder if this is specific to Teradata, or whether it can be changed elsewhere higher up in the abstraction layers.

But you're saying the scale/precision info is lost in the JDBC connector? then I wonder what we can do, really; we need to know this to get it right

Eugene-Mark · 2022-05-18T05:04:26Z

@srowen @HyukjinKwon I made some progress and things are clearer now. Let me summarize my recent findings.

For Teradata, per the guideline from their official guide, Number type can be declared with implicit or explicit precision/scale.

indicate NUMBER with the system limits for precision and scale
    NUMBER, or equivalently NUMBER (*)
limit the scale
    NUMBER (*,m)
    where m specifies a scale in the range from 0 to 38.
limit the precision and scale

        NUMBER (n,m)
        NUMBER (n), which is equivalent to NUMBER (n,0)

Spark is using resultSet.getMetaData to get the column meta info. However, the scale will be lost when it's not explicitly defined in Teradata side. For example, if we just create a database like:

create set table test_db.test_table(id BIGINT, column1 NUMBER, column2 NUMBER) PRIMARY INDEX (id);

Both column1 and column2 will lose its scale info, which means the scale equals 0. In the later Getter side, the data is fetched using the scheme without scale info, and final dataframe will lose the column's fractional part. But per the guide from Teradata, the Number should have system limits for precision and scale.

If we create database and explicitly define the scale of Number data type:

create set table test_db.test_table2 (id BIGINT, column1 NUMBER(20,10), column2 NUMBER(38,18)) PRIMARY INDEX (id);

Spark can get the scale info correctly at JdbcUtils.scala and everything will be working fine per my manual test.

Things are much more clearly now, when we use query s"SELECT * FROM $table WHERE 1=0" to get schema info of Teradata, the implicit scale of Number will lose.

By implementing the getCatalystType of TeradataDialect, we give those implicit Number the SYSTEM_DEFAULT DecimalType in Spark, which is precision = 38 and scale = 18, so that we can make sure the further Getter of Spark won't miss the factional part of original data.

@srowen To answer your question directly, "is it specific to Teradata?", I'm afraid I can't give you an definite answer since Number type exists in many other databases and the situation varies case by case. I don't have environment to test them one by one. Some might return the scale info, some might not. However, with current findings, we know that the getCatalyst can do the favor and each dialect can handle other corner cases of Number type like what OracleDialect did.

So to handle current Teradata issue, I suggest we let user know, by document or loginfo, that when the Number is created without explicitly scale, Spark will treat it as DecimalType(38,18).

Note, for the DecimalType(38,18), I just copy from Spark DecimalType's val SYSTEM_DEFAULT: DecimalType = DecimalType(MAX_PRECISION, 18). However, I found OracleDialect uses DecimalType(38, 10) instead. I'm neutral to this change and please suggest which one should be used.

Eugene-Mark · 2022-05-25T10:52:02Z

@HyukjinKwon @srowen Just updated the latest comment with some findings about the root cause of the issue and current solution. Any comment is welcomed, thanks!

srowen · 2022-05-28T15:11:48Z

So if I create a NUMBER in Teradata without a scale, then it uses a system default scale. Do we know what that is?
I'm confused if Teradata doesn't record and return the actual scale used in the driver, because otherwise we have to guess here.
What if I have a NUMBER that is scale=0, actually? this would be wrong.
I imagine your change is more correct, but, I'm also aware this is a behavior change in a case where it seems like there is no correct answer.

Can a caller work around it in this case with a cast or does that not work?

Eugene-Mark · 2022-06-04T15:49:48Z

@srowen Thanks for your response. For first part, indicate NUMBER with the system limits for precision and scale, we didn't find more explanations about it. It sounds like the scale and precision is flexible depending on user's input, but can't be larger than system limit. Since it's flexible, maybe they just return scale as 0 to show the case. (I'm actually thinking maybe it's better to provides a invalid value, like -1, then for downstream caller like Spark can handle the case better. )
Before it's fixed from Teradata side (or maybe never), the issue goes into which situation can be tolerated(in more cases):

A number like 1234.5678 is rounded to 1234 (Current behavior)
A number like 1234 is turned to 1234.0000
IMHO, the second option seems more reasonable.

As for "Can a caller work around it in this case with a cast or does that not work?". Yes, the cast can be a work around. However, it forces user to take care of the precision and scale of each Number column and it would be more tedious when query is complex with a lot columns to be taken care of. And it somehow goes against the flexibility of original Number(*) definition.

Anyway, I agree with you that there seems hard to find a "correct" answer, more like a tradeoff and also needs document to mark it out.

srowen · 2022-06-04T16:00:18Z

It sounds like the scale is just 'unknown' even on the Teradata side? that doesn't sound right. But this isn't a Spark issue then, or, no assumption we make in Spark is any more or less correct, no?

…to 0

Eugene-Mark · 2022-06-05T15:05:31Z

For NUMBER(*) on Teradata, the scale is not fixed but can suit itself to different value, as they said, it's only constrained by system limit. So the issue for Teradata is about how to denote the scale in JDBC, maybe Teredata side think the value 0 is the best way to denote the flexible property.
The key is about how Spark reflect the flexible scale from Teredata, round it to integer (current practice) or reserve the fractional part with specified scale.

srowen · 2022-06-05T15:59:55Z

I see, so we should interpret this as "maximum scale" or something in Spark? that seems OK, and if we're only confident about Teradata, this seems OK. Let's add a note in the release notes for the behavior change

… for OracleDialect's suggesting default decimal type

Eugene-Mark · 2022-06-06T05:36:27Z

We have "maximum scale" defined in Spark, however, it's not suitable in our case. The current MAX_SCALE is 38 and is used in sth like boundary protecting in Decimal's divide operator.
We need to have a relatively universal decimal type ( Can cover almost "all" types of Number value), which I think current SYSTEM_DEFAULT is right candidate upon the usage of it in JavaTypeInference and ScalaReflection, it's used as default type of Decimal/Number.
val SYSTEM_DEFAULT: DecimalType = DecimalType(MAX_PRECISION, 18)

I suggest we also modify the OracleDialects default decimal. The DecimalType(DecimalType.MAX_PRECISION, 10) is more like a magic type since there is no clue about why the scale should be set to 10. For the sake of the consistency, maybe it's better to replaced it with DecimalType.SYSTEM_DEFAULT.

srowen · 2022-06-06T13:02:04Z

You're saying, basically, assume scale=18? that's seems reasonable.
Or are you saying there needs to be an arbitrary precision type? I don't see how a DB would support that.
I'm hesitant to modify Oracle without knowing why it is how it is, and why it should change.

Eugene-Mark · 2022-06-06T13:28:47Z

Agreed with you that it's better not to modify Oracle related part, just removed from the commit.
Yes, I suggest we use scale = 18.
And for precision, when Number(*) or Number is used in Teradata, the precision returned from JDBC is 40, which is larger than Spark's max precision, so I also handled this case explicitly.

srowen · 2022-06-07T14:12:07Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/TeradataDialect.scala

+
+  override def getCatalystType(
+    sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): Option[DataType] = {
+    if (sqlType == Types.NUMERIC) {


OK, now down to nits. I would use sqlType match { for consistency. Also return, Some(...) when the argument is definitely non-null, like a constant, as in all cases here

Good point! Will modify as per.

srowen · 2022-06-07T14:12:37Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/TeradataDialect.scala

+        } else {
+          // In Teradata, Number(*, scale) will return size, namely precision, as 40,
+          // which conflicts to DecimalType.MAX_PRECISION
+          Option(DecimalType(Math.min(size, DecimalType.MAX_PRECISION), scale.toInt))


What if precision = 40 and scale = 0? do we need to entertain that possibility, or is scale=0 always going to mean default precision too?

Thanks for this comment! It reminds me of sth more reasonable than my current practice! Since in Teradata, only Number(*)/Number, Number(*,scale) and Number(precision,scale) is valid expression, which means when scale is flexible, the precision returned must be 40. So we don't need to convert all scale = 0 to default decimal type, but only need to do it when the precision = 40 is detected. Which means we will respect user's explicit scale = 0 settings, like Number(20,0), will be converted to DecimalType(20,0).

Eugene-Mark · 2022-06-09T07:33:10Z

The test failure seems have no relationship with the committed code, several recent PRs failed with same error, like this one

Eugene-Mark · 2022-06-10T15:51:17Z

Please kindly help relaunch the test once the CI issue has been fixed, thanks!

srowen · 2022-06-13T14:52:36Z

I think you have to retrigger on your end - can you try re-running the jobs? or push a dummy empty commit?

srowen · 2022-06-15T13:06:24Z

Hm, I think the doc build error is unrelated

Eugene-Mark · 2022-06-18T03:26:01Z

It's interesting that previous commit could pass the test and some other PRs can pass it also. I will try to revert some changes to see whether I can pass it.

srowen · 2022-06-18T03:41:14Z

I think it's spurious, we can ignore it, but let's see one more time

srowen

Huh, well I am not sure why the doc tests are failing. I think it is unrelated, clearly. The last change we need here is a note in the migration guide for 3.4, indicating the change in behavior. I think it is a bug fix, but still non-trivial enough to note as a behavior change.

Eugene-Mark · 2022-06-20T15:20:16Z

Document updated, thanks for previous valuable comments!

srowen · 2022-06-20T16:04:49Z

docs/sql-migration-guide.md

@@ -22,6 +22,11 @@ license: |
 * Table of contents
 {:toc}

+## Upgrading from Spark SQL 3.3 to 3.4
+


Is this note related tot his change? the second one is

Thanks for point it out!, just removed the first bullet which has been merged by mistake when resolving the doc conflicts.

Remove unrelated docs.

srowen · 2022-06-20T23:10:51Z

Merged to master

Eugene-Mark added 4 commits May 7, 2022 11:37

[SPARK-38846][SQL] Add explicit data mapping between Teradata Numberi…

65e9895

…c Type and Spark DecimalType

Merge branch 'master' of https://github.com/apache/spark into teradat…

fb8b25e

…a-loading

Add UT for teradata catalyst type mapping

7f985be

Explain why 18 is used as default scale

5c8ac59

github-actions bot added the SQL label May 10, 2022

Handle implicit precision and add notes when scale is explicitly set …

9bbbabc

…to 0

Set default scale explicitly to 18 and use DecimalType.SYSTEM_DEFAULT…

bfedbfa

… for OracleDialect's suggesting default decimal type

Refactor UT

339a19a

Eugene-Mark force-pushed the teradata-loading branch from a391b60 to 339a19a Compare June 6, 2022 09:19

Revert changes of OracleDialect

7e2cd25

srowen reviewed Jun 7, 2022

View reviewed changes

Respect explicit scale = 0 settings

d18f762

Eugene-Mark force-pushed the teradata-loading branch from a891749 to d18f762 Compare June 8, 2022 02:46

Add Teradata link

dfed76c

Trigger Build

0c9d607

Remove long link that might cause doc build error

25d97ee

Trigger Build

c040dc5

srowen reviewed Jun 18, 2022

View reviewed changes

Update migration guide for Spark 3.4

96d602d

github-actions bot added the DOCS label Jun 19, 2022

Merge branch 'master' into teradata-loading

5b2e047

srowen reviewed Jun 20, 2022

View reviewed changes

Update sql-migration-guide.md

d9a41cc

Remove unrelated docs.

srowen approved these changes Jun 20, 2022

View reviewed changes

srowen closed this in e31d072 Jun 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType #36499

[SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType #36499

Eugene-Mark commented May 10, 2022 •

edited

Eugene-Mark commented May 13, 2022

srowen commented May 16, 2022

HyukjinKwon commented May 17, 2022

Eugene-Mark commented May 17, 2022 •

edited

Eugene-Mark commented May 17, 2022 •

edited

srowen commented May 17, 2022

Eugene-Mark commented May 18, 2022 •

edited

Eugene-Mark commented May 25, 2022

srowen commented May 28, 2022

Eugene-Mark commented Jun 4, 2022

srowen commented Jun 4, 2022

Eugene-Mark commented Jun 5, 2022

srowen commented Jun 5, 2022

Eugene-Mark commented Jun 6, 2022

srowen commented Jun 6, 2022

Eugene-Mark commented Jun 6, 2022 •

edited

srowen Jun 7, 2022

Eugene-Mark Jun 7, 2022

srowen Jun 7, 2022

Eugene-Mark Jun 7, 2022 •

edited

Eugene-Mark commented Jun 9, 2022

Eugene-Mark commented Jun 10, 2022

srowen commented Jun 13, 2022

srowen commented Jun 15, 2022

Eugene-Mark commented Jun 18, 2022

srowen commented Jun 18, 2022

srowen left a comment

Eugene-Mark commented Jun 20, 2022

srowen Jun 20, 2022

Eugene-Mark Jun 20, 2022

srowen commented Jun 20, 2022

[SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType #36499

[SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType #36499

Conversation

Eugene-Mark commented May 10, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Eugene-Mark commented May 13, 2022

srowen commented May 16, 2022

HyukjinKwon commented May 17, 2022

Eugene-Mark commented May 17, 2022 • edited

Eugene-Mark commented May 17, 2022 • edited

srowen commented May 17, 2022

Eugene-Mark commented May 18, 2022 • edited

Eugene-Mark commented May 25, 2022

srowen commented May 28, 2022

Eugene-Mark commented Jun 4, 2022

srowen commented Jun 4, 2022

Eugene-Mark commented Jun 5, 2022

srowen commented Jun 5, 2022

Eugene-Mark commented Jun 6, 2022

srowen commented Jun 6, 2022

Eugene-Mark commented Jun 6, 2022 • edited

srowen Jun 7, 2022

Choose a reason for hiding this comment

Eugene-Mark Jun 7, 2022

Choose a reason for hiding this comment

srowen Jun 7, 2022

Choose a reason for hiding this comment

Eugene-Mark Jun 7, 2022 • edited

Choose a reason for hiding this comment

Eugene-Mark commented Jun 9, 2022

Eugene-Mark commented Jun 10, 2022

srowen commented Jun 13, 2022

srowen commented Jun 15, 2022

Eugene-Mark commented Jun 18, 2022

srowen commented Jun 18, 2022

srowen left a comment

Choose a reason for hiding this comment

Eugene-Mark commented Jun 20, 2022

srowen Jun 20, 2022

Choose a reason for hiding this comment

Eugene-Mark Jun 20, 2022

Choose a reason for hiding this comment

srowen commented Jun 20, 2022

Eugene-Mark commented May 10, 2022 •

edited

Eugene-Mark commented May 17, 2022 •

edited

Eugene-Mark commented May 17, 2022 •

edited

Eugene-Mark commented May 18, 2022 •

edited

Eugene-Mark commented Jun 6, 2022 •

edited

Eugene-Mark Jun 7, 2022 •

edited