Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType #36499

Closed
wants to merge 16 commits into from

Conversation

Eugene-Mark
Copy link
Contributor

@Eugene-Mark Eugene-Mark commented May 10, 2022

What changes were proposed in this pull request?

  • Implemented getCatalystType method in TeradataDialect
  • Handle Types.NUMERIC explicitly

Why are the changes needed?

Load table from Teradata, if the type of column in Teradata is Number, it will be converted to DecimalType(38,0) which will lose the fractional part of original data.

Does this PR introduce any user-facing change?

Yes, it will convert Number type to DecimalType(38,18) if the scale is 0, so that keep the fractional part in some way.

How was this patch tested?

UT is added to JDBCSuite.scala.

@github-actions github-actions bot added the SQL label May 10, 2022
@Eugene-Mark
Copy link
Contributor Author

@HyukjinKwon @srowen It's appreciated if the PR can be reviewed recently. Thanks!

@srowen
Copy link
Member

srowen commented May 16, 2022

I don't know anything about teradata - is it documented that this should be the result, and is it specific to teradata?

@HyukjinKwon
Copy link
Member

Yeah, I don't quite follow the change. Why do we need to change precision and scale?

@Eugene-Mark
Copy link
Contributor Author

Eugene-Mark commented May 17, 2022

@srowen I'm also not a Teradata guy, just invokes Teradata's API from Spark and found the issue. I didn't find the document explaining the issue at Teradata side. I tried to print metadata from JdbcUtils.scala -> getSchema, which indicates that the fieldScale is 0 before it passes to downstream invoker. The metadata is fetched from ResultSet, which is generated right after Spark execute statement.executeQuery with query s"SELECT * FROM $table WHERE 1=0".
Maybe it's good enough to give user a default DecimalType instead of a rounded int before we find better way to explain what happened on Teradata side.

@Eugene-Mark
Copy link
Contributor Author

Eugene-Mark commented May 17, 2022

@HyukjinKwon The issue-38846 shows that the Number type of Teradata will lose its fractional part after loading to Spark. We find that JDBC ResultSetMetaData only has scaleField == 0 no matter what's it like on Teradata side. Inspired by how Spark handle Oracle ResultSetMetaData inconsistency issue, I explicitly change scale at TeradataDialect, to make sure at least a default DecimalType with scale 18 is returned, instead of a rounded int.

@srowen
Copy link
Member

srowen commented May 17, 2022

OK, I just wonder if this is specific to Teradata, or whether it can be changed elsewhere higher up in the abstraction layers.

But you're saying the scale/precision info is lost in the JDBC connector? then I wonder what we can do, really; we need to know this to get it right

@Eugene-Mark
Copy link
Contributor Author

Eugene-Mark commented May 18, 2022

@srowen @HyukjinKwon I made some progress and things are clearer now. Let me summarize my recent findings.

For Teradata, per the guideline from their official guide, Number type can be declared with implicit or explicit precision/scale.

indicate NUMBER with the system limits for precision and scale
    NUMBER, or equivalently NUMBER (*)
limit the scale
    NUMBER (*,m)
    where m specifies a scale in the range from 0 to 38.
limit the precision and scale

        NUMBER (n,m)
        NUMBER (n), which is equivalent to NUMBER (n,0)

Spark is using resultSet.getMetaData to get the column meta info. However, the scale will be lost when it's not explicitly defined in Teradata side. For example, if we just create a database like:

create set table test_db.test_table(id BIGINT, column1 NUMBER, column2 NUMBER) PRIMARY INDEX (id);

Both column1 and column2 will lose its scale info, which means the scale equals 0. In the later Getter side, the data is fetched using the scheme without scale info, and final dataframe will lose the column's fractional part. But per the guide from Teradata, the Number should have system limits for precision and scale.

If we create database and explicitly define the scale of Number data type:

create set table test_db.test_table2 (id BIGINT, column1 NUMBER(20,10), column2 NUMBER(38,18)) PRIMARY INDEX (id);

Spark can get the scale info correctly at JdbcUtils.scala and everything will be working fine per my manual test.

Things are much more clearly now, when we use query s"SELECT * FROM $table WHERE 1=0" to get schema info of Teradata, the implicit scale of Number will lose.

By implementing the getCatalystType of TeradataDialect, we give those implicit Number the SYSTEM_DEFAULT DecimalType in Spark, which is precision = 38 and scale = 18, so that we can make sure the further Getter of Spark won't miss the factional part of original data.

@srowen To answer your question directly, "is it specific to Teradata?", I'm afraid I can't give you an definite answer since Number type exists in many other databases and the situation varies case by case. I don't have environment to test them one by one. Some might return the scale info, some might not. However, with current findings, we know that the getCatalyst can do the favor and each dialect can handle other corner cases of Number type like what OracleDialect did.

So to handle current Teradata issue, I suggest we let user know, by document or loginfo, that when the Number is created without explicitly scale, Spark will treat it as DecimalType(38,18).

Note, for the DecimalType(38,18), I just copy from Spark DecimalType's val SYSTEM_DEFAULT: DecimalType = DecimalType(MAX_PRECISION, 18). However, I found OracleDialect uses DecimalType(38, 10) instead. I'm neutral to this change and please suggest which one should be used.

@Eugene-Mark
Copy link
Contributor Author

@HyukjinKwon @srowen Just updated the latest comment with some findings about the root cause of the issue and current solution. Any comment is welcomed, thanks!

@srowen
Copy link
Member

srowen commented May 28, 2022

So if I create a NUMBER in Teradata without a scale, then it uses a system default scale. Do we know what that is?
I'm confused if Teradata doesn't record and return the actual scale used in the driver, because otherwise we have to guess here.
What if I have a NUMBER that is scale=0, actually? this would be wrong.
I imagine your change is more correct, but, I'm also aware this is a behavior change in a case where it seems like there is no correct answer.

Can a caller work around it in this case with a cast or does that not work?

@Eugene-Mark
Copy link
Contributor Author

@srowen Thanks for your response. For first part, indicate NUMBER with the system limits for precision and scale, we didn't find more explanations about it. It sounds like the scale and precision is flexible depending on user's input, but can't be larger than system limit. Since it's flexible, maybe they just return scale as 0 to show the case. (I'm actually thinking maybe it's better to provides a invalid value, like -1, then for downstream caller like Spark can handle the case better. )
Before it's fixed from Teradata side (or maybe never), the issue goes into which situation can be tolerated(in more cases):

  1. A number like 1234.5678 is rounded to 1234 (Current behavior)
  2. A number like 1234 is turned to 1234.0000
    IMHO, the second option seems more reasonable.

As for "Can a caller work around it in this case with a cast or does that not work?". Yes, the cast can be a work around. However, it forces user to take care of the precision and scale of each Number column and it would be more tedious when query is complex with a lot columns to be taken care of. And it somehow goes against the flexibility of original Number(*) definition.

Anyway, I agree with you that there seems hard to find a "correct" answer, more like a tradeoff and also needs document to mark it out.

@srowen
Copy link
Member

srowen commented Jun 4, 2022

It sounds like the scale is just 'unknown' even on the Teradata side? that doesn't sound right. But this isn't a Spark issue then, or, no assumption we make in Spark is any more or less correct, no?

@Eugene-Mark
Copy link
Contributor Author

For NUMBER(*) on Teradata, the scale is not fixed but can suit itself to different value, as they said, it's only constrained by system limit. So the issue for Teradata is about how to denote the scale in JDBC, maybe Teredata side think the value 0 is the best way to denote the flexible property.
The key is about how Spark reflect the flexible scale from Teredata, round it to integer (current practice) or reserve the fractional part with specified scale.

@srowen
Copy link
Member

srowen commented Jun 5, 2022

I see, so we should interpret this as "maximum scale" or something in Spark? that seems OK, and if we're only confident about Teradata, this seems OK. Let's add a note in the release notes for the behavior change

… for OracleDialect's suggesting default decimal type
@Eugene-Mark
Copy link
Contributor Author

We have "maximum scale" defined in Spark, however, it's not suitable in our case. The current MAX_SCALE is 38 and is used in sth like boundary protecting in Decimal's divide operator.
We need to have a relatively universal decimal type ( Can cover almost "all" types of Number value), which I think current SYSTEM_DEFAULT is right candidate upon the usage of it in JavaTypeInference and ScalaReflection, it's used as default type of Decimal/Number.
val SYSTEM_DEFAULT: DecimalType = DecimalType(MAX_PRECISION, 18)

I suggest we also modify the OracleDialects default decimal. The DecimalType(DecimalType.MAX_PRECISION, 10) is more like a magic type since there is no clue about why the scale should be set to 10. For the sake of the consistency, maybe it's better to replaced it with DecimalType.SYSTEM_DEFAULT.

@srowen
Copy link
Member

srowen commented Jun 6, 2022

You're saying, basically, assume scale=18? that's seems reasonable.
Or are you saying there needs to be an arbitrary precision type? I don't see how a DB would support that.
I'm hesitant to modify Oracle without knowing why it is how it is, and why it should change.

@Eugene-Mark
Copy link
Contributor Author

Eugene-Mark commented Jun 6, 2022

Agreed with you that it's better not to modify Oracle related part, just removed from the commit.
Yes, I suggest we use scale = 18.
And for precision, when Number(*) or Number is used in Teradata, the precision returned from JDBC is 40, which is larger than Spark's max precision, so I also handled this case explicitly.


override def getCatalystType(
sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): Option[DataType] = {
if (sqlType == Types.NUMERIC) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, now down to nits. I would use sqlType match { for consistency. Also return, Some(...) when the argument is definitely non-null, like a constant, as in all cases here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Will modify as per.

} else {
// In Teradata, Number(*, scale) will return size, namely precision, as 40,
// which conflicts to DecimalType.MAX_PRECISION
Option(DecimalType(Math.min(size, DecimalType.MAX_PRECISION), scale.toInt))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if precision = 40 and scale = 0? do we need to entertain that possibility, or is scale=0 always going to mean default precision too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this comment! It reminds me of sth more reasonable than my current practice! Since in Teradata, only Number(*)/Number, Number(*,scale) and Number(precision,scale) is valid expression, which means when scale is flexible, the precision returned must be 40. So we don't need to convert all scale = 0 to default decimal type, but only need to do it when the precision = 40 is detected. Which means we will respect user's explicit scale = 0 settings, like Number(20,0), will be converted to DecimalType(20,0).

@Eugene-Mark
Copy link
Contributor Author

The test failure seems have no relationship with the committed code, several recent PRs failed with same error, like this one

@Eugene-Mark
Copy link
Contributor Author

Please kindly help relaunch the test once the CI issue has been fixed, thanks!

@srowen
Copy link
Member

srowen commented Jun 13, 2022

I think you have to retrigger on your end - can you try re-running the jobs? or push a dummy empty commit?

@srowen
Copy link
Member

srowen commented Jun 15, 2022

Hm, I think the doc build error is unrelated

@Eugene-Mark
Copy link
Contributor Author

It's interesting that previous commit could pass the test and some other PRs can pass it also. I will try to revert some changes to see whether I can pass it.

@srowen
Copy link
Member

srowen commented Jun 18, 2022

I think it's spurious, we can ignore it, but let's see one more time

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, well I am not sure why the doc tests are failing. I think it is unrelated, clearly. The last change we need here is a note in the migration guide for 3.4, indicating the change in behavior. I think it is a bug fix, but still non-trivial enough to note as a behavior change.

@github-actions github-actions bot added the DOCS label Jun 19, 2022
@Eugene-Mark
Copy link
Contributor Author

Document updated, thanks for previous valuable comments!

@@ -22,6 +22,11 @@ license: |
* Table of contents
{:toc}

## Upgrading from Spark SQL 3.3 to 3.4

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this note related tot his change? the second one is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for point it out!, just removed the first bullet which has been merged by mistake when resolving the doc conflicts.

Remove unrelated docs.
@srowen
Copy link
Member

srowen commented Jun 20, 2022

Merged to master

@srowen srowen closed this in e31d072 Jun 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants