New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType #36499
Conversation
@HyukjinKwon @srowen It's appreciated if the PR can be reviewed recently. Thanks! |
I don't know anything about teradata - is it documented that this should be the result, and is it specific to teradata? |
Yeah, I don't quite follow the change. Why do we need to change precision and scale? |
@srowen I'm also not a Teradata guy, just invokes Teradata's API from Spark and found the issue. I didn't find the document explaining the issue at Teradata side. I tried to print metadata from JdbcUtils.scala -> getSchema, which indicates that the |
@HyukjinKwon The issue-38846 shows that the Number type of Teradata will lose its fractional part after loading to Spark. We find that JDBC |
OK, I just wonder if this is specific to Teradata, or whether it can be changed elsewhere higher up in the abstraction layers. But you're saying the scale/precision info is lost in the JDBC connector? then I wonder what we can do, really; we need to know this to get it right |
@srowen @HyukjinKwon I made some progress and things are clearer now. Let me summarize my recent findings. For Teradata, per the guideline from their official guide,
Spark is using
Both If we create database and explicitly define the
Spark can get the scale info correctly at Things are much more clearly now, when we use query By implementing the @srowen To answer your question directly, "is it specific to Teradata?", I'm afraid I can't give you an definite answer since So to handle current Teradata issue, I suggest we let user know, by document or loginfo, that when the Number is created without explicitly scale, Spark will treat it as Note, for the DecimalType(38,18), I just copy from Spark DecimalType's |
@HyukjinKwon @srowen Just updated the latest comment with some findings about the root cause of the issue and current solution. Any comment is welcomed, thanks! |
So if I create a NUMBER in Teradata without a scale, then it uses a system default scale. Do we know what that is? Can a caller work around it in this case with a cast or does that not work? |
@srowen Thanks for your response. For first part,
As for "Can a caller work around it in this case with a cast or does that not work?". Yes, the cast can be a work around. However, it forces user to take care of the precision and scale of each Number column and it would be more tedious when query is complex with a lot columns to be taken care of. And it somehow goes against the flexibility of original Number(*) definition. Anyway, I agree with you that there seems hard to find a "correct" answer, more like a tradeoff and also needs document to mark it out. |
It sounds like the scale is just 'unknown' even on the Teradata side? that doesn't sound right. But this isn't a Spark issue then, or, no assumption we make in Spark is any more or less correct, no? |
For NUMBER(*) on Teradata, the scale is not fixed but can suit itself to different value, as they said, it's only constrained by |
I see, so we should interpret this as "maximum scale" or something in Spark? that seems OK, and if we're only confident about Teradata, this seems OK. Let's add a note in the release notes for the behavior change |
… for OracleDialect's suggesting default decimal type
We have "maximum scale" defined in Spark, however, it's not suitable in our case. The current I suggest we also modify the OracleDialects default decimal. The |
a391b60
to
339a19a
Compare
You're saying, basically, assume scale=18? that's seems reasonable. |
Agreed with you that it's better not to modify Oracle related part, just removed from the commit. |
|
||
override def getCatalystType( | ||
sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): Option[DataType] = { | ||
if (sqlType == Types.NUMERIC) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, now down to nits. I would use sqlType match {
for consistency. Also return, Some(...)
when the argument is definitely non-null, like a constant, as in all cases here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Will modify as per.
} else { | ||
// In Teradata, Number(*, scale) will return size, namely precision, as 40, | ||
// which conflicts to DecimalType.MAX_PRECISION | ||
Option(DecimalType(Math.min(size, DecimalType.MAX_PRECISION), scale.toInt)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if precision = 40 and scale = 0? do we need to entertain that possibility, or is scale=0 always going to mean default precision too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this comment! It reminds me of sth more reasonable than my current practice! Since in Teradata, only Number(*)/Number, Number(*,scale) and Number(precision,scale) is valid expression, which means when scale is flexible, the precision returned must be 40. So we don't need to convert all scale = 0 to default decimal type, but only need to do it when the precision = 40 is detected. Which means we will respect user's explicit scale = 0 settings, like Number(20,0), will be converted to DecimalType(20,0).
a891749
to
d18f762
Compare
The test failure seems have no relationship with the committed code, several recent PRs failed with same error, like this one |
Please kindly help relaunch the test once the CI issue has been fixed, thanks! |
I think you have to retrigger on your end - can you try re-running the jobs? or push a dummy empty commit? |
Hm, I think the doc build error is unrelated |
It's interesting that previous commit could pass the test and some other PRs can pass it also. I will try to revert some changes to see whether I can pass it. |
I think it's spurious, we can ignore it, but let's see one more time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, well I am not sure why the doc tests are failing. I think it is unrelated, clearly. The last change we need here is a note in the migration guide for 3.4, indicating the change in behavior. I think it is a bug fix, but still non-trivial enough to note as a behavior change.
Document updated, thanks for previous valuable comments! |
@@ -22,6 +22,11 @@ license: | | |||
* Table of contents | |||
{:toc} | |||
|
|||
## Upgrading from Spark SQL 3.3 to 3.4 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this note related tot his change? the second one is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for point it out!, just removed the first bullet which has been merged by mistake when resolving the doc conflicts.
Remove unrelated docs.
Merged to master |
What changes were proposed in this pull request?
Why are the changes needed?
Load table from Teradata, if the type of column in Teradata is
Number
, it will be converted toDecimalType(38,0)
which will lose the fractional part of original data.Does this PR introduce any user-facing change?
Yes, it will convert Number type to DecimalType(38,18) if the scale is 0, so that keep the fractional part in some way.
How was this patch tested?
UT is added to JDBCSuite.scala.