New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC-554: Float to timestamp schema evolution handles time/nanoseconds incorrectly #431
Conversation
@@ -1411,7 +1411,7 @@ public void setConvertVectorElement(int elementNum) { | |||
long wholeSec = (long) Math.floor(seconds); | |||
timestampColVector.time[elementNum] = wholeSec * 1000; | |||
timestampColVector.nanos[elementNum] = | |||
1_000_000 * (int) Math.round((seconds - wholeSec) * 1000); | |||
Math.max(0, 1_000_000 * (int) Math.round((seconds - wholeSec) * 1000)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abstractdog , I am confused on whether this is the correct fix for the issue. Is this negative value because of long overflow? @omalley , @prasanthj , @t3rmin4t0r , what would be the expected behavior in this case for ORC? Setting timestamp to null
? Using Long.MAX_VALUE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jcamachor: yes, this was about long overflow, so I agree, it should be handled accordingly in my opinion, the next commit is about this:
9c2b909
please take a look @omalley, @prasanthj , @t3rmin4t0r
please find the whole patch in single file here:
https://issues.apache.org/jira/secure/attachment/12981069/ORC-554.02.patch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overflow case should become null. I know it feels weird, because in programming we use "null" to mean "no value." In SQL it means "error value," which you can see when you accidentally write "col1 = null" and get no matches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @omalley! so, do you mean that isNull[elementNum] = true is enough for the overflow case (without any other assignments in the timestamp col vector)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need that as well as cv.noNulls = false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me make the test case stronger along with the functional changes.
if (seconds >= Long.MAX_VALUE || seconds <= Long.MIN_VALUE) { // overflow | ||
timestampColVector.time[elementNum] = 0L; | ||
timestampColVector.nanos[elementNum] = 0; | ||
timestampColVector.isNull[elementNum] = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also need to set timestampColVector.noNulls = false.
} else { | ||
timestampColVector.time[elementNum] = wholeSec * 1000; | ||
timestampColVector.nanos[elementNum] = | ||
1_000_000 * (int) Math.round((seconds - wholeSec) * 1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that your patch has removed the protection keeping the value above 0.
timestampColVector.nanos[elementNum] = | ||
1_000_000 * (int) Math.round((seconds - wholeSec) * 1000); | ||
|
||
if (seconds >= Long.MAX_VALUE || seconds <= Long.MIN_VALUE) { // overflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be seconds * 1000, since otherwise the time value can overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, let's multiple first and then check
RecordReader rows = reader.rows(rowOptions)) { | ||
assertTrue(rows.nextBatch(batchTimeStamp)); | ||
|
||
assertTrue(String.format("nanos should be > 0, instead it's: %d", t1.nanos[0]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case is much too loose. Let me take a pass at strengthening it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, test case is cleaner and covers+checks lots of values and use cases now, +1
Fixes apache#431 Signed-off-by: Owen O'Malley <omalley@apache.org>
Ok, take a look at https://github.com/omalley/orc/tree/orc-554 to see if it looks ok. |
double-checked testcase locally, it's working, I would only rename floatToTimeStampPositiveOverflow -> floatToTimeStampOverflow (similar to double) LGTM |
Fixes apache#431 Signed-off-by: Owen O'Malley <omalley@apache.org>
Fixes #431 Signed-off-by: Owen O'Malley <omalley@apache.org>
No description provided.