-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-51042][SQL] Read and write the month and days fields of intervals with one call in Unsafe* classes #49737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-51042][SQL] Read and write the month and days fields of intervals with one call in Unsafe* classes #49737
Conversation
…m.putLong() instead of two calls to Platform.putInt(). This makes writing intervals consistent with the way they are written and read in other places like UnsafeRow.java and should have the same or better performance compared to the original code. It also makes the code endian independent so it works on big and little endian platforms. After this change, all interval related tests are able to pass on big endian platforms. Signed-off-by: Jonathan Albrecht <jonathan.albrecht@ibm.com>
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is ok to align the implementation to UnsafeRow.java, and fix a big-endian issue.
|
+1, LGTM. Merging to master/4.0. |
…als with one call in Unsafe* classes ### What changes were proposed in this pull request? Write the month and days fields of intervals with one call to Platform.put/getLong() instead of two calls to Platform.put/getInt(). In commit ac07cea there was a performance improvement to reading a writing CalendarIntervals in UnsafeRow. This makes writing intervals consistent with UnsafeRow and has better performance compared to the original code. This also fixes big endian platforms where the old (two calls to getput) and new methods of reading and writing CalendarIntervals do not order the bytes in the same way. Currently CalendarInterval related tests in Catalyst and SQL are failing on big endian platforms. There is no effect on little endian platforms (byte order is not affected) except for performance improvement. ### Why are the changes needed? * Improves performance reading and writing CalendarIntervals in Unsafe* classes * Fixes big endian platforms where CalendarIntervals are not read or written correctly in Unsafe* classes ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests on big and little endian platforms ### Was this patch authored or co-authored using generative AI tooling? No Closes #49737 from jonathan-albrecht-ibm/master-endian-interval. Authored-by: Jonathan Albrecht <jonathan.albrecht@ibm.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit a79ba48) Signed-off-by: Max Gekk <max.gekk@gmail.com>
|
@jonathan-albrecht-ibm Congratulations with your first contribution to Apache Spark! |
|
@MaxGekk Thanks for reviewing and merging! |
|
@MaxGekk Would it be possible to merge this to branch-3.5 as well? I have also built and tested it on a local build of 3.5.4. More generally, should I mention that I'd like to merge a PR back as far as branch-3.5 in the PR description or in the JIRA or somewhere else? I didn't see any info on how to ask for which branches to merge to and not sure what the policies are. |
I think so. In the public doc we declare: see https://spark.apache.org/docs/latest/
Apparently, we can consider the changes as a bug fix. cc @dongjoon-hyun @cloud-fan WDYT? |
|
+1 to backport |
|
@MaxGekk, @cloud-fan Thanks for considering this to be backported. Should I open a new PR against branch-3.5? The cherry-pick of a79ba48 applies cleanly on branch-3.5 |
|
+1 for backporting. Thank you for pinging me, @MaxGekk . To @jonathan-albrecht-ibm , yes, we need a new PR in order to make it sure that it passes in |
…als with one call in Unsafe* classes ### What changes were proposed in this pull request? Write the month and days fields of intervals with one call to Platform.put/getLong() instead of two calls to Platform.put/getInt(). In commit ac07cea there was a performance improvement to reading a writing CalendarIntervals in UnsafeRow. This makes writing intervals consistent with UnsafeRow and has better performance compared to the original code. This also fixes big endian platforms where the old (two calls to getput) and new methods of reading and writing CalendarIntervals do not order the bytes in the same way. Currently CalendarInterval related tests in Catalyst and SQL are failing on big endian platforms. There is no effect on little endian platforms (byte order is not affected) except for performance improvement. ### Why are the changes needed? * Improves performance reading and writing CalendarIntervals in Unsafe* classes * Fixes big endian platforms where CalendarIntervals are not read or written correctly in Unsafe* classes ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests on big and little endian platforms ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#49737 from jonathan-albrecht-ibm/master-endian-interval. Authored-by: Jonathan Albrecht <jonathan.albrecht@ibm.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
…ntervals with one call in Unsafe* classes This is the branch-3.5 backport of #49737 ### What changes were proposed in this pull request? Write the month and days fields of intervals with one call to Platform.put/getLong() instead of two calls to Platform.put/getInt(). In commit ac07cea there was a performance improvement to reading a writing CalendarIntervals in UnsafeRow. This makes writing intervals consistent with UnsafeRow and has better performance compared to the original code. This also fixes big endian platforms where the old (two calls to getput) and new methods of reading and writing CalendarIntervals do not order the bytes in the same way. Currently CalendarInterval related tests in Catalyst and SQL are failing on big endian platforms. There is no effect on little endian platforms (byte order is not affected) except for performance improvement. ### Why are the changes needed? * Improves performance reading and writing CalendarIntervals in Unsafe* classes * Fixes big endian platforms where CalendarIntervals are not read or written correctly in Unsafe* classes ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests on big and little endian platforms ### Was this patch authored or co-authored using generative AI tooling? No Closes #49912 from jonathan-albrecht-ibm/branch-3.5-endian-interval. Authored-by: Jonathan Albrecht <jonathan.albrecht@ibm.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…als with one call in Unsafe* classes ### What changes were proposed in this pull request? Write the month and days fields of intervals with one call to Platform.put/getLong() instead of two calls to Platform.put/getInt(). In commit 86d3e80 there was a performance improvement to reading a writing CalendarIntervals in UnsafeRow. This makes writing intervals consistent with UnsafeRow and has better performance compared to the original code. This also fixes big endian platforms where the old (two calls to getput) and new methods of reading and writing CalendarIntervals do not order the bytes in the same way. Currently CalendarInterval related tests in Catalyst and SQL are failing on big endian platforms. There is no effect on little endian platforms (byte order is not affected) except for performance improvement. ### Why are the changes needed? * Improves performance reading and writing CalendarIntervals in Unsafe* classes * Fixes big endian platforms where CalendarIntervals are not read or written correctly in Unsafe* classes ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests on big and little endian platforms ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#49737 from jonathan-albrecht-ibm/master-endian-interval. Authored-by: Jonathan Albrecht <jonathan.albrecht@ibm.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 05ebdf1) Signed-off-by: Max Gekk <max.gekk@gmail.com>
What changes were proposed in this pull request?
Write the month and days fields of intervals with one call to Platform.put/getLong() instead of two calls to Platform.put/getInt().
In commit ac07cea there was a performance improvement to reading a writing CalendarIntervals in UnsafeRow. This makes writing intervals consistent with UnsafeRow and has better performance compared to the original code.
This also fixes big endian platforms where the old (two calls to getput) and new methods of reading and writing CalendarIntervals do not order the bytes in the same way. Currently CalendarInterval related tests in Catalyst and SQL are failing on big endian platforms.
There is no effect on little endian platforms (byte order is not affected) except for performance improvement.
Why are the changes needed?
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing unit tests on big and little endian platforms
Was this patch authored or co-authored using generative AI tooling?
No