New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-1922: Blog post on JAVA vector changes #1419
Conversation
for adding new functionality or improving the existing infrastructure. | ||
|
||
So we evaluated the usage of templates for compile time code generation and | ||
decided not to use comlplex templates in some places by writing small amount of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comlplex -> complex
Addressed review comments. |
Thanks for writing this! Do you have some more detailed benchmarks on speed and memory usage with Dremio before and after the refactor, with graphs? That'd be cool I think. Other than that it seems to me that the normal spelling of Java is Java and not JAVA, any reason for capitalizing it? Capitalized looks a little old school like FORTRAN or LISP (the latter seems to have been replaced with Lisp nowadays, maybe same for Fortran). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @siddharthteotia , looks good! I have the same comments that have already been said, but it would be nice to elaborate on the performance testing memory analysis with some more details.
## Performance Testing | ||
|
||
We did performance testing downstream in Dremio and saw a 5% average improvement | ||
in Tpch queries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An improvement in speedup right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanCutler , Yes, that's correct. Let me see if I can provide some graphical comparison of performance numbers from Tpch queries before and after using new Java implementation.
W.r.t memory analysis, I don't think I have enough information to elaborate on the results because the work done as part of refactoring project is a precursor to a project in the pipeline -- https://issues.apache.org/jira/browse/ARROW-1807
This has the potential to reduce the heap usage substantially. Without the work done as part of refactoring (ARROW-1463), ARROW-1807 would not be possible.
Having said that, the new implementation has reduced the heap usage to some extent simply because there are less number of objects floating around in vectors.
Let me see what I can do here to elaborate on memory usage.
@pcmoritz , there was no particular reason to use JAVA instead of Java. I will correct it.
|
||
## Design Goals | ||
|
||
1. Improved Maintainability and Extensibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lower case Maintainability
and Extensibility
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
developers don't have to write a lot of duplicate code. | ||
|
||
However, we realized that over a period of time some specific JAVA | ||
templates became extremely complex with giant if-else blocks, poor code indentation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
poor code indentation -> poor code style?
unnecessarily and using structures that could be substituted with better | ||
alternatives. | ||
|
||
**No performance overhead on hot code paths** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other two section "Improved Maintainability and Extensibility" and "Improved Heap Usage" are both capitalized, maybe make them more consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There were cases where branches could be avoided all together. | ||
|
||
In case of Nullable vectors, we were doing multiple checks to confirm if | ||
the value at a given position in the vector is NULL or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NULL -> null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@siddharthteotia Thanks for the write up. Looks good! |
All, we will be doing a short blog post on performance improvements we have seen in Dremio with upcoming version of Arrow. We are gathering numbers systematically for specific SQL operators where substantial improvement is visible. As of now, I don't think I will be able to provide numbers immediately that are worth putting in the blog post. What do people think? |
I think we should decide on this PR soon. What do people think of my suggestion here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. If you wouldn't mind, I will carry this patch from here and add some small refinements (like mentioning the 0.8.0 release) before publishing. I'm traveling tomorrow so if it sounds OK this can go up Tuesday along with a "What's new in Arrow 0.8.0" posting that I can put together during the day tomorrow
Thanks @wesm. You can take over this patch. |
Change-Id: I719e4aa31c1cd999e718b67154286fdd0dd05a0d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thanks for writing this @siddharthteotia!
cc @wesm , @jacques-n , @BryanCutler , @icexelloss
A small post on recent improvements in JAVA vectors. Suggestions are welcome :)