Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-1922: Blog post on JAVA vector changes #1419

Closed
wants to merge 4 commits into from

Conversation

siddharthteotia
Copy link
Contributor

cc @wesm , @jacques-n , @BryanCutler , @icexelloss

A small post on recent improvements in JAVA vectors. Suggestions are welcome :)

for adding new functionality or improving the existing infrastructure.

So we evaluated the usage of templates for compile time code generation and
decided not to use comlplex templates in some places by writing small amount of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comlplex -> complex

@siddharthteotia
Copy link
Contributor Author

Addressed review comments.

@pcmoritz
Copy link
Contributor

Thanks for writing this! Do you have some more detailed benchmarks on speed and memory usage with Dremio before and after the refactor, with graphs? That'd be cool I think.

Other than that it seems to me that the normal spelling of Java is Java and not JAVA, any reason for capitalizing it? Capitalized looks a little old school like FORTRAN or LISP (the latter seems to have been replaced with Lisp nowadays, maybe same for Fortran).

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @siddharthteotia , looks good! I have the same comments that have already been said, but it would be nice to elaborate on the performance testing memory analysis with some more details.

## Performance Testing

We did performance testing downstream in Dremio and saw a 5% average improvement
in Tpch queries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An improvement in speedup right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanCutler , Yes, that's correct. Let me see if I can provide some graphical comparison of performance numbers from Tpch queries before and after using new Java implementation.

W.r.t memory analysis, I don't think I have enough information to elaborate on the results because the work done as part of refactoring project is a precursor to a project in the pipeline -- https://issues.apache.org/jira/browse/ARROW-1807

This has the potential to reduce the heap usage substantially. Without the work done as part of refactoring (ARROW-1463), ARROW-1807 would not be possible.

Having said that, the new implementation has reduced the heap usage to some extent simply because there are less number of objects floating around in vectors.

Let me see what I can do here to elaborate on memory usage.

@pcmoritz , there was no particular reason to use JAVA instead of Java. I will correct it.


## Design Goals

1. Improved Maintainability and Extensibility.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lower case Maintainability and Extensibility?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

developers don't have to write a lot of duplicate code.

However, we realized that over a period of time some specific JAVA
templates became extremely complex with giant if-else blocks, poor code indentation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

poor code indentation -> poor code style?

unnecessarily and using structures that could be substituted with better
alternatives.

**No performance overhead on hot code paths**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other two section "Improved Maintainability and Extensibility" and "Improved Heap Usage" are both capitalized, maybe make them more consistent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

There were cases where branches could be avoided all together.

In case of Nullable vectors, we were doing multiple checks to confirm if
the value at a given position in the vector is NULL or not.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NULL -> null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@icexelloss
Copy link
Contributor

icexelloss commented Dec 14, 2017

@siddharthteotia Thanks for the write up. Looks good!

@siddharthteotia
Copy link
Contributor Author

All, we will be doing a short blog post on performance improvements we have seen in Dremio with upcoming version of Arrow. We are gathering numbers systematically for specific SQL operators where substantial improvement is visible.

As of now, I don't think I will be able to provide numbers immediately that are worth putting in the blog post.

What do people think?

@siddharthteotia
Copy link
Contributor Author

I think we should decide on this PR soon. What do people think of my suggestion here?
#1419 (comment)

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. If you wouldn't mind, I will carry this patch from here and add some small refinements (like mentioning the 0.8.0 release) before publishing. I'm traveling tomorrow so if it sounds OK this can go up Tuesday along with a "What's new in Arrow 0.8.0" posting that I can put together during the day tomorrow

@siddharthteotia
Copy link
Contributor Author

Thanks @wesm. You can take over this patch.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks for writing this @siddharthteotia!

@wesm wesm closed this in d023b40 Dec 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants