Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IntSeries / IntMutableList for joins and filters #26

Closed
andrus opened this issue Apr 7, 2019 · 3 comments
Closed

IntSeries / IntMutableList for joins and filters #26

andrus opened this issue Apr 7, 2019 · 3 comments
Milestone

Comments

@andrus
Copy link
Collaborator

andrus commented Apr 7, 2019

Let's create IntMutableList (an appendable collection of primitive "int" values) that can be converted to IntSeries, which is immutable.

While working with collections of primitives in Java is painful, there can be real performance gains. My prototype of the data structures above speeds up joins by ~ 25-30% when used for indexing joined DataFrames.

This task will switch joins and filters to int-based implementation. Sorters and groupers will be switched separately, as this requires our own custom sorter.

@andrus
Copy link
Collaborator Author

andrus commented Apr 8, 2019

Note that an implementation for joins is fairly straightforward. However an implementation for "sort" operation is more quirky, as JDK libs do not support sorting of int[] with a custom Comparator. Will need to write our own sort algorithm.

andrus added a commit that referenced this issue Apr 10, 2019
andrus added a commit that referenced this issue Apr 10, 2019
@andrus andrus changed the title IntSeries / IntMutableList - let's try using primitives IntSeries / IntMutableList for joins and filters Apr 10, 2019
@andrus andrus added this to the 0.6 milestone Apr 10, 2019
@andrus
Copy link
Collaborator Author

andrus commented Apr 10, 2019

Latest performance measurements:

  • Hash joins: 21-23% faster
  • Nested loop joins: 2-14% slower (why ?!!)
  • Filter: 37% faster

@andrus
Copy link
Collaborator Author

andrus commented Apr 10, 2019

After related #27 implementation, the numbers are improved:

Latest performance measurements:

Hash joins: 33-35% faster
Nested loop joins: 3-13% slower
Filter: 37% faster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant