Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory- and creation-optimized numeric Series #124

Open
andrus opened this issue Apr 18, 2021 · 0 comments
Open

Memory- and creation-optimized numeric Series #124

andrus opened this issue Apr 18, 2021 · 0 comments

Comments

@andrus
Copy link
Collaborator

andrus commented Apr 18, 2021

Primitive Series (IntSeries, LongSeries, etc.) are much more efficient than their object counterparts (Series<Integer>, Series<Long>). They take up to 5x less memory per cell and provide opportunities to implement faster numeric operations, as they don't require (un)boxing. But they don't allow to store null values.

Implementation

The idea here to create null-aware numeric Series objects with performance closer to primitive Series. A prototype for IntegerSeries implemented as two arrays - int[] for values and boolean[] for null tracking has the following performance characteristics:

  1. vs. ObjectSeries<Integer>
  • 4x less memory used
  • 3x faster to create
  • "get" is 6 orders of magnitude slower due to boxing of ints. Since "get" is a small % of any real operation, the operations are something like 30% slower.
  1. vs. IntSeries
  • 25% more memory used
  • same creation speed
  • same "get" speed for boxed Numbers (we can't use fast "getInt" because nulls may be present)

Conclusions

The new type of Series and accumulators save a lot of memory, are much faster to create, and provide opportunities for creation-time optimization (if no nulls are found, IntSeries is created).

The downside is slower "get" due to boxing, though of course the current IntSeries.get() is just as slow

TODO

  • Implement numeric Series for Integer, Double, Long
  • Integrate them to various data adapters (CSV, DB, Avro)
  • (Integrate in the Expressions API, so that exps could take advantage of the faster primitive access)
@andrus andrus changed the title Memory- and access-optimized numeric Series Memory- and creation-optimized numeric Series Jan 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant