Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENHANCEMENT: Performance ideas: pipelining and compression #15

Open
wheezil opened this issue Sep 5, 2019 · 1 comment
Open

ENHANCEMENT: Performance ideas: pipelining and compression #15

wheezil opened this issue Sep 5, 2019 · 1 comment

Comments

@wheezil
Copy link
Contributor

wheezil commented Sep 5, 2019

Greetings! I am using the library and really like it. Very flexible and performs quite well. However... there's always room for optimization. I noticed that the activity on this project is low and most comments are very old, so I should ask first, should I be looking into a newer project that has effectively replaced this one?

That being said, my experience in C++ based sorting showed two improvements can produce very significant results:

  • Pipelining: during the block sort phase, instead of doing the accumulate/blocksort/write in a procedural loop, fill each block and then lob it into an execution pipeline that separates the sort and write into separate parallel tasks.
  • Compression: A light compression like Snappy can reduce temp space by 70% or so, and can result in faster I/O (especially if compression is done in parallel using the above pipeline technique).

I haven't dug deep enough into the code to see if some of this is already supported, please tell me to RTFM if I've missed something.

Thanks,
john

@cowtowncoder
Copy link
Owner

Hiya! Project is stable, mature, as I haven't needed changes myself, but there is no replacement that I know of (i.e. I have not written newer package).

If you are anyone else is interested in experimenting with improvements -- performance, usability/ergonomics, configurability, interoperability -- I'd be happy to help in getting those integrated.
Right now I don't have personal itch to work on things, but I still maintain it if someone was to find a bug for example.

Now: on pipelining -- now support for it at this point. Someone actually did something like that for LZF codec I wrote (https://github.com/ning/compress), and the main question there is probably that of modeling of how things should fit together, how to expose tuning wrt threads to use, sync.

As to compression: I think that this is something that can be handled by allowing extensions and does not necessarily have to be part of core package... although I can see how maybe supporting codecs that JDK comes with (deflate/gzip) could be out of the box, as default implementation.
Alternatively this package could be made multi-maven project so that extension compression codecs could be built from same repo, just result in separate jar(s).

Both sound like nice extensions to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants