Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-2247 - improved java reading performance with new reader #391

Merged
merged 2 commits into from
Feb 3, 2020
Merged

AVRO-2247 - improved java reading performance with new reader #391

merged 2 commits into from
Feb 3, 2020

Conversation

unchuckable
Copy link
Contributor

Cannot reopen the original PR (#354), since I've rebased to current master.

I've tried to adress the points that @rstata brought up with my approach. The feature switch between traditional and newly suggested reader mechanism now is done inside GenericDatumReader. All tests provided with the avro project run smoothly (I stole @rstata's idea to trigger the tests an additional time with the feature switch enabled). Also fixed defaulting in a way that takes advantage of immutable values and only actually re-reads default objects with a distinct decoder when really required.

If there is any more things that would need testing, please do give me a pointer.

Overall, the newly proposed writer sacrifices time building a DatumReader, allowing it to perform the actual reading at a highly improved rate. For all applications that are remotely "big data", that tradeoff should turn out highly beneficial.

I also included a small module (benchmark) that uses JMH to test the performance of the proposed reader approach against the current generic reader. Using JMH should be preferable to Perf.java, for it allows to perform benchmarks in a controlled and statistical significant way.

As stated in the last PR, I'm open to any changes, fire ahead. It's the overall concept and its aparent reader performance gains that I'm chasing after, not having my implementation find its way into the main branch 1:1.

@rstata
Copy link
Contributor

rstata commented Nov 26, 2018

I will play with it.

To get it to build, I added licenses to all the files:

https://github.com/rstata-projects/avro/tree/unchuckable-fast-avro

For some reason I can't issue a pull request to your fork, can you pull this change from my repo?

Also, how do you invoke the benchmark?

@unchuckable
Copy link
Contributor Author

Hi, @rstata.

First of all, thanks for looking into it. It means a lot. I'm sorry about the license files; totally forgot about them files this time 😞

I pulled your change from your repo and pushed it into mine. No clue what's up with github and the pull request there, if anybody has a pointer on what I would need to set in my repo, any advice is welcome.

Invoking the benchmark:
cd lang/java/benchmark
mvn clean package
java -jar target/benchmarks.jar (not the benchmark-1.9.0-SNAPSHOT)

By default, it will use 5 warmup iterations and 5 measurement iterations with 10 seconds each, and do all of that 5 times, which totals up to almost 3 hours, but it can easily be reduced to more reasonable limits (20 minutes), like:
java -jar target/benchmarks.jar -wi 3 -i 3 -f 1 (3 iterations for warmup and measurement and only 1 repetition)
Adding -e Building will exclude the buiding of the DatumReaders from the benchmark, and reduce the total time of evaluation by half currently.

The current benchmark classes are only a small excerpt of cases of Perf.java (but trying to replicate them as good as possible). I can gladly add more if it helps the project; it might make sense to move that to a different ticket though, I guess.

@rstata
Copy link
Contributor

rstata commented Nov 28, 2018

I've run your code against Perf.java and uploaded the
results here. This report contains two sets of results:

  • The "avro-2247 (calibration)" column presents the results of running the 2247 branch against itself three different times. These results are useful for understanding where the Perf.java benchmark tends to have a lot of internal variability. As an example, the BooleanRead/Write shows a lot of natural variability, which is something I've notice in a lot of my previous performance testing.

  • The "avro-2274 (w/ custom coders) vs" column presents the result of running three different treatments against my avro-2274 branch. The three sub-columns here are as follows: "master" is the Apache Avro master branch (just prior to avro-2274 being merged into it); "2247 (off)" branch is the 2247 code with fast-coder turned off; "2247 (on)" is the 2247 branch with coders turned on.

The last sub-column of "avro-2274 (...) vs" results is the more relevant. What we see here are a large number of record-related cases showing speedups of 20-30% and even more. This is very promising.

I am currently running the JMH-based benchmarks. These do not have an (obvious) mechanism for comparing the "before/after" performance of your proposed changes, but I will be interested in seeing if they do better in reducing the variance between runs.

I haven't inspected your code yet. I'll do that as well, and offer some opinions.

@unchuckable
Copy link
Contributor Author

I agree that JMH will still be hard pressed for before/after comparisons, unless the change can be toggled with a feature switch at runtime (which fortunately is the case with the proposed change).

@unchuckable
Copy link
Contributor Author

Am currently refactoring the code, to use the refactored Resolver of #395. Will post updates soon.

@iemejia iemejia added the Java Pull Requests for Java binding label Nov 29, 2018
@rstata
Copy link
Contributor

rstata commented Nov 29, 2018

On the one hand, the performance results I posted a few days ago certainly demonstrate there is some perfomance improvements to be had for GenericDatumReader.

On the other hand, this change introduces 2800 lines of new code that looks like it'd be tedious to maintain. Also, the comparison here isn't apples to apples, because the old code is more aggressive about reusing objects, and it attempts to apply conversions, which is pure overhead for the performance tests we're using but aren't in other cases. Finally, looking more closely at GenericDatumReader, it has built into it a BUNCH of "customization" points -- methods and objects that can be replaced to customize the reading process, all of which add overhead in the inner-most loop. It's not clear whether how much of the performance gains come from the pre-computation of actions versus simply getting rid of all these customization points.

I'm tempted to extend the AVRO-2275 work so that the Action-tree generated by Resolver is a complete mirror of the reader's schema (right now, it stops at DoNothing nodes, which for Unions in particular could be pretty high-up in the schema's tree). Then one could write a FastGenericDatumReader class that simply walks that tree to decode the object. I suspect the resulting code would be on the order of 100 lines and would capture almost all the speed found in this fast-avro patch. (And one could decorate the Action objects with any Conversions for LogicalTypes found in the reader's schema, making it quick and easy to apply conversions while doing the walk.)

@rstata
Copy link
Contributor

rstata commented Nov 30, 2018

@unchuckable -- send an email to "rstata - at - yahoo - . - com" to better coordinate. Thanks.
'
'

@unchuckable
Copy link
Contributor Author

Updated PR to reflect current state of the work, based on Raymie's work in #395 (his changes are contained within this PR, too, and the overall code change of this PR has condensed down to less than 800 lines of code; his refactoring helped lots and made code a lot clearer!) and resolved merge conflicts with current main branch.

As the code comes with a feature switch and leaves most things entirely untouched when deactivated, I'd love to see it considered for inclusion in a near-future release, then it might be possible to get some feedback on the real-life effects of the proposed reader design.

Feedback welcome.

@scottcarey
Copy link
Contributor

This is pretty awesome. I agree with @rstata that we can probably make this more maintainable with some structural changes. Making it "reader schema shaped" is probably ideal.

It is similar to ideas I had oh, ... a long time ago when I was heavily contributing to Avro. I discussed some of them with @cutting long ago.

The fundamental issue with the old approach is three-fold:

  1. we have to parse TWO data structures for each item we read -- the resolver, and the schema. We could instead have one data structure with all the 'instructions' at each step and do a lot less work per iteration.
  2. Even within the steps, there is a lot of repeated work that could be cached, like the issues with default values.
  3. Avro is currently framed in terms of "readers" and "writers". This is the wrong abstraction! Avro is REALLY about schema transformations. from and to not read and write. For example, if you want to convert some data conforming to schema A from a SpecificRecord to a GenericRecord... you have to read specific, write to bytes, then read the bytes and write to generic. There is no reason you have to do that, you could have from=Specific and to=Generic. Right now, either from or to is always binary or json, represented by an encoder/decoder. Likewise, one could transform bytes that represent schema B at version 1 to schema B at version 2 without having to go 'through' some in memory class representation. Anyhow, I digress. We got the abstractions wrong on day 1. I've imagined magically having 3 months of free time to re-write the whole thing (probably in Scala or Kotlin) along these lines, but real life is real life and for now that is fantasy.

I have thought of this sort of work as the first step. It is an "interpreter" of instructions for transforming data from schema A to schema B, given the operations available to scan the data at schema B in order. This translates to reading or writing bytes, or to going directly from one representation to another without bytes in between.

The next step would be to compile those JIT style, if needed; In some cases we might want the interpreter only, if the iteration count is not high enough, in others we can aggressively compile.

As for bytecode rewriting later, there are two options I see:

  1. Inline the 'instructions' that the 'interpreter' runs into some generated class.
  2. Simply 'monomorphize' the call sites and let the JVM profiler do the inlining. This means subclassing each instance in the tree and 'concretizing' any generic parameter to be a specific class. The JVM does call-site profiling, so as long as each of the 'instructions' appears to be new, unique classes (clone the bytecode) it will profile these independently.

The latter may be fastest after some warmup time (and is easier), but the former will be faster before the JIT has a chance to warm up.

@iemejia
Copy link
Member

iemejia commented Mar 26, 2019

We introduced automatic code formatting. For more info see the "How to contribute" page. This probably affected this PR can you please rebase it.

@unchuckable
Copy link
Contributor Author

Hi there. Nice to see there's still interest in this. I had been planning to wait with rebasing on @rstata to finish his work on the resolver that I make use of, but I think I'll try to get it all into a merge-able state both function and formatting-wise again since I didn't hear from Raymie in like forever.

Will hopefully find some time to do that soon.

@unchuckable
Copy link
Contributor Author

Okay, I pulled myself together and rebased the whole work on the current master branch, incorporating only those parts of AVRO-2275 that have become a dependency for my code to work.

@scottcarey - I very much agree with your approach and your perspective. With the approach I present, individual FieldReader and ExecutionSteps could be moved from lambdas into specialized classes if they have the capability of replacing themselves/being replaced with dynamically compiled, optimized code. (Actually, in my first design, they all WERE specialized classes, but that made the code three times as big; overkill for the moment). I'd very much prefer approach B of your comment, the monomorphizing, since it takes complexity off our code and burdens the jit with a task it's proven to do quite well. And, TBH, if the code doesn't make enough cycles for the JIT to trigger on it, then it's not performance critical.

I'll be glad for comments, or (favourably) a positive merge decision 😉

@unchuckable
Copy link
Contributor Author

Hello, fellas. Since this MR has been in mergeable state and non-considered for a long time despite some interest from various sides, is there any work you'd like to see to get things considered for the main branch? Performance comparisons? Or should I just drop the matter? I'm willing to invest more work into it, but I'd need some pointers in case more work is required.

@rstata
Copy link
Contributor

rstata commented Nov 13, 2019

I took a quick look. Seems much more maintainable. Can you post what are the performance improvements?

@unchuckable
Copy link
Contributor Author

Thanks Jamie for taking your time to review the change. I'd suggest the following:

  • I'll address the points you made in your review.
  • Next week, I'll be back at work, where I might have the realisitic data that I want to actually impact with the change. I'll make measurements and post them here.
  • I'll also either try to run the jmh tests in perf with the different implementation selected via system property, or add new tests to verify performance improvements.

I wanna verify performance improvements, since it has been quite some time since the original code was submitted, and performance impact might vary.

@unchuckable
Copy link
Contributor Author

Alright, I just tested a merge of the code with the current master branch. All test cases validate. Since the feature is hidden behind a feature toggle (-Dorg.apache.avro.fastread=true / GenericData.setFastReaderEnabled(true)), there should be little risk adding it to the current master (and maybe even have it backported into 1.9.2 and/or 1.8.3).

I ran the Perf-Tests that are supposed to be influenced by the changes of this pull request.

Without FastRead (java -cp /tmp/perf.jar org.apache.avro.perf.Perf -test org.apache.avro.perf.test.generic.*.decode)

Benchmark                          Mode  Cnt        Score         Error  Units
GenericNestedFakeTest.decode      thrpt    3  3988216.741 ± 2893436.555  ops/s
GenericNestedTest.decode          thrpt    3  7515509.993 ± 2634921.762  ops/s
GenericStringTest.decode          thrpt    3  4188121.769 ± 4212835.121  ops/s
GenericTest.decode                thrpt    3  6925318.556 ±  351545.020  ops/s
GenericWithDefaultTest.decode     thrpt    3      329.709 ±      52.882  ops/s
GenericWithOutOfOrderTest.decode  thrpt    3  7015697.216 ±  165010.128  ops/s
GenericWithPromotionTest.decode   thrpt    3  5974462.244 ± 1027794.337  ops/s

With FastRead (java -Dorg.apache.avro.fastread=true -cp /tmp/perf.jar org.apache.avro.perf.Perf -test org.apache.avro.perf.test.generic.*.decode)

Benchmark                          Mode  Cnt         Score        Error  Units
GenericNestedFakeTest.decode      thrpt    3   5531218.252 ± 364891.613  ops/s (+38%)
GenericNestedTest.decode          thrpt    3   7662656.796 ± 710954.774  ops/s (+ 1%)
GenericStringTest.decode          thrpt    3   5621066.571 ± 242602.049  ops/s (+ 34%)
GenericTest.decode                thrpt    3   9688996.169 ± 756235.748  ops/s (+ 39%)
GenericWithDefaultTest.decode     thrpt    3       420.724 ±     49.087  ops/s (+ 27%)
GenericWithOutOfOrderTest.decode  thrpt    3  11333827.734 ± 183655.106  ops/s (+ 61%)
GenericWithPromotionTest.decode   thrpt    3   9664049.707 ± 546178.783  ops/s (+ 61%)

Observations:

  • GenericNestedTest and GenericNestedFakeTest should have their names swapped. GenericNestedFakeTest is not using a GenericDatumReader, hence the results are very much the same.
  • In most other cases, the changes of this pull request offer an optional speedup of between 25 and 60 percent in reading performance for GenericRecords. Some speedup should also be expected for SpecificRecords (tho @rstata's custom coders should reduce impact there severely)
  • As I said, current implementation does not replace the current reader by default, but can for the moment be used as an experimental option (that offers performance while leaving the safe grounds of well-proven code)

I'll do the nit-fixes in a bit. Please get back at me with your opinion.

@unchuckable
Copy link
Contributor Author

@rstata, anyone - comments? Or shall we rather just drop this one?

@RyanSkraba
Copy link
Contributor

Hello! I'm pretty confident that this is safe to merge, especially having no effect when org.apache.avro.fastread is turned off. I've reproduced (informally) the performance benefits of turning it on, and the technique looks sound.

There was a lot of discussion of future work and theory in this and the old PR, so having this in master and/or a release would definitely be motivating and inspiring to continue on the path, as well as help find regressions (if any!) in the new reader.

I think one of the reasons this PR took so long is that it's tricky to change the APIs -- we love our stability and backwards compatibility! I think if we had AVRO-2083 or similar already in place, this would have been long accepted...

Thanks for this work! I'm going to send a message to the dev@ mailing list to see if we can get this a release sooner-rather-than-later!

@unchuckable
Copy link
Contributor Author

Thank you, @RyanSkraba. If you think it safe, it might be interesting to even backport the feature, behind a feature switch, to 1.8 or 1.9, to be able to get more feedback on the approach from people who might be locked from 1.9 by incompatibilities.

@RyanSkraba RyanSkraba merged commit 3ad0106 into apache:master Feb 3, 2020
RyanSkraba pushed a commit that referenced this pull request Feb 3, 2020
* AVRO-2247 - Add FastDatumReaderBuilder and dependencies (rebased)

* Addressed comments to pull request
@FelixGV
Copy link

FelixGV commented Apr 11, 2020

FYI: if you are interested in Avro performance, I would recommend checking out the fast-serde module of our avro-util project: https://github.com/linkedin/avro-util

It is a fork of RTB House's fast-serde project, with a bunch of additional performance and compatibility improvements.

It does runtime code-gen to build tailored serializer and deserializer classes, one for each reader/writer pair. It works with GR and SR, and with Avro versions 1.4, 1.7 and 1.8. It can probably work with the other Avro versions as well (1.5, 1.6, 1.9), thanks to our compatibility helper module (also in the same project), but we simply haven't tested those versions yet.

Cheers.

-F

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Java Pull Requests for Java binding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants