Skip to content

ARTEMIS-1476 HdrHistogram support on verbose SyncCalculation#1605

Closed
franz1981 wants to merge 1 commit intoapache:masterfrom
franz1981:sync_calc_histo
Closed

ARTEMIS-1476 HdrHistogram support on verbose SyncCalculation#1605
franz1981 wants to merge 1 commit intoapache:masterfrom
franz1981:sync_calc_histo

Conversation

@franz1981
Copy link
Contributor

@franz1981 franz1981 commented Oct 23, 2017

It includes:

  • improved documentation of SyncCalculation::syncTest
  • HdrHistogram "write" latencies support using verbose

@franz1981
Copy link
Contributor Author

One example of the possible usage of the results using http://hdrhistogram.github.io/HdrHistogram/plotFiles.html is:
histogram

for (int ntry = 0; ntry < tries; ntry++) {

//perform a gc on each test iteration to help cleanup of callbacks/garbage
System.gc();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't do this.

The previous version was pretty much garbage free... if you must do this, I wouldn't merge the commit at all.

Besides... System.gc() will start a gc in parallel, and you will be already measuring the next write iteration.. which will affect the performance.

Copy link
Contributor Author

@franz1981 franz1981 Oct 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'm addressing all these things, but I need feedback on that too :)
I see 2 options here:

  1. preallocate and cache the blocks IO callbacks (but could be a large number too) in order to be garbage free again
  2. let the IO callbacks to size the (young) GC regions in the first test but uses a better way (ie more reliable) to be sure GC won't affect the measurements

I'm pushing the second solution right now, using the code used by the official microbenchmark tool of the OpenJDK JMH (http://hg.openjdk.java.net/code-tools/jmh/file/e96cad1fc480/jmh-core/src/main/java/org/openjdk/jmh/runner/BaseRunner.java#l309) to perform reliable GC, but feel free to give me feedbacks and suggest me to try the first one too, although I'm not 100% sure could be a better way (cachin a large number of instances in not optimal!!).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can always write the same thing. Why cache large numbers of Anything? Just write and send the same thing over and over.

This thing is settled. Not under production. We should leave this one as is IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not under production I mean. This is only used once to create the server. Just leave it be.

factory = new NIOSequentialFileFactory(datafolder, 1).setDatasync(datasync);
((NIOSequentialFileFactory) factory).disableBufferReuse();
if (bufferReuse) {
((NIOSequentialFileFactory) factory).enableBufferReuse();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also wrong... you are changing the semantic of this test...

We are measuring how much write the system can do... Using the buffer will defeat the purpose of the test...

Besides you are using the timeouts from where? this is not the intention of this calculation.

@clebertsuconic
Copy link
Contributor

You told me you would just add verbose output here.. not change semantics of the test for no reason.

There was a lot of testing around this... changing the semantics like this out the blank without the proper testing won't help much.

@clebertsuconic
Copy link
Contributor

This adds complexity that we don't need.. you're even requiring GCs to perform now.. the previous version wouldn't generate any garbage during the measurements (just simple buffer.write all the time.. always the same buffer).

@franz1981
Copy link
Contributor Author

@clebertsuconic
Good points! I've addressed all these things (with few other fixes), but I need feedbacks on the result.
I see 2 options here:
1) preallocate and cache the IO callbacks (but could be a large number too) in order to be garbage free again
2) let the IO callbacks to size the (young) GC regions in the first tests but uses a better way (ie more reliable) to be sure GC won't affect the measurements

I've pushed the second solution right now, using a method to perform reliable GCs (+ await them to happen) used in the official microbenchmark tool of the OpenJDK JMH too (http://hg.openjdk.java.net/code-tools/jmh/file/e96cad1fc480/jmh-core/src/main/java/org/openjdk/jmh/runner/BaseRunner.java#l309).

Feel free to give me feedbacks on it and if we need something similar in other parts (a more reliable way to perform/await GCs to happen) I can put it in our utlity classes too.
@michaelandrepearce @lulf I know you're pretty skilled in benchmarking too so please share your thoughts/opinions: are super wellcome!!!!

@franz1981 franz1981 force-pushed the sync_calc_histo branch 2 times, most recently from 64fa4b5 to 8138e53 Compare October 24, 2017 09:43
@franz1981
Copy link
Contributor Author

franz1981 commented Oct 24, 2017

@clebertsuconic As a note: now without --verbose it is garbage free as in the original version 👍
I've peformed few tests with/without --verbose to see if it impacts the results and it seems to be negligible:

  • with --verbose:
Your system can execute 0.5 syncs per millisecond
Your journal-buffer-timeout should be:2002000
  • without --verbose:
Your system can execute 0.5 syncs per millisecond
Your journal-buffer-timeout should be:2011000

@clebertsuconic
Copy link
Contributor

The issue is on the performWrite. You are creating buffers and callbacks for each write.

@franz1981
Copy link
Contributor Author

In the last version I've pushed I'm reusing the same buffer

@franz1981
Copy link
Contributor Author

franz1981 commented Oct 24, 2017

Here:

   private static SyncIOCompletion writeBlock(SequentialFile file,
                                              boolean syncWrites,
                                              final ByteBuffer bufferBlock,
                                              Supplier<? extends SyncIOCompletion> ioCallbackFactory) throws Exception {
      //reset it outside any latency measurement:
      //it is safe to do it with ASYNCIO and !syncWrites too, because it isn't used concurrently
      bufferBlock.position(0);
      SyncIOCompletion lastIOCallback = ioCallbackFactory.get();
      file.writeDirect(bufferBlock, true, lastIOCallback);

      if (syncWrites) {
         lastIOCallback.waitCompletion();
         lastIOCallback = null;
      }
      return lastIOCallback;
   }

<artifactId>commons-lang3</artifactId>
<version>${commons.lang.version}</version>
</dependency>
<dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the license of this?

pom.xml Outdated
<groupId>org.hdrhistogram</groupId>
<artifactId>HdrHistogram</artifactId>
<version>${org.hdrhistogram.version}</version>
<!-- License: Apache 2.0 -->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe it is Apache license.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep: super good point!! It is not obvious at all...I'm checking it!!!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems BSD 2 mmmmm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems compatible...@clebert wdyt?

@clebertsuconic
Copy link
Contributor

The license says public domain. And if you need a license BSD.

@franz1981
Copy link
Contributor Author

franz1981 commented Oct 24, 2017

@clebertsuconic I'm not a lot familar with these, but:

For users of this code who wish to consume it under the "BSD" license
rather than under the public domain or CC0 contribution text mentioned
above, the code found under this directory is also provided under the
following license (commonly referred to as the BSD 2-Clause License)

what kind of licence I've to report on the pom.xml?BSD-2 or PDDL?
Looking at other HdrHistogram users (Cassandra NOTICE) using the Apache 2.0 licence seems that there is no need to specify anything in the NOTICE about specific attribution terms too...

@clebertsuconic
Copy link
Contributor

Lets use it under free public domain. As they allow it.

@franz1981 franz1981 force-pushed the sync_calc_histo branch 2 times, most recently from c0ea557 to 999dd38 Compare October 24, 2017 13:43
@franz1981
Copy link
Contributor Author

@clebertsuconic ok!! I'm updating the PR with it!

@franz1981 franz1981 force-pushed the sync_calc_histo branch 2 times, most recently from 31ae9b8 to f357d78 Compare October 24, 2017 13:53
@franz1981
Copy link
Contributor Author

@clebertsuconic Done!

@michaelandrepearce
Copy link
Contributor

michaelandrepearce commented Oct 24, 2017

@franz1981 i really like the ability to get a histogram of the sync test so you can see the percentiles of the disk performance from activemq's view. I do share some concerns with @clebertsuconic that it is quite a re-write in the logic, this is quite a core tool to get the journal settings right, i know @clebertsuconic spent a lot of time with us previously getting it to work and fix issues.

To mitigate this concern, have we an ability to validate the result is the same before and after this change? e.g. some sort of test?

If we can't then at least could we manually check it on a few different disks which will perform differently, e.g. SAN, HDD, SSD and check the output manually.

The difference in previous in the conversation above, actually isn't negligible when talking about SSD over SATA or NVMe. eg. NVMe i believe buffer-timeout can be as low as 4000 on newest of kit.
diff in results 2011000-2002000 = 9000 which is double that of something like the NVMe.

in general +1 this though. HDR is fairly standard lib, we use it lots too here :)

@franz1981
Copy link
Contributor Author

@michaelandrepearce

To mitigate this concern, have we an ability to validate the result is the same before and after this change? e.g. some sort of test?

Right now the calculation of the timeout isn't changed at all while the test methodology (and the output of course) is changed only if --verbose is specified: it is always possible to "validate" the output (the output percentiles) turning off --verbose to see if the timeout estimation is changed/different.
AFAIK the initial journal tuning is performed without --verbose hence I'm expecting that nothing will change from the original version.

The other improvements (without affecting the no --verbose logic) are made to help the JIT to avoid OSR in order to get more reliable results when benchmarking vs temporary filesystem/very high speed devices/no fsync involved.

@michaelandrepearce
Copy link
Contributor

@franz1981, this makes sense and good to know +1, ill leave some comments in place in code in a sec, where looks like some bits still leaked into existing logic. and let you comment.

SequentialFileFactory factory = newFactory(datafolder, fsync, journalType, blockSize * blocks, maxAIO);

final Supplier<? extends SyncIOCompletion> ioCallbackFactory;
final Histogram writeLatencies;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be just in the if verbose section?

JournalType journalType) throws Exception {
SequentialFileFactory factory = newFactory(datafolder, fsync, journalType, blockSize * blocks, maxAIO);

final Supplier<? extends SyncIOCompletion> ioCallbackFactory;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be just in the verbose section?

Copy link
Contributor

@michaelandrepearce michaelandrepearce Oct 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

essentially i would be expecting to see no diff in the code, except an if (verbose) statement with all the change only in that section.
Currently there is too much diff to be sure the logic is the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I've provided a couple of fixes too :P: like the one on the file size with alignment != 1.
I've tried to have a common logic while grouping the differences in order to have a simpler and maintainable code: probably that's the reason why you are seeing such differences on the source code.
I've already evaluated (ie JitWatch on the rescue) such abstraction won't affect the performance of the 2 cases, but if you think that is better to put everything under a big if (verbose) I can understand your reasons too: that's why having 6 eyes on it is more than wellcome IMO :)

Copy link
Contributor

@michaelandrepearce michaelandrepearce Oct 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me trying to understand the code just for a review and understand the differences/changes right now is a bit hard, as too much has moved, its not impossible just is making it tricky to be sure.

If you're not too much against it, it might be easier to do the "if", then it is simply cleaner / easier from a review POV to see what is changed / get comfortable that actually the old behaviour is as is.

Then anyhow as we get more confident it might the if statement always equates to true anyhow and the other side of the if block is removed entirely.

If theres a couple of fixes in existing this can still be there, but then it be obvious what changes are done to fix as well.

Copy link
Contributor Author

@franz1981 franz1981 Oct 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me trying to understand the code just for a review and understand the differences/changes right now is a bit hard, as too much has moved, its not impossible just is making it tricky to be sure.

I think is normal: I've used the splitted view on github and it helps (vs the unified one), but probably it works just if you already know by heart the original code. Your review is a good way for me to have a more objective view of that so I will provide a cleaner separation of the 2 paths as a separate commit to be squashed if it will pass the reviews (by the end of the day; I've other stuff to do first :P).

Just for the sake of completeness, I will ask to @clebertsuconic too: he knows by heart the original code for sure!
@clebertsuconic wdyt about the last implementation? Have you found difficult to trace/understand how much has changed the not --verbose path vs the original?

@clebertsuconic
Copy link
Contributor

clebertsuconic commented Oct 25, 2017

I want to -1 this commit @franz1981..This is supposed to be a dumb piece of code.. making IO writes at garbage free.

For instance.. I told you to use the same buffer, and the same Completion over and over.. like it was before. There's not guarantees about the ordering won't be changed... trusting that just makes it more fragile.

If you just added the Histogram like you said you would.. it would be one thing.. but this logic is completely changed.. and I had spent a lot of time to make sure about these calculations, using different hardwares (disk types).

@franz1981
Copy link
Contributor Author

@clebertsuconic

This is supposed to be a dumb piece of code.. making IO writes at garbage free.

And in the not verbose case it is garbage free as it was originally :) Please use the latest commit to see it.

For instance.. I told you to use the same buffer, and the same Completion over and over.. like it was before.

In the not verbose case I've reused the buffer and the IOCompletion while in the verbose case only the buffer, because I haven't found a way to maintain for each write operation its own start time: if you see a better way to do that please help me to understand it 👍

If you just added the Histogram like you said you would.. it would be one thing.. but this logic is completely changed.. and I had spent a lot of time to make sure about these calculations, using different hardwares (disk types).

And it is the reason why I'm providing a new commit that split the 2 logic in a clean way in order to maintain the original 1:1 as it is (tiny bugs included).
Let me know if it is ok @michaelandrepearce too in order to have 6 eyes on it 👍

@clebertsuconic
Copy link
Contributor

I am using it...

I just don't understand why you would only use the last callback.. why so much complexity added?

if this complexity was added around say.. for instance... JournalImpl::show statistics.. I would go for it... but here at the sync calculation? I think this is being a lot of complexity for something that is supposed to be simple.

@clebertsuconic
Copy link
Contributor

clebertsuconic commented Oct 25, 2017

for instance.. you could have chosen to add the Histogram without making any other changes.. I don't understand why you wanted to do that? just to understand the code? (not that this would be a good reason). It just gets harder to maintain.

why you wanted to just wait the last callback for instance? why such change? and why to create many callbacks?

Why to add this complex System.gc code?

@franz1981
Copy link
Contributor Author

franz1981 commented Oct 25, 2017

@clebertsuconic

if this complexity was added around say.. for instance... JournalImpl::show statistics.. I would go for it... but here at the sync calculation? I think this is being a lot of complexity for something that is supposed to be simple.

+100 This is exactly the point mate :)
So do you think that is better to provide an ad hoc command (if not exists) for that (ie single write latencies measurements)?
This time I have trying to avoid having 2 tools similar but different, but I trust you if you see a different usage for it...

for instance.. you could have chosen to add the Histogram without making any other changes

It seems not possible (to me at least :(): each write request (with ASYNCIO and not syncWrites) need its own start time in order to compute the latency of the operation (and record it on the histogram).
How can you provide for each write its own start time with ASYNCIO and not syncWrites?

@franz1981
Copy link
Contributor Author

@clebertsuconic @michaelandrepearce Consider the last commit code the base for a "journal's show statistics" command: i will open a different JIRA with it if it seems ok and I will drop the timeout calculation too, the histograms are enough for it.

@clebertsuconic
Copy link
Contributor

clebertsuconic commented Oct 25, 2017

@franz1981 The Sync calculation is not part of the life cycle of the server running at all...
Think about it.. we first had nothings.. users had to figure out themselves based on their hardware specifications how many syncs their system supported, and then they would place it there.

At a later point, when we added create.. I added this sync tool to make an approximated auto-tune for the system. anything beyond that is out of context.

It doesn't really matter to add histograms at the sync-calculation, or to perform any writes...

However.. during the process (aka production)... it would matter...

Look at TimedBuffer::logRatesTimer...

there a Histogram would make a lot more sense... and actually that logRatesTimer doesn't belong into TimedBuffer maybe now that we can disable TimedBuffer optionally.

@franz1981
Copy link
Contributor Author

It doesn't really matter to add histograms at the sync-calculation, or to perform any writes...

That's the part I'm trying to understand: IMHO it matters to have a more detailed view of the sync test using standard de facto tools too as @michaelandrepearce has stated above in other comments.
While having it in production in real time (as another PR I've already pushed has provided) is a different but equally (maybe more) important purpose.
With IO, treating the latency just as 1/throughput risks to hide important informations: providing a test that shows the distribution of the percentiles could give to the expert user the chance to tune the buffer timeout based on the 99 percentile (taken from the real hw) instead of an average in order to be sure to met his SLA.

@clebertsuconic
Copy link
Contributor

wouldn't be easier to keep all the original code and just add the Histogram.. without changing how callbacks are happening for example.. (The reason why you needed GCs).

just use the original code... make simple changes... add the histogram (when --verbose).. and it's a lot easier to understand the impact of the changes.

@clebertsuconic
Copy link
Contributor

if you are actually measuring 99%.. do you really need to take out GCs from the calculation?

just keep it simple.. this is not simple as I see this.

@franz1981
Copy link
Contributor Author

franz1981 commented Oct 26, 2017

@clebertsuconic

wouldn't be easier to keep all the original code and just add the Histogram.. without changing how callbacks are happening for example.. (The reason why you needed GCs).

There are good reasons I can't do it and one is written below:

//it is necessary to allocate new callbacks because each async write has its own elapsed time

      final boolean asyncWrites = journalType == JournalType.ASYNCIO && !syncWrites;
      final Supplier<? extends SyncIOCompletion> ioCallbackFactory;
      if (asyncWrites) {
         //it is necessary to allocate new callbacks because each async write has its own elapsed time
         ioCallbackFactory = () -> new LatencyRecorderIOCallback(writeLatencies);
      } else {
         //i can reuse the same callback each time
         final ReusableIOCallback ioCallback = new ReusableIOCallback(writeLatencies);
         //uses will perform a countUp and reset start
         ioCallbackFactory = () -> {
            ioCallback.beforeWrite();
            return ioCallback;
         };
      }

The others was related to compilation of methods by the JIT compiler if you have > 10000 writes and very fast storages/no fsync: you need to add -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining to the command line to find it.

if you are actually measuring 99%.. do you really need to take out GCs from the calculation?

I'm providing all the percentiles offered by the HDR histogram and the GC is already taken out of the async measurements (if you have enough heap) by using the awaitGC method: it is already battle tested and defined in the official OpenJDK Java Microbenchmark Harness (http://openjdk.java.net/projects/code-tools/jmh/) here.

IMO could be a good idea to provide this feature it is to drop at all the timeout calculation if a user just want the latencies percentiles and let the original calculation being used otherwise. I simply don't know how/in what form expose such tool, @clebertsuconic @michaelandrepearce any idea?make sense?

@franz1981
Copy link
Contributor Author

franz1981 commented Oct 26, 2017

@clebertsuconic @michaelandrepearce

IMO could be a good idea to provide this feature it is to drop at all the timeout calculation if a user just want the latencies percentiles and let the original calculation being used otherwise. I simply don't know how/in what form expose such tool, @clebertsuconic @michaelandrepearce any idea?make sense?

I've provided an attempt to do that here: https://github.com/franz1981/activemq-artemis/tree/stat-journal-histogram while in the current PR there is the most recent version I can build that uses the original code.

@michaelandrepearce Now I think it could be reviewed with ease 👍

Today I'm doing other stuff so I will probably do something on that in the weekend...

@franz1981 franz1981 force-pushed the sync_calc_histo branch 4 times, most recently from ec33d57 to 2c36542 Compare October 26, 2017 13:19
clebertsuconic added a commit to clebertsuconic/artemis that referenced this pull request Oct 26, 2017

long[] result = new long[tries];

byte[] block = new byte[blockSize];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this being removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore, just noticed you moved this into allocateAlignedBlock

@michaelandrepearce
Copy link
Contributor

@franz1981 this is much cleaner +1, also its good to see the existing logic remains pretty much the same, just now with HDR being captured.

clebertsuconic added a commit to clebertsuconic/artemis that referenced this pull request Oct 26, 2017
@asfgit asfgit closed this in 9092221 Oct 26, 2017
@franz1981 franz1981 deleted the sync_calc_histo branch October 27, 2017 07:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants