New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-25998: Redo synchronization in SyncFuture #3382
Conversation
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
// Lock protecting the thread safe fields | ||
ReentrantLock doneLock; | ||
// Condition to wait on for client threads. | ||
Condition isDone; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: isDone is name of a method, not of a data member.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And name it something else because isDone method already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doneCondition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, done.
Looks good. |
0a970fb
to
f91693d
Compare
f91693d
to
caf0f57
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took another look. Still LGTM. Nits and one question. Otherwise +1 (FYI OOO for next week...)
@@ -363,7 +362,8 @@ private void syncCompleted(AsyncWriter writer, long processedTxid, long startTim | |||
// sync futures then just use the default one. | |||
private boolean isHsync(long beginTxid, long endTxid) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to note that above the methods are changed from getTxid to getTxId but then here we have beginTxid... super nit... but you might want them to be consistent? Just leave the getTxid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole Txid change mess was unintentional, undone.
@@ -1,5 +1,5 @@ | |||
/** | |||
* Licensed to the Apache Software Foundation (ASF) under one | |||
*Licensed to the Apache Software Foundation (ASF) under one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super super super nit...^ Put space back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
/* | ||
* Fields below are created once at reset() and accessed without any lock. Should be ok as they | ||
* are immutable for this instance of sync future until it is reset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth it doing all this handling just so we reuse SyncFuture instances? Maybe a later experiment would be trying to create a SyncFuture every time? Just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, the whole SyncFuture caching seems unnecessary, I was discussing this with my colleagues at work. cc: @apurtell @d-c-manning
I don't have the full historical context here but if I were to guess I think it is because we don't want GC to (over) work with millions of these small sync future objects since they are created once per mutation. May be the latest versions of Java garbage collectors don't need these kind of optimizations, agree that this needs more performance analysis. Let me create a jira to perf analyze the removal of this.
} | ||
|
||
synchronized boolean isForceSync() { | ||
boolean isForceSync() { | ||
return forceSync; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be covered by the lock? The comment at head of the class says this data member should be covered by the lock? ("all below")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also this comment..
/*
* Fields below are created once at reset() and accessed without any lock. Should be ok as they
* are immutable for this instance of sync future until it is reset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I do not think using ReentrantLock could have much difference on performance comparing to synchronized.
Maybe the reason is because of now we create a ReentrantLock everytime so there are some side effects such as biased locking?
@@ -132,7 +131,7 @@ | |||
private static final Logger LOG = LoggerFactory.getLogger(AsyncFSWAL.class); | |||
|
|||
private static final Comparator<SyncFuture> SEQ_COMPARATOR = (o1, o2) -> { | |||
int c = Long.compare(o1.getTxid(), o2.getTxid()); | |||
int c = Long.compare(o1.getTxId(), o2.getTxId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? I think in most places we just use txid instead of txId...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And maybe for JDK8, we could use
Comparator.comparingLong(SyncFuture::getTxid).thenComparingInt(System::identityHashCode);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Txid change was unintentional, undone.
Comparator change: Done.
* The transaction id of this operation, monotonically increases. | ||
* Lock protecting the thread-safe fields. | ||
*/ | ||
ReentrantLock doneLock; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
/** | ||
* Condition to wait on for client threads. | ||
*/ | ||
Condition doneCondition; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if (t != null && t != Thread.currentThread()) { | ||
throw new IllegalStateException(); | ||
} | ||
t = Thread.currentThread(); | ||
if (!isDone()) { | ||
throw new IllegalStateException("" + txid + " " + Thread.currentThread()); | ||
} | ||
this.doneLock = new ReentrantLock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock is also reset everytime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just being extra cautions, just in case there are any dangling locks and such from previous invocation but thinking about it again, that is probably not needed.
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
Agree its not just synchronized vs j.u.concurrent. It looks like synchronized is a teeny bit slower compared to concurrent implementations under contention (based on benchmarks, example) but in general there shouldn't be a substantial difference, especially with modern JVM versions. I don't think it is biased locking because synchronized inherently supports biased locking (and interestingly it is being removed) and that is not a differentiator. I've looked at the assembly of the JIT-ed code for this function but it was not obvious to me (I'm not an expert at reading x86 assembly, so may I have missed something too). Just looking at the flame graphs with/without change I can see that there is less contention around locks due to a reasons like no busy wait, instead there is a CV wait (so there is no constant monitor contention in a loop until woken up, there is no contention around certain fields like txid, forcesync and probably some help from reentrant lock being faster than object monitors. |
caf0f57
to
fd09ca7
Compare
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So with the newest patch, we could still see big performance improvements?
Interesting. Did you guys test it against a distributed HDFS cluster, or a local mini cluster?
synchronized boolean isThrowable() { | ||
return isDone() && getThrowable() != null; | ||
boolean isDone() { | ||
if (doneLock == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit confusing... What if later a developer call this method before calling reset? Since now we will not create the lock and condition everytime in reset, let's just create it ion the constructor? So we do not need to test whether it is null any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, done
Throwable getThrowable() { | ||
doneLock.lock(); | ||
try { | ||
if (!isDone()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to just test this.doneTxid != NOT_DONE? Or introduce an unlocked isDone version may be called isDoneWithoutLock to call it in both isDone and getThrowable, so we do not need to lock in the isDone again here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a ReentrantLock, so in the same thread context, it doesn't acquire again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if you are saying we can access doneTxid without a lock for specific codepaths, long operations per JVM spec are not atomic since it is split into two 32bit operations, so we may see a corrupt value and hence can result in a wrong isDone() check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it will be acuired again, as you still need to call unlock one more time you call lock one more time. Reentrant means you can lock the lock multiple times in the same thread, but at least it will waste a bit more cycles.
I mean change the code like this:
boolean isDoneWithoutLock() {
return this.doneTxid != NOT_DONE;
}
boolean isDone() {
doneLock.lock();
try {
return inDoneWithoutLock();
} finally {
doneLock.unlock();
}
}
boolean done() {
doneLock.lock();
try {
if (inDoneWithoutLock()) {
return false;
}
...
} finally {
doneLock.unlock();
}
}
long get() {
doneLock.lock();
try {
while (!inDoneWithoutLock()) {
...
}
...
} finally {
doneLock.unlock();
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not include modifiers here, just an example, not the final code. The isDoneWithoutLock should be private...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, re-entrant acquisition of the lock should be very low penalty but yes doing this way shaves off a few cycles, will do.
fd09ca7
to
0458760
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Did you guys test it against a distributed HDFS cluster, or a local mini cluster?
Please see my comments on the jira.
Throwable getThrowable() { | ||
doneLock.lock(); | ||
try { | ||
if (!isDone()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if you are saying we can access doneTxid without a lock for specific codepaths, long operations per JVM spec are not atomic since it is split into two 32bit operations, so we may see a corrupt value and hence can result in a wrong isDone() check?
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
Saw your last comment, glad to see there are performance increasing on a real cluster. What I mentioned above is the result for WALPE, I haven't seen you or @apurtell explicitly mentioned the environment, on a HDFS or just a local mini cluster or something else. As the performance number is too good for WALPE, I want to double confirm here. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to double confirm here.
Sure, unfortunately my tooling doesn't support branch-2 yet (WIP), so can't test AsyncWAL on a real distributed HDFS cluster but as I noted the gains reported in the jira are specifically with async wal and not the other implementation.
Throwable getThrowable() { | ||
doneLock.lock(); | ||
try { | ||
if (!isDone()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, re-entrant acquisition of the lock should be very low penalty but yes doing this way shaves off a few cycles, will do.
OK, no problem. I think the current information is enough to merge this PR. We could do more test in the future add fill the release note :) |
🎊 +1 overall
This message was automatically generated. |
0458760
to
255d58e
Compare
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No big problems. Just minor nits about whether to mark a field final.
And please fix the checkstyle issue? It is just a line length warning.
Thanks for the nice patch!
|
||
private boolean forceSync; | ||
|
||
SyncFuture() { | ||
this.doneLock = new ReentrantLock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two fields could be marked as final if we never change them after construction.
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left only minor nits, +1
* The transaction id of this operation, monotonically increases. | ||
* Lock protecting the thread-safe fields. | ||
*/ | ||
private ReentrantLock doneLock; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: keep it final?
/** | ||
* Condition to wait on for client threads. | ||
*/ | ||
private Condition doneCondition; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: final?
255d58e
to
b91c3b0
Compare
Currently uses coarse grained synchronized approach that seems to create a lot of contention. This patch - Uses a reentrant lock instead of synchronized monitor - Switches to a condition variable based waiting rather than busy wait - Removed synchronization for unnecessary fields Signed-off-by: Michael Stack <stack@apache.org> Signed-off-by: Andrew Purtell <apurtell@apache.org> Signed-off-by: Duo Zhang <zhangduo@apache.org> Signed-off-by: Viraj Jasani <vjasani@apache.org>
b91c3b0
to
6bafb59
Compare
Currently uses coarse grained synchronized approach that seems to
create a lot of contention. This patch