-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDFS-17166. RBF: Throwing NoNamenodesAvailableException for a long time, when failover #5990
Conversation
💔 -1 overall
This message was automatically generated. |
@@ -83,14 +83,12 @@ public class MembershipNamenodeResolver | |||
/** Cached lookup of NN for block pool. Invalidated on cache refresh. */ | |||
private Map<String, List<? extends FederationNamenodeContext>> cacheBP; | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid these unrelated changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
.../main/java/org/apache/hadoop/hdfs/server/federation/resolver/MembershipNamenodeResolver.java
Show resolved
Hide resolved
public synchronized void shuffleCache(String nsId, FederationNamenodeContext namenode) { | ||
cacheNS.compute(Pair.of(nsId, false), (ns, namenodeContexts) -> { | ||
if (namenodeContexts != null | ||
&& namenodeContexts.size() > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isEmpty() and probably we want to check for > 1 as there's nothing to rotate.
We should actually do that outside, if there's 1 or less, don't do anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for not judging outside here is that we should ensure that get cache and modify cache are atomic, considering the following situation:
- After Thread1 judges that the cache is not empty, it gets the ns no active in the cache
- Thread2 (StateStoreCacheUpdateService loadCache )clear the cache
- Thread3 processes the client request and finds that the cache is empty(MembershipNamenodeResolver.getNamenodesForNameserviceId()), then updates the cache. At this time, the ns - in the cache has active nn
- Thread1 rotates the previously acquired data of non-active nn and writes it into the cache, causing ns in the cache to have no active nn
so, guaranteed atomicity can ensure that the above 2 and 3 are recognized before modifying the cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is true that >1 should be judged here instead of >0, done
|
||
for (RouterContext routerContext : cluster.getRouters()) { | ||
// Get the second namenode in the router cache and make it active | ||
List<? extends FederationNamenodeContext> ns0 = routerContext.getRouter().getNamenodeResolver().getNamenodesForNameserviceId("ns0", false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure this line is being flagged by checkstyle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
for (NamenodeHeartbeatService service : heartbeatServices) { | ||
service.periodicInvoke(); | ||
} | ||
assertTrue(cluster.getNamenode("ns0", nsId).getNamenode().getNameNodeState() == ACTIVE.ordinal()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertEquals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
da69e7f
to
29c98dc
Compare
🎊 +1 overall
This message was automatically generated. |
@simbadzina @slfan1989 Could you please help to review. |
🎊 +1 overall
This message was automatically generated. |
*/ | ||
@Override | ||
public synchronized void rotateCache(String nsId, FederationNamenodeContext namenode) { | ||
cacheNS.compute(Pair.of(nsId, false), (ns, namenodeContexts) -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to refresh the cache for the key Pair.of(nsId, true)
at times.
You can extend the signature of rotateCache to
public synchronized void rotateCache(String nsId, FederationNamenodeContext namenode, boolean listObserversFirst)
And then call it with
this.namenodeResolver.rotateCache(nsId, namenode, shouldUseObserver);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
.../main/java/org/apache/hadoop/hdfs/server/federation/resolver/MembershipNamenodeResolver.java
Show resolved
Hide resolved
List<FederationNamenodeContext> rotatedNnContexts = new ArrayList<>(namenodeContexts); | ||
Collections.rotate(rotatedNnContexts, -1); | ||
return rotatedNnContexts; | ||
}else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for else. You can just return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
b349753
to
fd89331
Compare
@KeeProMise Thank you for your contribution, LGTM. The information described is very detailed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. There may be a some checkstyle errors though.
if(firstNamenodeContext.getState().equals(ACTIVE)) { | ||
return namenodeContexts; | ||
} | ||
/* | ||
* If the first nn in the cache at this time is not the nn | ||
* that needs to be lowered in priority, there is no need to rotate. | ||
* This happens when other threads have already rotated the cache. | ||
*/ | ||
if(firstNamenodeContext.getRpcAddress().equals(namenode.getRpcAddress())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may get checkstyle errors from missing whitespaces after the if
s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
fd89331
to
e41f72c
Compare
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
e41f72c
to
ff81231
Compare
💔 -1 overall
This message was automatically generated. |
ff81231
to
209f588
Compare
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
@KeeProMise Can you provide a link to the unit test that reported the error? I found it. |
|
🎊 +1 overall
This message was automatically generated. |
@slfan1989 @goiri @simbadzina "shadedclient error" has been fixed, If you have time, please help to continue the review or merge the PR, Thanks a lot~ |
🎊 +1 overall
This message was automatically generated. |
@@ -599,6 +599,9 @@ public Object invokeMethod( | |||
} | |||
LOG.error("Cannot get available namenode for {} {} error: {}", | |||
nsId, rpcAddress, ioe.getMessage()); | |||
if (this.namenodeResolver != null) { | |||
this.namenodeResolver.rotateCache(nsId, namenode, shouldUseObserver); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to add a log for people to understand what's is happening.
I don't know if it should be behind a config though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for your review
- Added log (done)
- Since this is a bug not a feature, most users must need fix the bug by default, so I don’t think we need to add config here, I don't think there are a large number of users here who don't need this bug fixed. But if you think you need to add it, you can give a config name and I will add one later, but I personally think it is not necessary.
...op-hdfs-rbf/src/test/java/org/apache/hadoop/hdfs/server/federation/MiniRouterDFSCluster.java
Show resolved
Hide resolved
@@ -397,6 +397,11 @@ public List<String> getMountPoints(String path) throws IOException { | |||
public void setRouterId(String router) { | |||
} | |||
|
|||
@Override | |||
public void rotateCache( | |||
String nsId, FederationNamenodeContext namenode, boolean listObserversFirst) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation is weird.
Take a look at the checkstyle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, 8 spaces become 6 spaces.
💔 -1 overall
This message was automatically generated. |
add log and javadoc
8846f60
to
d338de4
Compare
🎊 +1 overall
This message was automatically generated. |
Hi @goiri @simbadzina The latest CI report green here and you have approved this PR. If no more comments here, Please help to check in. Thanks. |
@goiri If you have time, please help to continue the review or merge the PR, Thanks a lot~
// Resolver to track active NNs
this.namenodeResolver = newActiveNamenodeResolver(
this.conf, this.stateStore);
if (this.namenodeResolver == null) {
throw new IOException("Cannot find namenode resolver.");
}
|
public synchronized void rotateCache( | ||
public void rotateCache( | ||
String nsId, FederationNamenodeContext namenode, boolean listObserversFirst) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cacheNS instance of ConcurrentHashMap, so the compute method is atomic, we can delete synchronized of the method rotateCache
if (this.namenodeResolver != null) { | ||
this.namenodeResolver.rotateCache(nsId, namenode, shouldUseObserver); | ||
LOG.info("Rotate cache of pair: <ns: {}, observer use: {}>", nsId, shouldUseObserver); | ||
} | ||
// Rotate cache so that client can retry the next namenode in the cache | ||
this.namenodeResolver.rotateCache(nsId, namenode, shouldUseObserver); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to judge whether the namenodeResolver is empty, because the namenodeResolver is not null when the router is initialized
in Router.java line:178
this.namenodeResolver = newActiveNamenodeResolver(
this.conf, this.stateStore);
if (this.namenodeResolver == null) {
throw new IOException("Cannot find namenode resolver.");
}
.../main/java/org/apache/hadoop/hdfs/server/federation/resolver/MembershipNamenodeResolver.java
Show resolved
Hide resolved
🎊 +1 overall
This message was automatically generated. |
@goiri @simbadzina @slfan1989 If no more comments here, please help merge it, thanks |
…me, when failover (apache#5990)
…me, when failover (apache#5990) (Cherry-picked from c40a6bd) ACLOVERRIDE
… long time, when failover (apache#5990)" This reverts commit fdc11e3.
Description of PR
Provides a description of the bug, the cause of the bug and steps to reproduce it:https://issues.apache.org/jira/browse/HDFS-17166
How was this patch tested?
1. Unit testing
2. Comparison test in real environment
Suppose we have 2 clients [c1 c2], 2 routers [r1 r2] and a ns [ns60], the ns has 2 nn [nn6001 nn6002]
If both nn6001 and nn6002 are in standby state, the priority of nn6002 is higher than nn6001,
r1 uses the package that fixing the bug, r2 uses the original package which has the bug
c1 loops to send requests to r1, and c2 loops to send requests to r2, the request is related to ns60
Make both nn6001 and nn6002 in standby state
After the router reports that nn is in standby state, switch nn6001 to active
14:15:24 nn6001 is active
Check the log of router r1, after nn6001 switches to active, only NoNamenodesAvailableException is printed once
Check the log of router r2, and print NoNamenodesAvailableException for more than one minute after nn6001 switches to active
At 14:16:25, the client c2 accessing the router with the bug could not get the data, and the client c1 accessing the router after the bug was fixed could get the data normally:
c2's log:unable to access normally
c1's log:display the result correctly
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?