Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDFS-16699: RBF: Update Observer NameNode state to Active when failover #4663

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

SanQiMax
Copy link

…ver because of sockeTimeOut Exception

Description of PR

How was this patch tested?

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 44s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 38m 45s trunk passed
+1 💚 compile 1m 1s trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 compile 0m 54s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 0m 52s trunk passed
+1 💚 mvnsite 1m 3s trunk passed
+1 💚 javadoc 1m 8s trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javadoc 1m 19s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 1m 48s trunk passed
+1 💚 shadedclient 21m 33s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 41s the patch passed
+1 💚 compile 0m 45s the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javac 0m 45s the patch passed
+1 💚 compile 0m 40s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 0m 40s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 26s the patch passed
+1 💚 mvnsite 0m 42s the patch passed
+1 💚 javadoc 0m 40s the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javadoc 0m 57s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 1m 26s the patch passed
+1 💚 shadedclient 20m 51s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 22m 16s hadoop-hdfs-rbf in the patch passed.
+1 💚 asflicense 0m 54s The patch does not generate ASF License warnings.
121m 18s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4663/1/artifact/out/Dockerfile
GITHUB PR #4663
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 5875833b7199 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 1378a71
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4663/1/testReport/
Max. process+thread count 2189 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs-rbf U: hadoop-hdfs-project/hadoop-hdfs-rbf
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4663/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Contributor

@ZanderXu ZanderXu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SanQiMax for your PR.
Can you explain how to reproduce it in detail? Have used Observer Read in RBF?

@SanQiMax
Copy link
Author

SanQiMax commented Jul 31, 2022

Thanks @SanQiMax for your PR. Can you explain how to reproduce it in detail? Have used Observer Read in RBF?
reproduce detail
I deployed three router,
two NameService,as ns1 and ns2,
which ns1 has three OBServerNameNode,ob1,ob2,ob3,
ns1 has large of connections to NameNode ,when the first time connect to ob1 failed for network probleam ,will try to next one,
and when catch the network exception failover set to true ,if the ob2 response success ,thus the second OBServerNameNode will update to active
we can produce the sitiation with Construct network failure for activeNameNode or OBServerNameNode that cause connect exception ;
you can see the code that cause failover set to true

} else if (isUnavailableException(ioe)) {
    if (this.rpcMonitor != null) {
       this.rpcMonitor.proxyOpFailureCommunicate(nsId);
    }
    failover = true;

isUnavailableException contains ConnectTimeoutException, EOFException,SocketException,StandbyException

@SanQiMax
Copy link
Author

Thanks @SanQiMax for your PR. Can you explain how to reproduce it in detail? Have used Observer Read in RBF?

I add logs,and build a jar ,The final log print confirmed my guess,

@slfan1989
Copy link
Contributor

slfan1989 commented Jul 31, 2022

@SanQiMax Thank you very much for your contribution, but in JIRA, your description needs to be clearer, you can refer to @ZanderXu's Jira description, he wrote very well.

@SanQiMax
Copy link
Author

SanQiMax commented Aug 1, 2022

@SanQiMax Thank you very much for your contribution, but in JIRA, your description needs to be clearer, you can refer to @ZanderXu's Jira description, he wrote very well.

The router will obtain all nn lists (randomly sorted) before forwarding a read request each time. If it is a write request, it will obtain the Active and Standby、OBserver nn lists, and access them in turn until the request is processed successfully.
From the theoretical analysis, if the router processes the read request and the first one in the nn list is the observer nn, and the observer nn throw a ConnectTimeoutException to the router when processing the request, and the second time it will try to connect to the second nn from the nn list , if the second nn connects observer nn and returns successfully, the logic shown in the following figure will be executed, and the state of nn will be set to ACTIVE, then if the write request is processed, the router will obtain the list of active states (at this time it may be It will get multiple NNs in the ACTIVE state), and the request will be forwarded to the slave NN.
This logic will be abnormal when there are only active and standby nn in the cluster. Even if the status of standby nn is set to ACTIVE and the request is transferred to standby nn, the standby nn will not process read and write requests, and then the router will Forwarded to the Active nn for processing; if the observer nn is added, the state of the slave nn may be set to ACTIVE

Copy link
Member

@ayushtkn ayushtkn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RBF with observer reads isn't supported yet. I don't think we need to chase issues regarding Observer reads with RBF unless we finalise the solution for that

@@ -483,10 +483,11 @@ public Object invokeMethod(
final Object proxy = client.getProxy();

ret = invoke(nsId, 0, method, proxy, params);
if (failover) {
if (failover && namenode.getState().equals(FederationNamenodeServiceState.STANDBY)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FederationNamenodeServiceState.STANDBY.equals(namenode.getState())

Is usually safer.

// Success on alternate server, update
InetSocketAddress address = client.getAddress();
namenodeResolver.updateActiveNamenode(nsId, address);
LOG.info("Update ActiveNameNode,nsId = {},rpcAddress = {}.", nsId, rpcAddress);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix spaces in the log

@goiri goiri changed the title HDFS-16699:Router Update Observer NameNode state to Active when failo… HDFS-16699: RBF: Update Observer NameNode state to Active when failover Aug 2, 2022
@SanQiMax
Copy link
Author

SanQiMax commented Aug 2, 2022

RBF with observer reads isn't supported yet. I don't think we need to chase issues regarding Observer reads with RBF unless we finalise the solution for that

OK,I got it,thanks very much

@xinglin
Copy link
Contributor

xinglin commented Oct 13, 2022

If i am not wrong, reads from observerNodes in RBF has been added in HDFS-16767.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants