HA Deployment of OM does not work on Kubernetes when listens on external IP #5032
Replies: 9 comments 6 replies
-
cc @szetszwo |
Beta Was this translation helpful? Give feedback.
-
@mladjan-gadzic Please take a look. |
Beta Was this translation helpful? Give feedback.
-
@sokui I know you have quite a bit of experience with ozone on k8s. Have you run into this problem? |
Beta Was this translation helpful? Give feedback.
-
Hi @GeorgeJahad , I have set up ozone in k8s with kerberos and HA enabled. I noticed that I do not have this line in my config. Not sure if it is the root cause for your case.
|
Beta Was this translation helpful? Give feedback.
-
Hi @sokui , Then inside the container you can execute |
Beta Was this translation helpful? Give feedback.
-
Hi @dantalian-pv , I tried to use |
Beta Was this translation helpful? Give feedback.
-
One thing you should be aware is that in k8s, when a pod just gets started, it takes some time to be listed behind a service. This means the FQDN |
Beta Was this translation helpful? Give feedback.
-
Hi @sokui, I think I have multiple issues at the same time, however one can be the reason for other. First I did some further research and noticed, that when an RPC server starts listening on port It looks like it hangs when it is trying to connect to other nodes, but because all of them do not reply via RPC, they hang forever at this point. You can see it because neither UI nor other listeners has started, as shown at my first message (please, see top of this thread). However I think my issue is related to the point that the RPC server in a container listens on external IP address instead of listening on wildcard, and because of specific configuration of my Kubernetes cluster (which I cannot change), the Kubernetes cannot redirect traffic to the RPC server inside the container. With that I am asking whether it is possible to configure RPC of OM to listen on wildcard address, but taking into account remote IP addresses of other instances, configured for HA, to connect to them. In comparison here is a part of ConfigMap for HA of SCM and output of
bash-4.2$ ss -nltp
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 *:9876 *:* users:(("java",pid=7,fd=215))
LISTEN 0 256 *:9860 *:* users:(("java",pid=7,fd=202))
LISTEN 0 256 *:9861 *:* users:(("java",pid=7,fd=181))
LISTEN 0 4096 *:9894 *:* users:(("java",pid=7,fd=214))
LISTEN 0 4096 *:9895 *:* users:(("java",pid=7,fd=216))
LISTEN 0 256 *:9863 *:* users:(("java",pid=7,fd=192)) |
Beta Was this translation helpful? Give feedback.
-
I took some time to find a right combination of Here is my current Kubernetes deployment with ConfigMap and Services: With HA deployment With current deployment I have 2 different problems:
Even though at the
$> kubectl -n test-namespace port-forward pod/ozz-ozone-om-0 9862:rpc
Forwarding from 127.0.0.1:9862 -> 9862
Handling connection for 9862
E0731 11:06:20.134283 10843 portforward.go:407] an error occurred forwarding 9862 -> 9862: error forwarding port 9862 to pod 61a897a96023a3fa7592281207c88aab896bf9c29ae9f666e8a72f5af24cf05e, uid : exit status 1: 2023/07/31 09:06:20 socat[1328519] E connect(5, AF=2 127.0.0.1:9862, 16): Connection refused
E0731 11:06:20.134817 10843 portforward.go:233] lost connection to pod And on the client side: $> ./hdfs dfs -fs ofs://localhost:9862/ -ls -R /
2023-07-31 11:06:22,146 INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.ConnectException: Call From localhost.localdomain/127.0.0.1 to localhost:9862 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy13.submitRequest over nodeId=null,nodeAddress=localhost:9862 after 1 failover attempts. Trying to failover after sleeping for 4000ms. Current retry count: 1. Port-forwarding is necessary for me, otherwise I have no other possibility to upload files to remote FS. And at the same time there is no issue to |
Beta Was this translation helpful? Give feedback.
-
By following instruction on the documentation page https://ozone.apache.org/docs/1.3.0/feature/om-ha.html
I am trying to make a HA deployment for Apache Ozone, including replication for OM.
Deployment with single OM node is working well, but one problem appears with enabled
ratis
and replication .When single OM node is deployed, OM takes into account
ozone.om.address = 0.0.0.0:9862
, and then listens on wildcard IP. With that Kubernetes is able to redirect traffic from headless service to a running application in a Pod's container.Command executed inside the container:
However, when
ratis
is enabled andservice.ids
are defined, OM on bootstrap takes external IP as listen address, and because of that Kubernetes is not able to redirect traffic though the headless service to the application in the Pod's container anymore.Command executed inside the container:
Is it possible to configure OM somehow, so that it will always listen on wildcard address, but will use configured node addresses to find other replication nodes?
P.S. Here is enclosed Kubernetes configuration and logs:
om_1.zip
Beta Was this translation helpful? Give feedback.
All reactions