Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: Network Chaos on Kubernetes Service #36

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dengruilong
Copy link

No description provided.

Signed-off-by: dengruilong <dengruilong@zuoyebang.com>
@STRRL
Copy link
Member

STRRL commented Jan 24, 2022

rendered markdown

- It is reasonably clear how the feature would be implemented.
- Corner cases are dissected by example.
- How the feature is used. -->
I still have no specific detailed design for this feature, but have a rough proposal like this: moving the network chaos experiment from Pods to Nodes, transparently transmit the Pods's info(like ip, namespace, appname, etc) to the Nodes, and add some tc/iptable rules based on Nodes according to the specified namespace/appname.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could apply tc/iptables on node network namespace for Service NetworkChaos.

The Linux network namespace is the basic way to keep isolation and control the blast radius, I prefer to keep using pods' network namespace for Pod NetworkChaos.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way for ClusterIP and in-cluster network traffic sounds great. Another concerning thing is the networking path for a LoadBalancer service and ClusterIP might be not exactly the same.

I have no idea about how cloud providers treat the LoadBalancer service. I am very glad if you could provide more introduction about that @dengruilong . ❤️

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@STRRL As we have discussed on the last weekly meeting, the Service should only be allowed for the destination of the connection (e.g. the target selector for direction `to). The network path doesn't matter. Because in the view of the source pod, it only sees the target IP, no matter how it reaches the other end.

The only task for us is to get the IP of the service. For LoadBalancer, it seems that the .status.loadBalancer.ingress[].ip and port is faithful, and we could also add the .spec.clusterIP (if they are not nil).

Another question, I wonder whether the service discovering works for the LoadBalancer service and which IP it will return.

Copy link
Member

@STRRL STRRL Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@STRRL As we have discussed on the last weekly meeting, the Service should only be allowed for the destination of the connection (e.g. the target selector for direction `to).

That could be the first stage for the feature "NetworkChoas on Kubernetes Service".

I wonder if we could do much more to inject chaos into the "Kubernetes Service" not only network traffic in the Kubernetes cluster. For example, an "external service"(not located in Kubernetes Cluster) could access this LoadBalancer service, also affected by Chaos Mesh NetworkChaos

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not only network traffic in the Kubernetes cluster

It seems we could do that if we modify the tc/iptable rules on Node.

Copy link
Member

@STRRL STRRL Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question, I wonder whether the service discovering works for the LoadBalancer service and which IP it will return.

"Normal" (not headless) Services are assigned a DNS A or AAAA record, depending on the IP family of the service, for a name of the form my-svc.my-namespace.svc.cluster-domain.example. This resolves to the cluster IP of the Service.

As the document says, only ClusterIP is introduced. And it does act as that document says(on my local minikube cluster). I did not dig into the codes of coredns kubernetes plugin, not sure about there is no "magic" in it about resolving external IP. :P

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way for ClusterIP and in-cluster network traffic sounds great. Another concerning thing is the networking path for a LoadBalancer service and ClusterIP might be not exactly the same.

I have no idea about how cloud providers treat the LoadBalancer service. I am very glad if you could provide more introduction about that @dengruilong . ❤️

We didn't use cloud providers's LoadBalancer, because it usually depended on the provides' difference and hard to work as a universal solution. Instead, we used Ingress as north-to-south communication, and used service (in fact, this service worked as a IPVS, which has high performance in kernel level and highly used by the K8s industry) as horizon communication.

Hope above info can help you clear your concerns. Many thanks ❤️

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any update for the service injection design? @STRRL

Copy link
Member

@STRRL STRRL Feb 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no more new outcomes yet🤔

This feature is not in chaos-mesh/chaos-mesh#2608. Maybe it needs more time to consider.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have interest in helping us with the design and implementation, PR is welcome! ❤️

@STRRL STRRL changed the title submit rfc for ChaosMesh rfc: Network Chaos on Kubernetes Service Jan 24, 2022
@STRRL
Copy link
Member

STRRL commented Jan 24, 2022

Hi @dengruilong, I noticed that the "detailed design" part is not enough properly described. I think we could help you to enrich this part.

Would you mind us editing this RFC directly in the future?

@dengruilong
Copy link
Author

dengruilong commented Jan 25, 2022

@STRRL Glad to hear that. It will be great if you can help enriching the detailed design.

@zicheqingluo
Copy link

CHAOS-MESH is an excellent K8S failure injection tool than other products, but under the trend of cloudy and mixed clouds, the fault injection tool in the K8S cluster can not meet the business disaster tolerance needs. The failure injection ability must surpass the competition and become a great fault injection product.
This is an excellent proposal and opened my mind.

@yxxhero
Copy link

yxxhero commented May 13, 2022

We need this feat very much.

@andyblog
Copy link

不错不错

@YangKeao
Copy link
Member

@STRRL Another (rough) thought. Will ingress network chaos works like a charm in this situation 🤔 ? If the NetworkChaos could be bidirectional, the traffic through service will also be matched during ingress.

(Though, I agree supporting service as the target is much more straightforward and intuitive.)

@zicheqingluo
Copy link

zicheqingluo commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants