[WIP, READY FOR REVIEW] Integration of policies with services and the Internet access#609
Merged
brecode merged 44 commits intocontiv:masterfrom Mar 1, 2018
Merged
[WIP, READY FOR REVIEW] Integration of policies with services and the Internet access#609brecode merged 44 commits intocontiv:masterfrom
brecode merged 44 commits intocontiv:masterfrom
Conversation
added 30 commits
February 13, 2018 09:32
Pods should be able to access kubernetes services (e.g. DNS) even if they are isolated from the kube-system namespace by the installed K8s network policies. However, this is not the case in the opposite direction. Policy may disallow kube-system pod to conntact pod from another namespace.
This commit implements source NATing for all traffic leaving the cluster network, which in effect opens up the Internet access for all pods. The SNAT was included into the Service plugin in order to keep the NAT-related configuration all in one place. The solution is to add the IP address of the default GW interface into the pool of VPP/NAT44 addresses and to enable postrouting on that interface. The traffic going between cluster nodes should not be NATed otherwise the ACLs of the destination node would no longer match against pod IPs, but rather against node IPs, which breaks the semantic. It is possible to separate external traffic from the internal one only with the assistance of VXLANs, therefore the SNAT is not supported and gets disabled in the L2-only mode.
RendererCache combines capabilites of the VPPTCP and ACL caches
under a unified interface.
The rules are grouped into tables (ContivRuleTable type) and the
configuration is represented as a list of local tables, applied
on the ingress or the egress side of pods, and a single global table,
applied on the interfaces connecting the node with the rest
of the cluster.
The list of local tables is minimalistic in the sense that pods with
the same set of rules will share the same local table. Whether shared
tables are installed in one instance or as separate copies for each
associated pod is up to the renderer (usually determined by
the capabilities of the destination network stack).
All tables match only one side of the traffic - either ingress or
egress, depending on the cache orientation as selected in the Init method.
The cache combines the received ingress and egress Contiv rules
into the single chosen direction in a way that maintains the original
semantic (the global table is introduced to accomplish the task).
The rules are ordered in tables such that if rule *r1* matches subset
of the traffic matched by *r2*, then r1 precedes r2 in the list.
It is the order at which the rules should by applied by the rule
matching algorithm in the destination network stack (otherwise the
more specific rules could be overshadowed and never matched).
There are two types of tables distinguished:
1. Local table: should be applied to match against traffic leaving
(IngressOrientation) or entering (EgressOrientation)
a selected subset of pods.
Every pod has at most one local table installed at
any given time. For a given local table, the set
of rules is immutable. Different content is treated
as a new local table (and the original table may
get unassigned from some or all originally
associated pods).
Local table has always at least one rule, otherwise
it is simply not tracked and returned by the cache.
2. Global table: should be applied to match against traffic entering
(IngressOrientation) or leaving (EgressOrientation)
the node. There is always exactly one global table
installed (per node).
The global table may contain an empty set of rules
(meaning ALLOW-ALL).
Update() method is still to-be-done.
In Resync we are not able to *easily* fully reconstruct the policy configuration, most notably the IP addresses of pods. For pods no longer existing after the resync it should not be necessary to know the IP address anyway, therefore it can be nil.
added 13 commits
February 23, 2018 17:26
…nstalled." This reverts commit 2049644. Based on: https://github.com/ahmetb/kubernetes-network-policy-recipes/blob/master/11-deny-egress-traffic-from-an-application.md it is clear that policies should apply to the kube-system namespace as normally.
Member
|
LGTM |
brecode
approved these changes
Mar 1, 2018
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
WORK IN PROGRESS: PLEASE DO NOT MERGE YET
TO-BE-DONE:
This pull request primarily includes the refactor of the policy rendering code, which was necessary to adapt to the limitations of the VPP/NAT plugin. For policies we always need to evaluate rules against the local IP addresses and not the NATed addresses of services or the node itself. This is, however, not the case for inbound ACLs, which makes them unusable. Previously we were using both directions, but now we combine ingress with egress and install all rules into outbound ACLs. Furthermore, to apply access control on the inter-node and pod-to-internet traffic, we need to reflect the ingress policies into a "global" ACL, installed on the node's output interfaces, also outbound side.
A detailed algorithm description + diagrams depicting the order of VPP nodes will be part of the documentation.
Similar restrictions are also present in the VPPTCP stack - each pod has only a single "local" table of rules assigned (evaluated in the ingress direction) and the stack additionally provides a single "global" table, evaluated in the ingress direction for traffic entering the node.
The equivalent limitations of VPPTCP stack and ACL+NAT (just different orientation of tables) have allowed us to unify the cache and the rendering algorithm to a large degree between the two renderers. That's the second contribution of the pull request.
The third contribution is in the service plugin: the plugin now also installs SNAT configuration which allows Internet access from pods. The SNAT is configured on the physical interface which acts as the default GW (host-VPP interconnect is not supported for Internet access). The implementation is DHCP-aware.
The only issue is that we cannot SNAT inter-node traffic, otherwise policies on the destination node will be evaluated against the NATed address of the source node and not the source pod. The solution is to split inter-node traffic from the pod-internet traffic. This is possible with VXLANs (inter-node traffic is encapsulated. whereas pod-internet is not), or by having an additional physical interface which acts as the default GW. With VXLANs disabled and only one physical interface available, SNAT gets therefore disabled (and needs to be performed by and external NAT device). This is a limitation for which we don't have a workaround at the moment.
Both policies and services (+SNAT) have resync fully implemented, i.e. restart scenarious are supported.