/
doc.go
236 lines (235 loc) · 12.1 KB
/
doc.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
// Copyright 2018 Authors of Cilium
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
// Package fqdn handles some of the DNS-based policy functions:
// - A DNS lookup cache used to populate toFQDNs rules in the policy layer.
// - A NameManager that coordinates distributing IPs to matching toFQDNs
// selectors.
// - A DNS Proxy that applies L7 DNS rules and populates the lookup cache with
// IPs from allowed/successful DNS lookups.
// - (deprecated) A DNS Poller that actively polls all L3 toFQDNs.MatchName
// entries and populates the DNS lookup cache.
//
// Note: There are 2 different requests that are handled: the DNS lookup and
// the connection to the domain in the DNS lookup.
//
// Proxy redirection and L3 policy calculations are handled by the datapath and
// policy layer, respectively.
//
// DNS data is tracked per-endpoint but collected globally in each cilium-agent
// when calculating policy. This differs from toEndpoints rules, which use
// cluster-global information, and toCIDR rules, which use static information
// in the policy. toServices rules are similar but they are cluster-global and
// have no TTL nor a distinct lookup request from the endpoint. Furthermore,
// toFQDNs cannot handle in-cluster IPs but toServices can.
//
// +-------------+ +----------------+ +---------+ +---------+
// | | | | | | | |
// | +<--+ NameManager +<-------+ | | |
// | | | | Update | | | |
// | Policy | +-------+--------+ Trigger| DNS | | |
// | Selectors | ^ | Proxy +<--->+ Network |
// | | | | | | |
// | | +-------+--------+ | | | |
// | | | DNS | | | | |
// | | | Lookup Cache +<-------+ | | |
// +------+------+ | | DNS +----+----+ +----+----+
// | +----------------+ Data ^ ^
// v | |
// +------+------+--------------------+ | |
// | | | | |
// | Datapath | | | |
// | | | DNS Lookup| |
// +-------------+ +<------------+ |
// | | |
// | Pod | |
// | | HTTP etc. |
// | +<----------------------------+
// | |
// +----------------------------------+
//
// === L7 DNS ===
// L7 DNS is handled by the DNS Proxy. The proxy is always running within
// cilium-agent but traffic is only redirected to it when a L7 rule includes a
// DNS section such as:
//
// - toEndpoints:
// toPorts:
// - ports:
// - port: "53"
// protocol: ANY
// rules:
// dns:
// - matchPattern: "*"
// - matchName: "cilium.io"
//
// These redirects are implemented by the datapath and the management logic is
// shared with other proxies in cilium (envoy and kafka). L7 DNS rules can
// apply to an endpoint from various policies and, if any allow a request, it
// will be forwarded to the original target of the DNS packet. This is often
// configured in /etc/resolv.conf for a pod and k8s sets this automatically
// (https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-config)
// In the example above `matchPattern: "*"` allows all requests and makes
// `matchName: "cilium.io"` redundant.
// Notes:
// - The forwarded requests are sent from cilium-agent on the host interface
// and not from the endpoint.
// - Users must explicitly allow `*.*.svc.cluster.local.` in k8s clusters.
// This is not automatic.
// - L7 DNS rules are egress-only,
// - The proxy emits L7 cilium-monitor events: one for the request, an
// accept/reject event, and the final response.
//
// Apart from allowing or denying DNS requests, the DNS proxy is used to
// observe DNS lookups in order to then allow L3 connections with the response
// information. These must separately allowed with toFQDNs L3 rules. The
// example above is a common "visibility" policy that allows all requests but
// ensures that they traverse the proxy. This information is then placed in the
// per-Endpoint and global DNS lookup caches and propagates from there.
//
// === L3 DNS ===
// L3 DNS rules control L3 connections and not the DNS requests themselves.
// They rely on DNS lookup cache information and it must come from the DNS
// proxy, or via a L7 DNS rule.
//
// - toFQDNs:
// - matchName: "my-remote-service.com"
// - matchPattern: "bucket.*.my-remote-service.com"
//
// IPs seen in a DNS response (i.e. the request was allowed by a L7 policy)
// that are also selected in a DNS L3 rule matchPattern or matchName have a /32
// or /128 CIDR identity created. This occurs when they are first passed to the
// toFQDN selectors from NameManager. These identities are not special in any
// way and can overlap with toCIDR rules in policiies. They are placed in the
// node-local ipcache and in the policy map of each endpoint that is allowed to
// connect to them (i.e. defined in the L3 DNS rule).
// Notes:
// - Generally speaking, toFQDNs can only handle non-cluster IPs. In-cluster
// policy should use toEndpoints and toServices. This is partly historical but
// is because of ipcache limitations when mapping ip->identity. Endpoint
// identities can clobber the FQDN IP identity.
// - Despite being tracked per-Endpoint. DNS lookup IPs are collected into a
// global cache. This is historical and can be changed.
// The original implementation created policy documents in the policy
// repository to represent the IPs being allowed and could not distinguish
// between endpoints. The current implementation uses selectors that also do
// not distinguish between Endpoints. There is some provision for this,
// however, and it just requires better plumbing in how we place data in the
// Endpoint's datapath.
//
// === Caching, Long-Lived Connections & Garbage Collection ===
// DNS requests are distinct traffic from the connections that pods make with
// the response information. This makes it difficult to correlate one DNS
// lookup to a later connection; a pod may reuse the IPs in a DNS response an
// arbitrary time after the lookup occurred, even past the DNS TTL. The
// solution is multi-layered for historical reasons:
// - Keep a per-Endpoint cache that can be stored to disk and restored on
// startup. These caches apply TTL expiration and limit the IP count per domain.
// - Keep a global cache to combine all this DNS information and send it to the
// policy system. This cache applies TTL but not per-domain limits.
// This causes a DNS lookup in one endpoint to leak to another!
// - Track live connections allowed by DNS policy and delay expiring that data
// while the connection is open. If the policy itself is removed, however, the
// connection is interrupted.
//
// The same DNSCache type is used in all cases. DNSCache instances remain
// consistent if the update order is different and merging multiple caches
// should be equivalent to applying the constituent updates individually. As a
// result, DNS data is all inserted into a single global cache from which the
// policy layer receives information. This is historic and per-Endpoint
// handling can be added. The data is internally tracked per IP because
// overlapping DNS responses may have different TTLs for IPs that appear in
// both.
// Notes:
// - The default configurable minimum TTL in the caches is 1 hour. This is
// mostly for identity stability, as short TTLs would cause more identity
// churn. This is mostly history as CIDR identities now have a near-0
// allocation overhead.
// - DNSCache deletes only currently occur when the cilium API clears the cache
// or when the garbage collector evicts entries.
// - The combination of caches: per-Endpoint and global must manage disparate
// behaviours of pods. The worst case scenario is one where one pod makes many
// requests to a target with changing IPs (like S3) but another makes few
// requests that are long-lived. We need to ensure "fairness" where one does
// not starve the other. The limits in the per-Endpoint caches allow this, and
// the global cache acts as a collector across different Endpoints (without
// restrictions).
//
// Expiration of DNS data is handled by the dns-garbage-collector-job controller.
// Historically, the only expiration was TTL based and the per-Endpoint and
// global caches would expire data at the same time without added logic.
// This is not true when we apply per-host IP limits in the cache. These
// default to 50 IPs for a given domain, per Endpoint. To account for these
// evictions the controller handles TTL and IP limit evictions. This ensures
// that the global cache is consistent with the per-Endpoint caches. The result
// is that the actual expiration is imprecise (TTL especially). The caches mark
// to-evict data internally and only do so on GC method calls from the
// controller.
// When DNS data is evicted from any per-Endpoint cache, for any reason, each
// IP is retained as a "zombie" in type fqdn.DNSZombieMapping. These "zombies"
// represent IPs that were previously associated with a resolved DNS name, but
// the DNS name is no longer known (for example because of TTL expiry). However
// there may still be an active connection associated with the zombie IP.
// Externally, related options use the term "deferred connection delete".
// Zombies are tracked per IP for the endpoint they come from (with a default
// limit of 10000 set by defaults.ToFQDNsMaxDeferredConnectionDeletes). When
// the Connection Tracking garbage collector runs, it marks any zombie IP that
// correlates to a live connection by that endpoint as "alive". At the next
// iteration of the dns-garbage-collector-job controller, the not-live zombies
// are finally evicted. These IPs are then, finally, no longer placed into the
// global cache on behalf of this endpoint. Other endpoints may have live DNS
// TTLs or connections to the same IPs, however, so these IPs may be inserted
// into the global cache for the same domain or a different one (or both).
// Note: The CT GC has a variable run period. This ranges from 30s to 12 hours
// and is shorter when more connection churn is observed (the constants are
// ConntrackGCMinInterval, ConntrackGCMaxInterval and ConntrackGCMaxLRUInterval
// in package defaults).
//
// === Flow of DNS data ===
//
// +---------------------+
// | DNS Proxy |
// +----------+----------+
// |
// v
// +----------+----------+
// | per-EP Lookup Cache |
// +----------+----------+
// |
// v
// +----------+----------+
// | per-EP Zombie Cache |
// +----------+----------+
// |
// v
// +----------+----------+
// | Global DNS Cache |
// +----------+----------+
// |
// v
// +----------+----------+
// | NameManager |
// +----------+----------+
// |
// v
// +----------+----------+
// | Policy toFQDNs |
// | Selectors |
// +----------+----------+
// |
// v
// +----------+----------+
// | per-EP Datapath |
// +---------------------+
//
package fqdn