New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: upgrade test reveals corrupted userspace memory #5070

Closed
ianvernon opened this Issue Aug 1, 2018 · 18 comments

Comments

Projects
3 participants
@ianvernon
Contributor

ianvernon commented Aug 1, 2018

Build link: https://jenkins.cilium.io/job/cilium-ginkgo/job/cilium/job/master/1340/testReport/junit/k8s-1/7/K8sUpdates_Updating_Cilium_stable_to_master/

Output:

Stacktrace

/home/jenkins/workspace/cilium-ginkgo_cilium_master-QVKDPVJ2A6E272IRUX7X6DCB5FBLCQDKRTOMHPZS7ZECX3RM5C3A/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:376
Expected
    <*errors.errorString | 0xc420e5ef20>: {
        s: "Timeout reached: timed out waiting for pods with filter -l zgroup=testapp to be ready",
    }
to be nil
/home/jenkins/workspace/cilium-ginkgo_cilium_master-QVKDPVJ2A6E272IRUX7X6DCB5FBLCQDKRTOMHPZS7ZECX3RM5C3A/src/github.com/cilium/cilium/test/k8sT/Updates.go:115
Standard Output

Number of "context deadline exceeded" in logs: 0
Number of "level=error" in logs: 0
Number of "level=warning" in logs: 0
Number of "Cilium API handler panicked" in logs: 0
Cilium pods: [cilium-h61tx cilium-wrlx4]
Netpols loaded: 
CiliumNetworkPolicies loaded: 
Endpoint Policy Enforcement:
     cilium-health-k8s1                 =>   none
     cilium-health-k8s2                 =>   none
     kube-dns-2552250787-0k64t          =>   none
     prometheus-core-3506541099-6s1lr   =>   none
⚠️  Cilium agent "cilium-h61tx": Status: Ok  Health: Ok Nodes "k8s1 k8s2" ContinerRuntime: Ok Kubernetes: Ok KVstore: Ok Controllers: Total 25 Failed 25
Failed controllers:
 controller sync-IPv4-identity-mapping (44771) failure ""
controller cilium-health-ep failure ""
controller sync-IPv4-identity-mapping (3135) failure ""
controller resolve-identity-3135 failure ""
controller sync-identity-to-k8s-pod (3135) failure ""
controller sync-IPv4-identity-mapping (19270) failure ""
controller lxcmap-bpf-host-sync failure ""
controller resolve-identity-44771 failure ""
controller ipcache-bpf-garbage-collection failure ""
controller sync-identity-to-k8s-pod (44771) failure ""
controller sync-policymap-19270 failure ""
controller resolve-identity-19270 failure ""
controller sync-IPv6-identity-mapping (3135) failure ""
controller sync-to-k8s-ciliumendpoint (3135) failure ""
controller sync-to-k8s-ciliumendpoint (44771) failure ""
controller kvstore-etcd-session-renew failure ""
controller metricsmap-bpf-prom-sync failure ""
controller sync-policymap-44771 failure ""
controller sync-identity-to-k8s-pod (19270) failure ""
controller kvstore-lease-keepalive failure ""
controller sync-to-k8s-ciliumendpoint (19270) failure ""
controller sync-to-k8s-ciliumendpoint-gc (k8s2) failure ""
controller sync-IPv6-identity-mapping (19270) failure ""
controller sync-IPv6-identity-mapping (44771) failure ""
controller sync-policymap-3135 failure ""
⚠️  Cilium agent "cilium-wrlx4": Status: Ok  Health: Ok Nodes "k8s1 k8s2" ContinerRuntime: Ok Kubernetes: Ok KVstore: Ok Controllers: Total 18 Failed 18
Failed controllers:
 controller sync-IPv4-identity-mapping (44771) failure ""
controller cilium-health-ep failure ""
controller sync-IPv4-identity-mapping (3135) failure ""
controller resolve-identity-3135 failure ""
controller sync-identity-to-k8s-pod (3135) failure ""
controller sync-IPv4-identity-mapping (19270) failure ""
controller lxcmap-bpf-host-sync failure ""
controller resolve-identity-44771 failure ""
controller ipcache-bpf-garbage-collection failure ""
controller sync-identity-to-k8s-pod (44771) failure ""
controller sync-policymap-19270 failure ""
controller resolve-identity-19270 failure ""
controller sync-IPv6-identity-mapping (3135) failure ""
controller sync-to-k8s-ciliumendpoint (3135) failure ""
controller sync-to-k8s-ciliumendpoint (44771) failure ""
controller kvstore-etcd-session-renew failure ""
controller metricsmap-bpf-prom-sync failure ""
controller sync-policymap-44771 failure ""
controller sync-identity-to-k8s-pod (19270) failure ""
controller kvstore-lease-keepalive failure ""
controller sync-to-k8s-ciliumendpoint (19270) failure ""
controller sync-to-k8s-ciliumendpoint-gc (k8s2) failure ""
controller sync-IPv6-identity-mapping (19270) failure ""
controller sync-IPv6-identity-mapping (44771) failure ""
controller sync-policymap-3135 failure ""
controller sync-to-k8s-ciliumendpoint (40724) failure ""
controller sync-policymap-31680 failure ""
controller sync-identity-to-k8s-pod (40724) failure ""
controller sync-IPv4-identity-mapping (31680) failure ""
controller resolve-identity-40724 failure ""
controller kvstore-etcd-session-renew failure ""
controller sync-identity-to-k8s-pod (31680) failure ""
controller sync-IPv6-identity-mapping (40724) failure ""
controller sync-policymap-40724 failure ""
controller kvstore-lease-keepalive failure ""
controller sync-to-k8s-ciliumendpoint (31680) failure ""
controller lxcmap-bpf-host-sync failure ""
controller sync-IPv6-identity-mapping (31680) failure ""
controller sync-IPv4-identity-mapping (40724) failure ""
controller metricsmap-bpf-prom-sync failure ""
controller sync-to-k8s-ciliumendpoint-gc (k8s1) failure ""
controller ipcache-bpf-garbage-collection failure ""
controller cilium-health-ep failure ""

Standard Error

STEP: Checking that installed image is "cilium/cilium:v1.1.1"
STEP: Installing kube-dns
STEP: Creating some endpoints and L7 policy
=== Test Finished at 2018-07-31T10:36:17Z====
===================== TEST FAILED =====================
cmd: kubectl get pods -o wide --all-namespaces
NAMESPACE     NAME                               READY     STATUS             RESTARTS   AGE       IP              NODE
default       app1-1879020049-6jbz8              0/1       CrashLoopBackOff   5          5m        <none>          k8s1
default       app1-1879020049-9rxz6              0/1       CrashLoopBackOff   5          5m        <none>          k8s1
default       app2-2809035625-fkcb0              0/1       CrashLoopBackOff   5          5m        <none>          k8s1
default       app3-753781860-cccgg               0/1       CrashLoopBackOff   5          5m        <none>          k8s1
kube-system   cilium-h61tx                       1/1       Running            0          5m        192.168.36.12   k8s2
kube-system   cilium-wrlx4                       1/1       Running            0          5m        192.168.36.11   k8s1
kube-system   etcd-k8s1                          1/1       Running            0          27m       192.168.36.11   k8s1
kube-system   kube-apiserver-k8s1                1/1       Running            0          27m       192.168.36.11   k8s1
kube-system   kube-controller-manager-k8s1       1/1       Running            0          28m       192.168.36.11   k8s1
kube-system   kube-dns-2552250787-0k64t          3/3       Running            0          5m        10.10.1.190     k8s2
kube-system   kube-proxy-1hh2v                   1/1       Running            0          28m       192.168.36.11   k8s1
kube-system   kube-proxy-f28r3                   1/1       Running            0          21m       192.168.36.12   k8s2
kube-system   kube-scheduler-k8s1                1/1       Running            0          28m       192.168.36.11   k8s1
prometheus    prometheus-core-3506541099-6s1lr   1/1       Running            0          21m       10.10.0.24      k8s1

cmd: kubectl exec -n kube-system cilium-h61tx -- cilium endpoint list
ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                        IPv6                   IPv4          STATUS   
           ENFORCEMENT        ENFORCEMENT                                                                                                          
3135       Disabled           Disabled          4          reserved:health                                    f00d::a0a:100:0:c3f    10.10.1.52    ready   
19270      Disabled           Disabled          3402       k8s:io.cilium.k8s.policy.serviceaccount=kube-dns   f00d::a0a:100:0:4b46   10.10.1.190   ready   
                                                           k8s:io.kubernetes.pod.namespace=kube-system                                                     
                                                           k8s:k8s-app=kube-dns                                                                            

cmd: kubectl exec -n kube-system cilium-wrlx4 -- cilium endpoint list
ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                              IPv6                 IPv4         STATUS   
           ENFORCEMENT        ENFORCEMENT                                                                                                             
31680      Disabled           Disabled          33870      k8s:app=prometheus                                       f00d::a0a:0:0:7bc0   10.10.0.24   ready   
                                                           k8s:component=core                                                                                 
                                                           k8s:io.kubernetes.pod.namespace=prometheus                                                         
                                                           k8s:��������m.k8s.policy.serviceaccount=prometheus-k8s                                             
40724      Disabled           Disabled          4          reserved:health                                          f00d::a0a:0:0:9f14   10.10.0.54   ready   

Note that the labels for one of the pods contains "��������" - this suspiciously looks like memory corruption.

Logs:
395fcae9_K8sUpdates_Updating_Cilium_stable_to_master.zip

@ianvernon

This comment has been minimized.

Contributor

ianvernon commented Aug 1, 2018

Note that the app pods are in CrashLoopBackOff, per pods_status.txt in the above logs archive:

Events:
  FirstSeen	LastSeen	Count	From			SubObjectPath			Type		Reason			Message
  ---------	--------	-----	----			-------------			--------	------			-------
  5m		5m		1	default-scheduler					Normal		Scheduled		Successfully assigned app2-2809035625-fkcb0 to k8s1
  5m		5m		1	kubelet, k8s1						Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "app2-account-token-g1wv6" 
  5m		5m		1	kubelet, k8s1		spec.containers{app-frontend}	Warning		Failed			Error: failed to start container "app-frontend": Error response from daemon: {"message":"cannot join network of a non running container: 7ef20977854990da7179fc0539b9b0c2061be8314a19301f2c82c7092038f6f8"}
  4m		4m		1	kubelet, k8s1		spec.containers{app-frontend}	Warning		Failed			Error: failed to start container "app-frontend": Error response from daemon: {"message":"cannot join network of a non running container: 12361b8f1d72267da1756928f7a930a7a2a9a7a7b4675239d3fff39f0011fdbd"}
  4m		4m		1	kubelet, k8s1		spec.containers{app-frontend}	Warning		Failed			Error: failed to start container "app-frontend": Error response from daemon: {"message":"cannot join network of a non running container: 743398eeae6d8303f0a27227a1ec3d8acd70020378d2091414105b5753e26ca1"}
  3m		3m		1	kubelet, k8s1		spec.containers{app-frontend}	Warning		Failed			Error: failed to start container "app-frontend": Error response from daemon: {"message":"cannot join network of a non running container: 9a9f316e05c6bfb2f0140b0c8c5e3204f8778e3e03639d13b7511fd8a3573b5c"}
  3m		3m		1	kubelet, k8s1		spec.containers{app-frontend}	Warning		Failed			Error: failed to start container "app-frontend": Error response from daemon: {"message":"cannot join network of a non running container: b819e979ad36d4f9ff5d2e3dec2f7c522f1a77227c1fbea0657b4c4003911a13"}
  5m		1m		6	kubelet, k8s1		spec.containers{app-frontend}	Normal		Pulling			pulling image "docker.io/cilium/demo-client"
  1m		1m		1	kubelet, k8s1		spec.containers{app-frontend}	Warning		Failed			Error: failed to start container "app-frontend": Error response from daemon: {"message":"cannot join network of a non running container: 930b195519aeed97f9ff723abc7e23a0ba2428657b2f0aa69c5edc6b369687b5"}
  5m		1m		6	kubelet, k8s1		spec.containers{app-frontend}	Normal		Pulled			Successfully pulled image "docker.io/cilium/demo-client"
  5m		1m		6	kubelet, k8s1		spec.containers{app-frontend}	Normal		Created			Created container
  5m		1m		32	kubelet, k8s1						Normal		SandboxChanged		Pod sandbox changed, it will be killed and re-created.
  4m		12s		14	kubelet, k8s1		spec.containers{app-frontend}	Warning		BackOff			Back-off restarting failed container
  5m		12s		53	kubelet, k8s1						Warning		FailedSync		Error syncing pod
@ianvernon

This comment has been minimized.

Contributor

ianvernon commented Aug 1, 2018

cilium-wrlx4 is the cilium pod which is on the same node (k8s1) as the app pods, but Cilium isn't managing them. The logs for cilium-wrlx4 are riddled with messages like the following:

2018-07-31T10:32:30.611639134Z level=debug msg="Waiting for endpoint representing container to appear" containerID=fcbf41b206 maxRetry=20 retry=8 subsys=workload-watcher willRetry=true
2018-07-31T10:32:30.6593194Z level=debug msg="Queueing container event" containerID=ea6991ee87 event=stop subsys=workload-watcher
2018-07-31T10:32:30.783370255Z level=debug msg="Unable to inspect container after container create event" containerID=a3df933a2e error="unable to inspect container 'a3df933a2e9fe5b8e4b52e407af819f2bd11b603900c999698bdd6baa3f6ddca': Error: No such container: a3df933a2e9fe5b8e4b52e407af819f2bd11b603900c999698bdd6baa3f6ddca" maxRetry=20 retry=5 subsys=workload-watcher willRetry=true

@ianvernon ianvernon added this to Needs triage in CI Failures via automation Aug 1, 2018

@ianvernon

This comment has been minimized.

Contributor

ianvernon commented Aug 1, 2018

The logs for the aforementioned Cilium pod show that the label is indeed corrupted! :

2018-07-31T10:31:00.97784824Z level=debug msg="Evaluating context for source PolicyID" containerID=1864f0935d ctx="{0 0 <nil> [k8s:app=prometheus k8s:component=core k8s:\xda\x02\x00\x00\xda\x02\x00\x00m.k8s.policy.serviceaccount=prometheus-k8s k8s:io.kubernetes.pod.namespace=prometheus] [k8s:app=prometheus k8s:component=core k8s:\xda\x02\x00\x00\xda\x02\x00\x00m.k8s.policy.serviceaccount=prometheus-k8s k8s:io.kubernetes.pod.namespace=prometheus] []}" endpointID=31680 ipv4=10.10.0.24 ipv6="f00d::a0a:0:0:7bc0" k8sPodName=":" policyID=33870 policyRevision=0
@ianvernon

This comment has been minimized.

Contributor

ianvernon commented Aug 1, 2018

Actions needed:

  • Figure out why Cilium wasn't picking up the app pods and managing their networking
  • Figure out why the label for the prometheus service account appears to have been corrupted

@ianvernon ianvernon added priority/high and removed needs/triage labels Aug 1, 2018

@ianvernon

This comment has been minimized.

Contributor

ianvernon commented Aug 1, 2018

@rlenglet and I noted while doing paired debugging that we are dumping and parsing the BPF LB maps, which use golang's unsafe pointers. Thus, we may be improperly using these pointers and corrupting memory.

We see that right before we try to allocate an identity for the prometheus pod, we sync the BPF LBMaps, which involve dumping:

2018-07-31T10:31:00.963130361Z level=info msg="Syncing BPF LBMaps with daemon's LB maps..."
2018-07-31T10:31:00.96314726Z level=debug msg="parsing service mapping" bpfMapKey="10.97.155.228:9090" bpfMapValue="10.10.0.24:9090 (2)"
2018-07-31T10:31:00.963150116Z level=debug msg="converting ServiceKey and ServiceValue to frontend and backend" obj="10.10.0.24:9090 (2)" serviceID="10.97.155.228:9090"
2018-07-31T10:31:00.963152342Z level=debug msg="creating L3n4Addr for ServiceKey" serviceID="10.97.155.228:9090"
2018-07-31T10:31:00.963154324Z level=debug msg="created new L3n4Addr" ipAddr="{10.97.155.228 {TCP 9090}}"
2018-07-31T10:31:00.963156173Z level=debug msg="created new LBBackend" backend="{{10.10.0.24 {TCP 9090}} 0}"
2018-07-31T10:31:00.963158008Z level=debug msg="adding frontend and backend to SVCMap" backend="10.10.0.24:9090, weight: 0" backendIndex=1 frontend="10.97.155.228:9090"
2018-07-31T10:31:00.963165186Z level=debug msg="parsing service mapping" bpfMapKey="10.96.0.1:443" bpfMapValue="192.168.36.11:6443 (9)"

The next log shows the allocated identity for the endpoint. The labels are intact :

2018-07-31T10:31:00.963167409Z level=debug msg="Resolved identity" identity=33870 identityLabels="k8s:app=prometheus,k8s:component=core,k8s:io.cilium.k8s.policy.serviceaccount=prometheus-k8s,k8s:io.kubernetes.pod.namespace=prometheus" isNew=false

We perform more LBMap dumping / parsing-related operations after this:

2018-07-31T10:31:00.96399917Z level=debug msg="converting ServiceKey and ServiceValue to frontend and backend" obj="192.168.36.11:6443 (9)" serviceID="10.96.0.1:443"
2018-07-31T10:31:00.964001489Z level=debug msg="creating L3n4Addr for ServiceKey" serviceID="10.96.0.1:443"
2018-07-31T10:31:00.964003427Z level=debug msg="created new L3n4Addr" ipAddr="{10.96.0.1 {TCP 443}}"
2018-07-31T10:31:00.964007122Z level=debug msg="created new LBBackend" backend="{{192.168.36.11 {TCP 6443}} 0}"
2018-07-31T10:31:00.964009144Z level=debug msg="adding frontend and backend to SVCMap" backend="192.168.36.11:6443, weight: 0" backendIndex=1 frontend="10.96.0.1:443"
2018-07-31T10:31:00.964011176Z level=debug msg="parsing BPF revNAT mapping" bpfMapKey=2 bpfMapValue="10.97.155.228:9090"
2018-07-31T10:31:00.964013051Z level=debug msg="created new L3n4Addr" ipAddr="{10.97.155.228 {TCP 9090}}"
2018-07-31T10:31:00.964014926Z level=debug msg="parsing BPF revNAT mapping" bpfMapKey=9 bpfMapValue="10.96.0.1:443"
2018-07-31T10:31:00.964016814Z level=debug msg="created new L3n4Addr" ipAddr="{10.96.0.1 {TCP 443}}"
2018-07-31T10:31:00.964018668Z level=debug msg="iterating over services read from BPF LB Map and seeing if they have the same ID set in the KV store"
2018-07-31T10:31:00.964726467Z level=debug msg="Resolving service" l3n4Addr="{IP:10.97.155.228 L4Addr:{Protocol:TCP Port:9090}}"
2018-07-31T10:31:00.964733592Z level=debug msg="Updated controller" name="sync-identity-to-k8s-pod (31680)" subsys=controller uuid=d1c787ab-94ac-11e8-855e-08002707f0ef
2018-07-31T10:31:00.964735967Z level=debug msg="Updated controller" name="sync-IPv4-identity-mapping (31680)" subsys=controller uuid=d1c8214b-94ac-11e8-855e-08002707f0ef
2018-07-31T10:31:00.964738038Z level=debug msg="Updated controller" name="sync-IPv6-identity-mapping (31680)" subsys=controller uuid=d1c8224b-94ac-11e8-855e-08002707f0ef

The next immediate log message shows that the labels for the endpoint are corrupted:

2018-07-31T10:31:00.964740114Z level=info msg="Identity of endpoint changed" containerID=1864f0935d endpointID=31680 identity=33870 identityLabels="k8s:app=prometheus,k8s:component=core,k8s:io.kubernetes.pod.namespace=prometheus,k8s:\xda\x02\x00\x00\xda\x02\x00\x00m.k8s.policy.serviceaccount=prometheus-k8s" ipv4=10.10.0.24 ipv6="f00d::a0a:0:0:7bc0" k8sPodName=":" oldIdentity=33870 policyRevision=0
@ianvernon

This comment has been minimized.

Contributor

ianvernon commented Aug 1, 2018

@rlenglet and I also checked to see if there was a mismatch between the BPF Map definitions in the BPF code vs. in the cilium-agent userspace code, but could not find any for the IPv4 / IPv6 RevNat or LBMaps.

This may require another glance to be sure...

@ianvernon

This comment has been minimized.

Contributor

ianvernon commented Aug 1, 2018

@joestringer faced a similar issue here:
#4886 (comment)

@ianvernon

This comment has been minimized.

Contributor

ianvernon commented Aug 1, 2018

\xda\x02\x00\x00\xda\x02\x00\x00 translates to 55810 or 730, but we couldn't find anything using those numbers in the logs :(

@ianvernon

This comment has been minimized.

Contributor

ianvernon commented Aug 1, 2018

The log message iterating over services read from BPF LB Map and seeing if they have the same ID set in the KV store occurs right after all of the maps are dumped:

if !option.Config.IPv4Disabled {
// lbmap.RRSeq4Map is updated as part of Service4Map and does
// not need separate dump.
if err := lbmap.Service4Map.DumpWithCallback(parseSVCEntries); err != nil {
log.WithError(err).Warn("error dumping Service4Map")
}
if err := lbmap.RevNat4Map.DumpWithCallback(parseRevNATEntries); err != nil {
log.WithError(err).Warn("error dumping RevNat4Map")
}
}
// lbmap.RRSeq6Map is updated as part of Service6Map and does not need
// separate dump.
if err := lbmap.Service6Map.DumpWithCallback(parseSVCEntries); err != nil {
log.WithError(err).Warn("error dumping Service6Map")
}
if err := lbmap.RevNat6Map.DumpWithCallback(parseRevNATEntries); err != nil {
log.WithError(err).Warn("error dumping RevNat6Map")
}
// Need to do this outside of parseSVCEntries to avoid deadlock, because we
// are modifying the BPF maps, and calling Dump on a Map RLocks the maps.
log.Debug("iterating over services read from BPF LB Map and seeing if they have the same ID set in the KV store")

I am highly suspicious that the Dump function is corrupting memory in userspace.

@joestringer

This comment has been minimized.

Contributor

joestringer commented Aug 2, 2018

One point to keep in mind here is that the failure is during an "upgrade" test, where latest was installed and then 1.1.1 has been installed over the top. This issue is observed after the downgrade, and is on Cilium 1.1.1 ( git commit 6094010).

I was wondering if a map with key/values defined in one version might be somehow dumped from a subsequent version, and cause corruption via invalid sized dump. This should be prevented by the map upgrade logic though, and it appears that the map upgrade logic should be triggered if necessary for the LB Maps via OpenOrCreate(). FWIW in the logs there is no evidence that a map upgrade happens for LB map types (and upon manual inspection, it doesn't seem like LB maps have changed structure at all between these versions).

Looking closer at the logs I see no evidence that the 1.1.1 version is performing map "upgrade" for the CT map, which has been changed between 1.1.1 and master. This is a bit concerning and could explain the corruption we see, via the CT map dump rather than the LB map dumps.

@joestringer

This comment has been minimized.

Contributor

joestringer commented Aug 2, 2018

We don't seem to print any debug logs for CT dump window so I'm not sure we could know when the ct dump is occurring and how it might impact the timeline presented by @ianvernon above:

$ git grep log pkg/maps/ctmap/
pkg/maps/ctmap/ctmap.go:        "github.com/cilium/cilium/pkg/logging"
pkg/maps/ctmap/ctmap.go:        "github.com/cilium/cilium/pkg/logging/logfields"
pkg/maps/ctmap/ctmap.go:        log = logging.DefaultLogger.WithField(logfields.LogSubsys, "map-ct")
pkg/maps/ctmap/ctmap.go:                                log.WithError(err).Errorf("Unable to delete CT entry %s", currentKey.String())
pkg/maps/ctmap/ctmap.go:                log.WithError(err).WithField("interrupted", interrupted).Warning(
pkg/maps/ctmap/ctmap.go:                                log.WithError(err).Errorf("Unable to delete CT entry %s", currentKey.String())
pkg/maps/ctmap/ctmap.go:                log.WithError(err).WithField("interrupted", interrupted).Warning(
pkg/maps/ctmap/ctmap.go:                // FIXME: GH-3239 LRU logic is not handling timeouts gracefully enough

@ianvernon ianvernon changed the title from CI: upgrade test failed waiting for pods to be ready to CI: upgrade test reveals corrupted userspace memory Aug 2, 2018

@ianvernon

This comment has been minimized.

Contributor

ianvernon commented Aug 2, 2018

Looking closer at the logs I see no evidence that the 1.1.1 version is performing map "upgrade" for the CT map, which has been changed between 1.1.1 and master. This is a bit concerning and could explain the corruption we see, via the CT map dump rather than the LB map dumps.

Great point. Is it feasible for us to do migration of the conntrack map?

We don't seem to print any debug logs for CT dump window so I'm not sure we could know when the ct dump is occurring

I'll add these logs and post a PR so we can have more data if this occurs again.

@tgraf

This comment has been minimized.

Member

tgraf commented Aug 2, 2018

The CT table is created exclusively from the kernel part so there is no upgrade. We'll have to delete the table manually if we want to force a re-creation.

@ianvernon

This comment has been minimized.

Contributor

ianvernon commented Aug 2, 2018

Even if the CT table is created in the kernel, can't we migrate it nonetheless? I do not understand why the fact that the map is created in the kernel affects whether or not we can upgrade it.

@tgraf

This comment has been minimized.

Member

tgraf commented Aug 3, 2018

We can migrate it but we have to rename it, and re-create it before any endpoints are being restored. We can do it but we are not doing it right now as far as I know.

@ianvernon

This comment has been minimized.

Contributor

ianvernon commented Aug 3, 2018

Yeah, since it is still being modified by the datapath (i.e., new connections being added / removed), then is there a concern we may lose newly added info that was added while we migrate the data? Thanks for answering my questions - I am not too familiar with this area of things.

@joestringer

This comment has been minimized.

Contributor

joestringer commented Aug 3, 2018

Probably worth clarifying, inside Cilium right now, the deletion/recreation of maps is referred to as "upgrading" the map (properties). @ianvernon , is this what you mean or do you mean actual migration of the full map including contents?

The map property update logs for ipcache are there in this failure, as an example:

2018-07-31T10:30:59.677360299Z level=info msg="Map type mismatch for BPF map" file-path=/sys/fs/bpf/tc/globals/cilium_ipcache new="Longest prefix match trie" old=Hash
2018-07-31T10:30:59.677409228Z level=info msg="Flags mismatch for BPF map" file-path=/sys/fs/bpf/tc/globals/cilium_ipcache new=1 old=0
2018-07-31T10:30:59.677414691Z level=info msg="Removing map to allow for property upgrade (expect map data loss)" file-path=/sys/fs/bpf/tc/globals/cilium_ipcache

I expect that in this situation we should delete and recreate the CT map as well, but we do not appear to be doing so. I don't know why.

@tgraf

This comment has been minimized.

Member

tgraf commented Aug 5, 2018

Two reasons that we didn't delete the CT so far:

  1. We never broke compatibility up to now so there was no reason to cause connection breakage on upgrade
  2. The map is always created from kernel space so far so the existing logic in OpenOrCreate() does not work.

Fixing this for the global CT model is simple, we can:

  1. Rename the old global CT map if it exists and create a new one with the new format
  2. Restore all endpoints
  3. Delete the old map

Fixing this for the per endpoint CT map requires the same logic as above to be done for each endpoint restored given the endpoint is using per endpoint CT.

@tgraf tgraf added this to the 1.2-bugfix milestone Aug 5, 2018

@tgraf tgraf added this to Committed in 1.2 Aug 6, 2018

joestringer added a commit to joestringer/cilium that referenced this issue Aug 7, 2018

daemon: Upgrade CT map properties on startup
If the CT map has properties that differ from the target properties as
configured in the daemon code (for example, different value size), then
remove that map during endpoint restore.

When the map is removed, any existing BPF program that refers to the map
should continue to operate as normal. When the endpoints being restored
are regenerated, they will cause automatic recreation of the new
conntrack map (if it doesn't yet exist), and begin to use the new
endpoint.

Note that when an endpoint switches from the old CT map to the new CT
map, as it would in the above case, existing connections may be
temporarily disrupted.

Once all endpoints have finished using the old CT map, the kernel will
automatically relinquish its resources.

Fixes: cilium#5070

Signed-off-by: Joe Stringer <joe@covalent.io>

joestringer added a commit to joestringer/cilium that referenced this issue Aug 8, 2018

daemon: Upgrade CT map properties on startup
If the CT map has properties that differ from the target properties as
configured in the daemon code (for example, different value size), then
remove that map during endpoint restore.

When the map is removed, any existing BPF program that refers to the map
should continue to operate as normal. When the endpoints being restored
are regenerated, they will cause automatic recreation of the new
conntrack map (if it doesn't yet exist), and begin to use the new
endpoint.

Note that when an endpoint switches from the old CT map to the new CT
map, as it would in the above case, existing connections may be
temporarily disrupted.

Once all endpoints have finished using the old CT map, the kernel will
automatically relinquish its resources.

Fixes: cilium#5070

Signed-off-by: Joe Stringer <joe@covalent.io>

joestringer added a commit to joestringer/cilium that referenced this issue Aug 8, 2018

daemon: Upgrade CT map properties on startup
If the CT map has properties that differ from the target properties as
configured in the daemon code (for example, different value size), then
remove that map during endpoint restore.

When the map is removed, any existing BPF program that refers to the map
should continue to operate as normal. When the endpoints being restored
are regenerated, they will cause automatic recreation of the new
conntrack map (if it doesn't yet exist), and begin to use the new
endpoint.

Note that when an endpoint switches from the old CT map to the new CT
map, as it would in the above case, existing connections may be
temporarily disrupted.

Once all endpoints have finished using the old CT map, the kernel will
automatically relinquish its resources.

Fixes: cilium#5070

Signed-off-by: Joe Stringer <joe@covalent.io>

joestringer added a commit to joestringer/cilium that referenced this issue Aug 9, 2018

daemon: Upgrade CT map properties on startup
If the CT map has properties that differ from the target properties as
configured in the daemon code (for example, different value size), then
remove that map during endpoint restore.

When the map is removed, any existing BPF program that refers to the map
should continue to operate as normal. When the endpoints being restored
are regenerated, they will cause automatic recreation of the new
conntrack map (if it doesn't yet exist), and begin to use the new
endpoint.

Note that when an endpoint switches from the old CT map to the new CT
map, as it would in the above case, existing connections may be
temporarily disrupted.

Once all endpoints have finished using the old CT map, the kernel will
automatically relinquish its resources.

Fixes: cilium#5070

Signed-off-by: Joe Stringer <joe@covalent.io>

joestringer added a commit to joestringer/cilium that referenced this issue Aug 10, 2018

daemon: Upgrade CT map properties on startup
If the CT map has properties that differ from the target properties as
configured in the daemon code (for example, different value size), then
remove that map during endpoint restore.

When the map is removed, any existing BPF program that refers to the map
should continue to operate as normal. When the endpoints being restored
are regenerated, they will cause automatic recreation of the new
conntrack map (if it doesn't yet exist), and begin to use the new
endpoint.

Note that when an endpoint switches from the old CT map to the new CT
map, as it would in the above case, existing connections may be
temporarily disrupted.

Once all endpoints have finished using the old CT map, the kernel will
automatically relinquish its resources.

Fixes: cilium#5070

Signed-off-by: Joe Stringer <joe@covalent.io>

@tgraf tgraf closed this in #5118 Aug 10, 2018

CI Failures automation moved this from Needs triage to Closed Aug 10, 2018

1.2 automation moved this from Committed to Done Aug 10, 2018

tgraf added a commit that referenced this issue Aug 10, 2018

daemon: Upgrade CT map properties on startup
If the CT map has properties that differ from the target properties as
configured in the daemon code (for example, different value size), then
remove that map during endpoint restore.

When the map is removed, any existing BPF program that refers to the map
should continue to operate as normal. When the endpoints being restored
are regenerated, they will cause automatic recreation of the new
conntrack map (if it doesn't yet exist), and begin to use the new
endpoint.

Note that when an endpoint switches from the old CT map to the new CT
map, as it would in the above case, existing connections may be
temporarily disrupted.

Once all endpoints have finished using the old CT map, the kernel will
automatically relinquish its resources.

Fixes: #5070

Signed-off-by: Joe Stringer <joe@covalent.io>

tgraf added a commit that referenced this issue Aug 11, 2018

daemon: Upgrade CT map properties on startup
[ upstream commit fa12caf ]

If the CT map has properties that differ from the target properties as
configured in the daemon code (for example, different value size), then
remove that map during endpoint restore.

When the map is removed, any existing BPF program that refers to the map
should continue to operate as normal. When the endpoints being restored
are regenerated, they will cause automatic recreation of the new
conntrack map (if it doesn't yet exist), and begin to use the new
endpoint.

Note that when an endpoint switches from the old CT map to the new CT
map, as it would in the above case, existing connections may be
temporarily disrupted.

Once all endpoints have finished using the old CT map, the kernel will
automatically relinquish its resources.

Fixes: #5070

Signed-off-by: Joe Stringer <joe@covalent.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>

tgraf added a commit that referenced this issue Aug 12, 2018

daemon: Upgrade CT map properties on startup
[ upstream commit fa12caf ]

If the CT map has properties that differ from the target properties as
configured in the daemon code (for example, different value size), then
remove that map during endpoint restore.

When the map is removed, any existing BPF program that refers to the map
should continue to operate as normal. When the endpoints being restored
are regenerated, they will cause automatic recreation of the new
conntrack map (if it doesn't yet exist), and begin to use the new
endpoint.

Note that when an endpoint switches from the old CT map to the new CT
map, as it would in the above case, existing connections may be
temporarily disrupted.

Once all endpoints have finished using the old CT map, the kernel will
automatically relinquish its resources.

Fixes: #5070

Signed-off-by: Joe Stringer <joe@covalent.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>

tgraf added a commit that referenced this issue Aug 12, 2018

daemon: Upgrade CT map properties on startup
[ upstream commit fa12caf ]

If the CT map has properties that differ from the target properties as
configured in the daemon code (for example, different value size), then
remove that map during endpoint restore.

When the map is removed, any existing BPF program that refers to the map
should continue to operate as normal. When the endpoints being restored
are regenerated, they will cause automatic recreation of the new
conntrack map (if it doesn't yet exist), and begin to use the new
endpoint.

Note that when an endpoint switches from the old CT map to the new CT
map, as it would in the above case, existing connections may be
temporarily disrupted.

Once all endpoints have finished using the old CT map, the kernel will
automatically relinquish its resources.

Fixes: #5070

Signed-off-by: Joe Stringer <joe@covalent.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>

tgraf added a commit that referenced this issue Aug 12, 2018

daemon: Upgrade CT map properties on startup
[ upstream commit fa12caf ]

If the CT map has properties that differ from the target properties as
configured in the daemon code (for example, different value size), then
remove that map during endpoint restore.

When the map is removed, any existing BPF program that refers to the map
should continue to operate as normal. When the endpoints being restored
are regenerated, they will cause automatic recreation of the new
conntrack map (if it doesn't yet exist), and begin to use the new
endpoint.

Note that when an endpoint switches from the old CT map to the new CT
map, as it would in the above case, existing connections may be
temporarily disrupted.

Once all endpoints have finished using the old CT map, the kernel will
automatically relinquish its resources.

Fixes: #5070

Signed-off-by: Joe Stringer <joe@covalent.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>

tgraf added a commit that referenced this issue Aug 12, 2018

daemon: Upgrade CT map properties on startup
[ upstream commit fa12caf ]

If the CT map has properties that differ from the target properties as
configured in the daemon code (for example, different value size), then
remove that map during endpoint restore.

When the map is removed, any existing BPF program that refers to the map
should continue to operate as normal. When the endpoints being restored
are regenerated, they will cause automatic recreation of the new
conntrack map (if it doesn't yet exist), and begin to use the new
endpoint.

Note that when an endpoint switches from the old CT map to the new CT
map, as it would in the above case, existing connections may be
temporarily disrupted.

Once all endpoints have finished using the old CT map, the kernel will
automatically relinquish its resources.

Fixes: #5070

Signed-off-by: Joe Stringer <joe@covalent.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>

borkmann added a commit that referenced this issue Aug 13, 2018

daemon: Upgrade CT map properties on startup
[ upstream commit fa12caf ]

If the CT map has properties that differ from the target properties as
configured in the daemon code (for example, different value size), then
remove that map during endpoint restore.

When the map is removed, any existing BPF program that refers to the map
should continue to operate as normal. When the endpoints being restored
are regenerated, they will cause automatic recreation of the new
conntrack map (if it doesn't yet exist), and begin to use the new
endpoint.

Note that when an endpoint switches from the old CT map to the new CT
map, as it would in the above case, existing connections may be
temporarily disrupted.

Once all endpoints have finished using the old CT map, the kernel will
automatically relinquish its resources.

Fixes: #5070

Signed-off-by: Joe Stringer <joe@covalent.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

tgraf added a commit that referenced this issue Aug 13, 2018

daemon: Upgrade CT map properties on startup
[ upstream commit fa12caf ]

If the CT map has properties that differ from the target properties as
configured in the daemon code (for example, different value size), then
remove that map during endpoint restore.

When the map is removed, any existing BPF program that refers to the map
should continue to operate as normal. When the endpoints being restored
are regenerated, they will cause automatic recreation of the new
conntrack map (if it doesn't yet exist), and begin to use the new
endpoint.

Note that when an endpoint switches from the old CT map to the new CT
map, as it would in the above case, existing connections may be
temporarily disrupted.

Once all endpoints have finished using the old CT map, the kernel will
automatically relinquish its resources.

Fixes: #5070

Signed-off-by: Joe Stringer <joe@covalent.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>

tgraf added a commit that referenced this issue Aug 13, 2018

daemon: Upgrade CT map properties on startup
[ upstream commit fa12caf ]

If the CT map has properties that differ from the target properties as
configured in the daemon code (for example, different value size), then
remove that map during endpoint restore.

When the map is removed, any existing BPF program that refers to the map
should continue to operate as normal. When the endpoints being restored
are regenerated, they will cause automatic recreation of the new
conntrack map (if it doesn't yet exist), and begin to use the new
endpoint.

Note that when an endpoint switches from the old CT map to the new CT
map, as it would in the above case, existing connections may be
temporarily disrupted.

Once all endpoints have finished using the old CT map, the kernel will
automatically relinquish its resources.

Fixes: #5070

Signed-off-by: Joe Stringer <joe@covalent.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>

tgraf added a commit that referenced this issue Aug 14, 2018

daemon: Upgrade CT map properties on startup
[ upstream commit fa12caf ]

If the CT map has properties that differ from the target properties as
configured in the daemon code (for example, different value size), then
remove that map during endpoint restore.

When the map is removed, any existing BPF program that refers to the map
should continue to operate as normal. When the endpoints being restored
are regenerated, they will cause automatic recreation of the new
conntrack map (if it doesn't yet exist), and begin to use the new
endpoint.

Note that when an endpoint switches from the old CT map to the new CT
map, as it would in the above case, existing connections may be
temporarily disrupted.

Once all endpoints have finished using the old CT map, the kernel will
automatically relinquish its resources.

Fixes: #5070

Signed-off-by: Joe Stringer <joe@covalent.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>

tgraf added a commit that referenced this issue Aug 14, 2018

daemon: Upgrade CT map properties on startup
[ upstream commit fa12caf ]

If the CT map has properties that differ from the target properties as
configured in the daemon code (for example, different value size), then
remove that map during endpoint restore.

When the map is removed, any existing BPF program that refers to the map
should continue to operate as normal. When the endpoints being restored
are regenerated, they will cause automatic recreation of the new
conntrack map (if it doesn't yet exist), and begin to use the new
endpoint.

Note that when an endpoint switches from the old CT map to the new CT
map, as it would in the above case, existing connections may be
temporarily disrupted.

Once all endpoints have finished using the old CT map, the kernel will
automatically relinquish its resources.

Fixes: #5070

Signed-off-by: Joe Stringer <joe@covalent.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>

tgraf added a commit that referenced this issue Aug 14, 2018

daemon: Upgrade CT map properties on startup
[ upstream commit fa12caf ]

If the CT map has properties that differ from the target properties as
configured in the daemon code (for example, different value size), then
remove that map during endpoint restore.

When the map is removed, any existing BPF program that refers to the map
should continue to operate as normal. When the endpoints being restored
are regenerated, they will cause automatic recreation of the new
conntrack map (if it doesn't yet exist), and begin to use the new
endpoint.

Note that when an endpoint switches from the old CT map to the new CT
map, as it would in the above case, existing connections may be
temporarily disrupted.

Once all endpoints have finished using the old CT map, the kernel will
automatically relinquish its resources.

Fixes: #5070

Signed-off-by: Joe Stringer <joe@covalent.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>

tgraf added a commit that referenced this issue Aug 14, 2018

daemon: Upgrade CT map properties on startup
[ upstream commit fa12caf ]

If the CT map has properties that differ from the target properties as
configured in the daemon code (for example, different value size), then
remove that map during endpoint restore.

When the map is removed, any existing BPF program that refers to the map
should continue to operate as normal. When the endpoints being restored
are regenerated, they will cause automatic recreation of the new
conntrack map (if it doesn't yet exist), and begin to use the new
endpoint.

Note that when an endpoint switches from the old CT map to the new CT
map, as it would in the above case, existing connections may be
temporarily disrupted.

Once all endpoints have finished using the old CT map, the kernel will
automatically relinquish its resources.

Fixes: #5070

Signed-off-by: Joe Stringer <joe@covalent.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>

joestringer added a commit to joestringer/cilium that referenced this issue Sep 10, 2018

docs: Document safe downgrade to Cilium 1.0.
Downgrading to Cilium 1.0 from Cilium 1.2 or later without these new
steps is unsafe due to issue cilium#5070.

PR cilium#5503 introduces a safe path: Clear the BPF maps during downgrade.

Document the new option for safe downgrade to Cilium 1.0.

Signed-off-by: Joe Stringer <joe@covalent.io>

tgraf added a commit that referenced this issue Sep 12, 2018

docs: Document safe downgrade to Cilium 1.0.
Downgrading to Cilium 1.0 from Cilium 1.2 or later without these new
steps is unsafe due to issue #5070.

PR #5503 introduces a safe path: Clear the BPF maps during downgrade.

Document the new option for safe downgrade to Cilium 1.0.

Signed-off-by: Joe Stringer <joe@covalent.io>

ianvernon added a commit that referenced this issue Sep 12, 2018

docs: Document safe downgrade to Cilium 1.0.
[ upstream commit 2c03853 ]

Downgrading to Cilium 1.0 from Cilium 1.2 or later without these new
steps is unsafe due to issue #5070.

PR #5503 introduces a safe path: Clear the BPF maps during downgrade.

Document the new option for safe downgrade to Cilium 1.0.

Signed-off-by: Joe Stringer <joe@covalent.io>
Signed-off-by: Ian Vernon <ian@cilium.io>

ianvernon added a commit that referenced this issue Sep 12, 2018

docs: Document safe downgrade to Cilium 1.0.
[ upstream commit 2c03853 ]

Downgrading to Cilium 1.0 from Cilium 1.2 or later without these new
steps is unsafe due to issue #5070.

PR #5503 introduces a safe path: Clear the BPF maps during downgrade.

Document the new option for safe downgrade to Cilium 1.0.

Signed-off-by: Joe Stringer <joe@covalent.io>
Signed-off-by: Ian Vernon <ian@cilium.io>

ianvernon added a commit that referenced this issue Sep 13, 2018

docs: Document safe downgrade to Cilium 1.0.
[ upstream commit 2c03853 ]

Downgrading to Cilium 1.0 from Cilium 1.2 or later without these new
steps is unsafe due to issue #5070.

PR #5503 introduces a safe path: Clear the BPF maps during downgrade.

Document the new option for safe downgrade to Cilium 1.0.

Signed-off-by: Joe Stringer <joe@covalent.io>
Signed-off-by: Ian Vernon <ian@cilium.io>

tgraf added a commit that referenced this issue Sep 14, 2018

docs: Document safe downgrade to Cilium 1.0.
[ upstream commit 2c03853 ]

Downgrading to Cilium 1.0 from Cilium 1.2 or later without these new
steps is unsafe due to issue #5070.

PR #5503 introduces a safe path: Clear the BPF maps during downgrade.

Document the new option for safe downgrade to Cilium 1.0.

Signed-off-by: Joe Stringer <joe@covalent.io>
Signed-off-by: Ian Vernon <ian@cilium.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment