Skip to content
Permalink
Branch: master
Commits on May 16, 2019
  1. bpf: force recreation of regular ct entry upon service collision

    borkmann authored and tgraf committed May 16, 2019
    If there was a service reuse in place, meaning, e.g. upon upgrade, the
    CT map was reused, but different services created for the same IP/port
    pair, then we will first detect it in the services lookup via lb{4,6}_local().
    The ct_lookup{4,6}() from there will return a connection in CT_{ESTABLISHED,
    REPLY,RELATED} where we find that the rev-nat is mismatching via
    state->rev_nat_index != svc_v2->rev_nat_index. This forces to select a
    new slave of the actual (current) service. We then update state with the
    new backend_id and rev_nat_index, and propagate this info back into the
    CT_SERVICE entry, so it becomes persistent for the next service lookup.
    After that, we lookup the newly selected backend via lb{4,6}_lookup_backend()
    and fix up tuple address and skb with the new backend IP/port. Thus in
    the CT, the CT_SERVICE entry is updated/fixed for subsequent lookups.
    
    Back in bpf_lxc (e.g. handle_ipv4_from_lxc()), this ct_state info is
    stored in ct_state_new. After lb4_local() did its work, we do the CT
    lookup via ct_lookup4() for the tuple with the backend IP/port included.
    If it's a new connection (CT_NEW), then ct_create4() will propagate the
    info from ct_state_new into the ct entry/value (entry.rev_nat_index). If
    there is a case where there is a stale entry in the CT due to similar
    stale, prior used service IP, then we may find a CT_{ESTABLISHED,REPLY,
    RELATED case. In the latter two, we use the ct_state from the normal
    connection that has been looked up to then do the lb4_rev_nat() iff
    there was a rev-nat index in the ct_state. Similarly, for CT_ESTABLISHED,
    we'll do the lb4_rev_nat() through a differnt code path. CT_ESTABLISHED
    we'll get for outgoing connections from the endpoint itself, the CT_REPLY
    for packets coming back in. While we updated the services, the regular
    CT entries would still reuse the stale rev-nat index, which may be wrong.
    Thus, detect for outgoing CT_ESTABLISHED packets whether we have a
    mismatch in rev_nat_index, and if so, purge the whole entry from the CT
    table and force recreation of a new one.
    
    Following cases can be valid: the rev-nat index change as explained above,
    and the backend_id change for the case where in lb4_local() we hit the
    one where the backend was removed from underneath us and we were forced
    to select a completely new one, and thus rev-nat stays the same but
    backend_id changes. In the latter, this means, the tuple got updated in
    lb4_local() for the later lookup in ct_lookup4(). If that was a stale
    entry, then we also would have a rev-nat index mismatch, and force to
    purge the entry from the CT table, so this is also covered. In the case
    where there is no service for the given tuple, then ct_state_new will have
    rev-nat of 0, same for the ct_state from the lookup, so there won't be
    a mismatch.
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
  2. bpf: do propagate backend, and rev nat to new entry

    borkmann authored and tgraf committed May 15, 2019
    If we don't do it, we'd always hit this path, and skb hash could
    be slightly different which then could select a different backend
    again.
    
    And also fix a tiny race window where we re-lookup a backend upon
    entry creation.
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Commits on May 9, 2019
  1. docs: fix various spelling issues in kata gsg

    borkmann committed May 9, 2019
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
  2. bpf: use double word for v6 addr copy and comparison

    borkmann committed Apr 11, 2019
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Commits on Apr 26, 2019
  1. ginko: adjust timeout to something more appropriate

    borkmann committed Apr 26, 2019
    Looking at the Cilium 1.5 main CI runs under ...
    
      https://jenkins.cilium.io/view/Cilium-v1.5/job/cilium-v1.5-standard/
    
    ... the timeout of 1h 15 min is way too short. Runs did finish successfully
    in the range of 1h 10min, 1h 13min, for example, but others got aborted
    at 1h 16min. Increase it to 110 min so that we hit aborts less frequent,
    only if something is really off.
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
  2. bpf: make services available for host applications

    borkmann authored and brb committed Apr 24, 2019
    This enables service load balancing for host applications such that
    they can connect from host ns to the backends provided by the Cilium
    service load balancer. Operations is at the connect cgroup hooks. It
    checks whether the connect destination matches one of the service VIPs
    and if so, selects an underlying backend IP/port in order to wire it
    up directly.
    
    This has the advantage that we only pay the cost once at connect time
    instead of xlating every packet in the BPF data path at the host. The
    connect hooks are not device specific. These hooks are available since
    v4.18 kernel. connect4 and connect6 covers TCP as well as connected UDP,
    the sendmsg4 and sendmsg6 unconnected UDP. Only for unconnected UDP
    it's a per-packet cost at the sendmsg layer.
    
    iproute2 has been modified such that for the cgroup hooks we can reuse
    the same ELF parser in order to reuse existing LB maps. iproute2 will
    then temporarily pin the program into bpf fs and we'll do the final
    attachment via bpftool. Attach semantics are that a previously loaded
    program will be overridden at the cgroups hook. In future, both can
    be done natively out of Cilium agent itself.
    
    On top of the following iproute2 commit (to be upstreamed):
    
      cilium/iproute2@de3ae7e
    
    Testing / example:
    
      # daemon/cilium-agent --kvstore consul --kvstore-opt consul.address=127.0.0.1:8500 --tunnel=disabled --auto-direct-node-routes=true --enable-ipv4=true --enable-ipv6=true --disable-envoy-version-check=true --enable-host-reachable-services=true
    
      # bpftool cgroup show /var/run/cilium/cgroupv2/
      ID       AttachType      AttachFlags     Name
      37       connect4
      35       connect6
      36       sendmsg4
      34       sendmsg6
    
      In host ns, IPv4 test:
      ----------------------
    
      # curl -4 127.0.0.1
      curl: (7) Failed to connect to 127.0.0.1 port 80: Connection refused
    
      # curl -4 172.217.11.14
      <HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
      <TITLE>301 Moved</TITLE></HEAD><BODY>
      [...]
    
      # cilium service update --frontend 127.0.0.1:80 --backends 172.217.11.14:80 --id 1
      Creating new service with id '1'
      Added service with 1 backends
    
      # cilium service list
      ID   Frontend       Backend
      1    127.0.0.1:80   1 => 172.217.11.14:80
    
      # curl -4 127.0.0.1
      <HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
      <TITLE>301 Moved</TITLE></HEAD><BODY>
      [...]
    
      In host ns, IPv6 test:
      ----------------------
    
      # curl -6 [::1]:80
      curl: (7) Failed to connect to ::1 port 80: Connection refused
    
      # curl -6 [2607:f8b0:4006:812::200e]:80
      <!DOCTYPE html>
      <html lang=en>
      [...]
    
      # cilium service update --frontend  [::1]:80 --backends [2607:f8b0:4006:812::200e]:80 --id 2
      Creating new service with id '2'
      Added service with 1 backends
    
      # cilium service list
      ID   Frontend     Backend
      1    127.0.0.1:80   1 => 172.217.11.14:80
      2    [::1]:80       1 => [2607:f8b0:4006:812::200e]:80
    
      # curl -6 [::1]:80
      <!DOCTYPE html>
      <html lang=en>
      [...]
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
  3. cilium: split cgroups handling into own package

    borkmann authored and brb committed Apr 25, 2019
    Split it off from sockops and move into its own package since it's
    going to be used outside of sockops in subsequent commits.
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
  4. cilium: update container runtime image to include iproute2 changes

    borkmann authored and brb committed Apr 25, 2019
    Pull in latest iproute2's static-data branch which includes:
    
      cilium/iproute2@de3ae7e
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Commits on Apr 19, 2019
  1. docs: clarify kernel version for BPF based masquerading

    borkmann authored and ianvernon committed Apr 18, 2019
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Commits on Apr 18, 2019
  1. cilium, template: add cilium_encrypt_state to ignored prefixes

    borkmann committed Apr 17, 2019
    Seen several warnings in the cilium agent log files as the following:
    
      2019-04-17T20:11:25.804796408Z level=warning msg="Skipping symbol substitution" subsys=elf symbol=cilium_encrypt_state
    
    The map is a global one, therefore add it to ignored prefixes to
    silence the warning.
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Commits on Apr 17, 2019
  1. bpf, snat: dump external v4/v6 addresses more clearly into node config

    borkmann authored and tgraf committed Apr 17, 2019
    It's useful for debugging, so make it clear what Cilium daemon has
    selected for node_config.h such that this can be introspected easily
    from the usual locations (/var/run/cilium/state/globals/node_config.h).
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
  2. node, address: fix bug where internal IP is selected over external

    borkmann authored and tgraf committed Apr 16, 2019
    Commit df63d9b unfortunately broke SNAT due to selecting an
    private IP with global scope over a public one. The interface I had
    specified had two globally scoped IPs, one of them had a 10.0.0.0/8
    prefix, the other one not. However, since prior node_config.h was
    present on the system, it selected the 10.0.0.0/8 one. As a consequence
    NAT broke since host local communication which was using the public
    address got then NATed by using the private one and thus machine
    looses connectivity. Rework firstGlobal*Addr() to prefer public over
    private addresses. If there is a preferredIP passed, then it's only
    preferred pick within the set of public resp. private ones.
    
    For IPv6 add similar logic. Also, the findIPv6NodeAddr() function
    comment mentions an interface that can be passed, however there is
    no such parameter. Rework the IPv6 logic to reuse all of the IPv4
    one such that global scope address of a provided interface is picked
    if device is configured. The fallback logic is then first trying to
    find a reduced scope if a device was specified, and if that breaks
    down, the second fallback is to find the IP considering all interfaces
    with universe scope (and again, falling back to reduced scope before
    giving up completely).
    
    Fixes: df63d9b ("Node: Try to prioritize the InternalIPv[46] from restore.")
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
  3. bpf, snat: select lru map if available otherwise fall back to htab

    borkmann authored and tgraf committed Apr 12, 2019
    Same principle as in conntrack table. In case of evicting old NAT
    entries from active connection via LRU mechanism, we could run into
    two scenarios wrt TCP connections: i) packet coming in via ingress.
    If there is no state for the given connection, the packet will be
    dropped. Node might resend in which case we'll have an outgoing
    connection. In case of the latter, we'll then re-create a new mapping
    via snat_v{4,6}_new_mapping(). Try to retain evicted 5-tuple mapping
    if possible such that there is a chance connections would keep
    working before we try completely random ones. Lift the short interval
    for GC cleanup when NAT is enabled to not interfere with stale NAT
    entries in LRU. NAT hash-tab would normally piggy-back on GC.
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
  4. bpf, snat: reject unknown ethertypes early

    borkmann authored and tgraf committed Apr 14, 2019
    Move the test out of do_netdev() and into each entry point, so we
    don't even attempt to try SNAT in the first place. This also fixes
    a verifier issue for older kernels where LLVM is generating ctx+off
    in a register which is not allowed.
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
  5. bpf, snat: add cilium monitor support for pre/post snat engine

    borkmann authored and tgraf committed Apr 13, 2019
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Commits on Apr 9, 2019
  1. cilium, bpf: fix panic when run with newer LLVM

    borkmann authored and ianvernon committed Apr 9, 2019
    While debugging another issue I ran into the following panic wrt
    ELF templating:
    
      # daemon/cilium-agent --kvstore consul --kvstore-opt consul.address=127.0.0.1:8500 --enable-ipv6=false --masquerade=true --auto-direct-node-routes=true --disable-envoy-version-check
      [...]
      panic: runtime error: index out of range
    
      goroutine 318 [running]:
      github.com/cilium/cilium/pkg/elf.(*symbols).extractFrom(0xc0006520d0, 0xc000f28320, 0xc000f28320, 0x0)
      	/root/go/src/github.com/cilium/cilium/pkg/elf/symbols.go:170 +0x1357
      github.com/cilium/cilium/pkg/elf.NewELF(0x317a580, 0xc00000f518, 0xc0001090a0, 0x0, 0x0, 0xc0001090a0)
      	/root/go/src/github.com/cilium/cilium/pkg/elf/elf.go:77 +0x145
      github.com/cilium/cilium/pkg/elf.Open(0xc000afd5c0, 0x52, 0xc0000a6c00, 0x7f42126307d8, 0xc0006f20f0)
      	/root/go/src/github.com/cilium/cilium/pkg/elf/elf.go:97 +0x1eb
      github.com/cilium/cilium/pkg/datapath/loader.CompileOrLoad(0x31b6f80, 0xc0000a6c00, 0x31db1c0, 0xc0006f20f0, 0xc000fdf7c0, 0x0, 0x0)
      	/root/go/src/github.com/cilium/cilium/pkg/datapath/loader/loader.go:123 +0xd6
      github.com/cilium/cilium/pkg/endpoint.(*Endpoint).realizeBPFState(0xc000d46000, 0xc000fdf680, 0xc000fdf888, 0xc000fdf680, 0xc0012d5ad8)
      	/root/go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:488 +0x30d
      github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regenerateBPF(0xc000d46000, 0x31d7540, 0xc0007b21c0, 0xc000fdf680, 0x0, 0x0, 0x0, 0x0)
      	/root/go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:415 +0x273
      github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regenerate(0xc000d46000, 0x31d7540, 0xc0007b21c0, 0xc000fdf680, 0x0, 0x0)
      	/root/go/src/github.com/cilium/cilium/pkg/endpoint/policy.go:323 +0x704
      github.com/cilium/cilium/pkg/endpoint.(*EndpointRegenerationEvent).Handle(0xc00139d8c0, 0xc0000a6ba0)
      	/root/go/src/github.com/cilium/cilium/pkg/endpoint/events.go:54 +0x21b
      github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).Run.func1()
      	/root/go/src/github.com/cilium/cilium/pkg/eventqueue/eventqueue.go:236 +0x144
      sync.(*Once).Do(0xc0004f0e78, 0xc0010df540)
      	/usr/local/go/src/sync/once.go:44 +0xb3
      created by github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).Run
      	/root/go/src/github.com/cilium/cilium/pkg/eventqueue/eventqueue.go:225 +0xa9
      level=fatal msg="Agent pipe unexpectedly closed, shutting down" subsys=cilium-node-monitor
    
    Reason is that newer LLVM changed BPF backend to emit symbols under
    OBJECT instead of NOTYPE symbol scope. Also, BTF line info is not
    recognized by golang's default ELF parser, so accessing section for
    it will lead to the oob access panic (sym.Section is 64k here). Just
    skip these.
    
      # llc --version
      LLVM (http://llvm.org/):
      LLVM version 9.0.0svn
      Optimized build.
      Default target: x86_64-unknown-linux-gnu
      Host CPU: haswell
    
      Registered Targets:
        bpf    - BPF (host endian)
        bpfeb  - BPF (big endian)
        bpfel  - BPF (little endian)
        x86    - 32-bit X86: Pentium-Pro and above
        x86-64 - 64-bit X86: EM64T and AMD64
    
    Symtab from bpf_netdev:
    
      # readelf -a /var/run/cilium/state/bpf_netdev.o
      [...]
        30: 00000000000000a8    28 OBJECT  GLOBAL DEFAULT   14 cilium_calls_netdev_2
        31: 0000000000000000    28 OBJECT  GLOBAL DEFAULT   14 cilium_events
        32: 00000000000000c4    28 OBJECT  GLOBAL DEFAULT   14 cilium_ipcache
        33: 000000000000001c    28 OBJECT  GLOBAL DEFAULT   14 cilium_lxc
        34: 0000000000000038    28 OBJECT  GLOBAL DEFAULT   14 cilium_metrics
        35: 0000000000000054    28 OBJECT  GLOBAL DEFAULT   14 cilium_policy
        36: 0000000000000070    28 OBJECT  GLOBAL DEFAULT   14 cilium_policy_reserved_2
        37: 000000000000008c    28 OBJECT  GLOBAL DEFAULT   14 cilium_proxy4
      [...]
    
    Symtab from bpf_lxc:
    
      # readelf -a /var/run/cilium/state/templates/c4406a7451ccb067746c874b71aaf19266ce7212/bpf_lxc.o
      [...]
       285: 000000000000001c     4 OBJECT  GLOBAL DEFAULT   11 LXC_ID
       286: 0000000000000010     4 OBJECT  GLOBAL DEFAULT   11 LXC_IPV4
       287: 0000000000000000     4 OBJECT  GLOBAL DEFAULT   11 LXC_IP_1
       288: 0000000000000004     4 OBJECT  GLOBAL DEFAULT   11 LXC_IP_2
       289: 0000000000000008     4 OBJECT  GLOBAL DEFAULT   11 LXC_IP_3
       290: 000000000000000c     4 OBJECT  GLOBAL DEFAULT   11 LXC_IP_4
       291: 0000000000000014     4 OBJECT  GLOBAL DEFAULT   11 NODE_MAC_1
       292: 0000000000000018     4 OBJECT  GLOBAL DEFAULT   11 NODE_MAC_2
       293: 0000000000000020     4 OBJECT  GLOBAL DEFAULT   11 SECLABEL
       294: 0000000000000024     4 OBJECT  GLOBAL DEFAULT   11 SECLABEL_NB
       295: 0000000000000000     4 OBJECT  GLOBAL DEFAULT   13 ____license
      [...]
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Commits on Mar 29, 2019
  1. cilium: add cilium bpf nat cli commands

    borkmann committed Mar 28, 2019
    Add some CLI commands for inspecting and debugging BPF NAT.
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
  2. bpf: initial bpf-based masquerading support

    borkmann committed Mar 28, 2019
    This work adds a NAT engine in BPF which is working together with
    Cilium's BPF-based connection tracker. This gets enabled when the
    following is set: --install-iptables-rules=false --masquerade=true
    In this case Cilium won't rely on iptables to perform the masquerading
    but instead uses its own, more efficient BPF-based implementation.
    
    Right now at the beginning this option is only made available for
    direct routing mode with ipvlan, but will be lifted in subsequent
    work for veth-based datapath as well. In combination with ipvlan,
    this allows to use the more efficient L3 mode instead of relying
    on L3S! The faster L3 mode is otherwise not possible to use in
    combination with masquerading since it bypasses Netfilter hooks.
    For the BPF-based implementation in this work it is not a problem
    since tc-based hooks are not bypassed.
    
    The NAT core is mostly stand-alone, and in this work integrated
    into the bpf_netdev program, but could in future also be integrated
    into others. The NAT code handles connections from containers/pods
    but also host local ones. In case of the latter, it will integrate
    mappings into the same NAT table to check for potential collisions
    and if the case, then it will map them to a different port just like
    any other connections from container/pods. For connection originating
    from local host, they need to be passed through the BPF-based
    connection tracker as well in order to manage lifetime of NAT
    mappings. For this purpose, it reuses the global connection tracker.
    
    NAT mappings piggy back on Cilium's CT garbage collector in order
    to remove related mappings once connection is terminated. Both
    support for IPv4 as well as IPv6 has been implemented and tested
    in this work. L3 protos that are supported are: TCP, UDP, ICMP
    and ICMPv6. For both ICMP protocols, echo request/reply are supported
    and in case of collisions their request/reply identifier translated.
    Other ICMP types are not implemented yet, for that we'd need to
    parse deeper into the packet in order to avoid leaking addressing
    information. This can be done in future work. The NAT engine first
    tries to retain ports/identifier from the packet in order avoid L4
    packet rewrites, but in case of collisions with existing mappings,
    it will select a different one. In case of L3 for local host
    connections, it avoids rewrites altogether if the tuple can be
    retained.
    
    The NAT engine works as well with older kernels since it doesn't
    rely on recently added helpers. Instead, it uses a subset of what
    Cilium already uses today. There has been queue/stack map added to
    the kernel some time ago (https://lwn.net/Articles/768994/), with
    the motivation of implementing NAT through this, but we found that
    using this map is rather limiting as a design with the help of that
    map would only allow for 64k NAT mappings, but nothing beyond that.
    For the queue map case, theory of operation would have been: fill
    queue map with individual, unique port values as entries, and upon
    port assignment, take a unused port out of the map to avoid
    collisions, and once connection is terminated, put it back into
    queue map. Aside from the limited number of mappings, this also
    has the issue that we're wasting quite some amount of memory since
    we need to prealloc the map with 64k entries in total 6 times (TCP,
    UDP, ICMP for IPv4 and IPv6 each). We instead rely on randomized
    port allocation and in case of collisions we retry up to 16 times
    before giving up. The prandom based allocation is uniformly
    distributed and aside from overcoming the memory waste, we also
    allow for colliding port mappings as long as the tuple itself is
    still unique thus avoiding a limit of 64k we would have had with
    queue map otherwise.
    
    The Cilium daemon has two new commands for managing the NAT table
    similarly as with the connection tracking tables:
    
      cilium bpf nat list  - List all current NAT mappings
      cilium bpf nat flush - Remove all NAT mappings
    
    Both are quite useful for introspection and/or debugging. Example
    invocation with BPF based NAT engine:
    
    Start daemon with BPF-based masquerading:
    
      # ./daemon/cilium-agent --kvstore consul \
                              --kvstore-opt consul.address=127.0.0.1:8500 \
                              --datapath-mode=ipvlan \
                              --ipvlan-master-device=bond0 \
                              --tunnel=disabled \
                              --install-iptables-rules=false \      <-- Switch to BPF
                              --masquerade=true \                   <-- Masquerading
                              --auto-direct-node-routes=true \
                              --enable-ipv4=true \
                              --enable-ipv6=true \
                              --disable-envoy-version-check=true
    
    Check Cilium operates in ipvlan L3 mode:
    
      # ip -d a
      [...]
      277: cilium_host@bond0: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
          link/ether 0c:c4:7a:86:29:f8 brd ff:ff:ff:ff:ff:ff promiscuity 0
          ipvlan  mode l3 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
          inet 10.155.0.1/32 scope link cilium_host
             valid_lft forever preferred_lft forever
          inet6 2604:1380:2000:3d00::3/128 scope global
             valid_lft forever preferred_lft forever
          inet6 fe80::cc4:7a00:e986:29f8/64 scope link
             valid_lft forever preferred_lft forever
    
    List current NAT mappings (ssh to node itself):
    
      # cilium bpf nat list
      TCP OUT 147.75.83.155:22 -> 188.63.217.235:51372 XLATE_SRC 147.75.83.155:22 Created=5sec HostLocal=1
      TCP IN 188.63.217.235:51372 -> 147.75.83.155:22 XLATE_DST 147.75.83.155:22 Created=5sec HostLocal=1
    
    Ping from container to outside world with ipvlan L3:
    
      # docker exec -ti server ping 1.1.1.1
      PING 1.1.1.1 (1.1.1.1): 56 data bytes
      64 bytes from 1.1.1.1: seq=0 ttl=58 time=1.037 ms
      64 bytes from 1.1.1.1: seq=1 ttl=58 time=1.100 ms
      [...]
    
    Again, list new NAT mappings:
    
      # cilium bpf nat list
      ICMP IN 1.1.1.1:0 -> 147.75.83.155:35328 XLATE_DST 10.155.53.137:35328 Created=4sec HostLocal=0
      TCP OUT 147.75.83.155:22 -> 188.63.217.235:51372 XLATE_SRC 147.75.83.155:22 Created=27sec HostLocal=1
      ICMP OUT 10.155.53.137:35328 -> 1.1.1.1:0 XLATE_SRC 147.75.83.155:35328 Created=4sec HostLocal=0
      TCP IN 223.111.139.244:33154 -> 147.75.83.155:22 XLATE_DST 147.75.83.155:22 Created=17sec HostLocal=1
      TCP OUT 147.75.83.155:22 -> 223.111.139.244:33154 XLATE_SRC 147.75.83.155:22 Created=17sec HostLocal=1
      TCP IN 188.63.217.235:51372 -> 147.75.83.155:22 XLATE_DST 147.75.83.155:22 Created=27sec HostLocal=1
    
    Test manually flushing NAT mappings (test/demo-only, not needed otherwise):
    
      # cilium bpf nat flush
      Flushed 8 entries from /sys/fs/bpf/tc/globals/cilium_snat_v4_external
      Flushed 0 entries from /sys/fs/bpf/tc/globals/cilium_snat_v6_external
    
    Ping from local host to outside world:
    
      # ping6 ipv6.google.com
      PING ipv6.google.com(ams16s29-in-x0e.1e100.net (2a00:1450:400e:804::200e)) 56 data bytes
      64 bytes from ams16s29-in-x0e.1e100.net (2a00:1450:400e:804::200e): icmp_seq=1 ttl=56 time=1.33 ms
      64 bytes from ams16s29-in-x0e.1e100.net (2a00:1450:400e:804::200e): icmp_seq=2 ttl=56 time=1.29 ms
      64 bytes from ams16s29-in-x0e.1e100.net (2a00:1450:400e:804::200e): icmp_seq=3 ttl=56 time=1.41 ms
      [...]
    
    List mappings again:
    
      # cilium bpf nat list
      [...]
      ICMPv6 OUT [2604:1380:2000:3d00::3]:20631 -> [2a00:1450:400e:804::200e]:0 XLATE_SRC [2604:1380:2000:3d00::3]:20631 Created=5sec HostLocal=1
      ICMPv6 IN [2a00:1450:400e:804::200e]:0 -> [2604:1380:2000:3d00::3]:20631 XLATE_DST [2604:1380:2000:3d00::3]:20631 Created=5sec HostLocal=1
    
    Flush of CT triggers eviction of NAT mappings as well. This, of course,
    works in case of explicit flush (demoed here) as well as automatic
    garbage collection:
    
      # cilium bpf ct flush global
      [...]
    
    List mappings after flush (only ssh to node itself remaining):
    
      # cilium bpf nat list
      TCP OUT 147.75.83.155:22 -> 188.63.217.235:51372 XLATE_SRC 147.75.83.155:22 Created=5sec HostLocal=1
      TCP IN 188.63.217.235:51372 -> 147.75.83.155:22 XLATE_DST 147.75.83.155:22 Created=5sec HostLocal=1
    
    Documentation under ipvlan has been updated, once we enable this
    setting also for veth-based direct routing, then we'll add its own
    getting started section as well.
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
  3. bpf: track icmp echo ids in conntrack entries

    borkmann committed Mar 28, 2019
    This is needed such that the conntrack garbage collection can later
    also properly flush NAT mappings where we track and rewrite the echo
    request/reply IDs instead of ports. Otherwise we would need separate
    garbage collection runs for ICMP in NAT table only, which would be
    ugly. There should be no breakage from CT side, since i) echo
    req/replies are short lived, entries will be recreated and we also add
    icmp related entries on top.
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
  4. bpf: refactor ct maps into lib/{nat,conntrack_map}.h

    borkmann committed Mar 28, 2019
    We need to reuse the connection tracker from NAT as well, therefore
    move the map definitions out of bpf_lxc. Also, take out the ct delete
    helpers since we'll add NAT deletion logic in there as well.
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
  5. maps, natmap: add a new NAT map to access mappings from daemon

    borkmann committed Mar 28, 2019
    Add representation of NAT map in order to i) dump and flush the map
    and ii) piggy back on garbage collection from ctmap.
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
  6. maps, ctmap: refactor keys into tuple for later reuse outside of ct

    borkmann committed Mar 28, 2019
    They will later on be resued for the NAT map as well, just the
    values will be different.
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Commits on Mar 18, 2019
  1. ipsec, daemon: reject unsupported config options

    borkmann authored and tgraf committed Mar 18, 2019
    This avoids obscure startup errors on Cilium daemon and/or false
    user expectations, thus lets error out early with a clear error
    message.
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
  2. ipsec, doc: remove note on 1.4.1 release

    borkmann authored and tgraf committed Mar 18, 2019
    Replace it with 'upcoming release' since it hasn't been merged
    into 1.4.1 or 1.4.2 at this point and to avoid confusion for users
    following the guide.
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
  3. ipsec, bpf: fix build error when tunneling is disabled

    borkmann authored and tgraf committed Mar 16, 2019
    If we don't have encap index defined (ENCAP_IFINDEX), then we also cannot
    redirect to it. This compilation error is thrown instead:
    
      2019-03-16T08:38:33.30268603Z level=warning msg="/var/lib/cilium/bpf/bpf_netdev.c:548:11: warning: implicit declaration of function '__encap_and_redirect_with_nodeid' is invalid in C99 [-Wimplicit-function-declaration]" subsys=daemon
      2019-03-16T08:38:33.302694714Z level=warning msg="                        return __encap_and_redirect_with_nodeid(skb, tunnel_endpoint, seclabel, TRACE_PAYLOAD_LEN);" subsys=daemon
      2019-03-16T08:38:33.302702926Z level=warning msg="                               ^" subsys=daemon
      2019-03-16T08:38:33.302708262Z level=warning msg="1 warning generated." subsys=daemon
    
    Fixes: 3b62458 ("cilium: ipsec, add BPF datapath encryption direction")
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
  4. daemon: fix conntrack map dump wrt addresses

    borkmann authored and tgraf committed Mar 18, 2019
    The CT dump currently shows swapped src/dst address entries even
    though it's correctly using src address resp. dst address as data.
    
    Issue is that 7afe903 ("bpf: Global 5 tuple conntrack.") did
    not swap the initial tuple for the lookup when converting from
    local to global table, and all the current code right now is doing
    workarounds in order to not break CT table during version upgrade.
    
    Thus same needs to be done here for the dump. Issue became more
    apparent after aaf6ba3 ("ctmap: Fix order of CtKey{4,6} struct
    fields"), which might have had been swapped on purpose but without
    further comments in the code on why it was swapped on daemon side.
    
    In this case, reverting aaf6ba3 doesn't fully fix it either
    since then direction also needs to be swapped. Instead, make it
    less confusing and only swap what needs to be swapped, that is, the
    address parts since in the datapath this is the only thing that
    should have been done but was missed back then. For next major
    version upgrade (aka 2.0), this will be properly fixed (at the
    cost of disruptive upgrade).
    
    Fixes: 7afe903 ("bpf: Global 5 tuple conntrack.")
    Fixes: aaf6ba3 ("ctmap: Fix order of CtKey{4,6} struct fields")
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Commits on Feb 27, 2019
  1. cilium: fix bailing out on auto-complete when v4/v6 ranges are specified

    borkmann authored and tgraf committed Feb 27, 2019
    The node.AutoComplete() checks for ipv{4,6}AllocRange being set even
    though in case of != AutoCIDR we set this after node.AutoComplete(),
    meaning --ipv{4,6}-range is ignored and daemon bails out nevertheless.
    
    Fixes: 90eed3a ("pkg: allocate first IP in IPv4 allocation range")
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Commits on Feb 13, 2019
  1. cilium, bpf: only account tx for egress direction

    borkmann committed Feb 13, 2019
    ... and not for services which is buggy.
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Commits on Feb 12, 2019
  1. doc, configmap: add missing entries

    borkmann authored and tgraf committed Feb 12, 2019
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Commits on Feb 8, 2019
  1. cilium, sockmap: unbreak --sockmap-enable mode

    borkmann authored and tgraf committed Feb 8, 2019
    Currently operating Cilium in sockmap mode fails with the following
    error in the log:
    
      [...]
      level=error msg="Failed to pin map 0(cilium_sock_ops_map): exit status 255" subsys=sockops
      level=error msg="Failed to attach prog(145) to map(0): exit status 255" subsys=sockops
      [...]
    
    Reason is that we're searching for sockhash map named 'cilium_sock_ops_map'
    via bpftool but kernel only has 'test_cilium_soc' named sockhash map. In
    kernel map names are of max size of 16 bytes (incl. \0), therefore Cilium
    daemon cannot find map and ends up with mapID of 0, which is invalid. Get
    it running again by naming it 'cilium_sock_ops'. The 'test_' prefix is also
    invalid since this is not a dummy header.
    
    Fixes: 9e1c047 ("sockops: rename sock_ops_map to cilium_sock_ops_map")
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Commits on Feb 6, 2019
  1. docs: add initial networking gsg for ipvlan

    borkmann authored and aanm committed Feb 5, 2019
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
  2. cilium, ipvlan: add config map support to examples

    borkmann authored and aanm committed Feb 6, 2019
    Auto-generated via:
    
    CILIUM_VERSION=latest make -C examples/kubernetes clean all
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
  3. cilium, ipvlan: add config map support to templates

    borkmann authored and aanm committed Feb 6, 2019
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
  4. docs: add note to minikube gsg for --kubernetes-version

    borkmann authored and aanm committed Feb 5, 2019
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Commits on Feb 4, 2019
  1. cilium, docker: allow for v4 only config

    borkmann committed Feb 4, 2019
    GetNodeAddressing() does not set up the v6 config when in v4 only.
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Older
You can’t perform that action at this time.