New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate guides for v1.13 release #23051
Comments
Kind Quick Install - 163 seconds Terminal Output
|
Kind HTTP Policy - 4 mins Terminal Output
|
Will grab the |
EKS Quick Install - under 3 minutes once the cluster is created (which takes a lifetime) EKS IPSec 😭
Terminal Output
Status
tcpdump not showing encrypted traffic
but lots of traffic shown if esp not specified
|
Taking a look at the IPsec guide... cc @lizrice
FYI, note you tested v1.12 and not v1.13-rc4. As per Chris's OP:
The issues you found probably also affect v1.13 though.
We typically assume Cilium is uninstalled at the beginning of guide. If we changed that guide to assume something else, we would need to change a lot of guides. Improving the CLI to not fail in this case is tracked at cilium/cilium-cli#205.
I don't think that matters (anymore?). I've sent #23135 to update the guide.
#23135 will fix that as well.
I was able to reproduce that. I had to wait a while before the tcpdump command showed anything. This seems to be caused by the buffer on stdout. I've also updated the guide in #23135 to account for that. |
Ha, you're right, I forgot the argument when I had to uninstall and reinstall, sorry about that. Did it again, much improved by being able to run |
The provided links for AKS BYONCI for "Cluster Mesh" and "Service Affinity" lead to 404. |
@jspaleta you can find the clustermesh and service affinity pages from the new multi-cluster networking section: https://docs.cilium.io/en/v1.13.0-rc4/network/clustermesh/ |
AKS BYOCNI worksforme. I recorded the validation testing live: https://www.twitch.tv/videos/1710361420 One connectivity test error that cleaned itself up on re-run. |
Apologies for all the 404s. I quickly created this issue without going through and checking the links. I assumed the docs structure didn't change much, but that was not true 🙂. They should all be fixed now. |
As I'm running through these to validate for correction, if I come up with an enhancement idea should I just file those as new issues? |
AKS BYOCNI quick start states default detapath will be Encapsulation But checking If I'm understanding this correctly config view is tell me cilium is in native routing datapath https://docs.cilium.io/en/v1.13.0-rc4/network/concepts/routing/#native-routing Also this impacts steps needed to get ServiceMesh running on AKS as encap/native have slightly different preflight steps that needs to be documented. So is this a documentation bug, or should cilium install really default to |
@jspaleta good spotting. Can you share the output from the Cilium install steps? This looks like a discrepancy between the docs and the behaviour of cilium-cli. EDIT: At a glance, it looks like the BYOCNI instructions for helm (on this page) were updated the same way as the quick install, but cilium-cli was not changed to enable the byocni options by default. We may need to revisit the way that cilium-cli behaves when attempting to install fresh into an AKS environment. |
GKE Quick Install - just under 30 min (of which cluster install ~6 min, connectivity test ~17min) GKE Egress Gateway works fine, but I'm adding some troubleshooting tips to the guide to help anyone who mis-types a label like I did |
@joestringer I reran the quickstart instructions and confirmed config view now shows tunnel: vxlan and ipam:cluster-pool. But now I wonder, can we enhance the docs with a breadcrumb to help a user verify AKS cluster was created with adequate settings prior to cilium cli install? Maybe a cilium cli preflight mode that does the investigation and reports back what it would do for install settings? That sort of mode could probe the AKS setup and tell you if its configured correctly for byocni or azure-ipam without attempting the install. hmmm. |
@jspaleta In the installation output, the CLI will say what sort of datapath mode it's configuring Cilium for. We could improve the docs to say to check for that specific line in the output. The Azure autodetection sort of logic already exists in the Cilium CLI installation: https://github.com/cilium/cilium-cli/blob/03f744ff360e46030509904f89d7e4ffe3ac036f/install/autodetect.go#L84 |
Ran through the AKS Cluster Mesh specific instructions. I got one error running the connectivity tests at the tail end of the instructions. The I re-ran the no-polices test by itself, and still got a failure.
|
The feature seems working as expected, please find below the details of testing. Create workload (using clustermesh example workload)
Create CNP policy without auth configurationapiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "auth-policy"
spec:
endpointSelector:
matchLabels:
name: x-wing
egress:
- toEndpoints:
- matchLabels:
name: rebel-base Send the request from one pod to another
Create CNP policy auth configuration (e.g. null)apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "auth-policy"
spec:
endpointSelector:
matchLabels:
name: x-wing
egress:
- toEndpoints:
- matchLabels:
name: rebel-base
auth:
type: "null"
Send the request from one pod to another
|
A little more color on the connectivity errors I'm seeing. I think there is a missing Azure specific instruction at https://docs.cilium.io/en/v1.13.0-rc4/network/clustermesh/aks-clustermesh-prep/, But I'm not sure what we need to add. Just noodling around with the peering create options.. and I think we need to enable gateway transit on the peering to get pod-to-local-nodeport tests passing. |
The CLI autodetects the right configuration to use based on the actual cluster it's working against, trying to keep close with the CLI design idea that for users it's a transparent experience no matter the cluster on any platform, hence why there are no explicit options documented to be passed ;) |
FWIW I'm proposing a refactor of AKS installation instructions in #23304. |
I'll run through the refactor to see if I see something different. But heads up.. right now following the byonci aks cluster mesh instructions I'm hitting some rough edges. There's a bug or two lurking here with AKS. |
Good News!
|
The connectivity test failures for AKS BYOCNI are now understood and are documentation/cli tool bugs not-specific to the AKS guidance. With that, I can confidently say the AKS BYOCNI guides are validated. |
It should be OK as instructions are the same, just moved around in a different structure to encourage BYOCNI over Azure IPAM. A fresh read is more than welcome though, I'd like to get your opinion as a user :) |
k3s Quick Install - ~7-8 mins including cluster creation (timestamps below include connectivity test and reinstall of correct version) Feedback
Note: In all cases, you have the user export the Terminal Output
$ date
Thu Jan 26 07:07:34 UTC 2023
$ curl -sfL https://get.k3s.io | K3S_KUBECONFIG_MODE="644" INSTALL_K3S_EXEC=' --flannel-backend=none --disable-network-policy' sh -
[sudo] password for :
[INFO] Finding release for channel stable
[INFO] Using v1.25.5+k3s2 as release
[INFO] Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.25.5+k3s2/sha256sum-amd64.txt
[INFO] Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.25.5+k3s2/k3s
[INFO] Verifying binary download
[INFO] Installing k3s to /usr/local/bin/k3s
[INFO] Skipping installation of SELinux RPM
[INFO] Creating /usr/local/bin/kubectl symlink to k3s
[INFO] Creating /usr/local/bin/crictl symlink to k3s
[INFO] Creating /usr/local/bin/ctr symlink to k3s
[INFO] Creating killall script /usr/local/bin/k3s-killall.sh
[INFO] Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[INFO] env: Creating environment file /etc/systemd/system/k3s.service.env
[INFO] systemd: Creating service file /etc/systemd/system/k3s.service
[INFO] systemd: Enabling k3s unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.
[INFO] systemd: Starting k3s
$ export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
$ CILIUM_CLI_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/cilium-cli/master/stable.txt)
CLI_ARCH=amd64
if [ "$(uname -m)" = "aarch64" ]; then CLI_ARCH=arm64; fi
curl -L --fail --remote-name-all https://github.com/cilium/cilium-cli/releases/download/${CILIUM_CLI_VERSION}/cilium-linux-${CLI_ARCH}.tar.gz{,.sha256sum}
sha256sum --check cilium-linux-${CLI_ARCH}.tar.gz.sha256sum
sudo tar xzvfC cilium-linux-${CLI_ARCH}.tar.gz /usr/local/bin
rm cilium-linux-${CLI_ARCH}.tar.gz{,.sha256sum}
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 25.4M 100 25.4M 0 0 8084k 0 0:00:03 0:00:03 --:--:-- 10.4M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 92 100 92 0 0 473 0 --:--:-- --:--:-- --:--:-- 473
cilium-linux-amd64.tar.gz: OK
cilium
$ cilium install --version v1.13.0-rc4
🔮 Auto-detected Kubernetes kind: K3s
ℹ️ Using Cilium version 1.13.0-rc4
🔮 Auto-detected cluster name: default
🔮 Auto-detected datapath mode: tunnel
⚠️ Unable to list kubernetes api resources, try --api-versions if needed: %!w(*fmt.wrapError=&{failed to list api resources: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request 0xc00062a008})
ℹ️ helm template --namespace kube-system cilium cilium/cilium --version 1.13.0-rc4 --set cluster.id=0,cluster.name=default,encryption.nodeEncryption=false,kubeProxyReplacement=disabled,operator.replicas=1,serviceAccounts.cilium.name=cilium,serviceAccounts.operator.name=cilium-operator,tunnel=vxlan
ℹ️ Storing helm values file in kube-system/cilium-cli-helm-values Secret
🔑 Created CA in secret cilium-ca
🔑 Generating certificates for Hubble...
🚀 Creating Service accounts...
🚀 Creating Cluster roles...
🚀 Creating ConfigMap for Cilium version 1.13.0-rc4...
🚀 Creating Agent DaemonSet...
🚀 Creating Operator Deployment...
⌛ Waiting for Cilium to be installed and ready...
✅ Cilium was successfully installed! Run 'cilium status' to view installation health
$ cilium status --wait
/¯¯\
/¯¯\__/¯¯\ Cilium: OK
\__/¯¯\__/ Operator: OK
/¯¯\__/¯¯\ Hubble: disabled
\__/¯¯\__/ ClusterMesh: disabled
\__/
DaemonSet cilium Desired: 1, Ready: 1/1, Available: 1/1
Deployment cilium-operator Desired: 1, Ready: 1/1, Available: 1/1
Containers: cilium Running: 1
cilium-operator Running: 1
Cluster Pods: 5/5 managed by Cilium
Image versions cilium quay.io/cilium/cilium:v1.13.0-rc4@sha256:32acd47fd9bea9c0045222ba5d27f5fe9ad06dabd572a80b870b1f0e68c0e928: 1
cilium-operator quay.io/cilium/operator-generic:v1.13.0-rc4@sha256:19f612d4f1052e26edf33e26f60d64d8fb6caed9f03692b85b429a4ef5d175b2: 1
$ date
Thu Jan 26 07:18:29 UTC 2023 |
This probably relates to many platforms, but the step Should we always have the |
Feedback
Time Taken Terminal Output
|
Host Firewall Feedback This ones fine, but I think the doc as a whole could use some work:
|
@tommyp1ckles Are you planning to send a PR?
I'd expect users to usually come with a specific env. in mind. But regardless, it's not something we do in any guide AFAIK.
Those two things don't seem to contradict each other 🤔 The second allows anything from the cluster; the first says that if something comes from outside the cluster, it will only be allowed on port TCP/22. |
@pchaigno Yup, im putting together some changes. |
Some terminals automatically escape characters for arguments. Addresses: cilium#23051 Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
* Add options for using cilium-cli or helm. * Ensure cilium is ready prior to proceeding. Addresses: cilium#23051 Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Update: Checking on olm releases for the RC so that I can test OKD |
RKE doc links to a broken anchor. PR: #23706 |
RKE install worked well. |
RKE toEntities/kube-apiserver test failed, most likely because this is a one-node cluster (where
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: default-deny
namespace: default
spec:
endpointSelector: {}
ingress:
- {}
egress:
- toEndpoints:
- matchLabels:
io.kubernetes.pod.namespace: kube-system
k8s-app: kube-dns
toPorts:
- ports:
- port: "53"
protocol: UDP
rules:
dns:
- matchPattern: "*"
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: kubeapi-server
namespace: default
spec:
endpointSelector: {}
egress:
- toEntities:
- kube-apiserver
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: kubeapi-server
namespace: default
spec:
endpointSelector: {}
egress:
- toEntities:
- kube-apiserver
- host |
Better fix for the RKE doc link: #23728 |
With regard to Openshift testing, I was able to bring up the cluster but hit a few snags with the cilium-olm. 1.6763978282953136e+09 ERROR helm.controller Release failed {"namespace": "cilium", "name": "cilium", "apiVersion": "cilium.io/v1alpha1", "kind": "CiliumConfig", "release": "cilium", "error": "failed to install release: rendered manifests contain a resource that already exists. Unable to continue with install: could not get information about the resource Role \"cilium-config-agent\" in namespace \"cilium\": roles.rbac.authorization.k8s.io \"cilium-config-agent\" is forbidden: User \"system:serviceaccount:cilium:cilium-olm\" cannot get resource \"roles\" in API group \"rbac.authorization.k8s.io\" in the namespace \"cilium\": RBAC: role.rbac.authorization.k8s.io \"leader-election\" not found"} I manually patched the clusterrole for cilium-olm to include the ability to view roles and rolebindings apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
creationTimestamp: "2023-02-14T16:02:06Z"
name: cilium-cilium-olm
resourceVersion: "24707"
uid: 8a51cc46-657d-484a-839d-629ee196e146
rules:
- apiGroups:
- security.openshift.io
resourceNames:
- hostnetwork
resources:
- securitycontextconstraints
verbs:
- use
- apiGroups:
- rbac.authorization.k8s.io
resources:
- clusterroles
- clusterrolebindings
- roles
- rolebindings
verbs:
- create
- get
- patch
- update
- delete
- list
- watch This triggered the following I0214 18:36:08.422128 1 request.go:601] Waited for 1.047810343s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/storage.k8s.io/v1beta1?timeout=32s
1.676399770037502e+09 ERROR helm.controller Release failed {"namespace": "cilium", "name": "cilium", "apiVersion": "cilium.io/v1alpha1", "kind": "CiliumConfig", "release": "cilium", "error": "failed to install release: clusterroles.rbac.authorization.k8s.io \"cilium-operator\" is forbidden: user \"system:serviceaccount:cilium:cilium-olm\" (groups=[\"system:serviceaccounts\" \"system:serviceaccounts:cilium\" \"system:authenticated\"]) is attempting to grant RBAC permissions not currently held:\n{APIGroups:[\"\"], Resources:[\"services/status\"], Verbs:[\"patch\"]}"} |
@cmluciano Probably we need to fix this in the manifests then. cc @nathanjsweet for opinion |
- KPR probe mode has been removed in v1.13. Use the default setting for running the CNI tests. - Update cilium-olm cluster role. Cilium v1.13 needs access to role, rolebindings, and service/status. Ref: cilium/cilium#23051 (comment) Signed-off-by: Michi Mutsuzaki <michi@isovalent.com>
- KPR probe mode has been removed in v1.13. Use the default setting for running the CNI tests. - Update cilium-olm cluster role. Cilium v1.13 needs access to role, rolebindings, and service/status. Ref: cilium/cilium#23051 (comment) Signed-off-by: Michi Mutsuzaki <michi@isovalent.com>
Below is a set of platforms with a semi-randomly picked feature "getting started guide" links. The goal is to test each platform with at least one of the getting started guides. Secondary goal is to expose more people to the system.
Please test using https://docs.cilium.io/en/v1.13.0-rc4. For quick install guides, ensure to pass the version argument to the Cilium CLI to install the correct version.
Also take notes how much time it took you performing the Quick Install and how much time it took you testing the feature itself. Ideally the Quick Install should take < 15 minutes.
Deadline: TBD
EKS (@lizrice)
GKE (@lizrice)
K3s (@tracypholmes)
Kind (@thebsdbox)
RKE (@raphink)
AKS BYOCNI (@jspaleta)
AKS Azure IPAM (@tommyp1ckles)
Openshift (@cmluciano)
The text was updated successfully, but these errors were encountered: