Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify ARM64 support #468

Closed
6 tasks done
tjkirch opened this issue Oct 24, 2019 · 11 comments
Closed
6 tasks done

Verify ARM64 support #468

tjkirch opened this issue Oct 24, 2019 · 11 comments

Comments

@tjkirch
Copy link
Contributor

tjkirch commented Oct 24, 2019

Wherever EKS and ECR support ARM64, we should aim for the same level of support.

The EKS ARM preview program provides images and ARM specific Kubernetes resources. We should use these resources and verify that we can successfully launch Bottlerocket nodes on ARM instances.


Bottlerocket OS

  • Build and test Bottlerocket AMI (booted and runs pods)

Control container

Admin container

Update operator

Kubernetes node

@jahkeup
Copy link
Member

jahkeup commented Jan 30, 2020

Changes to CNI build process is needed to support arm, the current build process doesn't configure GOARCH nor does it use the arch specific base container image. I'll be getting in touch with the maintainers to get these changes made and contribute the changes (that I made to make the build succeed).

I'll be testing out the changes in the mentioned PRs (and CNI build changes) to confirm the -arm64 suffix strategy works as expected before making any conclusions.

@jahkeup
Copy link
Member

jahkeup commented Jan 30, 2020

Using admin and control containers built from #694 I was able to launch an instance and ssh in:

13:25:28 ❯ ssh ec2-user@54.70.132.152
The authenticity of host '54.70.132.152 (54.70.132.152)' can't be established.
ECDSA key fingerprint is SHA256:kQm800kZtGMXjillRl/je+IbTER4xi/LuxiKfiNTcHw.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '54.70.132.152' (ECDSA) to the list of known hosts.
Welcome to Thar's Handy Administrator Resources (the admin container)!

This container provides access to the Thar host filesystems (see
/.thar/rootfs) and contains common tools for inspection and troubleshooting.
It is based on Amazon Linux 2, and most things are in the same places you would
find them on an AL2 host.

To permit more intrusive troubleshooting, including actions that mutate the
running state of the Thar host, we provide a tool called "sheltie" (`sudo sheltie`).
When run, this tool drops you into a root shell in the Thar host's root filesystem.
[ec2-user@ip-172-31-46-18 ~]$ sudo sheltie
bash-5.0# ls -al
total 64
drwxr-xr-x  19 root root  4096 Jan 30 18:25 .
drwxr-xr-x  19 root root  4096 Jan 30 18:25 ..
drwxr-xr-x   3 root root  4096 Jan 30 18:25 aarch64-thar-linux-gnu
lrwxrwxrwx   1 root root    41 Jan 30 18:14 bin -> ./aarch64-thar-linux-gnu/sys-root/usr/bin
drwxr-xr-x   2 root root  4096 Jan 30 18:25 boot
drwxr-xr-x  14 root root  2800 Jan 30 21:22 dev
drwxr-xr-x  11 root root   580 Jan 30 21:27 etc
drwxr-xr-x   2 root root  4096 Jan 30 18:14 home
lrwxrwxrwx   1 root root    41 Jan 30 18:14 lib -> ./aarch64-thar-linux-gnu/sys-root/usr/lib
lrwxrwxrwx   1 root root    41 Jan 30 18:14 lib64 -> ./aarch64-thar-linux-gnu/sys-root/usr/lib
drwxr-xr-x   6 root root  4096 Jan 30 21:27 local
drwx------   2 root root 16384 Jan 30 18:25 lost+found
drwxr-xr-x   2 root root  4096 Jan 30 18:14 media
drwxr-xr-x   2 root root  4096 Jan 30 18:14 mnt
drwxr-xr-x   5 root root  4096 Jan 30 21:27 opt
dr-xr-xr-x 132 root root     0 Jan  1  1970 proc
drwxr-xr-x   2 root root  4096 Jan 30 18:14 root
drwxr-xr-x  16 root root   420 Jan 30 21:27 run
lrwxrwxrwx   1 root root    42 Jan 30 18:14 sbin -> ./aarch64-thar-linux-gnu/sys-root/usr/sbin
drwxr-xr-x   2 root root  4096 Jan 30 18:14 srv
dr-xr-xr-x  12 root root     0 Jan 30 21:22 sys
drwxrwxrwt   8 root root   160 Jan 30 21:27 tmp
lrwxrwxrwx   1 root root    37 Jan 30 18:14 usr -> ./aarch64-thar-linux-gnu/sys-root/usr
drwxr-xr-x   7 root root  4096 Jan 30 21:22 var
bash-5.0# uname -a
Linux ip-172-31-46-18.us-west-2.compute.internal 4.19.75 #1 SMP Thu Jan 30 18:15:30 UTC 2020 aarch64 GNU/Linux
bash-5.0# exit
[ec2-user@ip-172-31-46-18 ~]$ aws --region us-west-2 ec2 describe-images --image-id $(curl -s 169.254.169.254/latest/meta-data/ami-id)
{
    "Images": [
        {
            "VirtualizationType": "hvm",
            "Description": "thar-aarch64-aws-k8s-v0.2.1-35-g06ed3177",
            "Hypervisor": "xen",
            "EnaSupport": true,
            "SriovNetSupport": "simple",
            "ImageId": "ami-006248e261caed532",
            "State": "available",
            "BlockDeviceMappings": [
                {
                    "DeviceName": "/dev/xvda",
                    "Ebs": {
                        "SnapshotId": "snap-0d7dc3a7a56b32ad0",
                        "DeleteOnTermination": true,
                        "VolumeType": "gp2",
                        "VolumeSize": 2,
                        "Encrypted": false
                    }
                },
                {
                    "DeviceName": "/dev/xvdb",
                    "Ebs": {
                        "SnapshotId": "snap-0827fd0adffe7339c",
                        "DeleteOnTermination": true,
                        "VolumeType": "gp2",
                        "VolumeSize": 20,
                        "Encrypted": false
                    }
                }
            ],
            "Architecture": "arm64",
            "ImageLocation": "111111111111/a1sauce-thar-aarch64-aws-k8s-v0.2.1-35-g06ed3177",
            "RootDeviceType": "ebs",
            "OwnerId": "111111111111",
            "RootDeviceName": "/dev/xvda",
            "CreationDate": "2020-01-30T21:21:59.000Z",
            "Public": false,
            "ImageType": "machine",
            "Name": "a1sauce-thar-aarch64-aws-k8s-v0.2.1-35-g06ed3177"
        }
    ]
}

# amazon-linux-extras enable docker && sudo yum install -y containerd
-bash-4.2# sudo ctr -a /.thar/rootfs/run/host-containerd/containerd.sock c ls
CONTAINER    IMAGE                                                                                RUNTIME
admin        ecr.aws/arn:aws:ecr:us-west-2:111111111111:repository/thar-admin:9c7813c1-arm64      io.containerd.runtime.v1.linux
control      ecr.aws/arn:aws:ecr:us-west-2:111111111111:repository/thar-control:06ed3177-arm64    io.containerd.runtime.v1.linux

@jahkeup
Copy link
Member

jahkeup commented Jan 30, 2020

There are some outstanding changes that I think are needed based on the output of the journal:

rng generation time significantly delays the start up process and the lack console output during this time made me uneasy as another service printed an error prior to the

...
Jan 30 21:22:29 localhost systemd[1]: Started wicked network nanny service.
Jan 30 21:22:29 localhost systemd[1]: Starting wicked managed network interfaces...
Jan 30 21:22:29 localhost wickedd-dhcp4[297]: eth0: Request to acquire DHCPv4 lease with UUID 1549335e-c914-0e00-2d01-000002000000
Jan 30 21:22:29 localhost wickedd-dhcp6[298]: eth0: Request to acquire DHCPv6 lease with UUID 1549335e-c914-0e00-2d01-000003000000 in mode auto
Jan 30 21:22:30 localhost wickedd-dhcp4[297]: eth0: Committed DHCPv4 lease with address 172.31.46.18 (lease time 3600 sec, renew in 1800 sec, rebind in 3150 sec)
Jan 30 21:22:59 localhost wicked[305]: eth0            setup-in-progress
Jan 30 21:22:59 localhost systemd[1]: Started wicked managed network interfaces.
Jan 30 21:22:59 localhost systemd[1]: Reached target Network.
Jan 30 21:22:59 localhost systemd[1]: Reached target Network is Online.
Jan 30 21:23:22 localhost systemd[1]: Starting What's Updog?...
Jan 30 21:23:22 localhost updog[308]: Failed to read config file /etc/updog.toml: No such file or directory (os error 2)
Jan 30 21:23:22 localhost systemd[1]: updog.service: Main process exited, code=exited, status=1/FAILURE
Jan 30 21:23:22 localhost systemd[1]: updog.service: Failed with result 'exit-code'.
Jan 30 21:23:22 localhost systemd[1]: Failed to start What's Updog?.

** no output, makes ^the above look suspect **

Jan 30 21:27:11 localhost kernel: random: crng init done
Jan 30 21:27:11 localhost kernel: random: 7 urandom warning(s) missed due to ratelimiting
...

host-ctr should try to find the architecture specific image and fallback as a compatibility mechanism (like Kubernetes for example) so that a specific userdata.toml isn't needed:

[settings.host-containers.admin]
enabled = true
source = "111111111111.dkr.ecr.us-west-2.amazonaws.com/thar-admin:9c7813c1-arm64"

[settings.host-containers.control]
enabled = true
source = "111111111111.dkr.ecr.us-west-2.amazonaws.com/thar-control:06ed3177-arm64"

Applying sysctl from release-sysctl.conf result in an error, needs investigation:

Jan 30 21:22:26 localhost systemd-sysctl[233]: Couldn't write '1' to 'kernel/unprivileged_bpf_disabled', ignoring: No such file or directory
Jan 30 21:22:26 localhost systemd[1]: Started Apply Kernel Variables.

Arch specific image names (a la #689, #694)

I misinterpreted the multiarch compatibility handling of Kubernetes, the PRs need to be updated to create images named registry.example.com/myimage-ARCH:mytag (it is currently using registry.example.com/myimage:mytag-ARCH, which I now know is wrong 😄 👍).

Pause container for k8s

Still need to get a pause container configured for use, though its clear that the kubelet is trying the mentioned Kubernetes multiarch fallback name, which is encouraging!

Jan 30 21:27:26 ip-172-31-46-18.us-west-2.compute.internal host-ctr[484]: time="2020-01-30T21:27:26Z" level=info msg="Pulling with Amazon ECR Resolver" ref="ecr.aws/arn:aws:ecr:us-west-2:602401143452:repository/eks/pause-arm64:3.1"

updog error on boot (I don't think this is arm64 specific)

We could add a condition on this service that prevents it from successfully starting on first-boot when there's no configuration file (if it doesn't support running without this file, then we should express this semantic and prevent it from running IMO):

# updog.service
[Unit]
...
ConditionPathExists=/etc/updog.toml
...

Otherwise, the console displays the following during boot:

Jan 30 21:23:22 localhost systemd[1]: Starting What's Updog?...
Jan 30 21:23:22 localhost updog[308]: Failed to read config file /etc/updog.toml: No such file or directory (os error 2)
Jan 30 21:23:22 localhost systemd[1]: updog.service: Main process exited, code=exited, status=1/FAILURE
Jan 30 21:23:22 localhost systemd[1]: updog.service: Failed with result 'exit-code'.
Jan 30 21:23:22 localhost systemd[1]: Failed to start What's Updog?.

@jahkeup
Copy link
Member

jahkeup commented Jan 30, 2020

Here's the full journald output from the boot the snippets were taken from.

@jahkeup
Copy link
Member

jahkeup commented Jan 30, 2020

@bcressey I think we might have to add some CONFIG settings to align the x86_64 and aarch64 builds, I'm just digging in, but these are the related CONFIGs that I found different:

[jakeev@ip-172-31-56-129:~/thar/kernel]$ ls ../kernel*.src.rpm; diff -aur <(cat config-x86_64 | awk '/^# CONFIG/ {print $2,$0} /^CONFIG/ {print $1,$0}' | sort | awk '{$1=""; print}') <(cat config-aarch64 | awk '/^# CONFIG/ {print $2,$0} /^CONFIG/ {print $1,$0}' | sort | awk '{$1=""; print}') | grep -i bpf
../kernel-4.19.75-28.73.amzn2.src.rpm
- CONFIG_BPF_EVENTS=y
  # CONFIG_BPFILTER is not set
- CONFIG_BPF_JIT_ALWAYS_ON=y
  CONFIG_BPF_JIT=y
- # CONFIG_BPF_KPROBE_OVERRIDE is not set
- CONFIG_BPF_STREAM_PARSER=y
- CONFIG_BPF_SYSCALL=y
+ # CONFIG_BPF_SYSCALL is not set
  CONFIG_BPF=y
- CONFIG_CGROUP_BPF=y
  CONFIG_HAVE_EBPF_JIT=y
  CONFIG_LWTUNNEL_BPF=y
- CONFIG_NET_ACT_BPF=m
+ # CONFIG_NET_ACT_BPF is not set
  CONFIG_NET_CLS_BPF=m

@jahkeup
Copy link
Member

jahkeup commented Jan 30, 2020

Re: updog error on boot

It appears that the updog.service and updog.timer need some added dependencies and/or ordering to prevent it from starting before its configuration file is generated.

@jhaynes jhaynes added this to Researching in Bottlerocket Roadmap Feb 17, 2020
jahkeup added a commit that referenced this issue Feb 20, 2020
Builds will be performed for all architectures declared. The golang
toolchain can perform the compilation when the host's build of the
toolchain is configured to include target arch.

Related to #468

Signed-off-by: Jacob Vallejo <jakeev@amazon.com>
@jhaynes jhaynes modified the milestones: Public Preview, GA Feb 21, 2020
jahkeup added a commit to bottlerocket-os/bottlerocket-update-operator that referenced this issue Feb 21, 2020
Builds will be performed for host architecture by defaul. The go
toolchain can perform the compilation when the host's build of the
toolchain is configured to include target arch. The build can be
made to explicitly target another architecture at the user's request.

Related to bottlerocket-os/bottlerocket#468

Signed-off-by: Jacob Vallejo <jakeev@amazon.com>
jahkeup added a commit to bottlerocket-os/bottlerocket-update-operator that referenced this issue Feb 21, 2020
Builds will be performed for host architecture by defaul. The go
toolchain can perform the compilation when the host's build of the
toolchain is configured to include target arch. The build can be
made to explicitly target another architecture at the user's request.

Related to bottlerocket-os/bottlerocket#468

Signed-off-by: Jacob Vallejo <jakeev@amazon.com>
jahkeup added a commit to bottlerocket-os/bottlerocket-update-operator that referenced this issue Feb 24, 2020
Builds will be performed for host architecture by defaul. The go
toolchain can perform the compilation when the host's build of the
toolchain is configured to include target arch. The build can be
made to explicitly target another architecture at the user's request.

Related to bottlerocket-os/bottlerocket#468

Signed-off-by: Jacob Vallejo <jakeev@amazon.com>
@vielmetti
Copy link

Good to see this work underway - it's been 1+ month since the last specific comment here, is it possible to get an update?

@jahkeup
Copy link
Member

jahkeup commented Mar 13, 2020

We were able to verify that ARM builds of the OS, and other components, is possible and works:

However, this is for a working Bottlerocket OS image - there's other changes that we're dependent on to make ARM cluster nodes usable.

We're tracking the public roadmap issue regarding ECR's multiarch support to address the need for architecture specific settings.kubernetes.pod-infra-container-image values (and our other container images). Ideally, we'd have our images built for each architecture, distributed together with a multiarch manifest, and referenced uniformly in settings.*.

@jhaynes jhaynes moved this from Researching to Coming Soon in Bottlerocket Roadmap May 1, 2020
@jhaynes
Copy link
Contributor

jhaynes commented May 1, 2020

ECR now has multiarch support!

@scottmalkie
Copy link

EKS support for Graviton/Graviton2 went GA yesterday! Would be great to have "out of the box" multi-arch support for bottlerocket-based clusters.

@jahkeup
Copy link
Member

jahkeup commented Aug 18, 2020

Thanks for checking in! Bottlerocket now has ARM images built and available for use!

The images can be queried from SSM with the method you'd use for x86_64 images, replacing x86_64 with arm64. For example, to fetch the latest k8s 1.17 image, the parameter name would be:

/aws/service/bottlerocket/aws-k8s-1.17/arm64/latest/image_id

Please see the Finding an AMI section of the docs for more details.

@jahkeup jahkeup closed this as completed Aug 18, 2020
@jhaynes jhaynes moved this from Coming Soon to Just Shipped in Bottlerocket Roadmap Sep 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Public Preview
  
To do
Development

No branches or pull requests

5 participants