Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instances on k8s 1.26 image don't show up in SSM #3544

Closed
Shershebnev opened this issue Oct 21, 2023 · 5 comments
Closed

Instances on k8s 1.26 image don't show up in SSM #3544

Shershebnev opened this issue Oct 21, 2023 · 5 comments
Labels
status/needs-triage Pending triage or re-evaluation type/bug Something isn't working

Comments

@Shershebnev
Copy link

Probably related to #3525
I'm trying to follow this article https://aws.amazon.com/blogs/containers/reduce-container-startup-time-on-amazon-eks-with-bottlerocket-data-volume/ (code - https://github.com/aws-samples/containers-blog-maelstrom/blob/main/bottlerocket-images-cache/snapshot.sh), they have aws-k8s-1.24 image there as default - /aws/service/bottlerocket/aws-k8s-1.24/x86_64/latest/image_id, however given that my EKS cluster is on 1.26, I've tried to change the bottlerocket image to 1.26 respectively. With 1.24 everything works fine, instance appears in SSM almost immediately, but when switching to 1.26 it just never shows up in SSM (I've added also an instance name into CloudFormation stack as I've seen in some old issue that instances without names are skipped by SSM). I've also tried 1.25 - works fine, but 1.27 also never shows up

Image I'm using:
bottlerocket-aws-k8s-1.26-x86_64-v1.15.1-264e294c (latest for 1.26 currently)
bottlerocket-aws-k8s-1.26-x86_64-v1.14.3-764e37e4 (tried an older one as well)

What I expected to happen:
Instance shows up in SSM

What actually happened:
It never shows up

How to reproduce the problem:
Original CF stack - https://github.com/aws-samples/containers-blog-maelstrom/blob/main/bottlerocket-images-cache/ebs-snapshot-instance.yaml

CF stack with my modifications (instance name and ebs volume size increased)
AWSTemplateFormatVersion: 2010-09-09
Description: Bottlerocket instance to snashot data volume.

Parameters:
  AmiID:
    Type: AWS::SSM::Parameter::Value<AWS::EC2::Image::Id>
    Description: "The ID of the AMI."
    Default: /aws/service/bottlerocket/aws-k8s-1.26-nvidia/x86_64/latest/image_id
  InstanceType:
    Type: String
    Description: "EC2 instance type to launch"
    Default: t2.small

Resources:
  BottlerocketInstance:
    Type: AWS::EC2::Instance
    Properties:
      ImageId: !Ref AmiID
      InstanceType: !Ref InstanceType
      IamInstanceProfile: !Ref BottlerocketNodeInstanceProfile
      BlockDeviceMappings:
        - DeviceName: "/dev/xvdb"
          Ebs:
            VolumeType: "gp3"
            VolumeSize: "100"
            DeleteOnTermination: "true"
            Encrypted: "true"
      UserData:
        Fn::Base64: |
          [settings.host-containers.admin]
          enabled = true
      Tags:
        - Key: "Name"
          Value: "bottlerocket-snapshot-instance"
  BottlerocketNodeInstanceProfile:
    Type: "AWS::IAM::InstanceProfile"
    Properties:
      Path: "/"
      Roles:
        - Ref: "BottlerocketNodeRole"
  BottlerocketNodeRole:
    Type: "AWS::IAM::Role"
    Properties:
      Path: /
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service:
                !Sub "ec2.${AWS::URLSuffix}"
            Action:
              - "sts:AssumeRole"
      ManagedPolicyArns:
        - !Sub "arn:${AWS::Partition}:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
        - !Sub "arn:${AWS::Partition}:iam::aws:policy/AmazonSSMManagedInstanceCore"

Outputs:
  InstanceId:
    Value: !Ref BottlerocketInstance
    Description: Instance Id
Just the relevant parts of the original script from github link above:
CFN_STACK_NAME="Bottlerocket-ebs-snapshot"
INSTANCE_TYPE="g4dn.xlarge"
AMI_ID="/aws/service/bottlerocket/aws-k8s-1.26/x86_64/latest/image_id"
aws cloudformation deploy --stack-name $CFN_STACK_NAME --template-file ebs-snapshot-instance.yaml --capabilities CAPABILITY_NAMED_IAM \
--parameter-overrides AmiID=$AMI_ID InstanceType=$INSTANCE_TYPE
INSTANCE_ID=$(aws cloudformation describe-stacks --stack-name $CFN_STACK_NAME --query "Stacks[0].Outputs[?OutputKey=='InstanceId'].OutputValue" --output text)


while [[ $(aws ssm describe-instance-information --filters "Key=InstanceIds,Values=$INSTANCE_ID" --query "InstanceInformationList[0].PingStatus" --output text) != "Online" ]]
do
   echo "waiting"
   sleep 5
done

When switching 1.26 to 1.24 or 1.25 everything works fine, but 1.27 also doesn't show up
Just in case, I've also tried cpu-only instances as well as NVidia version of the image

Since my purpose is to just use the data volume for docker images, should I just use 1.24/1.25 images?

@Shershebnev Shershebnev added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Oct 21, 2023
@zmrow
Copy link
Contributor

zmrow commented Oct 23, 2023

Thanks for the report @Shershebnev !

Are any errors reported on the instance via the console using "Get system log", "Get instance screenshot", or "EC2 serial console"? I suspect that some additional user data and/or roles may be needed for the instance. I'm looking back over the changelogs to confirm this suspicion and explain the differences between k8s versions.

@Shershebnev
Copy link
Author

Log is empty, but screenshot shows some encryption error
i-010cd92721081bc01
That's on Intel-based instances (m6i.4xlarge)
I've also tried AMD-based instance (m6a.4xlarge), it seems to be stuck on booting
i-09bdddf62b644452c
I've also tried the oldest ami I can see - 1.13.0 (bottlerocket-aws-k8s-1.26-x86_64-v1.13.0-f7a2e3cc) and it works fine even though it gives the same error about encryption, still it proceeds further and appears in ssm almost immediately.
Yet 1.14.0 gets stuck
So at this point I've realized I actually have nodes in EKS that I've switched to bottlerocket and they work fine on the latest ami for 1.26 but the nvidia version bottlerocket-aws-k8s-1.26-nvidia-x86_64-v1.15.1-264e294c, they appear in ssm as well. The only differences I could see are /dev/xvda root volume size (4 gb vs 2 gb) and eks nodes being on nvidia version. I've changed both and it seems to go past encryption error with such setup but then still got stuck
i-0ce93f121a3bf8b3a
And after some more waiting I got a system log ending with

[  305.391718] sundog[1858]: Setting generator 'pluto private-dns-name' failed with exit code 1 - stderr: Timed out retrieving private DNS name from EC2: deadline has elapsed
[FAILED] Failed to start User-specified setting generators.
See 'systemctl status sundog.service' for details.
[DEPEND] Dependency failed for Bottlerocket initial configuration complete.
[DEPEND] Dependency failed for Isolates configured.target.
[DEPEND] Dependency failed for Applies settings to create config files.
[DEPEND] Dependency failed for Send signal to CloudFormation Stack.
[DEPEND] Dependency failed for Sets the hostname.

i-0ce93f121a3bf8b3a.log
I can confirm that on this VPC DNS resolution is enabled.
There seem to be related issue #3064 however my failing instances are in public subnet so doesn't seem to be caused by what they had going on in the issue. However my EKS nodes which seem to work fine are in the private subnets.

This turned into quite a long post, sorry about that. In a nutshell:

  • When starting in public subnet as standalone instances:
    • Ami version 1.13.0 seems to work fine and appear almost immediately in SSM even with 2 GB root volume.
    • Ami version 1.14.0 and beyond (including latest version) seems to get stuck either on encryption error or, when increasing root volume to 4 GB, gets stuck for several minutes to arrive to DNS resolution error from the log above.
  • However in EKS when starting nodes in private subnets everything seems to work fine (still can see the encryption error though), here they start with 4 GB (I also find it strange that default root volume size seems to be different as I don't specify root volume size in EKS explicitly)

Hope this is helpful :)

@yeazelm
Copy link
Contributor

yeazelm commented Oct 25, 2023

Related to #3525 (comment) I think we might need to add in EC2 Describe Images access to the IAM Role policies attached in https://github.com/aws-samples/containers-blog-maelstrom/blob/ee8e18c0bb170f625b86a59dfc0605e9c98cdee3/bottlerocket-images-cache/ebs-snapshot-instance.yaml#L44. For example, I have AmazonEKSWorkerNodePolicy attached with:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeInstances",
                "ec2:DescribeInstanceTypes",
                "ec2:DescribeRouteTables",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeSubnets",
                "ec2:DescribeVolumes",
                "ec2:DescribeVolumesModifications",
                "ec2:DescribeVpcs",
                "eks:DescribeCluster"
            ],
            "Resource": "*"
        }
    ]
}

as the policy. This might be the missing piece. Can you try this and see if it resolves the issues with 1.26 coming up? If so, we can try and get this other repo updated to cover this permissions addition.

@Shershebnev
Copy link
Author

I've tried with AmazonEC2ReadOnlyAccess AWS managed policy, everything works now on latest 1.26 🎉

@yeazelm
Copy link
Contributor

yeazelm commented Oct 30, 2023

Sounds great! Glad we got you sorted!

@yeazelm yeazelm closed this as completed Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/needs-triage Pending triage or re-evaluation type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants