Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add windows server 2019 packer template #546

Merged
merged 8 commits into from
Apr 16, 2019
Merged

Add windows server 2019 packer template #546

merged 8 commits into from
Apr 16, 2019

Conversation

jeremiahsnapp
Copy link
Contributor

@jeremiahsnapp jeremiahsnapp commented Mar 13, 2019

This adds a packer template that creates a Windows Server 2019 (with docker installed) AMI that is fully functional with elastic-ci-stack-for-aws cloudformation. It even works with Buildkite's docker plugins.

The following lists the few things I identified as missing when compared with the existing Amazon Linux 2 packer template.

  • buildkite-agent user account is not created
  • AuthorizedUsersUrl cloudformation setting does nothing
  • BuildkiteAdditionalSudoPermissions cloudformation setting does nothing because it has no context in windows
  • EnableDockerUserNamespaceRemap cloudformation setting does nothing because docker userns-remap functionality only works on linux
  • bk-check-disk-space.sh script (equivalent windows script is not created)
  • fix-buildkite-agent-builds-permissions script (equivalent windows script is not created but I'm not sure we need this on Windows)
  • docker-gc hourly cron job (equivalent windows scheduled task is not created)
  • docker-low-disk-gc hourly cron job (equivalent windows scheduled task is not created)
  • git-lfs is not explicitly installed but the output of choco install git makes me wonder if it actually installs it
  • goss is not installed because it is only supported on linux

To use the Windows AMI we download Buildkite's latest cloudformation yaml to aws-windows-stack.yml and replace the UserData section with the following content.

wget -O aws-windows-stack.yml https://s3.amazonaws.com/buildkite-aws-stack/latest/aws-stack.yml
      UserData:
        "Fn::Base64": !Sub
          - |
            <powershell>
            $Env:DOCKER_USERNS_REMAP="${EnableDockerUserNamespaceRemap}"
            $Env:DOCKER_EXPERIMENTAL="${EnableDockerExperimental}"
            powershell -file C:\buildkite-agent\bin\bk-configure-docker.ps1 >> C:\buildkite-agent\elastic-stack.log

            $Env:BUILDKITE_STACK_NAME="${AWS::StackName}"
            $Env:BUILDKITE_STACK_VERSION="v4.3.1"
            $Env:BUILDKITE_LAMBDA_AUTOSCALING="${LambdaAutoscaling}"
            $Env:BUILDKITE_SCALE_DOWN_PERIOD="${ScaleDownPeriod}"
            $Env:BUILDKITE_SECRETS_BUCKET="${LocalSecretsBucket}"
            $Env:BUILDKITE_AGENT_TOKEN="${BuildkiteAgentToken}"
            $Env:BUILDKITE_AGENTS_PER_INSTANCE="${AgentsPerInstance}"
            $Env:BUILDKITE_AGENT_TAGS="${BuildkiteAgentTags}"
            $Env:BUILDKITE_AGENT_TIMESTAMP_LINES="${BuildkiteAgentTimestampLines}"
            $Env:BUILDKITE_AGENT_EXPERIMENTS="${BuildkiteAgentExperiments}"
            $Env:BUILDKITE_AGENT_RELEASE="${BuildkiteAgentRelease}"
            $Env:BUILDKITE_QUEUE="${BuildkiteQueue}"
            $Env:BUILDKITE_AGENT_ENABLE_GIT_MIRRORS_EXPERIMENT="${EnableAgentGitMirrorsExperiment}"
            $Env:BUILDKITE_ORG_SLUG="${BuildkiteOrgSlug}"
            $Env:BUILDKITE_ELASTIC_BOOTSTRAP_SCRIPT="${BootstrapScriptUrl}"
            $Env:BUILDKITE_AUTHORIZED_USERS_URL="${AuthorizedUsersUrl}"
            $Env:BUILDKITE_ECR_POLICY="${ECRAccessPolicy}"
            $Env:BUILDKITE_LIFECYCLE_TOPIC="${AgentLifecycleTopic}"
            $Env:BUILDKITE_TERMINATE_INSTANCE_AFTER_JOB="${BuildkiteTerminateInstanceAfterJob}"
            $Env:BUILDKITE_TERMINATE_INSTANCE_AFTER_JOB_TIMEOUT="${BuildkiteTerminateInstanceAfterJobTimeout}"
            $Env:BUILDKITE_TERMINATE_INSTANCE_AFTER_JOB_DECREASE_DESIRED_CAPACITY="${BuildkiteTerminateInstanceAfterJobDecreaseDesiredCapacity}"
            $Env:BUILDKITE_ADDITIONAL_SUDO_PERMISSIONS="${BuildkiteAdditionalSudoPermissions}"
            $Env:BUILDKITE_WINDOWS_ADMINISTRATOR="${BuildkiteWindowsAdministrator}"
            $Env:AWS_DEFAULT_REGION="${AWS::Region}"
            $Env:SECRETS_PLUGIN_ENABLED="${EnableSecretsPlugin}"
            $Env:ECR_PLUGIN_ENABLED="${EnableECRPlugin}"
            $Env:DOCKER_LOGIN_PLUGIN_ENABLED="${EnableDockerLoginPlugin}"
            $Env:AWS_REGION="${AWS::Region}"
            powershell -file C:\buildkite-agent\bin\bk-install-elastic-stack.ps1 >> C:\buildkite-agent\elastic-stack.log
            </powershell>
          - LocalSecretsBucket:
              !If
                - CreateSecretsBucket
                - !Ref ManagedSecretsBucket
                - !Ref SecretsBucket
            LambdaAutoscaling:
              !If
                - UseLambdaAutoscaling
                - true
                - false

Then we just use terraform's aws_cloudformation_stack resource, point its template_body at aws-windows-stack.yml and set other parameters appropriately. For example, the following shows the settings we use for our single-use windows queue. It has only one agent per instance and the agent only runs one job and then the instance terminates itself. We use this queue for jobs that must run on the host (not in docker). The ephemeral nature of the instances ensures each job starts with a clean environment.

resource "aws_cloudformation_stack" "buildkite_queue_single_use_windows_privileged" {
  name = "buildkite-single-use-windows-privileged"

  parameters {
    KeyName                                   = "my-key"
    BuildkiteAgentRelease                     = "stable"
    BuildkiteAgentToken                       = "${data.aws_s3_bucket_object.buildkite_agent_token.body}"
    BuildkiteAgentTags                        = "os=windows"
    BuildkiteTerminateInstanceAfterJob        = "true"
    BuildkiteTerminateInstanceAfterJobTimeout = 1800
    BuildkiteWindowsAdministrator             = "true"
    BuildkiteQueue                            = "single-use-windows-privileged"
    AgentsPerInstance                         = 1
    SecretsBucket                             = "my-secrets-bucket"
    ArtifactsBucket                           = ""
    BootstrapScriptUrl                        = "s3://my-secrets-bucket/buildkite_boot_windows.ps1"
    AuthorizedUsersUrl                        = ""
    VpcId                                     = "my-vpc_id"
    Subnets                                   = "my-subnet-id"
    AvailabilityZones                         = ""
    InstanceType                              = "c5.xlarge"
    EnableExperimentalLambdaBasedAutoscaling  = "true"
    SpotPrice                                 = "0"
    MaxSize                                   = 5
    MinSize                                   = 0
    ScaleDownPeriod                           = 1800
    InstanceCreationTimeout                   = "PT15M"
    RootVolumeName                            = "/dev/sda1"
    RootVolumeSize                            = 250
    SecurityGroupId                           = "my-sg-id"
    ImageId                                   = "${data.aws_ami.buildkite_windows.id}"
    ManagedPolicyARN                          = "my-managed-policy-arn"
    ECRAccessPolicy                           = "none"
    AssociatePublicIpAddress                  = "false"
    EnableSecretsPlugin                       = "true"
    EnableECRPlugin                           = "false"
    EnableDockerLoginPlugin                   = "true"
    EnableCostAllocationTags                  = "true"
    EnableDockerUserNamespaceRemap            = "false"
    CostAllocationTagName                     = "X-Application"
    CostAllocationTagValue                    = "buildkite"
  }

  template_body = "${file("cloudformation-templates/aws-windows-stack.yml")}"

  capabilities = ["CAPABILITY_IAM", "CAPABILITY_NAMED_IAM"]
}

We currently only use the following buildkite_boot_windows.ps1 bootstrap script to increase docker's storage-opts size setting to enable a larger container filesystem.

# Stop script execution when a non-terminating error occurs
$ErrorActionPreference = "Stop"

$dockerd_config = "C:\ProgramData\docker\config\daemon.json"

If (! (Test-Path $dockerd_config)) {
  Set-Content -Path $dockerd_config -Value "{}"
}

Get-Content $dockerd_config | jq 'if has(\"storage-opts\") then . else .\"storage-opts\"=[] end | .\"storage-opts\" |= map(select(startswith(\"size=\")|not)) + [\"size=120GB\"]' | Set-Content -Path $dockerd_config

Restart-Service docker

@petemounce
Copy link

Over here, we had an entertaining time with setting up a local user account to run the bk agent service as (we chose NSSM as the manager). https://serverfault.com/questions/946882/how-to-programmatically-cause-a-new-windows-users-profile-to-be-created is relevant re: creating user profile directory.

"spot_price": "auto",
"spot_price_auto_product": "Windows (Amazon VPC)",
"user_data_file":"scripts/ec2-userdata.ps1",
"communicator": "winrm",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reasonably sure that it is now possible to use openssh-win32 and packer with great success. https://operator-error.com/2018/04/16/windows-amis-with-even/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting. I can see value in adding ssh access and enabling the AuthorizedUsersUrl stack setting which would automatically configure the .ssh/authorized_keys file.

https://github.com/buildkite/elastic-ci-stack-for-aws/blob/master/packer/conf/bin/bk-install-elastic-stack.sh#L127-L137

@jeremiahsnapp
Copy link
Contributor Author

Thanks a lot @petemounce for the link about setting up a local user account. We backed out of our initial effort to run the agent using a buildkite-agent user account because nssm seemed to require the account to have a password. How did you deal with that? I don't setup enough Windows services to know if there's a better way than just creating a randomized password when building the AMI.

@petemounce
Copy link

petemounce commented Mar 13, 2019

http://ilovepowershell.com/2018/05/28/awesome-and-simple-way-to-generate-random-passwords-with-powershell/ works ok.

We're actually provisioning our images (in GCE) via ansible, and so there we're using the lookup plugin - https://github.com/azavea/ansible-buildkite-agent/blob/develop/tasks/install-on-Windows.yml#L2-L14 has an example.

We create an ansible user for packer to use as follows, and delete it after a successful provisioning run before sysprep. I don't think at the time I had come across the post I'm linking above.

$ErrorActionPreference = 'Stop'; # stop on all errors
$Count = Get-Random -min 24 -max 32
$TempPassword = -join ((65..90) + (97..122) + (48..57) | Get-Random -Count $Count | % {[char]$_})
$UserName = "packeransible"
write-output "Making $UserName ..."
New-LocalUser -Name $UserName -PasswordNeverExpires -Password ($TempPassword | ConvertTo-SecureString -AsPlainText -Force) | out-null
write-output "Adding to Administrators ..."
Add-LocalGroupMember -Group "Administrators" -Member $UserName | out-null
write-output "Saving password to file ..."
set-content -path "$($env:WINDIR)/temp/host.password.txt" -value $TempPassword -NoNewLine
write-output "Finished."

We do that to work around packer-at-the-time not making the WinRMPassword available to its ansible provisioner. That's fixed now.

Edit: I misunderstood you. We create a randomised password for the buildkite-agent user, don't record it anywhere, and that's fine (for us).

@petemounce
Copy link

We use windows' new openssh package to run ssh-agent windows service (which the package installs but doesn't auto-start unless you tell it to via a parameter) so we can avoid keeping ssh keys at rest on the filesystem. We have a powershell one-shot nssm service that runs on-boot to fetch an ssh key from somewhere... secret :p ... and loads that into the ssh-agent.

Through a painful trial and error process, I learned how to use icacls to set the appropriate filesystem permissions on the ssh key that the ssh-agent requires to be true before it will accept the load of the key. That's below.

One other piece of painfully won information is that the user who will be using the key(s) needs to be the one to load them - I wasn't able to load them as one user on behalf of the buildkite-agent user.

$write_to = "the path to the file on disk"

# https://superuser.com/questions/1296024/windows-ssh-permissions-for-private-key-are-too-open
# https://github.com/PowerShell/Win32-OpenSSH/wiki/Security-protection-of-various-files-in-Win32-OpenSSH
Write-Host "Setting filesystem permissions on key to allow it to be loaded to ssh-agent."
Write-Host "Giving ownership to $($username), running this script as $($env:username) (should match!)"
& icacls "$write_to"
& icacls "$write_to" /c /t /inheritance:d
& icacls "$write_to" /c /t /grant "$($username):F"
& icacls "$write_to" /c /t /remove Administrator BUILTIN\Administrators BUILTIN Everyone System Users
& icacls "$write_to"

Write-Host "Loading key to agent..."
& ssh-add "$($write_to)"

if ($LASTEXITCODE -ne 0) {
  throw "Failed to load key to ssh-agent."
}

# illustrate success
& ssh-add -L

# so the key material is not left on disk at rest, remove it.
Remove-Item "$write_to" -force

plugins-path="C:\buildkite-agent\plugins"
experiment="${Env:BUILDKITE_AGENT_EXPERIMENTS}"
priority=%n
shell=powershell
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to know how people feel about using shell=powershell. Does anyone think using the default cmd.exe would be better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Windows, Powershell is the choice I'd expect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it depends on what versions of Windows this template supports. If we want to be able to support older versions of windows where Powershell's availability and/or stability is suspect, we might want to make it configurable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer CMD.exe, just because we don't yet support Powershell officially in the agent.

@jeremiahsnapp
Copy link
Contributor Author

I want to make a note here that this PR does install lifecycled and configures a handler script but I'm not sure that it really does anything useful. My understanding is that when lifecycled receives a termination notification then it postpones the instance termination while it runs the handler script to gracefully stop the buildkite agents. The graceful agent shutdown allows the agents to finish any running jobs. During my testing on Windows the agent stopped immediately without finishing the running job.

Can someone confirm that the Windows agent isn't able to gracefully stop? If it can't then should we just remove lifecycled from this PR?

@jeremiahsnapp
Copy link
Contributor Author

jeremiahsnapp commented Mar 18, 2019

We use windows' new openssh package to run ssh-agent windows service (which the package installs but doesn't auto-start unless you tell it to via a parameter) so we can avoid keeping ssh keys at rest on the filesystem.

@petemounce the implementation we have in this Windows AMI PR uses the same preinstalled bash hooks and docker-login, ecr and secrets plugins that the Amazon Linux AMI uses. For example, the Windows buildkite agent is able to run the agent's bash environment hook which calls the secrets plugin's environment hook which downloads and configures the stack's build secrets. This includes downloading /private_ssh_key and/or /{pipeline-slug}/private_ssh_key from your s3 SecretsBucket and adding it to the Git for Windows ssh-agent without saving to disk. That makes the private key available for the buildkite agent to use for any subsequent work.

I like this implementation because it allows us to reuse as much pre-existing work as possible. Please let me know if you think we're missing anything in our implementation.

@petemounce
Copy link

That sounds comprehensive. I'm not doing any of that because where I am uses GCP & vault, so had more wiring to do.

@jeremiahsnapp
Copy link
Contributor Author

I'm not doing any of that because where I am uses GCP & vault, so had more wiring to do.

That makes sense. We're transitioning to vault too but I'll keep this implementation the way it is for the sake of keeping it compatible with elastic-ci-stack-for-aws.

@petemounce
Copy link

Sounds good to me. I'm sure at some stage someone will make a vault-integration for the AWS secrets-management thing(s??).

@lox
Copy link
Contributor

lox commented Mar 24, 2019

Sorry for the slow response, have been on vacation, just catching up now! This looks awesome, will review in more depth. ❤️

@lox
Copy link
Contributor

lox commented Mar 24, 2019

I'm torn on whether we should try and have this in the same repo. On one hand, there is a single spot to have things, but on the other hand it means a lot of config that might not apply to windows and it raises the bar for adding new features on the linux side.

The other option is to move this code over to a lightweight elastic stack focused on windows at https://github.com/buildkite/elastic-ci-stack-for-aws-windows.

Thoughts folks?

@petemounce
Copy link

I think it's a better UX for contributors to have things in the same repo

  • it's easier to cross-compare things to see what's in sync and what's not
  • it's easier to review for consistency of approach

I don't think it's necessary for Windows & Linux to be in sync, features-wise, just because they're in the same place. Internally, CI-with-buildkite docs have a feature-matrix - we describe what we offer, then we have a tick or not for each platform, and we fill them in. Sets expectations fine, and shows progress.

@lox
Copy link
Contributor

lox commented Mar 24, 2019

My other concern is it slows down iteration speed even further as we need to wait for CI for windows and linux. I guess we can do the mono-repo thing of detecting changes in subpaths.

@lox
Copy link
Contributor

lox commented Mar 24, 2019

I'm leaning towards same repo presently, just thinking it through.

@toolmantim
Copy link
Contributor

The other advantages to having it in the same repo, is that this project can now say its "Multiplatform" (Amazon Linux and Windows).

For each parameter description, we might need to include one of "(Linux and Windows)", "(Linux only)", or "(Windows only)" type thing?

@tduffield
Copy link
Contributor

My preference would to have things be in a single repository as well. Having something like packer/linux and packer/windows is a nice pattern that I've used elsewhere.

@lox
Copy link
Contributor

lox commented Mar 26, 2019

Yup, cool, I agree, let's have this in the one repo. @jeremiahsnapp could we move things into a packer/linux vs packer/windows structure? Then perhaps we can merge it and I can work on getting the CI parts of things going.

@jeremiahsnapp
Copy link
Contributor Author

@lox I relocated the packer templates and squashed a bunch of the commits. Let me know if there's anything else I can do to help.

@tduffield
Copy link
Contributor

@lox just wanted to check in on where we are with this. I've been staging some work internally with hopes to consume this.

@lox
Copy link
Contributor

lox commented Apr 4, 2019

The plan at this stage is to get this merged in soon, but we're debating whether we want to do a major release first with the new fast autoscaling stuff. Will update soon.

Signed-off-by: Jeremiah Snapp <jeremiah@chef.io>
Signed-off-by: Jeremiah Snapp <jeremiah@chef.io>
Signed-off-by: Jeremiah Snapp <jeremiah@chef.io>
Signed-off-by: Jeremiah Snapp <jeremiah@chef.io>
Signed-off-by: Jeremiah Snapp <jeremiah@chef.io>
Signed-off-by: Jeremiah Snapp <jeremiah@chef.io>
Signed-off-by: Jeremiah Snapp <jeremiah@chef.io>
@jeremiahsnapp
Copy link
Contributor Author

@lox I updated this to be compatible with the 4.3.1 stack so it works with the new lambda scaling as well as the git mirror experiment option. It also uses spawn to configure multiple agents per instances which has been a really nice improvement but we look forward to seeing buildkite/agent#985 get fixed so we can configure multiple agents per instance using spawn AND use the new lambda scaling.

@petemounce I also used some of your code example in your comments to create a buildkite-agent user in the Administrators group with a random password and configured the buildkite-agent service to run as that user.

@petemounce
Copy link

@jeremiahsnapp why grant admin?

@jeremiahsnapp
Copy link
Contributor Author

@petemounce I'm still developing our Windows AMIs for our testing purposes and I think some of our tests are needing admin privilege but I might just not know of alternative solutions to our needs yet.

Do you think it would be worth having it as a non-admin user by default and adding a cloudformation parameter that would allow us to choose to make it an admin user during instance startup if we wanted? Similar to the BuildkiteAdditionalSudoPermissions for linux.

@petemounce
Copy link

Personally; yes, definitely.

@lox
Copy link
Contributor

lox commented Apr 14, 2019

Yeah, I reckon that would be a good idea.

Signed-off-by: Jeremiah Snapp <jeremiah@chef.io>
@jeremiahsnapp
Copy link
Contributor Author

Ok @lox and @petemounce, I added BuildkiteWindowsAdministrator to make adding the user account to the Administrators group optional. Let me know if you have other suggestions for the parameter name.

@lox
Copy link
Contributor

lox commented Apr 15, 2019

Ok, lemme get a point release out today for the last of the 4.x series and then lets get this merged into master.

@lox
Copy link
Contributor

lox commented Apr 16, 2019

Merging this in, thanks for all your hard work @jeremiahsnapp! 💪🏻

@lox lox merged commit b229b06 into buildkite:master Apr 16, 2019
@jeremiahsnapp jeremiahsnapp deleted the add-windows-server-2019-packer-template branch April 16, 2019 10:43
@lox
Copy link
Contributor

lox commented Apr 17, 2019

FWIW, I've decided to remove the optional Windows Administrator setting in favour of always adding the user to the Admin group. The docker socket wasn't accessible to non-administrators, and with access to the docker socket you effectively have root access anyway.

@petemounce
Copy link

Would it be reasonable to instead include the access to the docket socket into the flag, so it's possible to run without admin and in so doing not have ability to docker?

@lox
Copy link
Contributor

lox commented Apr 17, 2019

Yeah, I'll give it a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants