Skip to content

feat: Add ability to use custom terraform version#2942

Merged
gerhard merged 14 commits into
dagger:mainfrom
brittandeyoung:feat/terraform-version
Aug 18, 2022
Merged

feat: Add ability to use custom terraform version#2942
gerhard merged 14 commits into
dagger:mainfrom
brittandeyoung:feat/terraform-version

Conversation

@brittandeyoung
Copy link
Copy Markdown
Contributor

Updated all terraform actions and base "run" action to support specifying a custom version of terraform cli.

@brittandeyoung brittandeyoung force-pushed the feat/terraform-version branch 4 times, most recently from 6cd90c9 to 072ca8f Compare August 15, 2022 16:17
Updated all terraform actions and base "run" action to support specifying a custom version of terraform cli.

Signed-off-by: Brittan DeYoung <32572259+brittandeyoung@users.noreply.github.com>
@brittandeyoung brittandeyoung force-pushed the feat/terraform-version branch from 072ca8f to 6d9fc57 Compare August 15, 2022 16:45
We should always default to a specific version of Terraform, otherwise
users of this package may be surprised with new behaviour introduced in
a future latest version. Requiring everyone that uses this package to
default to a version sounds like a lot of busy work.

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
@gerhard gerhard force-pushed the feat/terraform-version branch from 58c0e14 to acddbba Compare August 15, 2022 22:49
@gerhard gerhard self-requested a review August 15, 2022 23:04
Copy link
Copy Markdown
Contributor

@gerhard gerhard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me - ready to merge. WDYT @marcosnils @jpadams?

Does it still work as expected for you @brittandeyoung after my changes?

Comment thread pkg/universe.dagger.io/alpha/terraform/run.cue Outdated
Signed-off-by: Brittan DeYoung <32572259+brittandeyoung@users.noreply.github.com>
@brittandeyoung brittandeyoung force-pushed the feat/terraform-version branch from 55a36a5 to 20e8536 Compare August 16, 2022 12:20
@brittandeyoung
Copy link
Copy Markdown
Contributor Author

@gerhard I am confused as to why there needs to be a default version specified in run when there is a default version specified in the image? I was also unaware you could pass in nested values without declaring them =) so thats cool! saves on repeated definitions. Please take a look at how in run I am conditionally passing in the version if one is set, then in the image the default value is used if a version is not provided. Doesn't this accomplish the same thing without declaring the default value twice?

@marcosnils Good catch! I have moved the input for version to the #Image definition. Please let me know if this resolves your concerns.

@brittandeyoung
Copy link
Copy Markdown
Contributor Author

brittandeyoung commented Aug 16, 2022

Additionally I would like to get thoughts on making the default version latest. I bring this up as this may be a great way to keep the Actions up to date with the latest version of the terraform cli. As long as this is documented, it is usually my experience that users will specify the version of terraform they want if they do not want to use latest. Otherwise developers will have to specify latest or a newer version of the specified default if they need functionality that is in a newer version that the default defined in the image action.

@gerhard @marcosnils

Having this as latest would prevent developers from needing to specify version if they are okay with using the latest version and okay with getting auto updates. Then hard coding to a specific version would be more of a conscience effort.

@marcosnils
Copy link
Copy Markdown
Contributor

@marcosnils Good catch! I have moved the input for version to the #Image definition. Please let me know if this resolves your concerns.

thx for updating this 🙌. I just realized that the reason that we didn't catch this in the tests is because we're not currently testing it. WDYT if we add the following action to also test the docker.#Image flow?

				imageWorkflow: {
			// Set a terraform version for testing specific version of terraform
			_image: docker.#Pull & {
				source: "hashicorp/terraform"
			}

			init: terraform.#Init & {
				container: #input: _image.output
				source: tfSource.output
			}

			workspaceNew: terraform.#Workspace & {
				container: #input: _image.output
				source: init.output
				subCmd: "new"
				name:   "TEST_WORKSPACE"
			}

			plan: terraform.#Plan & {
				container: #input: _image.output
				source: init.output
			}

			apply: terraform.#Apply & {
				container: #input: _image.output
				always: true
				source: plan.output
			}

			verify: #AssertFileRegex & {
				input: apply.output
				path:  "terraform.tfstate"
				regex: string | *"1\\.2\\.7"
			}

			destroy: terraform.#Destroy & {
				container: #input: _image.output
				source: apply.output
				container: #input: _image.output
			}
		}

Having this as latest would prevent developers from needing to specify version if they are okay with using the latest version and okay with getting auto updates. Then hard coding to a specific version would be more of a conscience effort.

I agree that having latest is very convenient for users that prefer to have the latest features without having to hardcode a version in their plans. Having said that, I'd open separate issue to discuss about this since currently a lot of our actions follow the same pattern of having a _#DefaultVersion definition and there's a reason that it was done in the first place. If we decide to change this, we should make the effort to make it work equally for all the actions and address it on a single PR if possible.

@brittandeyoung brittandeyoung force-pushed the feat/terraform-version branch from c9cd7ef to 2e11890 Compare August 17, 2022 12:14
@brittandeyoung
Copy link
Copy Markdown
Contributor Author

@marcosnils New test workflow has been added as suggested. I have also opened #2951 to start the discussion of using latest for the _#DefaultVersion.

Signed-off-by: Brittan DeYoung <32572259+brittandeyoung@users.noreply.github.com>
@brittandeyoung brittandeyoung force-pushed the feat/terraform-version branch from 2e11890 to c6e4d1f Compare August 17, 2022 12:53
Signed-off-by: Brittan DeYoung <32572259+brittandeyoung@users.noreply.github.com>
@brittandeyoung brittandeyoung requested a review from a team as a code owner August 17, 2022 13:31
@brittandeyoung
Copy link
Copy Markdown
Contributor Author

The Universe Tests were exceeding the git hub actions timeout value specified, So I increased it.

@grouville
Copy link
Copy Markdown
Member

grouville commented Aug 17, 2022

The Universe Tests were exceeding the git hub actions timeout value specified, So I increased it.

They shouldn't: in general, it times out when bats fails unproperly and a zombie continues to live whenever it shouldn't.
It can happen when a test is broken by the import path. Whenever you test it locally, how long does it take ?

@brittandeyoung
Copy link
Copy Markdown
Contributor Author

@grouville Unfortunately I am unable to run all the tests locally as I use a M1 Mac. There are tests that rely on docker images that do not have an arm based image. I have been relying on targeting the specific tests I am modifying locally.

@brittandeyoung
Copy link
Copy Markdown
Contributor Author

@grouville I have seen the Universe test run for close to 23 minutes in past PRs. With tests being added, it would not surprise me if it takes longer than 30 min to run in a PR. This is why I bumped the timeout to test on this PR.

@gerhard
Copy link
Copy Markdown
Contributor

gerhard commented Aug 17, 2022

Thank you @brittandeyoung for sticking with this!

I have a few local changes which I am yet to push, but for now it's more important to address the existing comments in reverse order, starting with the most recent one first (bottom to top). Here comes the first one:

@brittandeyoung I have seen the Universe test run for close to 23 minutes in past PRs. With tests being added, it would not surprise me if it takes longer than 30 min to run in a PR. This is why I bumped the timeout to test on this PR.

This is a symptom of the shared networking available on the free GitHub Actions tier combined with the variable performance of the GitHub Actions Cache.

Locally, these tests consistently take less than 2 minutes to run. Here is my last run:

image

Instead of increasing the timeout, a better fix is to disable the GitHub Actions Caching. This will result in more consistent durations, which currently are around the 15 mins mark. This fix is already in main. This has more details: #2941 (comment). FTR:

image

My recommendation is to rebase this branch and drop the timeout increase commit.

Moving onto the next comment.

Updated all terraform actions and base "run" action to support specifying a custom version of terraform cli.

Signed-off-by: Brittan DeYoung <32572259+brittandeyoung@users.noreply.github.com>
gerhard and others added 6 commits August 17, 2022 10:34
We should always default to a specific version of Terraform, otherwise
users of this package may be surprised with new behaviour introduced in
a future latest version. Requiring everyone that uses this package to
default to a version sounds like a lot of busy work.

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
Signed-off-by: Brittan DeYoung <32572259+brittandeyoung@users.noreply.github.com>
Signed-off-by: Brittan DeYoung <32572259+brittandeyoung@users.noreply.github.com>
Signed-off-by: Brittan DeYoung <32572259+brittandeyoung@users.noreply.github.com>
…young/dagger into feat/terraform-version

Signed-off-by: Brittan DeYoung <32572259+brittandeyoung@users.noreply.github.com>
This reverts commit a782b7d.

Signed-off-by: Brittan DeYoung <32572259+brittandeyoung@users.noreply.github.com>
@brittandeyoung brittandeyoung force-pushed the feat/terraform-version branch from 428ee1e to 7b4b7c0 Compare August 17, 2022 14:38
@brittandeyoung
Copy link
Copy Markdown
Contributor Author

@gerhard I have pulled the latest from main and performed a rebase.

@gerhard
Copy link
Copy Markdown
Contributor

gerhard commented Aug 17, 2022

@brittandeyoung Unfortunately I am unable to run all the tests locally as I use a M1 Mac. There are tests that rely on docker images that do not have an arm based image. I have been relying on targeting the specific tests I am modifying locally.

I do not use Docker for Mac since this is 8x slower than the Linux equivalent. FTR: https://docs.dagger.io/1200/local-dev

image

I personally run Docker on Linux & use DOCKER_HOST=tcp://100.81.87.121:2375 across all my Macs to connect to it via Tailscale. Technically, I could also use DOCKER_HOST=tcp://192.168.88.22:2375 since it's a physical host on the LAN, but the Tailscale IP ensures that this works consistently even when I'm remote, on the M1 Mac laptop.

Knowing the 8x slowdown, would it still be worth to you for us to invest time in getting these tests to work on Docker for Mac ARM?

@gerhard
Copy link
Copy Markdown
Contributor

gerhard commented Aug 17, 2022

@brittandeyoung I have also opened #2951 to start the discussion of using latest for the _#DefaultVersion.

Thank you very much! That was going to be my suggestion too since this is slowly snowballing and I am keen on us getting it merged.

@brittandeyoung
Copy link
Copy Markdown
Contributor Author

@gerhard Thanks for the tips on running it remotely on a linux host. I do not feel these tests need to be "fixed". The examples are the powershell action. The docker image simply does not have a arm64 option. This will likely be the case for a long time. I will work on getting a dev environment using the remote DOCKER_HOST as suggested.

@gerhard
Copy link
Copy Markdown
Contributor

gerhard commented Aug 17, 2022

@brittandeyoung Thanks for the tips on running it remotely on a linux host. I will work on getting a dev environment using the remote DOCKER_HOST as suggested.

🙌🏻

@brittandeyoung The examples are the powershell action. The docker image simply does not have a arm64 option. This will likely be the case for a long time.

The simplest thing that I can think of is to disable those tests if the platform is not x86_64. WDYT?

@brittandeyoung
Copy link
Copy Markdown
Contributor Author

The simplest thing that I can think of is to disable those tests if the platform is not x86_64. WDYT?

I think that might help for tests unrelated. But if the user is modifying this action, I think it would be better for them to get an error when running tests locally then it simply not running them.

Maybe there should be some documentation indicating potential issues running tests on ARM platforms instead?

@gerhard
Copy link
Copy Markdown
Contributor

gerhard commented Aug 17, 2022

I think that might help for tests unrelated. But if the user is modifying this action, I think it would be better for them to get an error when running tests locally then it simply not running them.

Good point. I believe that is the case currently.

Maybe there should be some documentation indicating potential issues running tests on ARM platforms instead?

I would much rather we made these tests work for ARM. I don't know if we can add a Powershell arm64 image, but that sounds most reasonable.

@gerhard
Copy link
Copy Markdown
Contributor

gerhard commented Aug 17, 2022

It looks like bats got "stuck": https://github.com/dagger/dagger/runs/7881156518?check_suite_focus=true#step:6:1097

image

Running this locally now, with an empty cache.

@gerhard
Copy link
Copy Markdown
Contributor

gerhard commented Aug 17, 2022

Passes locally (Linux amd64) on first try:

image

Re-triggered the GitHub Action for now - not holding my breath.

This is my last update in this time-box. I might be able to resume later on today, otherwise tomorrow.

My takeaway from this session is that my task to provision a dedicated Docker Engine / GitHub Actions Runner just jumped to the top of my list. These tests should finish in less than 2 minutes. We are wasting a lot of time by not having this. Leaving it here for now. fyi @aluzzardi @samalba @shykes

@gerhard
Copy link
Copy Markdown
Contributor

gerhard commented Aug 17, 2022

I continued digging into the make universe-test timeouts, and this is what I discovered on a fresh Debian 11 host:

image

Notice that we blow past the 4GB limit, start swapping, and then everything slows down to a crawl. While this host is limited to 4GB and the free GitHub Actions Runner has 7GB, I suspect that we are seeing the same issue in GitHub Actions.

FTR: my local Linux host where this completes in under 2 mins has 64GB of DDR4 and 8 CPUs (16 vCPUs). The fast NVMe drive (7GB/s & 500k IOPS) makes a big difference too.

Next step is to run make universe-test in a host limited to 8GB, which is closer to the GitHub Actions Runner spec and see what happens then. My assumption is that the same thing. Will continue tomorrow.

@gerhard
Copy link
Copy Markdown
Contributor

gerhard commented Aug 18, 2022

32287s (9h) later, the integration tests are still running, and they are unlikely to finish since we are hitting some gRPC timeouts within BuildKit:

image

The above confirms that bumping the timeout to 60m was unlikely to work.

Before I bump the RAM to 8GB, I am curious what happens if I lower the number of parallel bats jobs.

Because we currently run 4 in parallel (meaning 4 dagger instances), I suspect that this makes this CUE issue cue-lang/cue#1795 far worse. Based on everything I know, I am fairly certain that we are hitting this specific issue. 👨🏻‍⚕️

@gerhard
Copy link
Copy Markdown
Contributor

gerhard commented Aug 18, 2022

Even with 2 bats jobs, we end up with 3 dagger instances which do not fit in 4GB based on the work that they need to do. Notice the RSS taken by dagger processes (ignore the forks) during the test:

Screenshot 2022-08-18 at 11 49 31

And after make universe-test is cancelled:

Screenshot 2022-08-18 at 11 50 32

Since the GitHub Actions Runner has 7GB, I am going to bump my test host to 8GB and try again.

Instead of adopting multiple approaches, stick to a single one.

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
When `dagger do test integration universe` runs, each job can use up to
1GB of RAM. There is an extra top-level job that spikes up to 3GB of
RAM. Combined, they exceed the 7GB free GitHub Actions Runner limit.
When this happens, the host starts swapping, everything slows down to a
crawl, and eventually the 30 minutes action limit gets exceeded and the
job gets cancelled.

The best approach right now seems to not parallelize by default. The
total time saving is insignificant - 20s for a >600s total in my test.
dagger#2942 has more details (including
lots of screenshots).

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
@gerhard gerhard force-pushed the feat/terraform-version branch from 9f46ab9 to 23530f3 Compare August 18, 2022 17:36
@gerhard
Copy link
Copy Markdown
Contributor

gerhard commented Aug 18, 2022

Not running multiple universe tests in parallel should keep the memory utilisation within the free GitHub Actions runner resources.

Each job can use up to 1GB of RAM, and the top-level one that orchestrates all the jobs can use up to 3GB of RAM. This hits the 7GB limit, the runner starts swapping, everything slows down to a crawl, and eventually the 30 mins timeout gets hit. This is what that looks like in one picture:

Screenshot 2022-08-17 at 23 47 38

While the last commit makes the Experimental Universe Tests slower (9 vs 7 mins), as well as the Integration Tests slower (11 vs 7 mins), the Universe ones should now be more reliable going forward. They also seem to be slightly faster (14 vs 15 mins). This speed-up might be incidental.

There is a lot more that I discovered since I started looking at this. For now, this improves things enough to get this change merged. I intend to continue digging into this test-related / GitHub Actions issue next week. cc @aluzzardi @samalba @shykes

Thank you @brittandeyoung for your contribution & sticking with us as we pushed this through. If we did it right, your next contribution should be easier 😎


Leaving another screenshot for the Universe Tests issue here:

Screenshot 2022-08-17 at 23 48 34

Copy link
Copy Markdown
Contributor

@gerhard gerhard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants