Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(custom-resources): empty onEvent handler zip's being created, failing deploys #27342

Open
diranged opened this issue Sep 28, 2023 · 22 comments
Open
Labels
@aws-cdk/custom-resources Related to AWS CDK Custom Resources bug This issue is a bug. needs-reproduction This issue needs reproduction. p1

Comments

@diranged
Copy link

diranged commented Sep 28, 2023

Describe the bug

We recently started to see our integration tests failing, even though deploys were succeeding. The failures on the integration tests look like this:

sent 1,788 bytes  received 35 bytes  3,646.00 bytes/sec
total size is 1,680  speedup is 0.92
fatal: Not a valid object name integ
INFRA-MYAPP-ClusterTest:  fail: ENOENT: no such file or directory, open '/home/runner/work/x/xxx/test/integ/constructs/xyz/integ.cluster.ts.snapshot/asset.9202bb21d52e07810fc1da0f6acf2dcb75a40a43a9a2efbcfc9ae39535c6260c.zip'
INFRA-MYAPP-ClusterTest:  fail: ENOENT: no such file or directory, open '/home/runner/work/xxx/xxx/test/integ/constructs/xyz/integ.cluster.ts.snapshot/asset.8e18eb5caccd2617fb76e648fa6a35dc0ece98c4681942bc6861f41afdff6a1b.zip'
INFRA-MYAPP-ClusterTest:  fail: ENOENT: no such file or directory, open '/home/runner/work/xxx/xxx/test/integ/constructs/xyz/integ.cluster.ts.snapshot/asset.e2277687077a2abf9ae1af1cc9565e6715e2ebb62f79ec53aa75a1af9298f642.zip'

 ❌ Deployment failed: Error: Failed to publish asset a3f66c60067b06b5d9d00094e9e817ee39dd7cb5c315c8c254f5f3c571959ce5:current_account-current_region
    at Deployments.publishSingleAsset (/home/runner/work/xxx/xxx/node_modules/aws-cdk/lib/index.js:446:11458)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async Object.publishAsset (/home/runner/work/xxx/xxx/node_modules/aws-cdk/lib/index.js:446:151474)
    at async /home/runner/work/xxx/xxx/node_modules/aws-cdk/lib/index.js:446:136916
Failed to publish asset a3f66c60067b06b5d9d00094e9e817ee39dd7cb5c315c8c254f5f3c571959ce5:current_account-current_region
  FAILED     integ/constructs/xyz/integ.cluster-IntegTest/DefaultTest (undefined/us-east-1) 29.[135](https://github.com/Nextdoor/xxx/actions/runs/6340275359/job/17221306412#step:11:136)s
      Integration test failed: TypeError [ERR_STREAM_NULL_VALUES]: May not write null values to stream

When we then look in our S3 bucket, we find a series of 22 byte sized zip files. These three images are from three separate build attempts, all with fresh empty cdk.out directories, and all after we had wiped out the S3 cache files:

Screenshot 2023-09-28 at 7 41 49 AM Screenshot 2023-09-28 at 7 20 22 AM Screenshot 2023-09-28 at 7 25 17 AM

When we dug into it, it seems that these files are all related to the onEvent handlers for the custom-resource constructs. Going back in time a bit, it looks like these hash values show up at or around a9ed64f#diff-8bf3c7acb1f51f01631ea642163612a520b448b843d7514dc31ccc6f140c0753..

Attempts to fix

Roll back to 2.90.0 - success

We tried to roll back to 2.87.0 - but our codebase would have required too many changes for that, so we were able to roll back to 2.90.0 though which is interestingly before several of the handlers were updated from Node16 to Node18.

When we rolled back to 2.90.0, the integration tests work fine.

Roll forward to 2.91.0 - success

Same as 2.90.0 - the tests work fine.

Roll forward to 2.92.0 - partial success

In https://github.com/aws/aws-cdk/releases/tag/v2.92.0, the custom-resources handler is bumped to use Node18 instead of Node16. That change creates the new asset hash a3f66c60067b06b5d9d00094e9e817ee39dd7cb5c315c8c254f5f3c571959ce5. This code mostly worked - however #26771 prevented us from fully testing the CDK construct for EKS.

Roll forward to 2.93.0 - success

In 2.93.0, we see the asset hash change from 3f579d6c1ab146cac713730c96809dd4a9c5d9750440fb835ab20fd6925e528c.zip -> 9202bb21d52e07810fc1da0f6acf2dcb75a40a43a9a2efbcfc9ae39535c6260c.zip. It seems that this release works just fine - though the tests are ongoing right now.

Roll forward to 2.94.0 - failure

It seems that the failure starts as soon as we hit the 2.94.0 release.

INFRA-MYAPP-ClusterTest:  fail: ENOENT: no such file or directory, open '/home/runner/work/infra-myapp/infra-myapp/test/integ/constructs/aws-eks/integ.xx-cluster.ts.snapshot/asset.9202bb21d52e07810fc1da0f6acf2dcb75a40a43a9a2efbcfc9ae39535c6260c.zip'
INFRA-MYAPP-ClusterTest:  fail: ENOENT: no such file or directory, open '/home/runner/work/infra-myapp/infra-myapp/test/integ/constructs/aws-eks/integ.xx-cluster.ts.snapshot/asset.e2277687077a2abf9ae1af1cc9565e6715e2ebb62f79ec53aa75a1af9298f642.zip'
INFRA-MYAPP-ClusterTest:  fail: ENOENT: no such file or directory, open '/home/runner/work/infra-myapp/inframyapp/test/integ/constructs/aws-eks/integ.xx-cluster.ts.snapshot/asset.8e18eb5caccd2617fb76e648fa6a35dc0ece98c4681942bc6861f41afdff6a1b.zip'

Rolling back to '2.93.9' - success

Rolling back to 2.93.0 after the 2.94.0 failure immediately works... builds and integration tests pass again.

Expected Behavior

A few things here..

  1. I obviously don't expect the zip files to be created empty and causing problems.
  2. I would expect the files are cleaned up or replaced when they are determined to be corrupt.

Current Behavior

As far as we can tell, once the corrupt file is created - there are some situations where it is uploaded to S3 (and thus poisoning the cache), and other situations where the upload fails to begin with.

Reproduction Steps

Working on this ... don't yet know exactly how to reproduce this

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.95.0+

Framework Version

No response

Node.js Version

18

OS

Linux and OSX

Language

Typescript

Language Version

No response

Other information

No response

@diranged diranged added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Sep 28, 2023
@github-actions github-actions bot added the @aws-cdk/custom-resources Related to AWS CDK Custom Resources label Sep 28, 2023
@diranged diranged changed the title (custom)-resources: empty onEvent handler zip's being created, failing deploys (custom-resources): empty onEvent handler zip's being created, failing deploys Sep 28, 2023
@diranged
Copy link
Author

Update...As I worked through testing different versions of the CDK libraries, I started setting cdkVersionPinning: true in my .projenrc.ts file to make it easier to ensure I was testing the right CDK libs. I tested 2.90.0, 2.91.0, 2.92.0, 2.93.0 and continued to have success. This got me thinking though - the fact that I had to make sure my aws-cdk and aws-cdk-lib versions were matched by setting cdkVersionPinning: true lead me to wonder whether there were other version mismatches.

I checked our .projenrc.ts file and we've got the following devDeps setup..

  devDeps: [
    // Integ-Tests
    '@aws-cdk/integ-tests-alpha@^2.94.0-alpha.0',
    '@aws-cdk/integ-runner@^2.94.0-alpha.0',
  ],

Seems innocuous - right? Just make sure that we have the integration tools at least at 2.94.0+? Well this does an interesting thing to our yarn.lock file ... it makes us install TWO versions of the aws-cdk packages (but one lib):

# yarn.lock off of our "main" branch where we were having problems
...
aws-cdk-lib@^2.1.0:
  version "2.98.0"
  resolved "https://registry.yarnpkg.com/aws-cdk-lib/-/aws-cdk-lib-2.98.0.tgz#b1a5cbfa95e630e0440bc025c6281402db98965d"
  integrity sha512-6APM6zVTCi59L/8lPX47DINlCD9ZG7OEQ28pD/ftmHZ8qC7AlBWwWqOfuSL+DyEbJBLcw3AZ2MLM1AMJPO+sVg==
  dependencies:
    "@aws-cdk/asset-awscli-v1" "^2.2.200"
    "@aws-cdk/asset-kubectl-v20" "^2.1.2"
    "@aws-cdk/asset-node-proxy-agent-v6" "^2.0.1"
    "@balena/dockerignore" "^1.0.2"
    case "1.6.3"
    fs-extra "^11.1.1"
    ignore "^5.2.4"
    jsonschema "^1.4.1"
    minimatch "^3.1.2"
    punycode "^2.3.0"
    semver "^7.5.4"
    table "^6.8.1"
    yaml "1.10.2"

aws-cdk@2.94.0:
  version "2.94.0"
  resolved "https://registry.yarnpkg.com/aws-cdk/-/aws-cdk-2.94.0.tgz#2bf7bc649f41e13b864fb8cfdbf42218786df95e"
  integrity sha512-9bJkzxFDYZDwPDfZi/DSUODn4HFRzuXWPhpFgIIgRykfT18P+iAIJ1AEhaaCmlqrrog5yQgN+2iYd9BwDsiBeg==
  optionalDependencies:
    fsevents "2.3.2"

aws-cdk@^2.1.0:
  version "2.98.0"
  resolved "https://registry.yarnpkg.com/aws-cdk/-/aws-cdk-2.98.0.tgz#eb624ce9ab43e920695c59c0f270fa2f40906e62"
  integrity sha512-K8WCstCTmJo7dOwzAfUxhWmRYs9FmtFMpKh0OkEOs7iJ1HsNvAOz2LUURkVMqINXgfhmqqjgK6PQxI4AfgOdGA==
  optionalDependencies:
    fsevents "2.3.2"

...

"@aws-cdk/integ-runner@^2.94.0-alpha.0":
  version "2.94.0-alpha.0"
  resolved "https://registry.yarnpkg.com/@aws-cdk/integ-runner/-/integ-runner-2.94.0-alpha.0.tgz#1ce93341773457218728e87dd0f6433b29e80dd3"
  integrity sha512-6KDCwmOKcMpoOGhQAHJi31K1fuF1eyHFXxWZ4+FjdyofUQW7C2VjDXSavccEuvWAMudkyJTqQkxHalmEZpVFHA==
  dependencies:
    aws-cdk "2.94.0"
  optionalDependencies:
    fsevents "2.3.2"

"@aws-cdk/integ-tests-alpha@^2.94.0-alpha.0":
  version "2.94.0-alpha.0"
  resolved "https://registry.yarnpkg.com/@aws-cdk/integ-tests-alpha/-/integ-tests-alpha-2.94.0-alpha.0.tgz#33ec7cc8ab39405c7951c04741062ae2b46598d9"
  integrity sha512-mWfNy1EhsqNGdeu6ZzjMw0adIHmU7cwyMJM/Tgb6+Z3yP8WJzne3g6R2qJVEWFTYZ/S37ZmN7Ekw04AMPZEr9Q==

You'll see that we have aws-cdk-lib@2.98.0 and then aws-cdk@2.98.0 AND aws-cdk@2.94.0. I wondered where the aws-cdk@2.94.0 dependency came from, and found it in the file in the @aws-cdk/integ.... dependencies. It strikes me as very likely that the version mismatches here are somehow causing us problems.. especially since the pain is only when we run our integration tests and not happening when we're running our main deploys.

I'm beginning some tests where I force the aws-cdk/integ.* packages to be updated.. but I am wondering, how do people handle this? What is the right way to make sure that all of the aws-cdk.* libraries are matched?

@peterwoodworth peterwoodworth added p1 node18-upgrade Any work (bug, feature) related to Node 18 upgrade needs-review and removed needs-triage This issue or PR still needs to be triaged. node18-upgrade Any work (bug, feature) related to Node 18 upgrade needs-review labels Sep 28, 2023
@peterwoodworth
Copy link
Contributor

Thanks for posting an update @diranged,

I'm not entirely sure what you mean by the following:

I wondered where the aws-cdk@2.94.0 dependency came from, and found it in the file in the @aws-cdk/integ.... dependencies

I don't typically use projen, so it might be a bit tricky for me to try to reproduce this without more specific steps

@diranged
Copy link
Author

@peterwoodworth Thanks for responding, let me try to explain better. We make heavy use of Projen... and while we use a custom construct with a bunch of defaults, a common CDK project for us will look something lke this:

const project awscdk.AwsCdkTypeScriptApp{
  authorEmail: 'matt@....com',
  authorName: 'Matt Wise',
  authorUrl: 'https://github.com/ourorg/ourapp',
  cdkVersion: '2.99.0',

  defaultReleaseBranch: 'main',
  name: 'infra-thingy',
  projenrcTs: true,

  deps: [
     // src/constructs/aws_vpc
    '@aws-cdk/aws-lambda-python-alpha',

  ],

  devDeps: [
    // Integ-Tests
    '@aws-cdk/integ-tests-alpha',
    '@aws-cdk/integ-runner',
  ],

  gitignore: [
    // CDK Temp Data
    'cdk.out',
    'cdk.staging',

    // VIM
    '*.swp',
    '*.swo',

    // Integ-Tests write some files that do not need to be committed:
    'cdk.context.json',
    'read*lock',
    'test/integ/**/integ.*.snapshot/asset.*',
  ],

  githubOptions: {
    // https://projen.io/api/API.html#projen-github-pullrequestlint
    pullRequestLintOptions: {
      semanticTitleOptions: {
        requireScope: true,
      },
    },
  },
});

What I realized in this setup is that the package.json file gets set up with the @aws-cdk and @aws-cdk-lib libraries matching versions ... but then other @aws-cdk/* libraries are not upgraded along with them. I don't really understand the NodeJS Yarn/NPM dependency world and tooling terribly well, so forgive me for not having a better answer here.

What I see though is that the aws-cdk and aws-cdk-lib libraries are updated together (see diff below from automated PR generated by Dependabot) - and yet the other libs that have the -alpha.0 version suffix seem to be ignored by dependabot and then get left behind when we're doing upgrades. I think that ends up with us creating this version skew, which then creates weird behaviors.

Our dependabot looks like this:

  version: 2
  registries:
    github:
      type: npm-registry
      url: https://npm.pkg.github.com
      token: ${{ secrets.PROJEN_GITHUB_TOKEN }}
  updates:
    - package-ecosystem: npm
      versioning-strategy: lockfile-only
      directory: /
      schedule:
        interval: weekly
      registries:
        - github
      groups:
        awslibs:
          patterns:
            - "@aws*"
            - aws*
        node:
          patterns:
            - typescript*
            - "@types*"
            - eslint*
        projen:
          patterns:
            - projen*
            - jest
      open-pull-requests-limit: 50

We recently had a PR update come through like this:

image

Notice how aws-cdk and aws-cdk-lib are updated together? But what about our other @aws-cdk/* dependencies? They remained stuck on the old versions... does this have to do perhaps with the -alpha.0 suffix?

In the short term, I am tweaking our projenrc.ts file to do this instead:

project.addDevDeps(`@aws-cdk/aws-lambda-python-alpha@${project.cdkVersion}-alpha.0`);
project.addDevDeps(`@aws-cdk/integ-tests-alpha@${project.cdkVersion}-alpha.0`);
project.addDevDeps(`@aws-cdk/integ-runner@${project.cdkVersion}-alpha.0`);

However, this won't automatically catch updates during Dependabot PRs .. instead, it will only force the update when we change project.cdkVersion: .. to a new version.

@diranged
Copy link
Author

diranged commented Sep 28, 2023

@peterwoodworth,
Sorry for the misdirection - I don't believe version mis-matching is the issue now. I got all of my versions cleaned up and went to the latest 2.99.0 code and it's failing exactly the same way:

 Waiting for 2 more (integ/constructs/aws-eks/integ.xx-cluster, integ/constructs/aws-vpc/integ.xx-vpc)
  CHANGED    integ/constructs/aws-vpc/integ.xx-vpc 87.258s
      Resources
[~] AWS::Lambda::Function myappawsvpccidrreservationfunctionFunction49D8A27A
 └─ [~] Code
     └─ [~] .S3Key:
         ├─ [-] 51321ab8e804a0977f16dc33e216352d6c44f47e138c4f0cd2d5744d1341c149.zip
         └─ [+] d63cae75bab1847aea143d10362f28d64b116d80390ed1f2ff2abbe8fb39ae85.zip
      Repro:
        env CDK_INTEG_ACCOUNT='12345678' CDK_INTEG_REGION='test-region' CDK_INTEG_HOSTED_ZONE_ID='Z23ABC4XYZL05B' CDK_INTEG_HOSTED_ZONE_NAME='example.com' CDK_INTEG_DOMAIN_NAME='*.example.com' CDK_INTEG_CERT_ARN='arn:aws:acm:test-region:12345678:certificate/86468209-a272-595d-b831-0efb6421265z' cdk synth -a 'node -r ts-node/register integ.xx-vpc.ts' -o 'test/integ/constructs/aws-vpc/cdk-integ.out.integ.xx-vpc.ts.snapshot' -c '@aws-cdk/aws-lambda:recognizeLayerVersion=true' -c '@aws-cdk/core:checkSecretUsage=true' -c '@aws-cdk/core:target-partitions=undefined' -c '@aws-cdk-containers/ecs-service-extensions:enableDefaultLogDriver=true' -c '@aws-cdk/aws-ec2:uniqueImdsv2TemplateName=true' -c '@aws-cdk/aws-ecs:arnFormatIncludesClusterName=true' -c '@aws-cdk/aws-iam:minimizePolicies=true' -c '@aws-cdk/core:validateSnapshotRemovalPolicy=true' -c '@aws-cdk/aws-codepipeline:crossAccountKeyAliasStackSafeResourceName=true' -c '@aws-cdk/aws-s3:createDefaultLoggingPolicy=true' -c '@aws-cdk/aws-sns-subscriptions:restrictSq...
  CHANGED    integ/constructs/aws-vpc/integ.xx-vpc 87.259s
      Resources
[~] AWS::Lambda::Function myappawsvpccidrreservationfunctionFunction49D8A27A
 └─ [~] Code
     └─ [~] .S3Key:
         ├─ [-] 51321ab8e804a0977f16dc33e216352d6c44f47e138c4f0cd2d5744d1341c149.zip
         └─ [+] d63cae75bab1847aea143d10362f28d64b116d80390ed1f2ff2abbe8fb39ae85.zip
  CHANGED    integ/constructs/aws-eks/integ.xx-cluster 87.41s
      Resources
[~] AWS::CloudFormation::Stack NetworkPrepNestedStackNetworkPrepNestedStackResourceCF699613
 └─ [~] TemplateURL
     └─ [~] .Fn::Join:
         └─ @@ -13,6 +13,6 @@
            [ ]     {
            [ ]       "Fn::Sub": "cdk-hnb659fds-assets-${AWS::AccountId}-${AWS::Region}"
            [ ]     },
            [-]     "/3cdb2a9ae15a7becb87a176d255fc7d246c6d14a5ee3a24b404ce096f12d60f6.json"
            [+]     "/3bb2e502fc3e1a1db356acda7c546402936497d3d11fb5afe8c2b18b38f79ffd.json"
            [ ]   ]
            [ ] ]
      Repro:
        env CDK_INTEG_ACCOUNT='12345678' CDK_INTEG_REGION='test-region' CDK_INTEG_HOSTED_ZONE_ID='Z23ABC4XYZL05B' CDK_INTEG_HOSTED_ZONE_NAME='example.com' CDK_INTEG_DOMAIN_NAME='*.example.com' CDK_INTEG_CERT_ARN='arn:aws:acm:test-region:12345678:certificate/86468209-a272-595d-b831-0efb6421265z' cdk synth -a 'node -r ts-node/register integ.xx-cluster.ts' -o 'test/integ/constructs/aws-eks/cdk-integ.out.integ.xx-cluster.ts.snapshot' -c '@aws-cdk/aws-lambda:recognizeLayerVersion=true' -c '@aws-cdk/core:checkSecretUsage=true' -c '@aws-cdk/core:target-partitions=undefined' -c '@aws-cdk-containers/ecs-service-extensions:enableDefaultLogDriver=true' -c '@aws-cdk/aws-ec2:uniqueImdsv2TemplateName=true' -c '@aws-cdk/aws-ecs:arnFormatIncludesClusterName=true' -c '@aws-cdk/aws-iam:minimizePolicies=true' -c '@aws-cdk/core:validateSnapshotRemovalPolicy=true' -c '@aws-cdk/aws-codepipeline:crossAccountKeyAliasStackSafeResourceName=true' -c '@aws-cdk/aws-s3:createDefaultLoggingPolicy=true' -c '@aws-cdk/aws-sns-subscriptions:re...
  CHANGED    integ/constructs/aws-eks/integ.xx-cluster 87.41s
      Resources
[~] AWS::Lambda::Function myappawsvpccidrreservationfunctionFunction49D8A27A
 └─ [~] Code
     └─ [~] .S3Key:
         ├─ [-] 51321ab8e804a0977f16dc33e216352d6c44f47e138c4f0cd2d5744d1341c149.zip
         └─ [+] d63cae75bab1847aea143d10362f28d64b116d80390ed1f2ff2abbe8fb39ae85.zip
Snapshot Results:
Tests:    2 failed, 2 total
Failed: /home/runner/work/infra-myapp/infra-myapp/test/integ/constructs/aws-eks/integ.xx-cluster.ts
Failed: /home/runner/work/infra-myapp/infra-myapp/test/integ/constructs/aws-vpc/integ.xx-vpc.ts
Running integration tests for failed tests...
Running in parallel across regions: us-east-1, us-east-2, us-west-2
Running test /home/runner/work/infra-myapp/infra-myapp/test/integ/constructs/aws-vpc/integ.xx-vpc.ts in us-east-1
Running test /home/runner/work/infra-myapp/infra-myapp/test/integ/constructs/aws-eks/integ.xx-cluster.ts in us-east-2
#0 building with "default" instance using docker driver
#1 [internal] load .dockerignore
#1 transferring context: 2B done
#1 DONE 0.0s
#2 [internal] load build definition from Dockerfile
#2 transferring dockerfile: 1.32kB done
#2 DONE 0.0s
#3 [internal] load metadata for public.ecr.aws/sam/build-python3.11:latest
#3 DONE 0.1s
#4 [1/2] FROM public.ecr.aws/sam/build-python3.11@sha256:126e86be0b6e81b216a8ba9bb6450be40775488270b60a0c41fbf3f1f804d14f
#4 DONE 0.0s
#5 [2/2] RUN     python -m venv /usr/app/venv &&     mkdir /tmp/pip-cache &&     chmod -R 777 /tmp/pip-cache &&     pip install --upgrade pip &&     mkdir /tmp/poetry-cache &&     chmod -R 777 /tmp/poetry-cache &&     pip install pipenv==2022.4.8 poetry==1.5.1 &&     rm -rf /tmp/pip-cache/* /tmp/poetry-cache/*
#5 CACHED
#6 exporting to image
#6 exporting layers done
#6 writing image sha256:dd550a6e7461997228d1ffa85899a5654cf22654493d66697c9b092d3eec9135 done
#6 naming to docker.io/library/cdk-8203f0acdd036b5af721fd3c9720bf3dbf066f802f0ae5e63011727142bb2ef1 done
#6 DONE 0.0s
Bundling asset INFRA-MYAPP-CidrReservationTestStack/@infra-myapp--aws-vpc--cidr-reservation--function/Function/Code/Stage...
sending incremental file list
index.py
sent 1,788 bytes  received 35 bytes  3,646.00 bytes/sec
total size is 1,680  speedup is 0.92
Warning:  aws-cdk-lib.aws_ec2.VpcProps#cidr is deprecated.
  Use ipAddresses instead
  This API will be removed in the next major release.
#0 building with "default" instance using docker driver
#1 [internal] load .dockerignore
#1 transferring context: 2B done
#1 DONE 0.0s
#2 [internal] load build definition from Dockerfile
#2 transferring dockerfile: 1.32kB done
#2 DONE 0.0s
#3 [internal] load metadata for public.ecr.aws/sam/build-python3.11:latest
#3 DONE 0.1s
#4 [1/2] FROM public.ecr.aws/sam/build-python3.11@sha256:126e86be0b6e81b216a8ba9bb6450be40775488270b60a0c41fbf3f1f804d14f
#4 DONE 0.0s
#5 [2/2] RUN     python -m venv /usr/app/venv &&     mkdir /tmp/pip-cache &&     chmod -R 777 /tmp/pip-cache &&     pip install --upgrade pip &&     mkdir /tmp/poetry-cache &&     chmod -R 777 /tmp/poetry-cache &&     pip install pipenv==2022.4.8 poetry==1.5.1 &&     rm -rf /tmp/pip-cache/* /tmp/poetry-cache/*
#5 CACHED
#6 exporting to image
#6 exporting layers done
#6 writing image sha256:dd550a6e7461997228d1ffa85899a5654cf22654493d66697c9b092d3eec9135 done
#6 naming to docker.io/library/cdk-8203f0acdd036b5af721fd3c9720bf3dbf066f802f0ae5e63011727142bb2ef1 done
#6 DONE 0.0s
Bundling asset INFRA-MYAPP-ClusterTest/NetworkPrep/@infra-myapp--aws-vpc--cidr-reservation--function/Function/Code/Stage...
#0 building with "default" instance using docker driver
#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 1.32kB done
#1 DONE 0.0s
#2 [internal] load .dockerignore
#2 transferring context: 2B done
#2 DONE 0.0s
#3 [internal] load metadata for public.ecr.aws/sam/build-python3.11:latest
#3 DONE 0.1s
#4 [1/2] FROM public.ecr.aws/sam/build-python3.11@sha256:126e86be0b6e81b216a8ba9bb6450be40775488270b60a0c41fbf3f1f804d14f
#4 DONE 0.0s
#5 [2/2] RUN     python -m venv /usr/app/venv &&     mkdir /tmp/pip-cache &&     chmod -R 777 /tmp/pip-cache &&     pip install --upgrade pip &&     mkdir /tmp/poetry-cache &&     chmod -R 777 /tmp/poetry-cache &&     pip install pipenv==2022.4.8 poetry==1.5.1 &&     rm -rf /tmp/pip-cache/* /tmp/poetry-cache/*
#5 CACHED
#6 exporting to image
#6 exporting layers done
#6 writing image sha256:dd550a6e7461997228d1ffa85899a5654cf22654493d66697c9b092d3eec9135 done
#6 naming to docker.io/library/cdk-8203f0acdd036b5af721fd3c9720bf3dbf066f802f0ae5e63011727142bb2ef1 done
#6 DONE 0.0s
sending incremental file list
index.py
sent 1,788 bytes  received 35 bytes  3,646.00 bytes/sec
total size is 1,680  speedup is 0.92
fatal: Not a valid object name main
fatal: Not a valid object name main
INFRA-MYAPP-ClusterTest:  fail: ENOENT: no such file or directory, open '/home/runner/work/infra-myapp/infra-myapp/test/integ/constructs/aws-eks/integ.xx-cluster.ts.snapshot/asset.8e18eb5caccd2617fb76e648fa6a35dc0ece98c4681942bc6861f41afdff6a1b.zip'
INFRA-MYAPP-ClusterTest:  fail: ENOENT: no such file or directory, open '/home/runner/work/infra-myapp/infra-myapp/test/integ/constructs/aws-eks/integ.xx-cluster.ts.snapshot/asset.9202bb21d52e07810fc1da0f6acf2dcb75a40a43a9a2efbcfc9ae39535c6260c.zip'
INFRA-MYAPP-ClusterTest:  fail: ENOENT: no such file or directory, open '/home/runner/work/infra-myapp/infra-myapp/test/integ/constructs/aws-eks/integ.xx-cluster.ts.snapshot/asset.e2277687077a2abf9ae1af1cc9565e6715e2ebb62f79ec53aa75a1af9298f642.zip'
 ❌ Deployment failed: Error: Failed to publish asset a3f66c60067b06b5d9d00094e9e817ee39dd7cb5c315c8c254f5f3c571959ce5:current_account-current_region
    at Deployments.publishSingleAsset (/home/runner/work/infra-myapp/infra-myapp/node_modules/aws-cdk/lib/index.js:470:11458)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async Object.publishAsset (/home/runner/work/infra-myapp/infra-myapp/node_modules/aws-cdk/lib/index.js:470:177839)
    at async /home/runner/work/infra-myapp/infra-myapp/node_modules/aws-cdk/lib/index.js:470:163266
Failed to publish asset a3f66c60067b06b5d9d00094e9e817ee39dd7cb5c315c8c254f5f3c571959ce5:current_account-current_region
  FAILED     integ/constructs/aws-eks/integ.xx-cluster-IntegTest/DefaultTest (undefined/us-east-2) 24.01s
      Integration test failed: TypeError [ERR_STREAM_NULL_VALUES]: May not write null values to stream
Could not checkout snapshot directory test/integ/constructs/aws-eks/integ.xx-cluster.ts.snapshot using these commands:
git merge-base HEAD main && git checkout {merge-base} -- integ.xx-cluster.ts.snapshot
error: Error: Command exited with status 128
    at exec2 (/home/runner/work/infra-myapp/infra-myapp/node_modules/@aws-cdk/integ-runner/lib/workers/extract/index.js:797377:11)
    at IntegTestRunner.checkoutSnapshot (/home/runner/work/infra-myapp/infra-myapp/node_modules/@aws-cdk/integ-runner/lib/workers/extract/index.js:804533:26)
    at IntegTestRunner.deploy (/home/runner/work/infra-myapp/infra-myapp/node_modules/@aws-cdk/integ-runner/lib/workers/extract/index.js:804831:18)
    at IntegTestRunner.runIntegTestCase (/home/runner/work/infra-myapp/infra-myapp/node_modules/@aws-cdk/integ-runner/lib/workers/extract/index.js:804610:37)
    at Function.integTestWorker (/home/runner/work/infra-myapp/infra-myapp/node_modules/@aws-cdk/integ-runner/lib/workers/extract/index.js:1107960:34)
    at MessagePort.<anonymous> (/home/runner/work/infra-myapp/infra-myapp/node_modules/@aws-cdk/integ-runner/lib/workers/extract/index.js:973:31)
    at [nodejs.internal.kHybridDispatch] (node:internal/event_target:757:20)
    at exports.emitMessage (node:internal/per_context/messageport:23:28)

I am going to start working back through the release versions to see which one fixes it...

@diranged
Copy link
Author

I've updated the main PR description ... the failure seems to begin at 2.94.0 and run through 2.99.0. If we roll back to 2.93.0, it's fine.

@peterwoodworth
Copy link
Contributor

Interestingly enough, this issue also is describing issues with testing when migrating to 2.94.0. They don't seem related, but there were some changes done to testing. If you could provide code to reproduce this, it would be very helpful 🙂

@diranged
Copy link
Author

Thanks for pointing that out ... so we also ran into some really strange issues that I thought were related to our own code (see https://cdk-dev.slack.com/archives/C017RG58LM8/p1695824193947209), but I now suspect are going to be fixed by rolling back to 2.93.0. I've got some other work I have to complete first, but I'll try to post back results here. It definitely seems like something wonky happened in 2.94.0

@rix0rrr
Copy link
Contributor

rix0rrr commented Sep 29, 2023

@diranged, is there any way we could have a look at a repro of this?

@rix0rrr
Copy link
Contributor

rix0rrr commented Sep 29, 2023

My best suspect at the moment is that #26910 would lead to an ordering problem.

@diranged if you could send me the manifest.json and *.assets.json from your cdk.out directory (in a failing setup), that might help.

@peterwoodworth peterwoodworth added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed needs-review labels Oct 4, 2023
@github-actions
Copy link

github-actions bot commented Oct 4, 2023

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

@github-actions github-actions bot added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Oct 4, 2023
@diranged
Copy link
Author

diranged commented Oct 4, 2023

keep-alive please..

@peterwoodworth
Copy link
Contributor

@diranged is there any way you can provide a repro?

@peterwoodworth peterwoodworth added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed closing-soon This issue will automatically close in 4 days unless further comments are made. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Oct 4, 2023
@diranged
Copy link
Author

diranged commented Oct 4, 2023

@diranged is there any way you can provide a repro?

So if I open an AWS Support TIcket - I can upload the entire repo there along with instructions.. I haven't had time to be able to craft a smaller example unfortunately.

@peterwoodworth
Copy link
Contributor

ok, in that case we can put this ticket on hold until we're able to get a reproduction

@peterwoodworth peterwoodworth added needs-reproduction This issue needs reproduction. and removed response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Oct 4, 2023
@diranged
Copy link
Author

diranged commented Nov 5, 2023

@peterwoodworth,
I may have found the root cause of this issue - while not totally understanding the behavior. We have the following configuration in our .gitignore. The intention is that we want most of our integration test data stored in git - but we don't want assets like the various Lambda Functions stored - they're large, and that would be a waste of space in our opinion:

# .gitignore
...
test/integ/**/cdk-integ.out.*
test/integ/**/integ.*.snapshot/asset.*

Additionally, we have a Lifecycle Policy on our CDK S3 buckets that purges data - it was set to 30 days until recently, when we changed it to 1 day. Our thinking behind this policy was that this was ephemeral data that would just be regenerated as needed, so why risk having invalid or corrupt data in S3? I discovered though that after switching this to a 1 day expiration policy, virtually every integration test we have across multiple different CDK repos began to fail with the exact same behavior.

My theory right now is that the integration tests rely on the S3 data existing for the "previous asset versions" - so that it can bring the test stacks up into stateA, before issuing the update call to move them to stateB. How does this sound so far? Is this how it works?

Our expectation was that the code would just regenerate any assets it was missing from S3.. but perhaps we're wrong?

@diranged
Copy link
Author

diranged commented Jan 12, 2024

@peterwoodworth Hey - this is still an issue.. any thoughts on my comment/question? We noticed a second situation where this occurs as well.

Let's say that we have a change to the code that we know will pass ... changing Kubernetes manifest string from image:1.2 to image:1.3 ... in those cases, we don't always want to force a full integration test. So here's what happens:

  1. The user builds their PR, and runs integ-runner ... --dry-run which generates all the changed files in the integration test as if it ran.
  2. They commit the whole PR and we review it. Because there is no diff seen by our integration-test runner action, it moves on and doesn't require fresh integration tests.
  3. We merge the PR to main...
  4. A new PR is created that does need an integration test.. so we run the integ-runner tests..
  5. integ-runner tries to spin up the old state of the stacks based on the committed integration test files on main...
  6. the bringup of the "old state" fails though because the local files in the git repository refer to S3-hosted files that were never uploaded (due to the dry-run).

It really feels like the answer here is that the integ-runner command needs to understand to generate and upload ALL missing files (from the "previous" state to the "current" state) before trying to run the tests..

@pahud
Copy link
Contributor

pahud commented Jan 30, 2024

Hi @diranged

As @rix0rrr mentioned above, we will need to look at the sample repo before we can identify the root cause or even reproduce it on our end. Are you able to provide a sample repo with very minimal required code snippets that helps us reproduce this issue?

@mrgrain
Copy link
Contributor

mrgrain commented Jan 31, 2024

Our expectation was that the code would just regenerate any assets it was missing from S3.. but perhaps we're wrong?

@diranged This assumption is unfortunately not correct, like your second post also describes. By default, integ-runner will do a two step deployment to verify that changes not only work, but also that the delta can be applied as an update to an existing stack:

  1. Deploy the previous version from the stored snapshot in git
  2. Synth & deploy the new version

It really feels like the answer here is that the integ-runner command needs to understand to generate and upload ALL missing files (from the "previous" state to the "current" state) before trying to run the tests..

integ-runner is currently not capable of doing this. There is a even the question if this is even possible given that a lot of other factors outside of the control of integ-runner could have changed: Context data, package versions, the version of integ-runner itself.

What it does offer today is --disable-update-workflow to just forgo this system. You could possibly combine this with --no-clean to maintain an ongoing "integration test environment".


Either way, I think this needs to be documented better.

@mrgrain
Copy link
Contributor

mrgrain commented Feb 2, 2024

@diranged Does including all files solve your problem?

Also, given that we have this in the README:

All snapshot files (i.e. *.snapshot/**) must be checked-in to version control. If not, changes cannot be compared across systems.

Could you elaborate what lead you to the assumption you can ignore some files? Any suggestions how we can improve the readme?

@diranged
Copy link
Author

@mrgrain,
So ... first, thank you for taking the time to respond, I really do appreciate it. In reading your comments, you are definitely right that I missed the comment in the README and it's pretty explicit (though I have an edge case I've run into, and I'll comment separately from this one on it to see if you have any ideas). I think that the integ-runner code is really critical in larger CDK environments to execute realistic tests ... so we've worked really hard to adopt it as a default in most of our CDK projects. In fact, we've built a full Github-Action/PR-based workflow where integration tests are run by users when they submit new PRs via some PR comments. With that type of a setup, it's really critical that the tests are reliable, so that failures are truly related to the users PR itself.

Improving the UX

Right off the bat .. I think that if the integ-runner command was able to proactively verify that the assets all existed before beginning a test, it would have dramatically improved things and probably saved me 10's of hours of debugging and troubleshooting. I just imagine that it could use the lookup role to verify that the resources exist (all the assets), and if any are missing, it could throw a big red error to tell the user that they can't do an "update workflow test". Thoughts?

Snapshots in Git

I'm curious how you've seen this used in the past... when it comes to building small dedicated CDK libraries that do a specific thing, I can imagine that storing the full snapshots isn't really a big deal .. but in our case we're launching integration tests to spin up real Kubernetes clusters along with a dozen different Lambda functions. These functions and the assets get pretty big:

% du -sch cdk.out test
136M	cdk.out
 71M	test
207M	total

Do you realistically see people storing these in Git - and updating them? Virtually every AWS CDK release makes changes to the Lambda handler code in some way, which causes new hash's to be generated, causing new functions to be built and new assets to be created.

I'm not complaining ... just trying to figure out what the realistic pattern is here. Our NodeJS functions aren't too big - but we have a couple of Python functions that get big. For example:

 16K    MYREPO-NativeClusterTest.assets.json
 48K    MYREPO-NativeClusterTest.template.json
 16K    MYREPONativeClusterTestCleanupStackA5C06CE2.nested.template.json
 40K    MYREPONativeClusterTestContinuousDeployment28A15EF4.nested.template.json
 80K    MYREPONativeClusterTestCorePluginsBB9AD3A8.nested.template.json
 12K    MYREPONativeClusterTestDns05AFFC71.nested.template.json
8.0K    MYREPONativeClusterTestKubeSystemNodesF64F789A.nested.template.json
 20K    MYREPONativeClusterTestNetworkPrep159B41F9.nested.template.json
 44K    MYREPONativeClusterTestOcean639A0FD8.nested.template.json
4.0K    MYREPONativeClusterTestRemoteManagementD093FD97.nested.template.json
 96K    MYREPONativeClusterTestSupplementalPlugins7C1CEFC9.nested.template.json
 16K    MYREPONativeClusterTestVpc42B5454F.nested.template.json
8.0K    MYREPONativeClusterTestndawseksNdKubectlProvider5DDA391D.nested.template.json
4.0K    IntegTestDefaultTestDeployAssertE3E7D2A4.assets.json
4.0K    IntegTestDefaultTestDeployAssertE3E7D2A4.template.json
 24K    asset.1471fa6f2876749a13de79989efc6651c9768d3173ef5904947e87504f8d7069
1.1M    asset.283efd6aefae7121bcf6bd25901fcb60ecd8b58bcd34cb8b91d8d8fc5322f62c
 16M    asset.3322b7049fb0ed2b7cbb644a2ada8d1116ff80c32dca89e6ada846b5de26f961.zip
 12K    asset.350497850828a0108f064a8cb783dd16d04637d20593411e21cc5b4f9e485cd6
4.0K    asset.4e26bf2d0a26f2097fb2b261f22bb51e3f6b4b52635777b1e54edbd8e2d58c35
4.1M    asset.6d93bc9532045758cbb4e2faa3a244d1154fc78d517cecfb295d2f07889d1259
 20K    asset.7382a0addb9f34974a1ea6c6c9b063882af874828f366f5c93b2b7b64db15c94
8.0K    asset.78b70ad373a624989fdc7740e7aa19700d82dfc386c4bc849803634716c8fa4a
4.4M    asset.aa10d0626ba6f3587e40157ecf0f5e0879088a68b2477bf0ef8eb74045a2439a
 30M    asset.bdb2015ec68b53161d29e5910113dcb0b789ba26659fcfdcddddf8256bde19ef.zip
8.0K    asset.be971704b52836a95da4dc35cbeb928f60b51bd5f7b01f03ac731e05cdfccbaf
8.0K    asset.dd5711540f04e06aa955d7f4862fc04e8cdea464cb590dae91ed2976bb78098e
4.0K    cdk.out
4.0K    integ.json
 88K    manifest.json
592K    tree.json
 57M    total

In this particular case, asset.aa10d0626ba6f3587e40157ecf0f5e0879088a68b2477bf0ef8eb74045a2439a is a 4.4MB NodeJS file... where the majority of that space must be used by imported libraries. Then the other one is asset.bdb2015ec68b53161d29e5910113dcb0b789ba26659fcfdcddddf8256bde19ef.zip which is the Kubectl/Helm package.

General thoughts on the Integ Runner

I think its an amazing tool .. I wish it got more love. I've opened a bunch of issues on it over the last year (#27437, #22804, #22329, #27445, #28549).. they all kind of fall into the theme of better-documentation, better examples, and improved errors/warnings that help developers actually understand the root cause of failures.

@diranged
Copy link
Author

@mrgrain,
Quick followup edge case... we created an integ.config.json file that looked like this:

{
    "verbose": true,
    "directory": "test/integ",
    "update-on-failed": true,
    "parallel-regions": ["us-west-2"],
    // https://github.com/aws/aws-cdk/issues/27342#issuecomment-1793592083"
    "disable-update-workflow": true
}

We assumed if there was any problem parsing the file - it would tell us ... and there were no warnings, so we thought we were all good. Turns out that this code silently catches the errors and then you don't get any of the behaviors you are expecting

try {
return JSON.parse(fs.readFileSync(fileName, { encoding: 'utf-8' }));
} catch {
return {};
}
.

This bit us for a long time because we kept having intermittent failures, but we thought we had set disable-update-workflow: false ... and yet it was actually not false because of this failure to parse the JSON..

@mrgrain
Copy link
Contributor

mrgrain commented Feb 11, 2024

@diranged Re the config file: Yeah that's bad and should be fixed. Would you mind opening a separate issue for this one? Obviously PRs are also welcome.

I'll respond later to your other comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/custom-resources Related to AWS CDK Custom Resources bug This issue is a bug. needs-reproduction This issue needs reproduction. p1
Projects
None yet
Development

No branches or pull requests

5 participants