Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ops] Fix GCS bucket access for future buildkite agents #174756

Merged
merged 24 commits into from
Feb 7, 2024

Conversation

delanni
Copy link
Contributor

@delanni delanni commented Jan 12, 2024

Summary

Once we're moving to the elastic-wide buildkite agents, and away from the kibana-buildkite-managed ones, we won't have default access to the buckets we used to use, as the assumed service account will differ.

Note: Although this will only be required in the new infra, but this change can be merged and expected to work properly in the current infra as well.

Solution

We've set up a central service-account with rights to impersonate other service accounts that have controlled access to individual buckets to minimize the reach and influence of individual accounts. See: https://github.com/elastic/kibana-operations/pull/51

several of the changes weren't tested, as they're part of CI tasks outside the PR build - will merge with caution and monitor the stability afterwards

TODO: add access, and assume account before other GCS bucket usages

  • storybook
  • coverage (.buildkite/scripts/steps/code_coverage/reporting/uploadPrevSha.sh)
  • upload static site (.buildkite/scripts/steps/code_coverage/reporting/uploadStaticSite.sh)
  • SO object migration (.buildkite/scripts/steps/archive_so_migration_snapshot.sh)
  • ES Snapshot manifest upload (.buildkite/scripts/steps/es_snapshots/create_manifest.ts)
  • Scalability? (.buildkite/scripts/steps/functional/scalability_dataset_extraction.sh)
  • Benchmarking (.buildkite/scripts/steps/scalability/benchmarking.sh)
  • Webpack bundle analyzer (.buildkite/scripts/steps/webpack_bundle_analyzer/upload.ts)
  • Build chromium (x-pack/build_chromium/build.py) Not needed, as it's manual, and not a CI task

TODO: others

  • Remove manifest upload (.buildkite/scripts/steps/es_serverless/promote_es_serverless_image.sh)
  • Decide if we should merge with the CDN access: no, SRE is managing that account
  • Bazel remote cache seems to also rely on gcs - roles PR: https://github.com/elastic/kibana-operations/pull/56

Closes: https://github.com/elastic/kibana-operations/issues/29
Part of: https://github.com/elastic/kibana-operations/issues/15

@@ -151,6 +151,11 @@ export GCS_SA_CDN_QA_EMAIL
GCS_SA_CDN_QA_BUCKET="$(vault_get gcs-sa-cdn-qa bucket)"
export GCS_SA_CDN_QA_BUCKET

GOOGLE_APPLICATION_CREDENTIALS="$(mktemp -d)/kibana-gcloud-service-account.json"
Copy link
Contributor Author

@delanni delanni Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #173159 @jbudz set up a similar scheme of access to a bucket for CDN assets (just a few lines above).

What should our approach be here? Should we have specific service accounts for different usage sites, with well-controlled bucket access, and assume the specific ones right before the uploads? Or should we use a powerful service account for most of the GCS access throughout our build jobs?

The former solution allows for nicer access control over different buckets and gcloud features, the latter would allow us to activate the account in a central place, and not have to worry about locally assuming the role before usage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO we should go with the former one. Besides the nicer access control , its more explicit and a better operational/security practise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree with the general sentiment, that granular access control is nicer, but wouldn't this get over-board with all these sites requiring 1-1 extra service account, with their own access to an individual bucket?

 coverage (.buildkite/scripts/steps/code_coverage/reporting/uploadPrevSha.sh)
 SO object migration (.buildkite/scripts/steps/archive_so_migration_snapshot.sh)
 Upload static site (.buildkite/scripts/steps/code_coverage/reporting/uploadStaticSite.sh)
 ES Snapshot manifest upload (.buildkite/scripts/steps/es_snapshots/create_manifest.ts)
 Scalability? (.buildkite/scripts/steps/functional/scalability_dataset_extraction.sh)
 Benchmarking (.buildkite/scripts/steps/scalability/benchmarking.sh)
 Webpack bundle analyzer (.buildkite/scripts/steps/webpack_bundle_analyzer/upload.ts)
 Build chromium (x-pack/build_chromium/build.py)
+ bazel cache

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sharing in case there's any interest:

https://cloud.google.com/iam/docs/best-practices-service-accounts#choose-when-to-use suggests a combination of both. Default application credentials with permission to impersonate service accounts for specific areas.

https://cloud.google.com/docs/authentication/use-service-account-impersonation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea, but it's not far from managing and assuming roles manually. We'll still have to manage (probably through terraform as you suggested) the sevice accounts - we'd save on generating/storing the keys though.
I'll see how difficult is it to get the default account set up with rights to impersonate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gcloud service account we get by default in the new agents is this one: vault-gcp-authentication@elastic-ci-prod.iam.gserviceaccount.com
- this probably shouldn't have all the rights to impersonate any other service accounts, it looks like it's limited to accessing vault.

So if we'd like to have a power-user who can impersonate the bucket-limited service accounts, we'd need +1 account. Hopefully, I could set it up so that the account is activated in the pre-command hooks, so at the call sites, one would just need to use an impersonation flag, or command - then we wouldn't need to store/retrieve all the individual keys for the service accounts, just know their names.

@jbudz / @mistic - does this plan look OK? If so, I'll go ahead and create, setup the new account with rights.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the plan, sounds good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbudz - should I also move the CDN account's use to this form of impersonation, and include it to the set of accounts managed together by terraform?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(We discussed on slack, we're holding off for now)

@delanni
Copy link
Contributor Author

delanni commented Jan 26, 2024

/ci

@delanni
Copy link
Contributor Author

delanni commented Feb 2, 2024

It seems the service-account proxy + impersonation seems to work, however, the usage is not without friction. Observe this positive and negative example:

➜  kibana git:(fix-gcs-access-in-elastic-buildkite) ✗ gcloud auth list --filter="status=ACTIVE" --format="value(account)"                                                                 
kibana-ci-sa-proxy@elastic-kibana-ci.iam.gserviceaccount.com
➜  kibana git:(fix-gcs-access-in-elastic-buildkite) ✗ gcloud config get auth/impersonate_service_account            
kibana-ci-access-es-snapshots@elastic-kibana-ci.iam.gserviceaccount.com
➜  kibana git:(fix-gcs-access-in-elastic-buildkite) ✗ gcloud storage cp test.txt gs://kibana-ci-es-snapshots-daily/test-alex.txt
WARNING: This command is using service account impersonation. All API calls will be executed as [kibana-ci-access-es-snapshots@elastic-kibana-ci.iam.gserviceaccount.com].
WARNING: This command is using service account impersonation. All API calls will be executed as [kibana-ci-access-es-snapshots@elastic-kibana-ci.iam.gserviceaccount.com].
Copying file://test.txt to gs://kibana-ci-es-snapshots-daily/test-alex.txt
⠼WARNING: This command is using service account impersonation. All API calls will be executed as [kibana-ci-access-es-snapshots@elastic-kibana-ci.iam.gserviceaccount.com].                               
⠧WARNING: This command is using service account impersonation. All API calls will be executed as [kibana-ci-access-es-snapshots@elastic-kibana-ci.iam.gserviceaccount.com].                               
⠛WARNING: This command is using service account impersonation. All API calls will be executed as [kibana-ci-access-es-snapshots@elastic-kibana-ci.iam.gserviceaccount.com].                               
⠼WARNING: This command is using service account impersonation. All API calls will be executed as [kibana-ci-access-es-snapshots@elastic-kibana-ci.iam.gserviceaccount.com].                               
  Completed files 1/1 | 16.0B/16.0B                                                                                                                                                                       
➜  kibana git:(fix-gcs-access-in-elastic-buildkite) ✗ gcloud storage cp test.txt gs://ci-artifacts.kibana.dev/test-alex.txt     
WARNING: This command is using service account impersonation. All API calls will be executed as [kibana-ci-access-es-snapshots@elastic-kibana-ci.iam.gserviceaccount.com].
WARNING: This command is using service account impersonation. All API calls will be executed as [kibana-ci-access-es-snapshots@elastic-kibana-ci.iam.gserviceaccount.com].
ERROR: (gcloud.storage.cp) User [kibana-ci-sa-proxy@elastic-kibana-ci.iam.gserviceaccount.com] does not have permission to access b instance [ci-artifacts.kibana.dev] (or it may not exist): kibana-ci-access-es-snapshots@elastic-kibana-ci.iam.gserviceaccount.com does not have storage.objects.list access to the Google Cloud Storage bucket. Permission 'storage.objects.list' denied on resource (or it may not exist).

ℹ️ The first 2 commands show that my main identity was the sa-proxy and I was impersonating kibana-ci-access-es-snapshots to be able to copy the kibana-ci-es-snapshots-daily bucket.

⚠️ The 3rd command shows a successful copy to the bucket we granted access to through the impersonation. You can see, however, that the output is (even after several attempts) swarmed with warnings.

❌ The 4th command shows a failed copy, showing that the grant only applies to a select bucket. Once again, though, in the output the error shows the warnings and an error as if the call was being made with the sa-proxy account, which is a bit misleading.

@delanni delanni force-pushed the fix-gcs-access-in-elastic-buildkite branch 2 times, most recently from 86c4729 to ddd2540 Compare February 5, 2024 14:33
@delanni delanni force-pushed the fix-gcs-access-in-elastic-buildkite branch from ddd2540 to b12ce66 Compare February 5, 2024 14:34
@delanni
Copy link
Contributor Author

delanni commented Feb 5, 2024

/ci

1 similar comment
@delanni
Copy link
Contributor Author

delanni commented Feb 5, 2024

/ci

ARG0="${1:-}"

if [[ -z "$ARG0" ]]; then
echo "Usage: $0 <bucket_name|email>"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set up this script to make it easy to set up the impersonation before calling gcloud storage / gsutil commands. However, I didn't want the script caller to have to refer to e-mails that are associated with the service-accounts for the buckets.

Do you think it's okay to have this bucket name -> service account mapping (from line 49) to make it handier to use the script (with defaults/fallbacks for activating based on the email).

@delanni delanni marked this pull request as ready for review February 5, 2024 16:27
@delanni delanni requested a review from a team as a code owner February 5, 2024 16:27
@delanni delanni added release_note:fix backport:skip This commit does not require backporting labels Feb 5, 2024
@delanni delanni force-pushed the fix-gcs-access-in-elastic-buildkite branch from 0c71f47 to 46f7bed Compare February 6, 2024 09:15
@delanni delanni force-pushed the fix-gcs-access-in-elastic-buildkite branch from 7307f23 to c274d87 Compare February 6, 2024 14:10
@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #6 / cases security and spaces enabled: basic Common bulk_create_attachments alerts security_solution "after each" hook for "should add the case ID to the alert schema when the user has read access only"

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@delanni
Copy link
Contributor Author

delanni commented Feb 6, 2024

Merging this tomorrow (02-07) with more active time to monitor

@delanni delanni merged commit 455ee0a into elastic:main Feb 7, 2024
20 checks passed
jloleysens added a commit that referenced this pull request Feb 7, 2024
* main: (224 commits)
  [Http] Replace `buildNr` with `buildSha` in static asset paths (#175898)
  [Ops] Fix GCS bucket access for future buildkite agents (#174756)
  [api-docs] 2024-02-07 Daily api_docs build (#176362)
  skip flaky suite (#176002)
  skip failing es promotion suite (#176359)
  [Cloud Security] [Grouping] Add URL Params support to the grouping components (#175749)
  chore(NA): update versions after v8.12.2 bump (#176309)
  chore(NA): update versions after v7.17.19 bump (#176313)
  skip failing test suite (#176352)
  [SLO] Enable burn rate alert by default during creation via UI (#176317)
  [Fleet] Add the uptime capability to observability projects (#176285)
  [Security Solution][Endpoint] Fix Manifest Manger so that it works with large (>10k) (#174411)
  [ResponseOps] Alert creation delay based on user definition (#175851)
  [data views] Default field formatters based on field meta values (#174973)
  [Cloud Security]Detection Rules counter on Rules Flyout (#176041)
  [Security Solution] Data Quality Dashboard persistence (#175673)
  [Ent Search] Connector client copy cleanup (#176290)
  [ML] Anomaly Detection: Adds actions menu to anomaly markers in Single Metric Viewer chart. (#175556)
  [ML] Anomaly Detection: Fix `values-dots` colors (#176303)
  [Fleet] Logstash Output - being compliant to RFC-952 (#176298)
  ...
fkanout pushed a commit to fkanout/kibana that referenced this pull request Feb 7, 2024
## Summary
Once we're moving to the elastic-wide buildkite agents, and away from
the kibana-buildkite-managed ones, we won't have default access to the
buckets we used to use, as the assumed service account will differ.

**Note:** Although this will only be required in the new infra, but this
change can be merged and expected to work properly in the current infra
as well.

### Solution
We've set up a central service-account with rights to impersonate other
service accounts that have controlled access to individual buckets to
minimize the reach and influence of individual accounts. See:
elastic/kibana-operations#51

**several of the changes weren't tested, as they're part of CI tasks
outside the PR build** - will merge with caution and monitor the
stability afterwards

TODO:  _add access, and assume account before other GCS bucket usages_
- [x] storybook
- [x] coverage
(.buildkite/scripts/steps/code_coverage/reporting/uploadPrevSha.sh)
- [x] upload static site
(.buildkite/scripts/steps/code_coverage/reporting/uploadStaticSite.sh)
- [x] SO object migration
(.buildkite/scripts/steps/archive_so_migration_snapshot.sh)
- [x] ES Snapshot manifest upload
(.buildkite/scripts/steps/es_snapshots/create_manifest.ts)
- [x] Scalability?
(.buildkite/scripts/steps/functional/scalability_dataset_extraction.sh)
- [x] Benchmarking
(.buildkite/scripts/steps/scalability/benchmarking.sh)
- [x] Webpack bundle analyzer
(.buildkite/scripts/steps/webpack_bundle_analyzer/upload.ts)
- [x] ~Build chromium (x-pack/build_chromium/build.py)~ Not needed, as
it's manual, and not a CI task

TODO:  _others_
- [x] Remove manifest upload
(.buildkite/scripts/steps/es_serverless/promote_es_serverless_image.sh)
- [x] Decide if we should merge with the CDN access: no, SRE is managing
that account
- [x] Bazel remote cache seems to also rely on gcs - roles PR:
elastic/kibana-operations#56


Closes: elastic/kibana-operations#29
Part of: elastic/kibana-operations#15
mistic added a commit that referenced this pull request Feb 13, 2024
…176781)

This PR solves a problem introduced at
#174756 where the addition into
the bash script was not being correctly loaded for the promote manifest
pipeline as it was not running in the original main cwd.
CoenWarmer pushed a commit to CoenWarmer/kibana that referenced this pull request Feb 15, 2024
## Summary
Once we're moving to the elastic-wide buildkite agents, and away from
the kibana-buildkite-managed ones, we won't have default access to the
buckets we used to use, as the assumed service account will differ.

**Note:** Although this will only be required in the new infra, but this
change can be merged and expected to work properly in the current infra
as well.

### Solution
We've set up a central service-account with rights to impersonate other
service accounts that have controlled access to individual buckets to
minimize the reach and influence of individual accounts. See:
elastic/kibana-operations#51

**several of the changes weren't tested, as they're part of CI tasks
outside the PR build** - will merge with caution and monitor the
stability afterwards

TODO:  _add access, and assume account before other GCS bucket usages_
- [x] storybook
- [x] coverage
(.buildkite/scripts/steps/code_coverage/reporting/uploadPrevSha.sh)
- [x] upload static site
(.buildkite/scripts/steps/code_coverage/reporting/uploadStaticSite.sh)
- [x] SO object migration
(.buildkite/scripts/steps/archive_so_migration_snapshot.sh)
- [x] ES Snapshot manifest upload
(.buildkite/scripts/steps/es_snapshots/create_manifest.ts)
- [x] Scalability?
(.buildkite/scripts/steps/functional/scalability_dataset_extraction.sh)
- [x] Benchmarking
(.buildkite/scripts/steps/scalability/benchmarking.sh)
- [x] Webpack bundle analyzer
(.buildkite/scripts/steps/webpack_bundle_analyzer/upload.ts)
- [x] ~Build chromium (x-pack/build_chromium/build.py)~ Not needed, as
it's manual, and not a CI task

TODO:  _others_
- [x] Remove manifest upload
(.buildkite/scripts/steps/es_serverless/promote_es_serverless_image.sh)
- [x] Decide if we should merge with the CDN access: no, SRE is managing
that account
- [x] Bazel remote cache seems to also rely on gcs - roles PR:
elastic/kibana-operations#56


Closes: elastic/kibana-operations#29
Part of: elastic/kibana-operations#15
CoenWarmer pushed a commit to CoenWarmer/kibana that referenced this pull request Feb 15, 2024
…lastic#176781)

This PR solves a problem introduced at
elastic#174756 where the addition into
the bash script was not being correctly loaded for the promote manifest
pipeline as it was not running in the original main cwd.
CoenWarmer pushed a commit to CoenWarmer/kibana that referenced this pull request Feb 15, 2024
## Summary
Once we're moving to the elastic-wide buildkite agents, and away from
the kibana-buildkite-managed ones, we won't have default access to the
buckets we used to use, as the assumed service account will differ.

**Note:** Although this will only be required in the new infra, but this
change can be merged and expected to work properly in the current infra
as well.

### Solution
We've set up a central service-account with rights to impersonate other
service accounts that have controlled access to individual buckets to
minimize the reach and influence of individual accounts. See:
elastic/kibana-operations#51

**several of the changes weren't tested, as they're part of CI tasks
outside the PR build** - will merge with caution and monitor the
stability afterwards

TODO:  _add access, and assume account before other GCS bucket usages_
- [x] storybook
- [x] coverage
(.buildkite/scripts/steps/code_coverage/reporting/uploadPrevSha.sh)
- [x] upload static site
(.buildkite/scripts/steps/code_coverage/reporting/uploadStaticSite.sh)
- [x] SO object migration
(.buildkite/scripts/steps/archive_so_migration_snapshot.sh)
- [x] ES Snapshot manifest upload
(.buildkite/scripts/steps/es_snapshots/create_manifest.ts)
- [x] Scalability?
(.buildkite/scripts/steps/functional/scalability_dataset_extraction.sh)
- [x] Benchmarking
(.buildkite/scripts/steps/scalability/benchmarking.sh)
- [x] Webpack bundle analyzer
(.buildkite/scripts/steps/webpack_bundle_analyzer/upload.ts)
- [x] ~Build chromium (x-pack/build_chromium/build.py)~ Not needed, as
it's manual, and not a CI task

TODO:  _others_
- [x] Remove manifest upload
(.buildkite/scripts/steps/es_serverless/promote_es_serverless_image.sh)
- [x] Decide if we should merge with the CDN access: no, SRE is managing
that account
- [x] Bazel remote cache seems to also rely on gcs - roles PR:
elastic/kibana-operations#56


Closes: elastic/kibana-operations#29
Part of: elastic/kibana-operations#15
CoenWarmer pushed a commit to CoenWarmer/kibana that referenced this pull request Feb 15, 2024
…lastic#176781)

This PR solves a problem introduced at
elastic#174756 where the addition into
the bash script was not being correctly loaded for the promote manifest
pipeline as it was not running in the original main cwd.
fkanout pushed a commit to fkanout/kibana that referenced this pull request Mar 4, 2024
## Summary
Once we're moving to the elastic-wide buildkite agents, and away from
the kibana-buildkite-managed ones, we won't have default access to the
buckets we used to use, as the assumed service account will differ.

**Note:** Although this will only be required in the new infra, but this
change can be merged and expected to work properly in the current infra
as well.

### Solution
We've set up a central service-account with rights to impersonate other
service accounts that have controlled access to individual buckets to
minimize the reach and influence of individual accounts. See:
elastic/kibana-operations#51

**several of the changes weren't tested, as they're part of CI tasks
outside the PR build** - will merge with caution and monitor the
stability afterwards

TODO:  _add access, and assume account before other GCS bucket usages_
- [x] storybook
- [x] coverage
(.buildkite/scripts/steps/code_coverage/reporting/uploadPrevSha.sh)
- [x] upload static site
(.buildkite/scripts/steps/code_coverage/reporting/uploadStaticSite.sh)
- [x] SO object migration
(.buildkite/scripts/steps/archive_so_migration_snapshot.sh)
- [x] ES Snapshot manifest upload
(.buildkite/scripts/steps/es_snapshots/create_manifest.ts)
- [x] Scalability?
(.buildkite/scripts/steps/functional/scalability_dataset_extraction.sh)
- [x] Benchmarking
(.buildkite/scripts/steps/scalability/benchmarking.sh)
- [x] Webpack bundle analyzer
(.buildkite/scripts/steps/webpack_bundle_analyzer/upload.ts)
- [x] ~Build chromium (x-pack/build_chromium/build.py)~ Not needed, as
it's manual, and not a CI task

TODO:  _others_
- [x] Remove manifest upload
(.buildkite/scripts/steps/es_serverless/promote_es_serverless_image.sh)
- [x] Decide if we should merge with the CDN access: no, SRE is managing
that account
- [x] Bazel remote cache seems to also rely on gcs - roles PR:
elastic/kibana-operations#56


Closes: elastic/kibana-operations#29
Part of: elastic/kibana-operations#15
fkanout pushed a commit to fkanout/kibana that referenced this pull request Mar 4, 2024
…lastic#176781)

This PR solves a problem introduced at
elastic#174756 where the addition into
the bash script was not being correctly loaded for the promote manifest
pipeline as it was not running in the original main cwd.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants