-
Notifications
You must be signed in to change notification settings - Fork 8.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ops] Fix GCS bucket access for future buildkite agents #174756
[Ops] Fix GCS bucket access for future buildkite agents #174756
Conversation
@@ -151,6 +151,11 @@ export GCS_SA_CDN_QA_EMAIL | |||
GCS_SA_CDN_QA_BUCKET="$(vault_get gcs-sa-cdn-qa bucket)" | |||
export GCS_SA_CDN_QA_BUCKET | |||
|
|||
GOOGLE_APPLICATION_CREDENTIALS="$(mktemp -d)/kibana-gcloud-service-account.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #173159 @jbudz set up a similar scheme of access to a bucket for CDN assets (just a few lines above).
What should our approach be here? Should we have specific service accounts for different usage sites, with well-controlled bucket access, and assume the specific ones right before the uploads? Or should we use a powerful service account for most of the GCS access throughout our build jobs?
The former solution allows for nicer access control over different buckets and gcloud
features, the latter would allow us to activate the account in a central place, and not have to worry about locally assuming the role before usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO we should go with the former one. Besides the nicer access control , its more explicit and a better operational/security practise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree with the general sentiment, that granular access control is nicer, but wouldn't this get over-board with all these sites requiring 1-1 extra service account, with their own access to an individual bucket?
coverage (.buildkite/scripts/steps/code_coverage/reporting/uploadPrevSha.sh)
SO object migration (.buildkite/scripts/steps/archive_so_migration_snapshot.sh)
Upload static site (.buildkite/scripts/steps/code_coverage/reporting/uploadStaticSite.sh)
ES Snapshot manifest upload (.buildkite/scripts/steps/es_snapshots/create_manifest.ts)
Scalability? (.buildkite/scripts/steps/functional/scalability_dataset_extraction.sh)
Benchmarking (.buildkite/scripts/steps/scalability/benchmarking.sh)
Webpack bundle analyzer (.buildkite/scripts/steps/webpack_bundle_analyzer/upload.ts)
Build chromium (x-pack/build_chromium/build.py)
+ bazel cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sharing in case there's any interest:
https://cloud.google.com/iam/docs/best-practices-service-accounts#choose-when-to-use suggests a combination of both. Default application credentials with permission to impersonate service accounts for specific areas.
https://cloud.google.com/docs/authentication/use-service-account-impersonation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea, but it's not far from managing and assuming roles manually. We'll still have to manage (probably through terraform as you suggested) the sevice accounts - we'd save on generating/storing the keys though.
I'll see how difficult is it to get the default account set up with rights to impersonate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gcloud
service account we get by default in the new agents is this one: vault-gcp-authentication@elastic-ci-prod.iam.gserviceaccount.com
- this probably shouldn't have all the rights to impersonate any other service accounts, it looks like it's limited to accessing vault.
So if we'd like to have a power-user who can impersonate the bucket-limited service accounts, we'd need +1 account. Hopefully, I could set it up so that the account is activated in the pre-command hooks, so at the call sites, one would just need to use an impersonation flag, or command - then we wouldn't need to store/retrieve all the individual keys for the service accounts, just know their names.
@jbudz / @mistic - does this plan look OK? If so, I'll go ahead and create, setup the new account with rights.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with the plan, sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbudz - should I also move the CDN account's use to this form of impersonation, and include it to the set of accounts managed together by terraform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(We discussed on slack, we're holding off for now)
/ci |
It seems the service-account proxy + impersonation seems to work, however, the usage is not without friction. Observe this positive and negative example:
ℹ️ The first 2 commands show that my main identity was the ❌ The 4th command shows a failed copy, showing that the grant only applies to a select bucket. Once again, though, in the output the error shows the warnings and an error as if the call was being made with the sa-proxy account, which is a bit misleading. |
…d by the elastic team, we changed agreements right after the implementation
86c4729
to
ddd2540
Compare
ddd2540
to
b12ce66
Compare
/ci |
1 similar comment
/ci |
ARG0="${1:-}" | ||
|
||
if [[ -z "$ARG0" ]]; then | ||
echo "Usage: $0 <bucket_name|email>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set up this script to make it easy to set up the impersonation before calling gcloud storage / gsutil
commands. However, I didn't want the script caller to have to refer to e-mails that are associated with the service-accounts for the buckets.
Do you think it's okay to have this bucket name -> service account mapping (from line 49) to make it handier to use the script (with defaults/fallbacks for activating based on the email).
0c71f47
to
46f7bed
Compare
7307f23
to
c274d87
Compare
💛 Build succeeded, but was flaky
Failed CI StepsTest Failures
Metrics [docs]
History
To update your PR or re-run it, just comment with: |
Merging this tomorrow (02-07) with more active time to monitor |
* main: (224 commits) [Http] Replace `buildNr` with `buildSha` in static asset paths (#175898) [Ops] Fix GCS bucket access for future buildkite agents (#174756) [api-docs] 2024-02-07 Daily api_docs build (#176362) skip flaky suite (#176002) skip failing es promotion suite (#176359) [Cloud Security] [Grouping] Add URL Params support to the grouping components (#175749) chore(NA): update versions after v8.12.2 bump (#176309) chore(NA): update versions after v7.17.19 bump (#176313) skip failing test suite (#176352) [SLO] Enable burn rate alert by default during creation via UI (#176317) [Fleet] Add the uptime capability to observability projects (#176285) [Security Solution][Endpoint] Fix Manifest Manger so that it works with large (>10k) (#174411) [ResponseOps] Alert creation delay based on user definition (#175851) [data views] Default field formatters based on field meta values (#174973) [Cloud Security]Detection Rules counter on Rules Flyout (#176041) [Security Solution] Data Quality Dashboard persistence (#175673) [Ent Search] Connector client copy cleanup (#176290) [ML] Anomaly Detection: Adds actions menu to anomaly markers in Single Metric Viewer chart. (#175556) [ML] Anomaly Detection: Fix `values-dots` colors (#176303) [Fleet] Logstash Output - being compliant to RFC-952 (#176298) ...
## Summary Once we're moving to the elastic-wide buildkite agents, and away from the kibana-buildkite-managed ones, we won't have default access to the buckets we used to use, as the assumed service account will differ. **Note:** Although this will only be required in the new infra, but this change can be merged and expected to work properly in the current infra as well. ### Solution We've set up a central service-account with rights to impersonate other service accounts that have controlled access to individual buckets to minimize the reach and influence of individual accounts. See: elastic/kibana-operations#51 **several of the changes weren't tested, as they're part of CI tasks outside the PR build** - will merge with caution and monitor the stability afterwards TODO: _add access, and assume account before other GCS bucket usages_ - [x] storybook - [x] coverage (.buildkite/scripts/steps/code_coverage/reporting/uploadPrevSha.sh) - [x] upload static site (.buildkite/scripts/steps/code_coverage/reporting/uploadStaticSite.sh) - [x] SO object migration (.buildkite/scripts/steps/archive_so_migration_snapshot.sh) - [x] ES Snapshot manifest upload (.buildkite/scripts/steps/es_snapshots/create_manifest.ts) - [x] Scalability? (.buildkite/scripts/steps/functional/scalability_dataset_extraction.sh) - [x] Benchmarking (.buildkite/scripts/steps/scalability/benchmarking.sh) - [x] Webpack bundle analyzer (.buildkite/scripts/steps/webpack_bundle_analyzer/upload.ts) - [x] ~Build chromium (x-pack/build_chromium/build.py)~ Not needed, as it's manual, and not a CI task TODO: _others_ - [x] Remove manifest upload (.buildkite/scripts/steps/es_serverless/promote_es_serverless_image.sh) - [x] Decide if we should merge with the CDN access: no, SRE is managing that account - [x] Bazel remote cache seems to also rely on gcs - roles PR: elastic/kibana-operations#56 Closes: elastic/kibana-operations#29 Part of: elastic/kibana-operations#15
## Summary Once we're moving to the elastic-wide buildkite agents, and away from the kibana-buildkite-managed ones, we won't have default access to the buckets we used to use, as the assumed service account will differ. **Note:** Although this will only be required in the new infra, but this change can be merged and expected to work properly in the current infra as well. ### Solution We've set up a central service-account with rights to impersonate other service accounts that have controlled access to individual buckets to minimize the reach and influence of individual accounts. See: elastic/kibana-operations#51 **several of the changes weren't tested, as they're part of CI tasks outside the PR build** - will merge with caution and monitor the stability afterwards TODO: _add access, and assume account before other GCS bucket usages_ - [x] storybook - [x] coverage (.buildkite/scripts/steps/code_coverage/reporting/uploadPrevSha.sh) - [x] upload static site (.buildkite/scripts/steps/code_coverage/reporting/uploadStaticSite.sh) - [x] SO object migration (.buildkite/scripts/steps/archive_so_migration_snapshot.sh) - [x] ES Snapshot manifest upload (.buildkite/scripts/steps/es_snapshots/create_manifest.ts) - [x] Scalability? (.buildkite/scripts/steps/functional/scalability_dataset_extraction.sh) - [x] Benchmarking (.buildkite/scripts/steps/scalability/benchmarking.sh) - [x] Webpack bundle analyzer (.buildkite/scripts/steps/webpack_bundle_analyzer/upload.ts) - [x] ~Build chromium (x-pack/build_chromium/build.py)~ Not needed, as it's manual, and not a CI task TODO: _others_ - [x] Remove manifest upload (.buildkite/scripts/steps/es_serverless/promote_es_serverless_image.sh) - [x] Decide if we should merge with the CDN access: no, SRE is managing that account - [x] Bazel remote cache seems to also rely on gcs - roles PR: elastic/kibana-operations#56 Closes: elastic/kibana-operations#29 Part of: elastic/kibana-operations#15
…lastic#176781) This PR solves a problem introduced at elastic#174756 where the addition into the bash script was not being correctly loaded for the promote manifest pipeline as it was not running in the original main cwd.
## Summary Once we're moving to the elastic-wide buildkite agents, and away from the kibana-buildkite-managed ones, we won't have default access to the buckets we used to use, as the assumed service account will differ. **Note:** Although this will only be required in the new infra, but this change can be merged and expected to work properly in the current infra as well. ### Solution We've set up a central service-account with rights to impersonate other service accounts that have controlled access to individual buckets to minimize the reach and influence of individual accounts. See: elastic/kibana-operations#51 **several of the changes weren't tested, as they're part of CI tasks outside the PR build** - will merge with caution and monitor the stability afterwards TODO: _add access, and assume account before other GCS bucket usages_ - [x] storybook - [x] coverage (.buildkite/scripts/steps/code_coverage/reporting/uploadPrevSha.sh) - [x] upload static site (.buildkite/scripts/steps/code_coverage/reporting/uploadStaticSite.sh) - [x] SO object migration (.buildkite/scripts/steps/archive_so_migration_snapshot.sh) - [x] ES Snapshot manifest upload (.buildkite/scripts/steps/es_snapshots/create_manifest.ts) - [x] Scalability? (.buildkite/scripts/steps/functional/scalability_dataset_extraction.sh) - [x] Benchmarking (.buildkite/scripts/steps/scalability/benchmarking.sh) - [x] Webpack bundle analyzer (.buildkite/scripts/steps/webpack_bundle_analyzer/upload.ts) - [x] ~Build chromium (x-pack/build_chromium/build.py)~ Not needed, as it's manual, and not a CI task TODO: _others_ - [x] Remove manifest upload (.buildkite/scripts/steps/es_serverless/promote_es_serverless_image.sh) - [x] Decide if we should merge with the CDN access: no, SRE is managing that account - [x] Bazel remote cache seems to also rely on gcs - roles PR: elastic/kibana-operations#56 Closes: elastic/kibana-operations#29 Part of: elastic/kibana-operations#15
…lastic#176781) This PR solves a problem introduced at elastic#174756 where the addition into the bash script was not being correctly loaded for the promote manifest pipeline as it was not running in the original main cwd.
## Summary Once we're moving to the elastic-wide buildkite agents, and away from the kibana-buildkite-managed ones, we won't have default access to the buckets we used to use, as the assumed service account will differ. **Note:** Although this will only be required in the new infra, but this change can be merged and expected to work properly in the current infra as well. ### Solution We've set up a central service-account with rights to impersonate other service accounts that have controlled access to individual buckets to minimize the reach and influence of individual accounts. See: elastic/kibana-operations#51 **several of the changes weren't tested, as they're part of CI tasks outside the PR build** - will merge with caution and monitor the stability afterwards TODO: _add access, and assume account before other GCS bucket usages_ - [x] storybook - [x] coverage (.buildkite/scripts/steps/code_coverage/reporting/uploadPrevSha.sh) - [x] upload static site (.buildkite/scripts/steps/code_coverage/reporting/uploadStaticSite.sh) - [x] SO object migration (.buildkite/scripts/steps/archive_so_migration_snapshot.sh) - [x] ES Snapshot manifest upload (.buildkite/scripts/steps/es_snapshots/create_manifest.ts) - [x] Scalability? (.buildkite/scripts/steps/functional/scalability_dataset_extraction.sh) - [x] Benchmarking (.buildkite/scripts/steps/scalability/benchmarking.sh) - [x] Webpack bundle analyzer (.buildkite/scripts/steps/webpack_bundle_analyzer/upload.ts) - [x] ~Build chromium (x-pack/build_chromium/build.py)~ Not needed, as it's manual, and not a CI task TODO: _others_ - [x] Remove manifest upload (.buildkite/scripts/steps/es_serverless/promote_es_serverless_image.sh) - [x] Decide if we should merge with the CDN access: no, SRE is managing that account - [x] Bazel remote cache seems to also rely on gcs - roles PR: elastic/kibana-operations#56 Closes: elastic/kibana-operations#29 Part of: elastic/kibana-operations#15
…lastic#176781) This PR solves a problem introduced at elastic#174756 where the addition into the bash script was not being correctly loaded for the promote manifest pipeline as it was not running in the original main cwd.
Summary
Once we're moving to the elastic-wide buildkite agents, and away from the kibana-buildkite-managed ones, we won't have default access to the buckets we used to use, as the assumed service account will differ.
Note: Although this will only be required in the new infra, but this change can be merged and expected to work properly in the current infra as well.
Solution
We've set up a central service-account with rights to impersonate other service accounts that have controlled access to individual buckets to minimize the reach and influence of individual accounts. See: https://github.com/elastic/kibana-operations/pull/51
several of the changes weren't tested, as they're part of CI tasks outside the PR build - will merge with caution and monitor the stability afterwards
TODO: add access, and assume account before other GCS bucket usages
Build chromium (x-pack/build_chromium/build.py)Not needed, as it's manual, and not a CI taskTODO: others
Closes: https://github.com/elastic/kibana-operations/issues/29
Part of: https://github.com/elastic/kibana-operations/issues/15