Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bazel: use GCS remote cache #4050

Merged
merged 12 commits into from
Aug 12, 2018
Merged

bazel: use GCS remote cache #4050

merged 12 commits into from
Aug 12, 2018

Conversation

lizan
Copy link
Member

@lizan lizan commented Aug 3, 2018

Signed-off-by: Lizan Zhou zlizan@google.com

Partially addresses #3741

Description:
Use GCS remote cache for CI builds when key is provided.

Risk Level: Low
Testing: CI
Docs Changes:
Release Notes:

Signed-off-by: Lizan Zhou <zlizan@google.com>
rm -rf /tmp/gcp_service_account.json
}

if [[ ! -z "${GCP_SERVICE_ACCOUNT_KEY}" && ! -z "${BAZEL_REMOTE_CACHE}" ]]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to grab me on slack and I can setup the env variables for you so we can test this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to add those env variables to CircleCI

Signed-off-by: Lizan Zhou <zlizan@google.com>
@@ -18,6 +18,8 @@ export NUM_CPUS=8
# (see https://circleci.com/docs/2.0/executor-types/#using-machine)
export BAZEL_EXTRA_TEST_OPTIONS="--test_env=ENVOY_IP_TEST_VERSIONS=v4only"

export BAZEL_REMOTE_CACHE="${BAZEL_REMOTE_CACHE:-https://storage.googleapis.com/envoy-circleci-bazel-cache/}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be more appropriate to set this in .circleci/config.yml?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, can we make this end-user configurable and document how individual developers can also make use of it? Maybe in https://github.com/envoyproxy/envoy/blob/master/bazel/README.md.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Specifically for when working outside a Docker image)

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, excited to see this land; any idea of speedups observed so far?


gcp_service_account_cleanup() {
echo "Deleting service account key file..."
rm -rf /tmp/gcp_service_account.json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use TMPDIR or something, probably bad to assume the existent of /tmp.

if [[ ! -z "${BAZEL_REMOTE_CACHE}" ]]; then

if [[ ! -z "${GCP_SERVICE_ACCOUNT_KEY}" ]]; then
echo "${GCP_SERVICE_ACCOUNT_KEY}" | base64 --decode > /tmp/gcp_service_account.json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mktemp

trap gcp_service_account_cleanup EXIT

export BAZEL_BUILD_EXTRA_OPTIONS="${BAZEL_BUILD_EXTRA_OPTIONS} \
--remote_http_cache=${BAZEL_REMOTE_CACHE} --google_credentials=/tmp/gcp_service_account.json"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to use filesystem? If so, best to make sure that the file perms are appropriately restricted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, switched to using mktemp also addresses the permission.

@@ -18,6 +18,8 @@ export NUM_CPUS=8
# (see https://circleci.com/docs/2.0/executor-types/#using-machine)
export BAZEL_EXTRA_TEST_OPTIONS="--test_env=ENVOY_IP_TEST_VERSIONS=v4only"

export BAZEL_REMOTE_CACHE="${BAZEL_REMOTE_CACHE:-https://storage.googleapis.com/envoy-circleci-bazel-cache/}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Specifically for when working outside a Docker image)

@htuch htuch self-assigned this Aug 5, 2018
Signed-off-by: Lizan Zhou <zlizan@google.com>
@lizan
Copy link
Member Author

lizan commented Aug 6, 2018

Performance can be tracked at https://circleci.com/gh/envoyproxy/envoy/tree/ci_gcs_remote_cache , it doesn't looks like this helped a lot, perhaps the network is not fast enough compared to compilation? Needs more investigation.

@PiotrSikora
Copy link
Contributor

The cache hit ratio on that branch is pretty terrible, e.g. for release builds:

From https://circleci.com/gh/envoyproxy/envoy/79682 (bazel: use GCS remote cache):

INFO: 4323 processes: 7 remote cache hit, 4316 local.
INFO: 582 processes: 21 remote cache hit, 561 local.

~0.5% cache hit ratio, total time: 1:16:05.

From https://circleci.com/gh/envoyproxy/envoy/79704 (setup read-only cache ...):

INFO: 1738 processes: 68 remote cache hit, 1670 local.
INFO: 4323 processes: 420 remote cache hit, 3903 local.
INFO: 582 processes: 67 remote cache hit, 515 local.

~8% cache hit ratio, total time: 1:10:28.

From https://circleci.com/gh/envoyproxy/envoy/79745 (merge with upstream/master):

INFO: 1741 processes: 91 remote cache hit, 1650 local.
INFO: 4329 processes: 245 remote cache hit, 4084 local.
INFO: 582 processes: 64 remote cache hit, 518 local.

~6% cache hit ratio, total time: 1:11:56.

From https://circleci.com/gh/envoyproxy/envoy/80260 (move env variables to circleci.yaml):

INFO: 1741 processes: 123 remote cache hit, 1618 local.
INFO: 4329 processes: 887 remote cache hit, 3442 local.
INFO: 582 processes: 156 remote cache hit, 426 local.

~17.5% cache hit ratio, total time: 55:33.

The last one is quite surprising, because it should be close to 100%, so something seems busted.

Anyway, the issue with caching is that any changes to the environment, api/ protos and/or include/ headers will effectively invalidate it, and while it's really amazing for local development and/or developing 3rd party filters (see: improvements in Istio Proxy with CircleCI's cache), I wouldn't have too much hopes for it for the new PRs... I suspect the cache hit ratio for the initial commits in the new PRs is going to be around 0-10%, but very high for subsequent commits addressing review comments (assuming no changes to api/ and include/). Ultimately, we'd need to deploy it to see how big of an improvement there is in practice.

PiotrSikora
PiotrSikora previously approved these changes Aug 6, 2018
@lizan
Copy link
Member Author

lizan commented Aug 7, 2018

The low cache hit rate is mostly due to

WARNING: Error writing to the remote cache:
429 Too Many Requests
<?xml version='1.0' encoding='UTF-8'?><Error><Code>SlowDown</Code><Message>Please reduce your request rate.</Message><Details>Changing object envoy-circleci-bazel-cache/cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 too quickly, please slow down</Details></Error>

I would expect after multiple runs the hit rate will be improved.

@lizan lizan changed the title [WIP] bazel: use GCS remote cache bazel: use GCS remote cache Aug 7, 2018
@lizan
Copy link
Member Author

lizan commented Aug 7, 2018

I wouldn't have too much hopes for it for the new PRs... I suspect the cache hit ratio for the initial commits in the new PRs is going to be around 0-10%, but very high for subsequent commits addressing review comments (assuming no changes to api/ and include/).

This won't be true unless we expose environment variables to PRs from forks (currently it doesn't and I don't feel it is safe), so the PRs from forks (like #4049 compared to this) will have read-only cache.

@PiotrSikora
Copy link
Contributor

I would expect after multiple runs the hit rate will be improved.

From https://circleci.com/gh/envoyproxy/envoy/80286 (merge with upstream/master):

INFO: 1741 processes: 141 remote cache hit, 1600 local.
INFO: 4351 processes: 806 remote cache hit, 3545 local.
INFO: 586 processes: 223 remote cache hit, 363 local.

~17.5% cache hit ratio (same as before), total time: 1:20:06.

This won't be true unless we expose environment variables to PRs from forks (currently it doesn't and I don't feel it is safe), so the PRs from forks (like #4049 compared to this) will have read-only cache.

Good point, and agreed wrt security, but looking at the PRs history, it seems that only Lyft folks are using the main repo for PRs, and everybody else is using private forks, so this approach won't result in huge improvements.

@lizan
Copy link
Member Author

lizan commented Aug 7, 2018

Seems mac improves quite a bit during this PR, it might be because mac runs first (without downloading docker image) and fills GCS rate limit soon.

https://circleci.com/gh/envoyproxy/envoy/80290 (total time: 55:35)

INFO: 2290 processes: 2084 remote cache hit, 206 darwin-sandbox.
INFO: 1846 processes: 1214 remote cache hit, 632 darwin-sandbox.
INFO: 578 processes: 289 remote cache hit, 289 darwin-sandbox.

https://circleci.com/gh/envoyproxy/envoy/80368 (total time: 26:35)

INFO: 2290 processes: 2287 remote cache hit, 3 darwin-sandbox.
INFO: 1846 processes: 1554 remote cache hit, 292 darwin-sandbox.
INFO: 578 processes: 573 remote cache hit, 5 darwin-sandbox.

@lizan
Copy link
Member Author

lizan commented Aug 7, 2018

Good point, and agreed wrt security, but looking at the PRs history, it seems that only Lyft folks are using the main repo for PRs, and everybody else is using private forks, so this approach won't result in huge improvements.

The read-only build cache will be kept updated from master builds, so each PR (unless it changes include/) will use some of caches.

Signed-off-by: Lizan Zhou <zlizan@google.com>
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Diff looks good, will continue discussion in PR on efficacy.

set -e

if [[ ! -z "${BAZEL_REMOTE_CACHE}" ]]; then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove superfluous blank lines here and below.

@htuch
Copy link
Member

htuch commented Aug 7, 2018

I see evidence of rate limiting on the branch CI, e.g. https://circleci.com/gh/envoyproxy/envoy/80440. Is this killing our ability to effectively cache? Would we be better off with a dedicated cache server, e.g. Google Cloud Build? Are we should be hitting the 200 objects modifications per second at https://cloud.google.com/storage/quotas?

Signed-off-by: Lizan Zhou <zlizan@google.com>
@lizan
Copy link
Member Author

lizan commented Aug 7, 2018

@htuch the 200 per second is for object composition. Per https://cloud.google.com/storage/docs/request-rate GCS doesn't really have a hard limit but need warm up starting from 1000 req/s write and 5000 req/s read, I would expect after we merge this, CI will use cache for every commit: read/write from master builds, read from each PR builds, the storage will be warmer.

@htuch
Copy link
Member

htuch commented Aug 10, 2018

@lizan LGTM; is there anything holding us back on merging this?

htuch
htuch previously approved these changes Aug 10, 2018
@lizan
Copy link
Member Author

lizan commented Aug 10, 2018

@htuch I was investigating the coverage drop for runs on CircleCI, turns out the remote cache doesn't cache gcno files unless you do bazel coverage or --collect_code_coverage, but that will break the coverage script.

@lizan
Copy link
Member Author

lizan commented Aug 10, 2018

Disabled the cache for coverage, I'll address it in a follow up. Let's merge this first and see how it goes on master.

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look, up in the sky! It's a bird! It's a plane! It's Bazel remote caching!

@htuch htuch merged commit 7309c14 into master Aug 12, 2018
@htuch htuch deleted the ci_gcs_remote_cache branch August 12, 2018 01:22
snowp added a commit to snowp/envoy that referenced this pull request Aug 14, 2018
* origin/master: (38 commits)
  test: add tests for corner-cases around sending requests before run() starts or after run() ends. (envoyproxy#4114)
  perf: reduce the memory usage of LC Trie construction (envoyproxy#4117)
  test: moving redundant code in websocket_integration_test to utilities (envoyproxy#4127)
  test: make YamlLoadFromStringFail less picky about error msg. (envoyproxy#4141)
  rbac: add rbac network filter. (envoyproxy#4083)
  fuzz: route lookup and header finalization fuzzer. (envoyproxy#4116)
  Set content-type and content-length (envoyproxy#4113)
  fault: use FractionalPercent for percent (envoyproxy#3978)
  test: Fix inverted exact match logic in IntegrationTcpClient::waitForData() (envoyproxy#4134)
  Added cluster_name to load assignment config for static cluster (envoyproxy#4123)
  ssl: refactor ContextConfig to use TlsCertificateConfig (envoyproxy#4115)
  syscall: refactor OsSysCalls for deeper errno latching (envoyproxy#4111)
  thrift_proxy: fix oneway bugs (envoyproxy#4025)
  Do not crash when converting YAML to JSON fails (envoyproxy#4110)
  config: allow unknown fields flag (take 2) (envoyproxy#4096)
  Use a jittered backoff strategy for handling HdsDelegate stream/connection failures (envoyproxy#4108)
  bazel: use GCS remote cache (envoyproxy#4050)
  Add thread local cache of overload action states (envoyproxy#4090)
  Added TCP healthcheck capabilities to the HdsDelegate (envoyproxy#4079)
  secret: add secret provider interface and use it for TlsCertificates (envoyproxy#4086)
  ...

Signed-off-by: Snow Pettersen <snowp@squareup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants