Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dev docs: Document how to fix cached state issues #7558

Open
teor2345 opened this issue Sep 14, 2023 · 9 comments
Open

dev docs: Document how to fix cached state issues #7558

teor2345 opened this issue Sep 14, 2023 · 9 comments
Labels
A-devops Area: Pipelines, CI/CD and Dockerfiles A-docs Area: Documentation C-enhancement Category: This is an improvement S-needs-triage Status: A bug report needs triage

Comments

@teor2345
Copy link
Contributor

teor2345 commented Sep 14, 2023

Motivation

We want any developer to be able to fix CI issues. Sometimes fixing cached states is complex.

Rough Plan

  • two different ways to sync, full and upgrade, and how to tell them apart in the list of cached states (-u in the name is an update)
  • how cached states are deleted: scheduled auto cleanup and manual deletion
  • which cached states to delete: from PR errors, to cached states, to finding cached states generated from those states
  • checkpoint disks always upgrade from an older state version (currently 25.2), so the checkpoint tests might be the only tests that fail due to upgrade bugs
    • if the checkpoint state version was more than 16 weeks ago, it is out of support and we can fix that CI bug by rebuilding those states
  • anything else mentioned in this ticket's comments
@teor2345 teor2345 added A-docs Area: Documentation A-devops Area: Pipelines, CI/CD and Dockerfiles C-enhancement Category: This is an improvement S-needs-triage Status: A bug report needs triage P-Medium ⚡ labels Sep 14, 2023
@teor2345
Copy link
Contributor Author

Here is how to test a change to the cached state CI workflows:
https://github.com/ZcashFoundation/zebra/pull/7557/files#r1325230754

@teor2345
Copy link
Contributor Author

Here is an example of a test that passes in Rust, but fails in the CI workflow:
https://github.com/ZcashFoundation/zebra/actions/runs/6165092427/job/16733752210?pr=7515

@teor2345
Copy link
Contributor Author

teor2345 commented Sep 14, 2023

Here's how to manually rebuild a cached state:
#7507 (comment)

And what it looks like when the test passes:
#7507 (comment)

@teor2345
Copy link
Contributor Author

Here's how to delete an invalid cached state:
#7555 (comment)

@mpguerra
Copy link
Contributor

mpguerra commented Sep 14, 2023

I think we should probably document how to check the cached state details to identify whether it may have a problem or not, so:

  • How to find/access the cached state
  • how to figure out when/how it was generated
  • how to extract other meaningful data from it

@teor2345
Copy link
Contributor Author

teor2345 commented Sep 14, 2023

An example of how to find the cached state used by a test:

The cached state used by the failing PR is not from #7531, it was generated by the self-hosted runners branch (the branch name and commit are after the cache type prefix):
zebrad-cache-self-hosted--7c7e860-v25-mainnet-tip-075639

Here is where the cached state is logged in CI:
https://github.com/ZcashFoundation/zebra/actions/runs/6169241952/job/16744865315?pr=7349#step:10:133

That cached state was created using a full sync. Full syncs have different tags from updates (the start height, original version, and update flag -u are all different). Here is a link to the list of cached state images and their tags:
https://console.cloud.google.com/compute/images?tab=images&hl=en&project=zfnd-dev-zebra&pageState=(%22images%22:(%22s%22:%5B(%22i%22:%22creationTimestamp%22,%22s%22:%221%22),(%22i%22:%22creator%22,%22s%22:%221%22),(%22i%22:%22name%22,%22s%22:%220%22)%5D,%22f%22:%22%255B%257B_22k_22_3A_22Created%2520by_22_2C_22t_22_3A10_2C_22v_22_3A_22_5C_22zfnd-dev-zebra_5C_22_22_2C_22i_22_3A_22creator_22%257D_2C%257B_22k_22_3A_22Name_22_2C_22t_22_3A10_2C_22v_22_3A_22_5C_22self-hosted_5C_22_22_2C_22i_22_3A_22name_22%257D%255D%22))&pli=1

The tags also contain other useful data about the cached state, such as the heights, versions, and the name of the cached state it was updated from (once PR #7560 merges).

@teor2345
Copy link
Contributor Author

Cached states are only written if they are higher than the previous state by at least 1000 blocks (1 day).

Here's how to force a new cached state to be saved, regardless of its height:
#7560 (comment)

@teor2345
Copy link
Contributor Author

Sometimes cached states are saved with a very low number of blocks due to a Rust test or CI bug. This issue has come up recently with both Zebra and lightwalletd cached states.

Here's an analysis of one bug like that:
#7555 (comment)

@teor2345
Copy link
Contributor Author

teor2345 commented Oct 5, 2023

Fix steps from ticket #7661:

  • Find all the lightwalletd cached states with heights less than 2 million
  • Work out which PR, branch, and CI job created them, and open a ticket to fix it
  • Optional: stop creating new invalid images, by stopping running CI on that PR, or disabling that scheduled CI job (if that's possible)
  • Delete all the lightwalletd cached states with heights less than 2 million (they are all invalid)
  • Repeat these steps until the original bug is fixed, merged, and all PRs are updated with that fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-devops Area: Pipelines, CI/CD and Dockerfiles A-docs Area: Documentation C-enhancement Category: This is an improvement S-needs-triage Status: A bug report needs triage
Projects
Status: New
Development

No branches or pull requests

2 participants