-
Notifications
You must be signed in to change notification settings - Fork 13
fix: restore peribolos with reliable auth and drift detection #89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
marcusburghardt
merged 2 commits into
complytime:main
from
marcusburghardt:opsx/fix-peribolos-implementation
May 8, 2026
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,127 @@ | ||
| # Workflow to detect drift between peribolos.yaml and actual GitHub org state. | ||
| # Runs weekly on Monday mornings. Opens or updates a GitHub issue when drift is detected. | ||
| name: Drift Detection | ||
|
|
||
| on: | ||
| schedule: | ||
| # Monday at 04:30 UTC — drift issues visible by EU morning (06:30 CEST / 05:30 CET), | ||
| # before daily reconciliation at 05:30 UTC | ||
| - cron: '30 4 * * 1' | ||
| workflow_dispatch: | ||
|
|
||
| jobs: | ||
| detect-drift: | ||
| if: github.repository_owner == 'complytime' | ||
| runs-on: ubuntu-latest | ||
| timeout-minutes: 20 | ||
| permissions: | ||
| contents: read | ||
| issues: write | ||
| steps: | ||
| - name: Checkout complytime/.github repo | ||
| uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 | ||
|
|
||
| - name: Install Go | ||
| uses: actions/setup-go@4a3601121dd01d1626a1e23e37211e3254c1c06c # v6.4.0 | ||
| with: | ||
| go-version-file: './go.mod' | ||
|
|
||
| - name: Checkout and build peribolos | ||
| uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 | ||
| with: | ||
| repository: kubernetes-sigs/prow | ||
|
|
||
| - name: Build peribolos | ||
| run: | | ||
| cd cmd/peribolos | ||
| go mod tidy | ||
| go build -o . | ||
| cp peribolos /tmp | ||
|
|
||
| - name: Generate GitHub App token | ||
| id: app-token | ||
| uses: actions/create-github-app-token@1b10c78c7865c340bc4f6099eb2f838309f1e8c3 # v3.1.1 | ||
| with: | ||
| client-id: ${{ secrets.COMPLYTIME_BOT_CLIENT_ID }} | ||
| private-key: ${{ secrets.COMPLYTIME_BOT_PRIVATE_KEY }} | ||
| owner: complytime | ||
|
|
||
| - name: Dump current org state | ||
| env: | ||
| APP_TOKEN: ${{ steps.app-token.outputs.token }} | ||
| run: | | ||
| set -o pipefail | ||
| /tmp/peribolos \ | ||
| --config-path peribolos.yaml \ | ||
| --require-self=false \ | ||
| --github-token-path <(printf '%s' "$APP_TOKEN") \ | ||
| --dump complytime \ | ||
| --dump-full 2>/tmp/peribolos-dump-stderr.log | yq -P 'sort_keys(..)' > /tmp/org-actual.yaml | ||
| yq -P 'sort_keys(..)' peribolos.yaml > /tmp/org-expected.yaml | ||
|
|
||
| - name: Compare org state | ||
| id: diff | ||
| run: | | ||
| if diff -u /tmp/org-expected.yaml /tmp/org-actual.yaml > /tmp/drift-diff.txt 2>&1; then | ||
| echo "drift=false" >> "$GITHUB_OUTPUT" | ||
| echo "No drift detected." | ||
| else | ||
| echo "drift=true" >> "$GITHUB_OUTPUT" | ||
| echo "Drift detected between peribolos.yaml and actual org state." | ||
| fi | ||
|
|
||
| - name: Check for existing drift issue | ||
| if: steps.diff.outputs.drift == 'true' | ||
| id: existing-issue | ||
| env: | ||
| GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} | ||
| run: | | ||
| ISSUE_NUMBER=$(gh issue list \ | ||
| --label peribolos-drift \ | ||
| --state open \ | ||
| --limit 1 \ | ||
| --json number \ | ||
| --jq '.[0].number // empty') | ||
| echo "issue_number=${ISSUE_NUMBER}" >> "$GITHUB_OUTPUT" | ||
|
|
||
| - name: Create or update drift issue | ||
| if: steps.diff.outputs.drift == 'true' | ||
| env: | ||
| GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} | ||
| ISSUE_NUMBER: ${{ steps.existing-issue.outputs.issue_number }} | ||
| WORKFLOW_URL: "${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}" | ||
| run: | | ||
| TIMESTAMP=$(date -u +%Y-%m-%dT%H:%M:%SZ) | ||
|
|
||
| { | ||
| echo "## Peribolos Drift Detected" | ||
| echo "" | ||
| echo "**Date**: ${TIMESTAMP}" | ||
| echo "**Workflow run**: ${WORKFLOW_URL}" | ||
| echo "" | ||
| echo "The actual GitHub org state differs from what is declared in \`peribolos.yaml\`." | ||
| echo "This may indicate manual changes were made via the GitHub UI." | ||
| echo "" | ||
| echo "### Diff" | ||
| echo "" | ||
| echo '```diff' | ||
| cat /tmp/drift-diff.txt | ||
| echo '```' | ||
| echo "" | ||
| echo "### Recommended Action" | ||
| echo "" | ||
| echo "- Review the diff to determine if the changes are intentional" | ||
| echo "- If unintentional: trigger a manual Peribolos apply via \`workflow_dispatch\`" | ||
| echo "- If intentional: update \`peribolos.yaml\` to match the desired state" | ||
| } > /tmp/issue-body.md | ||
|
|
||
| if [ -n "$ISSUE_NUMBER" ]; then | ||
| gh issue edit "$ISSUE_NUMBER" --body-file /tmp/issue-body.md | ||
| echo "Updated existing issue #${ISSUE_NUMBER}" | ||
| else | ||
| gh issue create \ | ||
| --title "Peribolos Drift Detected - $(date -u +%Y-%m-%d)" \ | ||
| --body-file /tmp/issue-body.md \ | ||
| --label peribolos-drift | ||
| echo "Created new drift issue" | ||
| fi | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| schema: spec-driven | ||
| created: 2026-05-07 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,95 @@ | ||
| ## Context | ||
|
|
||
| The `complytime` GitHub organization uses Peribolos (a Prow CLI tool) to manage org settings, teams, memberships, and repository permissions as code via `peribolos.yaml`. The implementation lives in the `complytime/.github` repository. | ||
|
|
||
| Current state: | ||
| - Peribolos has been silently failing since April 16, 2026 due to an expired GitHub App user access token | ||
| - The workflow pipeline masks failures: `peribolos ... 2>&1 | jq ...` swallows the exit code | ||
| - Every run since April 16 reports `success` while Peribolos exits with `fatal: Configuration failed: status code 404` | ||
| - The `complytime-bot` GitHub App is installed with correct permissions (`organization_administration: write`, `members: write`, `administration: write`) | ||
| - `pme-bot` is a regular user account (not an app bot) whose expired token was stored in `secrets.APP_ACCESS_TOKEN` | ||
| - The `testTeamMembers()` function in `config/config_test.go` is defined but never called from any test | ||
| - Org admins (`jpower432`, `marcusburghardt`) are listed as `members:` instead of `maintainers:` in several teams | ||
|
|
||
| ## Goals / Non-Goals | ||
|
|
||
| ### Goals | ||
|
|
||
| - Restore Peribolos to a working state with reliable, self-renewing authentication | ||
| - Make Peribolos failures visible (fail the workflow when Peribolos fails) | ||
| - Enable on-demand manual reapplication of org settings | ||
| - Enable daily automatic reconciliation | ||
| - Detect and alert when org state drifts from the declared config | ||
| - Fix config and test issues that allow invalid configurations | ||
|
|
||
| ### Non-Goals | ||
|
|
||
| - Managing branch protection rules (requires `branchprotector`, a separate Prow tool) | ||
| - Managing GitHub Actions permissions, webhooks, or repository rulesets | ||
| - Adding new repository settings beyond what is currently declared (keep it minimalist per user preference) | ||
| - Migrating away from Peribolos to a different tool | ||
|
|
||
| ## Decisions | ||
|
|
||
| ### D1: Use `actions/create-github-app-token` for authentication | ||
|
|
||
| **Decision**: Replace the static `APP_ACCESS_TOKEN` secret with per-run installation tokens generated by `actions/create-github-app-token@v3` (SHA-pinned per existing workflow conventions). | ||
|
|
||
| **Rationale**: The existing `complytime-bot` GitHub App already has all required permissions. Installation tokens are generated fresh per-run (1-hour TTL, auto-revoked after job), eliminating token expiry as a failure mode. This is the GitHub-recommended approach (official action, 800+ stars). The user access token approach requires manual regeneration every 8 hours and the device flow requires human interaction, making it unsuitable for CI. | ||
|
|
||
| **Alternative considered**: Fine-grained PAT — simpler but still requires manual rotation. Refresh token rotation in CI is fragile and creates security risks (workflow modifying its own secrets). | ||
|
|
||
| **Configuration**: | ||
| - `secrets.COMPLYTIME_BOT_CLIENT_ID` and `secrets.COMPLYTIME_BOT_PRIVATE_KEY` (already created) | ||
| - Token scoped to `owner: complytime` for org-wide access | ||
| - `skip-token-revoke: false` (default, auto-revoke after job) | ||
|
|
||
| **Migration plan**: | ||
| 1. Deploy the updated workflow while `APP_ACCESS_TOKEN` still exists as a fallback reference | ||
| 2. Run a manual `workflow_dispatch` dry-run to validate token generation works | ||
| 3. After one successful push-triggered run on `main`, mark `APP_ACCESS_TOKEN` as deprecated | ||
| 4. Remove `APP_ACCESS_TOKEN` after 3 successful runs | ||
|
|
||
| ### D2: Add `--require-self=false` to Peribolos | ||
|
|
||
| **Decision**: Disable the `--require-self` check (which defaults to `true`). | ||
|
|
||
| **Rationale**: The `--require-self` flag calls `GET /user` to verify the authenticated user is an org admin. Installation tokens cannot call `GET /user` (it's a user-only endpoint). This is the only Peribolos endpoint incompatible with installation tokens. The safety check is replaced by `--min-admins 2` and the app's own permission constraints (the App can only perform actions within its granted permission scopes). | ||
|
|
||
| ### D3: Fix pipeline exit code propagation | ||
|
|
||
| **Decision**: Add `set -o pipefail` to the shell step that runs Peribolos. | ||
|
|
||
| **Rationale**: Without `pipefail`, the exit code of `peribolos ... | jq ...` is jq's exit code (always 0), not Peribolos' exit code. This is the root cause of silent failures since April 16. | ||
|
|
||
| ### D4: Remove ghproxy from the apply workflow | ||
|
|
||
| **Decision**: Remove the ghproxy sidecar process from the workflow. | ||
|
|
||
| **Rationale**: Peribolos is not configured to route through ghproxy (no `--github-endpoint=http://localhost:8888` flag), so ghproxy runs but is never used. The warning `"It doesn't look like you are using ghproxy"` confirms this. For a small org (~12 repos, ~20 members, 5 teams), API rate limiting is not a concern. Removing it simplifies the workflow. | ||
|
|
||
| ### D5: Drift detection via `peribolos --dump` | ||
|
|
||
| **Decision**: Create a separate weekly scheduled workflow that runs `peribolos --dump complytime` to capture actual org state, then diffs against `peribolos.yaml`. Opens a GitHub issue when drift is detected. | ||
|
|
||
| **Rationale**: Even with daily reconciliation, there is value in explicitly detecting drift. Weekly frequency is chosen because daily reconciliation handles remediation; drift detection catches persistent or reconciliation-resistant drift. A separate workflow keeps concerns isolated from the apply workflow. | ||
|
|
||
| ### D6: Trigger behavior matrix | ||
|
|
||
| The apply workflow supports multiple trigger types with different behaviors: | ||
|
|
||
| | Trigger | Token Gen | Peribolos Build | Apply (`--confirm`) | Dry-run only | | ||
| |---|---|---|---|---| | ||
| | `pull_request` | No | No | No | No (skip entirely) | | ||
| | `push` to `main` | Yes | Yes | Yes | No | | ||
| | `workflow_dispatch` (dry-run=false) | Yes | Yes | Yes | No | | ||
| | `workflow_dispatch` (dry-run=true) | Yes | Yes | No | Yes | | ||
| | `schedule` (daily cron) | Yes | Yes | Yes | No | | ||
|
|
||
| ## Risks / Trade-offs | ||
|
|
||
| - **[`--require-self=false` removes a safety check]** → Mitigated by `--min-admins 2` flag, which prevents accidental admin removal. The app's installation permissions constrain what can be changed. | ||
| - **[Installation token 1-hour TTL]** → Peribolos runs complete in ~2 minutes for this org size. No risk of timeout. | ||
| - **[Drift detection may be noisy]** → The detection workflow only opens an issue, it does not auto-remediate. Org admins can triage and decide whether to reapply or update config. | ||
| - **[Removing ghproxy]** → If the org grows significantly, rate limiting could become relevant. ghproxy can be re-added with proper `--github-endpoint` configuration if needed. | ||
| - **[Upstream Prow dependency]** → The workflow builds Peribolos from source via `kubernetes-sigs/prow`. If the repo is unavailable or the build breaks, all runs fail. Go module caching mitigates transient outages. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.