Skip to content

[Bug] _test_pat_against_repo treats transient GitHub failures as invalid PATs (test-query side of #1106) #1331

@Khaostica

Description

@Khaostica

Description

_test_pat_against_repo in gittensor/validator/pat_handler.py collapses every non-200 status and every requests.RequestException into a single permanent error return. Callers (handle_pat_check, handle_pat_broadcast) then treat any error as a permanent invalid-PAT result.

This is the same asymmetry that #1106 documented for the /user (identity) side — fixed by #1107 — but on the sibling GraphQL test-query path. #1106 explicitly scoped itself to /user and did not cover the test-query call, so the gap remains: a GitHub 5xx burst, DNS hiccup, connection reset, or read timeout during the test query is reported to miners as an invalid PAT.

After #1107:

  • Identity-check transient failure → pat_valid = None, retry-oriented message. ✅
  • Test-query transient failure → pat_valid = False, "PAT test query failed: ..." message. ❌

In gitt miner check, the second branch tells the miner their PAT is invalid during a temporary GitHub problem — prompting them to rotate a working PAT.

Steps to Reproduce

  1. Patch requests.post (the one inside gittensor.validator.pat_handler) to return a response with status_code=503.
  2. Store a valid PAT entry for (uid=1, hotkey='hotkey_1', github_id='github_42') via pat_storage.save_pat(...).
  3. Call handle_pat_check with a PatCheckSynapse whose dendrite.hotkey == 'hotkey_1'.
  4. Observe:
synapse.has_pat is True
synapse.pat_valid is False  # ← should be None (inconclusive)
synapse.rejection_reason == 'PAT test query failed: GitHub GraphQL API returned 503'

Same shape applies for requests.Timeout, requests.ConnectionError, and HTTP 500/502/504.

For the broadcast path, the analogous reproduction patches requests.post to raise/return-5xx and calls handle_pat_broadcast with a valid PAT; the broadcast is rejected with "PAT test query failed: ..." even though the PAT itself was never actually evaluated against the GraphQL response.

Expected Behavior

Mirror the convention #1107 established for the identity path. Transient transport failures should be surfaced as inconclusive — not as permanent invalid-PAT results.

handle_pat_check on a transient test-query failure:

synapse.has_pat = True
synapse.pat_valid = None
synapse.rejection_reason = 'GitHub API temporarily unavailable; retry the check in a few minutes.'

pat_valid=None already fits the CLI's _pat_check_row_category() mapping → no_response, not invalid_pat.

handle_pat_broadcast on a transient test-query failure: reject with a retry-oriented message that does not imply the PAT is invalid; do not store the PAT (the existing save_pat call is already gated on a successful test query, so no storage-side behavior change is needed — only the rejection message).

Classification suggestion:

Outcome error transient_failure
200, viewer present None False
HTTP 5xx (500-504) set True
requests.RequestException (timeout, ConnectionError, DNS) set True
HTTP 4xx set False
200 with errors[0] set False
200 with data.viewer == null (scope-less fine-grained PAT) set False

Actual Behavior

_test_pat_against_repo returns Optional[str]None on success, error string otherwise — with no transient distinction. Callers collapse every non-None return into:

  • handle_pat_check: synapse.pat_valid = False, synapse.rejection_reason = f'PAT test query failed: {test_error}'
  • handle_pat_broadcast: rejects with f'PAT test query failed: {test_error}'

Result: during a GitHub partial outage, miners running gitt miner check see their stored PAT marked invalid; miners running gitt miner post to broadcast a fresh PAT see the broadcast rejected with a message that points the finger at the PAT.

Environment

  • OS: any (issue is in validator-side handler logic, OS-independent)
  • Python version: matches project (3.10+)
  • Commit/Version: reproducible against test at the current HEAD (gittensor/validator/pat_handler.py:170-193)

Additional Context

  • Identity-side equivalent: #1106 (closed by #1107).
  • Recent unrelated refactor on the same function: #1164 (consolidated header construction; did not touch error-classification semantics).
  • Suggested fix mirrors GitHubCredentialValidation shape: return a PatTestResult dataclass with error: Optional[str] and transient_failure: bool so callers can branch the same way they do for validation.transient_failure after #1107.
  • Economic impact is UX rather than emissions — miners with already-stored PATs continue earning regardless of gitt miner check output. The harm is misleading miners into rotating working PATs and rejecting fresh broadcasts during transient outages.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions