Skip to content

Issue 35680 intermittent 401 on public pages after push publishing may be caused by empty permission list being cached during permission reference rebuild#35684

Merged
freddyDOTCMS merged 6 commits into
mainfrom
issue-35680-intermittent-401-on-public-pages-after-Push-Publishing-may-be-caused-by-empty-permission-list-being-cached-during-permission_reference-rebuild
May 13, 2026
Merged

Issue 35680 intermittent 401 on public pages after push publishing may be caused by empty permission list being cached during permission reference rebuild#35684
freddyDOTCMS merged 6 commits into
mainfrom
issue-35680-intermittent-401-on-public-pages-after-Push-Publishing-may-be-caused-by-empty-permission-list-being-cached-during-permission_reference-rebuild

Conversation

@freddyDOTCMS
Copy link
Copy Markdown
Member

@freddyDOTCMS freddyDOTCMS commented May 12, 2026

Summary

  • Guard the permissionCache.addToPermissionCache() call in PermissionBitFactoryImpl.loadPermissions() to skip caching when the walk-up returns an empty permission list
  • Add integration test to PermissionBitFactoryImplTest verifying that an empty walk-up result is not written to cache

Context

During Push Publishing, resetPermissionReferences() is called for every existing HTML page in the bundle. This deletes each page's permission_reference row and broadcasts a cluster-wide cache invalidation. When a request arrives before the reference is rebuilt, loadPermissions() falls through to the walk-up (_loadParentPermissions()), which calls page.getParentPermissionable() to traverse the parent hierarchy.

If getParentPermissionable() returns null at that instant (transient DB state, folder cache miss under load), the walk-up loop never executes and returns an empty list. The previous code cached that empty result unconditionally — causing every subsequent request for the page to be served [] from cache, bypassing the walk-up entirely, and resulting in a persistent 401 for anonymous users until the next resetPermissionReferences() evicted the entry.

The fix refuses to cache empty walk-up results. If the walk-up fails transiently, the next request retries it rather than being served a stale empty result.

Checklist

  • Tests
  • Translations
  • Security Implications Contemplated — this fix prevents intermittent anonymous access denial (401) caused by a transient permission cache poisoning during Push Publishing

This PR fixes: #35680

This PR fixes: #35680

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 12, 2026

Claude finished @freddyDOTCMS's task in 1m 55s —— View job


PR Review — current state

  • Read the diff
  • Inspect surrounding code
  • Compile findings

Scoping the review to the PR as it stands (narrow guard at loadPermissions). Not relitigating the cache-layer move you already declined.

Worth a second look

1. Comment overstates the pre-fix behaviorPermissionBitFactoryImpl.java:1723

// Skip caching empty results — a failed walk-up must be retried, not persisted as a permanent 401.

It wasn't permanent. PermissionCacheImpl.addToPermissionCache (line 57-61) wraps empty lists in EmptyPermissionsCacheEntry with EMPTY_PERMISSIONS_TTL (default 300s, PermissionCacheImpl.java:36-37), and getPermissionsFromCache evicts it on read once expired (line 82-87). So pre-fix this was a 5-minute 401 per affected page, not permanent. The fix is still strictly better (next read retries), but the rationale in the comment should reflect what was actually happening — otherwise the next maintainer will find the EmptyPermissionsCacheEntry machinery, conclude the comment is wrong, and possibly revert. Suggested: // Empty walk-up is transient bad state; don't seed the negative cache — let the next read retry.

2. The line-814 caller can still poison the cachePermissionBitFactoryImpl.java:807-816

bitPermissionsList = loadPermissions( permissionable );
permissionCache.addToPermissionCache( permissionable.getPermissionId(), bitPermissionsList );

loadPermissions still returns an empty list when the walk-up fails (line 1750) — the PR only skipped the internal cache write. Any flow that warms via addPermissionsToCache (e.g. cache-warm paths, certain admin paths) will still negative-cache the empty result via EmptyPermissionsCacheEntry for 5 min. Not a blocker given your scoping decision, but worth a tracking comment in the code so someone doesn't have to rediscover this if 35680 recurs.

3. Test will leak an orphan permission_reference rowPermissionBitFactoryImplTest.java:140

After getPermissions(orphan) returns, Step 4 in loadPermissions submits upsertPermissionReferences(permissionable, type, systemHost) asynchronously (default PERMISSION_REFERENCES_UPDATE_ASYNC=true, line 1734-1742). That writes a row keyed by your random UUID into permission_reference with no cleanup. Two fixes:

  • Set Config.setProperty("PERMISSION_REFERENCES_UPDATE_ASYNC", false) in @Before / reset in @After, OR
  • Delete the row by getPermissionId() in a finally.

Either is a few lines and stops slow DB pollution across CI runs.

4. Test asymmetry — no positive assertion

The test asserts the empty case isn't cached but never asserts the non-empty walk-up still is cached. A future change that "fixes" the guard too aggressively (e.g. someone moves the check up and accidentally short-circuits a non-empty result) would pass this test. Add a second case that creates a permissionable whose walk-up returns a real list and asserts cache.getPermissionsFromCache(id) is non-null.

Nits

  • assertEquals(Collections.emptyList(), perms) over assertTrue("...", perms.isEmpty()) — clearer failure message. (PermissionBitFactoryImplTest.java:143)
  • The orphan permissionable's getPermissionType() returns IHTMLPage.class.getCanonicalName() but the object is not an IHTMLPage instance, so resolvePermissionType (line 1836) takes the getPermissionType() fallback. Works, just a touch confusing — a comment or using Folder.class.getCanonicalName() for clarity would help.

Not flagging (already settled in thread)

  • Moving the guard into PermissionCacheImpl / removing EmptyPermissionsCacheEntry — agreed with your call to keep this PR narrow.
  • PR title attribution to permission_reference rebuild — that's the most likely trigger; root cause is any transient getParentPermissionable() == null. Fine as-is.

Net

The fix's direction is correct and minimal. The four items above are the ones I'd want addressed before merge — (1) and (3) are the most material (misleading comment, DB pollution), (2) and (4) are insurance against re-regression.

--- · branch ·

@freddyDOTCMS freddyDOTCMS marked this pull request as draft May 12, 2026 20:16
@freddyDOTCMS
Copy link
Copy Markdown
Member Author

1. There is already a negative-cache TTL mechanism that this PR bypasses rather than uses.

PermissionCacheImpl.addToPermissionCache (lines 57‑67) does not store empty lists as permissions = []. It wraps them in an EmptyPermissionsCacheEntry with EMPTY_PERMISSIONS_TTL (default 300s), and getPermissionsFromCache evicts the entry once expired. So the pre-fix behavior was already "negative cache the empty result for 5 min, then re-evaluate" — not the "permanent 401" the new code comment claims. After this PR the negative cache is effectively dead for the walk-up path, and re-evaluation happens on every request.

That changes the trade-off:

  • The intermittent 401 will clear on the next request instead of waiting up to 5 minutes — good.
  • Permissionables that legitimately have no inheritable permissions of their type will now redo the full Step‑3 walk-up + async upsertPermissionReferences on every read. Under load that's a measurable regression for the negative-result path the TTL existed to absorb.

Two cheaper alternatives worth considering:

  • Lower EMPTY_PERMISSIONS_TTL to a few seconds (or 0) only during/just-after PP rebuild, and invalidate affected entries in the rebuild path. This targets the root cause (stale cache after permission_reference rebuild) without giving up negative caching globally.
  • Push the isEmpty() skip into PermissionCacheImpl.addToPermissionCache and delete the unused EmptyPermissionsCacheEntry — at least the intent is in one place and there's no dead code.

@claude take account in your review thatin dotCMS every asset resolves through the hierarchy to the System Host which always carries
inheritable permissions — an empty result from _loadParentPermissions in a healthy environment is essentially theoretical

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 12, 2026

Claude finished @freddyDOTCMS's task in 1m 16s —— View job


Re-review given the "System Host always has inheritable perms" invariant

You're right — that invariant materially changes the analysis. Conceding where my prior review was off and tightening the parts that still stand.

Revised: Major #1 — negative-cache regression

Largely withdrawn. If the walk-up loop in _loadParentPermissions (PermissionBitFactoryImpl.java:1813-1819) is guaranteed to bottom out at System Host and System Host always carries inheritable perms of the relevant type, then an empty return from the walk-up is not a "legitimate negative result we want to absorb." It's a signal of transient bad state. Negative-caching it for 5 minutes via EmptyPermissionsCacheEntry was always wrong for this path — it just turned a flicker into a 5‑minute outage. So "we're losing useful negative caching" doesn't really apply here; we're losing only the negative caching of broken state, which is the whole point.

The one residual question is whether _loadParentPermissions can ever return empty in healthy code (not data): e.g., a contenttype/category/template whose getParentPermissionable() chain doesn't walk to a Host. If those exist, they'd now re-do Step‑3 + the async upsertPermissionReferences on every read. Quick grep of getParentPermissionable() implementations would confirm — if every concrete Permissionable either returns a parent or is System Host itself, point #1 is fully moot.

Still standing: Major #2 — line 814 inconsistency

This one isn't resolved by the invariant. addPermissionsToCache (PermissionBitFactoryImpl.java:807-816) does:

bitPermissionsList = loadPermissions( permissionable );
permissionCache.addToPermissionCache( permissionable.getPermissionId(), bitPermissionsList );

If loadPermissions hits the same transient walk-up failure and returns [] (which it now can, since the PR removed the cache write but not the empty return), this caller will still negative-cache the empty list via EmptyPermissionsCacheEntry — re-introducing the exact same 5‑minute poison for any flow that warms via addPermissionsToCache (cache warmup, certain admin paths).

Given the invariant ("empty walk-up = broken state, never desired"), the cleanest expression is to push the guard into the cache so it can't be reintroduced:

// PermissionCacheImpl.addToPermissionCache
protected List<Permission> addToPermissionCache(final String key, List<Permission> permissions) {
    if (permissions == null || permissions.isEmpty()) {
        return Collections.emptyList();   // refuse to poison cache with empty walk-up
    }
    cache.put(primaryGroup + key, permissions, primaryGroup);
    return permissions;
}

Then delete EmptyPermissionsCacheEntry + EMPTY_PERMISSIONS_TTL — they only exist to soften a misfeature that the invariant says shouldn't happen. The guard at line 1724 becomes redundant but harmless. Fix this →

Still standing: Major #3 — comment + root cause

  • The new comment at line 1723 ("must be retried, not persisted as a permanent 401") is now more defensible given the invariant — but call it what it is: an empty walk-up is transient bad state, retry-on-next-read is the recovery path. Wording suggestion: // Empty walk-up is always transient (System Host carries inheritable perms); don't cache the bad state — let the next read retry.
  • The PR title still attributes the cause to "permission_reference rebuild." Under the invariant, this is just the most common trigger; a transient getParentPermissionable() == null from any cause (folder cache miss, DB hiccup mid-traversal) would produce the same poisoning. The fix correctly addresses the cache write, not the rebuild path — which is the right scope, but a one-line note in the commit that we deliberately did not touch resetPermissionReferences() would help future readers.

Still standing minor points (unaffected by the invariant)

  • Test cleanup: anonymous Permissionable with random UUID will trigger upsertPermissionReferences async write to permission_reference — add a finally cleanup, or set PERMISSION_REFERENCES_UPDATE_ASYNC=false in the test.
  • Test asymmetry: assert both directions — non-empty walk-up is cached, empty is not.
  • Nit: assertEquals(Collections.emptyList(), perms) over assertTrue(... perms.isEmpty()) for a better failure message.

Net

With the invariant in mind, the fix's direction is unambiguously correct and the negative-cache concern largely dissolves. The one thing I'd still push to land is moving the guard into PermissionCacheImpl.addToPermissionCache so the line‑814 path can't quietly reintroduce the bug — and so the dead EmptyPermissionsCacheEntry machinery goes away with it.

--- · branch

@freddyDOTCMS
Copy link
Copy Markdown
Member Author

Claude finished @freddyDOTCMS's task in 1m 16s —— View job

Re-review given the "System Host always has inheritable perms" invariant

You're right — that invariant materially changes the analysis. Conceding where my prior review was off and tightening the parts that still stand.

Revised: Major #1 — negative-cache regression

Largely withdrawn. If the walk-up loop in _loadParentPermissions (PermissionBitFactoryImpl.java:1813-1819) is guaranteed to bottom out at System Host and System Host always carries inheritable perms of the relevant type, then an empty return from the walk-up is not a "legitimate negative result we want to absorb." It's a signal of transient bad state. Negative-caching it for 5 minutes via EmptyPermissionsCacheEntry was always wrong for this path — it just turned a flicker into a 5‑minute outage. So "we're losing useful negative caching" doesn't really apply here; we're losing only the negative caching of broken state, which is the whole point.

The one residual question is whether _loadParentPermissions can ever return empty in healthy code (not data): e.g., a contenttype/category/template whose getParentPermissionable() chain doesn't walk to a Host. If those exist, they'd now re-do Step‑3 + the async upsertPermissionReferences on every read. Quick grep of getParentPermissionable() implementations would confirm — if every concrete Permissionable either returns a parent or is System Host itself, point #1 is fully moot.

Still standing: Major #2 — line 814 inconsistency

This one isn't resolved by the invariant. addPermissionsToCache (PermissionBitFactoryImpl.java:807-816) does:

bitPermissionsList = loadPermissions( permissionable );
permissionCache.addToPermissionCache( permissionable.getPermissionId(), bitPermissionsList );

If loadPermissions hits the same transient walk-up failure and returns [] (which it now can, since the PR removed the cache write but not the empty return), this caller will still negative-cache the empty list via EmptyPermissionsCacheEntry — re-introducing the exact same 5‑minute poison for any flow that warms via addPermissionsToCache (cache warmup, certain admin paths).

Given the invariant ("empty walk-up = broken state, never desired"), the cleanest expression is to push the guard into the cache so it can't be reintroduced:

// PermissionCacheImpl.addToPermissionCache
protected List<Permission> addToPermissionCache(final String key, List<Permission> permissions) {
    if (permissions == null || permissions.isEmpty()) {
        return Collections.emptyList();   // refuse to poison cache with empty walk-up
    }
    cache.put(primaryGroup + key, permissions, primaryGroup);
    return permissions;
}

Then delete EmptyPermissionsCacheEntry + EMPTY_PERMISSIONS_TTL — they only exist to soften a misfeature that the invariant says shouldn't happen. The guard at line 1724 becomes redundant but harmless. Fix this →

Still standing: Major #3 — comment + root cause

  • The new comment at line 1723 ("must be retried, not persisted as a permanent 401") is now more defensible given the invariant — but call it what it is: an empty walk-up is transient bad state, retry-on-next-read is the recovery path. Wording suggestion: // Empty walk-up is always transient (System Host carries inheritable perms); don't cache the bad state — let the next read retry.
  • The PR title still attributes the cause to "permission_reference rebuild." Under the invariant, this is just the most common trigger; a transient getParentPermissionable() == null from any cause (folder cache miss, DB hiccup mid-traversal) would produce the same poisoning. The fix correctly addresses the cache write, not the rebuild path — which is the right scope, but a one-line note in the commit that we deliberately did not touch resetPermissionReferences() would help future readers.

Still standing minor points (unaffected by the invariant)

  • Test cleanup: anonymous Permissionable with random UUID will trigger upsertPermissionReferences async write to permission_reference — add a finally cleanup, or set PERMISSION_REFERENCES_UPDATE_ASYNC=false in the test.
  • Test asymmetry: assert both directions — non-empty walk-up is cached, empty is not.
  • Nit: assertEquals(Collections.emptyList(), perms) over assertTrue(... perms.isEmpty()) for a better failure message.

Net

With the invariant in mind, the fix's direction is unambiguously correct and the negative-cache concern largely dissolves. The one thing I'd still push to land is moving the guard into PermissionCacheImpl.addToPermissionCache so the line‑814 path can't quietly reintroduce the bug — and so the dead EmptyPermissionsCacheEntry machinery goes away with it.

--- · branch

@claude really we are not completly sure about what cause the problem so this is why is better no chance the return value for now

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 13, 2026

Claude finished @freddyDOTCMS's task in 28s —— View job


Acknowledged — keeping the change minimal is the right call

Fair point, and I think it's the right call given the uncertainty. Conceding the suggestion.

Why your reasoning holds up:

The proposal to move the guard into PermissionCacheImpl.addToPermissionCache and delete EmptyPermissionsCacheEntry + EMPTY_PERMISSIONS_TTL rests on a strong claim: an empty permission list is always transient bad state and should never be cached. That claim is only safe if you're confident:

  1. The walk-up always reaches System Host in healthy code (the invariant you cited)
  2. And no other code path legitimately stores an empty list via addToPermissionCache for reasons unrelated to the walk-up

Point 2 is the one we haven't verified. addToPermissionCache is called from at least three sites in PermissionBitFactoryImpl (lines 814, 1725, 1782) plus potentially elsewhere. Without auditing each caller, "refuse all empty writes at the cache layer" could silently break a path we haven't characterized — and since the production symptom is intermittent and was hard to reproduce, a regression there would be similarly hard to spot.

The current PR's scoping — guard only at line 1725, the one site we have evidence is causing the bug — is the conservative move. If the line-814 path turns out to also be implicated, that can be a follow-up PR with its own evidence. Tearing out EmptyPermissionsCacheEntry is a bigger refactor that needs its own justification independent of this fix.

Net: the PR as it stands is defensible — narrow fix targeting the observed symptom, leaves the broader cache machinery alone until we understand it better. Approving the scoping decision.

The minor test points (cleanup of the orphan permission_reference row, asserting non-empty is cached, assertEquals(emptyList, perms) over assertTrue(perms.isEmpty())) are still worth picking up since they're cheap and local, but they're nits, not blockers.

--- · branch

…-after-Push-Publishing-may-be-caused-by-empty-permission-list-being-cached-during-permission_reference-rebuild
@freddyDOTCMS freddyDOTCMS marked this pull request as ready for review May 13, 2026 16:04
@freddyDOTCMS freddyDOTCMS enabled auto-merge May 13, 2026 16:06
@freddyDOTCMS freddyDOTCMS added this pull request to the merge queue May 13, 2026
Merged via the queue into main with commit 9904446 May 13, 2026
51 checks passed
@freddyDOTCMS freddyDOTCMS deleted the issue-35680-intermittent-401-on-public-pages-after-Push-Publishing-may-be-caused-by-empty-permission-list-being-cached-during-permission_reference-rebuild branch May 13, 2026 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI: Safe To Rollback Area : Backend PR changes Java/Maven backend code

Projects

Status: No status

2 participants