Problem Statement
After a Push Publishing operation completes on a receiver node, certain public pages intermittently respond with a 401, redirecting anonymous users to
/dotCMS/login?referrer=/. The error does not resolve on its own — it persists until a content save or publish is performed directly on the production
environment, which triggers resetPermissionReferences() for that asset, evicts the bad cache entry, and allows the next request to successfully rebuild the
permission reference. The pages themselves have no configuration changes and their Host-level permissions remain intact throughout.
This issue has been identified through customer report (support ticket #34942). A full reproduction has not yet been achieved. The description below represents
our current best theory based on code analysis and database inspection of the affected environment.
What We Know For Sure
- Push Publishing calls permissionAPI.resetPermissionReferences(contentlet) inside internalCheckin() for every existing HTML page in the bundle. This deletes
each page's row from permission_reference and broadcasts a cluster-wide cache invalidation.
- The affected pages have no direct rows in the permission table. They rely entirely on permission_reference pointing to their Host, which correctly holds CMS
Anonymous / IHTMLPage / READ=1.
- The Host's permissions are never touched by this process.
- When permission_reference is cleared and a request arrives, PermissionBitFactoryImpl.loadPermissions() falls through to _loadParentPermissions() — a walk-up
through the parent hierarchy that should always recover the Host's permissions.
- The result of that walk-up — including an empty list — is written to cache unconditionally at line ~1723, before any validity check.
The Theory
The walk-up in _loadParentPermissions() calls page.getParentPermissionable(), which in Contentlet.java resolves the parent by calling fAPI.find(this.getFolder(),
systemUser, false). If that call returns null at that moment — due to transient load, a folder cache miss during the high-demand push window, or any internal
failure — the walk-up loop never executes and an empty permission list is returned.
Because the cache write at line ~1723 is unconditional, that empty list is stored in the permission cache. Every subsequent request for the affected page hits
the cache immediately, gets [], and CMSUrlUtil.isUnauthorizedAndHandleError() fires a 401 for CMS Anonymous. This persists until the cache entry is evicted and a
subsequent walk-up succeeds.
This is unconfirmed. We have not been able to reproduce it locally, and we have not ruled out other causes.
Proposed Fix
In PermissionBitFactoryImpl.loadPermissions() (~line 1723), guard the cache write so empty results are never persisted:
// Before
permissionCache.addToPermissionCache(permissionKey, permissionList);
// After
if (!permissionList.isEmpty()) {
permissionCache.addToPermissionCache(permissionKey, permissionList);
}
If the theory is correct, this prevents a failed walk-up from being written to cache as a permanent empty result. If the theory is incorrect, the change is still
a safe defensive improvement — caching an empty permission list after a walk-up is never semantically valid under normal conditions, as every asset in dotCMS
resolves to the System Host which always carries inheritable permissions. The only downside is that a failed walk-up is retried on the next request instead of
being served from cache.
Steps to Reproduce
This bug has not been reproduced locally and is expected to be very difficult to reproduce in a controlled environment. Here is why:
The failure depends on a race condition between two concurrent events on the receiver node:
- Push Publishing clearing a page's permission_reference entry via resetPermissionReferences()
- An incoming HTTP request for that same page hitting loadPermissions() at the exact moment fAPI.find(folderId) fails to resolve the parent folder — likely due
to high load, a cache miss, or a transient DB state during the push window
Reproducing this locally would require:
- A multi-node cluster setup with the receiver node under realistic production load
- A bundle large enough to stress the async permission_reference rebuild thread pool (poolSize=1, maxPoolSize=5, queueCapacity=10000)
- Precise timing of an anonymous request to a page whose folder lookup transiently fails during the push
In practice, this combination appears to occur rarely — the customer first observed it approximately 2 months after the previous occurrence.
The fix is proposed on defensive correctness grounds, not on a confirmed reproduction. Caching an empty permission list after a walk-up is never semantically
valid: every asset in dotCMS resolves through the parent hierarchy to the System Host, which always carries inheritable permissions. Refusing to cache an empty
result costs one extra walk-up on the next request and has no negative side effects.
Acceptance Criteria
- PermissionBitFactoryImpl.loadPermissions() does not write an empty permission list to the permission cache when _loadParentPermissions() returns no results
- When a walk-up fails to find a parent (e.g., getParentPermissionable() returns null), the next request for that page retries the walk-up instead of being
served a cached empty result
- Pages with valid Host-level inheritable permissions for CMS Anonymous remain publicly accessible during and after a Push Publishing operation
- A unit test covers the scenario where _loadParentPermissions() returns empty: the result must not be written to cache
- An integration test or manual test verifies that a Push Publishing operation to a receiver node does not cause 401s on pages that inherit permissions from the
Host
- No regression in scenarios where an asset genuinely has no inheritable permissions — those cases should still be handled correctly (walk-up reruns on next
request rather than being permanently cached)
dotCMS Version
latest in main
Severity
Medium - Some functionality impacted
Links
https://helpdesk.dotcms.com/a/tickets/34942
Problem Statement
After a Push Publishing operation completes on a receiver node, certain public pages intermittently respond with a 401, redirecting anonymous users to
/dotCMS/login?referrer=/. The error does not resolve on its own — it persists until a content save or publish is performed directly on the production
environment, which triggers resetPermissionReferences() for that asset, evicts the bad cache entry, and allows the next request to successfully rebuild the
permission reference. The pages themselves have no configuration changes and their Host-level permissions remain intact throughout.
This issue has been identified through customer report (support ticket #34942). A full reproduction has not yet been achieved. The description below represents
our current best theory based on code analysis and database inspection of the affected environment.
What We Know For Sure
each page's row from permission_reference and broadcasts a cluster-wide cache invalidation.
Anonymous / IHTMLPage / READ=1.
through the parent hierarchy that should always recover the Host's permissions.
The Theory
The walk-up in _loadParentPermissions() calls page.getParentPermissionable(), which in Contentlet.java resolves the parent by calling fAPI.find(this.getFolder(),
systemUser, false). If that call returns null at that moment — due to transient load, a folder cache miss during the high-demand push window, or any internal
failure — the walk-up loop never executes and an empty permission list is returned.
Because the cache write at line ~1723 is unconditional, that empty list is stored in the permission cache. Every subsequent request for the affected page hits
the cache immediately, gets [], and CMSUrlUtil.isUnauthorizedAndHandleError() fires a 401 for CMS Anonymous. This persists until the cache entry is evicted and a
subsequent walk-up succeeds.
This is unconfirmed. We have not been able to reproduce it locally, and we have not ruled out other causes.
Proposed Fix
In PermissionBitFactoryImpl.loadPermissions() (~line 1723), guard the cache write so empty results are never persisted:
If the theory is correct, this prevents a failed walk-up from being written to cache as a permanent empty result. If the theory is incorrect, the change is still
a safe defensive improvement — caching an empty permission list after a walk-up is never semantically valid under normal conditions, as every asset in dotCMS
resolves to the System Host which always carries inheritable permissions. The only downside is that a failed walk-up is retried on the next request instead of
being served from cache.
Steps to Reproduce
This bug has not been reproduced locally and is expected to be very difficult to reproduce in a controlled environment. Here is why:
The failure depends on a race condition between two concurrent events on the receiver node:
to high load, a cache miss, or a transient DB state during the push window
Reproducing this locally would require:
In practice, this combination appears to occur rarely — the customer first observed it approximately 2 months after the previous occurrence.
The fix is proposed on defensive correctness grounds, not on a confirmed reproduction. Caching an empty permission list after a walk-up is never semantically
valid: every asset in dotCMS resolves through the parent hierarchy to the System Host, which always carries inheritable permissions. Refusing to cache an empty
result costs one extra walk-up on the next request and has no negative side effects.
Acceptance Criteria
served a cached empty result
Host
request rather than being permanently cached)
dotCMS Version
latest in main
Severity
Medium - Some functionality impacted
Links
https://helpdesk.dotcms.com/a/tickets/34942