Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to register grain {Grain} in grain directory #8646

Closed
ximi522 opened this issue Sep 22, 2023 · 5 comments
Closed

Failed to register grain {Grain} in grain directory #8646

ximi522 opened this issue Sep 22, 2023 · 5 comments
Assignees

Comments

@ximi522
Copy link

ximi522 commented Sep 22, 2023

Assume I have two Silos, SiloA and SiloB, with GrainA registered on SiloA. After SiloB accesses GrainA, it retains the address in the LRU cache. Now, the cluster adds SiloC, and after GrainA is deactivated, it is activated again on SiloC. However, SiloB will always use the address in the LRU cache when accessing GrainA, that is, it will access SiloA, causing SiloA to continuously generate warnings of "Failed to register grain {Grain} in grain directory".
We investigated why this old address would always exist in SiloB's LRU cache, and found that when SiloB accesses SiloA but can't find GrainA, it will try to delete the invalid address of GrainA. However, the method called is InvalidateCache(GrainAddress address) instead of InvalidateCache(GrainId grainId). The InvalidateCache(GrainAddress address) method will use the Matches(GrainAddress other) method to determine whether the two addresses are equal,

        /// <summary>
        /// Two grain addresses match if they have equal <see cref="SiloAddress"/> and <see cref="GrainId"/> values
        /// and either one has a default <see cref="ActivationId"/> value or both have equal <see cref="ActivationId"/> values.
        /// </summary>
        /// <param name="other"> The other <see cref="GrainAddress"/> to compare this one with.</param>
        /// <returns> Returns <c>true</c> if the two <see cref="GrainAddress"/> are considered to match.</returns>
        public bool Matches(GrainAddress other)
        {
            return other is not null && _grainId.Equals(other._grainId) && (SiloAddress?.Equals(other.SiloAddress) ?? other.SiloAddress is null)
                && (_activationId.IsDefault || other._activationId.IsDefault || _activationId.Equals(other._activationId));
        }

but the _activationId.Equals(other._activationId) condition in this method always returns false, thus making it impossible to delete the invalid address of GrainA from the LRU cache.We suspect this might be a bug, because if the Grain is accessed frequently, it would unnecessarily waste the performance of the Silo.

@ghost ghost added the Needs: triage 🔍 label Sep 22, 2023
@franckyj
Copy link

I would really like this issue to be fixed. We have a continuous stream of these errors in our production (and others) environment.

Even though we have these warnings, the grains mentionned in the error look like they are OK, since we can query them. But the warnings are polluting our App Insight instance and costing more money in the process :)

The best way of fixing this would be to reproduce the problem with a minimal repo and create a PR for the fix.

@ReubenBond ReubenBond self-assigned this Oct 31, 2023
@ReubenBond
Copy link
Member

Thank you for the analysis, @ximi522, and the bump + offer to help @franckyj. I've discussed this with @benjaminpetit and we agree that a fix is in order.

We discussed a few options:

  1. Switch to InvalidateCache(GrainId), which can also invalidate correct cache entries.
  2. Replace the call to GrainAddress.Match with one which ignores ActivationId.
  3. Change Message.AddToCacheInvalidationHeader so that it strips the ActivationId.
  4. Include the valid GrainAddress when invalidating the cache, to prevent load on the directory when invalidation occurs. This would need to be backwards-compatible.

We decided on option 4, so:

  • Retain the invalid ActivationId for diagnostic purposes.
  • Change message serialization in a backwards-compatible manner to include the address of successfully registered activation.
  • Replace the invalid activation cache entry with the valid activation.

I'll create a PR

@ReubenBond
Copy link
Member

I've opened #8696

@ReubenBond
Copy link
Member

ReubenBond commented Nov 3, 2023

I've also opened #8704 to fix a related issue. I believe the issue is fixed by these PRs. We will prepare a release soon.

@ReubenBond
Copy link
Member

The v7.2.3 release which aims to fix this is now available, so I will close this but please open a new issue and reference this if you still encounter this issue: https://github.com/dotnet/orleans/releases/tag/v7.2.3

@github-actions github-actions bot locked and limited conversation to collaborators Dec 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants