Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Ingesters unable to re-join the cluster then unregister_on_shutdown=false + -ingester.heartbeat-period=0 #4366

Merged
merged 6 commits into from
Aug 3, 2021

Conversation

alanprot
Copy link
Member

What this PR does:

Address problem state on #4365

Which issue(s) this PR fixes:
Fixes #4365

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@alanprot alanprot changed the title unregister_on_shutdown=false + -ingester.heartbeat-period=0 seems to not be working Fix Ingesters unable to re-join the cluster then unregister_on_shutdown=false + -ingester.heartbeat-period=0 Jul 16, 2021
CHANGELOG.md Outdated Show resolved Hide resolved
pkg/ring/lifecycler.go Show resolved Hide resolved
@@ -588,6 +589,13 @@ func (i *Lifecycler) initRing(ctx context.Context) error {
i.setTokens(tokens)

level.Info(log.Logger).Log("msg", "existing entry found in ring", "state", i.GetState(), "tokens", len(tokens), "ring", i.RingName)

// If we flip the instance from LEAVING to ACTIVE and the heartbeat is disabled, we need to update KV here
if i.cfg.HeartbeatPeriod == 0 && ringDesc.Ingesters[i.ID].State == LEAVING && instanceDesc.State == ACTIVE {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: are we going to hit the issue #4365 in the case the ingester crashes while in the PENDING or JOINING state in the ring and then restarts? Could you add unit tests to check that too, please?

Copy link
Member Author

@alanprot alanprot Jul 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure..

But i think we are fine there as..

		if instanceDesc.State == JOINING {
			level.Warn(log.Logger).Log("msg", "instance found in ring as JOINING, setting to PENDING",
				"ring", i.RingName)
			instanceDesc.State = PENDING
			return ringDesc, true, nil
		}

We will still hit the issue if the number of tokens change though... but this is already the case now

@alanprot alanprot force-pushed the 4365 branch 7 times, most recently from 6f69d75 to e27561c Compare July 16, 2021 19:34
@alanprot
Copy link
Member Author

Hi @pracucci

Did u had a change to see the updated change?

Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks reasonable, but I got lost in the test.
I made a few comments, but I think I'm not getting to the crux of what it is doing.
E.g. how many ingesters are there supposed to be?

pkg/ring/lifecycler.go Outdated Show resolved Hide resolved
require.NoError(t, err)
require.NoError(t, services.StartAndAwaitRunning(context.Background(), l2))

pool := func(condition func(*Desc) bool) map[string]InstanceDesc {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get what "pool" means here. Can you pick a more descriptive name, or add a comment saying what this function does?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be poll instead of pool :P
This function is basically waiting for a condition and returning actual state of the ingesters after the condition succeed.

pkg/ring/lifecycler_test.go Outdated Show resolved Hide resolved
pkg/ring/lifecycler_test.go Outdated Show resolved Hide resolved
return ingesters
}

startIngesterAndWaitActive := func(lcConfig LifecyclerConfig) *Lifecycler {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function starts one ingester then also waits for two more to be active?
Since starting one ingester is repeated many times, could that be a separate function from the one that waits for two to be active?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does not need to check the ingester1. I did that just in case... removing it.

But basically this function does what you stated: "starts ingester2 and wait it to became active."

pkg/ring/lifecycler_test.go Show resolved Hide resolved
pkg/ring/lifecycler.go Show resolved Hide resolved
alanprot and others added 4 commits July 30, 2021 10:10
…tdown=false

Signed-off-by: Alan Protasio <approtas@amazon.com>
Co-authored-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Alan Protasio <approtas@amazon.com>
Co-authored-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Alan Protasio <approtas@amazon.com>
Co-authored-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Alan Protasio <approtas@amazon.com>
@alanprot alanprot force-pushed the 4365 branch 2 times, most recently from 88353cb to 8111704 Compare July 30, 2021 17:11
Signed-off-by: Alan Protasio <approtas@amazon.com>
Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I think I get it now, but a bit more consistency in naming and usage would improve clarity in the test.

pkg/ring/lifecycler_test.go Outdated Show resolved Hide resolved

// Starts Ingester2 and wait it to became active
startIngester2AndWaitActive := func(lcConfig LifecyclerConfig) *Lifecycler {
ingester, err := NewLifecycler(lcConfig, &noopFlushTransferer{}, "ingester", IngesterRingKey, true, nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we create a new Lifecycler here? What happens to the old one in l2?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And why is it named ingester ? Nearby we have a variable ingesters which is a map[string]InstanceDesc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we create a new Lifecycler here? What happens to the old one in l2?

I think i tried to reuse it and it did not work.. but besides that, In order to simulated a brand new ingester i would assume that creating a new instance is the safest thing to do. make sense?

pkg/ring/lifecycler_test.go Show resolved Hide resolved
pkg/ring/lifecycler_test.go Show resolved Hide resolved
}

// Starts Ingester2 and wait it to became active
startIngester2AndWaitActive := func(lcConfig LifecyclerConfig) *Lifecycler {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's nothing specific to Ingester 2 in this function, is there?
That is a consequence of passing LifecyclerConfig with ID set to "ing2", I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Signed-off-by: Alan Protasio <approtas@amazon.com>
Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - sorry it took a while but I think the flow of the test is much clearer now.
Thanks!

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, LGTM!

@pracucci pracucci merged commit 4405f9c into cortexproject:master Aug 3, 2021
@alanprot
Copy link
Member Author

alanprot commented Aug 3, 2021

Thanks guys! 😄

alvinlin123 pushed a commit to ac1214/cortex that referenced this pull request Jan 14, 2022
…own=false` + `-ingester.heartbeat-period=0` (cortexproject#4366)

* Fix problem when ingester heartbeat is disabled and unregister_on_shutdown=false

Signed-off-by: Alan Protasio <approtas@amazon.com>

* Update CHANGELOG.md

Co-authored-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Alan Protasio <approtas@amazon.com>

* Update pkg/ring/lifecycler.go

Co-authored-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Alan Protasio <approtas@amazon.com>

* Update pkg/ring/lifecycler_test.go

Co-authored-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Alan Protasio <approtas@amazon.com>

* Address comments

Signed-off-by: Alan Protasio <approtas@amazon.com>

* Addressing comments on test

Signed-off-by: Alan Protasio <approtas@amazon.com>

Co-authored-by: Marco Pracucci <marco@pracucci.com>
Co-authored-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

unregister_on_shutdown=false + -ingester.heartbeat-period=0 seems to not be working.
3 participants