Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix double stop components #3482

Merged
merged 6 commits into from
Oct 23, 2023
Merged

Conversation

pchila
Copy link
Contributor

@pchila pchila commented Sep 28, 2023

What does this PR do?

This PR prevents agent from stopping already stopped components.

Why is it important?

Calling stop on an already stopped component may have unforeseen effects especially if uninstall is involved

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@pchila pchila added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Sep 28, 2023
@pchila pchila self-assigned this Sep 28, 2023
@mergify
Copy link
Contributor

mergify bot commented Sep 28, 2023

This pull request does not have a backport label. Could you fix it @pchila? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@elasticmachine
Copy link
Collaborator

elasticmachine commented Sep 28, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-10-23T08:07:57.716+0000

  • Duration: 27 min 18 sec

Test stats 🧪

Test Results
Failed 0
Passed 6577
Skipped 59
Total 6636

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

elasticmachine commented Sep 28, 2023

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 98.824% (84/85) 👍
Files 67.105% (204/304) 👍
Classes 66.19% (370/559) 👍
Methods 53.583% (1174/2191) 👍
Lines 39.804% (13775/34607) 👍 0.037
Conditionals 100.0% (0/0) 💚

@pchila pchila marked this pull request as ready for review September 28, 2023 16:02
@pchila pchila requested a review from a team as a code owner September 28, 2023 16:02
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

s.log.Debugf("service %s is already stopping: skipping...")
return
}
stopping = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That this is never set to false doesn't seem right. What would happen if we stopped the service and then tried to start it again? This would be uninstalling the endpoint integration and then reinstalling it quickly, or perhaps reassigning the agent to a different policy.

The ignoreCheckins on this path is set to false again in the case actionStart: block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When discussing the fix with @blakerouse , I asked about recycling a service runtime object and we came to the conclusion that a stopped component is ultimately discarded (removed from the maps of running components in the runtime manager) after it eventually stops.

If this assumption does not hold, then we need to keep track of the installation status and if the object gets recycled we need to wait for the ongoing uninstall to finish, then reinstall it (it is done as part of the start()) and set up the component back properly.

@blakerouse any thoughts ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is code in this file that would only be executed if the object is not discarded, so at minimum the behavior needs to be confirmed and then documented. It may be that your and Blake's analysis is correct, but it isn't obvious to myself and others.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is correct. Once a manager is stopped it is never restarted again. This follows the same pattern - https://github.com/elastic/elastic-agent/blob/main/pkg/component/runtime/runtime.go#L190; once set it is not unset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the case we should add a comment explaining this.

@pchila
Copy link
Contributor Author

pchila commented Sep 29, 2023

buildkite test this

@@ -705,6 +705,12 @@ func (m *Manager) update(model component.Model, teardown bool) error {
var stoppedWg sync.WaitGroup
stoppedWg.Add(len(stop))
for _, existing := range stop {
if existing.getLatest().State == client.UnitStateStopped {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of making this check here this could be changed to use the atomic that the componentRuntimeState is already setting.

s.shuttingDown.Store(true)

I think change line 190 in runtime.go to check if already set and do nothing would result in the same behavior. Removing the need to hold the lock that is being held when getLatest() is being called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in b233070

stoppedWg.Done()
continue
}
m.logger.Debugf("Stopping component %q", existing.id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do it here? Why not do it inside of existing.stop(? Seems safer, that way if the code is updated some where else to call existing.stop( the same logic would apply. Seems weird that it would be checked outside of the function that sets it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are referring to the log, it's there because that's where I landed when investigating the bug initially.
If you are referring to the stop() itself, the initial fix was only here in the manager without any modification to the service runtime object.
Would you prefer to have only for a service runtime object ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pchila No I am referring to this line function https://github.com/elastic/elastic-agent/blob/main/pkg/component/runtime/runtime.go#L189 that is being called from https://github.com/elastic/elastic-agent/blob/main/pkg/component/runtime/manager.go#L708 (which is what your changing here).

This runtime wraps the actual managers. So this change will apply to all managers.

I thinking the function could easily be changed to:

func (s *componentRuntimeState) stop(teardown bool, signed *component.Signed) error {
	if s.shuttingDown.Load() {
                // already stopping
		return nil
        }
	s.shuttingDown.Store(true)
	if teardown {
		return s.runtime.Teardown(signed)
	}
	return s.runtime.Stop()
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 2606c30

@mergify
Copy link
Contributor

mergify bot commented Oct 5, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix-double-stop-components upstream/fix-double-stop-components
git merge upstream/main
git push upstream fix-double-stop-components

Copy link
Member

@AndersonQ AndersonQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance you could add a test?

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for all the changes.

Comment on lines +190 to +193
if s.shuttingDown.Load() {
// already stopping
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can shuttingDown be set to false between this Load and the following Store?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from a component point of view it should be guarded, but from agent architecture standpoint it should not be concurrent and should be safe to have it as it is

@pchila
Copy link
Contributor Author

pchila commented Oct 17, 2023

Any chance you could add a test?

@AndersonQ I thought of that but for this precise timing I need a unit or a "narrow" integration test.
If the fix was in the runtime manager as it was initially, that could have been maybe easier. For service runtime I will have to check if I can figure something out without big refactor just for testing.

@pchila
Copy link
Contributor Author

pchila commented Oct 17, 2023

buildkite test this

@elastic-sonarqube
Copy link

@pchila pchila merged commit 97d9c80 into elastic:main Oct 23, 2023
21 checks passed
mergify bot pushed a commit that referenced this pull request Oct 23, 2023
* Skip stopping already stopped components

(cherry picked from commit 97d9c80)
pchila added a commit that referenced this pull request Oct 24, 2023
* Skip stopping already stopped components

(cherry picked from commit 97d9c80)

Co-authored-by: Paolo Chilà <paolo.chila@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.11.0 bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants