Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor bpm: generate uid for each process; remove identifier lock; remove blockingBuffer #2918

Merged
merged 11 commits into from
Feb 28, 2022

Conversation

Hexilee
Copy link
Member

@Hexilee Hexilee commented Feb 18, 2022

What problem does this PR solve?

Currently, the bpm locate managed processes by both PID and create time, because the OS may reuse PID, we must check the create time to avoid locating the wrong process.

However, the two-step locating is messy and the create time may be imprecise (we have fixed a relevant bug). In this PR, we generate random UID for each process to avoid id-reusing.

For example, to locate and kill a process, current code is complex:

	p, err := os.FindProcess(int(pid))
	if err != nil {
		log.Error(err, "unreachable path. `os.FindProcess` will never return an error on unix")
		return err
	}

	procState, err := process.NewProcess(int32(pid))
	if err != nil {
		// return successfully as the process has exited
		return nil
	}
	ct, err := procState.CreateTime()
	if err != nil {
		log.Error(err, "fail to read create time")
		// return successfully as the process has exited
		return nil
	}

	// There is a bug in calculating CreateTime in the new version of
	// gopsutils. This is a temporary solution before the upstream fixes it.
	if startTime-ct > 1000 || ct-startTime > 1000 {
		log.Info("process has already been killed", "startTime", ct, "expectedStartTime", startTime)
		// return successfully as the process has exited
		return nil
	}

	ppid, err := procState.Ppid()
	if err != nil {
		log.Error(err, "fail to read parent id")
		// return successfully as the process has exited
		return nil
	}
	if ppid != int32(os.Getpid()) {
		log.Info("process has already been killed", "ppid", ppid)
		// return successfully as the process has exited
		return nil
	}

	err = p.Signal(syscall.SIGTERM)

In this PR, we do that in the following code:

	proc, loaded := m.getProc(uid)
	if !loaded {
		return errors.Errorf("failed to find process with uid %s", uid)
	}

	if err := proc.cmd.Process.Signal(syscall.SIGTERM); err != nil {
		return errors.Wrap(err, "send SIGTERM to process")
	}

Moreover, the identifier lock caused some dead-lock issues, like #2660 and #2661, so we remove the lock, in this PR, applying requests will fail fast by conflicts of identifiers instead of running into dead-locked.

What's changed and how it works?

To keep compatible with v2.0 and v2.1, current we need a mapping between the pid pair(pid and create time) and the UID.

In v2.2, we should store UID in the status of CRDs (like podiochaos, podhttpchaos and stresschaos), then deprecate the instance(PID) and startTime(create time). In future versions like v3, we can remove the instance(PID) and startTime(createTime) from CRDs.

Others

  • replace identifier lock with sync.Map
  • replace blocking buffer with os.Pipe

Related changes

  • Need to update chaos-mesh/website
  • Need to update Dashboard UI
  • Need to cheery-pick to release branches
    • release-2.1
    • release-2.0

Checklist

Tests

  • Unit test
  • E2E test
  • No code
  • Manual test (add steps below)

Side effects

  • Breaking backward compatibility

Release note

Please add a release note.

You can safely ignore this section if you don't think this PR needs a release note.

DCO

If you find the DCO check fails, please run commands like below (Depends on the actual situations. For example, if the failed commit isn't the most recent) to fix it:

git commit --amend --signoff
git push --force

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Feb 18, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • Andrewmatilde
  • STRRL

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@Hexilee
Copy link
Member Author

Hexilee commented Feb 18, 2022

/run-e2e-tests

1 similar comment
@Hexilee
Copy link
Member Author

Hexilee commented Feb 18, 2022

/run-e2e-tests

@codecov
Copy link

codecov bot commented Feb 18, 2022

Codecov Report

Merging #2918 (8258385) into master (07508be) will increase coverage by 0.37%.
The diff coverage is 29.68%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2918      +/-   ##
==========================================
+ Coverage   36.88%   37.25%   +0.37%     
==========================================
  Files         111      110       -1     
  Lines        9487     9365     -122     
==========================================
- Hits         3499     3489      -10     
+ Misses       5666     5564     -102     
+ Partials      322      312      -10     
Impacted Files Coverage Δ
pkg/bpm/metrics.go 0.00% <0.00%> (ø)
pkg/chaosdaemon/httpchaos_server.go 0.00% <0.00%> (ø)
pkg/chaosdaemon/iochaos_server.go 0.00% <0.00%> (ø)
pkg/chaosdaemon/stress_server_linux.go 0.00% <0.00%> (ø)
pkg/bpm/bpm.go 53.92% <60.50%> (+19.87%) ⬆️
pkg/bpm/build_linux.go 34.09% <66.66%> (ø)
pkg/chaosdaemon/server.go 38.09% <100.00%> (+0.59%) ⬆️
.../workflow/controllers/workflow_entry_reconciler.go 48.14% <0.00%> (-4.77%) ⬇️
pkg/workflow/controllers/utils.go 85.71% <0.00%> (-1.59%) ⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 07508be...8258385. Read the comment docs.

Signed-off-by: xixi <i@hexilee.me>
Signed-off-by: xixi <i@hexilee.me>
Signed-off-by: xixi <i@hexilee.me>
Signed-off-by: xixi <i@hexilee.me>
pkg/bpm/bpm.go Show resolved Hide resolved
pkg/bpm/bpm.go Outdated Show resolved Hide resolved
pkg/bpm/bpm.go Outdated Show resolved Hide resolved
pkg/chaosdaemon/httpchaos_server.go Show resolved Hide resolved
@Hexilee
Copy link
Member Author

Hexilee commented Feb 21, 2022

/run-e2e-tests

Signed-off-by: xixi <i@hexilee.me>
Signed-off-by: xixi <i@hexilee.me>
@Hexilee
Copy link
Member Author

Hexilee commented Feb 22, 2022

/run-e2e-tests

1 similar comment
@Hexilee
Copy link
Member Author

Hexilee commented Feb 22, 2022

/run-e2e-tests

Copy link
Member

@Andrewmatilde Andrewmatilde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Hexilee
Copy link
Member Author

Hexilee commented Feb 22, 2022

/run-e2e-tests

@Hexilee Hexilee changed the title refactor bpm: generate uid for each process; remove identifier lock refactor bpm: generate uid for each process; remove identifier lock; remove blockingBuffer Feb 22, 2022
Signed-off-by: xixi <i@hexilee.me>
Signed-off-by: xixi <i@hexilee.me>
Signed-off-by: xixi <i@hexilee.me>
Signed-off-by: xixi <i@hexilee.me>
Copy link
Member

@STRRL STRRL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! ❤️

My only concern is about the process lifecycle managed in BackgroundProcessManager: from starting a process to waiting for a process dead, the logic is splited into several methods. IMO, it would be much helpful if we could split the lifecycle of the process out from the uid things.

@STRRL
Copy link
Member

STRRL commented Feb 28, 2022

If you do not want to do more things in this PR in the future, I think we could merge it. @Hexilee

@Hexilee
Copy link
Member Author

Hexilee commented Feb 28, 2022

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 1138ef8

@Hexilee
Copy link
Member Author

Hexilee commented Feb 28, 2022

LGTM! ❤️

My only concern is about the process lifecycle managed in BackgroundProcessManager: from starting a process to waiting for a process dead, the logic is splited into several methods. IMO, it would be much helpful if we could split the lifecycle of the process out from the uid things.

Do you mean we shouldn't recycle processes by the uid?

@STRRL
Copy link
Member

STRRL commented Feb 28, 2022

LGTM! heart
My only concern is about the process lifecycle managed in BackgroundProcessManager: from starting a process to waiting for a process dead, the logic is splited into several methods. IMO, it would be much helpful if we could split the lifecycle of the process out from the uid things.

Do you mean we shouldn't recycle processes by the uid?

We should recycle the uid, else we would cause a resource leak.

But I find that the path from cmd.Wait() to proc.Stopped() is pretty long. I wonder if there is a way to make it easier. 🤣

@ti-chi-bot ti-chi-bot merged commit 01df458 into chaos-mesh:master Feb 28, 2022
igcherkaev pushed a commit to igcherkaev/chaos-mesh that referenced this pull request Mar 1, 2022
…remove blockingBuffer (chaos-mesh#2918)

* support uuid

Signed-off-by: xixi <i@hexilee.me>

* chaosdaemon fits new bpm

Signed-off-by: xixi <i@hexilee.me>

* save chaosdaemon/server

Signed-off-by: xixi <i@hexilee.me>

* make check

Signed-off-by: xixi <i@hexilee.me>

* add some comments

Signed-off-by: xixi <i@hexilee.me>

* fix stress chaos

Signed-off-by: xixi <i@hexilee.me>

* deprecate stdio lock and blockingbuffer

Signed-off-by: xixi <i@hexilee.me>

* remove buffers in bpm

Signed-off-by: xixi <i@hexilee.me>

* make check

Signed-off-by: xixi <i@hexilee.me>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
ti-chi-bot added a commit that referenced this pull request Mar 7, 2022
* Support cgroup v2 for linux stress experiments

Signed-off-by: Igor Cherkaev <igor.cherkaev@copart.com>

* Update CHANGELOG.md

Signed-off-by: Igor Cherkaev <igor.cherkaev@copart.com>

* Trigger verify action when helm chart is modified (#2937)

Signed-off-by: SiyuChen <ryougi201@gmail.com>

* ci: fix `upload-image` action  (#2935)

* chore: move fake_image to linux_amd64

Signed-off-by: SiyuChen <ryougi201@gmail.com>

* chore: fix upload image

Signed-off-by: SiyuChen <ryougi201@gmail.com>

* fix: remove deafult project env

Signed-off-by: SiyuChen <ryougi201@gmail.com>

* chore: add FakeImage to fake_image

Signed-off-by: SiyuChen <ryougi201@gmail.com>

* chore: update e2e image name

Signed-off-by: SiyuChen <ryougi201@gmail.com>

* chore: fix e2e test

Signed-off-by: SiyuChen <ryougi201@gmail.com>

* chore: fix e2e test

Signed-off-by: SiyuChen <ryougi201@gmail.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

* fix(ui): pod phases should be first letter capitalized (#2915)

* fix(ui): pod phases should be first letter capitalized

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* fix: unify label case

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

* feat: OpenAPI to TypeScript API Client and Form Data (#2770)

* feat: OpenAPI to TypeScript API Client and Forms

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* fix: license checker

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* fix: go (verify)

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* chore: add comments

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* chore: reuse kubebuilder marks

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* chore: update according to RFC

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* chore: prevent ui:form appearing in crds

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* fix: add missing actions

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* chore: update lockfile

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* chore: support PhysicalMachineChaos

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* chore: merge commands into codegen

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* fix: license checker

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* fix: ci

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* fix: supplement another ignore check

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* chore: update package info

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* chore: update README

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* chore: update

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* chore: gen HTTPChaos

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* chore: update changelog

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* fix: remove nested ignores

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* fix: remove remaining markers

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* fix: typo

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* fix: supplement comments

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

* fix(workflow): no more event after accomplished (#2911)

Signed-off-by: imlonghao <git@imlonghao.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

* helm: add TTL configuration on value.yaml and update helm docs (#2921)

* helm: update archived data's ttl

Signed-off-by: cwen0 <cwenyin0@gmail.com>

* update helm doc

Signed-off-by: cwen0 <cwenyin0@gmail.com>

* fix comments

Signed-off-by: cwen0 <cwenyin0@gmail.com>

* fix comments

Signed-off-by: cwen0 <cwenyin0@gmail.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

* fix wrong comment on workflow Dashboard HTTP API (#2947)

* fix wrong comment on workflow Dashboard HTTP API

Signed-off-by: xiang <xiang13225080@163.com>

* update swagger

Signed-off-by: xiang <xiang13225080@163.com>

* api: export one module for a group of api (#2824)

* move v1alpha1 module to api

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* make check

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

* fix: replaced localhost:5000 with ghcr.io as registry name (#2919)

* fix: replaced localhost:5000 with ghcr.io as registry name when building images

Signed-off-by: Rohan Kumar <rohankmr414@gmail.com>

* import fix

Signed-off-by: Rohan Kumar <rohankmr414@gmail.com>

* import fix

Signed-off-by: Rohan Kumar <rohankmr414@gmail.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

* refactor bpm: generate uid for each process; remove identifier lock; remove blockingBuffer (#2918)

* support uuid

Signed-off-by: xixi <i@hexilee.me>

* chaosdaemon fits new bpm

Signed-off-by: xixi <i@hexilee.me>

* save chaosdaemon/server

Signed-off-by: xixi <i@hexilee.me>

* make check

Signed-off-by: xixi <i@hexilee.me>

* add some comments

Signed-off-by: xixi <i@hexilee.me>

* fix stress chaos

Signed-off-by: xixi <i@hexilee.me>

* deprecate stdio lock and blockingbuffer

Signed-off-by: xixi <i@hexilee.me>

* remove buffers in bpm

Signed-off-by: xixi <i@hexilee.me>

* make check

Signed-off-by: xixi <i@hexilee.me>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

* log: add customized logger for daemon and bpm (#2902)

* add customized logger for daemon and bpm

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* extract the grpc metadata into the context

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* make check

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* make check

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* remove log parameter in killIOChaos

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* add logs

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* remove log parameter in httpchaos server

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* add the context comments for process builder

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* make check

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* use the logger in arguments

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* add back boilerplate

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* fix make check

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* add auto compile github action (#2948)

* add auto compile github action

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* modify the name of action

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* remove comment expression

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* add comments validation

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* rename the zst file name

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* auto detect buildenv and devenv

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* fix artifact url

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* add debug information

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* add cache and debug artifact url

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* fix bash script

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* modify cache step name

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* use another docker buildx driver

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* add download bash script

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* add boilerplate

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* add some comments and help message

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

* Support cgroup v2 for linux stress experiments

Signed-off-by: Igor Cherkaev <igor.cherkaev@copart.com>

* Update auto-generated code

Signed-off-by: Igor Cherkaev <igor.cherkaev@copart.com>

Co-authored-by: Siyu Chen <ryougi201@gmail.com>
Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
Co-authored-by: Yue Yang <g1enyy0ung@gmail.com>
Co-authored-by: imlonghao <git@imlonghao.com>
Co-authored-by: CWen <cwenyin0@gmail.com>
Co-authored-by: WangXiang <xiang13225080@163.com>
Co-authored-by: YangKeao <yangkeao@chunibyo.icu>
Co-authored-by: Rohan Kumar <rohankmr414@gmail.com>
Co-authored-by: xixi <i@hexilee.me>
afzal442 pushed a commit to afzal442/chaos-mesh that referenced this pull request Mar 19, 2022
…remove blockingBuffer (chaos-mesh#2918)

* support uuid

Signed-off-by: xixi <i@hexilee.me>

* chaosdaemon fits new bpm

Signed-off-by: xixi <i@hexilee.me>

* save chaosdaemon/server

Signed-off-by: xixi <i@hexilee.me>

* make check

Signed-off-by: xixi <i@hexilee.me>

* add some comments

Signed-off-by: xixi <i@hexilee.me>

* fix stress chaos

Signed-off-by: xixi <i@hexilee.me>

* deprecate stdio lock and blockingbuffer

Signed-off-by: xixi <i@hexilee.me>

* remove buffers in bpm

Signed-off-by: xixi <i@hexilee.me>

* make check

Signed-off-by: xixi <i@hexilee.me>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants