Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the issue that io delay injected does not take effect #121

Merged
merged 9 commits into from Jan 13, 2020

Conversation

cwen0
Copy link
Member

@cwen0 cwen0 commented Jan 10, 2020

Signed-off-by: cwen0 cwenyin0@gmail.com

What problem does this PR solve?

  • Fix the issue that io delay injected does not take effect
  • Fix the config watcher can't restart
  • Update IOChaos document
  • Use the data directory as the mountpoint of the fuse server
  • No longer allow to inject chaosfs container dynamically and only support inject chaosfs container before the application created.
    If the application was created, the data directory of the application was not empty, we can't use 'fusermount' to mount it and we must rename it to fuse-data, but If we rename the data directory, there is a risk that the application will not start properly because the data directory cannot be found when the pod was restarted.

Check List

Tests

  • Manual test (add detailed scripts or steps below)
2020-01-10T15:56:31.578Z        INFO    fuse-server     Inject fault    {"context": {}}
2020-01-10T15:56:32.579Z        INFO    fuse-server     Inject fault    {"method": "release", "path": "last_tikv.toml"}
2020-01-10T15:56:32.579Z        INFO    fuse-server     Inject fault    {"context": {}}
2020-01-10T15:56:32.580Z        INFO    fuse-server     Inject fault    {"method": "getattr", "path": "MAILPATH"}
2020-01-10T15:56:32.580Z        INFO    fuse-server     Inject fault    {"context": {}}
2020-01-10T15:56:38.694Z        INFO    fuse-server     Inject fault    {"method": "getattr", "path": "snap"}
2020-01-10T15:56:38.694Z        INFO    fuse-server     Inject fault    {"context": {}}
2020-01-10T15:56:39.696Z        INFO    fuse-server     Inject fault    {"method": "opendir", "path": "snap"}
2020-01-10T15:56:39.696Z        INFO    fuse-server     Inject fault    {"context": {}}
2020-01-10T15:56:40.706Z        INFO    fuse-server     Inject fault    {"method": "getattr", "path": "db"}
2020-01-10T15:56:40.706Z        INFO    fuse-server     Inject fault    {"context": {}}
2020-01-10T15:56:41.708Z        INFO    fuse-server     Inject fault    {"method": "statfs", "path": "db"}
2020-01-10T15:56:41.708Z        INFO    fuse-server     Inject fault    {"context": {}}
2020-01-10T15:56:50.706Z        INFO    fuse-server     Inject fault    {"method": "getattr", "path": "db"}
2020-01-10T15:56:50.706Z        INFO    fuse-server     Inject fault    {"context": {}}
2020-01-10T15:56:51.441Z        INFO    fuse-server     recover all fault

Does this PR introduce a user-facing change?:

NONE 

No longer allow to inject `chaosfs` container dynamically and only support inject `chaosfs ` container before the application created. 

Signed-off-by: cwen0 <cwenyin0@gmail.com>
Signed-off-by: cwen0 <cwenyin0@gmail.com>
@cwen0 cwen0 added the type/bug Something isn't working label Jan 10, 2020
@zhouqiang-cl
Copy link
Contributor

This pr need queey's test. can you help to test it @mahjonp

if err := utils.UnsetIoInjection(ctx, r.Client, pod, iochaos); err != nil {
r.Log.Error(err, "failed to unset I/O injection",
return true, nil
}, cctx.Done())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When network is not ok, Poll loop and cctx timeout, wait.PollUntil will return timeout err or nil?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout err

Signed-off-by: cwen0 <cwenyin0@gmail.com>
@zhouqiang-cl zhouqiang-cl added type/bug-fix A bug needs to be fixed. and removed type/bug Something isn't working labels Jan 12, 2020
Copy link
Contributor

@zhouqiang-cl zhouqiang-cl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

scripts/init.sh Outdated
}

copy_scripts() {
echo $1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

polish codes?

// input is closed, so lets signal one last time if we have any pending unsignalled events
if signalled {
// send final event, so we dont miss the trailing event after input chan close
output <- struct{}{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on Coalescer is inaccurate

// it signals on output chan with the last value from input chan

and why remove closing the output channel?

Copy link
Member Author

@cwen0 cwen0 Jan 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing output channel maybe better.

- -original=/var/lib/tikv/data
- -mountpoint=/var/lib/tikv/fuse-data
- -original=/var/lib/tikv/fuse-data
- -mountpoint=/var/lib/tikv/data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the arg name is misleading..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This directory is usually set to the same level directory as the original directory.
* **mountpoint**: defines the mountpoint to mount original directory.
Copy link
Contributor

@mahjonp mahjonp Jan 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this description is contradictory with how it is used actually?

        - -original=/var/lib/tikv/fuse-data
        - -mountpoint=/var/lib/tikv/data

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I will fix it.

@@ -187,6 +187,7 @@ func watchConfig(cfg *config.Config, stopCh <-chan struct{}) {
case <-stopCh:
close(sigChan)
return
default:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why add the default branch? this may cause there exist multiple configWatchers that notify the eventsCh at the same time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this branch just used to restart configWatcher.Watch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstood the stopCh before, I thought it was used by configWatcher to notify the outer goroutine that it had been finished.
If configWatcher.Watch or eventsCh first received the msg from stopCh before the outer goroutine, the configWatcher.Watch will be restarted forever because case <- stopCh at line 187 won't receive msg again? how to solve this problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stopCh will be closed, so all receiver will receive this event.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got, that's ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restarting is ok for ErrWatchChannelClosed case, is that ok for other err?

Copy link
Contributor

@mahjonp mahjonp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest lgtm

@mahjonp
Copy link
Contributor

mahjonp commented Jan 12, 2020

What's the reason of not support inject sidecar dynamically now?

Signed-off-by: cwen0 <cwenyin0@gmail.com>
@cwen0
Copy link
Member Author

cwen0 commented Jan 12, 2020

What's the reason of not support inject sidecar dynamically now?

If the application was created, the data directory of the application was not empty, we can't use fusermount to mount it and we must rename it to fuse-data, but If we rename the data directory, there is a risk that the application will not start properly because the data directory cannot be found when the pod was restarted.

@codecov-io
Copy link

codecov-io commented Jan 12, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@7ab71cc). Click here to learn what that means.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #121   +/-   ##
=========================================
  Coverage          ?   29.96%           
=========================================
  Files             ?       24           
  Lines             ?      801           
  Branches          ?        0           
=========================================
  Hits              ?      240           
  Misses            ?      530           
  Partials          ?       31
Impacted Files Coverage Δ
pkg/utils/coalescer.go 0% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7ab71cc...9e436b2. Read the comment docs.

zhouqiang-cl
zhouqiang-cl previously approved these changes Jan 12, 2020
Copy link
Contributor

@zhouqiang-cl zhouqiang-cl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

you06
you06 previously approved these changes Jan 12, 2020
Copy link
Member

@you06 you06 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: cwen0 <cwenyin0@gmail.com>
@cwen0 cwen0 dismissed stale reviews from you06 and zhouqiang-cl via 9e436b2 January 13, 2020 00:06
Copy link
Contributor

@zhouqiang-cl zhouqiang-cl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@mahjonp mahjonp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@cwen0 cwen0 merged commit 60d091d into chaos-mesh:master Jan 13, 2020
@cwen0 cwen0 deleted the fix_iochaos branch January 13, 2020 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug-fix A bug needs to be fixed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants