Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: fix csi plugin concurrency issue on FuseRecovery and NodeUnpublishVolume #3448

Conversation

TrafalgarZZZ
Copy link
Member

@TrafalgarZZZ TrafalgarZZZ commented Sep 11, 2023

Ⅰ. Describe what this PR does

Fix potential concurrency issue when FuseRecovery and NodeUnpublishVolume process on a same target path.

Ⅱ. Does this pull request fix one issue?

fixes #3449

Ⅲ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Ⅳ. Describe how to verify it

Ⅴ. Special notes for reviews

Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>
Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>
Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>
Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>
@TrafalgarZZZ
Copy link
Member Author

/test fluid-e2e

}
// targetPath is corrupted
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using if-else to make it easy to understand.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>
should, err := r.shouldRecover(point.MountPath)
if err != nil {
glog.Warningf("FuseRecovery: found path %s which is unable to recover due to error %v, skip it", point.MountPath, err)
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If continue here, the lock of mountPath will not release.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thx for pointing this!

Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>
@codecov
Copy link

codecov bot commented Sep 12, 2023

Codecov Report

Merging #3448 (1c22dbe) into master (3e8ba24) will decrease coverage by 0.04%.
The diff coverage is 26.82%.

❗ Current head 1c22dbe differs from pull request most recent head c143674. Consider uploading reports for the commit c143674 to get more accurate results

@@            Coverage Diff             @@
##           master    #3448      +/-   ##
==========================================
- Coverage   64.28%   64.24%   -0.04%     
==========================================
  Files         442      442              
  Lines       26420    26441      +21     
==========================================
+ Hits        16984    16988       +4     
- Misses       7430     7444      +14     
- Partials     2006     2009       +3     
Files Changed Coverage Δ
pkg/csi/recover/register.go 0.00% <0.00%> (ø)
pkg/csi/updatedbconf/register.go 0.00% <0.00%> (ø)
pkg/csi/recover/recover.go 55.55% <29.72%> (-6.73%) ⬇️

Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>
@TrafalgarZZZ
Copy link
Member Author

/test fluid-e2e

brokenMounts, err := mountinfo.GetBrokenMountPoints()
if err != nil {
glog.Error(err)
return
}

for _, point := range brokenMounts {
glog.V(4).Infof("Get broken mount point: %v", point)
if lock := r.locks.TryAcquire(point.MountPath); !lock {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using r.checkAndRecoverMounts() to wrap the logic in the loop? We can use defer r.locks.Release to make the logic simper.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the suggestion. Moved it into func doRecover

Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>
@sonarcloud
Copy link

sonarcloud bot commented Sep 12, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
9.2% 9.2% Duplication

@TrafalgarZZZ
Copy link
Member Author

/test fluid-e2e

Copy link
Collaborator

@cheyang cheyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Collaborator

@cheyang cheyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@fluid-e2e-bot
Copy link

fluid-e2e-bot bot commented Sep 12, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheyang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fluid-e2e-bot fluid-e2e-bot bot merged commit 5076ef0 into fluid-cloudnative:master Sep 12, 2023
10 checks passed
TrafalgarZZZ added a commit to TrafalgarZZZ/fluid that referenced this pull request Sep 13, 2023
…blishVolume (fluid-cloudnative#3448)

* Add comments for NodeUnpublishVolume

Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>

* Refactor NodeUnpublishVolume code

Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>

* FuseRecovery uses volume locks to avoid race conditions

Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>

* Refactor node server with codes.Internal error code

Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>

* Rename CSI Config to RunningContext

Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>

* Fix github actions checks

Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>

* Fix lock release

Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>

* Refactor recover logic

Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>

---------

Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>
fluid-e2e-bot bot pushed a commit that referenced this pull request Sep 13, 2023
…and NodeUnpublishVolume (#3448) (#3453)

* Bugfix: ignore not connected error in NodeUnpublishVolume (#3445)

* ignore not connected error in NodeUnpublishVolume

Signed-off-by: wangshulin <wangshulin@smail.nju.edu.cn>

* fix check nil error

Signed-off-by: wangshulin <wangshulin@smail.nju.edu.cn>

* simplify error judgment

Signed-off-by: wangshulin <wangshulin@smail.nju.edu.cn>

---------

Signed-off-by: wangshulin <wangshulin@smail.nju.edu.cn>

* bugfix: fix csi plugin concurrency issue on FuseRecovery and NodeUnpublishVolume (#3448)

* Add comments for NodeUnpublishVolume

Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>

* Refactor NodeUnpublishVolume code

Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>

* FuseRecovery uses volume locks to avoid race conditions

Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>

* Refactor node server with codes.Internal error code

Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>

* Rename CSI Config to RunningContext

Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>

* Fix github actions checks

Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>

* Fix lock release

Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>

* Refactor recover logic

Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>

---------

Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>

---------

Signed-off-by: wangshulin <wangshulin@smail.nju.edu.cn>
Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>
Co-authored-by: wangshulin <89928606+wangshli@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Concurrency issue when enabling CSI FUSE Recovery feature
3 participants