Skip to content
This repository was archived by the owner on Dec 17, 2025. It is now read-only.

Conversation

@d-kuro
Copy link

@d-kuro d-kuro commented Jun 3, 2020

refs: #50

If you import the code from #52, the race condition will be gone.

# refs: https://github.com/argoproj/gitops-engine/pull/52
$ git status
On branch feature/fix-race-condition
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        modified:   pkg/sync/sync_context_test.go
        modified:   pkg/utils/kube/kubetest/mock.go

no changes added to commit (use "git add" and/or "git commit -a")

$ go test ./... --race
?       github.com/argoproj/gitops-engine       [no test files]
?       github.com/argoproj/gitops-engine/agent [no test files]
ok      github.com/argoproj/gitops-engine/pkg/cache     (cached)
?       github.com/argoproj/gitops-engine/pkg/cache/mocks       [no test files]
ok      github.com/argoproj/gitops-engine/pkg/diff      (cached)
?       github.com/argoproj/gitops-engine/pkg/engine    [no test files]
ok      github.com/argoproj/gitops-engine/pkg/health    (cached)
ok      github.com/argoproj/gitops-engine/pkg/sync      (cached)
ok      github.com/argoproj/gitops-engine/pkg/sync/common       (cached)
ok      github.com/argoproj/gitops-engine/pkg/sync/hook (cached)
ok      github.com/argoproj/gitops-engine/pkg/sync/hook/helm    (cached)
ok      github.com/argoproj/gitops-engine/pkg/sync/ignore       (cached)
ok      github.com/argoproj/gitops-engine/pkg/sync/resource     (cached)
ok      github.com/argoproj/gitops-engine/pkg/sync/syncwaves    (cached)
?       github.com/argoproj/gitops-engine/pkg/utils/errors      [no test files]
ok      github.com/argoproj/gitops-engine/pkg/utils/exec        (cached)
?       github.com/argoproj/gitops-engine/pkg/utils/io  [no test files]
?       github.com/argoproj/gitops-engine/pkg/utils/json        [no test files]
ok      github.com/argoproj/gitops-engine/pkg/utils/kube        (cached)
?       github.com/argoproj/gitops-engine/pkg/utils/kube/kubetest       [no test files]
?       github.com/argoproj/gitops-engine/pkg/utils/testing     [no test files]
?       github.com/argoproj/gitops-engine/pkg/utils/text        [no test files]
ok      github.com/argoproj/gitops-engine/pkg/utils/tracing     (cached)

Among many functions, using sync.Mutex to secure a lock was complex and difficult.
I took the approach of using sync.Map.
I don't think this is the best response, but I think it's good that race conditions are gone.
If you have any comments, please feel free to do so!

@codecov
Copy link

codecov bot commented Jun 3, 2020

Codecov Report

Merging #51 into master will decrease coverage by 0.09%.
The diff coverage is 57.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #51      +/-   ##
==========================================
- Coverage   52.49%   52.40%   -0.10%     
==========================================
  Files          25       25              
  Lines        2585     2727     +142     
==========================================
+ Hits         1357     1429      +72     
- Misses       1100     1146      +46     
- Partials      128      152      +24     
Impacted Files Coverage Δ
pkg/utils/kube/ctl.go 5.78% <0.00%> (+0.09%) ⬆️
pkg/cache/cluster.go 50.89% <51.76%> (-2.68%) ⬇️
pkg/cache/resource.go 72.97% <53.84%> (-12.22%) ⬇️
pkg/sync/sync_context.go 64.38% <86.27%> (+1.45%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 16598d5...01379a4. Read the comment docs.

@d-kuro d-kuro force-pushed the feature/fix-race-condition branch from 2f102ac to d78cf09 Compare June 4, 2020 16:44
@d-kuro d-kuro changed the title [WIP] Fix race condition fix: race condition Jun 4, 2020
@d-kuro d-kuro changed the title fix: race condition fix: race condition (#50) Jun 4, 2020
@d-kuro d-kuro marked this pull request as ready for review June 4, 2020 16:54
@ash2k
Copy link
Member

ash2k commented Jun 6, 2020

Hey, you beat me to it! I was also working on fixing these race conditions. I think your branch does more and does it better as I haven't tried to simplify things as much - I'm not that familiar with the codebase. You can see what I've done here if you are interested master...ash2k:ash2k/fixes

type apiMeta struct {
lock sync.Mutex
namespaced bool
resourceVersion string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this could be just a sync.Value to simplify things.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed
8c8713d

}

type Resources struct {
sync.Map
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be better to make it a field rather than embed - currently all the map's methods are added to Resources type and perhaps this is not what you want?

func runSynced(lock sync.Locker, action func() error) error {
lock.Lock()
defer lock.Unlock()
func runSynced(action func() error) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be just removed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
29224aa

// if there is anything that needs deleting, we are at best now in pending and
// want to return and wait for sync to be invoked again
runStateMutex.Lock()
runState = pending
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think apart from the data race that you are fixing here, there is also a logic error. One goroutine may set runState to failed but then another one to pending or the other way around. This non-deterministic behavior is probably not right, or is it? Please see 1e0e792 for a different approach.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cherry-picked from your commit.
Thank you!

6a762b3
c5d445f

sc.setResourceResult(task, "", common.OperationFailed, fmt.Sprintf("Failed to delete: %v", err))
terminateSuccessful = false
} else {
sc.setResourceResult(task, "", common.OperationSucceeded, fmt.Sprintf("Deleted"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

{
var wg sync.WaitGroup
for _, task := range pruneTasks {
task := task
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is not needed as task is passed to the closure (line 737/742) and it uses t instead.

Copy link
Author

@d-kuro d-kuro Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's been fixed with this commit.
c5d445f

Copy link
Contributor

@alexmt alexmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ash2k , @d-kuro Thanks a lot for fixing the race conditions and improving the quality! The GitOps engine code is lifting and shifting of Argo CD code and requires polishing.

Changes in sync_context.go are fixing real bugs. Great catch!

I apologize in advance for the delay in review. The updated code is very performance sensitive and requires careful testing. Before merging it please let us fix two issues that block 1.6 GA release:

Next, we will need to run performance testing on internal Argo CD instances. So it might take a few days.

@d-kuro
Copy link
Author

d-kuro commented Jun 10, 2020

@ash2k @alexmt
Thank you for your comment!
I'm going to work on the code fix that was pointed out to me over the weekend.

@ash2k
Copy link
Member

ash2k commented Jun 10, 2020

@d-kuro 👍

I suggest to always run go test -race. ArgoCD's tests too!

@alexmt
Copy link
Contributor

alexmt commented Jun 11, 2020

v1.6 testing looks good so far. We just want to run it internally a few more days and create 1.6 GA release. As soon as 1.6 is out I will test this PR's performance.

if runState == successful && createTasks.Any(func(t *syncTask) bool { return t.needsDeleting() }) {
var wg sync.WaitGroup
for _, task := range createTasks.Filter(func(t *syncTask) bool { return t.needsDeleting() }) {
task := task
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is not needed, same as above.

Copy link
Author

@d-kuro d-kuro Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand.
Thank you comments.

I think that's been fixed with this commit.
c5d445f

var pruneTasks syncTasks

for _, task := range tasks {
task := task
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why is this needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's been fixed with this commit.
c5d445f

@d-kuro d-kuro force-pushed the feature/fix-race-condition branch from d78cf09 to 7d7a436 Compare June 15, 2020 06:15
d-kuro and others added 5 commits June 15, 2020 15:16
Make runState a new type rather
than a type alias
==================
WARNING: DATA RACE

Write at 0x00c0000d0b40 by goroutine 83:
  github.com/argoproj/gitops-engine/pkg/sync.(*syncContext).runTasks.func5.1()
      /gitops-engine/pkg/sync/sync_context.go:786 +0x68c

Previous write at 0x00c0000d0b40 by goroutine 84:
  github.com/argoproj/gitops-engine/pkg/sync.(*syncContext).runTasks.func5.1()
      /gitops-engine/pkg/sync/sync_context.go:786 +0x68c

Goroutine 83 (running) created at:
  github.com/argoproj/gitops-engine/pkg/sync.(*syncContext).runTasks.func5()
      /gitops-engine/pkg/sync/sync_context.go:778 +0x165
  github.com/argoproj/gitops-engine/pkg/sync.(*syncContext).runTasks()
      /gitops-engine/pkg/sync/sync_context.go:807 +0xb32
  github.com/argoproj/gitops-engine/pkg/sync.(*syncContext).Sync()
      /gitops-engine/pkg/sync/sync_context.go:265 +0x1b3d
  github.com/argoproj/gitops-engine/pkg/sync.TestSyncFailureHookWithFailedSync()
      /gitops-engine/pkg/sync/sync_context_test.go:532 +0x4e5
  testing.tRunner()
      /usr/local/Cellar/go/1.14.3/libexec/src/testing/testing.go:991 +0x1eb

Goroutine 84 (running) created at:
  github.com/argoproj/gitops-engine/pkg/sync.(*syncContext).runTasks.func5()
      /gitops-engine/pkg/sync/sync_context.go:778 +0x165
  github.com/argoproj/gitops-engine/pkg/sync.(*syncContext).runTasks()
      /gitops-engine/pkg/sync/sync_context.go:807 +0xb32
  github.com/argoproj/gitops-engine/pkg/sync.(*syncContext).Sync()
      /gitops-engine/pkg/sync/sync_context.go:265 +0x1b3d
  github.com/argoproj/gitops-engine/pkg/sync.TestSyncFailureHookWithFailedSync()
      /gitops-engine/pkg/sync/sync_context_test.go:532 +0x4e5
  testing.tRunner()
      /usr/local/Cellar/go/1.14.3/libexec/src/testing/testing.go:991 +0x1eb
==================
Copy link
Member

@ash2k ash2k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few more comments. I think it would be very helpful to document which object/method is thread safe and which is not.

I think it might be not the best idea to use sync.Map because it makes it harder to write code:

  • all those nil checks all over the place
  • lost type safety in loops (.Range() vs for)
  • introduces a chance of "logical" data races where two method don't exclude each other but both work on the same data concurrently (e.g. iterate over a map and mutate it) and that leads to unforeseen results.

It might be better to just use a mutex and a normal map. WDYT? Just my two cents.

return version, nil
}

return "", fmt.Errorf("stored type is invalid in resourceVersion")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot happen and if it does it should be ok to panic because... it should never happen?
I'd do this instead

str, _ := a.resourceVersion.Load().(string)
return str, nil

This will not error out and it also preserves current semantics where the field is an empty string if it was never set. It's also nicer to not need to check the error in every place where this method is called.

r.Store(key, value)
}

func (r *Resources) LoadResources(key kube.ResourceKey) (*Resource, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above - panicking should be ok if something unexpected is in this private field. And the error is not checked in a few places below. Why return it then? :)

c.apisMeta.Store(key, value)
}

func (c *clusterCache) LoadApisMeta(key schema.GroupKind) (*apiMeta, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above - panicking should be ok if something unexpected is in this private field.

c.nsIndex.Store(key, value)
}

func (c *clusterCache) LoadNsIndex(key string) (*Resources, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above - panicking should be ok if something unexpected is in this private field.

if err != nil {
return err
}
if info != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if can be inverted to avoid so much indentation. https://github.com/golang/go/wiki/CodeReviewComments#indent-error-flow

c.lock.Lock()
defer c.lock.Unlock()
// before doing any work, check once again now that we have the lock, to see if it got
// synced between the first check and now
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment and code above can be deleted now.

if len(res.OwnerRefs) == 0 {
resources[res.ResourceKey()] = res
func (c *clusterCache) GetNamespaceTopLevelResources(namespace string) *Resources {
res, _ := c.LoadNsIndex(namespace)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No error check here too. That's why I'm suggesting to just let it panic if the field contents is corrupted. The next line will panic anyway.

@d-kuro
Copy link
Author

d-kuro commented Jun 19, 2020

@ash2k

I think it might be not the best idea to use sync.Map because it makes it harder to write code:

I understand.
I too think the Range() of the sync map is not good and should be improved.
Initially, we decided to use sync.Map because of the complexity of locking on the map, but the Now that we've cleaned up the code to access it, we can use the mutex and map access approach with I think we can switch it up.

I'm going to try to change it back from sync map to map once.
Thanks!

if len(ns) == 0 {
delete(c.nsIndex, key.Namespace)
func (c *clusterCache) onNodeRemoved(key kube.ResourceKey) error {
existing, err := c.resources.LoadResources(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @ash2k regarding sync.Map usage.

I think we still need to have a lock here to make sure resources map and namespace index are synchronized.

c.nsIndex.Store(key, value)
}

func (c *clusterCache) LoadNsIndex(key string) (*Resources, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think LoadNsIndex is a private method. Please rename to loadNsIndex.

return cnt
}

func (c *clusterCache) StoreNsIndex(key string, value *Resources) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as LoadNsIndex. Please rename to storeNsIndex.

@alexmt
Copy link
Contributor

alexmt commented Jun 24, 2020

Hello @d-kuro . I'm very interested in completing and merging this PR. Please, let me know if you have time to continue working on it.
I can help. E.g. I can send PR to your fork that addresses sync.Map related comments. let me know

@d-kuro
Copy link
Author

d-kuro commented Jun 27, 2020

@alexmt
Sorry for the late reply.

I'm going to try migrating from SyncMap to map, as I wrote in my comment below.
#51 (comment)

I may ask you to follow up if you try and get stuck, but I'll try to do it myself for now.
Thanks for your concern!

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities (and Security Hotspot 0 Security Hotspots to review)
Code Smell A 3 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@d-kuro
Copy link
Author

d-kuro commented Jun 27, 2020

@ash2k cc: @alexmt

I'm trying to work on migration from sync.Map to map.
but it suffers from the race condition of map range access and writing.

fatal error: concurrent map iteration and map write

Using RLock for map range access and Lock for write is simple but for example the following code.
https://github.com/argoproj/gitops-engine/blob/master/pkg/cache/cluster.go#L241-L249

Deadlocked this function, because we are deleting the map in range access that requires a read lock.
Is there a good way to write this without using SyncMap?

// c.resources is "map[kube.ResourceKey]*Resource"

// Need RWMutex.RLock()
for key := range c.resources {
	if key.Kind != gk.Kind || key.Group != gk.Group || ns != "" && key.Namespace != ns {
		continue
	}

	if _, ok := objByKey[key]; !ok {
		// Need RWMutex.Lock()
		// "delete(c.resources, key)" in a function
		c.onNodeRemoved(key)
	}
}

@ash2k
Copy link
Member

ash2k commented Jun 28, 2020

@d-kuro Perhaps replaceResourceCache() can be left as is but then watchEvents() which calls it can grab the write lock before calling it?

I'd like to suggest to split this PR as it's getting quite big. Perhaps make all the changes to pkg/cache/cluster.go in a separate PR?

I've also looked at the clusterCache more closely and I'm wondering why was it written using list+watch+manual cache (resources field) and not using informers? It might be a good idea to refactor clusterCache to use informers instead - they support notifications, indexes and are thoroughly tested by kubernetes itself. WDYT?

@alexmt
Copy link
Contributor

alexmt commented Jun 29, 2020

Thanks a lot for continue working on it @d-kuro !

I've looked only more time where replaceResourceCache is used. The method is executed in two places and both have write lock:

c.replaceResourceCache(api.GroupKind, info.resourceVersion, items, ns)

c.replaceResourceCache(gk, "", []unstructured.Unstructured{}, ns)

I suspect race condition happens in tests only: TestNamespaceModeReplace. I hope I'm not missing anything. If this is the case we just need to use lock in TestNamespaceModeReplace test.

Agree with @ash2k about moving cluster.go changes into separate PRs.

@alexmt
Copy link
Contributor

alexmt commented Jun 29, 2020

@ash2k we could not use informers framework because of informers cache the whole resource manifest. We need only resources reference and parent links. Switching to informers would increase memory consumption. That was the main reason to manually use List/Watch APIs.
But I agree that we are partially re-implementing functionality that exists in informers framework.

@ash2k
Copy link
Member

ash2k commented Jun 30, 2020

@alexmt

we could not use informers framework because of informers cache the whole resource manifest.

This is the default behavior, but it is configurable. If only certain fields are required, a custom Go type can be crafted to use for any watched resource type. That way only the needed fields are unmarshaled and nothing else. For example, there is a whole package just for working with metadata of objects. Tuned client and informer are available there.

However, memory consumption will still be higher in the above case vs how things currently work because e.g. all annotations will be stored in the informer's cache. Thank you for answering the question :)

@alexmt
Copy link
Contributor

alexmt commented Aug 28, 2020

Changes of this PR got split into several other PRs and race detection is now enabled in the project. @d-kuro , @ash2k thanks a lot for driving it!

@alexmt alexmt closed this Aug 28, 2020
@alswl
Copy link
Contributor

alswl commented Oct 28, 2021

@alexmt Hi, is there any progress? The master branch still not contains the commits in this PR.

@alswl
Copy link
Contributor

alswl commented Nov 3, 2021

c.lock and c.listSemaphore had two deadlock situation in big clusters(more than 500 api-resources).

  1. c.lock holds too big scope in EnsureSyncd() function, we should remove c.lock, using sync.Map.
  2. semaphore in c.listSemaphore using defer caused lock callback() function body, and in watchEvents() the body contains runSync().

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants