Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove waitgroup handling from Stopping of harvesters #4336

Merged
merged 1 commit into from May 22, 2017

Conversation

Projects
None yet
3 participants
@ruflin
Copy link
Collaborator

commented May 17, 2017

During trying to reproduce #4332 it seems the waitgroup in the Stop part is not needed. Even more it could cause issues.

The initial approach to fix the race was to pass the waitgroup to the harvester. But this would change the harvester interface which I would like to keep as simple as possible.

Before one of the problems was, that the harvester was added to the registry early which means Stop was potentially called on a harvester that was not started yet. This should now be prevented by checking on Start if the harvester was already stopped before any callback methods are registered.

go func() {
defer func() {
r.remove(h)
r.wg.Done()
}()
r.add(h)

This comment has been minimized.

Copy link
@urso

urso May 17, 2017

Collaborator

this has a very high chance to insert the harvester into the registry, after Start did return. Can we run into a race with the prospector requiring an up-to-date registry?

This comment has been minimized.

Copy link
@ruflin

ruflin May 18, 2017

Author Collaborator

The only time the prospector interacts with the registry is when adding harvesters or on stopping it. The reason I moved it inside is that previously a harvester was added to the registry and it could happen it was started later and stopped before. With this change it is less likely but probably still possible. That is why I also added the to the harvester so in case it was already stopped, it will not start and directly return. That means in this case the harvester would also not have to be stopped, as it is already finished.

go func(h Harvester) {
r.wg.Done()
h.Stop()
}(hv)
}

This comment has been minimized.

Copy link
@urso

urso May 17, 2017

Collaborator

where is r.wg.Wait() executed? Having r.wg.Wait() after this for loop, we can wait for all harvester to finish. This will allow us to make h.Stop() non-blocking ->

e.g.

func (r *Registry) Stop() {
  r.Lock()                // <- the lock on Start and Stop guarantees stop can only be called after Start finishes
  defer r.Unlock() 

  close(r.done)

  for _, h := range r.harvesters {
    h.Stop()
  }

  r.wg.Wait()   // we wait here before releasing the lock to ensure start can not be executed before Stop finishes 
}

with

 func (r *Registry) Start(h Harvester) {
  r.Lock()
  defer r.Unlock()

  // Make sure no new harvesters are started after stop was called
  if !r.active() {
    return
  }

  r.wg.Add(1)
  r.harvesters[h.ID()] = h
  go func() {
    // <- note: due to stop holding the mutex and waiting on `r.wg` the harvester is removed from 
    // the registry after r.wg.Done is called. This potentially races with the prospector not being able to re-start
    // the harvester. With the prospector trying to periodically start the harvester, the fact the race exists can be "ignored".
    defer r.remove(h) 

    defer r.wg.Done()
    h.Start()
  }
}

func (r *Registry) active() bool {
  select {
    case <-r.done:
      return false
    default:
      return true
  }  
}

func (h *Harvester) Stop() {
  // return immediately -> shutdown will be complete asynchronously.
  // Due to `r.wg` being decremented, only after the harvester `Start/Run` returned, the `registry.Stop` will block anyways.
  h.stopOnce.Do(func() {
 		close(h.done)
   })
}

This comment has been minimized.

Copy link
@ruflin

ruflin May 18, 2017

Author Collaborator

It is waiting for Completion in the defer statement on line 36. I quite like that Stop is actually blocking because someone from outside the harvester should not require knowlege about its inner working. I need to check what we do in other cases with Stop()

This comment has been minimized.

Copy link
@urso

urso May 19, 2017

Collaborator

The difference is when it's blocking and when the mutex is released. My proposal releases the mutex only after all harvesters are finished -> start can only be called after Stop did return.

This comment has been minimized.

Copy link
@ruflin

ruflin May 19, 2017

Author Collaborator

That's a good point. The stopping has now issue (at least in my implementation) is that as long as there is a lock, harvesters can't be removed from the list as they also require a lock.

This comment has been minimized.

Copy link
@ruflin

ruflin May 19, 2017

Author Collaborator

Thinking again about the above: In my implementation, because of the lock Start can only be called again after it is guaranteed that the channel is closed, means no new harvesters will be started.

@ruflin ruflin force-pushed the ruflin:potential-race-fix branch 2 times, most recently from 25398f2 to 0698c90 May 19, 2017

Remove waitgroup handling from Stopping of harvesters
During trying to reproduce #4332 it seems the waitgroup in the Stop part is not needed. Even more it could cause issues.

The initial approach to fix the race was to pass the waitgroup to the harvester. But this would change the harvester interface which I would like to keep as simple as possible.

Before one of the problems was, that the harvester was added to the registry early which means Stop was potentially called on a harvester that was not started yet. This should now be prevented by checking on Start if the harvester was already stopped before any callback methods are registered.

@ruflin ruflin force-pushed the ruflin:potential-race-fix branch from dc7976f to 5842044 May 22, 2017

@andrewkroh

This comment has been minimized.

Copy link
Member

commented May 22, 2017

I'm going to merge this so that we can get some testing time on this code.

@andrewkroh andrewkroh merged commit 4aea294 into elastic:master May 22, 2017

6 checks passed

CLA Commit author has signed the CLA
Details
beats-ci Build finished.
Details
codecov/patch 71.42% of diff hit (target 63.09%)
Details
codecov/project Absolute coverage decreased by -0.01% but relative coverage increased by +8.33% compared to 8fb1cf9
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

ramon-garcia added a commit to ramon-garcia/beats that referenced this pull request Dec 5, 2017

Remove waitgroup handling from Stopping of harvesters (elastic#4336)
During trying to reproduce elastic#4332 it seems the waitgroup in the Stop part is not needed. Even more it could cause issues.

The initial approach to fix the race was to pass the waitgroup to the harvester. But this would change the harvester interface which I would like to keep as simple as possible.

Before one of the problems was, that the harvester was added to the registry early which means Stop was potentially called on a harvester that was not started yet. This should now be prevented by checking on Start if the harvester was already stopped before any callback methods are registered.

athom added a commit to athom/beats that referenced this pull request Jan 25, 2018

Remove waitgroup handling from Stopping of harvesters (elastic#4336)
During trying to reproduce elastic#4332 it seems the waitgroup in the Stop part is not needed. Even more it could cause issues.

The initial approach to fix the race was to pass the waitgroup to the harvester. But this would change the harvester interface which I would like to keep as simple as possible.

Before one of the problems was, that the harvester was added to the registry early which means Stop was potentially called on a harvester that was not started yet. This should now be prevented by checking on Start if the harvester was already stopped before any callback methods are registered.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.