Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly stop logger during (re)connect failure #82

Merged
merged 3 commits into from
Oct 31, 2021

Conversation

akerouanton
Copy link
Contributor

@akerouanton akerouanton commented Apr 22, 2020

PR #77 introduced a new parameter named ForceStopAsyncSend. It can be
used to tell the logger to not try to send all the log messages in its
buffer before closing. Without this parameter, the logger hangs out
whenever it has logs to write and the target Fluentd server is down.

But this new parameter is not enough: the connection is currently lazily
initialized when the logger receive its first message. This blocks the
select reading messages from the queue (until the connection is ready).
Moreover, the connection dialing uses an exponential back-off retry.

Because of that, the logger won't look for messages on stopRunning
channel (the channel introduced by PR #77), either because it's
blocked by the Sleep used for the retry or because the connection
dialing is waiting until dialing timeout (eg. TCP timeout).

To fix these edge cases, the time.Sleep() used for back-off retry has
been transformed into a time.After() and a new stopAsyncConnect channel
has been introduced to stop it. Moreover, the dialer.Dial() call used
to connect to Fluentd has been replaced with dialer.DialContext() and a
cancellable context is now used to stop the call to that method.

Finally, the Fluent.run() method has been adapted to wait for both new
messages on f.pending and stop signal sent to f.stopRunning channel.
Previously, both channels were selected independently, in a procedural
fashion.

This fix is motivated by the issue described in: moby/moby#40063.

@tagomoris
Copy link
Member

The motivation is reasonable, so I'm positive to fix that problem.
This kind of patch requires time to review, so I'll be able to review this change tomorrow or early next week.
But anyway, CI is now failing. @NiR- Could you take care of it?

@akerouanton akerouanton force-pushed the stop-async-connect branch 3 times, most recently from 05f5f84 to 9978fb3 Compare April 23, 2020 17:38
@akerouanton
Copy link
Contributor Author

But anyway, CI is now failing. @NiR- Could you take care of it?

CI is fixed but actually I'm still doing some manual testing to make sure everything is ok. I'll try to add more test cases 😉

@tagomoris
Copy link
Member

@NiR- Great! Please let me know when you finished that.

@akerouanton akerouanton changed the title Try to asynchronously connect to Fluentd before writing Properly stop logger during (re)connect failure May 21, 2020
@angulito
Copy link

angulito commented Aug 3, 2020

Any news on this? I'm experiencing the same issue with docker v19.03.6-ce that is happening moby/moby#40063. Thank you!

@akerouanton
Copy link
Contributor Author

@angulito We first want to merge #83 which is introducing a new way of testing edge cases like the one identified and fixed by this PR.

The new test cases and the fix were first submitted in a single PR, but since the diff was quite large and important, I splitted it in two PRs to make it less difficult for the repo's maintainer(s) to review and accept these changes.

Hopefully this will be included in the next major docker release. In the mean time, I created a logger plugin including this fix. It's working like the original fluentd logger and it's available in akerouanton/fluentd-async-logger.

@angulito
Copy link

angulito commented Aug 3, 2020

Thank you for working hard on it! @akerouanton

@akerouanton
Copy link
Contributor Author

akerouanton commented Nov 11, 2020

Hey @tagomoris, I took a look at this PR in the past days and I finally found out why tests were failing on the CI. I also overhauled the previous implementation to make the code substantially simpler.

CI was failing because of tests with RequestAck option enabled: the ack hashes contain a randomly generated uint and despite the generator being implicitly seeded with 1 (as specified here), the ack hashes were not stable. I tried to explicitly seed the generator but with no luck. Thus, I decided to introduce a new randomGenerator variable and initialize it with rand.Uint64. This variable is changed in tests to use a function that always return 1. This way, we're sure the ack hashes are stable across time and machines.

On the overhaul part, I decided to remove the newly introduced stopAsyncConnect channel and use a cancellable context instead. This context is passed from newWithDialer() to run() -> write() -> connectOrRetry() -> connect() -> DialWithContext(). In this way, as soon as the context got cancelled, any in-progress dialing got stopped and the errIsClosing error is propagated up to run() by early terminating retriable code paths (connectOrRetry() and write()) . Moreover, in this way, any subsequent calls to write() will fail the same way (for instance if a log is dispatched through f.pending before f.stopRunning closing is dispatched), making the code more robust to channel dispatching order in run(). I also removed the drainEvents condition run() since returning from that function and closing f.pending is already enough to discard queued messages.

This PR is ready to be re-reviewed 🙂

@tagomoris
Copy link
Member

I'll try to find time to review this change this or the next week.

fluent/fluent.go Outdated
// Unfortunately, using a seeded random number generator isn't enough to ensure
// ack hashes are deterministically generated during tests, thus we need to
// change randomGenerator value to ensure the hashes are stable during tests
// such that we can expect how the logger behaves with RequestAck option enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about replacing this command with make this "var" to replace the random generator in tests and write the content of the above comment in the test file?
The content is a little complex, and looks almost only about testing.

@tagomoris
Copy link
Member

I took a glance at this, and need more time.

@tagomoris
Copy link
Member

I reviewed the updated change. The clear problem is, this code will acquire the lock of connection during writing messages and waiting for the ack message. This causes terrible performance degradation if the ack option is enabled. Even without enabling the ack option, the critical section for writing messages could cause performance problems.

And I found another problem (this is not introduced by this change) that arriving ack messages may be disordered, but the current implementation doesn't consider such cases. 🤔

@jchanam
Copy link

jchanam commented Mar 22, 2021

Hi!

I'd like to know if you were able to move forward with this issue. We're still facing this problem described in moby/moby#40063

Thanks!

@rothn
Copy link

rothn commented May 17, 2021

We are having this issue as well. Do we know if there's anything preventing a merge?

@hakashya
Copy link

Hey @tagomoris,
To confirm, the problems that are preventing the merge are:

  1. Message writing critical section causing performance degradation.
  2. Unordered ack messages not being handled.

Hey @akerouanton,
I am just wondering if you were working on it?

@tagomoris
Copy link
Member

@hakashya Unordered ack messages are not the problem of this change. Of course, it's good if we can solve it.

@PettitWesley
Copy link

PettitWesley commented Sep 28, 2021

@akerouanton @tagomoris Hi, I'm Wesley, from the AWS containers team, and I'm also the AWS maintainer of fluent/fluent-bit. We need to get this change out since it fixes the linked Moby/Docker issue.

My proposal is that I will take over this fix. I will preserve @akerouanton's commit so that they get credit. And then add commits on top of it as needed to get this PR to a merge-able state.

To confirm, the issues blocking merge are:

  1. Rebase.
  2. "Message writing critical section causing performance degradation." I need to read through this code and try to understand it... can someone point me to the piece of the code where this issue is?

If I fix those, we can merge it?

I will begin work on this on Wednesday September 29th.

@tagomoris
Copy link
Member

I'm totally ok to merge the change written by 2 contributors (it's almost equal to independent 2 pull-reqs) if it can solve problems.

@akerouanton
Copy link
Contributor Author

About the overview of the change: I'm not a big fan of long anonymous functions (writer in write()). In my opinion, it suffers readability.

I can rename the current write() method to writeOrRetry() and move the content of the writer closure into a new write() method if you want, like what I did with connect() and connectOrRetry() 🙂

@akerouanton
Copy link
Contributor Author

akerouanton commented Oct 22, 2021

Btw, I did not move f.conn.SetWriteDeadline() into connect() method like I proposed in the patch I posted as a comment 2 days ago. However setting the deadline in the write() method could be the source of a bug in sync mode. For instance:

  • 1st goroutine send a message which takes a long time to send (eg. tcp exponential backoff got triggered)
  • 2nd goroutine send a message whereas 1st message is still in-flight and set the new write deadline, impacting the deadline of the 1st message
  • 3rd goroutine send a message and sets the new deadline, again impacting the previous in-flight messages
  • etc...

This is due to how SetWriteDeadline() works:

// SetWriteDeadline sets the deadline for future Write calls
// and any currently-blocked Write call.
// Even if write times out, it may return n > 0, indicating that
// some of the data was successfully written.
// A zero value for t means Write will not time out.
SetWriteDeadline(t time.Time) error

Of course this is limited to usecases where many goroutines are used to send logs, but I prefer to point it out.

PR fluent#77 introduced a new parameter named ForceStopAsyncSend. It can be
used to tell the logger to not try to send all the log messages in its
buffer before closing. Without this parameter, the logger hangs out
whenever it has logs to write and the target Fluentd server is down.

But this new parameter is not enough: the connection is currently lazily
initialized when the logger receive its first message. This blocks the
select reading messages from the queue (until the connection is ready).
Moreover, the connection dialing uses an exponential back-off retry.

Because of that, the logger won't look for messages on `stopRunning`
channel (the channel introduced by PR fluent#77), either because it's
blocked by the Sleep used for the retry or because the connection
dialing is waiting until dialing timeout (eg. TCP timeout).

To fix these edge cases, the time.Sleep() used for back-off retry has
been transformed into a time.After(). Moreover, the dialer.Dial() call
used to connect to Fluentd has been replaced with dialer.DialContext()
and a cancellable context is now used to stop the call to that method.
The associated cancel function is stored in Fluent and got called by
Close() when ForceStopAsyncSend option is enabled.

Finally, the Fluent.run() method has been adapted to wait for both new
messages on f.pending and stop signal sent to f.stopRunning channel.
Previously, both channels were selected independently, in a procedural
fashion.

This fix is motivated by the issue described in: moby/moby#40063.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
@akerouanton
Copy link
Contributor Author

I moved the body of the writer() anynomous function to write() and the retry logic to writeOrRetry(). Also, I replaced time.After() by time.NewTimer() with a defered call to ensure this underlying channel is closed and garbage collected (cf. https://pkg.go.dev/time#After).

Copy link
Member

@tagomoris tagomoris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 👀 couldn't find any clear problems about locking/consistency on this change. Great work!
I left some comments about comments and naming.

fluent/fluent.go Outdated Show resolved Hide resolved
fluent/fluent.go Outdated Show resolved Hide resolved
fluent/fluent.go Outdated Show resolved Hide resolved
fluent/fluent.go Outdated
Comment on lines 544 to 543
// Here, we don't want to retry the write since connectOrRetry already
// retries Config.MaxRetry times to connect.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

fluent/fluent.go Outdated Show resolved Hide resolved
fluent/fluent.go Outdated Show resolved Hide resolved
Using defered calls to unlock muconn mutex ensures the mutex doesn't
stay locked if the lib user's have a panic recovering mechanism in place
and f.connect(), f.conn.Write() or f.close() panic.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
@tagomoris
Copy link
Member

LGTM!
Let me merge this change and tag v1.7.0 to let users choose this fix only (or v1.6.3 for other recent changes).

@tagomoris tagomoris merged commit f247034 into fluent:master Oct 31, 2021
@tagomoris
Copy link
Member

Done. Thank you for the great work!

@akerouanton akerouanton deleted the stop-async-connect branch October 31, 2021 08:51
akerouanton added a commit to akerouanton/docker that referenced this pull request Nov 1, 2021
Updates the fluent logger library to v1.7.0. Following PRs were merged
since last bump:

* [Add callback for error handling when using
  async](fluent/fluent-logger-golang#97)
* [Fix panic when accessing unexported struct
  field](fluent/fluent-logger-golang#99)
* [Properly stop logger during (re)connect
  failure](fluent/fluent-logger-golang#82)

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
akerouanton added a commit to akerouanton/docker that referenced this pull request Nov 1, 2021
Updates the fluent logger library to v1.7.0. Following PRs were merged
since last bump:

* [Add callback for error handling when using
  async](fluent/fluent-logger-golang#97)
* [Fix panic when accessing unexported struct
  field](fluent/fluent-logger-golang#99)
* [Properly stop logger during (re)connect
  failure](fluent/fluent-logger-golang#82)

See https://github.com/fluent/fluent-logger-golang/compare/v1.6.1..v1.7.0

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
akerouanton added a commit to akerouanton/docker that referenced this pull request Dec 2, 2021
Updates the fluent logger library to v1.8.0. Following PRs/commits were
merged since last bump:

* [Add callback for error handling when using
  async](fluent/fluent-logger-golang#97)
* [Fix panic when accessing unexported struct
  field](fluent/fluent-logger-golang#99)
* [Properly stop logger during (re)connect
  failure](fluent/fluent-logger-golang#82)
* [Support a TLS-enabled connection](fluent/fluent-logger-golang@e5d6aa1)

See https://github.com/fluent/fluent-logger-golang/compare/v1.6.1..v1.8.0

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
docker-jenkins pushed a commit to docker-archive/docker-ce that referenced this pull request Dec 2, 2021
Updates the fluent logger library to v1.8.0. Following PRs/commits were
merged since last bump:

* [Add callback for error handling when using
  async](fluent/fluent-logger-golang#97)
* [Fix panic when accessing unexported struct
  field](fluent/fluent-logger-golang#99)
* [Properly stop logger during (re)connect
  failure](fluent/fluent-logger-golang#82)
* [Support a TLS-enabled connection](fluent/fluent-logger-golang@e5d6aa1)

See https://github.com/fluent/fluent-logger-golang/compare/v1.6.1..v1.8.0

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
Upstream-commit: e24d61b7efac787ff3d5176d994608937a057522
Component: engine
evol262 pushed a commit to evol262/moby that referenced this pull request Jan 12, 2022
Updates the fluent logger library to v1.8.0. Following PRs/commits were
merged since last bump:

* [Add callback for error handling when using
  async](fluent/fluent-logger-golang#97)
* [Fix panic when accessing unexported struct
  field](fluent/fluent-logger-golang#99)
* [Properly stop logger during (re)connect
  failure](fluent/fluent-logger-golang#82)
* [Support a TLS-enabled connection](fluent/fluent-logger-golang@e5d6aa1)

See https://github.com/fluent/fluent-logger-golang/compare/v1.6.1..v1.8.0

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
PettitWesley added a commit to PettitWesley/moby that referenced this pull request Jan 12, 2022
Updates the fluent logger library to v1.8.0. Following PRs/commits were
merged since last bump:

* [Add callback for error handling when using
  async](fluent/fluent-logger-golang#97)
* [Fix panic when accessing unexported struct
  field](fluent/fluent-logger-golang#99)
* [Properly stop logger during (re)connect
  failure](fluent/fluent-logger-golang#82)
* [Support a TLS-enabled connection](fluent/fluent-logger-golang@e5d6aa1)

See https://github.com/fluent/fluent-logger-golang/compare/v1.6.1..v1.8.0

Signed-off-by: Wesley <wppttt@amazon.com>
PettitWesley pushed a commit to PettitWesley/moby that referenced this pull request Jan 13, 2022
Updates the fluent logger library to v1.8.0. Following PRs/commits were
merged since last bump:

* [Add callback for error handling when using
  async](fluent/fluent-logger-golang#97)
* [Fix panic when accessing unexported struct
  field](fluent/fluent-logger-golang#99)
* [Properly stop logger during (re)connect
  failure](fluent/fluent-logger-golang#82)
* [Support a TLS-enabled connection](fluent/fluent-logger-golang@e5d6aa1)

See https://github.com/fluent/fluent-logger-golang/compare/v1.6.1..v1.8.0

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
(cherry picked from commit e24d61b)
Signed-off-by: Wesley <wppttt@amazon.com>
lmbarros pushed a commit to balena-os/balena-engine that referenced this pull request Nov 22, 2022
Updates the fluent logger library to v1.8.0. Following PRs/commits were
merged since last bump:

* [Add callback for error handling when using
  async](fluent/fluent-logger-golang#97)
* [Fix panic when accessing unexported struct
  field](fluent/fluent-logger-golang#99)
* [Properly stop logger during (re)connect
  failure](fluent/fluent-logger-golang#82)
* [Support a TLS-enabled connection](fluent/fluent-logger-golang@e5d6aa1)

See https://github.com/fluent/fluent-logger-golang/compare/v1.6.1..v1.8.0

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
(cherry picked from commit e24d61b)
Signed-off-by: Wesley <wppttt@amazon.com>
lmbarros pushed a commit to balena-os/balena-engine that referenced this pull request Nov 22, 2022
Updates the fluent logger library to v1.8.0. Following PRs/commits were
merged since last bump:

* [Add callback for error handling when using
  async](fluent/fluent-logger-golang#97)
* [Fix panic when accessing unexported struct
  field](fluent/fluent-logger-golang#99)
* [Properly stop logger during (re)connect
  failure](fluent/fluent-logger-golang#82)
* [Support a TLS-enabled connection](fluent/fluent-logger-golang@e5d6aa1)

See https://github.com/fluent/fluent-logger-golang/compare/v1.6.1..v1.8.0

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
(cherry picked from commit e24d61b)
Signed-off-by: Wesley <wppttt@amazon.com>
lmbarros pushed a commit to balena-os/balena-engine that referenced this pull request Dec 8, 2022
Updates the fluent logger library to v1.8.0. Following PRs/commits were
merged since last bump:

* [Add callback for error handling when using
  async](fluent/fluent-logger-golang#97)
* [Fix panic when accessing unexported struct
  field](fluent/fluent-logger-golang#99)
* [Properly stop logger during (re)connect
  failure](fluent/fluent-logger-golang#82)
* [Support a TLS-enabled connection](fluent/fluent-logger-golang@e5d6aa1)

See https://github.com/fluent/fluent-logger-golang/compare/v1.6.1..v1.8.0

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
(cherry picked from commit e24d61b)
Signed-off-by: Wesley <wppttt@amazon.com>
lmbarros pushed a commit to balena-os/balena-engine that referenced this pull request Dec 21, 2022
Updates the fluent logger library to v1.8.0. Following PRs/commits were
merged since last bump:

* [Add callback for error handling when using
  async](fluent/fluent-logger-golang#97)
* [Fix panic when accessing unexported struct
  field](fluent/fluent-logger-golang#99)
* [Properly stop logger during (re)connect
  failure](fluent/fluent-logger-golang#82)
* [Support a TLS-enabled connection](fluent/fluent-logger-golang@e5d6aa1)

See https://github.com/fluent/fluent-logger-golang/compare/v1.6.1..v1.8.0

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
(cherry picked from commit e24d61b)
Signed-off-by: Wesley <wppttt@amazon.com>
lmbarros pushed a commit to balena-os/balena-engine that referenced this pull request Feb 7, 2023
Updates the fluent logger library to v1.8.0. Following PRs/commits were
merged since last bump:

* [Add callback for error handling when using
  async](fluent/fluent-logger-golang#97)
* [Fix panic when accessing unexported struct
  field](fluent/fluent-logger-golang#99)
* [Properly stop logger during (re)connect
  failure](fluent/fluent-logger-golang#82)
* [Support a TLS-enabled connection](fluent/fluent-logger-golang@e5d6aa1)

See https://github.com/fluent/fluent-logger-golang/compare/v1.6.1..v1.8.0

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
(cherry picked from commit e24d61b)
Signed-off-by: Wesley <wppttt@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants