Properly stop logger during (re)connect failure #82

akerouanton · 2020-04-22T09:25:30Z

PR #77 introduced a new parameter named ForceStopAsyncSend. It can be
used to tell the logger to not try to send all the log messages in its
buffer before closing. Without this parameter, the logger hangs out
whenever it has logs to write and the target Fluentd server is down.

But this new parameter is not enough: the connection is currently lazily
initialized when the logger receive its first message. This blocks the
select reading messages from the queue (until the connection is ready).
Moreover, the connection dialing uses an exponential back-off retry.

Because of that, the logger won't look for messages on stopRunning
channel (the channel introduced by PR #77), either because it's
blocked by the Sleep used for the retry or because the connection
dialing is waiting until dialing timeout (eg. TCP timeout).

To fix these edge cases, the time.Sleep() used for back-off retry has
been transformed into a time.After() and a new stopAsyncConnect channel
has been introduced to stop it. Moreover, the dialer.Dial() call used
to connect to Fluentd has been replaced with dialer.DialContext() and a
cancellable context is now used to stop the call to that method.

Finally, the Fluent.run() method has been adapted to wait for both new
messages on f.pending and stop signal sent to f.stopRunning channel.
Previously, both channels were selected independently, in a procedural
fashion.

This fix is motivated by the issue described in: moby/moby#40063.

tagomoris · 2020-04-23T04:41:13Z

The motivation is reasonable, so I'm positive to fix that problem.
This kind of patch requires time to review, so I'll be able to review this change tomorrow or early next week.
But anyway, CI is now failing. @NiR- Could you take care of it?

akerouanton · 2020-04-23T19:50:01Z

But anyway, CI is now failing. @NiR- Could you take care of it?

CI is fixed but actually I'm still doing some manual testing to make sure everything is ok. I'll try to add more test cases 😉

tagomoris · 2020-04-24T00:24:58Z

@NiR- Great! Please let me know when you finished that.

angulito · 2020-08-03T10:30:15Z

Any news on this? I'm experiencing the same issue with docker v19.03.6-ce that is happening moby/moby#40063. Thank you!

akerouanton · 2020-08-03T12:59:25Z

@angulito We first want to merge #83 which is introducing a new way of testing edge cases like the one identified and fixed by this PR.

The new test cases and the fix were first submitted in a single PR, but since the diff was quite large and important, I splitted it in two PRs to make it less difficult for the repo's maintainer(s) to review and accept these changes.

Hopefully this will be included in the next major docker release. In the mean time, I created a logger plugin including this fix. It's working like the original fluentd logger and it's available in akerouanton/fluentd-async-logger.

angulito · 2020-08-03T20:27:00Z

Thank you for working hard on it! @akerouanton

akerouanton · 2020-11-11T13:29:57Z

Hey @tagomoris, I took a look at this PR in the past days and I finally found out why tests were failing on the CI. I also overhauled the previous implementation to make the code substantially simpler.

CI was failing because of tests with RequestAck option enabled: the ack hashes contain a randomly generated uint and despite the generator being implicitly seeded with 1 (as specified here), the ack hashes were not stable. I tried to explicitly seed the generator but with no luck. Thus, I decided to introduce a new randomGenerator variable and initialize it with rand.Uint64. This variable is changed in tests to use a function that always return 1. This way, we're sure the ack hashes are stable across time and machines.

On the overhaul part, I decided to remove the newly introduced stopAsyncConnect channel and use a cancellable context instead. This context is passed from newWithDialer() to run() -> write() -> connectOrRetry() -> connect() -> DialWithContext(). In this way, as soon as the context got cancelled, any in-progress dialing got stopped and the errIsClosing error is propagated up to run() by early terminating retriable code paths (connectOrRetry() and write()) . Moreover, in this way, any subsequent calls to write() will fail the same way (for instance if a log is dispatched through f.pending before f.stopRunning closing is dispatched), making the code more robust to channel dispatching order in run(). I also removed the drainEvents condition run() since returning from that function and closing f.pending is already enough to discard queued messages.

This PR is ready to be re-reviewed 🙂

tagomoris · 2020-11-18T05:10:05Z

I'll try to find time to review this change this or the next week.

tagomoris · 2020-11-24T12:51:36Z

fluent/fluent.go

+// Unfortunately, using a seeded random number generator isn't enough to ensure
+// ack hashes are deterministically generated during tests, thus we need to
+// change randomGenerator value to ensure the hashes are stable during tests
+// such that we can expect how the logger behaves with RequestAck option enabled.


How about replacing this command with make this "var" to replace the random generator in tests and write the content of the above comment in the test file?
The content is a little complex, and looks almost only about testing.

tagomoris · 2020-11-24T13:09:51Z

I took a glance at this, and need more time.

tagomoris · 2020-12-01T15:10:52Z

I reviewed the updated change. The clear problem is, this code will acquire the lock of connection during writing messages and waiting for the ack message. This causes terrible performance degradation if the ack option is enabled. Even without enabling the ack option, the critical section for writing messages could cause performance problems.

And I found another problem (this is not introduced by this change) that arriving ack messages may be disordered, but the current implementation doesn't consider such cases. 🤔

jchanam · 2021-03-22T11:10:23Z

Hi!

I'd like to know if you were able to move forward with this issue. We're still facing this problem described in moby/moby#40063

Thanks!

rothn · 2021-05-17T22:24:37Z

We are having this issue as well. Do we know if there's anything preventing a merge?

hakashya · 2021-08-17T09:58:14Z

Hey @tagomoris,
To confirm, the problems that are preventing the merge are:

Message writing critical section causing performance degradation.
Unordered ack messages not being handled.

Hey @akerouanton,
I am just wondering if you were working on it?

tagomoris · 2021-08-17T12:09:11Z

@hakashya Unordered ack messages are not the problem of this change. Of course, it's good if we can solve it.

PettitWesley · 2021-09-28T00:35:06Z

@akerouanton @tagomoris Hi, I'm Wesley, from the AWS containers team, and I'm also the AWS maintainer of fluent/fluent-bit. We need to get this change out since it fixes the linked Moby/Docker issue.

My proposal is that I will take over this fix. I will preserve @akerouanton's commit so that they get credit. And then add commits on top of it as needed to get this PR to a merge-able state.

To confirm, the issues blocking merge are:

Rebase.
"Message writing critical section causing performance degradation." I need to read through this code and try to understand it... can someone point me to the piece of the code where this issue is?

If I fix those, we can merge it?

I will begin work on this on Wednesday September 29th.

tagomoris · 2021-09-28T01:18:07Z

I'm totally ok to merge the change written by 2 contributors (it's almost equal to independent 2 pull-reqs) if it can solve problems.

akerouanton · 2021-10-22T08:26:09Z

About the overview of the change: I'm not a big fan of long anonymous functions (writer in write()). In my opinion, it suffers readability.

I can rename the current write() method to writeOrRetry() and move the content of the writer closure into a new write() method if you want, like what I did with connect() and connectOrRetry() 🙂

akerouanton · 2021-10-22T08:36:59Z

Btw, I did not move f.conn.SetWriteDeadline() into connect() method like I proposed in the patch I posted as a comment 2 days ago. However setting the deadline in the write() method could be the source of a bug in sync mode. For instance:

1st goroutine send a message which takes a long time to send (eg. tcp exponential backoff got triggered)
2nd goroutine send a message whereas 1st message is still in-flight and set the new write deadline, impacting the deadline of the 1st message
3rd goroutine send a message and sets the new deadline, again impacting the previous in-flight messages
etc...

This is due to how SetWriteDeadline() works:

// SetWriteDeadline sets the deadline for future Write calls
// and any currently-blocked Write call.
// Even if write times out, it may return n > 0, indicating that
// some of the data was successfully written.
// A zero value for t means Write will not time out.
SetWriteDeadline(t time.Time) error

Of course this is limited to usecases where many goroutines are used to send logs, but I prefer to point it out.

PR fluent#77 introduced a new parameter named ForceStopAsyncSend. It can be used to tell the logger to not try to send all the log messages in its buffer before closing. Without this parameter, the logger hangs out whenever it has logs to write and the target Fluentd server is down. But this new parameter is not enough: the connection is currently lazily initialized when the logger receive its first message. This blocks the select reading messages from the queue (until the connection is ready). Moreover, the connection dialing uses an exponential back-off retry. Because of that, the logger won't look for messages on `stopRunning` channel (the channel introduced by PR fluent#77), either because it's blocked by the Sleep used for the retry or because the connection dialing is waiting until dialing timeout (eg. TCP timeout). To fix these edge cases, the time.Sleep() used for back-off retry has been transformed into a time.After(). Moreover, the dialer.Dial() call used to connect to Fluentd has been replaced with dialer.DialContext() and a cancellable context is now used to stop the call to that method. The associated cancel function is stored in Fluent and got called by Close() when ForceStopAsyncSend option is enabled. Finally, the Fluent.run() method has been adapted to wait for both new messages on f.pending and stop signal sent to f.stopRunning channel. Previously, both channels were selected independently, in a procedural fashion. This fix is motivated by the issue described in: moby/moby#40063. Signed-off-by: Albin Kerouanton <albinker@gmail.com>

Signed-off-by: Albin Kerouanton <albinker@gmail.com>

akerouanton · 2021-10-28T13:55:51Z

I moved the body of the writer() anynomous function to write() and the retry logic to writeOrRetry(). Also, I replaced time.After() by time.NewTimer() with a defered call to ensure this underlying channel is closed and garbage collected (cf. https://pkg.go.dev/time#After).

tagomoris

My 👀 couldn't find any clear problems about locking/consistency on this change. Great work!
I left some comments about comments and naming.

fluent/fluent.go

tagomoris · 2021-10-29T07:46:35Z

fluent/fluent.go

+		// Here, we don't want to retry the write since connectOrRetry already
+		// retries Config.MaxRetry times to connect.


fluent/fluent.go

Using defered calls to unlock muconn mutex ensures the mutex doesn't stay locked if the lib user's have a panic recovering mechanism in place and f.connect(), f.conn.Write() or f.close() panic. Signed-off-by: Albin Kerouanton <albinker@gmail.com>

tagomoris · 2021-10-31T04:09:51Z

LGTM!
Let me merge this change and tag v1.7.0 to let users choose this fix only (or v1.6.3 for other recent changes).

tagomoris · 2021-10-31T04:12:37Z

Done. Thank you for the great work!

Updates the fluent logger library to v1.7.0. Following PRs were merged since last bump: * [Add callback for error handling when using async](fluent/fluent-logger-golang#97) * [Fix panic when accessing unexported struct field](fluent/fluent-logger-golang#99) * [Properly stop logger during (re)connect failure](fluent/fluent-logger-golang#82) Signed-off-by: Albin Kerouanton <albinker@gmail.com>

Updates the fluent logger library to v1.7.0. Following PRs were merged since last bump: * [Add callback for error handling when using async](fluent/fluent-logger-golang#97) * [Fix panic when accessing unexported struct field](fluent/fluent-logger-golang#99) * [Properly stop logger during (re)connect failure](fluent/fluent-logger-golang#82) See https://github.com/fluent/fluent-logger-golang/compare/v1.6.1..v1.7.0 Signed-off-by: Albin Kerouanton <albinker@gmail.com>

Updates the fluent logger library to v1.8.0. Following PRs/commits were merged since last bump: * [Add callback for error handling when using async](fluent/fluent-logger-golang#97) * [Fix panic when accessing unexported struct field](fluent/fluent-logger-golang#99) * [Properly stop logger during (re)connect failure](fluent/fluent-logger-golang#82) * [Support a TLS-enabled connection](fluent/fluent-logger-golang@e5d6aa1) See https://github.com/fluent/fluent-logger-golang/compare/v1.6.1..v1.8.0 Signed-off-by: Albin Kerouanton <albinker@gmail.com>

Updates the fluent logger library to v1.8.0. Following PRs/commits were merged since last bump: * [Add callback for error handling when using async](fluent/fluent-logger-golang#97) * [Fix panic when accessing unexported struct field](fluent/fluent-logger-golang#99) * [Properly stop logger during (re)connect failure](fluent/fluent-logger-golang#82) * [Support a TLS-enabled connection](fluent/fluent-logger-golang@e5d6aa1) See https://github.com/fluent/fluent-logger-golang/compare/v1.6.1..v1.8.0 Signed-off-by: Albin Kerouanton <albinker@gmail.com> Upstream-commit: e24d61b7efac787ff3d5176d994608937a057522 Component: engine

Updates the fluent logger library to v1.8.0. Following PRs/commits were merged since last bump: * [Add callback for error handling when using async](fluent/fluent-logger-golang#97) * [Fix panic when accessing unexported struct field](fluent/fluent-logger-golang#99) * [Properly stop logger during (re)connect failure](fluent/fluent-logger-golang#82) * [Support a TLS-enabled connection](fluent/fluent-logger-golang@e5d6aa1) See https://github.com/fluent/fluent-logger-golang/compare/v1.6.1..v1.8.0 Signed-off-by: Albin Kerouanton <albinker@gmail.com>

Updates the fluent logger library to v1.8.0. Following PRs/commits were merged since last bump: * [Add callback for error handling when using async](fluent/fluent-logger-golang#97) * [Fix panic when accessing unexported struct field](fluent/fluent-logger-golang#99) * [Properly stop logger during (re)connect failure](fluent/fluent-logger-golang#82) * [Support a TLS-enabled connection](fluent/fluent-logger-golang@e5d6aa1) See https://github.com/fluent/fluent-logger-golang/compare/v1.6.1..v1.8.0 Signed-off-by: Wesley <wppttt@amazon.com>

Updates the fluent logger library to v1.8.0. Following PRs/commits were merged since last bump: * [Add callback for error handling when using async](fluent/fluent-logger-golang#97) * [Fix panic when accessing unexported struct field](fluent/fluent-logger-golang#99) * [Properly stop logger during (re)connect failure](fluent/fluent-logger-golang#82) * [Support a TLS-enabled connection](fluent/fluent-logger-golang@e5d6aa1) See https://github.com/fluent/fluent-logger-golang/compare/v1.6.1..v1.8.0 Signed-off-by: Albin Kerouanton <albinker@gmail.com> (cherry picked from commit e24d61b) Signed-off-by: Wesley <wppttt@amazon.com>

akerouanton force-pushed the stop-async-connect branch from 4bca5d4 to 24b698a Compare April 22, 2020 09:26

akerouanton force-pushed the stop-async-connect branch 3 times, most recently from 05f5f84 to 9978fb3 Compare April 23, 2020 17:38

akerouanton force-pushed the stop-async-connect branch from 565a204 to 401c01b Compare April 24, 2020 00:09

akerouanton mentioned this pull request May 7, 2020

Test how the logger behaves in async mode #83

Merged

akerouanton force-pushed the stop-async-connect branch from 401c01b to 8ae97ca Compare May 21, 2020 13:27

akerouanton changed the title ~~Try to asynchronously connect to Fluentd before writing~~ Properly stop logger during (re)connect failure May 21, 2020

akerouanton force-pushed the stop-async-connect branch from 8ae97ca to 2317ded Compare August 4, 2020 09:40

angulito mentioned this pull request Aug 20, 2020

Fluent-bit sidecar is killed because of networking error aws/aws-for-fluent-bit#66

Closed

CodePint mentioned this pull request Nov 9, 2020

Container hangs when using fluentd logger moby/moby#40063

Closed

akerouanton force-pushed the stop-async-connect branch 2 times, most recently from 5a1bb14 to 992f423 Compare November 11, 2020 13:28

tagomoris reviewed Nov 24, 2020

View reviewed changes

akerouanton mentioned this pull request Oct 23, 2021

docker service update leads to zombie processes moby/moby#42930

Closed

akerouanton added 2 commits October 28, 2021 15:51

Use a RWMutex to allow the write fastpath to be executed concurrently

29f85e1

Signed-off-by: Albin Kerouanton <albinker@gmail.com>

akerouanton force-pushed the stop-async-connect branch from 17f8703 to b48c086 Compare October 28, 2021 13:52

tagomoris mentioned this pull request Oct 29, 2021

Disordered Ack messages will be missed #103

Open

tagomoris reviewed Oct 29, 2021

View reviewed changes

akerouanton force-pushed the stop-async-connect branch from b48c086 to ede25d4 Compare October 29, 2021 18:47

akerouanton requested a review from tagomoris October 29, 2021 18:48

tagomoris merged commit f247034 into fluent:master Oct 31, 2021

akerouanton deleted the stop-async-connect branch October 31, 2021 08:51

akerouanton mentioned this pull request Oct 31, 2021

Calls to Write() after calling Close() in sync mode reopen the connection #104

Closed

akerouanton mentioned this pull request Nov 19, 2021

vendor: github.com/fluent/fluent-logger-golang v1.8.0 moby/moby#42979

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly stop logger during (re)connect failure #82

Properly stop logger during (re)connect failure #82

akerouanton commented Apr 22, 2020 •

edited

Loading

tagomoris commented Apr 23, 2020

akerouanton commented Apr 23, 2020

tagomoris commented Apr 24, 2020

angulito commented Aug 3, 2020

akerouanton commented Aug 3, 2020

angulito commented Aug 3, 2020

akerouanton commented Nov 11, 2020 •

edited

Loading

tagomoris commented Nov 18, 2020

tagomoris Nov 24, 2020

tagomoris commented Nov 24, 2020

tagomoris commented Dec 1, 2020

jchanam commented Mar 22, 2021

rothn commented May 17, 2021

hakashya commented Aug 17, 2021

tagomoris commented Aug 17, 2021

PettitWesley commented Sep 28, 2021 •

edited

Loading

tagomoris commented Sep 28, 2021

akerouanton commented Oct 22, 2021

akerouanton commented Oct 22, 2021 •

edited

Loading

akerouanton commented Oct 28, 2021

tagomoris left a comment

tagomoris Oct 29, 2021

tagomoris commented Oct 31, 2021

tagomoris commented Oct 31, 2021

		// Here, we don't want to retry the write since connectOrRetry already
		// retries Config.MaxRetry times to connect.

Properly stop logger during (re)connect failure #82

Properly stop logger during (re)connect failure #82

Conversation

akerouanton commented Apr 22, 2020 • edited Loading

tagomoris commented Apr 23, 2020

akerouanton commented Apr 23, 2020

tagomoris commented Apr 24, 2020

angulito commented Aug 3, 2020

akerouanton commented Aug 3, 2020

angulito commented Aug 3, 2020

akerouanton commented Nov 11, 2020 • edited Loading

tagomoris commented Nov 18, 2020

tagomoris Nov 24, 2020

Choose a reason for hiding this comment

tagomoris commented Nov 24, 2020

tagomoris commented Dec 1, 2020

jchanam commented Mar 22, 2021

rothn commented May 17, 2021

hakashya commented Aug 17, 2021

tagomoris commented Aug 17, 2021

PettitWesley commented Sep 28, 2021 • edited Loading

tagomoris commented Sep 28, 2021

akerouanton commented Oct 22, 2021

akerouanton commented Oct 22, 2021 • edited Loading

akerouanton commented Oct 28, 2021

tagomoris left a comment

Choose a reason for hiding this comment

tagomoris Oct 29, 2021

Choose a reason for hiding this comment

tagomoris commented Oct 31, 2021

tagomoris commented Oct 31, 2021

akerouanton commented Apr 22, 2020 •

edited

Loading

akerouanton commented Nov 11, 2020 •

edited

Loading

PettitWesley commented Sep 28, 2021 •

edited

Loading

akerouanton commented Oct 22, 2021 •

edited

Loading