Add syslog batching implementation #491

nicklas-dohrn · 2024-02-13T08:36:30Z

Description

This is our proposal to implement syslog batching for sending logs via https.
it includes a switch between the normal syslog one log per request mode via a syslog query parameter.
This can be done with the query parameter batching=true:

https://<your-drain-url>/syslog?batching=true

If you enable the syslog batching behaviour, it will currently write syslogbatches, where single messages are newline delimited (\n).
Currently, the batch sizes are hardwired to be around 256kb, which is already sufficient for speeding up throughput by a factor of 10x at least.
making it configurable would be an option, but I did not see the need so far.
please let me know what you think of the current approach.

linux-foundation-easycla · 2024-02-13T08:36:34Z

The committers listed above are authorized under a signed CLA.

✅ login: nicklas-dohrn / name: Nicklas Dohrn (2d2556e, 5170ba2, 52fbfbe, f206257, 35104c5, 48a13d8, 519dda6, 2fbae37, 3c9934f, 21666c8)

ctlong

In general, it looks fine to me. I don't seem where it adds the newline character to delimit between syslog lines though...

Would love to see a demo at the next ARP WG meeting.

(I sort of disregarded that this was a POC at points in time and some of my comments are more implementation-focused, sorry about that 😅 )

src/pkg/egress/syslog/https.go

nicklas-dohrn · 2024-04-11T07:59:39Z

The newline is already part of the syslog messages, so these are added already by a method beforhand (linked for anyone curious):
msg := appendNewline(removeNulls(env.GetLog().Payload))

this is true for all possible syslog messages, so I do not even need to add this, which is really convenient.

@ctlong

This addresses all the comments by @ctlong. It fixes the unneded if else branching for adding to the message batch which is just not needed anymore. It fixes the egressMetric to behave similar to the single message implementation, to not count erroneous logs.

nicklas-dohrn · 2024-04-11T08:29:17Z

Adressed all the comments and additions by @ctlong above, if sufficient, please close the threads :)

The refactor is mainly reshuffeling The new timer implementation makes it more clear what the actual logic is, and might also prevent some unresolvable states. It now only has two states: - Running if a batch is not yet full or time triggered - Not running if there was a batch send either through a time or a size based trigger

ctlong

I've still some specific concerns, which I've left as comments in this review.

In general, the implementation looks fine, though I'm not sure that I understand the necessity of the new TriggerTimer struct.

src/pkg/egress/syslog/triggerTimer.go

src/pkg/egress/syslog/https.go

ctlong · 2024-04-17T17:29:41Z

@nicklas-dohrn can you please sign the CLA. We can't merge this unless you've done so.

chombium · 2024-04-17T18:52:56Z

@ctlong I will take care about the CLA. @nicklas-dohrn has to be added to one of our GitHub orgs.

This is a new approach to switch between http and http batching. It only is different in this regard from the previous attempts, and only contains refactorings besides this change.

ctlong

Conceptually, I think this proof of concept is correct. Implementation-wise the timer still has some issues.

Once those are fixed, I would suggest rebasing this off #573 and testing the two changes together to see if it achieves the throughput you want. Then we're all ready for a real implementation (with tests).

🙏 Could you please also update the PR description, thanks.

src/pkg/egress/syslog/https_batch.go

ctlong · 2024-06-17T23:43:52Z

src/pkg/egress/syslog/https_batch.go

+
+const BATCHSIZE = 256 * 1024
+
+type HTTPSBatchWriter struct {


We understand the goal of the HTTPSBatchWriter to be:

Buffer & batch incoming envelopes.

If the batch reaches a certain size, flush the batch to a destination.

On some interval, flush the batch to a destination.

❓ Is that right?

Currently, HTTPSBatchWriter and TriggerTimer doesn't appear to do that. We think that code actually results in the following behaviour:

Add one message to the batch and sleep for some interval, then flush to a destination.

After that, add messages to the batch as they come in.

If the batch reaches a certain size, flush the batch to a destination.

➡️ Can you please have a look at adjusting this code.

Here are some examples of ways we've done batching in the past:

https://github.com/cloudfoundry/loggregator-agent-release/blob/main/src/pkg/otelcolclient/signal_batcher.go

https://github.com/cloudfoundry/go-loggregator/blob/main/ingress_client.go#L484-L525

first thing:
Yes your understanding seems right on what we try to do.

I have pushed a newer version, that should comply with the envisioned behaviour.
Tests for the new version are also added, confirming the implementation complying with the wanted behaviour.
I am currently looking through your pointers to see if I can leverage some of the implementations shown there.

Regarding the code proposed:
The implementation for the signal batcher uses slices and an append structure.
It has information of how long the batch is going to be, speeding up the execution.
In contrast, the syslog batching feature will not know beforehand, how long batches are going to be,
Giving the edge of speed to Byte.buffers:
https://stackoverflow.com/questions/39319024/builtin-append-vs-bytes-buffer-write

Also, the actual buffering code is pretty short, and the expected type for sending data with the http client used is Byte[], so the choice still seems obvious to me.

ctlong · 2024-06-17T23:46:31Z

src/pkg/egress/syslog/https_batch.go

+	egrMsgCount  float64
+}
+
+func NewHTTPSBatchWriter(


Can you please add tests for this writer.

Some tests are already added, do we need tests for the things already done by the httpsWriter (Error handling and the likes, which is anyways the same?)

ctlong · 2024-06-17T23:49:39Z

src/pkg/egress/syslog/https_batch.go

+	c *Converter,
+) egress.WriteCloser {
+	client := httpClient(netConf, tlsConf)
+	binding.URL.Scheme = "https" // reset the scheme for usage to a valid http scheme


What's the purpose of changing the scheme here?

the scheme is used to differentiate between the different endpoints (https and https-batched)
If i do not change it back to https for sends, the queried url will be https-batched://... , which is just not working then.

Does mutating the scheme here affect the metrics emitted?

I will check for the metrics being changed here.

src/pkg/egress/syslog/https_batch.go

acrmp · 2024-06-22T02:55:08Z

src/pkg/egress/syslog/https_batch.go

+	c *Converter,
+) egress.WriteCloser {
+	client := httpClient(netConf, tlsConf)
+	binding.URL.Scheme = "https" // reset the scheme for usage to a valid http scheme


Does mutating the scheme here affect the metrics emitted?

acrmp · 2024-06-22T02:58:05Z

src/pkg/egress/syslog/https_batch_test.go

+)
+
+var triggered = 0
+var string_to_1024_chars = "saljdflajsdssdfsdfljkfkajafjajlköflkjöjaklgljksdjlakljkflkjweljklkwjejlkfekljwlkjefjklwjklsdajkljklwerlkaskldgjksakjekjwrjkljasdjkgfkljwejklrkjlklasdkjlsadjlfjlkadfljkajklsdfjklslkdfjkllkjasdjkflsdlakfjklasldfkjlasdjfkjlsadlfjklaljsafjlslkjawjklerkjljklasjkdfjklwerjljalsdjkflwerjlkwejlkarjklalkklfsdjlfhkjsdfkhsewhkjjasdjfkhwkejrkjahjefkhkasdjhfkashfkjwehfkksadfjaskfkhjdshjfhewkjhasdfjdajskfjwehkfajkankaskjdfasdjhfkkjhjjkasdfjhkjahksdf"


I find the name here a little confusing. The string is ~440 bytes in length.

Yes this might be confusing.
just did it that way to not put way to many characters to reach the 1024.

src/pkg/egress/syslog/https_batch_test.go

acrmp · 2024-06-22T03:01:04Z

src/pkg/egress/syslog/https_batch_test.go

+			time.Sleep(99 * time.Millisecond)
+		}
+		time.Sleep(100 * time.Millisecond)
+		Expect(drain.messages).To(HaveLen(10))


In general in these tests can we use Eventually rather than relying on timing?

These are not final yet.
this test especially is set to test for the time window being exactly 1s, like defined in the class, and not something else.
If I reread it right now, it does not exactly test that, but an eventually would not test, if the timings defined would be adhered to.

src/pkg/egress/syslog/syslog_connector.go

src/pkg/egress/syslog/https_batch.go

nicklas-dohrn · 2024-06-26T04:57:22Z

I reimplemented the changes using a similar approach to what @ctlong proposed.
This will make the change way more concise, but also gets rid of all -race conflicts shown by go test -race.
There are some race complaints left, but these only refer to the implementation of the tests themselves, which inherently are data-races by design.
@ctlong and @acrmp, If your concerns above are adressed, please close everything that is done, so we can keep this organised.

nicklas-dohrn · 2024-08-05T10:45:47Z

I did some elaborate testing on the current and new approach for syslog-batching, sending from our dev cf landscape with 4 diego cells and 4 loggregator agents to a cls instance.
I also tested #573 (HTTPS drains reuse and release fasthttp),
there were no mayor improvements of throughput to be seen, cpu consumption of both the new and old version is minimal due to being network bounded.
There are some good news, concerning batching, that it will considerably speed up throughput, and also reduce drops.
(see attached table for information)
The concurrent refers to a version of the https drain, which I modified with a go routine to allow usage of more than one cpu.
The current approach would increase the throughput considerably, but this results in a new issue:
If multiple applications on one diego cell would bind to a cls instance, they would share the throughput constraints of one cpu (I tested that this is indeed the behaviour).
This leads to bottleneck issues on bigger cf landscapes, as these use way bigger diego cells, consequently using bigger loggregator instances with more cpu cores, which does not scale for this approach at all.
@ctlong and @acrmp,
I would like to hear your thoughts on this issue, and how we can proceed making this one work at scale.

juergen-walter · 2024-08-05T11:12:53Z

@ctlong and @acrmp have many customers suffering and complaining about log drops. We would highly appreciate if this PR could be finalized/merged in in a timely manner. Appreciate your efforts so far, best regards.

chombium

Generally it looks fine, I found two little things.

I will wait on @ctlong for his review

src/pkg/egress/syslog/https_batch_test.go

src/pkg/egress/syslog/https_test.go

Add syslog batching poc implementation

2d2556e

Add switch functionality for bindings

5170ba2

nicklas-dohrn mentioned this pull request Mar 27, 2024

Allow to send multiple log messages in a single HTTP request #332

Open

nicklas-dohrn marked this pull request as ready for review April 4, 2024 05:19

nicklas-dohrn requested a review from a team as a code owner April 4, 2024 05:19

ctlong requested changes Apr 9, 2024

View reviewed changes

src/pkg/egress/syslog/https.go Outdated Show resolved Hide resolved

src/pkg/egress/syslog/https.go Outdated Show resolved Hide resolved

src/pkg/egress/syslog/https.go Outdated Show resolved Hide resolved

src/pkg/egress/syslog/https.go Outdated Show resolved Hide resolved

acrmp reviewed Apr 10, 2024

View reviewed changes

src/pkg/egress/syslog/https.go Outdated Show resolved Hide resolved

nicklas-dohrn requested a review from ctlong April 11, 2024 08:28

ctlong requested changes Apr 17, 2024

View reviewed changes

src/pkg/egress/syslog/triggerTimer.go Outdated Show resolved Hide resolved

src/pkg/egress/syslog/https.go Outdated Show resolved Hide resolved

Refactor approach to use different protocol instead of parameter.

35104c5

This is a new approach to switch between http and http batching. It only is different in this regard from the previous attempts, and only contains refactorings besides this change.

ctlong reviewed May 7, 2024

View reviewed changes

src/pkg/egress/syslog/https_batch.go Outdated Show resolved Hide resolved

src/pkg/egress/syslog/https_batch.go Show resolved Hide resolved

src/pkg/egress/syslog/https_batch.go Outdated Show resolved Hide resolved

nicklas-dohrn force-pushed the main branch 2 times, most recently from fe08849 to c937231 Compare May 12, 2024 06:24

Fix trigger timer issues

21666c8

nicklas-dohrn force-pushed the main branch from c937231 to 21666c8 Compare May 12, 2024 06:32

ctlong requested changes Jun 17, 2024

View reviewed changes

nicklas-dohrn requested a review from ctlong June 20, 2024 04:47

Add tests and fix test related issues

3c9934f

nicklas-dohrn force-pushed the main branch from 2474ffb to 3c9934f Compare June 20, 2024 05:25

acrmp requested changes Jun 22, 2024

View reviewed changes

Change batch dispatch implementation

519dda6

nicklas-dohrn force-pushed the main branch from 3073a80 to 519dda6 Compare June 26, 2024 04:49

nicklas-dohrn requested a review from acrmp June 26, 2024 04:57

Add https-batch to allowed formats

52fbfbe

nicklas-dohrn force-pushed the main branch from 3a62859 to 52fbfbe Compare July 17, 2024 15:00

nicklas-dohrn changed the title ~~Add syslog batching poc implementation~~ Add syslog batching implementation Aug 5, 2024

chombium requested changes Aug 15, 2024

View reviewed changes

src/pkg/egress/syslog/https_batch_test.go Outdated Show resolved Hide resolved

src/pkg/egress/syslog/https_test.go Outdated Show resolved Hide resolved

Fix remarks from @chombium

2fbae37

nicklas-dohrn requested a review from chombium August 21, 2024 05:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add syslog batching implementation #491

Add syslog batching implementation #491

nicklas-dohrn commented Feb 13, 2024 •

edited

Loading

linux-foundation-easycla bot commented Feb 13, 2024 •

edited

Loading

ctlong left a comment

nicklas-dohrn commented Apr 11, 2024

nicklas-dohrn commented Apr 11, 2024

ctlong left a comment

ctlong commented Apr 17, 2024

chombium commented Apr 17, 2024

ctlong left a comment •

edited

Loading

ctlong Jun 17, 2024

nicklas-dohrn Jun 20, 2024

nicklas-dohrn Jun 20, 2024 •

edited

Loading

ctlong Jun 17, 2024

nicklas-dohrn Jun 19, 2024

ctlong Jun 17, 2024

nicklas-dohrn Jun 19, 2024

acrmp Jun 22, 2024

nicklas-dohrn Jun 24, 2024

acrmp Jun 22, 2024

acrmp Jun 22, 2024

nicklas-dohrn Jun 24, 2024

acrmp Jun 22, 2024

nicklas-dohrn Jun 24, 2024

nicklas-dohrn commented Jun 26, 2024

nicklas-dohrn commented Aug 5, 2024

juergen-walter commented Aug 5, 2024

chombium left a comment

Add syslog batching implementation #491

Are you sure you want to change the base?

Add syslog batching implementation #491

Conversation

nicklas-dohrn commented Feb 13, 2024 • edited Loading

Description

linux-foundation-easycla bot commented Feb 13, 2024 • edited Loading

ctlong left a comment

Choose a reason for hiding this comment

nicklas-dohrn commented Apr 11, 2024

nicklas-dohrn commented Apr 11, 2024

ctlong left a comment

Choose a reason for hiding this comment

ctlong commented Apr 17, 2024

chombium commented Apr 17, 2024

ctlong left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicklas-dohrn Jun 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicklas-dohrn commented Jun 26, 2024

nicklas-dohrn commented Aug 5, 2024

juergen-walter commented Aug 5, 2024

chombium left a comment

Choose a reason for hiding this comment

nicklas-dohrn commented Feb 13, 2024 •

edited

Loading

linux-foundation-easycla bot commented Feb 13, 2024 •

edited

Loading

ctlong left a comment •

edited

Loading

nicklas-dohrn Jun 20, 2024 •

edited

Loading