Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add syslog batching implementation #491

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

nicklas-dohrn
Copy link

@nicklas-dohrn nicklas-dohrn commented Feb 13, 2024

Description

This is our proposal to implement syslog batching for sending logs via https.
it includes a switch between the normal syslog one log per request mode via a syslog query parameter.
This can be done with the query parameter batching=true:

https://<your-drain-url>/syslog?batching=true

If you enable the syslog batching behaviour, it will currently write syslogbatches, where single messages are newline delimited (\n).
Currently, the batch sizes are hardwired to be around 256kb, which is already sufficient for speeding up throughput by a factor of 10x at least.
making it configurable would be an option, but I did not see the need so far.
please let me know what you think of the current approach.

Copy link

linux-foundation-easycla bot commented Feb 13, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@nicklas-dohrn nicklas-dohrn marked this pull request as ready for review April 4, 2024 05:19
@nicklas-dohrn nicklas-dohrn requested a review from a team as a code owner April 4, 2024 05:19
Copy link
Member

@ctlong ctlong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, it looks fine to me. I don't seem where it adds the newline character to delimit between syslog lines though...

Would love to see a demo at the next ARP WG meeting.

(I sort of disregarded that this was a POC at points in time and some of my comments are more implementation-focused, sorry about that 😅 )

src/pkg/egress/syslog/https.go Outdated Show resolved Hide resolved
src/pkg/egress/syslog/https.go Outdated Show resolved Hide resolved
src/pkg/egress/syslog/https.go Outdated Show resolved Hide resolved
src/pkg/egress/syslog/https.go Outdated Show resolved Hide resolved
@nicklas-dohrn
Copy link
Author

The newline is already part of the syslog messages, so these are added already by a method beforhand (linked for anyone curious):
msg := appendNewline(removeNulls(env.GetLog().Payload))

this is true for all possible syslog messages, so I do not even need to add this, which is really convenient.

This addresses all the comments by @ctlong.
It fixes the unneded if else branching for adding to the message
batch which is just not needed anymore.
It fixes the egressMetric to behave similar to the single message
implementation, to not count erroneous logs.
@nicklas-dohrn
Copy link
Author

Adressed all the comments and additions by @ctlong above, if sufficient, please close the threads :)

The refactor is mainly reshuffeling

The new timer implementation makes it more clear what the actual logic
is, and might also prevent some unresolvable states.
It now only has two states:
- Running if a batch is not yet full or time triggered
- Not running if there was a batch send either through a time or a size
based trigger
Copy link
Member

@ctlong ctlong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've still some specific concerns, which I've left as comments in this review.

In general, the implementation looks fine, though I'm not sure that I understand the necessity of the new TriggerTimer struct.

src/pkg/egress/syslog/triggerTimer.go Outdated Show resolved Hide resolved
src/pkg/egress/syslog/https.go Outdated Show resolved Hide resolved
@ctlong
Copy link
Member

ctlong commented Apr 17, 2024

@nicklas-dohrn can you please sign the CLA. We can't merge this unless you've done so.

@chombium
Copy link
Contributor

@ctlong I will take care about the CLA. @nicklas-dohrn has to be added to one of our GitHub orgs.

This is a new approach to switch between http and http batching.
It only is different in this regard from the previous attempts,
and only contains refactorings besides this change.
Copy link
Member

@ctlong ctlong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, I think this proof of concept is correct. Implementation-wise the timer still has some issues.

Once those are fixed, I would suggest rebasing this off #573 and testing the two changes together to see if it achieves the throughput you want. Then we're all ready for a real implementation (with tests).

🙏 Could you please also update the PR description, thanks.

src/pkg/egress/syslog/https_batch.go Outdated Show resolved Hide resolved
src/pkg/egress/syslog/https_batch.go Show resolved Hide resolved
src/pkg/egress/syslog/https_batch.go Outdated Show resolved Hide resolved

const BATCHSIZE = 256 * 1024

type HTTPSBatchWriter struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We understand the goal of the HTTPSBatchWriter to be:

  • Buffer & batch incoming envelopes.
  • If the batch reaches a certain size, flush the batch to a destination.
  • On some interval, flush the batch to a destination.

❓ Is that right?

Currently, HTTPSBatchWriter and TriggerTimer doesn't appear to do that. We think that code actually results in the following behaviour:

  • Add one message to the batch and sleep for some interval, then flush to a destination.
  • After that, add messages to the batch as they come in.
  • If the batch reaches a certain size, flush the batch to a destination.

➡️ Can you please have a look at adjusting this code.

Here are some examples of ways we've done batching in the past:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first thing:
Yes your understanding seems right on what we try to do.

I have pushed a newer version, that should comply with the envisioned behaviour.
Tests for the new version are also added, confirming the implementation complying with the wanted behaviour.
I am currently looking through your pointers to see if I can leverage some of the implementations shown there.

Copy link
Author

@nicklas-dohrn nicklas-dohrn Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the code proposed:
The implementation for the signal batcher uses slices and an append structure.
It has information of how long the batch is going to be, speeding up the execution.
In contrast, the syslog batching feature will not know beforehand, how long batches are going to be,
Giving the edge of speed to Byte.buffers:
https://stackoverflow.com/questions/39319024/builtin-append-vs-bytes-buffer-write

Also, the actual buffering code is pretty short, and the expected type for sending data with the http client used is Byte[], so the choice still seems obvious to me.

egrMsgCount float64
}

func NewHTTPSBatchWriter(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add tests for this writer.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some tests are already added, do we need tests for the things already done by the httpsWriter (Error handling and the likes, which is anyways the same?)

c *Converter,
) egress.WriteCloser {
client := httpClient(netConf, tlsConf)
binding.URL.Scheme = "https" // reset the scheme for usage to a valid http scheme
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of changing the scheme here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the scheme is used to differentiate between the different endpoints (https and https-batched)
If i do not change it back to https for sends, the queried url will be https-batched://... , which is just not working then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does mutating the scheme here affect the metrics emitted?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check for the metrics being changed here.

src/pkg/egress/syslog/https_batch.go Outdated Show resolved Hide resolved
src/pkg/egress/syslog/https_batch.go Outdated Show resolved Hide resolved
c *Converter,
) egress.WriteCloser {
client := httpClient(netConf, tlsConf)
binding.URL.Scheme = "https" // reset the scheme for usage to a valid http scheme
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does mutating the scheme here affect the metrics emitted?

)

var triggered = 0
var string_to_1024_chars = "saljdflajsdssdfsdfljkfkajafjajlköflkjöjaklgljksdjlakljkflkjweljklkwjejlkfekljwlkjefjklwjklsdajkljklwerlkaskldgjksakjekjwrjkljasdjkgfkljwejklrkjlklasdkjlsadjlfjlkadfljkajklsdfjklslkdfjkllkjasdjkflsdlakfjklasldfkjlasdjfkjlsadlfjklaljsafjlslkjawjklerkjljklasjkdfjklwerjljalsdjkflwerjlkwejlkarjklalkklfsdjlfhkjsdfkhsewhkjjasdjfkhwkejrkjahjefkhkasdjhfkashfkjwehfkksadfjaskfkhjdshjfhewkjhasdfjdajskfjwehkfajkankaskjdfasdjhfkkjhjjkasdfjhkjahksdf"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the name here a little confusing. The string is ~440 bytes in length.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this might be confusing.
just did it that way to not put way to many characters to reach the 1024.

src/pkg/egress/syslog/https_batch_test.go Show resolved Hide resolved
time.Sleep(99 * time.Millisecond)
}
time.Sleep(100 * time.Millisecond)
Expect(drain.messages).To(HaveLen(10))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general in these tests can we use Eventually rather than relying on timing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not final yet.
this test especially is set to test for the time window being exactly 1s, like defined in the class, and not something else.
If I reread it right now, it does not exactly test that, but an eventually would not test, if the timings defined would be adhered to.

src/pkg/egress/syslog/syslog_connector.go Outdated Show resolved Hide resolved
src/pkg/egress/syslog/https_batch.go Outdated Show resolved Hide resolved
@nicklas-dohrn
Copy link
Author

I reimplemented the changes using a similar approach to what @ctlong proposed.
This will make the change way more concise, but also gets rid of all -race conflicts shown by go test -race.
There are some race complaints left, but these only refer to the implementation of the tests themselves, which inherently are data-races by design.
@ctlong and @acrmp, If your concerns above are adressed, please close everything that is done, so we can keep this organised.

@nicklas-dohrn nicklas-dohrn requested a review from acrmp June 26, 2024 04:57
@nicklas-dohrn nicklas-dohrn changed the title Add syslog batching poc implementation Add syslog batching implementation Aug 5, 2024
@nicklas-dohrn
Copy link
Author

I did some elaborate testing on the current and new approach for syslog-batching, sending from our dev cf landscape with 4 diego cells and 4 loggregator agents to a cls instance.
I also tested #573 (HTTPS drains reuse and release fasthttp),
there were no mayor improvements of throughput to be seen, cpu consumption of both the new and old version is minimal due to being network bounded.
There are some good news, concerning batching, that it will considerably speed up throughput, and also reduce drops.
(see attached table for information)
The concurrent refers to a version of the https drain, which I modified with a go routine to allow usage of more than one cpu.
The current approach would increase the throughput considerably, but this results in a new issue:
If multiple applications on one diego cell would bind to a cls instance, they would share the throughput constraints of one cpu (I tested that this is indeed the behaviour).
This leads to bottleneck issues on bigger cf landscapes, as these use way bigger diego cells, consequently using bigger loggregator instances with more cpu cores, which does not scale for this approach at all.
@ctlong and @acrmp,
I would like to hear your thoughts on this issue, and how we can proceed making this one work at scale.
image

@juergen-walter
Copy link

@ctlong and @acrmp have many customers suffering and complaining about log drops. We would highly appreciate if this PR could be finalized/merged in in a timely manner. Appreciate your efforts so far, best regards.

Copy link
Contributor

@chombium chombium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally it looks fine, I found two little things.

I will wait on @ctlong for his review

src/pkg/egress/syslog/https_batch_test.go Outdated Show resolved Hide resolved
src/pkg/egress/syslog/https_test.go Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Waiting for Changes | Open for Contribution
Development

Successfully merging this pull request may close these issues.

None yet

5 participants