Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for InfluxDB batches of more than 1 and made the emission channel size configurable. #3937

Merged
merged 10 commits into from Aug 13, 2019

Conversation

@rudolfv
Copy link

commented May 29, 2019

@vito We still need to test the latest changes (making all the magic numbers configurable) against one of our concourse stacks, but I did manage to build a full release image in our slightly altered release pipeline. We are not running the upgrade or downgrade jobs, but everything else passes.

@rudolfv rudolfv force-pushed the rudolfv:feature/3674-influxdb-batching branch from f521b64 to 8c16e04 May 29, 2019

@vito vito requested a review from cirocosta May 29, 2019

@cirocosta
Copy link
Member

left a comment

Thank you for the PR, @rudolfv !

I just added some questions / points 👍I feel a bit bad that we didn't have some testing around this area before 😬 Do you think it'd be worth adding some to this logic?

Thanks!

@@ -66,7 +66,13 @@ var (
emissions chan eventEmission
)

func Initialize(logger lager.Logger, host string, attributes map[string]string) error {
func Initialize(logger lager.Logger, host string, attributes map[string]string, bufferSize int) error {

This comment has been minimized.

Copy link
@cirocosta

cirocosta Jun 3, 2019

Member

As we always expect bufferSize to be uint anyway, wdyt of changing the signature to take an uint instead of int?

Suggested change
func Initialize(logger lager.Logger, host string, attributes map[string]string, bufferSize int) error {
func Initialize(logger lager.Logger, host string, attributes map[string]string, bufferSize uint32) error {

It seems to make that it'd be better to do the casting at the make(chan ... that comes later rather than at the moment that we pass the value to Initialize 🤔

This comment has been minimized.

Copy link
@rudolfv

rudolfv Jun 6, 2019

Author

I will make this change - makes sense to me to constrain it with a uint32 and do the cast at the last possible moment.

@@ -110,6 +116,9 @@ func Deinitialize(logger lager.Logger) {
}

func emit(logger lager.Logger, event Event) {
logger.Debug("emit-event", lager.Data{

This comment has been minimized.

Copy link
@cirocosta

cirocosta Jun 3, 2019

Member

We already have a way of having the metrics emitted logging to stdout through the --emit-to-logs variable - did you want this logline for the purposes of checking if metrics are still being emitted regardless of the configured emitter, or for the same purposes as --emit-to-logs?

This comment has been minimized.

Copy link
@rudolfv

rudolfv Jun 6, 2019

Author

This code is to check regardless of the configured emitter. If I remember correctly only one of the emitters can be configured at a time. So we cannot have --emit-to-logs and InfluxDB emission configured at the same time.

This comment has been minimized.

Copy link
@vito

vito Aug 8, 2019

Member

I'm lukewarm on this; it seems like it'll be super noisy but I guess you can always filter it out. It's already at debug level but we have a decent amount of environments with debug logs enabled. Granted, we're used to maintaining filters to strip out the cruft.

If it was just added while you were developing and isn't useful anymore I would prefer we remove it, but if it was super useful in a pinch I'm fine with keeping it. 🤷‍♂

@@ -24,9 +26,22 @@ type InfluxDBConfig struct {
Password string `long:"influxdb-password" description:"InfluxDB server password."`

InsecureSkipVerify bool `long:"influxdb-insecure-skip-verify" description:"Skip SSL verification when emitting to InfluxDB."`

// https://github.com/influxdata/docs.influxdata.com/issues/454

This comment has been minimized.

Copy link
@cirocosta

cirocosta Jun 3, 2019

Member

Thanks for giving us the context on "why 5000"! That's very helpful.

Wdyt of leaving that to the PR though? As one could always trace the 5000 back to this PR, I'd say leaving that in the PR's comments (or even in the commit message) would be enough 🤔 In the codebase, we don't have many other cases where we add references like this.

This comment has been minimized.

Copy link
@rudolfv

rudolfv Jun 6, 2019

Author

I will remove the comments in the code and add as a comment here:

influxdata/docs.influxdata.com#454
https://docs.influxdata.com/influxdb/v0.13/write_protocols/write_syntax/#write-a-batch-of-points-with-curl
5000 seems to be the batch size recommended by the InfluxDB team

"influxdb-batch-duration": emitter.batchDuration,
"current-duration": duration,
})
go emitBatch(emitter, logger, batch)

This comment has been minimized.

Copy link
@cirocosta

cirocosta Jun 3, 2019

Member

I'm not very sure about making each emission a goroutine 🤔 If I understood correctly, this makes the buffer "infinite" as there's never really going to exist something that throttles the writes. Is this statement right? Maybe I got something wrong 😬

This comment has been minimized.

Copy link
@rudolfv

rudolfv Jun 6, 2019

Author

Making it a goroutine ensures that the construction of the batch points and writing to InfluxDB are immediately taken off the line of execution so that the emitter can continue to build up the next batch.

And the emitLoop function in emit.go still ensures that events are passed on in a synchronous way to the InfluxDBEmitter after they have been read from the channel.

So there will only ever be one open batch that will be closed and submitted for final processing when either the size or duration limit is reached. However, there can potentially be a number of closed batches that are in the process of being transformed into batch points or written to InfluxDB.

Does this answer your question? Or am I missing something?

@rudolfv

This comment has been minimized.

Copy link
Author

commented Jun 6, 2019

Thank you for the PR, @rudolfv !

I just added some questions / points 👍I feel a bit bad that we didn't have some testing around this area before 😬 Do you think it'd be worth adding some to this logic?

Thanks!

@cirocosta I am busy working on a unit test for this logic. Also see my replies to the other comments.

influxclient "github.com/influxdata/influxdb1-client/v2"
)

// This is a copy of the github.com/influxdata/influxdb1-client/v2/client.Client interface whose sole purpose is

This comment has been minimized.

Copy link
@rudolfv

rudolfv Jun 7, 2019

Author

I only added this file as a last resort. As stated in the counterfeiter docs for third party packages, I tried with
//go:generate counterfeiter github.com/influxdata/influxdb1-client/v2/client.Client
and various other permutations of that, but none of them worked.

Rudolf Visagie added some commits May 29, 2019

Rudolf Visagie
Added support for InfluxDB batches of more than 1 and made the emissi…
…on channel buffer size configurable.

Signed-off-by: Rudolf Visagie <rudolf.visagie@adp.com>
Rudolf Visagie
Added InfluxDB emitter tests
Signed-off-by: Rudolf Visagie <rudolf.visagie@adp.com>
Rudolf Visagie
Re-organize the test spec to group tests under the same context
Signed-off-by: Rudolf Visagie <rudolf.visagie@adp.com>
Rudolf Visagie
Remove the batch size comments as per the PR review - the same info h…
…as been added to the PR comments

Signed-off-by: Rudolf Visagie <rudolf.visagie@adp.com>
Rudolf Visagie
Change the bufferSize parameter to uint32 and only cast when creating…
… the channel

Signed-off-by: Rudolf Visagie <rudolf.visagie@adp.com>
Rudolf Visagie
Fix formatting
Signed-off-by: Rudolf Visagie <rudolf.visagie@adp.com>
Rudolf Visagie
Make a copy of the batch before submitting
Signed-off-by: Rudolf Visagie <rudolf.visagie@adp.com>

@rudolfv rudolfv force-pushed the rudolfv:feature/3674-influxdb-batching branch from 17d8b81 to 661bf9b Jun 7, 2019

@rudolfv

This comment has been minimized.

Copy link
Author

commented Jun 10, 2019

@cirocosta This PR is ready for review again

@cirocosta

This comment has been minimized.

Copy link
Member

commented Jun 17, 2019

Hi @rudolfv !

Sorry for the very long delay, I'll get back to it very soon!


Update (21 Jun, 2019): we still didn't get the time to come to it. It's still in our backlog though!

@rudolfv

This comment has been minimized.

Copy link
Author

commented Jul 22, 2019

@cirocosta Any updates on this?

@rudolfv

This comment has been minimized.

Copy link
Author

commented Jul 30, 2019

@vito @cirocosta I see the stale[bot] is about to close this. Do you have any updates for me?

@vito

This comment has been minimized.

Copy link
Member

commented Jul 30, 2019

@rudolfv Sorry! 🙁 I'll poke the issue so the stale bot backs off. We'll be sure to get to this soon. @cirocosta has been pretty busy with other commitments, and I've at least been busy putting together our roadmap. Unfortunately we don't have very consistent bandwidth for PR reviews - we're trying to improve on this as a project but it's slow going. I poked him about it yesterday, I'll try to get one of us to wrap this up soon depending on who's available.

@rudolfv

This comment has been minimized.

Copy link
Author

commented Jul 30, 2019

Thank you @vito!

@vito
Copy link
Member

left a comment

a tentative review to keep the ball rolling 🙂 sorry for the wait

@@ -110,6 +116,9 @@ func Deinitialize(logger lager.Logger) {
}

func emit(logger lager.Logger, event Event) {
logger.Debug("emit-event", lager.Data{

This comment has been minimized.

Copy link
@vito

vito Aug 8, 2019

Member

I'm lukewarm on this; it seems like it'll be super noisy but I guess you can always filter it out. It's already at debug level but we have a decent amount of environments with debug logs enabled. Granted, we're used to maintaining filters to strip out the cruft.

If it was just added while you were developing and isn't useful anymore I would prefer we remove it, but if it was super useful in a pinch I'm fine with keeping it. 🤷‍♂

@@ -132,13 +141,19 @@ func emit(logger lager.Logger, event Event) {

select {
case emissions <- eventEmission{logger: logger, event: event}:
logger.Debug("emit-event-write-to-channel", lager.Data{

This comment has been minimized.

Copy link
@vito

vito Aug 8, 2019

Member

This on the other hand seems like it'd be a bit too noisy. 😅 Would you be OK with removing either this one or the above log?

(Also interesting that the above log line will run even with a nil emitter. Not sure if that's useful. I guess you'd notice it's not configured more quickly, but it also means those logs will show up even in places where metrics are undesired.)

default:
logger.Error("queue-full", nil)
}
}

func emitLoop() {
for emission := range emissions {
emission.logger.Debug("emit-event-loop", lager.Data{

This comment has been minimized.

Copy link
@vito

vito Aug 8, 2019

Member

Also noisy - wouldn't this mean there are 3 logs emitted for every metric?

}
}

func (emitter *InfluxDBEmitter) SubmitBatch(logger lager.Logger) {

This comment has been minimized.

Copy link
@vito

vito Aug 8, 2019

Member

Just checking, is this thread-safe? I guess it is because all of the emits come from an emit loop? 🤔

This comment has been minimized.

Copy link
@rudolfv

rudolfv Aug 8, 2019

Author

@vito, yes, that is my understanding. The emitLoop will serialize the emission of events to the InfluxDB code. emitLoop -> emitter.Emit -> emitter.SubmitBatch

copy(batchToSubmit, batch)
batch = make([]metric.Event, 0)
lastBatchTime = time.Now()
go emitBatch(emitter, logger, batchToSubmit)

This comment has been minimized.

Copy link
@vito

vito Aug 8, 2019

Member

I guess the idea behind this is to prevent a giant batch from submitting too slowly and causing the queue to fill up. Makes sense, since that was the original problem. But I wonder if batching would help mitigate that in and of itself, since submiting 5k at a time might cause less backpressure than 1 event 5000 times.

Have you observed any issues with this yet at large scale? It seems like the goroutines could potentially pile up due to a slow consumer.

I think @cirocosta had similar concerns on an earlier revision. I'm fine with how it is for now, but in the future we may want to add a max-in-flight or something. I guess at the end of the day slow consumers are hard to avoid, and this is why we've been thinking of standardizing on Prometheus. 😅

This comment has been minimized.

Copy link
@rudolfv

rudolfv Aug 8, 2019

Author

@vito We have not had any issues with this, but I understand your concerns. As a final test I would suggest that once we have merged this and built a release candidate, that we will spin up a concourse cluster similar to our production one and perform some load tests to give us a measure of confidence with all the 5.5 code changes.

@cirocosta

This comment has been minimized.

Copy link
Member

commented Aug 8, 2019

thaanks for looking at it @vito 🙌 sorry y'all for dropping the ball on taking so long for doing it :( :( :(

@rudolfv rudolfv requested a review from concourse/pivotal as a code owner Aug 8, 2019

Rudolf Visagie
Remove extraneous logging
Signed-off-by: Rudolf Visagie <rudolf.visagie@adp.com>

@rudolfv rudolfv force-pushed the rudolfv:feature/3674-influxdb-batching branch from 04e3e9e to 666c14b Aug 8, 2019

@rudolfv

This comment has been minimized.

Copy link
Author

commented Aug 8, 2019

@vito I have removed all the extraneous logging. These are really all remnants of when I initially debugged the issue to find the exact point at which we were losing messages.

@rudolfv

This comment has been minimized.

Copy link
Author

commented Aug 8, 2019

@vito I have removed all the extraneous logging. These are really all remnants of when I initially debugged the issue to find the exact point at which we were losing messages.

Also, from running it in our production pipeline I can confirm that having debug mode on will create an insane amount of logging of which the usefulness is questionable. It makes sense to only add this again and do a custom build if we need to troubleshoot something specific further down the line.

Rudolf Visagie
Removed unnecessary line break
Signed-off-by: Rudolf Visagie <rudolf.visagie@adp.com>

@rudolfv rudolfv force-pushed the rudolfv:feature/3674-influxdb-batching branch from dc5c349 to dc224ff Aug 8, 2019

@vito
Copy link
Member

left a comment

👍 thanks for cleaning up those logs! i noticed one more that should probably be removed, after that I think it's good to go

@cirocosta np!

atc/metric/emitter/influxdb.go Outdated Show resolved Hide resolved
Rudolf Visagie
Removed development logging
Signed-off-by: Rudolf Visagie <rudolf.visagie@adp.com>
@vito

vito approved these changes Aug 13, 2019

Copy link
Member

left a comment

thanks! good to go once CI passes. 👍

@vito vito merged commit 7216993 into concourse:master Aug 13, 2019

5 checks passed

DCO DCO
Details
WIP Ready for review
Details
concourse-ci/testflight Concourse CI build success
Details
concourse-ci/unit Concourse CI build success
Details
concourse-ci/watsjs Concourse CI build success
Details
@rudolfv

This comment has been minimized.

Copy link
Author

commented Aug 13, 2019

Thanks, @vito. Any ideas on when the 5.5 release is targeted for?

@vito

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

Don't know yet, I would assume a week or two at worst. There are few things left in the milestone.

@rudolfv

This comment has been minimized.

Copy link
Author

commented Aug 13, 2019

Thanks @vito, that's good news!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.