Throughput question #374

dale-busse-av · 2020-11-19T18:08:17Z

Hello,

I have a question about general throughput. I have a log file that gets about 200 logs/second, in which the average log line length is about 225 characters. We have it running on AWS (c5n.4xlarge instance) with plenty of RAM/CPU to spare. What I am witnessing is that when I run lc-admin and looking at this particular file, I am seeing the "completion" statistic sits at about 58 (which is my understanding that is the amount read). Since I use logrotate and rotate every hour, the file gets deleted but log-courier holds onto it and continues processing. Then I have two files being processed, the orphaned one and the new one that was created. However this does take up space on the filesystem. So my questions (we are on version 2.0.6):

Should log-courier be able to keep up with this? I guess what kind of throughput should I expect?
2 Any way to speed this up?

Thanks!

driskell · 2020-11-20T09:19:02Z

Hello.

When you're looking at the lc-admin. Can you see what it says under the transport bit?

Key metric is the "pending payloads". If that sticks at 10 always, then essentially it's saying it's sent all it could (max 10 is default) to Logstash and is waiting for responses. If it's less than 10 then it's never waiting and always sending. In all honesty I've never seen a case where it was less than 10 and there was active harvesters ("completion" < 100) which would indicate Logstash running faster than logs can be read and sent.

Usually things are IO bound and the slowest link is nearly always Logstash due to the grok/user-agent lookup/ip lookup stuff going on. Based on that the usual way to speed up would be look at Logstash side and see if it's needing more CPU, or at the ES ingestion side to see if IO/CPU is there.

Regarding files held open - nowadays I see that as undesirable - it's a case where when things are working fine, holding the file after deletion for a few minutes to make sure everything was read and sent is a good thing - but if things aren't working fine it will hold it until all logs are sent and that could mean many deleted files held open if Logstash is hitting resource limits. It's something I haven't got around to looking at but log-courier should really have a setting that will abandon holding a deleted file open more than a specific amount of time.

dale-busse-av · 2020-11-20T15:44:49Z

Hello, and thanks for the response. Wonderful software.

I must have an "older" version (ppa from ubuntu) because I don't see that metric. This is an output of what I have currently from running a status:

0xc000eb0600:
    error: n/a
    harvester:
      codecs: none
      completion: 67.30813995648552
      current_offset: 67702147
      last_eof_offset: n/a
      last_known_size: 100585378
      processed_lines: 374686
      speed_bps: 126984.0948625534
      speed_lps: 689.7015856023845
      stale_bytes: 0
      status: alive
    id: 0xc000eb0600
    orphaned: no
    path: /var/log/mylog.log
    status: running
    type: file

Or is there a command line switch I need to add? Thanks again!

driskell · 2020-11-20T18:58:24Z

Hmmm would you be willing to post the full status output? Possibly it is called publisher not transport.

dale-busse-av · 2020-11-20T19:09:34Z

Ah yes, I see it now:

  endpoints:
    logstash-1:5001:
      averageLatency: 1
      pendingPayloads: 8
      publishedLines: 60600878
      server: logstash-1:5001
      status: Active
    logstash-2:5001:
      averageLatency: 3
      pendingPayloads: 0
      publishedLines: 24883666
      server: logstash-2:5001
      status: Active
  status:
    pendingPayloads: 10
    publishedLines: 85483520
    speed: 707.028702900108
reload: callback
version: 2.0.6```

Ok so if that stays at 10, then logstash is my issue? Correct?

driskell · 2020-11-20T19:27:05Z

It looks like one endpoint isn’t doing much as it’s got 0 pending - what’s the “method”? I think default is random but will only send to one at any one time and use any other as failover - if both are fine to be processing it may be worth trying loadbalance method.

https://github.com/driskell/log-courier/blob/master/docs/Configuration.md#method

It is saying 700 lines per second speed though so maybe it is catching up? If you started log courier with lots of full files it might still be sending everything and might catch up eventually.

dale-busse-av · 2020-11-20T19:38:55Z

Method is "loadbalance". I ran a status again and now I see this:

  endpoints:
    logstash-1:5001:
      averageLatency: 2
      pendingPayloads: 5
      publishedLines: 61546030
      server: logstash-1:5001
      status: Active
    logstash-2:5001:
      averageLatency: 3
      pendingPayloads: 5
      publishedLines: 25255378
      server: logstash-2:5001
      status: Active
  status:
    pendingPayloads: 10
    publishedLines: 86800384
    speed: 0
reload: callback
version: 2.0.6```

Would adding another logstash instance help?  Or since pending payloads is at "10", it won't do much in this case?

driskell · 2020-11-20T19:42:39Z

Ah here we have speed 0 and momentarily blocked pipeline in essence so yes I think there is some bottleneck on the logstash side.

Adding another could help but one thing I found helpful was to also optimise the processing on logstash. Specifically consider looking at the actions getting taken on events like grok as I know that tends to be the slowest. Some patterns can be slower than others.

driskell · 2020-11-20T19:45:11Z

Worth checking logstash CPU etc too a and the ES cluster CPU to make sure you pinpoint where the restricted element is.

dale-busse-av · 2020-11-20T19:54:57Z

I tweaked my logstash using workers and batch size setting and CPU/Mem are doing just fine so this leads me to believe it to be on the elasticsearch side of things since they are CPU bound.

dale-busse-av · 2020-11-23T15:25:57Z

As a follow up, when I run lc-admin to get a status, I see pending payload maxed at 10, which stated above, says it is waiting on logstash. I have 2 logstash servers, load balanced, and both have CPU (x4) that runs at about 50% and memory is not fully consumed either and no IOWait. In you experiences, could this also be network issues?

driskell · 2020-11-23T21:38:00Z

You could try increasing the max pending payloads. Memory usage will increase but perhaps the throughput is such that 10 is not enough. Essentially it dictates max amount of data in transit at any one moment in time.

dale-busse-av · 2020-11-23T21:51:37Z

Documentation reads default is 4. Is that correct? Or is it 10? Assuming since I see pendingPayloads: 10 I should increase that beyond 10. Thanks for the help!

dale-busse-av · 2020-12-02T13:44:54Z

Thank you for all the help! Great app.

driskell · 2020-12-02T17:28:17Z

@dale-busse-av No problem at all! Thanks for the patience. Did you manage to get things improved in the end?
You're right too - that default is incorrect. I'll remember to update that eventually.

dale-busse-av · 2020-12-02T20:19:40Z

Yep we did. Thanks to your help, I was able to determine my ES cluster was the bottleneck. Learned about log-courier and appreciated all the help. Thanks again!!

driskell added the question label Nov 20, 2020

dale-busse-av closed this as completed Dec 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Throughput question #374

Throughput question #374

dale-busse-av commented Nov 19, 2020

driskell commented Nov 20, 2020

dale-busse-av commented Nov 20, 2020

driskell commented Nov 20, 2020

dale-busse-av commented Nov 20, 2020

driskell commented Nov 20, 2020

dale-busse-av commented Nov 20, 2020

driskell commented Nov 20, 2020

driskell commented Nov 20, 2020

dale-busse-av commented Nov 20, 2020

dale-busse-av commented Nov 23, 2020

driskell commented Nov 23, 2020 •

edited

Loading

dale-busse-av commented Nov 23, 2020

dale-busse-av commented Dec 2, 2020

driskell commented Dec 2, 2020

dale-busse-av commented Dec 2, 2020

Throughput question #374

Throughput question #374

Comments

dale-busse-av commented Nov 19, 2020

driskell commented Nov 20, 2020

dale-busse-av commented Nov 20, 2020

driskell commented Nov 20, 2020

dale-busse-av commented Nov 20, 2020

driskell commented Nov 20, 2020

dale-busse-av commented Nov 20, 2020

driskell commented Nov 20, 2020

driskell commented Nov 20, 2020

dale-busse-av commented Nov 20, 2020

dale-busse-av commented Nov 23, 2020

driskell commented Nov 23, 2020 • edited Loading

dale-busse-av commented Nov 23, 2020

dale-busse-av commented Dec 2, 2020

driskell commented Dec 2, 2020

dale-busse-av commented Dec 2, 2020

driskell commented Nov 23, 2020 •

edited

Loading