Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throughput question #374

Closed
dale-busse-av opened this issue Nov 19, 2020 · 15 comments
Closed

Throughput question #374

dale-busse-av opened this issue Nov 19, 2020 · 15 comments
Labels

Comments

@dale-busse-av
Copy link

Hello,

I have a question about general throughput. I have a log file that gets about 200 logs/second, in which the average log line length is about 225 characters. We have it running on AWS (c5n.4xlarge instance) with plenty of RAM/CPU to spare. What I am witnessing is that when I run lc-admin and looking at this particular file, I am seeing the "completion" statistic sits at about 58 (which is my understanding that is the amount read). Since I use logrotate and rotate every hour, the file gets deleted but log-courier holds onto it and continues processing. Then I have two files being processed, the orphaned one and the new one that was created. However this does take up space on the filesystem. So my questions (we are on version 2.0.6):

  1. Should log-courier be able to keep up with this? I guess what kind of throughput should I expect?
    2 Any way to speed this up?

Thanks!

@driskell
Copy link
Owner

Hello.

When you're looking at the lc-admin. Can you see what it says under the transport bit?

Key metric is the "pending payloads". If that sticks at 10 always, then essentially it's saying it's sent all it could (max 10 is default) to Logstash and is waiting for responses. If it's less than 10 then it's never waiting and always sending. In all honesty I've never seen a case where it was less than 10 and there was active harvesters ("completion" < 100) which would indicate Logstash running faster than logs can be read and sent.

Usually things are IO bound and the slowest link is nearly always Logstash due to the grok/user-agent lookup/ip lookup stuff going on. Based on that the usual way to speed up would be look at Logstash side and see if it's needing more CPU, or at the ES ingestion side to see if IO/CPU is there.

Regarding files held open - nowadays I see that as undesirable - it's a case where when things are working fine, holding the file after deletion for a few minutes to make sure everything was read and sent is a good thing - but if things aren't working fine it will hold it until all logs are sent and that could mean many deleted files held open if Logstash is hitting resource limits. It's something I haven't got around to looking at but log-courier should really have a setting that will abandon holding a deleted file open more than a specific amount of time.

@dale-busse-av
Copy link
Author

Hello, and thanks for the response. Wonderful software.

I must have an "older" version (ppa from ubuntu) because I don't see that metric. This is an output of what I have currently from running a status:

0xc000eb0600:
    error: n/a
    harvester:
      codecs: none
      completion: 67.30813995648552
      current_offset: 67702147
      last_eof_offset: n/a
      last_known_size: 100585378
      processed_lines: 374686
      speed_bps: 126984.0948625534
      speed_lps: 689.7015856023845
      stale_bytes: 0
      status: alive
    id: 0xc000eb0600
    orphaned: no
    path: /var/log/mylog.log
    status: running
    type: file

Or is there a command line switch I need to add? Thanks again!

@driskell
Copy link
Owner

Hmmm would you be willing to post the full status output? Possibly it is called publisher not transport.

@dale-busse-av
Copy link
Author

Ah yes, I see it now:

  endpoints:
    logstash-1:5001:
      averageLatency: 1
      pendingPayloads: 8
      publishedLines: 60600878
      server: logstash-1:5001
      status: Active
    logstash-2:5001:
      averageLatency: 3
      pendingPayloads: 0
      publishedLines: 24883666
      server: logstash-2:5001
      status: Active
  status:
    pendingPayloads: 10
    publishedLines: 85483520
    speed: 707.028702900108
reload: callback
version: 2.0.6```

Ok so if that stays at 10, then logstash is my issue? Correct?

@driskell
Copy link
Owner

It looks like one endpoint isn’t doing much as it’s got 0 pending - what’s the “method”? I think default is random but will only send to one at any one time and use any other as failover - if both are fine to be processing it may be worth trying loadbalance method.

https://github.com/driskell/log-courier/blob/master/docs/Configuration.md#method

It is saying 700 lines per second speed though so maybe it is catching up? If you started log courier with lots of full files it might still be sending everything and might catch up eventually.

@dale-busse-av
Copy link
Author

Method is "loadbalance". I ran a status again and now I see this:

  endpoints:
    logstash-1:5001:
      averageLatency: 2
      pendingPayloads: 5
      publishedLines: 61546030
      server: logstash-1:5001
      status: Active
    logstash-2:5001:
      averageLatency: 3
      pendingPayloads: 5
      publishedLines: 25255378
      server: logstash-2:5001
      status: Active
  status:
    pendingPayloads: 10
    publishedLines: 86800384
    speed: 0
reload: callback
version: 2.0.6```

Would adding another logstash instance help?  Or since pending payloads is at "10", it won't do much in this case?

@driskell
Copy link
Owner

Ah here we have speed 0 and momentarily blocked pipeline in essence so yes I think there is some bottleneck on the logstash side.

Adding another could help but one thing I found helpful was to also optimise the processing on logstash. Specifically consider looking at the actions getting taken on events like grok as I know that tends to be the slowest. Some patterns can be slower than others.

@driskell
Copy link
Owner

Worth checking logstash CPU etc too a and the ES cluster CPU to make sure you pinpoint where the restricted element is.

@dale-busse-av
Copy link
Author

I tweaked my logstash using workers and batch size setting and CPU/Mem are doing just fine so this leads me to believe it to be on the elasticsearch side of things since they are CPU bound.

@dale-busse-av
Copy link
Author

As a follow up, when I run lc-admin to get a status, I see pending payload maxed at 10, which stated above, says it is waiting on logstash. I have 2 logstash servers, load balanced, and both have CPU (x4) that runs at about 50% and memory is not fully consumed either and no IOWait. In you experiences, could this also be network issues?

@driskell
Copy link
Owner

driskell commented Nov 23, 2020

You could try increasing the max pending payloads. Memory usage will increase but perhaps the throughput is such that 10 is not enough. Essentially it dictates max amount of data in transit at any one moment in time.

@dale-busse-av
Copy link
Author

Documentation reads default is 4. Is that correct? Or is it 10? Assuming since I see pendingPayloads: 10 I should increase that beyond 10. Thanks for the help!

@dale-busse-av
Copy link
Author

Thank you for all the help! Great app.

@driskell
Copy link
Owner

driskell commented Dec 2, 2020

@dale-busse-av No problem at all! Thanks for the patience. Did you manage to get things improved in the end?
You're right too - that default is incorrect. I'll remember to update that eventually.

@dale-busse-av
Copy link
Author

Yep we did. Thanks to your help, I was able to determine my ES cluster was the bottleneck. Learned about log-courier and appreciated all the help. Thanks again!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants