Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grok_exporter v 0.2.6 resident memory rising #55

Closed
raokrutarth opened this issue Mar 14, 2019 · 22 comments
Closed

grok_exporter v 0.2.6 resident memory rising #55

raokrutarth opened this issue Mar 14, 2019 · 22 comments

Comments

@raokrutarth
Copy link

raokrutarth commented Mar 14, 2019

I have a fairly frequently updated log file that I am using grok exporter to count specific log entries. I see that the resident memory usage by the exporter is quite high and growing. This happens 3/4 of the time in my env.

Graph:
image

The config is:

global:
  config_version: 2
grok:
  additional_patterns:
  - BASE10NUM (?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\.[0-9]+)?)|(?:\.[0-9]+)))
  - DATA .*?
  - NUMBER (?:%{BASE10NUM})
  - DAEMON [a-zA-Z_\-]+
  - NAELOGLEVEL LOG_[\S+]
  - MESSAGE [\S\s\S]?*
  - YEAR (?>\d\d){1,2}
  - HOUR (?:2[0123]|[01]?[0-9])
  - MINUTE (?:[0-5][0-9])
  - SECOND (?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)
  - TIME (?!<[0-9])%{HOUR}:%{MINUTE}(?::%{SECOND})(?![0-9])
  - MONTHNUM (?:0?[1-9]|1[0-2])
  - MONTHDAY (?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])
  - ISO8601_TIMEZONE (?:Z|[+-]%{HOUR}(?::?%{MINUTE}))
  - TIMESTAMP_ISO8601 %{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}?
  - LEVEL (LOG_)?([Aa]lert|ALERT|[Tt]race|TRACE|[Dd]ebug|DEBUG|DBG|[Nn]otice|NOTICE|[Ii]nfo|INFO|[Ww]arn?(?:ing)?|WARN?(?:ING)?|[Ee]rr|ERR?(?:OR)?|[Cc](rit)+(?:ical)?|CRIT?(?:ICAL)?|[Ff]atal|FATAL|[Ss]evere|SEVERE|EMERG(?:ENCY)?|[Ee]merg(?:ency)?)
input:
  path: /var/log/messages
  poll_interval_seconds: 5
  type: file
metrics:
- help: all log entries seperated by daemon
  labels:
    daemon: '{{.daemon}}'
  match: '%{TIMESTAMP_ISO8601:time} %{DAEMON:daemon}.*%{LEVEL:level}.*'
  name: benchmark_all_daemon_log_entry_counts
  type: counter
- help: count of panic entries in the log by go daemons
  labels:
    daemon: '{{.daemon}}'
  match: '%{TIMESTAMP_ISO8601:time} %{DAEMON:daemon}.*panic.*'
  name: benchmark_go_daemon_log_panic_counts
  type: counter

[other confidential counters]
@fstab
Copy link
Owner

fstab commented Mar 19, 2019

This is a bug. Will investigate where this comes from. Thanks for reporting.

fstab added a commit that referenced this issue Mar 25, 2019
@fstab
Copy link
Owner

fstab commented Mar 25, 2019

Found a memory leak in gsub. Pushed a fix and will make a bugfix release soon. Thanks for reporting!
Are you actually using gsub (in your [other confidential counters] config) or is there another memory leak I am not aware of?

@raokrutarth
Copy link
Author

I'm using summary and gauge metrics. Do those call gsub in any way?

@fstab
Copy link
Owner

fstab commented Apr 8, 2019

I released a bugfix release 0.2.7. Could you check if this solves the issue?

@raokrutarth
Copy link
Author

I used the 0.2.7 release binary in my env but I am still seeing the leak. Currently, a bash script restarts the exporter every 1.5 hrs.

image

@fstab
Copy link
Owner

fstab commented Apr 11, 2019

There's a buffer in grok_exporter where new log lines are cached before they are processed by the matchers. The load of this buffer can be monitored with the grok_exporter_line_buffer_peak_load metric. Could you monitor this metric and check if this is constantly growing? This would help me figure out where the memory leak happens.

@raokrutarth
Copy link
Author

image

image

@raokrutarth
Copy link
Author

I also noticed a different trend over a longer run (~2.5 days), the peaks are not as large later on.

image

@fstab
Copy link
Owner

fstab commented Apr 16, 2019

Thanks for the data. If I see the colors correctly the grok_exporter_line_buffer_peak_load is constantly increasing. That means log lines are written faster than grok_exporter processes them. The buffer with unprocessed log lines keeps growing, and grok_exporter eventually runs out of memory.

One thing you could do is to figure out why processing takes so long. Sometimes there is a single regular expression taking most of the processing time. The metric grok_exporter_lines_processing_time_microseconds_total gives you the total processing time in microseconds per metric. With a Prometheus query like rate(grok_exporter_lines_processing_time_microseconds_total[5m]) you can evaluate that in the time window of the last 5 minutes.

Could you run that query to see if there is any metric that takes significantly longer than the others? Maybe we can optimize the regular expression for that metric to make processing faster.

@raokrutarth
Copy link
Author

raokrutarth commented Apr 16, 2019

image

I can take out all_daemon_log_entry_counts but benchmark_ovsdb_logs are needed.

The metric is
- help: ovsdb-server log entries labels: level: '{{.level}}' match: '%{TIMESTAMP_ISO8601:time} ovsdb-server(\[%{NUMBER}\])?: .*%{LEVEL:level}' name: benchmark_ovsdb_logs type: counter

The frequency of this log is quite high, any chance this can be optimized?

@fstab
Copy link
Owner

fstab commented Apr 17, 2019

It doesn't look too bad. Some things you can try:

  • as far as I can see you only use level as a label, so you could replace %{TIMESTAMP_ISO8601:time} with %{TIMESTAMP_ISO8601} or even leave it out entirely.
  • (\[%{NUMBER}\])?: is optional and you don't use it, so you could remove it
  • It might be faster to replace .* with .*? depending on if the level is more at the end of the line or the beginning of the line.

@raokrutarth
Copy link
Author

An optional flag like --max-rss=1G for the exporter could help. Once the cap is reached, it drops older items for new ones. Just a suggestion.

fstab added a commit that referenced this issue Apr 18, 2019
@fstab
Copy link
Owner

fstab commented Apr 18, 2019

I didn't add a buffer limit so far because dropping log lines makes grok_exporter unreliable, you might miss fatal log messages.

On the other hand, if you have to restart grok_exporter you also might miss messages. So an optional flag is probably the lesser evil. I will look into it.

Meanwhile I added a log message, so if log messages are written faster than they are processed grok_exporter will print a warning to the console.

@raokrutarth
Copy link
Author

raokrutarth commented Apr 19, 2019

Thanks for taking a look. I am not really parsing the log messages from grok_exporter because my env can't take any actions regardless of what it says. Also noticed the memory spikes are progressing faster. Unable to verify if there have been changes in the number of log messages but I thought the data might be useful.

image

@fstab
Copy link
Owner

fstab commented Apr 28, 2019

I started implementing a limit for the number of lines in the line buffer. Specifying a number of lines is not as convenient as specifying a max amount of memory, but it's easier to implement and better than running out of memory. I will comment here when it's done.

@fstab
Copy link
Owner

fstab commented May 1, 2019

I implemented a config option for the input section in the configuration file:

input:
    type: file
    path: /var/log/messages
    max_lines_in_buffer: 100000

The example would drop the buffer if it reaches 100000 log lines. This is not good, because you loose 100000 log lines in monitoring, but it's better than running out of memory and crashing.

You need to build grok_exporter from source code to get this feature. It will be part of the next release.

fstab added a commit that referenced this issue May 1, 2019
@raokrutarth
Copy link
Author

Thanks!

Does the release script still need go version < 1.12?

If so, I'll to wait for the release since I'm using go version 1.12 and haven't figured out using 2 versions of go simultaneously in my environment.

@fstab
Copy link
Owner

fstab commented May 3, 2019

The release script works with go 1.12.

@raokrutarth
Copy link
Author

Can confirm the issue got resolved with the new fixes. Graph shows usage after the binary was swapped at 19:00 hours.

image

@raokrutarth
Copy link
Author

I started to see a few counters are not getting incremented even though tail -f on the log file show a fairly fast rate of increase in these messages. I'm using https://grokconstructor.appspot.com/do/match to test my filter and seeing the whole line match.

@fstab
Copy link
Owner

fstab commented May 10, 2019

@raokrutarth could you provide an example of a log line and the counter? I can have a look why it's not incremented.

@raokrutarth
Copy link
Author

raokrutarth commented May 11, 2019

re-posted comment on #58. memory issues are resolved with 0.2.8 and max_lines_in_buffer in config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants