Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exporter enters high CPU state and stops processing log lines #58

Closed
kfreem02 opened this issue Apr 29, 2019 · 22 comments
Closed

Exporter enters high CPU state and stops processing log lines #58

kfreem02 opened this issue Apr 29, 2019 · 22 comments

Comments

@kfreem02
Copy link

kfreem02 commented Apr 29, 2019

Issue observed with 0.26, 0.27 and source build from master on April 25.

I have configured 3 instances of grok_exporter on CentOS. 2 instances run without issue, but 1 will only export metrics for a few minutes before entering a high CPU state. In high CPU state the exporter still responds to scrape requests. The backlog is growing at 3-4 rows per second. While operating normally, the "bad" instance grok_exporter_lines_processing_time_microseconds_total is < 500 µs, and this metric drops to 0 once the instance enters high cpu state.

What debug information can I collect to help investigate this issue?

@kfreem02
Copy link
Author

kfreem02 commented Apr 29, 2019

Config for the instance that enters high CPU state

global:
config_version: 2
input:
type: file
path: /data/all_f5_asm_block.log
readall: false # Read from the beginning of the file? False means we start at the end of the file and read only new lines.
grok:
patterns_dir: /etc/grok/patterns
additional_patterns:
metrics:
- type: counter
name: bigp_asm_block
help: Count of ASM blocks by URI
match: '%{SYSLOGTIMESTAMP:LoggedEventTime} %{HOSTNAME:instance} ASM:unit_hostname="%{HOSTNAME}",management_ip_address="%{IP:HostIP1}",management_ip_address_2="",http_class_name="%{PATH}",web_application_name="%{PATH:WebAppName}",policy_name="%{PATH}",policy_apply_date="%{TIMESTAMP_ISO8601}",violations="%{DATA:Violations}",support_id="%{INT}",request_status="%{DATA:Status}",response_code="%{INT}",ip_client="%{IP:ClientIP}",route_domain="%{DATA}",method="%{WORD}",protocol="%{WORD}",query_string="%{DATA}",x_forwarded_for_header_value="%{DATA}",sig_ids="%{DATA}",sig_names="%{DATA}",date_time="%{TIMESTAMP_ISO8601}",severity="%{LOGLEVEL:Severity}",attack_type="%{DATA:AttackType}",geo_location="%{DATA:GeoLocation}",ip_address_intelligence="%{DATA:IP_Intelligence}",username="%{DATA}",sub_violations="%{DATA:SubViolations}",%{DATA},uri="%{URIPATH:Uri}",%{GREEDYDATA}'
labels:
web_app_name: '{{.WebAppName}}'
status: '{{.Status}}'
severity: '{{.Severity}}'
attack_type: '{{.AttackType}}'
violations: '{{.Violations}}'
uri: '{{.Uri}}'
geo_location: '{{.GeoLocation}}'
ip_intelligence: '{{.IP_Intelligence}}'
server:
host: 0.0.0.0
port: 9146

@kfreem02
Copy link
Author

kfreem02 commented Apr 29, 2019

Screenshot of metrics for misbehaving service:
Screen Shot 2019-04-29 at 2 04 30 PM

First entry in /var/log/messages after service start:
grok_exporter[35559]: time="2019-04-29T18:13:24Z" level=warning msg="Log lines are written faster than grok_exporter processes them. In the last minute there were constantly more than 1146 log lines in the buffer waiting to be processed. Check the built-in grok_exporter_lines_processing_time_microseconds_total metric to learn which metric takes most of the processing time."

@fstab
Copy link
Owner

fstab commented May 1, 2019

This sounds like a hard-to-find bug. There are two main parts in grok_exporter: The first part handles file system events and reads log lines from log files, the second part does the pattern matching.

As a first step we need to find out which of those parts is the culprit. Could you configure poll_interval_seconds in your input section? This disables file system event processing and polls the log files every n seconds. It has drawbacks and should not be necessary, but it could give us a hint. If this makes the problem go away, the error is likely in the file system event processing. If the problem is still there with this configuration, the problem is likely in pattern matching.

@raokrutarth
Copy link

raokrutarth commented May 8, 2019

I'm seeing the same issue so I tried:
image

But the metric that generally fluctuates around 100-500 new entries per second, the total seems to be getting stalled.

image

memory usage appears to be high so the strings are definitely getting "cached/stored/queued" but not getting matched.

image

Verified there is a full line match for the filter and the message with http://grokconstructor.appspot.com/do/match

@fstab
Copy link
Owner

fstab commented May 10, 2019

I'd love to reproduce it. Could you post the match and an example line? Maybe if I write a small test that runs the line through the match 500 times/second I can reproduce it.

@raokrutarth
Copy link

raokrutarth commented May 13, 2019

The logs are pretty large so I've included a subset (can't attach a file. Can upload elsewhere too):
Logs: https://justpaste.it/7i7qo

Filters with lower than expected matches:

%{TIMESTAMP_ISO8601}\s+%{WORD}\s+xyz-tsdbd\[%{NUMBER}\]:\s+%{DATA}[Rr]eceived\s+new%{DATA}notif%{GREEDYDATA}

%{TIMESTAMP_ISO8601}\s+%{WORD}\s+xyz-tsdbd\[%{NUMBER}\]:\s+%{DATA}(([pP]rocessed%{DATA}notification)|(notification%{DATA}[Pp]rocessed))%{GREEDYDATA}

%{TIMESTAMP_ISO8601}\s+%{WORD}\s+xyz-tsdbd\[%{NUMBER}\]:\s+%{DATA}((dropp%{DATA}notification)|(notification%{DATA}dropp))%{GREEDYDATA}

additional_patterns:

  - DATA .*?
  - GREEDYDATA .*
  - WORD \b\w+\b

@fstab
Copy link
Owner

fstab commented May 17, 2019

I still cannot reproduce it. I created a config.yaml file with the matches above as match1, match2, and match3, and stdin as input type. Then I copied the log lines into logfile.log, and ran the following to generate traffic:

(while true ; do cat logfile.log ; sleep 0.001 ; done) | grok_exporter -config ./config.yaml

With the Prometheus query rate(match1[5m]) I can see the number of matches per second. This is around 1.2k for match1, 5.1k for match2, and 2.4k for match3. These values are almost constant, even though the throughput is much higher than the 500 entries / second in your experiment. I don't experience any significant changes or lost matches. Memory consumption of grok_exporter is constant.

Do you have any idea how I could reproduce the error?

@raokrutarth
Copy link

I'm using file as input instead of stdin. Would you recommend using stdin?

@fstab
Copy link
Owner

fstab commented May 18, 2019

No, I would not recommend it. It's just for debugging.

It would be interesting to know if the error persists with stdin. If switching to stdin solves the problem, then there must be a bug in the file input with poll_interval_seconds. Could you check if the error goes away when you use stdin, maybe with tail -F | grok_exporter ...?

@raokrutarth
Copy link

Set the input to stdin:
image

Used the command tail -f /var/log/messages | ./grok_exporter -config ./nae-grok-autogen.yaml >& /dev/null & to run the process and verified the exporter is running on remote device.

After seeing multiple the relevant log messages that should have matched, verified http://grokconstructor.appspot.com/do/match#result detects match on eatch one.

But the counter below still showed 0 count:

- help: count of incoming notifications in tsdbd
  match: '%{TIMESTAMP_ISO8601}\s+%{WORD}\s+xyz-tsdbd\[%{NUMBER}\]:\s+%{DATA}[Rr]eceived\s+new%{DATA}notif%{GREEDYDATA}'
  name: benchmark_tsdbd_incoming_notif_log
  type: counter

image

@raokrutarth
Copy link

I have a fairly high number of counters and use summary metric types frequently as well. Could that have an impact on the matches?

@fstab
Copy link
Owner

fstab commented May 24, 2019

If you see the error with tail -f /var/log/messages | ./grok_exporter... maybe you can find some example log messages, put them into an logfile.log file and reproduce the same error with (while true ; do cat logfile.log ; sleep 0.001 ; done) | grok_exporter....

Then you could attach the logfile.log file and your config.yaml here so I can reproduce the error on my machine.

@raokrutarth
Copy link

logfile.log

config.txt

Unable to upload the config with the .yaml extension. Using .yaml during execution.

@raokrutarth
Copy link

The log file is very small subset of the actual log

@fstab
Copy link
Owner

fstab commented Jun 1, 2019

Thanks for the example files. grok_exporter uses the Oniguruma regular expression library for pattern matching. The grok_exporter release is statically linked to Oniguruma version 5.9.6. In your example, the benchmark_tsdbd_incoming_notif_log metric is 0 with the current grok_exporter release, but if I update Oniguruma to the current version 6.9.2 it works.

I will create a new release with an updated Oniguruma version, then you patterns should work. However, it will take two weeks.

In the meantime, you could either build grok_exporter from source using the latest Oniguruma library, or you refactor your patterns so that they work with Oniguruma 5.9.6. However, I cannot tell you exactly what's the breaking change between the two Oniguruma versions.

@raokrutarth
Copy link

raokrutarth commented Jun 3, 2019

Attempted to run release.sh with the following steps:

  • Started a new ubuntu container with docker run --volume /ws/raokru/temp:/temp --volume $GOPATH/src:/go/src --network bridge -i -t golang:1.12.5-stretch /bin/bash
  • downloaded Oniguruma release 6.9.2 in /ws/raokru/temp with wget https://github.com/kkos/oniguruma/releases/tag/v6.9.2
  • installed the new Oniguruma library with root@9b6a283a32f6:/temp/onig-6.9.2# ./configure && make && make install. Got success Libraries have been installed in: /usr/local/lib
  • Initialize the grok_exporter go libraries with git submodule update --init --recursive && go install .
  • run release.sh and got errors below
root@fd345cd03705:/go/src/github.com/fstab/grok_exporter# git submodule update --init --recursive && go install .
root@fd345cd03705:/go/src/github.com/fstab/grok_exporter# ./release.sh linux-amd64
?       github.com/fstab/grok_exporter  [no test files]
/tmp/go-build110982767/b144/config.test: error while loading shared libraries: libonig.so.5: cannot open shared object file: No such file or directory
FAIL    github.com/fstab/grok_exporter/config   0.002s
/tmp/go-build110982767/b152/config.test: error while loading shared libraries: libonig.so.5: cannot open shared object file: No such file or directory
FAIL    github.com/fstab/grok_exporter/config/v1        0.001s
/tmp/go-build110982767/b155/config.test: error while loading shared libraries: libonig.so.5: cannot open shared object file: No such file or directory
FAIL    github.com/fstab/grok_exporter/config/v2        0.000s
/tmp/go-build110982767/b158/exporter.test: error while loading shared libraries: libonig.so.5: cannot open shared object file: No such file or directory
FAIL    github.com/fstab/grok_exporter/exporter 0.001s
/tmp/go-build110982767/b161/oniguruma.test: error while loading shared libraries: libonig.so.5: cannot open shared object file: No such file or directory
FAIL    github.com/fstab/grok_exporter/oniguruma        0.001s
ok      github.com/fstab/grok_exporter/tailer   (cached)
?       github.com/fstab/grok_exporter/tailer/fswatcher [no test files]
ok      github.com/fstab/grok_exporter/tailer/glob      (cached)
/tmp/go-build110982767/b171/template.test: error while loading shared libraries: libonig.so.5: cannot open shared object file: No such file or directory
FAIL    github.com/fstab/grok_exporter/template 0.000s
root@fd345cd03705:/go/src/github.com/fstab/grok_exporter# 

@fstab
Copy link
Owner

fstab commented Jun 4, 2019

The release.sh script is used to build the official releases. The official releases use oniguruma version 5.9.6. If you want to build your own version from source code, you need to install go and gcc, and build grok_exporter with go install .. This is also described here: https://github.com/fstab/grok_exporter#how-to-build-from-source

@raokrutarth
Copy link

The official golang golang:1.12.5 and golang:1.11.10 docker images already have go and gcc installed. From the errors I shared earlier, it appears the issue in related to the static linking even though Oniguruma release 6.9.2 installed correctly. Please see my steps above.

@fstab
Copy link
Owner

fstab commented Jun 6, 2019

If you want to build with grok_exporter with your own Oniguruma version please run go install . and not ./release.sh.

You could have a look at the commands in .travis.yml for reference. This runs with every commit and builds grok_exporter with different combinations, Oniguruma 5.9.6 and 6.7.0, static and dynamic linking. Resuls are here: https://travis-ci.org/fstab/grok_exporter/

I currently have limited Internet so I cannot check what's in the golang Docker image. But I can make a release with an updated Oniguruma version next week.

@raokrutarth
Copy link

Makes sense. It'll be easier for me to wait since I don't have permissions to the /usr/... directories on my machine. Appreciate it!

@fstab
Copy link
Owner

fstab commented Jun 28, 2019

Sorry that it took so long, but I finally found the time to make rel 0.2.8. It uses Oniguruma 6.9.2.

@kfreem02
Copy link
Author

kfreem02 commented Jul 3, 2019

I have installed 0.28 and am no longer seeing the issue. Thank you for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants