Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

packetbeat nightly151201180656 crashing psql #565

Closed
opb1978 opened this issue Dec 18, 2015 · 18 comments

Comments

Projects
None yet
4 participants
@opb1978
Copy link

commented Dec 18, 2015

just updated to packetbeat 1.0.1 and checked if the issue #342 is now fixed in this version.

after running for about 10 minutes I got this error in the log file:

2015-12-18T23:07:11.921545+01:00 somehost /usr/bin/packetbeat[13560]: log.go:114: Stacktrace: /go/src/github.com/elastic/beats/libbeat/logp/log.go:114 (0x48c5c6)#12/usr/local/go/src/runtime/asm_amd64.s:437 (0x47d8fe)#12/usr/local/go/src/runtime/panic.go:423 (0x44d4f9)#12/usr/local/go/src/runtime/panic.go:18 (0x44ba39)#12/go/src/github.com/elastic/beats/packetbeat/protos/pgsql/pgsql.go:279 (0x512203)#12/go/src/github.com/elastic/beats/packetbeat/protos/pgsql/pgsql.go:610 (0x5146de)#12/go/src/github.com/elastic/beats/packetbeat/protos/pgsql/pgsql.go:707 (0x51515d)#12/go/src/github.com/elastic/beats/packetbeat/protos/tcp/tcp.go:87 (0x521093)#12/go/src/github.com/elastic/beats/packetbeat/protos/tcp/tcp.go:173 (0x5221cd)#12/go/src/github.com/elastic/beats/packetbeat/decoder/decoder.go:136 (0x6c8ad1)#12/go/src/github.com/elastic/beats/packetbeat/sniffer/sniffer.go:352 (0x5337a9)#12/go/src/github.com/elastic/beats/packetbeat/packetbeat.go:212 (0x422f2b)#12/usr/local/go/src/runtime/asm_amd64.s:1696 (0x47fc41)

seams to be still a problem here.

I can do a capture again if needed!

@tsg

This comment has been minimized.

Copy link
Collaborator

commented Dec 18, 2015

@opb1978 it would be really great if you could!

@opb1978

This comment has been minimized.

Copy link
Author

commented Dec 18, 2015

no problem, should I send it again to @andrewkroh ? I would like to gpg encrypt the file before sending...

@tsg

This comment has been minimized.

Copy link
Collaborator

commented Dec 18, 2015

Yeah, if you already have his gpg key, then that would be easiest. Thanks!

@andrewkroh

This comment has been minimized.

Copy link
Member

commented Dec 18, 2015

I think I incorrectly tagged #342 with 1.0.1. I don't think the #494 fix was incorporated into 1.0.1, but is instead tagged with 1.1.0.

You could try the nightly to see if the bug is fixed there. We (mostly @urso) developed the fix based on the PCAP that was provided.

@opb1978

This comment has been minimized.

Copy link
Author

commented Dec 18, 2015

I actually tried nightly builds one before but the version string in the nightly builds seams to be wrong and did not want to repack the debian package.

maybe you could have a look into this some time, will do the repack now by hand:

dpkg: error processing archive packetbeat_nightly.latest_amd64.deb (--install):
parsing file '/var/lib/dpkg/tmp.ci/control' near line 2 package 'packetbeat':
error in 'Version' field string 'nightly151201180656': version number does not start with digit
Errors were encountered while processing:

@andrewkroh

This comment has been minimized.

Copy link
Member

commented Dec 18, 2015

Yeah, sorry, that is still an open issue. elastic/beats-packer#40

@opb1978 opb1978 changed the title packetbeat 1.0.1 crashing psql packetbeat nightly151201180656 crashing psql Dec 21, 2015

@opb1978

This comment has been minimized.

Copy link
Author

commented Dec 21, 2015

@andrewkroh did a retest with nightly151201180656 still having a psql Problem. I will send you a download link for the pcap file.

@andrewkroh

This comment has been minimized.

Copy link
Member

commented Dec 21, 2015

We tried your PCAP and were not able to reproduce using the latest build from master. The nightly build that you used is from 2015-12-01 (based on the filename) and the fix was not introduced until 2015-12-10.

We only store the past 2 weeks of nightly builds, so did you possibly use a version that you had downloaded in the past?

@urso

This comment has been minimized.

Copy link
Collaborator

commented Dec 21, 2015

@opb1978 please get the most recent nightly. I checked builds up to 2015-12-10 being able to reproduce the original panic. More recent builds should be fine.

@opb1978

This comment has been minimized.

Copy link
Author

commented Dec 21, 2015

sorry for the confusion will retry with the latest nightly build. You where right I downloaded before and repacked the wrong version. Will update here soon!

@opb1978

This comment has been minimized.

Copy link
Author

commented Dec 22, 2015

did some retesting with the nightly build and got again some errors. @andrewkroh I have sent you a download link yesterday.

@andrewkroh

This comment has been minimized.

Copy link
Member

commented Dec 31, 2015

Right after I received the latest PCAP, I give it a try (but I forgot to update this issue). I was not able to reproduce any panics with it.

urso also tried the PCAP and could not reproduce.

@andrewkroh

This comment has been minimized.

Copy link
Member

commented Dec 31, 2015

We were just chatting about this and @urso came up with a theory that we should investigate further. It might explain why we can reproduce it from the PCAP, but you are seeing an issue in production. If the pgsql transactions are growing larger than 10MB, then the stream is dropped. But if there is some faulty state management (i.e. the state is not reset properly) then this could lead to potential issues.

@opb1978

This comment has been minimized.

Copy link
Author

commented Dec 31, 2015

Do you need any more tests for tracking down this problem? We could also do some remote testing on our Systems.

@urso

This comment has been minimized.

Copy link
Collaborator

commented Jan 4, 2016

Thanks for your help. Unfortunately my theory was wrong.

To track down the issue I need a trace reliably reproducing the issue, so I can minify the trace until I can identify the problem.

One can test a pcap in bash/zsh with:

$ packetbeat -e -N -t -I trace.pcap |& grep panic

We can build a small script creating and testing a dump for some Stacktrace:

#!/bin/sh

IFC=${1:-eth0}
PCAP=${2:-trace.pcap}

check() {
    packetbeat -e -t -N -I $1 2>&1 | grep "Stacktrace" | wc -l
}

while true; do
    # optionally use timeout command (try to be bc and support OS X)
    tcpdump -i $IFC -w "$PCAP" 'tcp port 5432 or tcp port 5431' &
    job=$!
    sleep 60
    kill $job; wait $job

    echo "check for Stacktrace"
    count=$(check "$PCAP")
    if [ "$count" -ge 1 ]; then
       echo "found Stacktrace. Quitting"
       break
    fi
done

This script will create a trace for 60 seconds and checks if the traces generates an error by running packetbeat with -t -N -I trace.pcap . -N and -I guarantee this packetbeat instance is reading packets from trace file only and will not forward any events to elasticsearch/logstash. You can run the script next to your running packetbeat instance (still memory/disk/cpu will be used to create the trace). Update check function and time intervals if required.

@opb1978

This comment has been minimized.

Copy link
Author

commented Jan 4, 2016

I have been running this script now since hours and no errors. If I start packetbeat again normally the problem occurs after some minutes.

I have been capturing on interface "any" because this is how packetbeat would be running. I have put the script into a screen, maybe it will produce the problem after some time.

Just a guess, maybe the problem is occurring while transferring to elasticsearch? As I disabled the normal process (causing to many errors and SMS) we could start the replay of the pcap file with transferring to elastic search. I can easily remove this sheds again.

@urso

This comment has been minimized.

Copy link
Collaborator

commented Jan 4, 2016

Hmm... bug seems to be hiding. Problem is, if we run with '-t', we alter timestamps and timely behavior.

So another options would be to modify the script to:

  1. remote '-t -N' from check function => transactions are send to elasticsearch + timing more similar to original capture.
  2. increase capture duration: sleep $(($DURATION * 60))
  3. use interface 'any' (using any with tcpdump will not set the devices into promiscuous mode) or list all devices with '-i name'

Doing changes 1 and 2, the script will capture traffic for DURATION minutes and afterwards check the pcap for DURATION minutes.

@urso

This comment has been minimized.

Copy link
Collaborator

commented Jan 22, 2016

Good news, I've finally got a trace send reproducing the error. Put quite some effort into hardening the pgsql parser today. See #825

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.