Skip to content
This repository has been archived by the owner on Jan 18, 2024. It is now read-only.

Massive delay at barnyard2-1.11 startup #54

Closed
MrSecure opened this issue Nov 9, 2012 · 7 comments
Closed

Massive delay at barnyard2-1.11 startup #54

MrSecure opened this issue Nov 9, 2012 · 7 comments

Comments

@MrSecure
Copy link

MrSecure commented Nov 9, 2012

I'm running snort + barnyard2 on FreeBSD, in a multi-site, multi-interface setup.
From a high level, there are multiple exit points being monitored (COA, COB, COC, etc.),
and multiple snort processes (one per interface) watching various aspects of the
installation's netorks (QQQ, RRR, SSS, etc). All of the RRR sensors feed into a
single DB for RRR; likewise for QQQ, SSS, etc. We have been working on a transition
from MySQL to PostgreSQL, and thus build barnyard2 with support for BOTH.

I upgraded our FreeBSD sensors to Snort 2.9.3.1 recently, and that pulled in barnyard2-1.10
from the FreeBSD ports tree. For the sites using PostgreSQL, the process died during
the initialization phase. Based on some reading of google groups & other mailing lists,
this appeared to be related to building the metadata cache in barnyard2, and seemed to
be fixed in 1.11 for several people. So, I duct taped the 1.11 code into the ports
framework, got a clean build, and installed the homegrown barnyard2-1.11 on my sensor.

As others have reported on several mailing lists, I noticed the massive spike at
startup as the cache is built (85-100% of one CPU for ~10 minutes). Once I let that
clear, I could see the monitored unified2 files were growing, but no alerts were
showing up in the DB for some sensors (RRR), but things were fine for others (QQQ, SSS).
All of this was observed on a single machine. I decided to take a fresh look in
the morning, since the alerts were not lost, just spooled to disk & needing to be
processed.

Chasing this down in the morning, I found the following:

Nov  8 19:46:57 lacrosse barnyard2[13455]: Daemon initialized, signaled parent pid: 13360
Nov  8 22:25:28 lacrosse barnyard2[13455]:         --== Initialization Complete ==--
Nov  8 22:25:28 lacrosse barnyard2[13455]: Waiting for new data

OVER 2.5 HOURS to complete initialization!

I'd like to help understand and fix whatever is causing this, but a 2 hour startup delay is not a viable situation. We run a large rule set, and update frequently, so this delay has a large impact on our operations.

I've included the things I believe to be critical to this report below; please let me know if there is anything else that I can provide to help.

MrSecure

[user@lacrosse]-~ > grep 13455 /var/log/all.log
Nov 8 19:46:57 lacrosse barnyard2[13455]: Daemon initialized, signaled parent pid: 13360
Nov 8 19:46:57 lacrosse barnyard2[13455]: PID path stat checked out ok, PID path set to /var/log/snort/RRR/
Nov 8 19:46:57 lacrosse barnyard2[13455]: Writing PID "13455" to file "/var/log/snort/RRR//barnyard2_spanRRR.pid"
Nov 8 19:49:48 lacrosse barnyard2[13455]: database: [SynchronizeEventId()]: Problems executing [SELECT MAX(cid) FROM icmphdr WHERE sid='4';]
Nov 8 22:25:28 lacrosse barnyard2[13455]: database: compiled support for (postgresql)
Nov 8 22:25:28 lacrosse barnyard2[13455]: database: configured to use mysql
Nov 8 22:25:28 lacrosse barnyard2[13455]: database: schema version = 107
Nov 8 22:25:28 lacrosse barnyard2[13455]: database: host = db1
Nov 8 22:25:28 lacrosse barnyard2[13455]: database: user = snort_RRR
Nov 8 22:25:28 lacrosse barnyard2[13455]: database: database name = SnortRRR
Nov 8 22:25:28 lacrosse barnyard2[13455]: database: sensor name = RRR-COB
Nov 8 22:25:28 lacrosse barnyard2[13455]: database: sensor id = 4
Nov 8 22:25:28 lacrosse barnyard2[13455]: database: sensor cid = 1231
Nov 8 22:25:28 lacrosse barnyard2[13455]: database: data encoding = hex
Nov 8 22:25:28 lacrosse barnyard2[13455]: database: detail level = full
Nov 8 22:25:28 lacrosse barnyard2[13455]: database: ignore_bpf = yes
Nov 8 22:25:28 lacrosse barnyard2[13455]: database: using the "log" facility
Nov 8 22:25:28 lacrosse barnyard2[13455]:
Nov 8 22:25:28 lacrosse barnyard2[13455]: --== Initialization Complete ==--
Nov 8 22:25:28 lacrosse barnyard2[13455]: Barnyard2 initialization completed successfully (pid=13455)
Nov 8 22:25:28 lacrosse barnyard2[13455]: Using waldo file '/var/log/snort/RRR/bylog.waldo': spool directory = /var/log/snort/RRR spool filebase = unified2 time_stamp = 1352412703 record_idx = 0
Nov 8 22:25:28 lacrosse barnyard2[13455]: Opened spool file '/var/log/snort/RRR/unified2.1352412703'
Nov 8 22:25:28 lacrosse barnyard2[13455]: Waiting for new data
[user@lacrosse]-~ > uname -rm
8.1-RELEASE-p7 amd64
[user@lacrosse]-~ > /usr/local/bin/barnyard2 --version

______ -> Barnyard2 <-
/ ,,_ \ Version 2.1.11 (Build 315)
|o" )~| By Ian Firns (SecurixLive): http://www.securixlive.com/

[user@lacrosse]-~ > grep -v ^# /usr/local/snort/etc/config-RRR/by2-COB.conf | tr -s "\n"

config reference_file: /usr/local/snort/etc/reference.config
config classification_file: /usr/local/snort/etc/classification.config
config gen_file: /usr/local/snort/etc/gen-msg.map
config sid_file: /usr/local/snort/etc/sid-msg.map
config daemon
config set_gid: snort
config set_uid: snort
input unified2
output database: log, mysql, host=db1 user=snort_RRR password=MOOT dbname=SnortRRR sensor_name=RRR-COB ignore_bpf=1
[user@lacrosse]-~ >
[user@lacrosse]-~ > wc -l /usr/local/snort/etc/sid-msg.map
29487 /usr/local/snort/etc/sid-msg.map
[user@lacrosse]-~ > wc -l /usr/local/snort/etc/gen-msg.map
391 /usr/local/snort/etc/gen-msg.map
[user@lacrosse]-~ > wc -l /usr/local/snort/etc/reference.config
16 /usr/local/snort/etc/reference.config
[user@lacrosse]-~ > wc -l /usr/local/snort/etc/classification.config
67 /usr/local/snort/etc/classification.config
[user@lacrosse]-~ >
[user@lacrosse]-~ > snort --version

,,_ -> Snort! <-
o" )~ Version 2.9.3.1 IPv6 GRE (Build 40) FreeBSD
'''' By Martin Roesch & The Snort Team: http://www.snort.org/snort/snort-team
Copyright (C) 1998-2012 Sourcefire, Inc., et al.
Using libpcap version 1.2.1
Using PCRE version: 8.30 2012-02-04
Using ZLIB version: 1.2.3

[user@lacrosse]-~ >

@binf
Copy link
Collaborator

binf commented Nov 9, 2012

How long has it been since you upgraded to 2-1.1x?

Before running 2-1.1X did you remove all entries in your reference and sig_reference table?

Whats the count of your signature table?

I am under the impression that you might have a huge sig_reference table from 2-1.9 serie and that could impact
the initialization process.

If you have multiple RRR barnyard2 instance stop all of em and once the delete in the database is done, start one,it might take a few to start up but then it should stabilize for other instance.

Also you have to consider the network latency betewen your barnyard2 instance and the database.

Also since your running pgsql you might want to cleanup those tables index and do a vacuum on reference,sig_reference.

and you can allways mail firnsy and me directly if you want the exchange to be faster

Cheers,
-elz

@MrSecure
Copy link
Author

MrSecure commented Nov 9, 2012

Making the 2-1.09 -> 2-1.1x transition triggered the problem ... the upgrade was yesterday.

I did no cleanup as part of the upgrade ... the typical warnings generated by FreeBSD's ports system made no mention of any cleanup needed. Effectively, barnyard's upgrade "came along for the ride" with snort's upgrade.

sig_reference table has 6026 entries
signature table has 3812 entries
reference table has 131350 entries

I did the stop barnyard; truncate table sig_reference; start barnyard dance for the SSS processes (PostgreSQL), and it made no real difference. That said, I have NOT restarted barnyard since ... as you can imagine, I'm a bit apprehensive.

FWIW, ping latency is under 1 ms between the most distant sensor (lacrosse) and the DB server; and the path is at least 1 Gbit/s between them.

So, aside from loosing out on the fully populated cache, is reverting to 2-1.09 going to cause me huge problems?

An update to the "published" release notes on securixlive.com with some guidance on what adjunct tasks should be done as part of the 1.9 -> 1.10+ would be tremendously helpful.

Thank you so much for the prompt response.

I'll take another shot at the stop/truncate/start this weekend to see if this is an upgrade cost or a restart cost.

@binf
Copy link
Collaborator

binf commented Nov 9, 2012

In your previous comment you mentionned RRR and now you mention SSS?

Now we do not distribute package thus its probably missing the README which if you would have had it
you would have seen the information.

As of latency that should be fair enough but i asked because it can play a role here.

You will loose proper transaction isolation, faster logging of events, some spooler fix.
You can still use 2-1.9 if you think that to insert a event you need 10 database call rather than one transaction.

Also note that in the postgresql world you will need to vacuum your tables if you delete alot of rows from an index.
There is a autovacuum process but somtimes depending on your setting it can take a long time to process so you
can allways manually vaccum.

And mysql is very different from postgresql tunning wise so i highly recommend that you drop in pgsql manual.

There is a definitively added cost yto startup and this is known but that added cost commes from benefits, now 2h seem's to me that there is something wrong somewhere, and that reference table with 131k entries ...might be part of it. Since part of the initialization cache will try to synchronize information built from *map files with the database.

Anyhow. keep us posted, and make sure you work on your good sensor ;)

-elz

@MrSecure
Copy link
Author

MrSecure commented Nov 9, 2012

Sorry for the confusion.... we're mid-migration, so it's a mess.
SSS - uses postgresql - I did the stop / truncate / start cycle, and it did not seem to make any difference.
RRR - uses mysql - have NOT done the stop / truncate / start cycle yet, based on there being no difference on SSS.

I will do the stop / truncate / start process on RRR & post the results back here.

@binf
Copy link
Collaborator

binf commented Nov 9, 2012

Its obvious that if the metrics where not the same for SSS and RRR that you didin't see the difference.

Let us know when you cleanup the database how it goes.

But i would encourage you to use our mailing list instead since the audiance is wider.

And somtimes this can enlighten some people who "google" for some similar issues.

-elz

@MrSecure
Copy link
Author

OK ... I stopped RRR's barnyard on all sensors, performed a 'TRUNCATE TABLE sig_reference' and 'TRUNCATE TABLE reference' for the RRR database (mysql), and then started the RRR barnyard process on each sensor, one at a time. The first one took 8 minutes; the following ones took around 4 minutes.

Also, doing a parallel barnyard restart on all RRR sensors, takes around 4 minutes.

After doing all of those, the RRR database contains 212 sig_reference records, and 24695 reference records.

Thank you for all of the help.

As a "frustrated user", I have a recommendation... changes this significant to the operation of any piece of software should use the version number to point out that care must be taken in the update process. Although the end result (data in and data out) in this case is not changed, this change is still a major functionality change. One other avenue that might be usable for changes like this is the use of an annoying config flag. For example, include a new config directive "upgrade_to_1_10_cleaned_reference_tables" which:

  • defaults to false in the compiled form (unless a compile-time flag is given)
  • is set true in the sample config file (right next to the documentation about exactly what needs to be done)
  • causes barnyard2 to log an informative error message & die when it's false

This would have saved me many hours trying to figure out why a simple, small version bump, with no notes in /usr/ports/UPDATING or in the output of 'make reinstall' (both FreeBSD specific).

Again, thank you so much for the excellent software, prompt response, and generous help in sorting out the problems I created for myself by not reading the release notes.

@binf
Copy link
Collaborator

binf commented Nov 10, 2012

Well i understand the frustration, but we do not support PORTS/Packages being made out of release.

I will allways encourage people who software in production to build it them self and package it them self if needed, especialy if its external software to the core OS, but that is just my opinion.

I would also like to highlight that the 2-1.10 release post made to the concerned mailing lists (snort-users,barnyard2-users) explicitely mentioned the issue as being one possible cause for slow start.

- Completely re-written database plugin for performance optimisation against the original DB schema.

NOTE: If you have intentions of running this new version we
highly recommended you to clean two databases table for better
performance: reference and sig_reference, not doing so will not
break anything but could slow the startup caching process).

I would also like to note that even tho there was some behavior change at initialization, the changes maintain 100% backward compatibility hence the "minor" version change.

Initialization behavior has changed and i would encourage you to join our mailing list, so that if you have issue / question we can promptly respond and enchance the response time.

On the other hand, you did not loose any events they are theorically just in backlog and you should be up to speed
within minutes of their initialization.

And if you encounter any issue, do not hesitate to drop on the ml's.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants