METRON-870: Add filtering by packet payload to the pcap query #541
Conversation
It appears that byteseek has a LGPL dependency in gnu trove, a primitive collections library. As a stopgap, I:
This is less than ideal in the long-run, so I submitted a PR against the library and asked for a new release. |
On second thought, the part of the |
Hi, byteseek author here. I notice you're using the standard BoyerMooreHorspool searcher. Just to let you know that the HorspoolFinalFlagSearcher offers much better performance than the standard Horspool searcher - about 1.5 to 2 times faster in my tests - and uses no more memory or pre-processing time. It's essentially the same algorithm as Horspool, but re-arranges it a bit so that it only needs to verify a match when the algorithm detects that the last character matches the pattern, which can be done using a negative shift value (the "final flag"). Since verification is expensive to perform on each loop iteration, this performs much better in general. |
Hey, thanks for that feedback @nishihatapalmer ! I adjusted to use the suggested searcher. I did have one more question, I'm looking to document the possible regex's available for binary search, is there any documentation on the restrictions or capabilities of the compiled regex's? |
There is a slightly out of date (note to self: update this!) syntax document at: It gives an overview of most of the syntax, but some of it is only usable by full regexes, not sequence matchers. In particular it can only accept syntax which leads to a fixed length expression, so these are excluded:
Shorthands defined in this document also do not currently function properly (e.g. [ascii]. Finally note that inversion ^ functions differently to most regular expression syntaxes. The token being inverted is the following token, not the entire set. So most regex would say something like [^ 01 02 03] meaning every byte except 01, 02 and 03. In byteseek this would be ^[ 01 02 03], as you are inverting the set. [ ^01 02 03] is also valid - except you are now specifying a set containing everything but 01 (which already covers 02 and 03). It's fairly easy to create a different parser if necessary, but most of byteseek regex syntax is fairly standard - but oriented towards bytes rather than strings as the default atomic unit. Any questions please feel free to ask (and I really must update the syntax document!). Regards, Matt. |
When you say use regexes, do you mean use the regex syntax to create fixed length sequences, or do you mean use full regex functionality? Full regex exists using NFAs and DFAs, but needs testing, as I haven't looked at that part of byteseek for quite some time. |
Currently, I'm using the SequenceMatcher to compile a matching expression and then using a searcher to search in the byte array for that expression (code is here ). From what I can tell, this isn't using the NFA or DFA under the hood, is that wrong? |
Correct, there's no NFA or DFA under the hood of the SequenceMatcher. You can create sequences using the regex syntax using the SequenceMatcherCompiler, as long as only syntax which creates fixed length sequences is used. So you can match bytes (hex values), sets of bytes [01 02 03], any bytes ., bitmasks, strings and case insensitive strings, but not wildcards or optional bytes. For example: 01 ^02 'a string' [f0-ff] 'another string' [0a 0d] The RegexCompiler can accept the full regex syntax including *, +, ?, and it does create NFAs - but this isn't tested. |
I have placed an updated syntax document as a markdown file syntax.md in the main byteseek project. Hopefully it's a little clearer than the old text file. As ever, any questions... https://github.com/nishihatapalmer/byteseek/blob/master/syntax.md |
Also added a SequenceMatcher syntax which tells you what is valid for a SequenceMatcher, rather than the full syntax. It can be found here: https://github.com/nishihatapalmer/byteseek/blob/master/sequencesyntax.md I notice you link to the old text file in one of your commits for the syntax definition. Would probably be worth updating to this file, since there's no ambiguity over what is valid and what isn't. |
Testing PlanPreliminaries
Ensure Data Flows from the IndicesEnsure that with a basic full-dev we get data into the elasticsearch (Optional) Free Up Space on the virtual machineFirst, let's free up some headroom on the virtual machine. If you are running this on a
Install and start pycapa
Ensure pycapa can write to HDFS
Note that if your MR job fails because of a lack of user directory for
Regression TestFixed
Stellar
Binary Payload Search : StringsFixed
Stellar
Binary Payload Search : Hex RegexStellarNOTE: To the astute reader, 0x1F90 in hex is 8080 in decimal
|
Taking a first glance through this and had a couple comments before I dig in a little further and spin things up. Can you flesh out the unit tests around the binary filtering? The converted unit tests are helpful, but it seems like there's probably more cases than are covered by the couple additions involving. I'm not familiar with the library, but does this work if I provide things in a '0x' hex format? E.g. your example has |
Filtering can be done both by the packet header as well as via a binary regular expression | ||
which can be run on the packet payload itself. This filter can be specified via: | ||
* The `-pf` or `--packet_filter` options for the fixed query filter | ||
* The `BYTEARRAY_MATCH(pattern, data)` Stellar function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is supposed to be BYTEARRAY_MATCHER
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
|
||
@Override | ||
public void configure(Iterable<Map.Entry<String, String>> config) { | ||
for (Map.Entry<String, String> kv : config) { | ||
if (kv.getKey().equals(Constants.Fields.DST_ADDR.getName())) { | ||
System.out.println("Processing: " + kv.getKey() + " => " + kv.getValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are printlns appropriate here? Is there a reason it's not a Logger call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's in a MR job, those printlns are getting captured in the map stdout log. I can make them logger logs though if it's more comfortable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, leave them alone. I forgot about that bit of annoyance in MR jobs.
Yeah, no |
+1, pending Travis. Was able to spin this up and full dev and everything worked as expected. Thanks a lot for the contribution. |
Contributor Comments
Currently we have the ability to filter packets in the pcap query tool by header information (src/dest ip/port). We should be able to filter by binary regex on the packets themselves.
Probably the state of the art and the goal to get to here is integration with Yara, but I'd like to iterate toward that solution for a couple of reasons:
org.apache.metron.pcap.filter.PcapFilter
, not as a portion of an existing one.That lead me to look for a stop-gap that was simpler and had the following characteristics:
bytestream ( an all java regex library that functions on byte arrays, not files) fit the bill without eating all of a full-on Yara integration and fit within our core abstractions better.
As such, the approach that I took is to provide the capability both of the packet filters that we currently have in place:
--packet_filter
or-pf
wherein you pass the binary regex.BYTEARRAY_MATCHER(pattern, packet)
I have made a follow-on task to integrate with Yara at METRON-871.
Testing plan will be in the comments.
Pull Request Checklist
Thank you for submitting a contribution to Apache Metron (Incubating).
Please refer to our Development Guidelines for the complete guide to follow for contributions.
Please refer also to our Build Verification Guidelines for complete smoke testing guides.
In order to streamline the review of the contribution we ask you follow these guidelines and ask you to double check the following:
For all changes:
For code changes:
Have you included steps to reproduce the behavior or problem that is being changed or addressed?
Have you included steps or a guide to how the change may be verified and tested manually?
Have you ensured that the full suite of tests and checks have been executed in the root incubating-metron folder via:
Have you written or updated unit tests and or integration tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?
For documentation related changes:
Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via
site-book/target/site/index.html
:Note:
Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
It is also recommended that travis-ci is set up for your personal repository such that your branches are built there before submitting a pull request.