METRON-870: Add filtering by packet payload to the pcap query #541

cestella · 2017-04-21T02:00:41Z

Contributor Comments

Currently we have the ability to filter packets in the pcap query tool by header information (src/dest ip/port). We should be able to filter by binary regex on the packets themselves.

Probably the state of the art and the goal to get to here is integration with Yara, but I'd like to iterate toward that solution for a couple of reasons:

Yara is hard to integrate with in our stack.
- It's C and, while the yara-java project does exist, it would make the build a bit of a pain and no longer platform agnostic (i.e. you'd have to build certain modules against the machines that you're running in the cluster). There are paths through that for sure, but it's more than I wanted to tackle just now.
The core abstraction for the obvious integration yara-java is running yara over a file, not a byte array. This would necessitate taking the performance penalty with JNI AND writing out every packet to a temporary file, then deleting it, in the MR job. I did not deem that a sensible approach.
Yara is a whole language, similar to stellar. The point of integration would be as a proper org.apache.metron.pcap.filter.PcapFilter, not as a portion of an existing one.

That lead me to look for a stop-gap that was simpler and had the following characteristics:

Worked within Java easily
Was permissively licensed
Functioned on byte arrays
Could do both hex regex as well as interpreting the byte array as a string (similar to Yara)

bytestream ( an all java regex library that functions on byte arrays, not files) fit the bill without eating all of a full-on Yara integration and fit within our core abstractions better.

As such, the approach that I took is to provide the capability both of the packet filters that we currently have in place:

Fixed via a new command line option --packet_filter or -pf wherein you pass the binary regex.
- This would restrict to a single pattern
Query via a new Stellar function BYTEARRAY_MATCHER(pattern, packet)
- This allows you to compose multiple filters with logic operations to get a closer to Yara-esque feel via Stellar

I have made a follow-on task to integrate with Yara at METRON-871.

Testing plan will be in the comments.

Pull Request Checklist

Thank you for submitting a contribution to Apache Metron (Incubating).
Please refer to our Development Guidelines for the complete guide to follow for contributions.
Please refer also to our Build Verification Guidelines for complete smoke testing guides.

In order to streamline the review of the contribution we ask you follow these guidelines and ask you to double check the following:

For all changes:

Is there a JIRA ticket associated with this PR? If not one needs to be created at Metron Jira.
Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically master)?

For code changes:

Have you included steps to reproduce the behavior or problem that is being changed or addressed?
Have you included steps or a guide to how the change may be verified and tested manually?
Have you ensured that the full suite of tests and checks have been executed in the root incubating-metron folder via:
```
mvn -q clean integration-test install && build_utils/verify_licenses.sh 
```
Have you written or updated unit tests and or integration tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?

For documentation related changes:

Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via site-book/target/site/index.html:
```
cd site-book
bin/generate-md.sh
mvn site:site
```

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
It is also recommended that travis-ci is set up for your personal repository such that your branches are built there before submitting a pull request.

cestella · 2017-04-21T03:59:12Z

It appears that byteseek has a LGPL dependency in gnu trove, a primitive collections library. As a stopgap, I:

excluded the dependency
provided a translation layer inside of metron-pcap based on mahout-math (which is, in turn, loosely based on Colt, a primitive collections library)

This is less than ideal in the long-run, so I submitted a PR against the library and asked for a new release.

cestella · 2017-04-21T04:02:23Z

On second thought, the part of the byteseek library that we're using doesn't use trove, so we can safely exclude it without providing the colt translation layer. I have taken it out.

nishihatapalmer · 2017-04-21T12:05:10Z

Hi, byteseek author here. I notice you're using the standard BoyerMooreHorspool searcher. Just to let you know that the HorspoolFinalFlagSearcher offers much better performance than the standard Horspool searcher - about 1.5 to 2 times faster in my tests - and uses no more memory or pre-processing time.

It's essentially the same algorithm as Horspool, but re-arranges it a bit so that it only needs to verify a match when the algorithm detects that the last character matches the pattern, which can be done using a negative shift value (the "final flag"). Since verification is expensive to perform on each loop iteration, this performs much better in general.

cestella · 2017-04-21T13:05:28Z

Hey, thanks for that feedback @nishihatapalmer ! I adjusted to use the suggested searcher. I did have one more question, I'm looking to document the possible regex's available for binary search, is there any documentation on the restrictions or capabilities of the compiled regex's?

nishihatapalmer · 2017-04-21T13:28:54Z

There is a slightly out of date (note to self: update this!) syntax document at:
https://github.com/nishihatapalmer/byteseek/blob/master/src/main/java/net/byteseek/parser/regex/Regular%20Expression%20syntax.txt

It gives an overview of most of the syntax, but some of it is only usable by full regexes, not sequence matchers. In particular it can only accept syntax which leads to a fixed length expression, so these are excluded:

*  zero to many
+ one to many
() groups
{n,m} n to m copies.
 X | Y alternatives.

Shorthands defined in this document also do not currently function properly (e.g. [ascii].

Finally note that inversion ^ functions differently to most regular expression syntaxes. The token being inverted is the following token, not the entire set. So most regex would say something like [^ 01 02 03] meaning every byte except 01, 02 and 03. In byteseek this would be ^[ 01 02 03], as you are inverting the set. [ ^01 02 03] is also valid - except you are now specifying a set containing everything but 01 (which already covers 02 and 03).

It's fairly easy to create a different parser if necessary, but most of byteseek regex syntax is fairly standard - but oriented towards bytes rather than strings as the default atomic unit.

Any questions please feel free to ask (and I really must update the syntax document!).

Regards,

Matt.

nishihatapalmer · 2017-04-21T13:38:18Z

When you say use regexes, do you mean use the regex syntax to create fixed length sequences, or do you mean use full regex functionality? Full regex exists using NFAs and DFAs, but needs testing, as I haven't looked at that part of byteseek for quite some time.

cestella · 2017-04-21T14:36:00Z

Currently, I'm using the SequenceMatcher to compile a matching expression and then using a searcher to search in the byte array for that expression (code is here ). From what I can tell, this isn't using the NFA or DFA under the hood, is that wrong?

nishihatapalmer · 2017-04-21T14:55:00Z

Correct, there's no NFA or DFA under the hood of the SequenceMatcher.

You can create sequences using the regex syntax using the SequenceMatcherCompiler, as long as only syntax which creates fixed length sequences is used. So you can match bytes (hex values), sets of bytes [01 02 03], any bytes ., bitmasks, strings and case insensitive strings, but not wildcards or optional bytes. For example:

01 ^02 'a string' [f0-ff] 'another string' [0a 0d]

The RegexCompiler can accept the full regex syntax including *, +, ?, and it does create NFAs - but this isn't tested.

nishihatapalmer · 2017-04-23T09:25:34Z

I have placed an updated syntax document as a markdown file syntax.md in the main byteseek project. Hopefully it's a little clearer than the old text file. As ever, any questions...

https://github.com/nishihatapalmer/byteseek/blob/master/syntax.md

nishihatapalmer · 2017-04-23T19:41:44Z

Also added a SequenceMatcher syntax which tells you what is valid for a SequenceMatcher, rather than the full syntax. It can be found here:

https://github.com/nishihatapalmer/byteseek/blob/master/sequencesyntax.md

I notice you link to the old text file in one of your commits for the syntax definition. Would probably be worth updating to this file, since there's no ambiguity over what is valid and what isn't.

cestella · 2017-04-25T20:40:39Z

Testing Plan

Preliminaries

Please perform the following tests on the full-dev vagrant environment.
Set an environment variable to indicate METRON_HOME:
export METRON_HOME=/usr/metron/0.4.0

Ensure Data Flows from the Indices

Ensure that with a basic full-dev we get data into the elasticsearch
indices and into HDFS.

(Optional) Free Up Space on the virtual machine

First, let's free up some headroom on the virtual machine. If you are running this on a
multinode cluster, you would not have to do this.

Stop and disable Metron in Ambari
Kill monit via service monit stop
From ambari, stop the metron service
Kill the sensors via service sensor-stubs stop

Install and start pycapa

# set env vars
export PYCAPA_HOME=/opt/pycapa
export PYTHON27_HOME=/opt/rh/python27/root

# Install these packages via yum (RHEL, CentOS)
yum -y install epel-release centos-release-scl 
yum -y install "@Development tools" python27 python27-scldevel python27-python-virtualenv libpcap-devel libselinux-python

# Setup directories
mkdir $PYCAPA_HOME && chmod 755 $PYCAPA_HOME

#Grab pycapa from git 
cd ~
git clone https://github.com/apache/incubator-metron.git
cp -R ~/incubator-metron/metron-sensors/pycapa* $PYCAPA_HOME

# Create virtualenv
export LD_LIBRARY_PATH="/opt/rh/python27/root/usr/lib64"
${PYTHON27_HOME}/usr/bin/virtualenv pycapa-venv

# Build it
cd ${PYCAPA_HOME}/pycapa
# activate the virtualenv
source ${PYCAPA_HOME}/pycapa-venv/bin/activate
pip install -r requirements.txt
python setup.py install

# Run it
cd ${PYCAPA_HOME}/pycapa-venv/bin
pycapa --producer --topic pcap -i eth1 -k node1:6667

Ensure pycapa can write to HDFS

Ensure that /apps/metron/pcap exists and can be written to by the
storm user. If not, then:

sudo su - hdfs
hadoop fs -mkdir -p /apps/metron/pcap
hadoop fs -chown metron:hadoop /apps/metron/pcap
hadoop fs -chmod 775 /apps/metron/pcap
exit

Start the pcap topology via $METRON_HOME/bin/start_pcap_topology.sh
Watch the topology in the Storm UI and kill the packet capture utility from before, when the number of packets ingested is over 3k. Ensure that at at least 3 files exist on HDFS by running hadoop fs -ls /apps/metron/pcap

Note that if your MR job fails because of a lack of user directory for root, then the following will create the directory appropriately:

sudo su - hdfs
hadoop fs -mkdir /user/root
hadoop fs -chown root:hadoop /user/root
hadoop fs -chmod 755 /user/root
exit

Regression Test

Fixed

Run a fixed pcap query by executing a command similar to the following:

$METRON_HOME/bin/pcap_query.sh fixed --ip_dst_port 8080 -st "20170425" -df "yyyyMMdd"

Verify the MR job finishes successfully. Upon completion, you should see multiple files named with relatively current datestamps in your current directory, e.g. pcap-data-20160617160549737+0000.pcap
Copy the files to your local machine and verify you can them it in Wireshark. Open the files and ensure that they contain only packets to the destination port of 8080.

Stellar

Run a fixed pcap query by executing a command similar to the following:

$METRON_HOME/bin/pcap_query.sh query --query "ip_dst_port == 8080" -st "20170425" -df "yyyyMMdd"

Verify the MR job finishes successfully. Upon completion, you should see multiple files named with relatively current datestamps in your current directory, e.g. pcap-data-20160617160549737+0000.pcap
Copy the files to your local machine and verify you can them it in Wireshark. Open the files and ensure that they contain only packets to the destination port of 8080.

Binary Payload Search : Strings

Fixed

Run a fixed pcap query by executing a command similar to the following:

$METRON_HOME/bin/pcap_query.sh fixed --ip_dst_port 8080 --packet_filter "\`persist\`" -st "20170425" -df "yyyyMMdd"

Verify the MR job finishes successfully. Upon completion, you should see multiple files named with relatively current datestamps in your current directory, e.g. pcap-data-20160617160549737+0000.pcap
Copy the files to your local machine and verify you can them it in Wireshark. Open the files and ensure that they contain only packets to the destination port of 8080 and are api calls involving /api/v1/persist/wizard-data in Ambari.

Stellar

Run a fixed pcap query by executing a command similar to the following:

$METRON_HOME/bin/pcap_query.sh query --query "ip_dst_port == 8080 && BYTEARRAY_MATCHER('\`persist\`', packet)" -st "20170425" -df "yyyyMMdd"

Verify the MR job finishes successfully. Upon completion, you should see multiple files named with relatively current datestamps in your current directory, e.g. pcap-data-20160617160549737+0000.pcap
Copy the files to your local machine and verify you can them it in Wireshark. Open the files and ensure that they contain only packets to the destination port of 8080 and are api calls involving /api/v1/persist/wizard-data in Ambari.

Binary Payload Search : Hex Regex

Stellar

NOTE: To the astute reader, 0x1F90 in hex is 8080 in decimal

Run a fixed pcap query by executing a command similar to the following:

$METRON_HOME/bin/pcap_query.sh query --query "BYTEARRAY_MATCHER('1F90', packet) && BYTEARRAY_MATCHER('\`persist\`', packet)" -st "20170425" -df "yyyyMMdd"

Verify the MR job finishes successfully. Upon completion, you should see multiple files named with relatively current datestamps in your current directory, e.g. pcap-data-20160617160549737+0000.pcap
Copy the files to your local machine and verify you can them it in Wireshark. Open the files and ensure that they contain only packets to the destination port of 8080 and are api calls involving /api/v1/persist/wizard-data in Ambari.

justinleet · 2017-04-26T14:11:47Z

Taking a first glance through this and had a couple comments before I dig in a little further and spin things up.

Can you flesh out the unit tests around the binary filtering? The converted unit tests are helpful, but it seems like there's probably more cases than are covered by the couple additions involving.

I'm not familiar with the library, but does this work if I provide things in a '0x' hex format? E.g. your example has BYTEARRAY_MATCHER('1F90', packet) but will it still work if it's BYTEARRAY_MATCHER('0x1F90', packet)? I don't think it's necessary that it does, but I looked at the example and immediately suspected I'd have prepended '0x' to '1F90' out of pure habit.

justinleet · 2017-04-26T14:04:11Z

metron-platform/metron-pcap-backend/README.md

+Filtering can be done both by the packet header as well as via a binary regular expression
+which can be run on the packet payload itself.  This filter can be specified via:
+* The `-pf` or `--packet_filter` options for the fixed query filter
+* The `BYTEARRAY_MATCH(pattern, data)` Stellar function.


Looks like this is supposed to be BYTEARRAY_MATCHER

justinleet · 2017-04-26T14:05:46Z

...-platform/metron-pcap/src/main/java/org/apache/metron/pcap/filter/fixed/FixedPcapFilter.java


  @Override
  public void configure(Iterable<Map.Entry<String, String>> config) {
    for (Map.Entry<String, String> kv : config) {
      if (kv.getKey().equals(Constants.Fields.DST_ADDR.getName())) {
+        System.out.println("Processing: " + kv.getKey() + " => " + kv.getValue());


Are printlns appropriate here? Is there a reason it's not a Logger call?

It's in a MR job, those printlns are getting captured in the map stdout log. I can make them logger logs though if it's more comfortable.

Nah, leave them alone. I forgot about that bit of annoyance in MR jobs.

cestella · 2017-04-26T18:10:11Z

Yeah, no 0x for specifying the regex syntax. I'll update the function docs to point to the syntax guide. Also, I'm going to give a bit of a better effort at the testing too. Good catches all around.

justinleet · 2017-04-27T15:00:06Z

+1, pending Travis. Was able to spin this up and full dev and everything worked as expected. Thanks a lot for the contribution.

cestella added 3 commits April 20, 2017 21:39

METRON-870: Add filtering by packet payload to the pcap query

69fa1b4

Remove LGPL dependency and replace with something more permissive.

6fe15eb

Wrong mahout math version.

4683bbf

Didn't need the mahout-math bit after all.

3103148

Changed to a faster bytearray searcher.

c50a50d

cestella closed this Apr 21, 2017

cestella reopened this Apr 21, 2017

Updated documentation.

45d0b67

cestella closed this Apr 24, 2017

cestella reopened this Apr 24, 2017

some slightly better debugging.

dc2c46d

Updated doc

c89d709

justinleet reviewed Apr 26, 2017

View reviewed changes

cestella added 2 commits April 26, 2017 15:29

Updating tests.

42ac37f

Merge branch 'master' into METRON-870

0bbcf59

asfgit closed this in bf2528f Apr 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

METRON-870: Add filtering by packet payload to the pcap query #541

METRON-870: Add filtering by packet payload to the pcap query #541

cestella commented Apr 21, 2017 •

edited

cestella commented Apr 21, 2017

cestella commented Apr 21, 2017

nishihatapalmer commented Apr 21, 2017 •

edited

cestella commented Apr 21, 2017

nishihatapalmer commented Apr 21, 2017 •

edited

nishihatapalmer commented Apr 21, 2017

cestella commented Apr 21, 2017

nishihatapalmer commented Apr 21, 2017

nishihatapalmer commented Apr 23, 2017

nishihatapalmer commented Apr 23, 2017 •

edited

cestella commented Apr 25, 2017

justinleet commented Apr 26, 2017

justinleet Apr 26, 2017

cestella Apr 26, 2017

justinleet Apr 26, 2017

cestella Apr 26, 2017

justinleet Apr 26, 2017

cestella commented Apr 26, 2017

justinleet commented Apr 27, 2017

METRON-870: Add filtering by packet payload to the pcap query #541

METRON-870: Add filtering by packet payload to the pcap query #541

Conversation

cestella commented Apr 21, 2017 • edited

Contributor Comments

Pull Request Checklist

For all changes:

For code changes:

For documentation related changes:

Note:

cestella commented Apr 21, 2017

cestella commented Apr 21, 2017

nishihatapalmer commented Apr 21, 2017 • edited

cestella commented Apr 21, 2017

nishihatapalmer commented Apr 21, 2017 • edited

nishihatapalmer commented Apr 21, 2017

cestella commented Apr 21, 2017

nishihatapalmer commented Apr 21, 2017

nishihatapalmer commented Apr 23, 2017

nishihatapalmer commented Apr 23, 2017 • edited

cestella commented Apr 25, 2017

Testing Plan

Preliminaries

Ensure Data Flows from the Indices

(Optional) Free Up Space on the virtual machine

Install and start pycapa

Ensure pycapa can write to HDFS

Regression Test

Fixed

Stellar

Binary Payload Search : Strings

Fixed

Stellar

Binary Payload Search : Hex Regex

Stellar

justinleet commented Apr 26, 2017

justinleet Apr 26, 2017

Choose a reason for hiding this comment

cestella Apr 26, 2017

Choose a reason for hiding this comment

justinleet Apr 26, 2017

Choose a reason for hiding this comment

cestella Apr 26, 2017

Choose a reason for hiding this comment

justinleet Apr 26, 2017

Choose a reason for hiding this comment

cestella commented Apr 26, 2017

justinleet commented Apr 27, 2017

cestella commented Apr 21, 2017 •

edited

nishihatapalmer commented Apr 21, 2017 •

edited

nishihatapalmer commented Apr 21, 2017 •

edited

nishihatapalmer commented Apr 23, 2017 •

edited