Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

METRON-870: Add filtering by packet payload to the pcap query #541

Closed
wants to merge 10 commits into from

Conversation

cestella
Copy link
Member

@cestella cestella commented Apr 21, 2017

Contributor Comments

Currently we have the ability to filter packets in the pcap query tool by header information (src/dest ip/port). We should be able to filter by binary regex on the packets themselves.

Probably the state of the art and the goal to get to here is integration with Yara, but I'd like to iterate toward that solution for a couple of reasons:

  • Yara is hard to integrate with in our stack.
    • It's C and, while the yara-java project does exist, it would make the build a bit of a pain and no longer platform agnostic (i.e. you'd have to build certain modules against the machines that you're running in the cluster). There are paths through that for sure, but it's more than I wanted to tackle just now.
  • The core abstraction for the obvious integration yara-java is running yara over a file, not a byte array. This would necessitate taking the performance penalty with JNI AND writing out every packet to a temporary file, then deleting it, in the MR job. I did not deem that a sensible approach.
  • Yara is a whole language, similar to stellar. The point of integration would be as a proper org.apache.metron.pcap.filter.PcapFilter, not as a portion of an existing one.

That lead me to look for a stop-gap that was simpler and had the following characteristics:

  • Worked within Java easily
  • Was permissively licensed
  • Functioned on byte arrays
  • Could do both hex regex as well as interpreting the byte array as a string (similar to Yara)

bytestream ( an all java regex library that functions on byte arrays, not files) fit the bill without eating all of a full-on Yara integration and fit within our core abstractions better.

As such, the approach that I took is to provide the capability both of the packet filters that we currently have in place:

  • Fixed via a new command line option --packet_filter or -pf wherein you pass the binary regex.
    • This would restrict to a single pattern
  • Query via a new Stellar function BYTEARRAY_MATCHER(pattern, packet)
    • This allows you to compose multiple filters with logic operations to get a closer to Yara-esque feel via Stellar

I have made a follow-on task to integrate with Yara at METRON-871.

Testing plan will be in the comments.

Pull Request Checklist

Thank you for submitting a contribution to Apache Metron (Incubating).
Please refer to our Development Guidelines for the complete guide to follow for contributions.
Please refer also to our Build Verification Guidelines for complete smoke testing guides.

In order to streamline the review of the contribution we ask you follow these guidelines and ask you to double check the following:

For all changes:

  • Is there a JIRA ticket associated with this PR? If not one needs to be created at Metron Jira.
  • Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
  • Has your PR been rebased against the latest commit within the target branch (typically master)?

For code changes:

  • Have you included steps to reproduce the behavior or problem that is being changed or addressed?

  • Have you included steps or a guide to how the change may be verified and tested manually?

  • Have you ensured that the full suite of tests and checks have been executed in the root incubating-metron folder via:

    mvn -q clean integration-test install && build_utils/verify_licenses.sh 
    
  • Have you written or updated unit tests and or integration tests to verify your changes?

  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?

  • Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via site-book/target/site/index.html:

    cd site-book
    bin/generate-md.sh
    mvn site:site
    

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
It is also recommended that travis-ci is set up for your personal repository such that your branches are built there before submitting a pull request.

@cestella
Copy link
Member Author

It appears that byteseek has a LGPL dependency in gnu trove, a primitive collections library. As a stopgap, I:

  • excluded the dependency
  • provided a translation layer inside of metron-pcap based on mahout-math (which is, in turn, loosely based on Colt, a primitive collections library)

This is less than ideal in the long-run, so I submitted a PR against the library and asked for a new release.

@cestella
Copy link
Member Author

On second thought, the part of the byteseek library that we're using doesn't use trove, so we can safely exclude it without providing the colt translation layer. I have taken it out.

@nishihatapalmer
Copy link

nishihatapalmer commented Apr 21, 2017

Hi, byteseek author here. I notice you're using the standard BoyerMooreHorspool searcher. Just to let you know that the HorspoolFinalFlagSearcher offers much better performance than the standard Horspool searcher - about 1.5 to 2 times faster in my tests - and uses no more memory or pre-processing time.

It's essentially the same algorithm as Horspool, but re-arranges it a bit so that it only needs to verify a match when the algorithm detects that the last character matches the pattern, which can be done using a negative shift value (the "final flag"). Since verification is expensive to perform on each loop iteration, this performs much better in general.

@cestella
Copy link
Member Author

Hey, thanks for that feedback @nishihatapalmer ! I adjusted to use the suggested searcher. I did have one more question, I'm looking to document the possible regex's available for binary search, is there any documentation on the restrictions or capabilities of the compiled regex's?

@cestella cestella closed this Apr 21, 2017
@cestella cestella reopened this Apr 21, 2017
@nishihatapalmer
Copy link

nishihatapalmer commented Apr 21, 2017

There is a slightly out of date (note to self: update this!) syntax document at:
https://github.com/nishihatapalmer/byteseek/blob/master/src/main/java/net/byteseek/parser/regex/Regular%20Expression%20syntax.txt

It gives an overview of most of the syntax, but some of it is only usable by full regexes, not sequence matchers. In particular it can only accept syntax which leads to a fixed length expression, so these are excluded:

*  zero to many
+ one to many
() groups
{n,m} n to m copies.
 X | Y alternatives.

Shorthands defined in this document also do not currently function properly (e.g. [ascii].

Finally note that inversion ^ functions differently to most regular expression syntaxes. The token being inverted is the following token, not the entire set. So most regex would say something like [^ 01 02 03] meaning every byte except 01, 02 and 03. In byteseek this would be ^[ 01 02 03], as you are inverting the set. [ ^01 02 03] is also valid - except you are now specifying a set containing everything but 01 (which already covers 02 and 03).

It's fairly easy to create a different parser if necessary, but most of byteseek regex syntax is fairly standard - but oriented towards bytes rather than strings as the default atomic unit.

Any questions please feel free to ask (and I really must update the syntax document!).

Regards,

Matt.

@nishihatapalmer
Copy link

When you say use regexes, do you mean use the regex syntax to create fixed length sequences, or do you mean use full regex functionality? Full regex exists using NFAs and DFAs, but needs testing, as I haven't looked at that part of byteseek for quite some time.

@cestella
Copy link
Member Author

Currently, I'm using the SequenceMatcher to compile a matching expression and then using a searcher to search in the byte array for that expression (code is here ). From what I can tell, this isn't using the NFA or DFA under the hood, is that wrong?

@nishihatapalmer
Copy link

Correct, there's no NFA or DFA under the hood of the SequenceMatcher.

You can create sequences using the regex syntax using the SequenceMatcherCompiler, as long as only syntax which creates fixed length sequences is used. So you can match bytes (hex values), sets of bytes [01 02 03], any bytes ., bitmasks, strings and case insensitive strings, but not wildcards or optional bytes. For example:

01 ^02 'a string' [f0-ff] 'another string' [0a 0d]

The RegexCompiler can accept the full regex syntax including *, +, ?, and it does create NFAs - but this isn't tested.

@nishihatapalmer
Copy link

I have placed an updated syntax document as a markdown file syntax.md in the main byteseek project. Hopefully it's a little clearer than the old text file. As ever, any questions...

https://github.com/nishihatapalmer/byteseek/blob/master/syntax.md

@nishihatapalmer
Copy link

nishihatapalmer commented Apr 23, 2017

Also added a SequenceMatcher syntax which tells you what is valid for a SequenceMatcher, rather than the full syntax. It can be found here:

https://github.com/nishihatapalmer/byteseek/blob/master/sequencesyntax.md

I notice you link to the old text file in one of your commits for the syntax definition. Would probably be worth updating to this file, since there's no ambiguity over what is valid and what isn't.

@cestella cestella closed this Apr 24, 2017
@cestella cestella reopened this Apr 24, 2017
@cestella
Copy link
Member Author

Testing Plan

Preliminaries

  • Please perform the following tests on the full-dev vagrant environment.
  • Set an environment variable to indicate METRON_HOME:
    export METRON_HOME=/usr/metron/0.4.0

Ensure Data Flows from the Indices

Ensure that with a basic full-dev we get data into the elasticsearch
indices and into HDFS.

(Optional) Free Up Space on the virtual machine

First, let's free up some headroom on the virtual machine. If you are running this on a
multinode cluster, you would not have to do this.

  • Stop and disable Metron in Ambari
  • Kill monit via service monit stop
  • From ambari, stop the metron service
  • Kill the sensors via service sensor-stubs stop

Install and start pycapa

# set env vars
export PYCAPA_HOME=/opt/pycapa
export PYTHON27_HOME=/opt/rh/python27/root

# Install these packages via yum (RHEL, CentOS)
yum -y install epel-release centos-release-scl 
yum -y install "@Development tools" python27 python27-scldevel python27-python-virtualenv libpcap-devel libselinux-python

# Setup directories
mkdir $PYCAPA_HOME && chmod 755 $PYCAPA_HOME

#Grab pycapa from git 
cd ~
git clone https://github.com/apache/incubator-metron.git
cp -R ~/incubator-metron/metron-sensors/pycapa* $PYCAPA_HOME

# Create virtualenv
export LD_LIBRARY_PATH="/opt/rh/python27/root/usr/lib64"
${PYTHON27_HOME}/usr/bin/virtualenv pycapa-venv

# Build it
cd ${PYCAPA_HOME}/pycapa
# activate the virtualenv
source ${PYCAPA_HOME}/pycapa-venv/bin/activate
pip install -r requirements.txt
python setup.py install

# Run it
cd ${PYCAPA_HOME}/pycapa-venv/bin
pycapa --producer --topic pcap -i eth1 -k node1:6667

Ensure pycapa can write to HDFS

  • Ensure that /apps/metron/pcap exists and can be written to by the
    storm user. If not, then:
sudo su - hdfs
hadoop fs -mkdir -p /apps/metron/pcap
hadoop fs -chown metron:hadoop /apps/metron/pcap
hadoop fs -chmod 775 /apps/metron/pcap
exit
  • Start the pcap topology via $METRON_HOME/bin/start_pcap_topology.sh
  • Watch the topology in the Storm UI and kill the packet capture utility from before, when the number of packets ingested is over 3k. Ensure that at at least 3 files exist on HDFS by running hadoop fs -ls /apps/metron/pcap

Note that if your MR job fails because of a lack of user directory for root, then the following will create the directory appropriately:

sudo su - hdfs
hadoop fs -mkdir /user/root
hadoop fs -chown root:hadoop /user/root
hadoop fs -chmod 755 /user/root
exit

Regression Test

Fixed

  • Run a fixed pcap query by executing a command similar to the following:
$METRON_HOME/bin/pcap_query.sh fixed --ip_dst_port 8080 -st "20170425" -df "yyyyMMdd"
  • Verify the MR job finishes successfully. Upon completion, you should see multiple files named with relatively current datestamps in your current directory, e.g. pcap-data-20160617160549737+0000.pcap
  • Copy the files to your local machine and verify you can them it in Wireshark. Open the files and ensure that they contain only packets to the destination port of 8080.

Stellar

  • Run a fixed pcap query by executing a command similar to the following:
$METRON_HOME/bin/pcap_query.sh query --query "ip_dst_port == 8080" -st "20170425" -df "yyyyMMdd"
  • Verify the MR job finishes successfully. Upon completion, you should see multiple files named with relatively current datestamps in your current directory, e.g. pcap-data-20160617160549737+0000.pcap
  • Copy the files to your local machine and verify you can them it in Wireshark. Open the files and ensure that they contain only packets to the destination port of 8080.

Binary Payload Search : Strings

Fixed

  • Run a fixed pcap query by executing a command similar to the following:
$METRON_HOME/bin/pcap_query.sh fixed --ip_dst_port 8080 --packet_filter "\`persist\`" -st "20170425" -df "yyyyMMdd"
  • Verify the MR job finishes successfully. Upon completion, you should see multiple files named with relatively current datestamps in your current directory, e.g. pcap-data-20160617160549737+0000.pcap
  • Copy the files to your local machine and verify you can them it in Wireshark. Open the files and ensure that they contain only packets to the destination port of 8080 and are api calls involving /api/v1/persist/wizard-data in Ambari.

Stellar

  • Run a fixed pcap query by executing a command similar to the following:
$METRON_HOME/bin/pcap_query.sh query --query "ip_dst_port == 8080 && BYTEARRAY_MATCHER('\`persist\`', packet)" -st "20170425" -df "yyyyMMdd"
  • Verify the MR job finishes successfully. Upon completion, you should see multiple files named with relatively current datestamps in your current directory, e.g. pcap-data-20160617160549737+0000.pcap
  • Copy the files to your local machine and verify you can them it in Wireshark. Open the files and ensure that they contain only packets to the destination port of 8080 and are api calls involving /api/v1/persist/wizard-data in Ambari.

Binary Payload Search : Hex Regex

Stellar

NOTE: To the astute reader, 0x1F90 in hex is 8080 in decimal

  • Run a fixed pcap query by executing a command similar to the following:
$METRON_HOME/bin/pcap_query.sh query --query "BYTEARRAY_MATCHER('1F90', packet) && BYTEARRAY_MATCHER('\`persist\`', packet)" -st "20170425" -df "yyyyMMdd"
  • Verify the MR job finishes successfully. Upon completion, you should see multiple files named with relatively current datestamps in your current directory, e.g. pcap-data-20160617160549737+0000.pcap
  • Copy the files to your local machine and verify you can them it in Wireshark. Open the files and ensure that they contain only packets to the destination port of 8080 and are api calls involving /api/v1/persist/wizard-data in Ambari.

@justinleet
Copy link
Contributor

Taking a first glance through this and had a couple comments before I dig in a little further and spin things up.

Can you flesh out the unit tests around the binary filtering? The converted unit tests are helpful, but it seems like there's probably more cases than are covered by the couple additions involving.

I'm not familiar with the library, but does this work if I provide things in a '0x' hex format? E.g. your example has BYTEARRAY_MATCHER('1F90', packet) but will it still work if it's BYTEARRAY_MATCHER('0x1F90', packet)? I don't think it's necessary that it does, but I looked at the example and immediately suspected I'd have prepended '0x' to '1F90' out of pure habit.

Filtering can be done both by the packet header as well as via a binary regular expression
which can be run on the packet payload itself. This filter can be specified via:
* The `-pf` or `--packet_filter` options for the fixed query filter
* The `BYTEARRAY_MATCH(pattern, data)` Stellar function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is supposed to be BYTEARRAY_MATCHER

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep


@Override
public void configure(Iterable<Map.Entry<String, String>> config) {
for (Map.Entry<String, String> kv : config) {
if (kv.getKey().equals(Constants.Fields.DST_ADDR.getName())) {
System.out.println("Processing: " + kv.getKey() + " => " + kv.getValue());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are printlns appropriate here? Is there a reason it's not a Logger call?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in a MR job, those printlns are getting captured in the map stdout log. I can make them logger logs though if it's more comfortable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, leave them alone. I forgot about that bit of annoyance in MR jobs.

@cestella
Copy link
Member Author

Yeah, no 0x for specifying the regex syntax. I'll update the function docs to point to the syntax guide. Also, I'm going to give a bit of a better effort at the testing too. Good catches all around.

@justinleet
Copy link
Contributor

+1, pending Travis. Was able to spin this up and full dev and everything worked as expected. Thanks a lot for the contribution.

@asfgit asfgit closed this in bf2528f Apr 27, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants