Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

METRON-2274 Flatfile loader and summarizer mapreduce mode broken #1525

Closed
wants to merge 2 commits into from

Conversation

mmiklavc
Copy link
Contributor

@mmiklavc mmiklavc commented Oct 3, 2019

Contributor Comments

https://issues.apache.org/jira/browse/METRON-2274

Test plan will come in the comments

Pull Request Checklist

For all changes:

  • Is there a JIRA ticket associated with this PR? If not one needs to be created at Metron Jira.
  • Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
  • Has your PR been rebased against the latest commit within the target branch (typically master)?

For code changes:

  • Have you included steps to reproduce the behavior or problem that is being changed or addressed?

  • Have you included steps or a guide to how the change may be verified and tested manually?

  • Have you ensured that the full suite of tests and checks have been executed in the root metron folder via:

    mvn -q clean integration-test install && dev-utilities/build-utils/verify_licenses.sh 
    
  • n/a Have you written or updated unit tests and or integration tests to verify your changes?

  • n/a If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?

  • Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?

For documentation related changes:

  • n/a Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via site-book/target/site/index.html:

    cd site-book
    mvn site
    
  • n/a Have you ensured that any documentation diagrams have been updated, along with their source files, using draw.io? See Metron Development Guidelines for instructions.

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
It is also recommended that travis-ci is set up for your personal repository such that your branches are built there before submitting a pull request.

@mmiklavc
Copy link
Contributor Author

mmiklavc commented Oct 3, 2019

Test Plan

Taken from:

  1. Flatfile loader - METRON-682: Unify and Improve the Flat File Loader #432 (comment)
  2. Flatfile summarizer - https://github.com/apache/metron/tree/master/use-cases/typosquat_detection#summarize

Preliminaries

  • Spin up the dev environment for Centos 6 or 7
  • Run as root is fine
  • Root user needs a home dir in HDFS. You can do that as follows:
sudo -u hdfs hdfs dfs -mkdir /user/root
sudo -u hdfs hdfs dfs -chown root:root /user/root
  • Download the Alexa top 1m data set
wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
unzip top-1m.csv.zip
  • Stage import file
head -n 10000 top-1m.csv > top-10k.csv
hdfs dfs -put top-10k.csv /tmp
  • Truncate hbase
echo "truncate 'enrichment'" | hbase shell

Test the flatfile loader in MR mode

  • Create an extractor.json for the CSV data by editing extractor.json and pasting in these contents:
{
  "config" : {
    "columns" : {
       "domain" : 1,
       "rank" : 0
                }
    ,"indicator_column" : "domain"
    ,"type" : "alexa"
    ,"separator" : ","
             },
  "extractor" : "CSV"
}
  • Import from HDFS via MR
# import data into hbase 
$METRON_HOME/bin/flatfile_loader.sh -i /tmp/top-10k.csv -t enrichment -c t -e ./extractor.json -m MR
# count data written and verify it's 10k
echo "count 'enrichment'" | hbase shell

Test the flatfile summarizer in MR mode

  • Create an extractor-count.json file and paste the following:
{
  "config" : {
    "columns" : {
       "rank" : 0,
       "domain" : 1
    },
    "value_transform" : {
       "domain" : "DOMAIN_REMOVE_TLD(domain)"
    },
    "value_filter" : "LENGTH(domain) > 0",
    "state_init" : "0L",
    "state_update" : {
       "state" : "state + LENGTH( DOMAIN_TYPOSQUAT( domain ))"
                     },
    "state_merge" : "REDUCE(states, (s, x) -> s + x, 0)",
    "separator" : ","
  },
  "extractor" : "CSV"
}
  • Create the summary from HDFS via MR
$METRON_HOME/bin/flatfile_summarizer.sh -i /tmp/top-10k.csv -e ~/extractor_count.json -p 5 -om CONSOLE -m MR
  • Verify you see a count in the output similar to the following:
Processing /root/top-10k.csv
19/10/03 21:19:56 WARN resolver.BaseFunctionResolver: Using System classloader
Processed 9999 - \
3478276

@tigerquoll
Copy link
Contributor

tigerquoll commented Oct 4, 2019

Confirmed working (with some minor tweaks to test plan the -i command seems to take a local file path rather then a HDFS path )

[root@node1 tmp]# $METRON_HOME/bin/flatfile_summarizer.sh -i /tmp/top-10k.csv -e ./extractor-count.json -p 5 -om CONSOLE -m MR
19/10/04 01:19:48 WARN extractor.TransformFilterExtractorDecorator: Unable to setup zookeeper client - zk_quorum url not provided. **This will limit some Stellar functionality**
Processing /tmp/top-10k.csv
19/10/04 01:19:49 WARN resolver.BaseFunctionResolver: Using System classloader
Processed 9999 - \
3473336

@nickwallen
Copy link
Contributor

+1 Thanks for the fix.

@mmiklavc
Copy link
Contributor Author

mmiklavc commented Oct 4, 2019

Confirmed working (with some minor tweaks to test plan the -i command seems to take a local file path rather then a HDFS path )

[root@node1 tmp]# $METRON_HOME/bin/flatfile_summarizer.sh -i /tmp/top-10k.csv -e ./extractor-count.json -p 5 -om CONSOLE -m MR
19/10/04 01:19:48 WARN extractor.TransformFilterExtractorDecorator: Unable to setup zookeeper client - zk_quorum url not provided. **This will limit some Stellar functionality**
Processing /tmp/top-10k.csv
19/10/04 01:19:49 WARN resolver.BaseFunctionResolver: Using System classloader
Processed 9999 - \
3473336

Actually good to test both, thanks @tigerquoll

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants