Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

METRON-2274 Flatfile loader and summarizer mapreduce mode broken #1525

Closed
wants to merge 2 commits into from

Conversation

mmiklavc
Copy link
Contributor

@mmiklavc mmiklavc commented Oct 3, 2019

Contributor Comments

https://issues.apache.org/jira/browse/METRON-2274

Test plan will come in the comments

Pull Request Checklist

For all changes:

  • Is there a JIRA ticket associated with this PR? If not one needs to be created at Metron Jira.
  • Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
  • Has your PR been rebased against the latest commit within the target branch (typically master)?

For code changes:

  • Have you included steps to reproduce the behavior or problem that is being changed or addressed?

  • Have you included steps or a guide to how the change may be verified and tested manually?

  • Have you ensured that the full suite of tests and checks have been executed in the root metron folder via:

    mvn -q clean integration-test install && dev-utilities/build-utils/verify_licenses.sh 
    
  • n/a Have you written or updated unit tests and or integration tests to verify your changes?

  • n/a If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?

  • Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?

For documentation related changes:

  • n/a Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via site-book/target/site/index.html:

    cd site-book
    mvn site
    
  • n/a Have you ensured that any documentation diagrams have been updated, along with their source files, using draw.io? See Metron Development Guidelines for instructions.

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
It is also recommended that travis-ci is set up for your personal repository such that your branches are built there before submitting a pull request.

@mmiklavc
Copy link
Contributor Author

mmiklavc commented Oct 3, 2019

Test Plan

Taken from:

  1. Flatfile loader - METRON-682: Unify and Improve the Flat File Loader #432 (comment)
  2. Flatfile summarizer - https://github.com/apache/metron/tree/master/use-cases/typosquat_detection#summarize

Preliminaries

  • Spin up the dev environment for Centos 6 or 7
  • Run as root is fine
  • Root user needs a home dir in HDFS. You can do that as follows:
sudo -u hdfs hdfs dfs -mkdir /user/root
sudo -u hdfs hdfs dfs -chown root:root /user/root
  • Download the Alexa top 1m data set
wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
unzip top-1m.csv.zip
  • Stage import file
head -n 10000 top-1m.csv > top-10k.csv
hdfs dfs -put top-10k.csv /tmp
  • Truncate hbase
echo "truncate 'enrichment'" | hbase shell

Test the flatfile loader in MR mode

  • Create an extractor.json for the CSV data by editing extractor.json and pasting in these contents:
{
  "config" : {
    "columns" : {
       "domain" : 1,
       "rank" : 0
                }
    ,"indicator_column" : "domain"
    ,"type" : "alexa"
    ,"separator" : ","
             },
  "extractor" : "CSV"
}
  • Import from HDFS via MR
# import data into hbase 
$METRON_HOME/bin/flatfile_loader.sh -i /tmp/top-10k.csv -t enrichment -c t -e ./extractor.json -m MR
# count data written and verify it's 10k
echo "count 'enrichment'" | hbase shell

Test the flatfile summarizer in MR mode

  • Create an extractor-count.json file and paste the following:
{
  "config" : {
    "columns" : {
       "rank" : 0,
       "domain" : 1
    },
    "value_transform" : {
       "domain" : "DOMAIN_REMOVE_TLD(domain)"
    },
    "value_filter" : "LENGTH(domain) > 0",
    "state_init" : "0L",
    "state_update" : {
       "state" : "state + LENGTH( DOMAIN_TYPOSQUAT( domain ))"
                     },
    "state_merge" : "REDUCE(states, (s, x) -> s + x, 0)",
    "separator" : ","
  },
  "extractor" : "CSV"
}
  • Create the summary from HDFS via MR
$METRON_HOME/bin/flatfile_summarizer.sh -i /tmp/top-10k.csv -e ~/extractor_count.json -p 5 -om CONSOLE -m MR
  • Verify you see a count in the output similar to the following:
Processing /root/top-10k.csv
19/10/03 21:19:56 WARN resolver.BaseFunctionResolver: Using System classloader
Processed 9999 - \
3478276

@tigerquoll
Copy link
Contributor

tigerquoll commented Oct 4, 2019

Confirmed working (with some minor tweaks to test plan the -i command seems to take a local file path rather then a HDFS path )

[root@node1 tmp]# $METRON_HOME/bin/flatfile_summarizer.sh -i /tmp/top-10k.csv -e ./extractor-count.json -p 5 -om CONSOLE -m MR
19/10/04 01:19:48 WARN extractor.TransformFilterExtractorDecorator: Unable to setup zookeeper client - zk_quorum url not provided. **This will limit some Stellar functionality**
Processing /tmp/top-10k.csv
19/10/04 01:19:49 WARN resolver.BaseFunctionResolver: Using System classloader
Processed 9999 - \
3473336

@nickwallen
Copy link
Contributor

+1 Thanks for the fix.

@mmiklavc
Copy link
Contributor Author

mmiklavc commented Oct 4, 2019

Confirmed working (with some minor tweaks to test plan the -i command seems to take a local file path rather then a HDFS path )

[root@node1 tmp]# $METRON_HOME/bin/flatfile_summarizer.sh -i /tmp/top-10k.csv -e ./extractor-count.json -p 5 -om CONSOLE -m MR
19/10/04 01:19:48 WARN extractor.TransformFilterExtractorDecorator: Unable to setup zookeeper client - zk_quorum url not provided. **This will limit some Stellar functionality**
Processing /tmp/top-10k.csv
19/10/04 01:19:49 WARN resolver.BaseFunctionResolver: Using System classloader
Processed 9999 - \
3473336

Actually good to test both, thanks @tigerquoll

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants