Skip to content
This repository was archived by the owner on Aug 20, 2025. It is now read-only.

Conversation

@mmiklavc
Copy link
Contributor

@mmiklavc mmiklavc commented Mar 25, 2019

Contributor Comments

https://issues.apache.org/jira/browse/METRON-2050

We currently hard-code the list of enrichments available to the management UI within the REST API itself. This PR removes the hard-coded list and replaces it with a new approach using HBase coprocessors.

Testing and additional documentation to follow.

Pull Request Checklist

For all changes:

  • Is there a JIRA ticket associated with this PR? If not one needs to be created at Metron Jira.
  • Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
  • Has your PR been rebased against the latest commit within the target branch (typically master)?

For code changes:

  • Have you included steps to reproduce the behavior or problem that is being changed or addressed?

  • Have you included steps or a guide to how the change may be verified and tested manually?

  • Have you ensured that the full suite of tests and checks have been executed in the root metron folder via:

    mvn -q clean integration-test install && dev-utilities/build-utils/verify_licenses.sh 
    
  • Have you written or updated unit tests and or integration tests to verify your changes?

  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?

  • Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via site-book/target/site/index.html:

    cd site-book
    mvn site
    

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
It is also recommended that travis-ci is set up for your personal repository such that your branches are built there before submitting a pull request.

return zkUrl;
}

private GlobalConfigService getGlobalConfigService(String zkUrl) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a org.apache.metron.common.zookeeper.ZKConfigurationsCache abstraction that does the same thing. Does it make sense to reuse that here? I wouldn't say it's a requirement because this is just a one time load but could be useful in the future if we do need to access the global config after a coprocessor is started. Just something for you to consider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had considered that, but I wanted to be careful because, as you mentioned, this is a one time load and there could be adverse side effects without explicitly changing it to handle changes to the configured parameters. The other thing is that the ZKConfigurationsCache abstraction is tightly coupled to the concept of our topology types, e.g.

  • ENRICHMENT
  • PARSER
  • INDEXING
  • PROFILER

Any reference to the global config has to be through one of those configuration types. There is probably some refactoring that could be done there if we wanted, but it felt a bit ham-fisted to shoehorn that in for what amounts to about 10 fairly concise lines of code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as you are aware of that and made a conscious decision to not use it then it's fine with me.

<empty-value-valid>true</empty-value-valid>
</value-attributes>
</property>
<property>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have a property for the HDFS url? Shouldn't we get this property from Ambari similar to how we get Zookeeper and Kafka urls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a bit of effort to figure out how we were doing that, actually. The way we grab the existing properties for Zookeeper and Kafka is via the service_advisor code. See:

You effectively grab the values at cluster install time and provide those as default recommended values for properties that we've defined in the MPack. Here is the path through that maze:

Defined in the env file, loaded into status_params first (so the status commands can use it), inherited by the params os-specific file (for use during install), modified during install by the service_advisor script.

Hopefully it makes sense why I did this (though not necessarily why this is all so esoteric)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not realize you were populating that in service_advisor.py. The zookeeper and kafka urls are actually populated in params_linux.py here and here. I'm not sure which approach is better or what the tradeoffs are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that's a good find - I did not know that existed. We also seem to be grabbing Zookeeper from here in the service_advisor.py for Solr. But that's apparently independent from our other zk url. Let me see if there's a way to grab HDFS in the same way. I prefer the approach you found.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merrimanr - I played around with this a bit. The one benefit to the way I have it is that it allows the user to change the HDFS url if there's an issue with it for any reason - it will pull the default recommended value from the property fs.defaultFS on install. My main concern with this property at all is namenode HA. I can't exactly use the same approach as was done for storm and zookeeper because I can't get the namenode port, i.e. hdfs_url = default("/clusterHostInfo/namenode_host", []) only gives me a host, no port, and there's no other property in clusterHostInfo to obtain that. I was, however, able to find a modification to the service_advisor style that allowed me to grab it from inside the params files, config["configurations"]["core-site"]["fs.defaultFS"]. That gives me the full url hdfs://node1:8020 as desired. The shortcoming here is that there is no way to change this whatsoever without modifying the python files if there's an issue. But I believe this should still work with namenode HA just fine, per https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html

fs.defaultFS - the default path prefix used by the Hadoop FS client when none is given

Optionally, you may now configure the default path for Hadoop clients to use the new HA-enabled logical URI. If you used “mycluster” as the nameservice ID earlier, this will be the value of the authority portion of all of your HDFS paths. This may be configured like so, in your core-site.xml file:

<property>
  <name>fs.defaultFS</name>
  <value>hdfs://mycluster</value>
</property>

I guess I don't have a strong opinion either way - both can work. Exposing the property means that the user can change it if they need, but they will also need to manually manage it in cases where the namenode nameservice ID changes. Not exposing it means they cannot change it at all, however it will always pull the latest ID from any changes with namenode HA, etc. Which do you think is better? In light of the HA probably working fine with fs.defaultFS, I'm leaning towards not exposing the property.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I don't think we should expose the property. I think we should go with an approach where Ambari can consistently and reliably populate that setting with or without namenode HA enabled. As long as that condition is satisfied, I'm good with how you choose to do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merrimanr - thanks for talking through this - I decided get rid of it in favor of the backend approach. Testing now and getting the test script together for you to start looking at as well.

*
* @return List of all row keys as Strings for this table.
*/
public List<String> readRecords() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to add some kind of configurable guard here or do you think a warning is enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What were you thinking? I had considered what to do about this. One option was simply putting this in its own class only useable by the coprocessor, but I felt like that might be a bit too heavy handed and redundant. Adding an aggressive note was where I landed as a hopefully reasonable compromise, but I'm definitely open to suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking a configurable limit that would break out of the scan loop after a threshold is reached. Looks like there would be more work involved though to pass in a config so I think a note is fine.

@merrimanr
Copy link
Contributor

Great work so far. Just a couple minor suggestions to consider before we start testing.

@mmiklavc
Copy link
Contributor Author

mmiklavc commented Apr 8, 2019

FYI, I just rebased from the latest master to get the enrichment refactoring changes from #1368.

@mmiklavc mmiklavc changed the title Metron-2050: Automatically populate a list of enrichments from HBase METRON-2050: Automatically populate a list of enrichments from HBase Apr 10, 2019
@mmiklavc
Copy link
Contributor Author

mmiklavc commented Apr 10, 2019

Testing Plan

We need to verify

  1. The enrichment coprocessor loads as expected
  2. Normal sensor data flow through the system, from parsers through to indexing, still functions as normal
  3. The new enrichment list table is populated when new enrichment types are added to the enrichment HBase table

TOC

  • Setup Test Environment
  • Verify Basics
  • Flatfile loader
  • Streaming enrichment
  • Final check

Setup Test Environment

  1. Build full dev metron/metron-deployment/development/centos6$ vagrant up
  2. Login to full dev ssh root@node1, password "vagrant"
  3. Set some environment variables
    # the root and metron users will need to do this - add to the user's ~/.bashrc or source each time you switch to the user
    source /etc/default/metron
    

Verify Basics

HBase enrichment table setup with coprocessor

  1. Run the following command from the CLI - you should see the coprocessor in the table attributes. Ambari should set this up as part of the MPack installation.

    # echo "describe 'enrichment'" | hbase shell
    HBase Shell; enter 'help<RETURN>' for list of supported commands.
    Type "exit<RETURN>" to leave the HBase Shell
    Version 1.1.2.2.6.5.1050-37, r897822d4dd5956ca186974c10382e9094683fa29, Tue Dec 11 
    02:04:10 UTC 2018
    
    describe 'enrichment'
    Table enrichment is ENABLED
    enrichment, {TABLE_ATTRIBUTES => {METADATA => {'Coprocessor$1' => 
    'hdfs://node1:8020/apps/metron/coprocessor/metron-hbase-server-0.7.1-uber.jar|org.apache.metron.hbase.coprocessor.EnrichmentCoprocessor||zookeeperUrl=node1:2181'}
    }
    COLUMN FAMILIES DESCRIPTION
    {NAME => 't', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
    1 row(s) in 0.3790 seconds 
    
  2. Ambari should provide 4 new options for configuring the enrichment list

    1. Enrichment List HBase Column Family
    2. Enrichment List HBase Coprocessor Implementation
    3. Enrichment List HBase Table Provider Implementation
    4. Enrichment List HBase Table
      image

Pipeline still processes to indexing

Verify data is flowing through the system, from parsing to indexing

  1. Open Ambari and navigate to the Metron service http://node1:8080/#/main/services/METRON/summary
  2. Open the Alerts UI
  3. image
  4. Verify alerts show up in the main UI - click the search icon (you may need to wait a moment for them to appear)
    image
  5. Head back to Ambari and select the Kibana service http://node1:8080/#/main/services/KIBANA/summary
  6. Open the Kibana dashboard via the "Metron UI" option in the quick links
  7. image
  8. Verify the dashboard is populating
  9. image

Flatfile loader

Preliminaries

  1. Before we start adding enrichments, let's verify the enrichment_list table is empty

  2. Go to Swagger
    image

  3. Click the sensor-enrichment-config-controller option.
    image

  4. Click the GET /api/v1/sensor/enrichment/config/list/available/enrichments option.

  5. And finally click the "Try it out!" button. You should see an empty array returned in the response body.
    image

  6. Now, let's perform an enrichment load. We'll do this as the metron user

    su - metron
    source /etc/default/metron
    
  7. Download the alexa 1m dataset:

    wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
    unzip top-1m.csv.zip
    
  8. Stage import file

    head -n 10000 top-1m.csv > top-10k.csv
    # plop it on HDFS
    hdfs dfs -put top-10k.csv /tmp
    
  9. Create an extractor.json for the CSV data by editing extractor.json and pasting in these contents:

    {
      "config": {
        "columns": {
          "domain": 1,
          "rank": 0
        },
        "indicator_column": "domain",
        "separator": ",",
        "type": "alexa"
      },
      "extractor": "CSV"
    }
    

The extractor.json will get used by flatfile_loader.sh in the next step

Import from HDFS via MR

# truncate hbase
echo "truncate 'enrichment'" | hbase shell
# import data into hbase 
$METRON_HOME/bin/flatfile_loader.sh -i /tmp/top-10k.csv -t enrichment -c t -e ./extractor.json -m MR
# count data written and verify it's 10k
echo "count 'enrichment'" | hbase shell

You should see a 10k count in the enrichment table. We'll add one more source of enrichment type before checking our enrichment list.

Streaming Enrichment

  1. Switch back to root if you're still the metron user.

    [metron@node1 ~]$ exit
    
  2. Pull down latest config from Zookeeper

    $METRON_HOME/bin/zk_load_configs.sh -m PULL -o ${METRON_HOME}/config/zookeeper -z $ZOOKEEPER -f
    
  3. Create a file named user.json in the parser directory.

    touch ${METRON_HOME}/config/zookeeper/parsers/user.json
    
  4. Enter these contents:

    {
      "parserClassName" : "org.apache.metron.parsers.csv.CSVParser" ,
      "writerClassName" : "org.apache.metron.writer.hbase.SimpleHbaseEnrichmentWriter",
      "sensorTopic":"user",
      "parserConfig": {
        "shew.table" : "enrichment",
        "shew.cf" : "t",
        "shew.keyColumns" : "ip",
        "shew.enrichmentType" : "user",
        "columns" : {
          "user" : 0,
          "ip" : 1
        }
      }
    }
    
  5. Push the changes back up to Zookeeper

    $METRON_HOME/bin/zk_load_configs.sh -m PUSH -i $METRON_HOME/config/zookeeper/ -z $ZOOKEEPER
    
  6. Create the user Kafka topic

    ${HDP_HOME}/kafka-broker/bin/kafka-topics.sh --create --zookeeper $ZOOKEEPER --replication-factor 1 --partitions 1 --topic user
    
  7. Start the topology

    ${METRON_HOME}/bin/start_parser_topology.sh -s user -z $ZOOKEEPER
    
  8. Create a simple file with named user.csv with user mapping to IP, e.g.

    echo "mmiklavcic,192.168.138.158" > user.csv
    
  9. Push the data to Kafka

    tail user.csv | ${HDP_HOME}/kafka-broker/bin/kafka-console-producer.sh --broker-list $BROKERLIST --topic user
    
  10. Verify data makes it to the enrichment table.

    echo "count 'enrichment'" | hbase shell
    

    There should be 10,001 records now.

Final check

  1. Check the Swagger UI again with our earlier steps. You should now see an "alexa" and a "user" enrichment type returned in the enrichment list results
    image

@merrimanr
Copy link
Contributor

I ran through the test plan and everything works as expected. This is a really nice addition. +1

@mmiklavc
Copy link
Contributor Author

@merrimanr - as I reviewed your global config doc change PR (#1376), I went to double check that I hadn't missed anything in this PR and ended up tweaking the docs a bit. Can you look over my latest changes when you have a moment and confirm whether your +1 stands?

@merrimanr
Copy link
Contributor

Latest change looks good to me. My +1 stands.

@mmiklavc
Copy link
Contributor Author

Gah, looks like my recent metron-common changes from master conflict. Fixing.

@mmiklavc
Copy link
Contributor Author

Ok, latest commit fixes merge conflict with master - metron-common README improvements from #1376. I checked the site-book again just to be sure and the links all work.

@merrimanr
Copy link
Contributor

I check the metron-common README and everything looks ok. +1

@asfgit asfgit closed this in 5709548 Apr 12, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants