-
Notifications
You must be signed in to change notification settings - Fork 506
METRON-2050: Automatically populate a list of enrichments from HBase #1365
Conversation
| return zkUrl; | ||
| } | ||
|
|
||
| private GlobalConfigService getGlobalConfigService(String zkUrl) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a org.apache.metron.common.zookeeper.ZKConfigurationsCache abstraction that does the same thing. Does it make sense to reuse that here? I wouldn't say it's a requirement because this is just a one time load but could be useful in the future if we do need to access the global config after a coprocessor is started. Just something for you to consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had considered that, but I wanted to be careful because, as you mentioned, this is a one time load and there could be adverse side effects without explicitly changing it to handle changes to the configured parameters. The other thing is that the ZKConfigurationsCache abstraction is tightly coupled to the concept of our topology types, e.g.
- ENRICHMENT
- PARSER
- INDEXING
- PROFILER
Any reference to the global config has to be through one of those configuration types. There is probably some refactoring that could be done there if we wanted, but it felt a bit ham-fisted to shoehorn that in for what amounts to about 10 fairly concise lines of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as you are aware of that and made a conscious decision to not use it then it's fine with me.
| <empty-value-valid>true</empty-value-valid> | ||
| </value-attributes> | ||
| </property> | ||
| <property> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have a property for the HDFS url? Shouldn't we get this property from Ambari similar to how we get Zookeeper and Kafka urls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a bit of effort to figure out how we were doing that, actually. The way we grab the existing properties for Zookeeper and Kafka is via the service_advisor code. See:
Line 130 in 96a66c8
hdfsUrl = services["configurations"]["core-site"]["properties"]["fs.defaultFS"] Line 142 in 96a66c8
stormUIServerURL = stormUIProtocol + stormUIServerHost + ":" + stormUIServerPort
You effectively grab the values at cluster install time and provide those as default recommended values for properties that we've defined in the MPack. Here is the path through that maze:
Line 133 in 96a66c8
<name>storm_rest_addr</name> Line 111 in 96a66c8
storm_rest_addr = config['configurations']['metron-env']['storm_rest_addr'] Line 157 in 96a66c8
storm_rest_addr = status_params.storm_rest_addr Line 144 in 96a66c8
putMetronEnvProperty("storm_rest_addr",stormUIServerURL)
Defined in the env file, loaded into status_params first (so the status commands can use it), inherited by the params os-specific file (for use during install), modified during install by the service_advisor script.
Hopefully it makes sense why I did this (though not necessarily why this is all so esoteric)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, that's a good find - I did not know that existed. We also seem to be grabbing Zookeeper from here in the service_advisor.py for Solr. But that's apparently independent from our other zk url. Let me see if there's a way to grab HDFS in the same way. I prefer the approach you found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@merrimanr - I played around with this a bit. The one benefit to the way I have it is that it allows the user to change the HDFS url if there's an issue with it for any reason - it will pull the default recommended value from the property fs.defaultFS on install. My main concern with this property at all is namenode HA. I can't exactly use the same approach as was done for storm and zookeeper because I can't get the namenode port, i.e. hdfs_url = default("/clusterHostInfo/namenode_host", []) only gives me a host, no port, and there's no other property in clusterHostInfo to obtain that. I was, however, able to find a modification to the service_advisor style that allowed me to grab it from inside the params files, config["configurations"]["core-site"]["fs.defaultFS"]. That gives me the full url hdfs://node1:8020 as desired. The shortcoming here is that there is no way to change this whatsoever without modifying the python files if there's an issue. But I believe this should still work with namenode HA just fine, per https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html
fs.defaultFS - the default path prefix used by the Hadoop FS client when none is given
Optionally, you may now configure the default path for Hadoop clients to use the new HA-enabled logical URI. If you used “mycluster” as the nameservice ID earlier, this will be the value of the authority portion of all of your HDFS paths. This may be configured like so, in your core-site.xml file:
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>
I guess I don't have a strong opinion either way - both can work. Exposing the property means that the user can change it if they need, but they will also need to manually manage it in cases where the namenode nameservice ID changes. Not exposing it means they cannot change it at all, however it will always pull the latest ID from any changes with namenode HA, etc. Which do you think is better? In light of the HA probably working fine with fs.defaultFS, I'm leaning towards not exposing the property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I don't think we should expose the property. I think we should go with an approach where Ambari can consistently and reliably populate that setting with or without namenode HA enabled. As long as that condition is satisfied, I'm good with how you choose to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@merrimanr - thanks for talking through this - I decided get rid of it in favor of the backend approach. Testing now and getting the test script together for you to start looking at as well.
| * | ||
| * @return List of all row keys as Strings for this table. | ||
| */ | ||
| public List<String> readRecords() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to add some kind of configurable guard here or do you think a warning is enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What were you thinking? I had considered what to do about this. One option was simply putting this in its own class only useable by the coprocessor, but I felt like that might be a bit too heavy handed and redundant. Adding an aggressive note was where I landed as a hopefully reasonable compromise, but I'm definitely open to suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking a configurable limit that would break out of the scan loop after a threshold is reached. Looks like there would be more work involved though to pass in a config so I think a note is fine.
|
Great work so far. Just a couple minor suggestions to consider before we start testing. |
… rest hbase user settings client.
…dynamically pull the HDFS host:port via Ambari in the status_advisor.py
…perties to single_node_vm and small_cluster Ansible scripts.
|
FYI, I just rebased from the latest master to get the enrichment refactoring changes from #1368. |
Testing PlanWe need to verify
TOC
Setup Test Environment
Verify BasicsHBase enrichment table setup with coprocessor
Pipeline still processes to indexingVerify data is flowing through the system, from parsing to indexing
Flatfile loaderPreliminaries
The extractor.json will get used by flatfile_loader.sh in the next step Import from HDFS via MRYou should see a 10k count in the enrichment table. We'll add one more source of enrichment type before checking our enrichment list. Streaming Enrichment
Final check |
|
I ran through the test plan and everything works as expected. This is a really nice addition. +1 |
|
@merrimanr - as I reviewed your global config doc change PR (#1376), I went to double check that I hadn't missed anything in this PR and ended up tweaking the docs a bit. Can you look over my latest changes when you have a moment and confirm whether your +1 stands? |
|
Latest change looks good to me. My +1 stands. |
|
Gah, looks like my recent metron-common changes from master conflict. Fixing. |
|
Ok, latest commit fixes merge conflict with master - metron-common README improvements from #1376. I checked the site-book again just to be sure and the links all work. |
|
I check the metron-common README and everything looks ok. +1 |









Contributor Comments
https://issues.apache.org/jira/browse/METRON-2050
We currently hard-code the list of enrichments available to the management UI within the REST API itself. This PR removes the hard-coded list and replaces it with a new approach using HBase coprocessors.
Testing and additional documentation to follow.
Pull Request Checklist
For all changes:
For code changes:
Have you included steps to reproduce the behavior or problem that is being changed or addressed?
Have you included steps or a guide to how the change may be verified and tested manually?
Have you ensured that the full suite of tests and checks have been executed in the root metron folder via:
Have you written or updated unit tests and or integration tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?
For documentation related changes:
Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via
site-book/target/site/index.html:Note:
Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
It is also recommended that travis-ci is set up for your personal repository such that your branches are built there before submitting a pull request.