Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

METRON-283 Migrate Geo Enrichment outside of MySQL #421

Closed
wants to merge 21 commits into from

Conversation

justinleet
Copy link
Contributor

@justinleet justinleet commented Jan 20, 2017

MySQL Removed

Drops MySQL entirely from the project. This is done for a couple reasons outlined in a discussion thread on the dev lists. They boil down to a combination of licensing, eliminating a single point of failure, and using MaxMind's libraries for handling GeoLite2 data, and a couple other concerns.

This PR includes dependencies, installation paths, READMEs, etc. The only places left are MySQLConfig and MySQLConfigTest for if anyone wants to use them. The vast majority of removed files / code are simply from stripping this out. If any traces of MySQL outside of this are found in review, they should almost certainly be removed.

This moves to a system based on using MaxMind's binary DB directly. Per their page:

The GeoLite2 databases are distributed under the Creative Commons Attribution-ShareAlike 4.0 International License.

Our LICENSE file has been updated with the notification that we include use their database (and some of their test data, which is CCA ShareAlike 3.0). Both of these licenses are acceptable for us as stated on Apache Legal.

Let me know if that notification should be in a different spot or spots, and I can adjust it appropriately.

GeoLite2 Database

The main portion of the PR is in GeoLiteDatabaseTest.java, which manages access to the GeoLite2 database.

Raw Database

The raw database is stored on HDFS. By default, it will be in /apps/metron/geo. If no explicit location is given, /apps/metron/geo/default/<db_filename> will be used. Otherwise, updates will use /apps/metron/geo/<millis>/<db_filename>. Given the low rate of churn on the DB (updated once per week) and the potential for replay use cases , I haven't implemented any pruning or anything fancy on top of this.

Updating DB

A script is provided for updates geo_enrichment_load.sh in metron-data-management. Usage details are provided in metron-data-management/README.md. Note that the original didn't appear to have update capabilities,

The script will pull down a new instance of GeoLite2 database. This location can be either their standard web address (or somewhere else hosted), or even a file:// URL. Once the db file is pulled down, it will push to the appropriate HDFS location. Finally, it will pull down and update the global config with the new location. This will not require a topology restart.

Note that there have been conversations about how we manage config updates (specifically leaning towards Ambari). This has not been finalized, and we have two non Ambari testing environments (quickdev and docker-metron) so this just hits ZK. Ambari is not updated based on this script, and it is the user's responsibility to update global.json.

This leads to a questions people may have preferences on

  • Do we want the script to always update? Should there be a flag to stage the file, but not update configs?

Code

It is a singleton that allows for the database to be updated when a global config is updated. It is (hopefully!) correctly locked to avoid threading issues when updating or reading from the DB (and I've been able to update without issues.

The various Bolts have been updated to make sure they initialize the adapter to have it grab the current data appropriately.

In addition, a Stellar function has been provided GEO_GET(), which takes an IPV4 address. It probably works with an IPV6 address, but I didn't really dig into it, given that the goal was to initially match parity.

Given the somewhat core nature of this, and my relative unfamiliarity going in with how all these pieces tie together, I'm definitely looking for feedback on how things are implemented, or if I missed conventions we've used in the code.

Testing

Unit testing is added for the database and Stellar portions of the code as needed. The DB testing uses one of MaxMind's test DB's that they've published, because we can't create the binary format correctly. It does not use the full (20+ MB) version of the data, but rather a stripped down version (on the order of several KB).

Three environments were tested during this. Having these three disparate environments make features that cut across like this more complicated to test, so additional scrutiny would be merited (I would definitely like at least one person to run through one of these themselves and make sure it's transparent). Notably, quickdev requires Ansible setup scripts to align; the mpack requires layout, internal configuration, and handling of additional files and ownership of scripts to work properly; and docker-metron requires essentially cheating the scripts and just running a wget on the file because things aren't actually setup.

  • quickdev
  • Ambari Management Pack
  • docker-metron

QuickDev

Ansible scripts are updated. Running data through topologies kicked out the data.

Ambari Management Pack

RPMs updated where needed. Config Screen layout changed, updates made to properly handle configs and ownership. Ran Stellar on this install. Again, ran data through the topology.

Docker

Essentially this just involved cheating the scripts and running a wget on the GeoLite2 dbfile, because there's no Hadoop. Ran through the instructions to run the topologies (which are a little different than the others because Docker) and again was able to get data back out.

Additional Notes

  • As noted above, do we want the DB script to always update? Should there be a flag to stage the file, but not update configs? I primarily see this affecting the mpack because of the Ambari management behind it.
  • Increased withMaxTimeMS in the indexing integration test. This seems unrelated to my changes ( and I believe had been seen elsewhere), so if anybody has found the root cause, I can adjust my code appropriately. Outside issue (and code is reverted in this PR), see: https://issues.apache.org/jira/browse/METRON-672.
  • LocID doesn't technically exist in the new data, and I suspect it was never meant to be relied upon anyway outside of being a join key. The same applies to the new field that is replacing it in this context. It seems like we were mostly just passing that field along because it was available, and it seems like it should be refactored to be more useful. I didn't take on that analysis here, it's the slightly more validated version of a gut feeling.
  • The newer form of the MaxMind info has more data available than the old source we were using. We should also consider passing (at least some) of this data along. See MaxMind's What's New in GeoIp2. One of the ones that leapt out at me as potentially interesting was a field containing where an IP was registered, rather than just where the IP actually is. Another is fields for is_anonymous_proxy key, etc. I didn't validate if everything new is in the free version of the dataset.

@justinleet
Copy link
Contributor Author

Expanding the Stellar GEO_GET function to allow for just returning a map of specified fields (if more than one), or just the direct value (if just one field requested) would be very nice.

e.g. GEO_GET('', ['country', 'city']) returns {"country": "US", "city:"Chicago"} and GEO_GET('ip', ['county'] just returns 'US'

Copy link
Member

@cestella cestella left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great...first pass through and a few comments, but I love it so far!

@@ -210,6 +210,12 @@ This product bundles some test examples from the Stix project (metron-platform/m

This product bundles wait-for-it.sh, which is available under a "MIT Software License" license. For details, see https://github.com/vishnubob/wait-for-it

------------------------------------------------------------------------------------

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make the license explicit here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added header with license similar to the MIT one above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with another license. The testing DB used is under a share-alike 3.0 (not 4.0) license, so I explicitly called it it and added a link to the GitHub repo.

@@ -225,6 +163,8 @@ indexing.executors=0
kafka.zk={{ zookeeper_quorum }}
kafka.broker={{ kafka_brokers }}
kafka.start=WHERE_I_LEFT_OFF
##### GEO #####
geo.hdfs.file={{ geoip_hdfs_file }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the geo IP data file in HDFS configured in flux rather than zookeeper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be there at all, deleted it.

@@ -81,7 +81,6 @@ public JSONObject enrich(CacheKey value) {
Map<String, Object> globalConfig = value.getConfig().getConfiguration();
Map<String, Object> sensorConfig = value.getConfig().getEnrichment().getConfig();
if(handler == null) {
_LOG.trace("Stellar ConfigHandler is null.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't like the trace statements there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added back in. Didn't intend to drop them at all.

@@ -161,6 +167,7 @@ protected void initializeStellar() {
stellarContext = new Context.Builder()
.with(Context.Capabilities.ZOOKEEPER_CLIENT, () -> client)
.with(Context.Capabilities.GLOBAL_CONFIG, () -> getConfigurations().getGlobalConfig())
.with(Context.Capabilities.GEO_IP, () -> GeoLiteDatabase.INSTANCE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the INSTANCE is static, is there a reason to not just refer to it directly in the Stellar function rather than adding a capability? I always envisioned capabilities to be for configs which were changing at runtime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also- why is this in the GenericEnrichment bolt? Am I mistaken, or wouldn't this be better located in the GEO bolt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cestella Capability was left over from earlier testing. Everything referred to the static instance anyway for precisely that reason. Dropped it entirely and removed from the enum.

@dlyle65535 There's not a geo specific bolt, it's an instance of GenericEnrichment. To the best of my knowledge (and correct me if I'm wrong), you can't grab global configs in the adapters. This specific one gets dropped, but the prepare() method references the static INSTANCE Casey refers to. Is there somewhere that could live?

@@ -63,12 +65,15 @@ public ThreatIntelJoinBolt(String zookeeperUrl) {
@Override
public void prepare(Map map, TopologyContext topologyContext) {
super.prepare(map, topologyContext);
GeoLiteDatabase.INSTANCE.update((String)getConfigurations().getGlobalConfig().get(GeoLiteDatabase.GEO_HDFS_FILE));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can certainly stand now, but it opens the question as to whether we want to register some easier way to do this than remembering to call the right set of functions in all of the bolts that use stellar as a follow-on JIRA. Just a thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree with this, particularly because this shows up a couple times, but without a particularly good way to avoid it that I saw.

I'm definitely open to suggestions if there's anything we want to do short term to make this a little better or clearer.

@justinleet
Copy link
Contributor Author

As a note to anyone coming in late, at least one comment (David's) is still relevant, but hidden behind a collapsed outdated diff, because the line he commented on needed to be deleted, but the point he brought up is still in active conversation.

@justinleet
Copy link
Contributor Author

Forgot to comment on it, but the GEO_GET call was updated a bit ago per my comment about lists and fields. That change should also be reviewed and given any feedback.

Copy link
Contributor

@nickwallen nickwallen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really nice contribution. Love that we don't need MySQL any longer. This was a ton of work. Thanks, Justin.

@@ -317,6 +316,8 @@ This package installs the Metron Profiler %{metron_home}
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

%changelog
* Thu Jan 19 2017 Justin Leet <justinjleet@gmail.com> - 0.3.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but I thought it was un-Apache to put individual's names and emails in the source code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not even sure who we should talk to about how it should be handled. I'm more than happy to adjust the changelog as needed, but I need to know how to change it first.

Copy link
Member

@cestella cestella Jan 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true that apache recommends strongly (or maybe outright forbids) things like author tags in source code ( see here for a discussion, the reasoning was mostly around the author tags not being accurately representative.

For this situation, however, this is a changelog, so it doesn't have the problem of accurate representation that author tags do. That being said, it is redundant information because such information is stored in git.

I'd recommend doing a dev list discussion with a subject that starts with [MENTORS] on this to see what the ASF wants.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just yank that section or leave it blank if it's required. No point in figuring out how to keep it if we're agreed we don't need it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to that @dlyle65535

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sent out the email before I saw this, but I'm +1 on just dropping it. I'll pull it in this PR, and whatever answer we get from the mentors will just be for our info.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But so we're clear- with the changelog in place you can interrogate the rpm directly to find out what's changed "rpm -q --changelog". We'd lose that capability. I'm still good with omitting it, but I wanted to make sure everyone was aware of that consequence.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, you're right @dlyle65535, we do lose some capabilities by not having the changelog in there. You know what would be cool, if we could generate the spec files as part of maven's generate-source and fill in the changelog with the information from git. Anyway, best of both worlds kinda thing, but not effort that should be done as part of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's where I'm at with it. How about we let @justinleet's mentor thread confirm/deny that it's okay to keep it and then handle it as a directed effort after?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good :)

@@ -586,7 +586,6 @@ include $RULE_PATH/community.rules
# include $RULE_PATH/malware-tools.rules
# include $RULE_PATH/misc.rules
# include $RULE_PATH/multimedia.rules
# include $RULE_PATH/mysql.rules
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we actually want to remove the MySQL rules from Snort. These rules help Snort create alerts based on MySQL databases that may be running in the protected network.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I'll add it back in.

The shell script `$METRON_HOME/bin/geo_enrichment_load.sh` will retrieve MaxMind GeoLite2 data and load data into HDFS, and update the configuration.

THIS SCRIPT WILL NOT UPDATE AMBARI'S GLOBAL.JSON, JUST THE ZK CONFIGS. CHANGES WILL GO INTO EFFECT, BUT WILL NOT PERSIST PAST AN AMBARI RESTART UNTIL UPDATED THERE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like something that could cause a really annoying problem for a user. Is there no simple action we can take in this PR to stop Ambari from clobbering things?

Obviously, deploying/updating the geo database in an Ambari "service action" would be ideal. Would be nice to have one way to do it that does not break on restart.

Or maybe we need to accept this for now and once we have Ansible deployments using the MPack, we can implement an Ambari "service action" for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's annoying, for sure. It's a persistent problem and one of the reasons for that discuss thread about how to handle configurations. I believe the conclusion arrived at there is that in the future state, we should push changes through ambari which will then update zookeeper, thereby avoiding this unfortunate state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets into the whole "How do we manage configs discussion?". Unfortunately, it's in a really awkward spot. I might be able to add a service action to sorta take care of it, but it still probably does the same end around of Ambari's management, except it's going through the UI. I don't know that there's a good solution to this until we unify our config management, which is the real answer here.

Given that I don't think there was originally support for updating the db, I'm inclined to clean it up as part of unifying config management. It's ugly and I don't like it, but I can't come up with a good, short-term, alternative that isn't ugly for other reason anyway.

return null;
}
if(args.size() > 2) {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we throw an exception? If a user misuses the function, then let's tell them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a well defined expectation for what happens in cases like this? I don't think I've seen anything, and on reflection I'm pretty sure I just based that off what other functions did.

@cestella Can you shed some light on what perceptions/expectations exist for this case? Or shame my reading comprehension by pointing me to where we have them documented, if we do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it is true that we do not tend to throw exceptions on some errors in Stellar functions for reasons that may or may not be valid. The original intent was to not stop topologies because an error has happened in a specific function. I happen to think that this isn't correct reasoning anymore, but that's a discussion for another thread.

In this specific case, that the user has passed in too many arguments, I think you could make a reasonable argument for a couple of different semantics:

  • do nothing, passing in too many arguments does no harm
  • throw an exception as Nick suggests
  • return null

I actually think I agree in this case with Nick, I'd rather err on the side of caution and throw an exception here so we don't have people using functions wrong and not realizing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now throwing an IllegalArgumentException and adjusted the test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a little tricky because we don't want a data quality issue to impact the system negatively, but at the same time failing fast can make the user experience much better.

I have been thinking about it this way. If it is a condition that can be triggered by a data quality issue, then try to be as forgiving as possible. If it is clearly a 'static' condition, like I passed the wrong number of arguments, then it is appropriate to fail-fast.

return addr.isAnyLocalAddress() || addr.isLoopbackAddress()
|| addr.isSiteLocalAddress() || addr.isMulticastAddress()
|| !ipvalidator.isValidInet4Address(ipStr);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method name may be a little misleading. It is more answering the question of whether an IP is eligible for geo-enrichment; not whether it is a local address.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I'll go ahead and change the method name.

"fields - Optional list of GeoIP fields to grab. Options are locID, country, city, postalCode, dmaCode, latitude, longitude, location_point"
}
,returns = "If a Single field is requested a string of the field, If multiple fields a map of string of the fields, and null otherwise"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems useful. Would we ever deprecate the current way of doing enrichment in favor of just using Stellar and a function like this? May take some more work, but that might simplify things.

(Comment for future state; obviously outside the scope of your PR)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside of the scope of this PR, I agree, but I just wanted to chime in and agree with the sentiment. Now that we have geo IP enrichment, honestly, stellar enrichments aren't missing any features that the other enrichments have. Worthy of a discussion, for sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to kick off a discuss thread for this? Seems like we might as well start a discussion, rather than potentially having it buried here. And the discussion is useful even before everyone's set with this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will do.

@@ -149,9 +154,10 @@ public JSONObject load(CacheKey key) throws Exception {
cache = CacheBuilder.newBuilder().maximumSize(maxCacheSize)
.expireAfterWrite(maxTimeRetain, TimeUnit.MINUTES)
.build(loader);
GeoLiteDatabase.INSTANCE.update((String)getConfigurations().getGlobalConfig().get(GeoLiteDatabase.GEO_HDFS_FILE));
boolean success = adapter.initializeAdapter();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we will attempt to load the geo database into memory in every GenericEnrichmentBolt; even ones not doing geo-enrichment? Maybe that is one reason, we have singleton for the GeoLiteDatabase; to avoid repeated initialization?

Along the same lines, If I choose to not do geo-enrichment, do I still need to have a valid Maxmind file in HDFS? Or would that cause all GenericEnrichmentBolts to fail to initiailize?

Would it make sense to update the EnrichmentAdapter interface to accept configuration values, so that this initialization occurs in the GeoAdapter where it makes more sense? Then only the bolts doing geo-enrichment attempt to initialize the geo data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that if we adjust the enrichment adapter to accept configuration values, this can be pushed to the GeoAdapter, where I agree it makes more sense.

Are there concerns over changing the interface method directly, or would we prefer to give EnrichmentAdapter a initializeAdapter(Map<String, Object> config) with default impl that calls initializeAdapter() with no args? The second option makes it a bit ugly that GeoAdapter still has the original initializeAdapter(), but has the benefit of keeping the interface backwards compatible to anybody with custom adapters (Is this a thing that people have?).

And you are correct that there is an issue with initializing a missing DB (Great catch!). If we make this change, I believe the only time you'd have an issue is if you set up a geo enrichment and didn't actually set up the DB. Which is an entirely reasonable issue to have an error with.

@dlyle65535 Would this address your concerns about having the init in the GenericEnrichmentBolt in a satisfactory manner?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinleet - I think that's a great solution, thanks!

import java.util.zip.GZIPInputStream;

public enum GeoLiteDatabase {
INSTANCE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you go with a singleton? I'm sure you thought through this and there are good reasons, so just want to understand. I just don't like the extra burden that using a singleton brings on (like the extra locking).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I'll defend the singleton use:

  • We can reuse the same database across multiple storm tasks on the same storm worker
  • Even without a singleton, you'd have to deal with the locking in the stellar function since those are effectively static. i.e. You're reading a database that could be being updated/replaced.

I'll let Justin respond with his reasons, but those are the ones that came to my head.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's primary because the database file could be used and accessed by multiple threads of execution, depending on Storm's parallelism. If multiple tasks are running, they could potentially start stomping on each other's data.

The burden shouldn't be bad, because it primarily consists of reading. The alternative is having to manage potentially every thread pulling down it's own version of the file and loading it while avoiding tripping up other threads.

Having said that, I am definitely open to other ideas if we have good alternatives.

Copy link
Contributor

@nickwallen nickwallen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to look over the locking code one more time.

try {
readLock.lock();
addr = InetAddress.getByName(ip);
CityResponse cityResponse = reader.city(addr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to constrain the scope of the read lock? Once we have a CityResponse object we don't need the lock any longer.

}

// Always update if we don't have a DatabaseReader
if (reader == null || !hdfsLoc.equals(hdfsFile)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this null check need to be protected by a lock? I am really not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpdateIfNecessary is actually synchronized, so it's already locked. Given that updates are rare and this is only run on global configs being updated, impact will be very low (since it's called very rarely).

@justinleet
Copy link
Contributor Author

Updated to replace EnrichmentAdapter.initializeAdapter() with EnrichmentAdapter.initializeAdapter(Map<String, Object> config). Also added updateAdapter(Map<String, Object> config);. Right now the configs only get used by GeoAdapter, but all adapters can use configs it is passed as it needs. All implementing classes have the method updated, but simply ignore the config param.

This removes the direct dependency of GenericEnrichmentBolt on GeoLiteDatabase. It will simply call init and update to delegate to each adapter what to do (if anything). Update is called during reloadCallback if global config updates.

I'll want to spin it up again to further validate, but this addresses @nickwallen's catch on topologies not using geo needing geo data to exist and @dlyle65535's concern about the GenericEnrichmentBolt needing to know about geo for everything. Now errors should only occur if a topology using geo data can't find it.

Copy link
Contributor

@nickwallen nickwallen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. +1

I rant it up on "Quick Dev" and geo-enrichment was working as expected.

@cestella
Copy link
Member

+1 as well, great job, Justin.

@dlyle65535
Copy link
Contributor

Ditto +1. Thanks for all the hard work!

@justinleet
Copy link
Contributor Author

@cestella @dlyle65535 @nickwallen I added one more change to EnrichmentIntegrationTest to not create and destroy some tmp stuff that's not being used in the test. Let me know if there's any concerns.

@nickwallen
Copy link
Contributor

+1 looks good

@dlyle65535
Copy link
Contributor

Still +1 @justinleet - thanks for checking.

@cestella
Copy link
Member

reaffirmed +1

@asfgit asfgit closed this in fa6f3df Jan 26, 2017
@justinleet justinleet deleted the geo_mmdb branch April 4, 2017 12:14
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants