ingest-useragent plugin #19074

cwurm · 2016-06-24T22:00:29Z

This PR adds an ingest-useragent plugin, meant to replicate the functionality of logstash-filter-useragent.

This is the last ingest plugin missing to be able to replicate a standard pipeline for web logs with just the Ingest Node (e.g. as in the example for Apache logs).

The code is based on logstash-filter-useragent, the Ruby library it uses (uap-ruby) and the Java equivalent, uap-java. However, since the code in uap-java is kind of overengineered for this purpose (esp. way too many classes) I wrote the whole thing from scratch, so there shouldn't be any significant code share.

regexes.yaml, the file containing regular expressions for useragent, operating system and device detection is copied from uap-core, I added a license header to it. Doing it this way keeps this dependency in the code, but will require it to be updated manually from time to time. If a user requires a newer or even custom version of this file, they can copy it into the config/ directory of the plugin and specify the regex_file option (similar to database_file in ingest-geoip).

Some points that may be worth discussing:

Should regexes.yaml be part of this repo? ingest-geoip depends on some external database files that are in a separate repo under the elastic org and uploaded to Maven Central. I shied away from this because A) in this case it's a text file, not binary B) it's more straightforward having it in this repo, thereby being able to easily PR new versions, seeing all changes made to it in the git history, etc.
This plugin has one dependency: commons-collections. It's not a unique new one, repository-hdfs already depends on it. However, here only one class is used, LRUMap. It's important, as it dramatically improves performance (in my tests, it reduces processing time by almost 80%, would likely be even higher if measured without grok running in front). If an LRU cache already exists in Elasticsearch or one of its existing dependencies using that one might be better.
In UseragentParser there's two classes with public members and no methods, used solely as containers to be passed to UseragentProcessor. I'm happy to put getters and setters in place, I thought they'd unnecessarily clutter the code without adding any real value.
I tried running gradle check as per CONTRIBUTING.MD, however it fails due to _ttl no longer being available in 5.0. Since this happens because of tests outside of my code and I can see that those jobs are currently disabled on build.elastic.co I figured I'm fine. gradle assemble and gradle test on just this plugin as well as gradle assemble on elasticsearch/ works fine, and I tested the deployed plugin with real data.

Hope I didn't forget anything, let me know if I have or if I should change something.

jasontedor · 2016-06-24T22:51:35Z

If an LRU cache already exists in Elasticsearch or one of its existing dependencies using that one might be better.

We have one, org.elasticsearch.common.cache.Cache.

cwurm · 2016-06-24T23:30:36Z

Indeed, thanks. It can't be configured to only hold a certain number of elements though, can it? High-cardinality data would lead to higher memory usage then.

The cardinality in my smallish dataset of 300K web requests is about 5K different agent strings, in my tests an LRU cache with 1K elements is practically as good as one with 10K.

I tested it out, performance is as good as with LRUMap, so that's not a concern.

jasontedor · 2016-06-24T23:33:27Z

It can't be configured to only hold a certain number of elements though, can it?

Yes, it can. Set the maximum weight to the maximum number of elements that you want and set the weight function to map every element to weight 1.

rjernst · 2016-06-25T02:01:30Z

plugins/ingest-useragent/build.gradle

+
+bundlePlugin {
+  from("${project.projectDir}/src/main/resources") {
+    into 'config/'


Please don't do this. Plugins with static configuration files should be on their way out. This is a resource, and it should be loaded as a resource.

@rjernst What do you mean by "loaded as a resource"? class.getResource*?

I copied the bundlePlugin idea from ingest-geoip and thought it has the nice effect of letting users find the regex file in the config directory of the plugin, allowing them to easily check its contents, adapt it or replace/supplement it by another. Would there be a way to preserve that advantage?

I think what Ryan means here is that the regexes.yaml should just remain on the classpath and it is then loaded from there.

Maybe the grok processor should be used as an example here instead. There the default grok patterns are loaded from the classpath and custom patterns can be defined inside the processor itself. I think managing custom regexes would be easier that way?

The only reason ingest-geoip does this bundling is because of the size of the database files.

Yes, Martijn explained my thoughts exactly.

martijnvg · 2016-06-27T06:19:19Z

@cwurm Nice work!

I wonder if this processor can be placed in the ingest-common module instead of as separate plugin? Especially if this processor can handle its resource like the grok processor? The reasons that geoip and attachment are provided as plugin is because of the size its dependencies and the fact that these plugins require additional security permissions.

cwurm · 2016-06-27T16:02:43Z

@martijnvg I like the idea.

However, ingest-common doesn't have its own directory in config/. As you suggested earlier, we could load the default regex file from the classpath, but what about when a user wants to provide a custom regex file?

The user could put it into the top-level config directory, together with elasticsearch.yml and other standard files. Is there another feature that does that already? Would doing this be alright?

I guess we could have the user create a new subdirectory, yet it feels error-prone.

s1monw · 2016-06-27T16:06:23Z

The user could put it into the top-level config directory, together with elasticsearch.yml and other standard files. Is there another feature that does that already? Would doing this be alright?

yes that is the place to put things like this. you can specify a directory via some setting and that directory can be wherever. You are not bound to anything as long as the security manager allows to read from it. stuff like this must be a node setting but must not be part of the module/plugin

martijnvg · 2016-06-27T16:11:45Z

The user could put it into the top-level config directory, together with elasticsearch.yml and other standard files. Is there another feature that does that already? Would doing this be alright?

I think we should pass custom regexes via the processor configuration in the pipeline definition? This is also now how custom grok patterns are specified. So right now this doesn't allow sharing of custom regexes between pipelines, but we could add infrastructure for that in ingest, in a general manner, so that grok patterns can be shared too between pipelines. If we have this then managing custom regexes is much easier as it is available on all nodes via the cluster state.

cwurm · 2016-06-27T16:40:01Z

I think we should pass custom regexes via the processor configuration in the pipeline definition? This is also now how custom grok patterns are specified.

For grok this makes sense, but I don't think that is practical here.

We would provide a standard set of 800+ regexes. I'm pretty sure there's user agents it doesn't cover and going forward, this will only get worse as the release ages.

At the moment, in logstash-filter-useragent we allow the user to provide a custom file containing new patterns (similar in ingest-geoip for IPs). In practice, this will be a more current version of the same file (on average, it gets several updates per month).

I don't think pasting all of that into Sense / Console makes sense. There might be cases where somebody wants to add a specific user agent (e.g. some internal crawler or something), but even then it's impractical to add it separately, as order matters (first matching regex wins).

martijnvg · 2016-06-28T07:00:24Z

as order matters (first matching regex wins).

Ok, I didn't realise this. Then for now lets allow adding custom files via config/ingest-useragent directory and the out of the box file should then be provided via the jar file.

martijnvg · 2016-06-28T07:01:56Z

...ns/ingest-useragent/src/main/java/org/elasticsearch/ingest/useragent/UseragentProcessor.java

+import static org.elasticsearch.ingest.ConfigurationUtils.readStringProperty;
+import static org.elasticsearch.ingest.ConfigurationUtils.readIntProperty;
+
+public class UseragentProcessor extends AbstractProcessor {


Can the the processor and its factory be made final? This is consistent with other processors

oops factory is already a final class.

cwurm · 2016-06-29T17:28:37Z

Is anything still outstanding after my last commits?

martijnvg · 2016-06-30T06:58:39Z

plugins/ingest-useragent/src/main/java/org/elasticsearch/ingest/useragent/UserAgentCache.java

+    }
+
+    private static final class CompositeCacheKey {
+        private final String str1;


rename to parserName and userAgent?

martijnvg · 2016-06-30T07:42:47Z

...ingest-useragent/src/main/java/org/elasticsearch/ingest/useragent/IngestUserAgentPlugin.java

+                IngestUserAgentPlugin.class.getResourceAsStream("/regexes.yaml"), cache);
+        userAgentParsers.put(DEFAULT_PARSER_NAME, defaultParser);
+
+        if (Files.exists(userAgentConfigDirectory) == false && Files.isDirectory(userAgentConfigDirectory)) {


This directory is optional now, so the logic should be changed?

if (Files.exists(userAgentConfigDirectory)) { if (Files.isDirectory(userAgentConfigDirectory) == false) { throw new IllegalStateException( "the user agent config path [" + userAgentConfigDirectory + "] isn't a directory"); } PathMatcher pathMatcher = userAgentConfigDirectory.getFileSystem().getPathMatcher("glob:**.yaml"); try (Stream<Path> regexFiles = Files.find(userAgentConfigDirectory, 1, (path, attr) -> attr.isRegularFile() && pathMatcher.matches(path))) { Iterable<Path> iterable = regexFiles::iterator; for (Path path : iterable) { String parserName = path.getFileName().toString(); try (InputStream regexStream = Files.newInputStream(path, StandardOpenOption.READ)) { userAgentParsers.put(parserName, new UserAgentParser(parserName, regexStream, cache)); } } } }

Is it optional? I like the idea of allowing users to place their custom regex files wherever they want, but that would make it hard to parse them at node startup.

Right now we're trying to parse all *.yaml files in config/ingest-useragent, the same way ingest-geoip is doing it. Can we do it another way, while not allowing changes to files after startup to have any effect in the running instance?

Oh wait, you mean the directory doesn't have to exist if there's no custom regex file? Good point, would the user then have to create that directory? I kind of assumed it's being auto-created somewhere.

Yes, that is what I meant.

The user would need to create the directory, so it is important that this is documented too.

martijnvg · 2016-06-30T08:01:01Z

@cwurm The PR is looking good. I left some more comments. Also when running gradle clean check there are some checkstyle failure (some lines are too long).

martijnvg · 2016-06-30T08:31:49Z

We also need an integration test that tests using a custom regex file.

Add a custom test regex file to the plugin root directory.
Add the following snippet to the build.gradle file:

integTest {
  cluster {
    extraConfigFile 'ingest-useragent/test-regexes.yaml', 'test-regexes.yaml'
  }
}

Add a test case that creates a pipeline that sets regex_file setting to test-regexes.yaml.

rjernst · 2016-06-30T10:00:38Z

I would not put that file under src/test/resources because then it would be on the classpath, and the point would be to ensure finding it in the filesystem works (yes the proposed gradle config will add it to the config dir, but having it in both places leaves us unsure of whether it works bc of the classpath or filesystem).

martijnvg · 2016-06-30T10:16:34Z

@rjernst Makes sense, then that file should be added to the plugin root directory.

cwurm · 2016-06-30T11:20:07Z

@martijnvg UserAgentProcessorFactoryTests creates a second file in the file system and tests using that in testBuildRegexFile. Would that not be enough?

martijnvg · 2016-06-30T11:26:15Z

@cwurm No, it is should be separate file that doesn't endup on the classpath. This file should be simple and just contain a single line (dummy regex), just to prove that if a custom file specified things work as expected.

cwurm · 2016-06-30T15:39:04Z

@martijnvg Last few commits should have implemented remaining suggestions. The optional directory one is outstanding, I didn't quite understand that

martijnvg · 2016-06-30T15:47:24Z

@cwurm The config/ingest-useragent only exists if custom regex files are added, right? The logic in IngestUserAgentPlugin.createUserAgentParsers(...) could just be:

if (directoryExists) {
  // load the parsers based on custom files
}

cwurm · 2016-06-30T16:26:34Z

@martijnvg Right. Implemented in d4cf76d

martijnvg · 2016-06-30T20:23:20Z

@cwurm This looks great! The only thing I'm wondering now if we should move this processor in the ingest-common module. What do you think?

By default we don't rely on any additional config that needs to be packaged, the default regex file is part of jar file itself. Also this plugin doesn't depend an additional dependencies or require additional security permission (either both is usually a reason the package a feature as a plugin) and by moving this processor to the ingest-common module it becomes out-of-the-box available.

cwurm · 2016-06-30T21:34:39Z

@martijnvg I'm a little hesitant, because all the other plugins in ingest-common so far are very general-purpose, whereas this one is just applicable if you have user agent strings in your data, which most data doesn't. I'd be happy to have it in common though, so if you think that's a good idea let's do it.

martijnvg · 2016-07-01T06:11:03Z

@cwurm We can always move it to ingest-commen if we find out that a lot of times this plugin ends up being installed. LGTM!

cwurm · 2016-07-01T11:35:52Z

@martijnvg Thanks. I've made two more commits to what has proved to be the most "fragile" piece of the code during development and testing. Still looking good? Can I merge?

martijnvg · 2016-07-01T11:40:08Z

@cwurm 👍 yes, still looking great.

jasontedor · 2016-07-05T09:39:17Z

@cwurm Thanks for the great work in this PR.

rjernst reviewed Jun 25, 2016
View reviewed changes

martijnvg added >feature :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Jun 27, 2016

martijnvg reviewed Jun 28, 2016
View reviewed changes

martijnvg added the review label Jun 28, 2016

martijnvg reviewed Jun 30, 2016
View reviewed changes

martijnvg added the v5.0.0-alpha5 label Jun 30, 2016

Add ingest-useragent plugin

81c5bdf

cwurm force-pushed the ingest_useragent branch from bf0aa60 to 81c5bdf Compare July 1, 2016 13:47

cwurm merged commit 42addb5 into elastic:master Jul 1, 2016

BigFunger mentioned this pull request Jul 8, 2016

[Ingest] Editable Pipelines Prototype elastic/kibana#7684

Closed

28 tasks

ingest-useragent plugin #19074

ingest-useragent plugin #19074

Conversation

cwurm commented Jun 24, 2016

jasontedor commented Jun 24, 2016

cwurm commented Jun 24, 2016

jasontedor commented Jun 24, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martijnvg commented Jun 27, 2016

cwurm commented Jun 27, 2016

s1monw commented Jun 27, 2016

martijnvg commented Jun 27, 2016

cwurm commented Jun 27, 2016

martijnvg commented Jun 28, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cwurm commented Jun 29, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martijnvg commented Jun 30, 2016

martijnvg commented Jun 30, 2016 • edited

rjernst commented Jun 30, 2016

martijnvg commented Jun 30, 2016

cwurm commented Jun 30, 2016

martijnvg commented Jun 30, 2016

cwurm commented Jun 30, 2016 • edited

martijnvg commented Jun 30, 2016

cwurm commented Jun 30, 2016

martijnvg commented Jun 30, 2016

cwurm commented Jun 30, 2016

martijnvg commented Jul 1, 2016

cwurm commented Jul 1, 2016

martijnvg commented Jul 1, 2016

jasontedor commented Jul 5, 2016

martijnvg commented Jun 30, 2016 •

edited

cwurm commented Jun 30, 2016 •

edited