Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest-useragent plugin #19074

Merged
merged 1 commit into from Jul 1, 2016
Merged

ingest-useragent plugin #19074

merged 1 commit into from Jul 1, 2016

Conversation

cwurm
Copy link
Contributor

@cwurm cwurm commented Jun 24, 2016

This PR adds an ingest-useragent plugin, meant to replicate the functionality of logstash-filter-useragent.

This is the last ingest plugin missing to be able to replicate a standard pipeline for web logs with just the Ingest Node (e.g. as in the example for Apache logs).

The code is based on logstash-filter-useragent, the Ruby library it uses (uap-ruby) and the Java equivalent, uap-java. However, since the code in uap-java is kind of overengineered for this purpose (esp. way too many classes) I wrote the whole thing from scratch, so there shouldn't be any significant code share.

regexes.yaml, the file containing regular expressions for useragent, operating system and device detection is copied from uap-core, I added a license header to it. Doing it this way keeps this dependency in the code, but will require it to be updated manually from time to time. If a user requires a newer or even custom version of this file, they can copy it into the config/ directory of the plugin and specify the regex_file option (similar to database_file in ingest-geoip).

Some points that may be worth discussing:

  1. Should regexes.yaml be part of this repo? ingest-geoip depends on some external database files that are in a separate repo under the elastic org and uploaded to Maven Central. I shied away from this because A) in this case it's a text file, not binary B) it's more straightforward having it in this repo, thereby being able to easily PR new versions, seeing all changes made to it in the git history, etc.
  2. This plugin has one dependency: commons-collections. It's not a unique new one, repository-hdfs already depends on it. However, here only one class is used, LRUMap. It's important, as it dramatically improves performance (in my tests, it reduces processing time by almost 80%, would likely be even higher if measured without grok running in front). If an LRU cache already exists in Elasticsearch or one of its existing dependencies using that one might be better.
  3. In UseragentParser there's two classes with public members and no methods, used solely as containers to be passed to UseragentProcessor. I'm happy to put getters and setters in place, I thought they'd unnecessarily clutter the code without adding any real value.
  4. I tried running gradle check as per CONTRIBUTING.MD, however it fails due to _ttl no longer being available in 5.0. Since this happens because of tests outside of my code and I can see that those jobs are currently disabled on build.elastic.co I figured I'm fine. gradle assemble and gradle test on just this plugin as well as gradle assemble on elasticsearch/ works fine, and I tested the deployed plugin with real data.

Hope I didn't forget anything, let me know if I have or if I should change something.

@jasontedor
Copy link
Member

If an LRU cache already exists in Elasticsearch or one of its existing dependencies using that one might be better.

We have one, org.elasticsearch.common.cache.Cache.

@cwurm
Copy link
Contributor Author

cwurm commented Jun 24, 2016

Indeed, thanks. It can't be configured to only hold a certain number of elements though, can it? High-cardinality data would lead to higher memory usage then.

The cardinality in my smallish dataset of 300K web requests is about 5K different agent strings, in my tests an LRU cache with 1K elements is practically as good as one with 10K.

I tested it out, performance is as good as with LRUMap, so that's not a concern.

@jasontedor
Copy link
Member

It can't be configured to only hold a certain number of elements though, can it?

Yes, it can. Set the maximum weight to the maximum number of elements that you want and set the weight function to map every element to weight 1.


bundlePlugin {
from("${project.projectDir}/src/main/resources") {
into 'config/'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't do this. Plugins with static configuration files should be on their way out. This is a resource, and it should be loaded as a resource.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjernst What do you mean by "loaded as a resource"? class.getResource*?

I copied the bundlePlugin idea from ingest-geoip and thought it has the nice effect of letting users find the regex file in the config directory of the plugin, allowing them to easily check its contents, adapt it or replace/supplement it by another. Would there be a way to preserve that advantage?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what Ryan means here is that the regexes.yaml should just remain on the classpath and it is then loaded from there.

Maybe the grok processor should be used as an example here instead. There the default grok patterns are loaded from the classpath and custom patterns can be defined inside the processor itself. I think managing custom regexes would be easier that way?

The only reason ingest-geoip does this bundling is because of the size of the database files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Martijn explained my thoughts exactly.

@martijnvg
Copy link
Member

@cwurm Nice work!

I wonder if this processor can be placed in the ingest-common module instead of as separate plugin? Especially if this processor can handle its resource like the grok processor? The reasons that geoip and attachment are provided as plugin is because of the size its dependencies and the fact that these plugins require additional security permissions.

@martijnvg martijnvg added >feature :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Jun 27, 2016
@cwurm
Copy link
Contributor Author

cwurm commented Jun 27, 2016

@martijnvg I like the idea.

However, ingest-common doesn't have its own directory in config/. As you suggested earlier, we could load the default regex file from the classpath, but what about when a user wants to provide a custom regex file?

The user could put it into the top-level config directory, together with elasticsearch.yml and other standard files. Is there another feature that does that already? Would doing this be alright?

I guess we could have the user create a new subdirectory, yet it feels error-prone.

@s1monw
Copy link
Contributor

s1monw commented Jun 27, 2016

The user could put it into the top-level config directory, together with elasticsearch.yml and other standard files. Is there another feature that does that already? Would doing this be alright?

yes that is the place to put things like this. you can specify a directory via some setting and that directory can be wherever. You are not bound to anything as long as the security manager allows to read from it. stuff like this must be a node setting but must not be part of the module/plugin

@martijnvg
Copy link
Member

The user could put it into the top-level config directory, together with elasticsearch.yml and other standard files. Is there another feature that does that already? Would doing this be alright?

I think we should pass custom regexes via the processor configuration in the pipeline definition? This is also now how custom grok patterns are specified. So right now this doesn't allow sharing of custom regexes between pipelines, but we could add infrastructure for that in ingest, in a general manner, so that grok patterns can be shared too between pipelines. If we have this then managing custom regexes is much easier as it is available on all nodes via the cluster state.

@cwurm
Copy link
Contributor Author

cwurm commented Jun 27, 2016

I think we should pass custom regexes via the processor configuration in the pipeline definition? This is also now how custom grok patterns are specified.

For grok this makes sense, but I don't think that is practical here.

We would provide a standard set of 800+ regexes. I'm pretty sure there's user agents it doesn't cover and going forward, this will only get worse as the release ages.

At the moment, in logstash-filter-useragent we allow the user to provide a custom file containing new patterns (similar in ingest-geoip for IPs). In practice, this will be a more current version of the same file (on average, it gets several updates per month).

I don't think pasting all of that into Sense / Console makes sense. There might be cases where somebody wants to add a specific user agent (e.g. some internal crawler or something), but even then it's impractical to add it separately, as order matters (first matching regex wins).

@martijnvg
Copy link
Member

as order matters (first matching regex wins).

Ok, I didn't realise this. Then for now lets allow adding custom files via config/ingest-useragent directory and the out of the box file should then be provided via the jar file.

import static org.elasticsearch.ingest.ConfigurationUtils.readStringProperty;
import static org.elasticsearch.ingest.ConfigurationUtils.readIntProperty;

public class UseragentProcessor extends AbstractProcessor {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the the processor and its factory be made final? This is consistent with other processors

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops factory is already a final class.

@cwurm
Copy link
Contributor Author

cwurm commented Jun 29, 2016

Is anything still outstanding after my last commits?

}

private static final class CompositeCacheKey {
private final String str1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to parserName and userAgent?

IngestUserAgentPlugin.class.getResourceAsStream("/regexes.yaml"), cache);
userAgentParsers.put(DEFAULT_PARSER_NAME, defaultParser);

if (Files.exists(userAgentConfigDirectory) == false && Files.isDirectory(userAgentConfigDirectory)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This directory is optional now, so the logic should be changed?

if (Files.exists(userAgentConfigDirectory)) {
            if (Files.isDirectory(userAgentConfigDirectory) == false) {
                throw new IllegalStateException(
                        "the user agent config path [" + userAgentConfigDirectory + "] isn't a directory");
            }

            PathMatcher pathMatcher = userAgentConfigDirectory.getFileSystem().getPathMatcher("glob:**.yaml");
            try (Stream<Path> regexFiles = Files.find(userAgentConfigDirectory, 1,
                    (path, attr) -> attr.isRegularFile() && pathMatcher.matches(path))) {
                Iterable<Path> iterable = regexFiles::iterator;
                for (Path path : iterable) {
                    String parserName = path.getFileName().toString();
                    try (InputStream regexStream = Files.newInputStream(path, StandardOpenOption.READ)) {
                        userAgentParsers.put(parserName, new UserAgentParser(parserName, regexStream, cache));
                    }
                }
            }
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it optional? I like the idea of allowing users to place their custom regex files wherever they want, but that would make it hard to parse them at node startup.

Right now we're trying to parse all *.yaml files in config/ingest-useragent, the same way ingest-geoip is doing it. Can we do it another way, while not allowing changes to files after startup to have any effect in the running instance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, you mean the directory doesn't have to exist if there's no custom regex file? Good point, would the user then have to create that directory? I kind of assumed it's being auto-created somewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is what I meant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user would need to create the directory, so it is important that this is documented too.

@martijnvg
Copy link
Member

@cwurm The PR is looking good. I left some more comments. Also when running gradle clean check there are some checkstyle failure (some lines are too long).

@martijnvg
Copy link
Member

martijnvg commented Jun 30, 2016

We also need an integration test that tests using a custom regex file.

  1. Add a custom test regex file to the plugin root directory.
  2. Add the following snippet to the build.gradle file:
integTest {
  cluster {
    extraConfigFile 'ingest-useragent/test-regexes.yaml', 'test-regexes.yaml'
  }
}
  1. Add a test case that creates a pipeline that sets regex_file setting to test-regexes.yaml.

@rjernst
Copy link
Member

rjernst commented Jun 30, 2016

I would not put that file under src/test/resources because then it would be on the classpath, and the point would be to ensure finding it in the filesystem works (yes the proposed gradle config will add it to the config dir, but having it in both places leaves us unsure of whether it works bc of the classpath or filesystem).

@martijnvg
Copy link
Member

@rjernst Makes sense, then that file should be added to the plugin root directory.

@cwurm
Copy link
Contributor Author

cwurm commented Jun 30, 2016

@martijnvg UserAgentProcessorFactoryTests creates a second file in the file system and tests using that in testBuildRegexFile. Would that not be enough?

@martijnvg
Copy link
Member

@cwurm No, it is should be separate file that doesn't endup on the classpath. This file should be simple and just contain a single line (dummy regex), just to prove that if a custom file specified things work as expected.

@cwurm
Copy link
Contributor Author

cwurm commented Jun 30, 2016

@martijnvg Last few commits should have implemented remaining suggestions. The optional directory one is outstanding, I didn't quite understand that

@martijnvg
Copy link
Member

@cwurm The config/ingest-useragent only exists if custom regex files are added, right? The logic in IngestUserAgentPlugin.createUserAgentParsers(...) could just be:

if (directoryExists) {
  // load the parsers based on custom files
}

@cwurm
Copy link
Contributor Author

cwurm commented Jun 30, 2016

@martijnvg Right. Implemented in d4cf76d

@martijnvg
Copy link
Member

@cwurm This looks great! The only thing I'm wondering now if we should move this processor in the ingest-common module. What do you think?

By default we don't rely on any additional config that needs to be packaged, the default regex file is part of jar file itself. Also this plugin doesn't depend an additional dependencies or require additional security permission (either both is usually a reason the package a feature as a plugin) and by moving this processor to the ingest-common module it becomes out-of-the-box available.

@cwurm
Copy link
Contributor Author

cwurm commented Jun 30, 2016

@martijnvg I'm a little hesitant, because all the other plugins in ingest-common so far are very general-purpose, whereas this one is just applicable if you have user agent strings in your data, which most data doesn't. I'd be happy to have it in common though, so if you think that's a good idea let's do it.

@martijnvg
Copy link
Member

@cwurm We can always move it to ingest-commen if we find out that a lot of times this plugin ends up being installed. LGTM!

@cwurm
Copy link
Contributor Author

cwurm commented Jul 1, 2016

@martijnvg Thanks. I've made two more commits to what has proved to be the most "fragile" piece of the code during development and testing. Still looking good? Can I merge?

@martijnvg
Copy link
Member

@cwurm 👍 yes, still looking great.

@cwurm cwurm merged commit 42addb5 into elastic:master Jul 1, 2016
@jasontedor
Copy link
Member

@cwurm Thanks for the great work in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >feature v5.0.0-alpha5
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants