New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ingest-useragent plugin #19074
ingest-useragent plugin #19074
Conversation
We have one, |
Indeed, thanks. It can't be configured to only hold a certain number of elements though, can it? High-cardinality data would lead to higher memory usage then. The cardinality in my smallish dataset of 300K web requests is about 5K different agent strings, in my tests an LRU cache with 1K elements is practically as good as one with 10K. I tested it out, performance is as good as with |
Yes, it can. Set the maximum weight to the maximum number of elements that you want and set the weight function to map every element to weight 1. |
|
||
bundlePlugin { | ||
from("${project.projectDir}/src/main/resources") { | ||
into 'config/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't do this. Plugins with static configuration files should be on their way out. This is a resource, and it should be loaded as a resource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rjernst What do you mean by "loaded as a resource"? class.getResource*
?
I copied the bundlePlugin
idea from ingest-geoip
and thought it has the nice effect of letting users find the regex file in the config directory of the plugin, allowing them to easily check its contents, adapt it or replace/supplement it by another. Would there be a way to preserve that advantage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what Ryan means here is that the regexes.yaml
should just remain on the classpath and it is then loaded from there.
Maybe the grok processor should be used as an example here instead. There the default grok patterns are loaded from the classpath and custom patterns can be defined inside the processor itself. I think managing custom regexes would be easier that way?
The only reason ingest-geoip does this bundling is because of the size of the database files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Martijn explained my thoughts exactly.
@cwurm Nice work! I wonder if this processor can be placed in the |
@martijnvg I like the idea. However, ingest-common doesn't have its own directory in The user could put it into the top-level config directory, together with elasticsearch.yml and other standard files. Is there another feature that does that already? Would doing this be alright? I guess we could have the user create a new subdirectory, yet it feels error-prone. |
yes that is the place to put things like this. you can specify a directory via some setting and that directory can be wherever. You are not bound to anything as long as the security manager allows to read from it. stuff like this must be a node setting but must not be part of the module/plugin |
I think we should pass custom regexes via the processor configuration in the pipeline definition? This is also now how custom grok patterns are specified. So right now this doesn't allow sharing of custom regexes between pipelines, but we could add infrastructure for that in ingest, in a general manner, so that grok patterns can be shared too between pipelines. If we have this then managing custom regexes is much easier as it is available on all nodes via the cluster state. |
For grok this makes sense, but I don't think that is practical here. We would provide a standard set of 800+ regexes. I'm pretty sure there's user agents it doesn't cover and going forward, this will only get worse as the release ages. At the moment, in I don't think pasting all of that into Sense / Console makes sense. There might be cases where somebody wants to add a specific user agent (e.g. some internal crawler or something), but even then it's impractical to add it separately, as order matters (first matching regex wins). |
Ok, I didn't realise this. Then for now lets allow adding custom files via |
import static org.elasticsearch.ingest.ConfigurationUtils.readStringProperty; | ||
import static org.elasticsearch.ingest.ConfigurationUtils.readIntProperty; | ||
|
||
public class UseragentProcessor extends AbstractProcessor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the the processor and its factory be made final? This is consistent with other processors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops factory is already a final class.
Is anything still outstanding after my last commits? |
} | ||
|
||
private static final class CompositeCacheKey { | ||
private final String str1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to parserName
and userAgent
?
IngestUserAgentPlugin.class.getResourceAsStream("/regexes.yaml"), cache); | ||
userAgentParsers.put(DEFAULT_PARSER_NAME, defaultParser); | ||
|
||
if (Files.exists(userAgentConfigDirectory) == false && Files.isDirectory(userAgentConfigDirectory)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This directory is optional now, so the logic should be changed?
if (Files.exists(userAgentConfigDirectory)) {
if (Files.isDirectory(userAgentConfigDirectory) == false) {
throw new IllegalStateException(
"the user agent config path [" + userAgentConfigDirectory + "] isn't a directory");
}
PathMatcher pathMatcher = userAgentConfigDirectory.getFileSystem().getPathMatcher("glob:**.yaml");
try (Stream<Path> regexFiles = Files.find(userAgentConfigDirectory, 1,
(path, attr) -> attr.isRegularFile() && pathMatcher.matches(path))) {
Iterable<Path> iterable = regexFiles::iterator;
for (Path path : iterable) {
String parserName = path.getFileName().toString();
try (InputStream regexStream = Files.newInputStream(path, StandardOpenOption.READ)) {
userAgentParsers.put(parserName, new UserAgentParser(parserName, regexStream, cache));
}
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it optional? I like the idea of allowing users to place their custom regex files wherever they want, but that would make it hard to parse them at node startup.
Right now we're trying to parse all *.yaml
files in config/ingest-useragent
, the same way ingest-geoip is doing it. Can we do it another way, while not allowing changes to files after startup to have any effect in the running instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait, you mean the directory doesn't have to exist if there's no custom regex file? Good point, would the user then have to create that directory? I kind of assumed it's being auto-created somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is what I meant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user would need to create the directory, so it is important that this is documented too.
@cwurm The PR is looking good. I left some more comments. Also when running |
We also need an integration test that tests using a custom regex file.
|
I would not put that file under src/test/resources because then it would be on the classpath, and the point would be to ensure finding it in the filesystem works (yes the proposed gradle config will add it to the config dir, but having it in both places leaves us unsure of whether it works bc of the classpath or filesystem). |
@rjernst Makes sense, then that file should be added to the plugin root directory. |
@martijnvg |
@cwurm No, it is should be separate file that doesn't endup on the classpath. This file should be simple and just contain a single line (dummy regex), just to prove that if a custom file specified things work as expected. |
@martijnvg Last few commits should have implemented remaining suggestions. The optional directory one is outstanding, I didn't quite understand that |
@cwurm The
|
@martijnvg Right. Implemented in d4cf76d |
@cwurm This looks great! The only thing I'm wondering now if we should move this processor in the By default we don't rely on any additional config that needs to be packaged, the default regex file is part of jar file itself. Also this plugin doesn't depend an additional dependencies or require additional security permission (either both is usually a reason the package a feature as a plugin) and by moving this processor to the |
@martijnvg I'm a little hesitant, because all the other plugins in |
@cwurm We can always move it to |
@martijnvg Thanks. I've made two more commits to what has proved to be the most "fragile" piece of the code during development and testing. Still looking good? Can I merge? |
@cwurm 👍 yes, still looking great. |
@cwurm Thanks for the great work in this PR. |
This PR adds an
ingest-useragent
plugin, meant to replicate the functionality oflogstash-filter-useragent
.This is the last ingest plugin missing to be able to replicate a standard pipeline for web logs with just the Ingest Node (e.g. as in the example for Apache logs).
The code is based on logstash-filter-useragent, the Ruby library it uses (uap-ruby) and the Java equivalent, uap-java. However, since the code in uap-java is kind of overengineered for this purpose (esp. way too many classes) I wrote the whole thing from scratch, so there shouldn't be any significant code share.
regexes.yaml
, the file containing regular expressions for useragent, operating system and device detection is copied from uap-core, I added a license header to it. Doing it this way keeps this dependency in the code, but will require it to be updated manually from time to time. If a user requires a newer or even custom version of this file, they can copy it into theconfig/
directory of the plugin and specify theregex_file
option (similar todatabase_file
iningest-geoip
).Some points that may be worth discussing:
regexes.yaml
be part of this repo?ingest-geoip
depends on some external database files that are in a separate repo under the elastic org and uploaded to Maven Central. I shied away from this because A) in this case it's a text file, not binary B) it's more straightforward having it in this repo, thereby being able to easily PR new versions, seeing all changes made to it in the git history, etc.commons-collections
. It's not a unique new one,repository-hdfs
already depends on it. However, here only one class is used,LRUMap
. It's important, as it dramatically improves performance (in my tests, it reduces processing time by almost 80%, would likely be even higher if measured without grok running in front). If an LRU cache already exists in Elasticsearch or one of its existing dependencies using that one might be better.UseragentParser
there's two classes with public members and no methods, used solely as containers to be passed toUseragentProcessor
. I'm happy to put getters and setters in place, I thought they'd unnecessarily clutter the code without adding any real value.gradle check
as perCONTRIBUTING.MD
, however it fails due to_ttl
no longer being available in 5.0. Since this happens because of tests outside of my code and I can see that those jobs are currently disabled on build.elastic.co I figured I'm fine.gradle assemble
andgradle test
on just this plugin as well asgradle assemble
on elasticsearch/ works fine, and I tested the deployed plugin with real data.Hope I didn't forget anything, let me know if I have or if I should change something.