[ML] Add log structure finder functionality #32788

droberts195 · 2018-08-10T16:16:01Z

This change adds a library to ML that can be used to deduce a log
file's structure given only a sample of the log file.

Eventually this will be used to add an endpoint to ML to make the
functionality available to end users, but this will follow in a
separate change.

The functionality is split into a library so that it can also be
used by a command line tool without requiring the command line
tool to include all server code.

This change adds a library to ML that can be used to deduce a log file's structure given only a sample of the log file. Eventually this will be used to add an endpoint to ML to make the functionality available to end users, but this will follow in a separate change. The functionality is split into a library so that it can also be used by a command line tool without requiring the command line tool to include all server code.

droberts195 · 2018-08-10T16:17:16Z

I've marked this >non-issue as it's invisible to the end user in this state. It will get added to the release notes when the corresponding endpoint is added.

benwtrent

This was a lot to grok (pun intended). I have read through most of it once :D

benwtrent · 2018-08-10T16:46:44Z

...ructure-finder/src/main/java/org/elasticsearch/xpack/ml/logstructure/GrokPatternCreator.java

+            }
+        }
+
+        if (wildcardRequired && others.stream().allMatch(String::isEmpty) == false) {


others.stream().noneMatch(String::isEmpty) is more readable.

I don't think others.stream().noneMatch(String::isEmpty) is logically equivalent. Consider the case where others contains one empty string and one non-empty string. allMatch(String::isEmpty) would be false, but so would noneMatch(String::isEmpty). So ditching the == false would yield the opposite result.

I think the logically equivalent condition is this:

others.stream().anyMatch(s -> s.isEmpty() == false)

Or alternatively if you don't like == false anywhere:

others.stream().filter(String::isEmpty).findFirst().isPresent()

But IntelliJ warns that this last one could be replaced with an anyMatch().

Since you found the allMatch() confusing I'll replace it with an anyMatch().

benwtrent · 2018-08-10T16:48:10Z

...ructure-finder/src/main/java/org/elasticsearch/xpack/ml/logstructure/GrokPatternCreator.java

+        boolean wildcardRequired = true;
+        for (int i = 0; i < driver.length(); ++i) {
+            char ch = driver.charAt(i);
+            if (PUNCTUATION_OR_SPACE.indexOf(ch) >= 0 && others.stream().allMatch(other -> other.indexOf(ch) >= 0)) {


Whenever I see a loop like this, my initial gut reaction is "Why isn't this a HashSet"? Are we worried about preoptimization or is there some Java thing I am missing?

I removed PUNCTUATION_OR_SPACE and NEEDS_ESCAPING and replaced them with a HashMap that enables both things to be checked with one lookup.

benwtrent · 2018-08-10T17:12:17Z

...ture-finder/src/main/java/org/elasticsearch/xpack/ml/logstructure/TimestampFormatFinder.java

+
+        @Override
+        public boolean equals(Object other) {
+            if (other == null) {


#nit add the equal object predicate.

if (this == other) { return true; }

dimitris-athanasiou

Set of comments as I got to one third into the review.

dimitris-athanasiou · 2018-08-13T10:31:22Z

...er/src/main/java/org/elasticsearch/xpack/ml/logstructure/AbstractLogFileStructureFinder.java

+ * or more contributor license agreements. Licensed under the Elastic License;
+ * you may not use this file except in compliance with the Elastic License.
+ */
+package org.elasticsearch.xpack.ml.logstructure;


I wonder if the package and the library name should match, ie should both be logstructure or logstructurefinder. Just raising it to ensure we double-check this.

OK cool. I renamed the package to logstructurefinder.

dimitris-athanasiou · 2018-08-13T10:34:08Z

...er/src/main/java/org/elasticsearch/xpack/ml/logstructure/AbstractLogFileStructureFinder.java

+import java.util.List;
+import java.util.Map;
+
+public abstract class AbstractLogFileStructureFinder {


I think we should consider dropping File from those class names. Whether a log is in a file or not is irrelevant as it could be the log was stored in elasticsearch or another storage like that.

dimitris-athanasiou · 2018-08-13T10:38:36Z

...er/src/main/java/org/elasticsearch/xpack/ml/logstructure/AbstractLogFileStructureFinder.java

+    private static final int KEYWORD_MAX_LEN = 256;
+    private static final int KEYWORD_MAX_SPACES = 5;
+
+    protected static Map<String, String> guessScalarMapping(List<String> explanation, String fieldName, Collection<String> fieldValues) {


I recommend to store explanation in a member field. Then we also add a getExplanations or explain method in the interface of the structure finder. We also add a protected method addExplanation(String) for sub-classes to use. The benefit will be that we won't have to be passing the explanation list as an argument around. Also, in theory, that would make it easy to, say, allow enabling/disabling explanations, etc.

It would be nicer in some ways, but it would cause quite a lot of side effects. At the moment explanation is built up by 3 pieces of code:

The character set finder

The high level structure determination (i.e. the *LogFileStructure::canCreateFromSample() methods)

The detailed structure determination

The unit tests test static methods that update explanations without having to create concrete objects. And the unit tests also print the explanation or partial explanation for whatever they've tested, which means that for unit tests we want the explanation to be owned by the test code.

I think one solution would be to introduce an Explanation class that encapsulates the list of strings. Then that could be a member of AbstractLogFileStructureFinder, which would avoid passing it around the member functions of that hierarchy so much. But it would be documented that other classes might be adding to the same explanation.

It's also tricky because the GrokPatternCreator needs to find mappings, relying on the fact that guessScalarMapping() is static.

dimitris-athanasiou · 2018-08-13T10:42:26Z

...er/src/main/java/org/elasticsearch/xpack/ml/logstructure/AbstractLogFileStructureFinder.java

+    private static final int KEYWORD_MAX_LEN = 256;
+    private static final int KEYWORD_MAX_SPACES = 5;
+
+    protected static Map<String, String> guessScalarMapping(List<String> explanation, String fieldName, Collection<String> fieldValues) {


Also, it would be nice to add a javadoc for this method.

+1 for Javadoc

dimitris-athanasiou · 2018-08-13T10:45:17Z

...er/src/main/java/org/elasticsearch/xpack/ml/logstructure/AbstractLogFileStructureFinder.java

+            }
+        }
+
+        else if (fieldValues.stream().allMatch(IP_GROK::match)) {


this else if is getting away! Should it just be if ?

It should be else if, but I'll remove the accidental extra blank line.

dimitris-athanasiou · 2018-08-13T12:38:51Z

...n/java/org/elasticsearch/xpack/ml/logstructure/AbstractStructuredLogFileStructureFinder.java

+
+        SortedMap<String, Object> mappings = new TreeMap<>();
+
+        if (sampleRecords != null) {


also unnecessary null check?

dimitris-athanasiou · 2018-08-13T12:40:14Z

...n/java/org/elasticsearch/xpack/ml/logstructure/AbstractStructuredLogFileStructureFinder.java

+            for (Map<String, ?> sampleRecord : sampleRecords) {
+                for (String fieldName : sampleRecord.keySet()) {
+                    mappings.computeIfAbsent(fieldName, key -> guessMapping(explanation, fieldName,
+                        sampleRecords.stream().flatMap(record -> {


Perhaps it would be more readable to store this in a local variable.

I think that would negate the benefit of using computeIfAbsent(). This is in an inner loop and if 1000 records all contain the same fields we don't want to do the complex processing 1000 times more than necessary.

If it's really hard to read as it stands I guess an alternative approach would be to first build a HashSet of all the unique field names across all sample records, then iterate through that to drive creation of the mappings for each field.

dimitris-athanasiou · 2018-08-13T12:42:12Z

...n/java/org/elasticsearch/xpack/ml/logstructure/AbstractStructuredLogFileStructureFinder.java

+                return Collections.singletonMap(MAPPING_TYPE_SETTING, "object");
+            }
+            throw new IllegalArgumentException("Field [" + fieldName +
+                "] has both object and non-object values - this won't work with Elasticsearch");


nit: rephrase to - this is not supported by Elasticsearch ?

dimitris-athanasiou · 2018-08-13T13:22:28Z

...ructure-finder/src/main/java/org/elasticsearch/xpack/ml/logstructure/GrokPatternCreator.java

+        }
+
+        /**
+         * This must only be called if {@link #matchesAll} returns <code>true</code>.


As this is only called in one place and right after matchesAll was called, I think we could move the matchesAll check in this method and so remove the risk of letting it be called without fulfilling the matchesAll contract.

dimitris-athanasiou · 2018-08-13T13:36:08Z

...ructure-finder/src/main/java/org/elasticsearch/xpack/ml/logstructure/GrokPatternCreator.java

+     * matched section and the pieces before and after it.  Recurse to find more matches in the pieces
+     * before and after and update the supplied string builder.
+     */
+    private static void processCandidateAndSplit(List<String> explanation, Map<String, Integer> fieldNameCountStore,


The long arg lists in these methods make me wonder if we could refactor as an object whose members capture those parameters. Hard to come with a more concrete suggestion without trying to do it, but these long arg lists and static methods set make me wonder if we could simplify by going more OO here. This class seems like it is doing a lot and could be split up in several smaller classes with reduced responsibilities.

Good idea - it's done in 37a92f7

This class got progressively more complex over time. When I started out it made sense to just have a couple of static methods but now you're right it needs to be an object with members.

davidkyle · 2018-08-13T11:22:19Z

...ture-finder/src/main/java/org/elasticsearch/xpack/ml/logstructure/TimestampFormatFinder.java

+
+    /**
+     * The timestamp patterns are complex and it can be slow to prove they do not
+     * match anywhere in a long message.  Many of the timestamps have similar and


'Many of the timestamps have similar and' -> 'are similar'

davidkyle · 2018-08-13T12:16:39Z

...ture-finder/src/main/java/org/elasticsearch/xpack/ml/logstructure/TimestampFormatFinder.java

+    private static final List<Pattern> QUICK_RULE_OUT_PATTERNS = Arrays.asList(
+        Pattern.compile("\\b\\d{4}-\\d{2}-\\d{2} "),
+        Pattern.compile("\\d \\d{2}:\\d{2}\\b"),
+        Pattern.compile(" \\d{2}:\\d{2}:\\d{2} ")


The first is YYYY-MM-DD, the last is a timestamp but I don't know what the second is checking. Can you add a short comment for each pattern

benwtrent · 2018-08-13T13:59:14Z

...main/java/org/elasticsearch/xpack/ml/logstructure/SeparatedValuesLogFileStructureFinder.java

+        if (rows.size() < 3) {
+            explanation.add("Too little data to accurately assess whether header is in sample - guessing it is");
+        } else {
+            isCsvHeaderInFile = isFirstRowUnusual(explanation, rows);


So, we want to automatically determine if a CSV header is given in the sample or not. Here are a couple of quick checks that we may want to try before doing expensive statistical checking and Levenshtein distances. Of course, this depends on how flexible we want to be.

Are all entries unique in the header row? You should not have duplicate header entries

Do we want to support null headers fields? If not, it stands to reason that none should be null in the sample

Do we want to support just white space or empty string header fields?

Are all entries actual string values? I am not sure it is valid to have a header that is just a number or a blank string.

There's really no distinction between null and empty string, as CSV cannot distinguish them. Whether the values are null or empty string in the code depends on the CSV parser implementation and any subsequent post processing.

The only reason why I'm hesitant to outlaw empty fields in the header is that people have sent me examples of files that were CSV but where the header row was shorter than the data rows.

But the duplicates check is a good idea. I'll add an extra initial test that if the first row contains duplicate non-empty field values then it's not allowed to be a header row.

The duplicates test is added in cf2d675

dimitris-athanasiou

Some more comments after looking at the JSON and XML finders.

dimitris-athanasiou · 2018-08-13T14:22:12Z

...structure-finder/src/main/java/org/elasticsearch/xpack/ml/logstructure/LogFileStructure.java

+    private final String grokPattern;
+    private final List<String> timestampFormats;
+    private final String timestampField;
+    private final boolean needClientTimezone;


What is the reason this is not optional (not a Boolean)?

Because it makes sense to include this in the output in all cases, whereas for some of the other boolean flags they only make sense when a particular high level structure has been found. For example, has_header_row does not make sense at all for JSON.

dimitris-athanasiou · 2018-08-13T15:47:33Z

...structure-finder/src/main/java/org/elasticsearch/xpack/ml/logstructure/LogFileStructure.java

+        public LogFileStructure build() {
+
+            if (numLinesAnalyzed <= 0) {
+                throw new IllegalArgumentException("Number of lines analyzed must be positive.");


nit: In other classes we have such validations in the corresponding builder setter. Arguably there are different benefits doing it either way. Just raising to see what we think about the 2 ways.

I put it here because the most likely error is failing to call the builder setter at all. This error would occur during development/adding a new log file structure. It won't be encountered by end users because they should never be building one of these objects.

dimitris-athanasiou · 2018-08-13T15:48:30Z

...structure-finder/src/main/java/org/elasticsearch/xpack/ml/logstructure/LogFileStructure.java

+        public LogFileStructure build() {
+
+            if (numLinesAnalyzed <= 0) {
+                throw new IllegalArgumentException("Number of lines analyzed must be positive.");


Another inconsistency here is whether the error messages contain a natural description of the field vs. the actual field name.

I think it's academic in this case, because it's our code supplying the values rather than the end user. So this check won't trigger as a result of a user calling a REST endpoint unless there's a bug in our code.

dimitris-athanasiou · 2018-08-13T15:53:17Z

...finder/src/main/java/org/elasticsearch/xpack/ml/logstructure/JsonLogFileStructureFinder.java

+
+    JsonLogFileStructureFinder(List<String> explanation, String sample, String charsetName, Boolean hasByteOrderMarker) throws IOException {
+
+        List<Map<String, ?>> sampleRecords = new ArrayList<>();


It sees that the Finder classes' constructors are doing the bulk of the work here. I wonder if it would be best to avoid that by introducing a method that clearly communicates that processing is going to occur.

I thought about this, but then the member variables cannot be final, and if you have one of these objects there's a potential doubt about whether the processing method has been called or not.

Good point. The other idea would be to go with static factory methods. But I don't feel strongly about it.

Good idea - I moved the major chunks of processing to static factory methods in baf863f

Refactoring of GrokPatternCreator and explanation still outstanding

davidkyle · 2018-08-13T14:21:40Z

...er/src/main/java/org/elasticsearch/xpack/ml/logstructure/AbstractLogFileStructureFinder.java

+    private static final int KEYWORD_MAX_LEN = 256;
+    private static final int KEYWORD_MAX_SPACES = 5;
+
+    protected static Map<String, String> guessScalarMapping(List<String> explanation, String fieldName, Collection<String> fieldValues) {


+1 for Javadoc

davidkyle · 2018-08-13T15:39:25Z

...er/src/main/java/org/elasticsearch/xpack/ml/logstructure/AbstractLogFileStructureFinder.java

+        }
+
+        TimestampMatch singleMatch = null;
+        for (String fieldValue : fieldValues) {


Is this checking for a date mapping? It took me a while to figure that out could you add a comment.

This for loop has a if ... else and 2 break statements there must be a way to simplify it. Why not call TimestampFormatFinder.findFirstFullMatch on the first element of the list and if that is non-null perform the check singleMatch.equals(TimestampFormatFinder... on every other element.

davidkyle · 2018-08-13T16:37:56Z

...structure-finder/src/main/java/org/elasticsearch/xpack/ml/logstructure/LogFileStructure.java

+                    if (shouldTrimFields != null) {
+                        throw new IllegalArgumentException("Should trim fields may not be specified for [" + format + "] structures.");
+                    }
+                    // $FALL-THROUGH$


Curious, does this suppress warnings?

It does in Eclipse - see
https://stackoverflow.com/a/13145718

However, I had to add the @SuppressWarnings on the whole method as the Gradle build runs with -WError, so this isn’t actually necessary to suppress the warning. I can change it to a more human-readable // Fallthrough if you prefer.

davidkyle · 2018-08-13T17:05:25Z

...der/src/main/java/org/elasticsearch/xpack/ml/logstructure/LogFileStructureFinderManager.java

+                for (int i = 0; i < spaceBytes.length && spaceEncodingContainsZeroByte == false; ++i) {
+                    spaceEncodingContainsZeroByte = (spaceBytes[i] == 0);
+                }
+                if (containsZeroBytes && spaceEncodingContainsZeroByte == false) {


I'm learning a lot about character encoding

davidkyle · 2018-08-13T17:14:41Z

...main/java/org/elasticsearch/xpack/ml/logstructure/SeparatedValuesLogFileStructureFinder.java

+
+        Tuple<Boolean, String[]> headerInfo = findHeaderFromSample(explanation, rows);
+        boolean isCsvHeaderInFile = headerInfo.v1();
+        String[] csvHeader = headerInfo.v2();


nit delineatedHeader or separatedValuesHeader

I changed it to just header. csv still appears in a few places because SuperCSV has the same problem of using "CSV" in its name while being a generic delimited values parser.

Major processing is now done in static factory methods and the constructors are trivial

droberts195 · 2018-08-14T16:44:17Z

I think I've now addressed all the review comments so far.

The only thing I'm left wondering is whether after baf863f it would make sense to get rid of the abstract base classes and move the methods (which are all now static) into a new LogStructureUtils class, a it's a bit weird to have abstract base classes with only static methods. I could also move a static method from the GrokPatternCreator into the new utility class to break a circular dependency with the TextLogStructureFinder.

I think I'll make this change tomorrow unless anyone has a better idea and then hopefully the PR will be in a fit state to merge.

These methods are now in a utils class

droberts195 · 2018-08-15T09:28:31Z

@elasticmachine test this please

dimitris-athanasiou

LGTM

This change adds a library to ML that can be used to deduce a log file's structure given only a sample of the log file. Eventually this will be used to add an endpoint to ML to make the functionality available to end users, but this will follow in a separate change. The functionality is split into a library so that it can also be used by a command line tool without requiring the command line tool to include all server code.

* master: Fix global checkpoint listeners test HLRC: adding machine learning open job (#32860) [ML] Add log structure finder functionality (#32788) INGEST: Add Configuration Except. Data to Metdata (#32322)

* 6.x: Fix global checkpoint listeners test HLRC: adding machine learning open job (#32860) [ML] Add log structure finder functionality (#32788) Use generic AcknowledgedResponse instead of extended classes (#32859)

…-checkpoint-poll * 'master' of github.com:elastic/elasticsearch: Fix docs for fixed filename for heap dump path (elastic#32882) Painless: Special Case def (elastic#32871) AwaitFix FullClusterRestartIT#testRollupIDSchemeAfterRestart. [Test] Fix DuelScrollIT#testDuelIndexOrderQueryThenFetch HLRC: adding machine learning delete job (elastic#32820) [DOCS] Update WordPress plugins links (elastic#32194) Remove passphrase support from reload settings API (elastic#32889) AwaitFix AckIT. Mutes test in DuelScrollIT CharArraysTests: Fix test bug. [ML] Choose seconds to fix intermittent DatafeeedConfigTest failure Test: Fix unpredictive merges in DocumentSubsetReaderTests [DOCS] Clarify sentence in network-host.asciidoc (elastic#32429) Docs enhancement: added reference to cluster-level setting `search.default_allow_partial_results` (elastic#32810) [DOCS] Fixing cross doc link to Stack Overview security topic. Move CharArrays to core lib (elastic#32851) Fix global checkpoint listeners test HLRC: adding machine learning open job (elastic#32860) [ML] Add log structure finder functionality (elastic#32788) INGEST: Add Configuration Except. Data to Metdata (elastic#32322)

droberts195 added >non-issue review v7.0.0 :ml Machine learning v6.5.0 labels Aug 10, 2018

benwtrent reviewed Aug 10, 2018

View reviewed changes

Address some review comments

653e532

dimitris-athanasiou reviewed Aug 13, 2018

View reviewed changes

davidkyle reviewed Aug 13, 2018

View reviewed changes

benwtrent reviewed Aug 13, 2018

View reviewed changes

Address @davidkyle's review comments

60abb5b

dimitris-athanasiou reviewed Aug 13, 2018

View reviewed changes

Address the simpler of @dimitris-athanasiou's review comments

9709ca1

Refactoring of GrokPatternCreator and explanation still outstanding

davidkyle reviewed Aug 13, 2018

View reviewed changes

droberts195 added 4 commits August 14, 2018 09:03

Address @davidkyle's second set of review comments

30a1cba

Make GrokPatternCreator more OO

37a92f7

Assume rows with duplicate values are not headers

cf2d675

Move large amounts of processing out of format finder constructors

baf863f

Major processing is now done in static factory methods and the constructors are trivial

droberts195 added 2 commits August 15, 2018 10:22

Remove abstract classes that ended up with only static methods

c602315

These methods are now in a utils class

Fix comment typo

896bcb9

dimitris-athanasiou approved these changes Aug 15, 2018

View reviewed changes

droberts195 merged commit 5ba04e2 into elastic:master Aug 15, 2018

droberts195 deleted the log-structure-finder-library branch August 15, 2018 17:04

droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Aug 28, 2018

Adjustments to account for review comment changes to elastic#32788

eb73a02

droberts195 mentioned this pull request Sep 14, 2018

[DOCS][ML] Document the ML find_file_structure endpoint #33723

Merged

jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

droberts195 mentioned this pull request Jan 5, 2021

[ML] Move find_file_structure endpoint to dedicated plugin #67001

Closed


		SortedMap<String, Object> mappings = new TreeMap<>();

		if (sampleRecords != null) {


		JsonLogFileStructureFinder(List<String> explanation, String sample, String charsetName, Boolean hasByteOrderMarker) throws IOException {

		List<Map<String, ?>> sampleRecords = new ArrayList<>();

[ML] Add log structure finder functionality #32788

[ML] Add log structure finder functionality #32788

Conversation

droberts195 commented Aug 10, 2018

droberts195 commented Aug 10, 2018

benwtrent left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benwtrent Aug 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droberts195 commented Aug 14, 2018

droberts195 commented Aug 15, 2018

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

benwtrent Aug 13, 2018 •

edited

Loading