[ML] ECS Grok patterns in the _text_structure/find_structure endpoint (…

…#88982) Also add support for new CATALINA/TOMCAT timestamp formats used by ECS Grok patterns Relates #77065 Co-authored-by: David Roberts <dave.roberts@elastic.co>
elastic · Aug 4, 2022 · 188f887 · 188f887
1 parent c08111b
commit 188f887
Show file tree

Hide file tree

Showing 19 changed files with 1,762 additions and 743 deletions.
diff --git a/docs/reference/text-structure/apis/find-structure.asciidoc b/docs/reference/text-structure/apis/find-structure.asciidoc
@@ -99,6 +99,16 @@ specified, the name of the timestamp field in the Grok pattern must match
 "timestamp". If `grok_pattern` is not specified, the structure finder creates a
 Grok pattern.
 
+`ecs_compatibility`::
+(Optional, string) The mode of compatibility with ECS compliant Grok patterns.
+Use this parameter to specify whether to use ECS Grok patterns instead of
+legacy ones when the structure finder creates a Grok pattern. Valid values
+are `disabled` and `v1`. The default value is `disabled`. This setting primarily
+has an impact when a whole message Grok pattern such as `%{CATALINALOG}`
+matches the input. If the structure finder identifies a common structure but
+has no idea of meaning then generic field names such as `path`, `ipaddress`,
+`field1` and `field2` are used in the `grok_pattern` output, with the intention
+that a user who knows the meanings rename these fields before using it.
 `has_header_row`::
 (Optional, Boolean) If you have set `format` to `delimited`, you can use this
 parameter to indicate whether the column names are in the first row of the text.
@@ -286,15 +296,16 @@ If the request does not encounter errors, you receive the following result:
   "charset" : "UTF-8", <4>
   "has_byte_order_marker" : false, <5>
   "format" : "ndjson", <6>
-  "timestamp_field" : "release_date", <7>
-  "joda_timestamp_formats" : [ <8>
+  "ecs_compatibility" : "disabled", <7>
+  "timestamp_field" : "release_date", <8>
+  "joda_timestamp_formats" : [ <9>
     "ISO8601"
   ],
-  "java_timestamp_formats" : [ <9>
+  "java_timestamp_formats" : [ <10>
     "ISO8601"
   ],
-  "need_client_timezone" : true, <10>
-  "mappings" : { <11>
+  "need_client_timezone" : true, <11>
+  "mappings" : { <12>
     "properties" : {
       "@timestamp" : {
         "type" : "date"
@@ -328,7 +339,7 @@ If the request does not encounter errors, you receive the following result:
       }
     ]
   },
-  "field_stats" : { <12>
+  "field_stats" : { <13>
     "author" : {
       "count" : 24,
       "cardinality" : 20,
@@ -536,19 +547,20 @@ may help diagnose parse errors or accidental uploads of the wrong text.
 <5> For UTF character encodings, `has_byte_order_marker` indicates whether the
 text begins with a byte order marker.
 <6> `format` is one of `ndjson`, `xml`, `delimited` or `semi_structured_text`.
-<7> The `timestamp_field` names the field considered most likely to be the
+<7> `ecs_compatibility` is either `disabled` or `v1`, defaults to `disabled`.
+<8> The `timestamp_field` names the field considered most likely to be the
 primary timestamp of each document.
-<8> `joda_timestamp_formats` are used to tell {ls} how to parse timestamps.
-<9> `java_timestamp_formats` are the Java time formats recognized in the time
+<9> `joda_timestamp_formats` are used to tell {ls} how to parse timestamps.
+<10> `java_timestamp_formats` are the Java time formats recognized in the time
 fields. {es} mappings and ingest pipelines use this format.
-<10> If a timestamp format is detected that does not include a timezone,
+<11> If a timestamp format is detected that does not include a timezone,
 `need_client_timezone` will be `true`. The server that parses the text must
 therefore be told the correct timezone by the client.
-<11> `mappings` contains some suitable mappings for an index into which the data
+<12> `mappings` contains some suitable mappings for an index into which the data
 could be ingested. In this case, the `release_date` field has been given a
 `keyword` type as it is not considered specific enough to convert to the `date`
 type.
-<12> `field_stats` contains the most common values of each field, plus basic
+<13> `field_stats` contains the most common values of each field, plus basic
 numeric statistics for the numeric `page_count` field. This information may
 provide clues that the data needs to be cleaned or transformed prior to use by
 other {stack} functionality.
@@ -1534,7 +1546,8 @@ This is an example of analyzing an {es} log file:
 
 [source,js]
 ----
-curl -s -H "Content-Type: application/json" -XPOST "localhost:9200/_text_structure/find_structure?pretty" -T "$ES_HOME/logs/elasticsearch.log"
+curl -s -H "Content-Type: application/json" -XPOST
+"localhost:9200/_text_structure/find_structure?pretty&ecs_compatibility=disabled" -T "$ES_HOME/logs/elasticsearch.log"
 ----
 // NOTCONSOLE
 // Not converting to console because this shows how curl can be used
@@ -1553,6 +1566,7 @@ this:
   "format" : "semi_structured_text", <1>
   "multiline_start_pattern" : "^\\[\\b\\d{4}-\\d{2}-\\d{2}[T ]\\d{2}:\\d{2}", <2>
   "grok_pattern" : "\\[%{TIMESTAMP_ISO8601:timestamp}\\]\\[%{LOGLEVEL:loglevel}.*", <3>
+  "ecs_compatibility" : "disabled", <4>
   "timestamp_field" : "timestamp",
   "joda_timestamp_formats" : [
     "ISO8601"
@@ -1679,6 +1693,8 @@ in the first line of each multi-line log message.
 <3> A very simple `grok_pattern` has been created, which extracts the timestamp
 and recognizable fields that appear in every analyzed message. In this case the
 only field that was recognized beyond the timestamp was the log level.
+<4> The ECS Grok pattern compatibility mode used, may be one of either `disabled`
+(the default if not specified in the request) or `v1`
 
 [discrete]
 [[find-structure-example-grok]]
@@ -1715,6 +1731,7 @@ this:
   "format" : "semi_structured_text",
   "multiline_start_pattern" : "^\\[\\b\\d{4}-\\d{2}-\\d{2}[T ]\\d{2}:\\d{2}",
   "grok_pattern" : "\\[%{TIMESTAMP_ISO8601:timestamp}\\]\\[%{LOGLEVEL:loglevel} *\\]\\[%{JAVACLASS:class} *\\] \\[%{HOSTNAME:node}\\] %{JAVALOGMESSAGE:message}", <1>
+  "ecs_compatibility" : "disabled", <2>
   "timestamp_field" : "timestamp",
   "joda_timestamp_formats" : [
     "ISO8601"
@@ -1769,7 +1786,7 @@ this:
       }
     ]
   },
-  "field_stats" : { <2>
+  "field_stats" : { <3>
     "class" : {
       "count" : 53,
       "cardinality" : 14,
@@ -1945,7 +1962,9 @@ this:
 
 <1> The `grok_pattern` in the output is now the overridden one supplied in the
 query parameter.
-<2> The returned `field_stats` include entries for the fields from the
+<2> The ECS Grok pattern compatibility mode used, may be one of either `disabled`
+(the default if not specified in the request) or `v1`
+<3> The returned `field_stats` include entries for the fields from the
 overridden `grok_pattern`.
 
 The URL escaping is hard, so if you are working interactively it is best to use

diff --git a/rest-api-spec/src/main/resources/rest-api-spec/api/text_structure.find_structure.json b/rest-api-spec/src/main/resources/rest-api-spec/api/text_structure.find_structure.json
@@ -74,6 +74,10 @@
         "type":"string",
         "description":"Optional parameter to specify the Grok pattern that should be used to extract fields from messages in a semi-structured text file"
       },
+      "ecs_compatibility":{
+        "type":"string",
+        "description":"Optional parameter to specify the compatibility mode with ECS Grok patterns - may be either 'v1' or 'disabled'"
+      },
       "timestamp_field":{
         "type":"string",
         "description":"Optional parameter to specify the timestamp field in the file"

diff --git a/x-pack/plugin/core/build.gradle b/x-pack/plugin/core/build.gradle
@@ -28,6 +28,7 @@ tasks.named("dependencyLicenses").configure {
 
 dependencies {
   compileOnly project(":server")
+  api project(':libs:elasticsearch-grok')
   api project(":libs:elasticsearch-ssl-config")
   api "org.apache.httpcomponents:httpclient:${versions.httpclient}"
   api "org.apache.httpcomponents:httpcore:${versions.httpcore}"

diff --git a/x-pack/plugin/core/src/main/java/module-info.java b/x-pack/plugin/core/src/main/java/module-info.java
@@ -8,6 +8,7 @@
 module org.elasticsearch.xcore {
     requires org.elasticsearch.cli;
     requires org.elasticsearch.base;
+    requires org.elasticsearch.grok;
     requires org.elasticsearch.server;
     requires org.elasticsearch.sslconfig;
     requires org.elasticsearch.xcontent;

diff --git a/.../src/main/java/org/elasticsearch/xpack/core/textstructure/action/FindStructureAction.java b/.../src/main/java/org/elasticsearch/xpack/core/textstructure/action/FindStructureAction.java
@@ -6,6 +6,7 @@
  */
 package org.elasticsearch.xpack.core.textstructure.action;
 
+import org.elasticsearch.Version;
 import org.elasticsearch.action.ActionRequest;
 import org.elasticsearch.action.ActionRequestValidationException;
 import org.elasticsearch.action.ActionResponse;
@@ -16,6 +17,7 @@
 import org.elasticsearch.common.io.stream.Writeable;
 import org.elasticsearch.common.xcontent.StatusToXContentObject;
 import org.elasticsearch.core.TimeValue;
+import org.elasticsearch.grok.Grok;
 import org.elasticsearch.rest.RestStatus;
 import org.elasticsearch.xcontent.ParseField;
 import org.elasticsearch.xcontent.XContentBuilder;
@@ -30,6 +32,8 @@
 import static org.elasticsearch.action.ValidateActions.addValidationError;
 
 public class FindStructureAction extends ActionType<FindStructureAction.Response> {
+    public static final String ECS_COMPATIBILITY_DISABLED = Grok.ECS_COMPATIBILITY_MODES[0];
+    public static final String ECS_COMPATIBILITY_V1 = Grok.ECS_COMPATIBILITY_MODES[1];
 
     public static final FindStructureAction INSTANCE = new FindStructureAction();
     public static final String NAME = "cluster:monitor/text_structure/findstructure";
@@ -107,6 +111,8 @@ public static class Request extends ActionRequest {
         public static final ParseField TIMESTAMP_FORMAT = new ParseField("timestamp_format");
         public static final ParseField TIMESTAMP_FIELD = TextStructure.TIMESTAMP_FIELD;
 
+        public static final ParseField ECS_COMPATIBILITY = TextStructure.ECS_COMPATIBILITY;
+
         private static final String ARG_INCOMPATIBLE_WITH_FORMAT_TEMPLATE = "[%s] may only be specified if ["
             + FORMAT.getPreferredName()
             + "] is [%s]";
@@ -122,6 +128,7 @@ public static class Request extends ActionRequest {
         private Character quote;
         private Boolean shouldTrimFields;
         private String grokPattern;
+        private String ecsCompatibility;
         private String timestampFormat;
         private String timestampField;
         private BytesReference sample;
@@ -141,6 +148,11 @@ public Request(StreamInput in) throws IOException {
             quote = in.readBoolean() ? (char) in.readVInt() : null;
             shouldTrimFields = in.readOptionalBoolean();
             grokPattern = in.readOptionalString();
+            if (in.getVersion().onOrAfter(Version.V_8_5_0)) {
+                ecsCompatibility = in.readOptionalString();
+            } else {
+                ecsCompatibility = null;
+            }
             timestampFormat = in.readOptionalString();
             timestampField = in.readOptionalString();
             sample = in.readBytesReference();
@@ -262,6 +274,14 @@ public void setGrokPattern(String grokPattern) {
             this.grokPattern = (grokPattern == null || grokPattern.isEmpty()) ? null : grokPattern;
         }
 
+        public String getEcsCompatibility() {
+            return ecsCompatibility;
+        }
+
+        public void setEcsCompatibility(String ecsCompatibility) {
+            this.ecsCompatibility = (ecsCompatibility == null || ecsCompatibility.isEmpty()) ? null : ecsCompatibility;
+        }
+
         public String getTimestampFormat() {
             return timestampFormat;
         }
@@ -338,6 +358,18 @@ public ActionRequestValidationException validate() {
                     );
                 }
             }
+
+            if (ecsCompatibility != null && Grok.isValidEcsCompatibilityMode(ecsCompatibility) == false) {
+                validationException = addValidationError(
+                    "["
+                        + ECS_COMPATIBILITY.getPreferredName()
+                        + "] must be one of ["
+                        + String.join(", ", Grok.ECS_COMPATIBILITY_MODES)
+                        + "] if specified",
+                    validationException
+                );
+            }
+
             if (sample == null || sample.length() == 0) {
                 validationException = addValidationError("sample must be specified", validationException);
             }
@@ -378,6 +410,9 @@ public void writeTo(StreamOutput out) throws IOException {
             }
             out.writeOptionalBoolean(shouldTrimFields);
             out.writeOptionalString(grokPattern);
+            if (out.getVersion().onOrAfter(Version.V_8_5_0)) {
+                out.writeOptionalString(ecsCompatibility);
+            }
             out.writeOptionalString(timestampFormat);
             out.writeOptionalString(timestampField);
             out.writeBytesReference(sample);
@@ -395,6 +430,7 @@ public int hashCode() {
                 hasHeaderRow,
                 delimiter,
                 grokPattern,
+                ecsCompatibility,
                 timestampFormat,
                 timestampField,
                 sample
@@ -422,6 +458,7 @@ public boolean equals(Object other) {
                 && Objects.equals(this.hasHeaderRow, that.hasHeaderRow)
                 && Objects.equals(this.delimiter, that.delimiter)
                 && Objects.equals(this.grokPattern, that.grokPattern)
+                && Objects.equals(this.ecsCompatibility, that.ecsCompatibility)
                 && Objects.equals(this.timestampFormat, that.timestampFormat)
                 && Objects.equals(this.timestampField, that.timestampField)
                 && Objects.equals(this.sample, that.sample);