Skip to content

Commit

Permalink
[ML] ECS Grok patterns in the _text_structure/find_structure endpoint (
Browse files Browse the repository at this point in the history
…#88982)

Also add support for new CATALINA/TOMCAT timestamp formats used by ECS Grok patterns

Relates #77065

Co-authored-by: David Roberts <dave.roberts@elastic.co>
  • Loading branch information
edsavage and droberts195 committed Aug 4, 2022
1 parent c08111b commit 188f887
Show file tree
Hide file tree
Showing 19 changed files with 1,762 additions and 743 deletions.
49 changes: 34 additions & 15 deletions docs/reference/text-structure/apis/find-structure.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,16 @@ specified, the name of the timestamp field in the Grok pattern must match
"timestamp". If `grok_pattern` is not specified, the structure finder creates a
Grok pattern.

`ecs_compatibility`::
(Optional, string) The mode of compatibility with ECS compliant Grok patterns.
Use this parameter to specify whether to use ECS Grok patterns instead of
legacy ones when the structure finder creates a Grok pattern. Valid values
are `disabled` and `v1`. The default value is `disabled`. This setting primarily
has an impact when a whole message Grok pattern such as `%{CATALINALOG}`
matches the input. If the structure finder identifies a common structure but
has no idea of meaning then generic field names such as `path`, `ipaddress`,
`field1` and `field2` are used in the `grok_pattern` output, with the intention
that a user who knows the meanings rename these fields before using it.
`has_header_row`::
(Optional, Boolean) If you have set `format` to `delimited`, you can use this
parameter to indicate whether the column names are in the first row of the text.
Expand Down Expand Up @@ -286,15 +296,16 @@ If the request does not encounter errors, you receive the following result:
"charset" : "UTF-8", <4>
"has_byte_order_marker" : false, <5>
"format" : "ndjson", <6>
"timestamp_field" : "release_date", <7>
"joda_timestamp_formats" : [ <8>
"ecs_compatibility" : "disabled", <7>
"timestamp_field" : "release_date", <8>
"joda_timestamp_formats" : [ <9>
"ISO8601"
],
"java_timestamp_formats" : [ <9>
"java_timestamp_formats" : [ <10>
"ISO8601"
],
"need_client_timezone" : true, <10>
"mappings" : { <11>
"need_client_timezone" : true, <11>
"mappings" : { <12>
"properties" : {
"@timestamp" : {
"type" : "date"
Expand Down Expand Up @@ -328,7 +339,7 @@ If the request does not encounter errors, you receive the following result:
}
]
},
"field_stats" : { <12>
"field_stats" : { <13>
"author" : {
"count" : 24,
"cardinality" : 20,
Expand Down Expand Up @@ -536,19 +547,20 @@ may help diagnose parse errors or accidental uploads of the wrong text.
<5> For UTF character encodings, `has_byte_order_marker` indicates whether the
text begins with a byte order marker.
<6> `format` is one of `ndjson`, `xml`, `delimited` or `semi_structured_text`.
<7> The `timestamp_field` names the field considered most likely to be the
<7> `ecs_compatibility` is either `disabled` or `v1`, defaults to `disabled`.
<8> The `timestamp_field` names the field considered most likely to be the
primary timestamp of each document.
<8> `joda_timestamp_formats` are used to tell {ls} how to parse timestamps.
<9> `java_timestamp_formats` are the Java time formats recognized in the time
<9> `joda_timestamp_formats` are used to tell {ls} how to parse timestamps.
<10> `java_timestamp_formats` are the Java time formats recognized in the time
fields. {es} mappings and ingest pipelines use this format.
<10> If a timestamp format is detected that does not include a timezone,
<11> If a timestamp format is detected that does not include a timezone,
`need_client_timezone` will be `true`. The server that parses the text must
therefore be told the correct timezone by the client.
<11> `mappings` contains some suitable mappings for an index into which the data
<12> `mappings` contains some suitable mappings for an index into which the data
could be ingested. In this case, the `release_date` field has been given a
`keyword` type as it is not considered specific enough to convert to the `date`
type.
<12> `field_stats` contains the most common values of each field, plus basic
<13> `field_stats` contains the most common values of each field, plus basic
numeric statistics for the numeric `page_count` field. This information may
provide clues that the data needs to be cleaned or transformed prior to use by
other {stack} functionality.
Expand Down Expand Up @@ -1534,7 +1546,8 @@ This is an example of analyzing an {es} log file:

[source,js]
----
curl -s -H "Content-Type: application/json" -XPOST "localhost:9200/_text_structure/find_structure?pretty" -T "$ES_HOME/logs/elasticsearch.log"
curl -s -H "Content-Type: application/json" -XPOST
"localhost:9200/_text_structure/find_structure?pretty&ecs_compatibility=disabled" -T "$ES_HOME/logs/elasticsearch.log"
----
// NOTCONSOLE
// Not converting to console because this shows how curl can be used
Expand All @@ -1553,6 +1566,7 @@ this:
"format" : "semi_structured_text", <1>
"multiline_start_pattern" : "^\\[\\b\\d{4}-\\d{2}-\\d{2}[T ]\\d{2}:\\d{2}", <2>
"grok_pattern" : "\\[%{TIMESTAMP_ISO8601:timestamp}\\]\\[%{LOGLEVEL:loglevel}.*", <3>
"ecs_compatibility" : "disabled", <4>
"timestamp_field" : "timestamp",
"joda_timestamp_formats" : [
"ISO8601"
Expand Down Expand Up @@ -1679,6 +1693,8 @@ in the first line of each multi-line log message.
<3> A very simple `grok_pattern` has been created, which extracts the timestamp
and recognizable fields that appear in every analyzed message. In this case the
only field that was recognized beyond the timestamp was the log level.
<4> The ECS Grok pattern compatibility mode used, may be one of either `disabled`
(the default if not specified in the request) or `v1`

[discrete]
[[find-structure-example-grok]]
Expand Down Expand Up @@ -1715,6 +1731,7 @@ this:
"format" : "semi_structured_text",
"multiline_start_pattern" : "^\\[\\b\\d{4}-\\d{2}-\\d{2}[T ]\\d{2}:\\d{2}",
"grok_pattern" : "\\[%{TIMESTAMP_ISO8601:timestamp}\\]\\[%{LOGLEVEL:loglevel} *\\]\\[%{JAVACLASS:class} *\\] \\[%{HOSTNAME:node}\\] %{JAVALOGMESSAGE:message}", <1>
"ecs_compatibility" : "disabled", <2>
"timestamp_field" : "timestamp",
"joda_timestamp_formats" : [
"ISO8601"
Expand Down Expand Up @@ -1769,7 +1786,7 @@ this:
}
]
},
"field_stats" : { <2>
"field_stats" : { <3>
"class" : {
"count" : 53,
"cardinality" : 14,
Expand Down Expand Up @@ -1945,7 +1962,9 @@ this:

<1> The `grok_pattern` in the output is now the overridden one supplied in the
query parameter.
<2> The returned `field_stats` include entries for the fields from the
<2> The ECS Grok pattern compatibility mode used, may be one of either `disabled`
(the default if not specified in the request) or `v1`
<3> The returned `field_stats` include entries for the fields from the
overridden `grok_pattern`.

The URL escaping is hard, so if you are working interactively it is best to use
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,10 @@
"type":"string",
"description":"Optional parameter to specify the Grok pattern that should be used to extract fields from messages in a semi-structured text file"
},
"ecs_compatibility":{
"type":"string",
"description":"Optional parameter to specify the compatibility mode with ECS Grok patterns - may be either 'v1' or 'disabled'"
},
"timestamp_field":{
"type":"string",
"description":"Optional parameter to specify the timestamp field in the file"
Expand Down
1 change: 1 addition & 0 deletions x-pack/plugin/core/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ tasks.named("dependencyLicenses").configure {

dependencies {
compileOnly project(":server")
api project(':libs:elasticsearch-grok')
api project(":libs:elasticsearch-ssl-config")
api "org.apache.httpcomponents:httpclient:${versions.httpclient}"
api "org.apache.httpcomponents:httpcore:${versions.httpcore}"
Expand Down
1 change: 1 addition & 0 deletions x-pack/plugin/core/src/main/java/module-info.java
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
module org.elasticsearch.xcore {
requires org.elasticsearch.cli;
requires org.elasticsearch.base;
requires org.elasticsearch.grok;
requires org.elasticsearch.server;
requires org.elasticsearch.sslconfig;
requires org.elasticsearch.xcontent;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
*/
package org.elasticsearch.xpack.core.textstructure.action;

import org.elasticsearch.Version;
import org.elasticsearch.action.ActionRequest;
import org.elasticsearch.action.ActionRequestValidationException;
import org.elasticsearch.action.ActionResponse;
Expand All @@ -16,6 +17,7 @@
import org.elasticsearch.common.io.stream.Writeable;
import org.elasticsearch.common.xcontent.StatusToXContentObject;
import org.elasticsearch.core.TimeValue;
import org.elasticsearch.grok.Grok;
import org.elasticsearch.rest.RestStatus;
import org.elasticsearch.xcontent.ParseField;
import org.elasticsearch.xcontent.XContentBuilder;
Expand All @@ -30,6 +32,8 @@
import static org.elasticsearch.action.ValidateActions.addValidationError;

public class FindStructureAction extends ActionType<FindStructureAction.Response> {
public static final String ECS_COMPATIBILITY_DISABLED = Grok.ECS_COMPATIBILITY_MODES[0];
public static final String ECS_COMPATIBILITY_V1 = Grok.ECS_COMPATIBILITY_MODES[1];

public static final FindStructureAction INSTANCE = new FindStructureAction();
public static final String NAME = "cluster:monitor/text_structure/findstructure";
Expand Down Expand Up @@ -107,6 +111,8 @@ public static class Request extends ActionRequest {
public static final ParseField TIMESTAMP_FORMAT = new ParseField("timestamp_format");
public static final ParseField TIMESTAMP_FIELD = TextStructure.TIMESTAMP_FIELD;

public static final ParseField ECS_COMPATIBILITY = TextStructure.ECS_COMPATIBILITY;

private static final String ARG_INCOMPATIBLE_WITH_FORMAT_TEMPLATE = "[%s] may only be specified if ["
+ FORMAT.getPreferredName()
+ "] is [%s]";
Expand All @@ -122,6 +128,7 @@ public static class Request extends ActionRequest {
private Character quote;
private Boolean shouldTrimFields;
private String grokPattern;
private String ecsCompatibility;
private String timestampFormat;
private String timestampField;
private BytesReference sample;
Expand All @@ -141,6 +148,11 @@ public Request(StreamInput in) throws IOException {
quote = in.readBoolean() ? (char) in.readVInt() : null;
shouldTrimFields = in.readOptionalBoolean();
grokPattern = in.readOptionalString();
if (in.getVersion().onOrAfter(Version.V_8_5_0)) {
ecsCompatibility = in.readOptionalString();
} else {
ecsCompatibility = null;
}
timestampFormat = in.readOptionalString();
timestampField = in.readOptionalString();
sample = in.readBytesReference();
Expand Down Expand Up @@ -262,6 +274,14 @@ public void setGrokPattern(String grokPattern) {
this.grokPattern = (grokPattern == null || grokPattern.isEmpty()) ? null : grokPattern;
}

public String getEcsCompatibility() {
return ecsCompatibility;
}

public void setEcsCompatibility(String ecsCompatibility) {
this.ecsCompatibility = (ecsCompatibility == null || ecsCompatibility.isEmpty()) ? null : ecsCompatibility;
}

public String getTimestampFormat() {
return timestampFormat;
}
Expand Down Expand Up @@ -338,6 +358,18 @@ public ActionRequestValidationException validate() {
);
}
}

if (ecsCompatibility != null && Grok.isValidEcsCompatibilityMode(ecsCompatibility) == false) {
validationException = addValidationError(
"["
+ ECS_COMPATIBILITY.getPreferredName()
+ "] must be one of ["
+ String.join(", ", Grok.ECS_COMPATIBILITY_MODES)
+ "] if specified",
validationException
);
}

if (sample == null || sample.length() == 0) {
validationException = addValidationError("sample must be specified", validationException);
}
Expand Down Expand Up @@ -378,6 +410,9 @@ public void writeTo(StreamOutput out) throws IOException {
}
out.writeOptionalBoolean(shouldTrimFields);
out.writeOptionalString(grokPattern);
if (out.getVersion().onOrAfter(Version.V_8_5_0)) {
out.writeOptionalString(ecsCompatibility);
}
out.writeOptionalString(timestampFormat);
out.writeOptionalString(timestampField);
out.writeBytesReference(sample);
Expand All @@ -395,6 +430,7 @@ public int hashCode() {
hasHeaderRow,
delimiter,
grokPattern,
ecsCompatibility,
timestampFormat,
timestampField,
sample
Expand Down Expand Up @@ -422,6 +458,7 @@ public boolean equals(Object other) {
&& Objects.equals(this.hasHeaderRow, that.hasHeaderRow)
&& Objects.equals(this.delimiter, that.delimiter)
&& Objects.equals(this.grokPattern, that.grokPattern)
&& Objects.equals(this.ecsCompatibility, that.ecsCompatibility)
&& Objects.equals(this.timestampFormat, that.timestampFormat)
&& Objects.equals(this.timestampField, that.timestampField)
&& Objects.equals(this.sample, that.sample);
Expand Down

0 comments on commit 188f887

Please sign in to comment.