Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Text Structure][ML] Improve multi-line start pattern recognition when no timestamps are present #79708

Open
jgowdyelastic opened this issue Oct 25, 2021 · 3 comments
Labels
>enhancement :ml Machine learning Team:ML Meta label for the ML team

Comments

@jgowdyelastic
Copy link
Member

jgowdyelastic commented Oct 25, 2021

Two example files were supplied by a user:

  1. poc-noprob-310.csv
  2. poc-repro-311.csv

The find_structure endpoint will only produce a multiline_start_pattern when the first line of a multi-lined document includes a field which it assumes is a date field.
In the examples provided, it appears the very long (300+ char) number is seen as being a time stamp and so a multiline_start_pattern is produced.
Without this multiline_start_pattern the file upload plugin in kibana cannot correctly parse the line as it will assume that the newline character in col2 is the end of the line.

Is it possible to produce multiline_start_pattern when a date field does not edit on the first line of a multi-lined message?

@jgowdyelastic jgowdyelastic added >enhancement :ml Machine learning labels Oct 25, 2021
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Oct 25, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@jgowdyelastic jgowdyelastic changed the title [Text Structure][ML] Problems creating multi-line start pattern [Text Structure][ML] Improve multi-line start pattern recognition when no timestamps are present Oct 25, 2021
@droberts195
Copy link
Contributor

It's not completely true to say that there needs to be a timestamp present. But there needs to be a field that always appears on the first line of each CSV record. The current logic is described here (and the code is underneath this comment):

/**
* The multi-line start pattern is based on the first field in the line that is boolean, numeric
* or the detected timestamp, and consists of a pattern matching that field, preceded by wildcards
* to match any prior fields and to match the delimiters in between them.
*
* This is based on the observation that a boolean, numeric or timestamp field will not contain a
* newline.
*
* The approach works best when the chosen field is early in each record, ideally the very first
* field. It doesn't work when fields prior to the chosen field contain newlines in some of the
* records.
*/

The next thought is of course to say this is crazy, just run the whole file through a CSV parser. However, the problem arises because of the way Elasticsearch and its data shippers have always worked by first splitting files into messages, then parsing these messages individually. Because of this approach "just run the whole file through a CSV parser" is not nearly as simple as it sounds within the Elastic ecosystem. The best way to accurately determine where one CSV record ends and the next starts is to use a proper CSV parser, not regexes. This was recognised in Logstash nearly 7 years ago - see elastic/logstash#2088 (comment). But the Filebeat/Ingest pipeline combo still relies on using regexes to split the file into messages, then running a CSV parser on each message separately. Filebeat does now actually have a built-in CSV parser - see elastic/beats#11753 - but still this only parses an individual field (which could be a full message) after the file has been split into messages.

This means that the text structure finder has a very difficult task if the CSV rows are such that the first field in each CSV record is a string field that could potentially be multiline. It needs to find something common about the bits of that field that always appear on the first line that could be used to detect the first line. Whatever regex is chosen must not then match any other line.

The workaround is to rearrange the column order such that a column that comes before the first string field that might be multiline is a number, boolean or the detected date field.

If no such field can be found then we could potentially try and do something with the first string field, looking for commonality across all messages that never appears on follow-on lines of messages. Before doing this it would be nice to have a concrete example of a real file that would show that this approach would be useful instead of the made up examples attached.

The other potential change would obviously be in Filebeat/Elastic Agent, where if that process had the ability to use a CSV parser to split the file into messages instead of just using regexes then the text structure finder could generate a config specifying to do that. So that would be a case of opening an enhancement against Filebeat/Elastic Agent, then waiting for that to be implemented, then progressing this issue.

droberts195 added a commit to droberts195/elasticsearch that referenced this issue Mar 17, 2022
This PR contains 3 fixes for the way multi-line start patterns are
created in the text structure endpoint:

1. For delimited files the multi-line start pattern used to be based
   on the first field that was a timestamp, boolean or number. This
   PR adds the option of a low cardinality keyword field as an
   alternative (i.e. an enum field effectively). It means there is
   more chance of a field early in each record being chosen as the
   mechanism for determining whether a line is the first line of a
   record.
2. The multi-line start pattern for delimited files now only permits
   the delimiter character between fields, not within quoted fields.
   Previously it was possible for the multi-line start pattern to
   match continuation lines. Unfortunately this may mean we can no
   longer determine a multi-line start pattern for files whose only
   suitable field is to the right of fields that sometimes contain
   commas, and the only solution in this case will be to reorder the
   columns before importing the data. Hopefully this problem will be
   very rare.
3. For semi-structured text log files there is now a cap on the
   complexity of the multi-line start pattern. It has been observed
   that the patterns generated for slightly malformed CSV files could
   run for days against the malformed lines of those files - the
   classic problem of a regex that doesn't match but nearly does doing
   lots of backtracking. We now throw an error in this situation and
   suggest overriding the format to delimited.

Relates elastic#79708
@droberts195
Copy link
Contributor

#85066 should help with this.

droberts195 added a commit that referenced this issue Mar 18, 2022
…85066)

This PR contains 3 fixes for the way multi-line start patterns are
created in the text structure endpoint:

1. For delimited files the multi-line start pattern used to be based
   on the first field that was a timestamp, boolean or number. This
   PR adds the option of a low cardinality keyword field as an
   alternative (i.e. an enum field effectively). It means there is
   more chance of a field early in each record being chosen as the
   mechanism for determining whether a line is the first line of a
   record.
2. The multi-line start pattern for delimited files now only permits
   the delimiter character between fields, not within quoted fields.
   Previously it was possible for the multi-line start pattern to
   match continuation lines. Unfortunately this may mean we can no
   longer determine a multi-line start pattern for files whose only
   suitable field is to the right of fields that sometimes contain
   commas, and the only solution in this case will be to reorder the
   columns before importing the data. Hopefully this problem will be
   very rare.
3. For semi-structured text log files there is now a cap on the
   complexity of the multi-line start pattern. It has been observed
   that the patterns generated for slightly malformed CSV files could
   run for days against the malformed lines of those files - the
   classic problem of a regex that doesn't match but nearly does doing
   lots of backtracking. We now throw an error in this situation and
   suggest overriding the format to delimited.

Relates #79708
Fixes elastic/kibana#121966
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Mar 18, 2022
…lastic#85066)

This PR contains 3 fixes for the way multi-line start patterns are
created in the text structure endpoint:

1. For delimited files the multi-line start pattern used to be based
   on the first field that was a timestamp, boolean or number. This
   PR adds the option of a low cardinality keyword field as an
   alternative (i.e. an enum field effectively). It means there is
   more chance of a field early in each record being chosen as the
   mechanism for determining whether a line is the first line of a
   record.
2. The multi-line start pattern for delimited files now only permits
   the delimiter character between fields, not within quoted fields.
   Previously it was possible for the multi-line start pattern to
   match continuation lines. Unfortunately this may mean we can no
   longer determine a multi-line start pattern for files whose only
   suitable field is to the right of fields that sometimes contain
   commas, and the only solution in this case will be to reorder the
   columns before importing the data. Hopefully this problem will be
   very rare.
3. For semi-structured text log files there is now a cap on the
   complexity of the multi-line start pattern. It has been observed
   that the patterns generated for slightly malformed CSV files could
   run for days against the malformed lines of those files - the
   classic problem of a regex that doesn't match but nearly does doing
   lots of backtracking. We now throw an error in this situation and
   suggest overriding the format to delimited.

Relates elastic#79708
Fixes elastic/kibana#121966
elasticsearchmachine pushed a commit that referenced this issue Mar 18, 2022
…85066) (#85100)

This PR contains 3 fixes for the way multi-line start patterns are
created in the text structure endpoint:

1. For delimited files the multi-line start pattern used to be based
   on the first field that was a timestamp, boolean or number. This
   PR adds the option of a low cardinality keyword field as an
   alternative (i.e. an enum field effectively). It means there is
   more chance of a field early in each record being chosen as the
   mechanism for determining whether a line is the first line of a
   record.
2. The multi-line start pattern for delimited files now only permits
   the delimiter character between fields, not within quoted fields.
   Previously it was possible for the multi-line start pattern to
   match continuation lines. Unfortunately this may mean we can no
   longer determine a multi-line start pattern for files whose only
   suitable field is to the right of fields that sometimes contain
   commas, and the only solution in this case will be to reorder the
   columns before importing the data. Hopefully this problem will be
   very rare.
3. For semi-structured text log files there is now a cap on the
   complexity of the multi-line start pattern. It has been observed
   that the patterns generated for slightly malformed CSV files could
   run for days against the malformed lines of those files - the
   classic problem of a regex that doesn't match but nearly does doing
   lots of backtracking. We now throw an error in this situation and
   suggest overriding the format to delimited.

Relates #79708
Fixes elastic/kibana#121966
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :ml Machine learning Team:ML Meta label for the ML team
Projects
None yet
Development

No branches or pull requests

3 participants