[Text Structure][ML] Improve multi-line start pattern recognition when no timestamps are present #79708

jgowdyelastic · 2021-10-25T11:49:53Z

Two example files were supplied by a user:

The find_structure endpoint will only produce a multiline_start_pattern when the first line of a multi-lined document includes a field which it assumes is a date field.
In the examples provided, it appears the very long (300+ char) number is seen as being a time stamp and so a multiline_start_pattern is produced.
Without this multiline_start_pattern the file upload plugin in kibana cannot correctly parse the line as it will assume that the newline character in col2 is the end of the line.

Is it possible to produce multiline_start_pattern when a date field does not edit on the first line of a multi-lined message?

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-10-25T11:49:56Z

Pinging @elastic/ml-core (Team:ML)

droberts195 · 2021-10-27T16:04:02Z

It's not completely true to say that there needs to be a timestamp present. But there needs to be a field that always appears on the first line of each CSV record. The current logic is described here (and the code is underneath this comment):

elasticsearch/x-pack/plugin/text-structure/src/main/java/org/elasticsearch/xpack/textstructure/structurefinder/DelimitedTextStructureFinder.java

Lines 739 to 750 in 68817d7

    
               /** 
        
                * The multi-line start pattern is based on the first field in the line that is boolean, numeric 
        
                * or the detected timestamp, and consists of a pattern matching that field, preceded by wildcards 
        
                * to match any prior fields and to match the delimiters in between them. 
        
                * 
        
                * This is based on the observation that a boolean, numeric or timestamp field will not contain a 
        
                * newline. 
        
                * 
        
                * The approach works best when the chosen field is early in each record, ideally the very first 
        
                * field.  It doesn't work when fields prior to the chosen field contain newlines in some of the 
        
                * records. 
        
                */

The next thought is of course to say this is crazy, just run the whole file through a CSV parser. However, the problem arises because of the way Elasticsearch and its data shippers have always worked by first splitting files into messages, then parsing these messages individually. Because of this approach "just run the whole file through a CSV parser" is not nearly as simple as it sounds within the Elastic ecosystem. The best way to accurately determine where one CSV record ends and the next starts is to use a proper CSV parser, not regexes. This was recognised in Logstash nearly 7 years ago - see elastic/logstash#2088 (comment). But the Filebeat/Ingest pipeline combo still relies on using regexes to split the file into messages, then running a CSV parser on each message separately. Filebeat does now actually have a built-in CSV parser - see elastic/beats#11753 - but still this only parses an individual field (which could be a full message) after the file has been split into messages.

This means that the text structure finder has a very difficult task if the CSV rows are such that the first field in each CSV record is a string field that could potentially be multiline. It needs to find something common about the bits of that field that always appear on the first line that could be used to detect the first line. Whatever regex is chosen must not then match any other line.

The workaround is to rearrange the column order such that a column that comes before the first string field that might be multiline is a number, boolean or the detected date field.

If no such field can be found then we could potentially try and do something with the first string field, looking for commonality across all messages that never appears on follow-on lines of messages. Before doing this it would be nice to have a concrete example of a real file that would show that this approach would be useful instead of the made up examples attached.

The other potential change would obviously be in Filebeat/Elastic Agent, where if that process had the ability to use a CSV parser to split the file into messages instead of just using regexes then the text structure finder could generate a config specifying to do that. So that would be a case of opening an enhancement against Filebeat/Elastic Agent, then waiting for that to be implemented, then progressing this issue.

This PR contains 3 fixes for the way multi-line start patterns are created in the text structure endpoint: 1. For delimited files the multi-line start pattern used to be based on the first field that was a timestamp, boolean or number. This PR adds the option of a low cardinality keyword field as an alternative (i.e. an enum field effectively). It means there is more chance of a field early in each record being chosen as the mechanism for determining whether a line is the first line of a record. 2. The multi-line start pattern for delimited files now only permits the delimiter character between fields, not within quoted fields. Previously it was possible for the multi-line start pattern to match continuation lines. Unfortunately this may mean we can no longer determine a multi-line start pattern for files whose only suitable field is to the right of fields that sometimes contain commas, and the only solution in this case will be to reorder the columns before importing the data. Hopefully this problem will be very rare. 3. For semi-structured text log files there is now a cap on the complexity of the multi-line start pattern. It has been observed that the patterns generated for slightly malformed CSV files could run for days against the malformed lines of those files - the classic problem of a regex that doesn't match but nearly does doing lots of backtracking. We now throw an error in this situation and suggest overriding the format to delimited. Relates elastic#79708

droberts195 · 2022-03-17T13:25:28Z

#85066 should help with this.

…85066) This PR contains 3 fixes for the way multi-line start patterns are created in the text structure endpoint: 1. For delimited files the multi-line start pattern used to be based on the first field that was a timestamp, boolean or number. This PR adds the option of a low cardinality keyword field as an alternative (i.e. an enum field effectively). It means there is more chance of a field early in each record being chosen as the mechanism for determining whether a line is the first line of a record. 2. The multi-line start pattern for delimited files now only permits the delimiter character between fields, not within quoted fields. Previously it was possible for the multi-line start pattern to match continuation lines. Unfortunately this may mean we can no longer determine a multi-line start pattern for files whose only suitable field is to the right of fields that sometimes contain commas, and the only solution in this case will be to reorder the columns before importing the data. Hopefully this problem will be very rare. 3. For semi-structured text log files there is now a cap on the complexity of the multi-line start pattern. It has been observed that the patterns generated for slightly malformed CSV files could run for days against the malformed lines of those files - the classic problem of a regex that doesn't match but nearly does doing lots of backtracking. We now throw an error in this situation and suggest overriding the format to delimited. Relates #79708 Fixes elastic/kibana#121966

…lastic#85066) This PR contains 3 fixes for the way multi-line start patterns are created in the text structure endpoint: 1. For delimited files the multi-line start pattern used to be based on the first field that was a timestamp, boolean or number. This PR adds the option of a low cardinality keyword field as an alternative (i.e. an enum field effectively). It means there is more chance of a field early in each record being chosen as the mechanism for determining whether a line is the first line of a record. 2. The multi-line start pattern for delimited files now only permits the delimiter character between fields, not within quoted fields. Previously it was possible for the multi-line start pattern to match continuation lines. Unfortunately this may mean we can no longer determine a multi-line start pattern for files whose only suitable field is to the right of fields that sometimes contain commas, and the only solution in this case will be to reorder the columns before importing the data. Hopefully this problem will be very rare. 3. For semi-structured text log files there is now a cap on the complexity of the multi-line start pattern. It has been observed that the patterns generated for slightly malformed CSV files could run for days against the malformed lines of those files - the classic problem of a regex that doesn't match but nearly does doing lots of backtracking. We now throw an error in this situation and suggest overriding the format to delimited. Relates elastic#79708 Fixes elastic/kibana#121966

…85066) (#85100) This PR contains 3 fixes for the way multi-line start patterns are created in the text structure endpoint: 1. For delimited files the multi-line start pattern used to be based on the first field that was a timestamp, boolean or number. This PR adds the option of a low cardinality keyword field as an alternative (i.e. an enum field effectively). It means there is more chance of a field early in each record being chosen as the mechanism for determining whether a line is the first line of a record. 2. The multi-line start pattern for delimited files now only permits the delimiter character between fields, not within quoted fields. Previously it was possible for the multi-line start pattern to match continuation lines. Unfortunately this may mean we can no longer determine a multi-line start pattern for files whose only suitable field is to the right of fields that sometimes contain commas, and the only solution in this case will be to reorder the columns before importing the data. Hopefully this problem will be very rare. 3. For semi-structured text log files there is now a cap on the complexity of the multi-line start pattern. It has been observed that the patterns generated for slightly malformed CSV files could run for days against the malformed lines of those files - the classic problem of a regex that doesn't match but nearly does doing lots of backtracking. We now throw an error in this situation and suggest overriding the format to delimited. Relates #79708 Fixes elastic/kibana#121966

jgowdyelastic added >enhancement :ml Machine learning labels Oct 25, 2021

elasticmachine added the Team:ML Meta label for the ML team label Oct 25, 2021

jgowdyelastic changed the title ~~[Text Structure][ML] Problems creating multi-line start pattern~~ [Text Structure][ML] Improve multi-line start pattern recognition when no timestamps are present Oct 25, 2021

droberts195 mentioned this issue Dec 24, 2021

[ML/Data Visualizer] Importing CSV file causes high CPU utilization on Chrome elastic/kibana#121966

Closed

droberts195 mentioned this issue Mar 17, 2022

[ML] Fixes for multi-line start patterns in text structure endpoint #85066

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Text Structure][ML] Improve multi-line start pattern recognition when no timestamps are present #79708

[Text Structure][ML] Improve multi-line start pattern recognition when no timestamps are present #79708

jgowdyelastic commented Oct 25, 2021 •

edited by droberts195

elasticmachine commented Oct 25, 2021

droberts195 commented Oct 27, 2021

droberts195 commented Mar 17, 2022

[Text Structure][ML] Improve multi-line start pattern recognition when no timestamps are present #79708

[Text Structure][ML] Improve multi-line start pattern recognition when no timestamps are present #79708

Comments

jgowdyelastic commented Oct 25, 2021 • edited by droberts195

elasticmachine commented Oct 25, 2021

droberts195 commented Oct 27, 2021

droberts195 commented Mar 17, 2022

jgowdyelastic commented Oct 25, 2021 •

edited by droberts195