-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Text Structure][ML] Improve multi-line start pattern recognition when no timestamps are present #79708
Comments
Pinging @elastic/ml-core (Team:ML) |
It's not completely true to say that there needs to be a timestamp present. But there needs to be a field that always appears on the first line of each CSV record. The current logic is described here (and the code is underneath this comment): Lines 739 to 750 in 68817d7
The next thought is of course to say this is crazy, just run the whole file through a CSV parser. However, the problem arises because of the way Elasticsearch and its data shippers have always worked by first splitting files into messages, then parsing these messages individually. Because of this approach "just run the whole file through a CSV parser" is not nearly as simple as it sounds within the Elastic ecosystem. The best way to accurately determine where one CSV record ends and the next starts is to use a proper CSV parser, not regexes. This was recognised in Logstash nearly 7 years ago - see elastic/logstash#2088 (comment). But the Filebeat/Ingest pipeline combo still relies on using regexes to split the file into messages, then running a CSV parser on each message separately. Filebeat does now actually have a built-in CSV parser - see elastic/beats#11753 - but still this only parses an individual field (which could be a full message) after the file has been split into messages. This means that the text structure finder has a very difficult task if the CSV rows are such that the first field in each CSV record is a string field that could potentially be multiline. It needs to find something common about the bits of that field that always appear on the first line that could be used to detect the first line. Whatever regex is chosen must not then match any other line. The workaround is to rearrange the column order such that a column that comes before the first string field that might be multiline is a number, boolean or the detected date field. If no such field can be found then we could potentially try and do something with the first string field, looking for commonality across all messages that never appears on follow-on lines of messages. Before doing this it would be nice to have a concrete example of a real file that would show that this approach would be useful instead of the made up examples attached. The other potential change would obviously be in Filebeat/Elastic Agent, where if that process had the ability to use a CSV parser to split the file into messages instead of just using regexes then the text structure finder could generate a config specifying to do that. So that would be a case of opening an enhancement against Filebeat/Elastic Agent, then waiting for that to be implemented, then progressing this issue. |
This PR contains 3 fixes for the way multi-line start patterns are created in the text structure endpoint: 1. For delimited files the multi-line start pattern used to be based on the first field that was a timestamp, boolean or number. This PR adds the option of a low cardinality keyword field as an alternative (i.e. an enum field effectively). It means there is more chance of a field early in each record being chosen as the mechanism for determining whether a line is the first line of a record. 2. The multi-line start pattern for delimited files now only permits the delimiter character between fields, not within quoted fields. Previously it was possible for the multi-line start pattern to match continuation lines. Unfortunately this may mean we can no longer determine a multi-line start pattern for files whose only suitable field is to the right of fields that sometimes contain commas, and the only solution in this case will be to reorder the columns before importing the data. Hopefully this problem will be very rare. 3. For semi-structured text log files there is now a cap on the complexity of the multi-line start pattern. It has been observed that the patterns generated for slightly malformed CSV files could run for days against the malformed lines of those files - the classic problem of a regex that doesn't match but nearly does doing lots of backtracking. We now throw an error in this situation and suggest overriding the format to delimited. Relates elastic#79708
#85066 should help with this. |
…85066) This PR contains 3 fixes for the way multi-line start patterns are created in the text structure endpoint: 1. For delimited files the multi-line start pattern used to be based on the first field that was a timestamp, boolean or number. This PR adds the option of a low cardinality keyword field as an alternative (i.e. an enum field effectively). It means there is more chance of a field early in each record being chosen as the mechanism for determining whether a line is the first line of a record. 2. The multi-line start pattern for delimited files now only permits the delimiter character between fields, not within quoted fields. Previously it was possible for the multi-line start pattern to match continuation lines. Unfortunately this may mean we can no longer determine a multi-line start pattern for files whose only suitable field is to the right of fields that sometimes contain commas, and the only solution in this case will be to reorder the columns before importing the data. Hopefully this problem will be very rare. 3. For semi-structured text log files there is now a cap on the complexity of the multi-line start pattern. It has been observed that the patterns generated for slightly malformed CSV files could run for days against the malformed lines of those files - the classic problem of a regex that doesn't match but nearly does doing lots of backtracking. We now throw an error in this situation and suggest overriding the format to delimited. Relates #79708 Fixes elastic/kibana#121966
…lastic#85066) This PR contains 3 fixes for the way multi-line start patterns are created in the text structure endpoint: 1. For delimited files the multi-line start pattern used to be based on the first field that was a timestamp, boolean or number. This PR adds the option of a low cardinality keyword field as an alternative (i.e. an enum field effectively). It means there is more chance of a field early in each record being chosen as the mechanism for determining whether a line is the first line of a record. 2. The multi-line start pattern for delimited files now only permits the delimiter character between fields, not within quoted fields. Previously it was possible for the multi-line start pattern to match continuation lines. Unfortunately this may mean we can no longer determine a multi-line start pattern for files whose only suitable field is to the right of fields that sometimes contain commas, and the only solution in this case will be to reorder the columns before importing the data. Hopefully this problem will be very rare. 3. For semi-structured text log files there is now a cap on the complexity of the multi-line start pattern. It has been observed that the patterns generated for slightly malformed CSV files could run for days against the malformed lines of those files - the classic problem of a regex that doesn't match but nearly does doing lots of backtracking. We now throw an error in this situation and suggest overriding the format to delimited. Relates elastic#79708 Fixes elastic/kibana#121966
…85066) (#85100) This PR contains 3 fixes for the way multi-line start patterns are created in the text structure endpoint: 1. For delimited files the multi-line start pattern used to be based on the first field that was a timestamp, boolean or number. This PR adds the option of a low cardinality keyword field as an alternative (i.e. an enum field effectively). It means there is more chance of a field early in each record being chosen as the mechanism for determining whether a line is the first line of a record. 2. The multi-line start pattern for delimited files now only permits the delimiter character between fields, not within quoted fields. Previously it was possible for the multi-line start pattern to match continuation lines. Unfortunately this may mean we can no longer determine a multi-line start pattern for files whose only suitable field is to the right of fields that sometimes contain commas, and the only solution in this case will be to reorder the columns before importing the data. Hopefully this problem will be very rare. 3. For semi-structured text log files there is now a cap on the complexity of the multi-line start pattern. It has been observed that the patterns generated for slightly malformed CSV files could run for days against the malformed lines of those files - the classic problem of a regex that doesn't match but nearly does doing lots of backtracking. We now throw an error in this situation and suggest overriding the format to delimited. Relates #79708 Fixes elastic/kibana#121966
Two example files were supplied by a user:
The
find_structure
endpoint will only produce amultiline_start_pattern
when the first line of a multi-lined document includes a field which it assumes is a date field.In the examples provided, it appears the very long (300+ char) number is seen as being a time stamp and so a
multiline_start_pattern
is produced.Without this
multiline_start_pattern
the file upload plugin in kibana cannot correctly parse the line as it will assume that the newline character incol2
is the end of the line.Is it possible to produce
multiline_start_pattern
when a date field does not edit on the first line of a multi-lined message?The text was updated successfully, but these errors were encountered: