[ML] Fixes for multi-line start patterns in text structure endpoint #85066

droberts195 · 2022-03-17T13:04:59Z

This PR contains 3 fixes for the way multi-line start patterns are
created in the text structure endpoint:

For delimited files the multi-line start pattern used to be based
on the first field that was a timestamp, boolean or number. This
PR adds the option of a low cardinality keyword field as an
alternative (i.e. an enum field effectively). It means there is
more chance of a field early in each record being chosen as the
mechanism for determining whether a line is the first line of a
record.
The multi-line start pattern for delimited files now only permits
the delimiter character between fields, not within quoted fields.
Previously it was possible for the multi-line start pattern to
match continuation lines. Unfortunately this may mean we can no
longer determine a multi-line start pattern for files whose only
suitable field is to the right of fields that sometimes contain
commas, and the only solution in this case will be to reorder the
columns before importing the data. Hopefully this problem will be
very rare.
For semi-structured text log files there is now a cap on the
complexity of the multi-line start pattern. It has been observed
that the patterns generated for slightly malformed CSV files could
run for days against the malformed lines of those files - the
classic problem of a regex that doesn't match but nearly does doing
lots of backtracking. We now throw an error in this situation and
suggest overriding the format to delimited.

Relates #79708
Fixes elastic/kibana#121966

This PR contains 3 fixes for the way multi-line start patterns are created in the text structure endpoint: 1. For delimited files the multi-line start pattern used to be based on the first field that was a timestamp, boolean or number. This PR adds the option of a low cardinality keyword field as an alternative (i.e. an enum field effectively). It means there is more chance of a field early in each record being chosen as the mechanism for determining whether a line is the first line of a record. 2. The multi-line start pattern for delimited files now only permits the delimiter character between fields, not within quoted fields. Previously it was possible for the multi-line start pattern to match continuation lines. Unfortunately this may mean we can no longer determine a multi-line start pattern for files whose only suitable field is to the right of fields that sometimes contain commas, and the only solution in this case will be to reorder the columns before importing the data. Hopefully this problem will be very rare. 3. For semi-structured text log files there is now a cap on the complexity of the multi-line start pattern. It has been observed that the patterns generated for slightly malformed CSV files could run for days against the malformed lines of those files - the classic problem of a regex that doesn't match but nearly does doing lots of backtracking. We now throw an error in this situation and suggest overriding the format to delimited. Relates elastic#79708

elasticmachine · 2022-03-17T13:05:03Z

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine · 2022-03-17T13:05:24Z

Hi @droberts195, I've created a changelog YAML for you.

przemekwitek · 2022-03-18T12:24:13Z

.../src/main/java/org/elasticsearch/xpack/textstructure/structurefinder/GrokPatternCreator.java

+     */
+    static int longestRun(List<?> sequence) {
+        int maxSoFar = 0;
+        for (int index = 0; index < sequence.size(); ++index) {


I think it would be a bit easier to follow the control flow if there was only one loop through the sequence:

static int longestRun(List<?> sequence) { if (sequence.size() <= 1) { return sequence.size(); } int maxSoFar = 0; int thisCount = 1; for (int index = 1; index < sequence.size(); ++index) { if (sequence.get(index).equals(sequence.get(index - 1))) { ++thisCount; } else { maxSoFar = Math.max(maxSoFar, thisCount); thisCount = 1; } } maxSoFar = Math.max(maxSoFar, thisCount); return maxSoFar; }

przemekwitek · 2022-03-18T12:29:56Z

...test/java/org/elasticsearch/xpack/textstructure/structurefinder/GrokPatternCreatorTests.java

+    public void testLongestRun() {
+
+        List<Integer> sequence = new ArrayList<>();
+        for (int before = randomIntBetween(0, 41); before > 0; --before) {


It is kind of covered by the random, but you could add explicit test cases for:

the longest sequence is the prefix

the longest sequence is the suffix

I made it more likely that these cases get tested by making each true 50% of the time. I tested all the scenarios locally and I think one test with higher probability of testing the edge cases should be sufficient to prevent regressions. There shouldn't be much need to change this code again.

przemekwitek

LGTM

elasticsearchmachine · 2022-03-18T13:47:36Z

💔 Backport failed

Status	Branch	Result
❌	7.17	Commit could not be cherrypicked due to conflicts
✅	8.1

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 85066

…lastic#85066) This PR contains 3 fixes for the way multi-line start patterns are created in the text structure endpoint: 1. For delimited files the multi-line start pattern used to be based on the first field that was a timestamp, boolean or number. This PR adds the option of a low cardinality keyword field as an alternative (i.e. an enum field effectively). It means there is more chance of a field early in each record being chosen as the mechanism for determining whether a line is the first line of a record. 2. The multi-line start pattern for delimited files now only permits the delimiter character between fields, not within quoted fields. Previously it was possible for the multi-line start pattern to match continuation lines. Unfortunately this may mean we can no longer determine a multi-line start pattern for files whose only suitable field is to the right of fields that sometimes contain commas, and the only solution in this case will be to reorder the columns before importing the data. Hopefully this problem will be very rare. 3. For semi-structured text log files there is now a cap on the complexity of the multi-line start pattern. It has been observed that the patterns generated for slightly malformed CSV files could run for days against the malformed lines of those files - the classic problem of a regex that doesn't match but nearly does doing lots of backtracking. We now throw an error in this situation and suggest overriding the format to delimited. Relates elastic#79708 Fixes elastic/kibana#121966

…85066) (#85100) This PR contains 3 fixes for the way multi-line start patterns are created in the text structure endpoint: 1. For delimited files the multi-line start pattern used to be based on the first field that was a timestamp, boolean or number. This PR adds the option of a low cardinality keyword field as an alternative (i.e. an enum field effectively). It means there is more chance of a field early in each record being chosen as the mechanism for determining whether a line is the first line of a record. 2. The multi-line start pattern for delimited files now only permits the delimiter character between fields, not within quoted fields. Previously it was possible for the multi-line start pattern to match continuation lines. Unfortunately this may mean we can no longer determine a multi-line start pattern for files whose only suitable field is to the right of fields that sometimes contain commas, and the only solution in this case will be to reorder the columns before importing the data. Hopefully this problem will be very rare. 3. For semi-structured text log files there is now a cap on the complexity of the multi-line start pattern. It has been observed that the patterns generated for slightly malformed CSV files could run for days against the malformed lines of those files - the classic problem of a regex that doesn't match but nearly does doing lots of backtracking. We now throw an error in this situation and suggest overriding the format to delimited. Relates #79708 Fixes elastic/kibana#121966

…85109) This PR contains 3 fixes for the way multi-line start patterns are created in the text structure endpoint: 1. For delimited files the multi-line start pattern used to be based on the first field that was a timestamp, boolean or number. This PR adds the option of a low cardinality keyword field as an alternative (i.e. an enum field effectively). It means there is more chance of a field early in each record being chosen as the mechanism for determining whether a line is the first line of a record. 2. The multi-line start pattern for delimited files now only permits the delimiter character between fields, not within quoted fields. Previously it was possible for the multi-line start pattern to match continuation lines. Unfortunately this may mean we can no longer determine a multi-line start pattern for files whose only suitable field is to the right of fields that sometimes contain commas, and the only solution in this case will be to reorder the columns before importing the data. Hopefully this problem will be very rare. 3. For semi-structured text log files there is now a cap on the complexity of the multi-line start pattern. It has been observed that the patterns generated for slightly malformed CSV files could run for days against the malformed lines of those files - the classic problem of a regex that doesn't match but nearly does doing lots of backtracking. We now throw an error in this situation and suggest overriding the format to delimited. Backport of #85066

droberts195 added >bug :ml Machine learning v8.2.0 v7.17.2 v8.1.2 labels Mar 17, 2022

elasticmachine added the Team:ML Meta label for the ML team label Mar 17, 2022

Update docs/changelog/85066.yaml

a54de77

droberts195 mentioned this pull request Mar 17, 2022

[Text Structure][ML] Improve multi-line start pattern recognition when no timestamps are present #79708

Open

Bail out if multi-line start pattern won't work due to prior line break

1c342f2

droberts195 added the auto-backport-and-merge Automatically create backport pull requests and merge when ready label Mar 17, 2022

droberts195 mentioned this pull request Mar 17, 2022

[ML/Data Visualizer] Importing CSV file causes high CPU utilization on Chrome elastic/kibana#121966

Closed

przemekwitek reviewed Mar 18, 2022

View reviewed changes

droberts195 added 2 commits March 18, 2022 12:32

Merge branch 'master' into multi-line-start-pattern-improvements

b899fbd

Address review comments

0b028ea

droberts195 requested a review from przemekwitek March 18, 2022 12:46

przemekwitek approved these changes Mar 18, 2022

View reviewed changes

droberts195 merged commit 9c6659d into elastic:master Mar 18, 2022

droberts195 deleted the multi-line-start-pattern-improvements branch March 18, 2022 13:46

droberts195 mentioned this pull request Mar 18, 2022

[8.1] [ML] Fixes for multi-line start patterns in text structure endpoint (#85066) #85100

Merged

droberts195 mentioned this pull request Mar 18, 2022

[ML] Fixes for multi-line start patterns in text structure endpoint #85109

Merged

droberts195 mentioned this pull request Jan 10, 2023

[ML] Multi-line start patterns for CSV from the text structure endpoint are fragile #92798

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Fixes for multi-line start patterns in text structure endpoint #85066

[ML] Fixes for multi-line start patterns in text structure endpoint #85066

droberts195 commented Mar 17, 2022 •

edited

Loading

elasticmachine commented Mar 17, 2022

elasticsearchmachine commented Mar 17, 2022

przemekwitek Mar 18, 2022

przemekwitek Mar 18, 2022

droberts195 Mar 18, 2022

przemekwitek left a comment

elasticsearchmachine commented Mar 18, 2022

[ML] Fixes for multi-line start patterns in text structure endpoint #85066

[ML] Fixes for multi-line start patterns in text structure endpoint #85066

Conversation

droberts195 commented Mar 17, 2022 • edited Loading

elasticmachine commented Mar 17, 2022

elasticsearchmachine commented Mar 17, 2022

przemekwitek Mar 18, 2022

Choose a reason for hiding this comment

przemekwitek Mar 18, 2022

Choose a reason for hiding this comment

droberts195 Mar 18, 2022

Choose a reason for hiding this comment

przemekwitek left a comment

Choose a reason for hiding this comment

elasticsearchmachine commented Mar 18, 2022

💔 Backport failed

droberts195 commented Mar 17, 2022 •

edited

Loading