-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Fixes for multi-line start patterns in text structure endpoint #85066
[ML] Fixes for multi-line start patterns in text structure endpoint #85066
Conversation
This PR contains 3 fixes for the way multi-line start patterns are created in the text structure endpoint: 1. For delimited files the multi-line start pattern used to be based on the first field that was a timestamp, boolean or number. This PR adds the option of a low cardinality keyword field as an alternative (i.e. an enum field effectively). It means there is more chance of a field early in each record being chosen as the mechanism for determining whether a line is the first line of a record. 2. The multi-line start pattern for delimited files now only permits the delimiter character between fields, not within quoted fields. Previously it was possible for the multi-line start pattern to match continuation lines. Unfortunately this may mean we can no longer determine a multi-line start pattern for files whose only suitable field is to the right of fields that sometimes contain commas, and the only solution in this case will be to reorder the columns before importing the data. Hopefully this problem will be very rare. 3. For semi-structured text log files there is now a cap on the complexity of the multi-line start pattern. It has been observed that the patterns generated for slightly malformed CSV files could run for days against the malformed lines of those files - the classic problem of a regex that doesn't match but nearly does doing lots of backtracking. We now throw an error in this situation and suggest overriding the format to delimited. Relates elastic#79708
Pinging @elastic/ml-core (Team:ML) |
Hi @droberts195, I've created a changelog YAML for you. |
*/ | ||
static int longestRun(List<?> sequence) { | ||
int maxSoFar = 0; | ||
for (int index = 0; index < sequence.size(); ++index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be a bit easier to follow the control flow if there was only one loop through the sequence:
static int longestRun(List<?> sequence) {
if (sequence.size() <= 1) {
return sequence.size();
}
int maxSoFar = 0;
int thisCount = 1;
for (int index = 1; index < sequence.size(); ++index) {
if (sequence.get(index).equals(sequence.get(index - 1))) {
++thisCount;
} else {
maxSoFar = Math.max(maxSoFar, thisCount);
thisCount = 1;
}
}
maxSoFar = Math.max(maxSoFar, thisCount);
return maxSoFar;
}
public void testLongestRun() { | ||
|
||
List<Integer> sequence = new ArrayList<>(); | ||
for (int before = randomIntBetween(0, 41); before > 0; --before) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is kind of covered by the random, but you could add explicit test cases for:
- the longest sequence is the prefix
- the longest sequence is the suffix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it more likely that these cases get tested by making each true 50% of the time. I tested all the scenarios locally and I think one test with higher probability of testing the edge cases should be sufficient to prevent regressions. There shouldn't be much need to change this code again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
💔 Backport failed
You can use sqren/backport to manually backport by running |
…lastic#85066) This PR contains 3 fixes for the way multi-line start patterns are created in the text structure endpoint: 1. For delimited files the multi-line start pattern used to be based on the first field that was a timestamp, boolean or number. This PR adds the option of a low cardinality keyword field as an alternative (i.e. an enum field effectively). It means there is more chance of a field early in each record being chosen as the mechanism for determining whether a line is the first line of a record. 2. The multi-line start pattern for delimited files now only permits the delimiter character between fields, not within quoted fields. Previously it was possible for the multi-line start pattern to match continuation lines. Unfortunately this may mean we can no longer determine a multi-line start pattern for files whose only suitable field is to the right of fields that sometimes contain commas, and the only solution in this case will be to reorder the columns before importing the data. Hopefully this problem will be very rare. 3. For semi-structured text log files there is now a cap on the complexity of the multi-line start pattern. It has been observed that the patterns generated for slightly malformed CSV files could run for days against the malformed lines of those files - the classic problem of a regex that doesn't match but nearly does doing lots of backtracking. We now throw an error in this situation and suggest overriding the format to delimited. Relates elastic#79708 Fixes elastic/kibana#121966
…85066) (#85100) This PR contains 3 fixes for the way multi-line start patterns are created in the text structure endpoint: 1. For delimited files the multi-line start pattern used to be based on the first field that was a timestamp, boolean or number. This PR adds the option of a low cardinality keyword field as an alternative (i.e. an enum field effectively). It means there is more chance of a field early in each record being chosen as the mechanism for determining whether a line is the first line of a record. 2. The multi-line start pattern for delimited files now only permits the delimiter character between fields, not within quoted fields. Previously it was possible for the multi-line start pattern to match continuation lines. Unfortunately this may mean we can no longer determine a multi-line start pattern for files whose only suitable field is to the right of fields that sometimes contain commas, and the only solution in this case will be to reorder the columns before importing the data. Hopefully this problem will be very rare. 3. For semi-structured text log files there is now a cap on the complexity of the multi-line start pattern. It has been observed that the patterns generated for slightly malformed CSV files could run for days against the malformed lines of those files - the classic problem of a regex that doesn't match but nearly does doing lots of backtracking. We now throw an error in this situation and suggest overriding the format to delimited. Relates #79708 Fixes elastic/kibana#121966
…85109) This PR contains 3 fixes for the way multi-line start patterns are created in the text structure endpoint: 1. For delimited files the multi-line start pattern used to be based on the first field that was a timestamp, boolean or number. This PR adds the option of a low cardinality keyword field as an alternative (i.e. an enum field effectively). It means there is more chance of a field early in each record being chosen as the mechanism for determining whether a line is the first line of a record. 2. The multi-line start pattern for delimited files now only permits the delimiter character between fields, not within quoted fields. Previously it was possible for the multi-line start pattern to match continuation lines. Unfortunately this may mean we can no longer determine a multi-line start pattern for files whose only suitable field is to the right of fields that sometimes contain commas, and the only solution in this case will be to reorder the columns before importing the data. Hopefully this problem will be very rare. 3. For semi-structured text log files there is now a cap on the complexity of the multi-line start pattern. It has been observed that the patterns generated for slightly malformed CSV files could run for days against the malformed lines of those files - the classic problem of a regex that doesn't match but nearly does doing lots of backtracking. We now throw an error in this situation and suggest overriding the format to delimited. Backport of #85066
This PR contains 3 fixes for the way multi-line start patterns are
created in the text structure endpoint:
on the first field that was a timestamp, boolean or number. This
PR adds the option of a low cardinality keyword field as an
alternative (i.e. an enum field effectively). It means there is
more chance of a field early in each record being chosen as the
mechanism for determining whether a line is the first line of a
record.
the delimiter character between fields, not within quoted fields.
Previously it was possible for the multi-line start pattern to
match continuation lines. Unfortunately this may mean we can no
longer determine a multi-line start pattern for files whose only
suitable field is to the right of fields that sometimes contain
commas, and the only solution in this case will be to reorder the
columns before importing the data. Hopefully this problem will be
very rare.
complexity of the multi-line start pattern. It has been observed
that the patterns generated for slightly malformed CSV files could
run for days against the malformed lines of those files - the
classic problem of a regex that doesn't match but nearly does doing
lots of backtracking. We now throw an error in this situation and
suggest overriding the format to delimited.
Relates #79708
Fixes elastic/kibana#121966