Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Fixes for multi-line start patterns in text structure endpoint #85066

Merged

Conversation

droberts195
Copy link
Contributor

@droberts195 droberts195 commented Mar 17, 2022

This PR contains 3 fixes for the way multi-line start patterns are
created in the text structure endpoint:

  1. For delimited files the multi-line start pattern used to be based
    on the first field that was a timestamp, boolean or number. This
    PR adds the option of a low cardinality keyword field as an
    alternative (i.e. an enum field effectively). It means there is
    more chance of a field early in each record being chosen as the
    mechanism for determining whether a line is the first line of a
    record.
  2. The multi-line start pattern for delimited files now only permits
    the delimiter character between fields, not within quoted fields.
    Previously it was possible for the multi-line start pattern to
    match continuation lines. Unfortunately this may mean we can no
    longer determine a multi-line start pattern for files whose only
    suitable field is to the right of fields that sometimes contain
    commas, and the only solution in this case will be to reorder the
    columns before importing the data. Hopefully this problem will be
    very rare.
  3. For semi-structured text log files there is now a cap on the
    complexity of the multi-line start pattern. It has been observed
    that the patterns generated for slightly malformed CSV files could
    run for days against the malformed lines of those files - the
    classic problem of a regex that doesn't match but nearly does doing
    lots of backtracking. We now throw an error in this situation and
    suggest overriding the format to delimited.

Relates #79708
Fixes elastic/kibana#121966

This PR contains 3 fixes for the way multi-line start patterns are
created in the text structure endpoint:

1. For delimited files the multi-line start pattern used to be based
   on the first field that was a timestamp, boolean or number. This
   PR adds the option of a low cardinality keyword field as an
   alternative (i.e. an enum field effectively). It means there is
   more chance of a field early in each record being chosen as the
   mechanism for determining whether a line is the first line of a
   record.
2. The multi-line start pattern for delimited files now only permits
   the delimiter character between fields, not within quoted fields.
   Previously it was possible for the multi-line start pattern to
   match continuation lines. Unfortunately this may mean we can no
   longer determine a multi-line start pattern for files whose only
   suitable field is to the right of fields that sometimes contain
   commas, and the only solution in this case will be to reorder the
   columns before importing the data. Hopefully this problem will be
   very rare.
3. For semi-structured text log files there is now a cap on the
   complexity of the multi-line start pattern. It has been observed
   that the patterns generated for slightly malformed CSV files could
   run for days against the malformed lines of those files - the
   classic problem of a regex that doesn't match but nearly does doing
   lots of backtracking. We now throw an error in this situation and
   suggest overriding the format to delimited.

Relates elastic#79708
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Mar 17, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Hi @droberts195, I've created a changelog YAML for you.

@droberts195 droberts195 added the auto-backport-and-merge Automatically create backport pull requests and merge when ready label Mar 17, 2022
*/
static int longestRun(List<?> sequence) {
int maxSoFar = 0;
for (int index = 0; index < sequence.size(); ++index) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a bit easier to follow the control flow if there was only one loop through the sequence:

    static int longestRun(List<?> sequence) {
        if (sequence.size() <= 1) {
            return sequence.size();
        }
        int maxSoFar = 0;
        int thisCount = 1;
        for (int index = 1; index < sequence.size(); ++index) {
            if (sequence.get(index).equals(sequence.get(index - 1))) {
                ++thisCount;
            } else {
                maxSoFar = Math.max(maxSoFar, thisCount);
                thisCount = 1;
            }
        }
        maxSoFar = Math.max(maxSoFar, thisCount);
        return maxSoFar;
    }

public void testLongestRun() {

List<Integer> sequence = new ArrayList<>();
for (int before = randomIntBetween(0, 41); before > 0; --before) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is kind of covered by the random, but you could add explicit test cases for:

  • the longest sequence is the prefix
  • the longest sequence is the suffix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it more likely that these cases get tested by making each true 50% of the time. I tested all the scenarios locally and I think one test with higher probability of testing the edge cases should be sufficient to prevent regressions. There shouldn't be much need to change this code again.

Copy link
Contributor

@przemekwitek przemekwitek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@droberts195 droberts195 merged commit 9c6659d into elastic:master Mar 18, 2022
@droberts195 droberts195 deleted the multi-line-start-pattern-improvements branch March 18, 2022 13:46
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
7.17 Commit could not be cherrypicked due to conflicts
8.1

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 85066

droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Mar 18, 2022
…lastic#85066)

This PR contains 3 fixes for the way multi-line start patterns are
created in the text structure endpoint:

1. For delimited files the multi-line start pattern used to be based
   on the first field that was a timestamp, boolean or number. This
   PR adds the option of a low cardinality keyword field as an
   alternative (i.e. an enum field effectively). It means there is
   more chance of a field early in each record being chosen as the
   mechanism for determining whether a line is the first line of a
   record.
2. The multi-line start pattern for delimited files now only permits
   the delimiter character between fields, not within quoted fields.
   Previously it was possible for the multi-line start pattern to
   match continuation lines. Unfortunately this may mean we can no
   longer determine a multi-line start pattern for files whose only
   suitable field is to the right of fields that sometimes contain
   commas, and the only solution in this case will be to reorder the
   columns before importing the data. Hopefully this problem will be
   very rare.
3. For semi-structured text log files there is now a cap on the
   complexity of the multi-line start pattern. It has been observed
   that the patterns generated for slightly malformed CSV files could
   run for days against the malformed lines of those files - the
   classic problem of a regex that doesn't match but nearly does doing
   lots of backtracking. We now throw an error in this situation and
   suggest overriding the format to delimited.

Relates elastic#79708
Fixes elastic/kibana#121966
elasticsearchmachine pushed a commit that referenced this pull request Mar 18, 2022
…85066) (#85100)

This PR contains 3 fixes for the way multi-line start patterns are
created in the text structure endpoint:

1. For delimited files the multi-line start pattern used to be based
   on the first field that was a timestamp, boolean or number. This
   PR adds the option of a low cardinality keyword field as an
   alternative (i.e. an enum field effectively). It means there is
   more chance of a field early in each record being chosen as the
   mechanism for determining whether a line is the first line of a
   record.
2. The multi-line start pattern for delimited files now only permits
   the delimiter character between fields, not within quoted fields.
   Previously it was possible for the multi-line start pattern to
   match continuation lines. Unfortunately this may mean we can no
   longer determine a multi-line start pattern for files whose only
   suitable field is to the right of fields that sometimes contain
   commas, and the only solution in this case will be to reorder the
   columns before importing the data. Hopefully this problem will be
   very rare.
3. For semi-structured text log files there is now a cap on the
   complexity of the multi-line start pattern. It has been observed
   that the patterns generated for slightly malformed CSV files could
   run for days against the malformed lines of those files - the
   classic problem of a regex that doesn't match but nearly does doing
   lots of backtracking. We now throw an error in this situation and
   suggest overriding the format to delimited.

Relates #79708
Fixes elastic/kibana#121966
elasticsearchmachine pushed a commit that referenced this pull request Mar 18, 2022
…85109)

This PR contains 3 fixes for the way multi-line start patterns are
created in the text structure endpoint:

1. For delimited files the multi-line start pattern used to be based
   on the first field that was a timestamp, boolean or number. This
   PR adds the option of a low cardinality keyword field as an
   alternative (i.e. an enum field effectively). It means there is
   more chance of a field early in each record being chosen as the
   mechanism for determining whether a line is the first line of a
   record.
2. The multi-line start pattern for delimited files now only permits
   the delimiter character between fields, not within quoted fields.
   Previously it was possible for the multi-line start pattern to
   match continuation lines. Unfortunately this may mean we can no
   longer determine a multi-line start pattern for files whose only
   suitable field is to the right of fields that sometimes contain
   commas, and the only solution in this case will be to reorder the
   columns before importing the data. Hopefully this problem will be
   very rare.
3. For semi-structured text log files there is now a cap on the
   complexity of the multi-line start pattern. It has been observed
   that the patterns generated for slightly malformed CSV files could
   run for days against the malformed lines of those files - the
   classic problem of a regex that doesn't match but nearly does doing
   lots of backtracking. We now throw an error in this situation and
   suggest overriding the format to delimited.

Backport of #85066
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport-and-merge Automatically create backport pull requests and merge when ready >bug :ml Machine learning Team:ML Meta label for the ML team v7.17.2 v8.1.2 v8.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ML/Data Visualizer] Importing CSV file causes high CPU utilization on Chrome
4 participants