Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Fixes for multi-line start patterns in text structure endpoint #85066

Merged

Commits on Mar 17, 2022

  1. [ML] Fixes for multi-line start patterns in text structure endpoint

    This PR contains 3 fixes for the way multi-line start patterns are
    created in the text structure endpoint:
    
    1. For delimited files the multi-line start pattern used to be based
       on the first field that was a timestamp, boolean or number. This
       PR adds the option of a low cardinality keyword field as an
       alternative (i.e. an enum field effectively). It means there is
       more chance of a field early in each record being chosen as the
       mechanism for determining whether a line is the first line of a
       record.
    2. The multi-line start pattern for delimited files now only permits
       the delimiter character between fields, not within quoted fields.
       Previously it was possible for the multi-line start pattern to
       match continuation lines. Unfortunately this may mean we can no
       longer determine a multi-line start pattern for files whose only
       suitable field is to the right of fields that sometimes contain
       commas, and the only solution in this case will be to reorder the
       columns before importing the data. Hopefully this problem will be
       very rare.
    3. For semi-structured text log files there is now a cap on the
       complexity of the multi-line start pattern. It has been observed
       that the patterns generated for slightly malformed CSV files could
       run for days against the malformed lines of those files - the
       classic problem of a regex that doesn't match but nearly does doing
       lots of backtracking. We now throw an error in this situation and
       suggest overriding the format to delimited.
    
    Relates elastic#79708
    droberts195 committed Mar 17, 2022
    Configuration menu
    Copy the full SHA
    f88d70f View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    a54de77 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    1c342f2 View commit details
    Browse the repository at this point in the history

Commits on Mar 18, 2022

  1. Configuration menu
    Copy the full SHA
    b899fbd View commit details
    Browse the repository at this point in the history
  2. Address review comments

    droberts195 committed Mar 18, 2022
    Configuration menu
    Copy the full SHA
    0b028ea View commit details
    Browse the repository at this point in the history