Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Improve multiline_start_pattern for CSV in find_file_structure #51737

Merged

Conversation

droberts195
Copy link
Contributor

The work to switch file upload over to treating delimited files
like semi-structured text and using the ingest pipeline for CSV
parsing makes the multi-line start pattern used for delimited
files much more critical than it used to be.

Previously it was always based on the time field, even if that
was towards the end of the columns, and no multi-line pattern
was created if no timestamp was detected.

This change improves the multi-line start pattern by:

  1. Never creating a multi-line pattern if the sample contained
    only single line records. This improves the import
    efficiency in a common case.
  2. Choosing the leftmost field that has a well-defined pattern,
    whether that be the time field or a boolean/numeric field.
    This reduces the risk of a field with newlines occurring
    earlier, and also means the algorithm doesn't automatically
    fail for data without a timestamp.

The work to switch file upload over to treating delimited files
like semi-structured text and using the ingest pipeline for CSV
parsing makes the multi-line start pattern used for delimited
files much more critical than it used to be.

Previously it was always based on the time field, even if that
was towards the end of the columns, and no multi-line pattern
was created if no timestamp was detected.

This change improves the multi-line start pattern by:

1. Never creating a multi-line pattern if the sample contained
   only single line records.  This improves the import
   efficiency in a common case.
2. Choosing the leftmost field that has a well-defined pattern,
   whether that be the time field or a boolean/numeric field.
   This reduces the risk of a field with newlines occurring
   earlier, and also means the algorithm doesn't automatically
   fail for data without a timestamp.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@benwtrent benwtrent self-requested a review February 4, 2020 12:08
@droberts195 droberts195 merged commit 9521e4f into elastic:master Feb 4, 2020
@droberts195 droberts195 deleted the better_csv_multiline_start_pattern branch February 4, 2020 12:36
droberts195 added a commit that referenced this pull request Feb 4, 2020
…51737)

The work to switch file upload over to treating delimited files
like semi-structured text and using the ingest pipeline for CSV
parsing makes the multi-line start pattern used for delimited
files much more critical than it used to be.

Previously it was always based on the time field, even if that
was towards the end of the columns, and no multi-line pattern
was created if no timestamp was detected.

This change improves the multi-line start pattern by:

1. Never creating a multi-line pattern if the sample contained
   only single line records.  This improves the import
   efficiency in a common case.
2. Choosing the leftmost field that has a well-defined pattern,
   whether that be the time field or a boolean/numeric field.
   This reduces the risk of a field with newlines occurring
   earlier, and also means the algorithm doesn't automatically
   fail for data without a timestamp.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants