[ML] Improve multiline_start_pattern for CSV in find_file_structure #51737

droberts195 · 2020-01-31T12:03:28Z

The work to switch file upload over to treating delimited files
like semi-structured text and using the ingest pipeline for CSV
parsing makes the multi-line start pattern used for delimited
files much more critical than it used to be.

Previously it was always based on the time field, even if that
was towards the end of the columns, and no multi-line pattern
was created if no timestamp was detected.

This change improves the multi-line start pattern by:

Never creating a multi-line pattern if the sample contained
only single line records. This improves the import
efficiency in a common case.
Choosing the leftmost field that has a well-defined pattern,
whether that be the time field or a boolean/numeric field.
This reduces the risk of a field with newlines occurring
earlier, and also means the algorithm doesn't automatically
fail for data without a timestamp.

The work to switch file upload over to treating delimited files like semi-structured text and using the ingest pipeline for CSV parsing makes the multi-line start pattern used for delimited files much more critical than it used to be. Previously it was always based on the time field, even if that was towards the end of the columns, and no multi-line pattern was created if no timestamp was detected. This change improves the multi-line start pattern by: 1. Never creating a multi-line pattern if the sample contained only single line records. This improves the import efficiency in a common case. 2. Choosing the leftmost field that has a well-defined pattern, whether that be the time field or a boolean/numeric field. This reduces the risk of a field with newlines occurring earlier, and also means the algorithm doesn't automatically fail for data without a timestamp.

elasticmachine · 2020-01-31T12:03:31Z

Pinging @elastic/ml-core (:ml)

…51737) The work to switch file upload over to treating delimited files like semi-structured text and using the ingest pipeline for CSV parsing makes the multi-line start pattern used for delimited files much more critical than it used to be. Previously it was always based on the time field, even if that was towards the end of the columns, and no multi-line pattern was created if no timestamp was detected. This change improves the multi-line start pattern by: 1. Never creating a multi-line pattern if the sample contained only single line records. This improves the import efficiency in a common case. 2. Choosing the leftmost field that has a well-defined pattern, whether that be the time field or a boolean/numeric field. This reduces the risk of a field with newlines occurring earlier, and also means the algorithm doesn't automatically fail for data without a timestamp.

droberts195 added >enhancement :ml Machine learning v8.0.0 v7.7.0 labels Jan 31, 2020

droberts195 added 2 commits February 4, 2020 08:49

Merge branch 'master' into better_csv_multiline_start_pattern

839127a

Improve comments

c06822d

benwtrent self-requested a review February 4, 2020 12:08

benwtrent approved these changes Feb 4, 2020

View reviewed changes

droberts195 merged commit 9521e4f into elastic:master Feb 4, 2020

droberts195 deleted the better_csv_multiline_start_pattern branch February 4, 2020 12:36

codebrain mentioned this pull request Apr 1, 2020

7.7.0 meta ticket (Part 2) elastic/elasticsearch-net#4533

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

droberts195 mentioned this pull request Jan 10, 2023

[ML] Multi-line start patterns for CSV from the text structure endpoint are fragile #92798

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Improve multiline_start_pattern for CSV in find_file_structure #51737

[ML] Improve multiline_start_pattern for CSV in find_file_structure #51737

droberts195 commented Jan 31, 2020

elasticmachine commented Jan 31, 2020

[ML] Improve multiline_start_pattern for CSV in find_file_structure #51737

[ML] Improve multiline_start_pattern for CSV in find_file_structure #51737

Conversation

droberts195 commented Jan 31, 2020

elasticmachine commented Jan 31, 2020