-
Notifications
You must be signed in to change notification settings - Fork 24.3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[ML] Fixes for multi-line start patterns in text structure endpoint (#…
…85066) (#85100) This PR contains 3 fixes for the way multi-line start patterns are created in the text structure endpoint: 1. For delimited files the multi-line start pattern used to be based on the first field that was a timestamp, boolean or number. This PR adds the option of a low cardinality keyword field as an alternative (i.e. an enum field effectively). It means there is more chance of a field early in each record being chosen as the mechanism for determining whether a line is the first line of a record. 2. The multi-line start pattern for delimited files now only permits the delimiter character between fields, not within quoted fields. Previously it was possible for the multi-line start pattern to match continuation lines. Unfortunately this may mean we can no longer determine a multi-line start pattern for files whose only suitable field is to the right of fields that sometimes contain commas, and the only solution in this case will be to reorder the columns before importing the data. Hopefully this problem will be very rare. 3. For semi-structured text log files there is now a cap on the complexity of the multi-line start pattern. It has been observed that the patterns generated for slightly malformed CSV files could run for days against the malformed lines of those files - the classic problem of a regex that doesn't match but nearly does doing lots of backtracking. We now throw an error in this situation and suggest overriding the format to delimited. Relates #79708 Fixes elastic/kibana#121966
- Loading branch information
1 parent
d0925dd
commit 488cf48
Showing
6 changed files
with
378 additions
and
16 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
pr: 85066 | ||
summary: Fixes for multi-line start patterns in text structure endpoint | ||
area: Machine Learning | ||
type: bug | ||
issues: [] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.