[ML] File Data Visualiser - Option to skip bad rows/events/documents #31065

markwalkom · 2019-02-14T03:47:34Z

Describe the feature: Allow a tick box that lets you discard/ignore any rows/documents/events that aren't processable. Then just report on them at the end of the process.

Describe a specific use case for the feature:
Sometimes files have things like summary rows at the bottom, holding things like sums of the columns above. These make sense in a spreadsheet, but you can miss them if you export to CSV and should be able to easily ignore things like this.

elasticmachine · 2019-02-14T03:47:35Z

Pinging @elastic/ml-ui

droberts195 · 2019-02-14T10:11:56Z

There's a chicken-and-egg problem here as the summary rows at the bottom of a CSV file created from a spreadsheet is likely to prevent the file being detected as CSV by the file structure finder. #29821 is a specific example of what happens when different rows in the CSV file contain different numbers of fields. So the summary rows that don't fit the pattern of the rest of the file can cause the structure to be so badly misinterpreted that it's then not simply a case of skipping a few rows - the UI wouldn't even know to parse the file as CSV.

One thing we could do is change the file structure finder endpoint in ES so that if the structure is overridden to a delimited format it tolerates different numbers of columns per row. I still think this is a minefield if the high level structure is not overridden, but if the user tells us the file is CSV we could do our best to parse it as such even though it might result in some columns having many nulls. We could also add the ability to detect a timestamp field when only a certain percentage of rows in a delimited file had the format in the same column, say 95%. The ingest pipeline would then error on the rows that didn't, which would be the summary rows, but the ingest pipeline could be made to discard these and carry on.

Most of the work here would be on the backend file structure finder endpoint. I opened elastic/elasticsearch#38890 for this.

markwalkom added the :ml label Feb 14, 2019

markwalkom mentioned this issue Feb 14, 2019

[ML] File Data Visualiser - Option to skip columns/fields #31066

Open

peteharverson changed the title ~~Data Visualiser - Option to skip bad rows/events/documents~~ [ML] Data Visualiser - Option to skip bad rows/events/documents Feb 14, 2019

peteharverson added the Feature:Anomaly Detection ML anomaly detection label Feb 14, 2019

jgowdyelastic changed the title ~~[ML] Data Visualiser - Option to skip bad rows/events/documents~~ [ML] File Data Visualiser - Option to skip bad rows/events/documents Feb 14, 2019

droberts195 mentioned this issue Feb 14, 2019

[ML] More tolerant delimited file parsing when structure is overridden elastic/elasticsearch#38890

Closed

droberts195 mentioned this issue Apr 27, 2020

[ML] Allow a certain number of ill-formatted rows when delimited format is specified elastic/elasticsearch#55735

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] File Data Visualiser - Option to skip bad rows/events/documents #31065

[ML] File Data Visualiser - Option to skip bad rows/events/documents #31065

markwalkom commented Feb 14, 2019

elasticmachine commented Feb 14, 2019

droberts195 commented Feb 14, 2019

[ML] File Data Visualiser - Option to skip bad rows/events/documents #31065

[ML] File Data Visualiser - Option to skip bad rows/events/documents #31065

Comments

markwalkom commented Feb 14, 2019

elasticmachine commented Feb 14, 2019

droberts195 commented Feb 14, 2019