Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] File Data Visualiser - Option to skip bad rows/events/documents #31065

Open
markwalkom opened this issue Feb 14, 2019 · 2 comments
Open
Labels

Comments

@markwalkom
Copy link
Contributor

Describe the feature: Allow a tick box that lets you discard/ignore any rows/documents/events that aren't processable. Then just report on them at the end of the process.

Describe a specific use case for the feature:
Sometimes files have things like summary rows at the bottom, holding things like sums of the columns above. These make sense in a spreadsheet, but you can miss them if you export to CSV and should be able to easily ignore things like this.

@markwalkom markwalkom added the :ml label Feb 14, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/ml-ui

@peteharverson peteharverson changed the title Data Visualiser - Option to skip bad rows/events/documents [ML] Data Visualiser - Option to skip bad rows/events/documents Feb 14, 2019
@peteharverson peteharverson added the Feature:Anomaly Detection ML anomaly detection label Feb 14, 2019
@jgowdyelastic jgowdyelastic changed the title [ML] Data Visualiser - Option to skip bad rows/events/documents [ML] File Data Visualiser - Option to skip bad rows/events/documents Feb 14, 2019
@droberts195
Copy link
Contributor

There's a chicken-and-egg problem here as the summary rows at the bottom of a CSV file created from a spreadsheet is likely to prevent the file being detected as CSV by the file structure finder. #29821 is a specific example of what happens when different rows in the CSV file contain different numbers of fields. So the summary rows that don't fit the pattern of the rest of the file can cause the structure to be so badly misinterpreted that it's then not simply a case of skipping a few rows - the UI wouldn't even know to parse the file as CSV.

One thing we could do is change the file structure finder endpoint in ES so that if the structure is overridden to a delimited format it tolerates different numbers of columns per row. I still think this is a minefield if the high level structure is not overridden, but if the user tells us the file is CSV we could do our best to parse it as such even though it might result in some columns having many nulls. We could also add the ability to detect a timestamp field when only a certain percentage of rows in a delimited file had the format in the same column, say 95%. The ingest pipeline would then error on the rows that didn't, which would be the summary rows, but the ingest pipeline could be made to discard these and carry on.

Most of the work here would be on the backend file structure finder endpoint. I opened elastic/elasticsearch#38890 for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants