Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AutoML v0.16.0] InferColumn doesn't work on tricky csv file #4460

Closed
LittleLittleCloud opened this issue Nov 8, 2019 · 4 comments

Comments

@LittleLittleCloud
Copy link
Contributor

@LittleLittleCloud LittleLittleCloud commented Nov 8, 2019

For some csv file that contains double quotes in it's field, the inferColumn API can't work properly. It's probably because when guessing delimiter, AutoML takes the candidates inside double quote into consideration, which should be neglect. (Or when splitting lines, it uses \n inside double quote)

steps to reproduce:
download this dataset

MLContext mlContext = new MLContext();
var inputColumnInformation = new ColumnInformation();
inputColumnInformation.LabelColumnName = @"review_scores_rating";
var train = mlContext.Auto().InferColumns(TrainDataPath, inputColumnInformation);
@justinormont

This comment has been minimized.

Copy link
Member

@justinormont justinormont commented Nov 8, 2019

This is a issue of the TextLoader in ML.NET. It does not currently support escaped quotes in a quoted field.

The TextLoader has a rather limited support for TSV/CSV files.

The issue is noted in the old repo: dotnet/machinelearning-automl#193 ("Infercolumn fails to parse new lines inside quoted text")


@vinodshanbhag :
Wikidetox fails in benchmarking because of this.
Tools like Excel are able to handle this.


@justinormont :
@CESARDELATORRE: what do you think about writing an example of converting a dataset from CSV/TSV to IDV?

The TextLoader can not handle many CSV/TSV files. Using a more general reader and outputting to IDV would the allow the AutoML code to read the IDV format.

Basic example:

To be clear, this would be an example (docs/example code) of how a user could convert their data before it comes to AutoML. This would allow us to process files like this issue is referencing.


@CESARDELATORRE :
@justinormont - It's a good idea. However, This example should be a workaround for cases like that.
It might also be a good example because those issues with "numeric value" happen in ML.NET 0.11 per-se.

For instance, I was using another dataset yesterday (just migrating to ML.NET v0.11) where the column Label had values like:

  • "1"
  • "0"

ML.NET transformers were not able to convert that to Boolean (it was putting all as 0) neither to Float (all values as NaN)… See issue in ML.NET I created:

#2824

Interestingly, those conversions to Boolean were working properly until ML.NET v0.10...

So, yes, this can be a good example. However, for AutoML, this example should be a workaround. For most cases, a .CSV/TSV files should be the by default approach since that is the most common type of dataset.


@justinormont :
@CESARDELATORRE - This is a side-effect of turning off quoting by default in ML.NET:
#2630

Non-issue:
I think AutoML will be unaffected by ML.NET changing its quoting defaults, as we sweep over both choices (and our heuristics default to using quoting when all else is equal). We should verify.

Issue:
AutoML will be affected by the TextLoader not supporting common TSV/CSV files. The purposed work around above is telling a user how to convert their TSV/CSV to IDV (bypassing TextLoader).

@LittleLittleCloud

This comment has been minimized.

Copy link
Contributor Author

@LittleLittleCloud LittleLittleCloud commented Nov 11, 2019

Good to know about it
Thanks!

@harishsk

This comment has been minimized.

Copy link
Member

@harishsk harishsk commented Nov 13, 2019

@LittleLittleCloud I am assuming your question has been answered and am closing the issue. Please feel free to reopen the issue if you have more questions.

@harishsk harishsk closed this Nov 13, 2019
@justinormont

This comment has been minimized.

Copy link
Member

@justinormont justinormont commented Nov 13, 2019

We may want to open an issue to improve the TextLoader to support more common TSV/CSV formats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.