Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataFrame] System.FormatException: Input string was not in a correct format on seemingly correct CSV file. #5884

Closed
ghost opened this issue Jul 23, 2021 · 5 comments
Labels
Microsoft.Data.Analysis All DataFrame related issues and PRs

Comments

@ghost
Copy link

ghost commented Jul 23, 2021

So, when running Microsoft.Data.Analysis 0.18.0 (which is the newest AFAIK) I am trying to load the following dataset https://www.kaggle.com/c/titanic/data?select=test.csv via DataFrame.LoadCsv("test.csv")

and I get System.FormatException: Input string was not in a correct format. at System.Number.ThrowOverflowOrFormatException(ParsingStatus status, TypeCode type) at System.String.System.IConvertible.ToSingle(IFormatProvider provider) at System.Convert.ChangeType(Object value, Type conversionType, IFormatProvider provider) at Microsoft.Data.Analysis.DataFrame.Append(IEnumerable1 row, Boolean inPlace)
at Microsoft.Data.Analysis.DataFrame.ReadCsvLinesIntoDataFrame(WrappedStreamReaderOrStringReader wrappedReader, Char separator, Boolean header, String[] columnNames, Type[] dataTypes, Int64 numberOfRowsToRead, Int32 guessRows, Boolean addIndexColumn)
at Microsoft.Data.Analysis.DataFrame.LoadCsv(String filename, Char separator, Boolean header, String[] columnNames, Type[] dataTypes, Int32 numRows, Int32 guessRows, Boolean addIndexColumn, Encoding encoding)
at Submission#58.<>d__0.MoveNext()
--- End of stack trace from previous location ---
at Microsoft.CodeAnalysis.Scripting.ScriptExecutionState.RunSubmissionsAsync[TResult](ImmutableArray1 precedingExecutors, Func2 currentExecutor, StrongBox1 exceptionHolderOpt, Func2 catchExceptionOpt, CancellationToken cancellationToken)
at System.Number.ThrowOverflowOrFormatException(ParsingStatus status, TypeCode type)
at System.String.System.IConvertible.ToSingle(IFormatProvider provider)
at System.Convert.ChangeType(Object value, Type conversionType, IFormatProvider provider)
at Microsoft.Data.Analysis.DataFrame.Append(IEnumerable1 row, Boolean inPlace) at Microsoft.Data.Analysis.DataFrame.ReadCsvLinesIntoDataFrame(WrappedStreamReaderOrStringReader wrappedReader, Char separator, Boolean header, String[] columnNames, Type[] dataTypes, Int64 numberOfRowsToRead, Int32 guessRows, Boolean addIndexColumn) at Microsoft.Data.Analysis.DataFrame.LoadCsv(String filename, Char separator, Boolean header, String[] columnNames, Type[] dataTypes, Int32 numRows, Int32 guessRows, Boolean addIndexColumn, Encoding encoding) at Submission#58.<<Initialize>>d__0.MoveNext() --- End of stack trace from previous location --- at Microsoft.CodeAnalysis.Scripting.ScriptExecutionState.RunSubmissionsAsync[TResult](ImmutableArray1 precedingExecutors, Func2 currentExecutor, StrongBox1 exceptionHolderOpt, Func2 catchExceptionOpt, CancellationToken cancellationToken)

With the training data set this doesn't happen and the data loads just fine. It's, for some reason, only the test set that has an issue and I can't figure out which part of the CSV is at fault. This dataset loads just fine with Python Pandas, for example so I assume it's a bug in the .NET implementation.

It is my first time trying out dotnet for this kind of thing, so I haven't really tried out many other CSV files to see how common this problem occurs but so far it seems to be only this one specific file.

@michaelgsharp michaelgsharp added the Microsoft.Data.Analysis All DataFrame related issues and PRs label Jul 28, 2021
@tombohub
Copy link

I had the same problem when CSV contained dashes. I believe error happens when DataFrame cannot infer the type of the columns, when numbers are mixed with strings and similar.

Please provide the csv because that link is not opening.

@ghost
Copy link
Author

ghost commented Aug 3, 2021

I had the same problem when CSV contained dashes. I believe error happens when DataFrame cannot infer the type of the columns, when numbers are mixed with strings and similar.

Please provide the csv because that link is not opening.

I can't share the file itself as sharing Kaggle datasets freely is not permitted AFAIK. But I fixed the link, made a stupid mistake when inserting it.

I'll check out your hint with mixed datatypes though. Maybe that's the problem anyway. :)

@sadukie
Copy link

sadukie commented Dec 9, 2021

The CSV file may be fine. However, I suspected that there might be some type inferencing going on, as I know some data tools do type inferencing off of a sample of values. This is what I did:

  1. Download the CSV initially linked from Kaggle.
  2. Create a .NET Interactive notebook and import Microsoft.Data.Analysis 0.19.0.
  3. Call LoadCsv on the test.csv file as noted above.
  4. Get the error listed above.

So then I opened the CSV in Excel to look at the data patterns. System.String.System.IConvertible.ToSingle in the error message caught my attention, so I checked the columns that started with integers and may have changed. The Ticket column caught my attention. So I put in an alphanumeric + characters value - A.330911 - for the first ticket to test my inference suspicion. Sure enough, that loaded the CSV fine.

I haven't looked at the code here. But is type inferencing in play? If so, how many rows are the types inferred from? Will this be documented somewhere?

@sadukie
Copy link

sadukie commented Dec 9, 2021

@pbraunstorfer I did some digging in the docs, and it looks like you can tell it how many rows to do the type inferencing from. By default, it's 10 rows. It's the guessRows parameter on LoadCsv

Unfortunately, the default of 10 rows just misses the Ticket value that would have set it to the correct needed type.

DataFrame.LoadCsv("test.csv",separator:',',header:true,guessRows:11); would work for that test.csv file.

@luisquintanilla
Copy link
Contributor

Thanks for the discussion. Closing this issue for now since it can be resolved by setting the number of rows to use for inference.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 31, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Microsoft.Data.Analysis All DataFrame related issues and PRs
Projects
None yet
Development

No branches or pull requests

4 participants