-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataFrame] System.FormatException: Input string was not in a correct format on seemingly correct CSV file. #5884
Comments
I had the same problem when CSV contained dashes. I believe error happens when DataFrame cannot infer the type of the columns, when numbers are mixed with strings and similar. Please provide the csv because that link is not opening. |
I can't share the file itself as sharing Kaggle datasets freely is not permitted AFAIK. But I fixed the link, made a stupid mistake when inserting it. I'll check out your hint with mixed datatypes though. Maybe that's the problem anyway. :) |
The CSV file may be fine. However, I suspected that there might be some type inferencing going on, as I know some data tools do type inferencing off of a sample of values. This is what I did:
So then I opened the CSV in Excel to look at the data patterns. I haven't looked at the code here. But is type inferencing in play? If so, how many rows are the types inferred from? Will this be documented somewhere? |
@pbraunstorfer I did some digging in the docs, and it looks like you can tell it how many rows to do the type inferencing from. By default, it's 10 rows. It's the Unfortunately, the default of 10 rows just misses the
|
Thanks for the discussion. Closing this issue for now since it can be resolved by setting the number of rows to use for inference. |
So, when running Microsoft.Data.Analysis 0.18.0 (which is the newest AFAIK) I am trying to load the following dataset https://www.kaggle.com/c/titanic/data?select=test.csv via
DataFrame.LoadCsv("test.csv")
and I get
System.FormatException: Input string was not in a correct format. at System.Number.ThrowOverflowOrFormatException(ParsingStatus status, TypeCode type) at System.String.System.IConvertible.ToSingle(IFormatProvider provider) at System.Convert.ChangeType(Object value, Type conversionType, IFormatProvider provider) at Microsoft.Data.Analysis.DataFrame.Append(IEnumerable
1 row, Boolean inPlace)at Microsoft.Data.Analysis.DataFrame.ReadCsvLinesIntoDataFrame(WrappedStreamReaderOrStringReader wrappedReader, Char separator, Boolean header, String[] columnNames, Type[] dataTypes, Int64 numberOfRowsToRead, Int32 guessRows, Boolean addIndexColumn)
at Microsoft.Data.Analysis.DataFrame.LoadCsv(String filename, Char separator, Boolean header, String[] columnNames, Type[] dataTypes, Int32 numRows, Int32 guessRows, Boolean addIndexColumn, Encoding encoding)
at Submission#58.<>d__0.MoveNext()
--- End of stack trace from previous location ---
at Microsoft.CodeAnalysis.Scripting.ScriptExecutionState.RunSubmissionsAsync[TResult](ImmutableArray
1 precedingExecutors, Func
2 currentExecutor, StrongBox1 exceptionHolderOpt, Func
2 catchExceptionOpt, CancellationToken cancellationToken)at System.Number.ThrowOverflowOrFormatException(ParsingStatus status, TypeCode type)
at System.String.System.IConvertible.ToSingle(IFormatProvider provider)
at System.Convert.ChangeType(Object value, Type conversionType, IFormatProvider provider)
at Microsoft.Data.Analysis.DataFrame.Append(IEnumerable
1 row, Boolean inPlace) at Microsoft.Data.Analysis.DataFrame.ReadCsvLinesIntoDataFrame(WrappedStreamReaderOrStringReader wrappedReader, Char separator, Boolean header, String[] columnNames, Type[] dataTypes, Int64 numberOfRowsToRead, Int32 guessRows, Boolean addIndexColumn) at Microsoft.Data.Analysis.DataFrame.LoadCsv(String filename, Char separator, Boolean header, String[] columnNames, Type[] dataTypes, Int32 numRows, Int32 guessRows, Boolean addIndexColumn, Encoding encoding) at Submission#58.<<Initialize>>d__0.MoveNext() --- End of stack trace from previous location --- at Microsoft.CodeAnalysis.Scripting.ScriptExecutionState.RunSubmissionsAsync[TResult](ImmutableArray
1 precedingExecutors, Func2 currentExecutor, StrongBox
1 exceptionHolderOpt, Func2 catchExceptionOpt, CancellationToken cancellationToken)
With the training data set this doesn't happen and the data loads just fine. It's, for some reason, only the test set that has an issue and I can't figure out which part of the CSV is at fault. This dataset loads just fine with Python Pandas, for example so I assume it's a bug in the .NET implementation.
It is my first time trying out dotnet for this kind of thing, so I haven't really tried out many other CSV files to see how common this problem occurs but so far it seems to be only this one specific file.
The text was updated successfully, but these errors were encountered: