-
Notifications
You must be signed in to change notification settings - Fork 347
Conversation
string[] columnNames = null, Type[] dataTypes = null, | ||
int numRows = -1, int guessRows = 10, | ||
bool addIndexColumn = false, Encoding encoding = null, | ||
CultureInfo cultureInfo = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to #2926 , here is the CultureInfo
exposed to the public API.
if (!csvStream.CanSeek) | ||
// if we have a comma separator and the cultureInfo has not been specified by the user, | ||
// we set it to Invariant Culture (since logically floats/doubles wouldn't be represented with commas in this case) | ||
if(separator.Equals(',') && cultureInfo is null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case where the CultureInfo
object is null and the file has a comma based separator, we change the CultureInfo to Invariant in order to handle Double/Single values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, for now I think this is ok since it's being more restrictive. I'm wondering if we can lose the separator.Equals(',')
part of the if condition safely, but we can do that later too.
} | ||
catch(FormatException) | ||
{ | ||
Console.Write($"Value \"{value}\" cannot be converted to type {column.DataType} (Column name: {column.Name}). "); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to #2902 , give a message to the user what value can't be converted to which type. If Double/Single, assign a NaN
value instead. Otherwise throw a FormatException
as previously ( this also doesn't break unit test TestAppendRow
which checks for FormatException
).
|
||
namespace Microsoft.Data.Analysis.IO.Tests | ||
{ | ||
public partial class DataFrameIOTests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added two simple unit tests in relation to #2902
@@ -14,6 +14,94 @@ namespace Microsoft.Data.Analysis.Tests | |||
{ | |||
public partial class DataFrameTests | |||
{ | |||
internal static void VerifyColumnTypes(DataFrame df, bool testArrowStringColumn = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is duplicated both in M.D.A and M.D.A.IO, since both unit test projects use this function. I found duplicating the function was the easiest work around for now.
if (column.DataType == typeof(double)) | ||
{ | ||
Console.WriteLine("Converting to Double.NaN instead."); | ||
value = Double.NaN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this change, but do you think you could separate it out into it's own PR? Reason is that I don't think we've tested any of our existing APIs with NaN
values. So I'd like to get coverage by modifying DataFrameTests.cs:MakeDataFrameWithNumericColumns
to generate a row with Double.NaN
and Single.NaN
. That should automatically give us coverage over all the APIs
@@ -230,4 +230,5 @@ | |||
<CustomToolNamespace>Microsoft.Data</CustomToolNamespace> | |||
</EmbeddedResource> | |||
</ItemGroup> | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe revert this change as your modifying the PR?
…e code. Added robust functionality for handling type conversion errors and assigning NaN values. Added a CultureInfo in the public api for LoadCsv for handling floats/doubles with commas and dots. Added unit tests.
…it test. Restoring TargetFrameworks that were changed by accident.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor comments. I think this is looking pretty good. I rebased on to the latest upstream master(I hope you don't mind!) so you shouldn't have any merge conflicts.
Tagging @eerhardt to give this a glance as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple concerns with the proposal here.
- Loading and writing a .csv file to/from DataFrame seems like a pretty intrinsic operation. I don't think we want a separate assembly/NuGet package for this. Customers should be able to just get the
Microsoft.Data.Analysis
package and load a CSV. - I'm a bit concerned about taking a dependency on
Microsoft.VisualBasic.FileIO.TextFieldParser
.- It isn't available in
netstandard2.0
. - There are a few public blogs saying it isn't very performant, and it looks like it hasn't been updated or enhanced in a few years.
- It isn't available in
I think we should fold this functionality back M.D.A and investigate what we can do about removing the dependency on TextFieldParser
. (One option could be to port the code to C# and modify it as appropriate.)
If there is consensus on keeping only one project (M.D.A), then I can roll back the changes and open a PR that only closes #2926, and then another one for #2902. Another option could be that there is a high level public function in M.D.A which loads a csv and then all the helper functions are in the I.O. related one. That way there are two separations and the user can just call M.D.A. |
@chriss2401 I think rolling back the changes and fixing only #2926 and #2902 is a good idea! I really like the fixes for those 2 issues. Once my .NET 5 work is done, I'll work on a full fledged |
Sounds good! I opened a PR for #2902 , will afterwards open one for #2926. Closing this PR :) |
Hi @eerhardt and @pgovind ,
The following pull request adds the following features to
Microsoft.Data.Analysis
:pgovind:DataframeIO
)LoadCsv
function).CultureInfo
object as a input parameter in the public API with unit tests - related to #2926NaN
is assigned.Please review at your own convenience.
Christos.