Skip to content
This repository has been archived by the owner on Aug 2, 2023. It is now read-only.

Robust dataframe io #2938

Closed
wants to merge 2 commits into from
Closed

Robust dataframe io #2938

wants to merge 2 commits into from

Conversation

chriss2401
Copy link

Hi @eerhardt and @pgovind ,

The following pull request adds the following features to Microsoft.Data.Analysis :

  • Separate M.D.A and M.D.A.I.O projects (pulled from pgovind:DataframeIO )
  • Removed M.D.A.I.O related duplicate code from M.D.A and moved needed code to IO project (i.e. main high level LoadCsv function).
  • Added a CultureInfo object as a input parameter in the public API with unit tests - related to #2926
  • Added better exception handling/message printing when type convertions fail with unit tests - related to #2902
    • If the type that fails is single/double, a corresponding NaN is assigned.

Please review at your own convenience.

Christos.

@dnfadmin
Copy link

dnfadmin commented Jun 21, 2020

CLA assistant check
All CLA requirements met.

string[] columnNames = null, Type[] dataTypes = null,
int numRows = -1, int guessRows = 10,
bool addIndexColumn = false, Encoding encoding = null,
CultureInfo cultureInfo = null)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to #2926 , here is the CultureInfo exposed to the public API.

if (!csvStream.CanSeek)
// if we have a comma separator and the cultureInfo has not been specified by the user,
// we set it to Invariant Culture (since logically floats/doubles wouldn't be represented with commas in this case)
if(separator.Equals(',') && cultureInfo is null)
Copy link
Author

@chriss2401 chriss2401 Jun 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where the CultureInfo object is null and the file has a comma based separator, we change the CultureInfo to Invariant in order to handle Double/Single values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, for now I think this is ok since it's being more restrictive. I'm wondering if we can lose the separator.Equals(',') part of the if condition safely, but we can do that later too.

}
catch(FormatException)
{
Console.Write($"Value \"{value}\" cannot be converted to type {column.DataType} (Column name: {column.Name}). ");
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to #2902 , give a message to the user what value can't be converted to which type. If Double/Single, assign a NaN value instead. Otherwise throw a FormatException as previously ( this also doesn't break unit test TestAppendRow which checks for FormatException ).


namespace Microsoft.Data.Analysis.IO.Tests
{
public partial class DataFrameIOTests
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added two simple unit tests in relation to #2902

@@ -14,6 +14,94 @@ namespace Microsoft.Data.Analysis.Tests
{
public partial class DataFrameTests
{
internal static void VerifyColumnTypes(DataFrame df, bool testArrowStringColumn = false)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is duplicated both in M.D.A and M.D.A.IO, since both unit test projects use this function. I found duplicating the function was the easiest work around for now.

if (column.DataType == typeof(double))
{
Console.WriteLine("Converting to Double.NaN instead.");
value = Double.NaN;
Copy link
Contributor

@pgovind pgovind Jul 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change, but do you think you could separate it out into it's own PR? Reason is that I don't think we've tested any of our existing APIs with NaN values. So I'd like to get coverage by modifying DataFrameTests.cs:MakeDataFrameWithNumericColumns to generate a row with Double.NaN and Single.NaN. That should automatically give us coverage over all the APIs

@@ -230,4 +230,5 @@
<CustomToolNamespace>Microsoft.Data</CustomToolNamespace>
</EmbeddedResource>
</ItemGroup>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe revert this change as your modifying the PR?

…e code. Added robust functionality for handling type conversion errors and assigning NaN values. Added a CultureInfo in the public api for LoadCsv for handling floats/doubles with commas and dots. Added unit tests.
…it test. Restoring TargetFrameworks that were changed by accident.
Copy link
Contributor

@pgovind pgovind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor comments. I think this is looking pretty good. I rebased on to the latest upstream master(I hope you don't mind!) so you shouldn't have any merge conflicts.

Tagging @eerhardt to give this a glance as well.

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple concerns with the proposal here.

  1. Loading and writing a .csv file to/from DataFrame seems like a pretty intrinsic operation. I don't think we want a separate assembly/NuGet package for this. Customers should be able to just get the Microsoft.Data.Analysis package and load a CSV.
  2. I'm a bit concerned about taking a dependency on Microsoft.VisualBasic.FileIO.TextFieldParser.
    1. It isn't available in netstandard2.0.
    2. There are a few public blogs saying it isn't very performant, and it looks like it hasn't been updated or enhanced in a few years.

I think we should fold this functionality back M.D.A and investigate what we can do about removing the dependency on TextFieldParser. (One option could be to port the code to C# and modify it as appropriate.)

@chriss2401
Copy link
Author

Hi @pgovind and @eerhardt,

If there is consensus on keeping only one project (M.D.A), then I can roll back the changes and open a PR that only closes #2926, and then another one for #2902.

Another option could be that there is a high level public function in M.D.A which loads a csv and then all the helper functions are in the I.O. related one. That way there are two separations and the user can just call M.D.A.

@pgovind
Copy link
Contributor

pgovind commented Aug 10, 2020

@chriss2401 I think rolling back the changes and fixing only #2926 and #2902 is a good idea! I really like the fixes for those 2 issues. Once my .NET 5 work is done, I'll work on a full fledged LoadCsv method that should fix all our parsing concerns with LoadCsv

@chriss2401
Copy link
Author

@chriss2401 I think rolling back the changes and fixing only #2926 and #2902 is a good idea! I really like the fixes for those 2 issues. Once my .NET 5 work is done, I'll work on a full fledged LoadCsv method that should fix all our parsing concerns with LoadCsv

Sounds good! I opened a PR for #2902 , will afterwards open one for #2926. Closing this PR :)

@chriss2401 chriss2401 closed this Aug 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants