Robust dataframe io #2938

chriss2401 · 2020-06-21T11:39:01Z

Hi @eerhardt and @pgovind ,

The following pull request adds the following features to Microsoft.Data.Analysis :

Separate M.D.A and M.D.A.I.O projects (pulled from pgovind:DataframeIO )
Removed M.D.A.I.O related duplicate code from M.D.A and moved needed code to IO project (i.e. main high level LoadCsv function).
Added a CultureInfo object as a input parameter in the public API with unit tests - related to #2926
Added better exception handling/message printing when type convertions fail with unit tests - related to #2902
- If the type that fails is single/double, a corresponding NaN is assigned.

Please review at your own convenience.

Christos.

dnfadmin · 2020-06-21T11:39:15Z

All CLA requirements met.

chriss2401 · 2020-06-21T13:14:30Z

src/Microsoft.Data.Analysis.IO/DataFrameIO.cs

+                                string[] columnNames = null, Type[] dataTypes = null,
+                                int numRows = -1, int guessRows = 10,
+                                bool addIndexColumn = false, Encoding encoding = null,
+                                CultureInfo cultureInfo = null)


Related to #2926 , here is the CultureInfo exposed to the public API.

chriss2401 · 2020-06-21T13:15:41Z

src/Microsoft.Data.Analysis.IO/DataFrameIO.cs

-            if (!csvStream.CanSeek)
+            // if we have a comma separator and the cultureInfo has not been specified by the user, 
+            // we set it to Invariant Culture (since logically floats/doubles wouldn't be represented with commas in this case)
+            if(separator.Equals(',') && cultureInfo is null)


In the case where the CultureInfo object is null and the file has a comma based separator, we change the CultureInfo to Invariant in order to handle Double/Single values.

So, for now I think this is ok since it's being more restrictive. I'm wondering if we can lose the separator.Equals(',') part of the if condition safely, but we can do that later too.

chriss2401 · 2020-06-21T13:17:45Z

src/Microsoft.Data.Analysis/DataFrame.cs

+                        }
+                        catch(FormatException)
+                        {
+                            Console.Write($"Value \"{value}\" cannot be converted to type {column.DataType} (Column name: {column.Name}). ");


Related to #2902 , give a message to the user what value can't be converted to which type. If Double/Single, assign a NaN value instead. Otherwise throw a FormatException as previously ( this also doesn't break unit test TestAppendRow which checks for FormatException ).

chriss2401 · 2020-06-21T13:18:44Z

tests/Microsoft.Data.Analysis.IO.Tests/DataFrame.IOTests.CultureInfo.cs

+
+namespace Microsoft.Data.Analysis.IO.Tests
+{
+    public partial class DataFrameIOTests


Added two simple unit tests in relation to #2902

chriss2401 · 2020-06-21T13:19:58Z

tests/Microsoft.Data.Analysis.Tests/DataFrameTests.cs

@@ -14,6 +14,94 @@ namespace Microsoft.Data.Analysis.Tests
 {
    public partial class DataFrameTests
    {
+        internal static void VerifyColumnTypes(DataFrame df, bool testArrowStringColumn = false)


This is duplicated both in M.D.A and M.D.A.IO, since both unit test projects use this function. I found duplicating the function was the easiest work around for now.

pgovind · 2020-07-28T19:38:22Z

src/Microsoft.Data.Analysis/DataFrame.cs

+                            if (column.DataType == typeof(double))
+                            {
+                                Console.WriteLine("Converting to Double.NaN instead.");
+                                value = Double.NaN;


I like this change, but do you think you could separate it out into it's own PR? Reason is that I don't think we've tested any of our existing APIs with NaN values. So I'd like to get coverage by modifying DataFrameTests.cs:MakeDataFrameWithNumericColumns to generate a row with Double.NaN and Single.NaN. That should automatically give us coverage over all the APIs

pgovind · 2020-07-28T19:39:22Z

src/Microsoft.Data.Analysis/Microsoft.Data.Analysis.csproj

@@ -230,4 +230,5 @@
      <CustomToolNamespace>Microsoft.Data</CustomToolNamespace>
    </EmbeddedResource>
  </ItemGroup>
+


Maybe revert this change as your modifying the PR?

…e code. Added robust functionality for handling type conversion errors and assigning NaN values. Added a CultureInfo in the public api for LoadCsv for handling floats/doubles with commas and dots. Added unit tests.

…it test. Restoring TargetFrameworks that were changed by accident.

pgovind

Just some minor comments. I think this is looking pretty good. I rebased on to the latest upstream master(I hope you don't mind!) so you shouldn't have any merge conflicts.

Tagging @eerhardt to give this a glance as well.

eerhardt

I have a couple concerns with the proposal here.

Loading and writing a .csv file to/from DataFrame seems like a pretty intrinsic operation. I don't think we want a separate assembly/NuGet package for this. Customers should be able to just get the Microsoft.Data.Analysis package and load a CSV.
I'm a bit concerned about taking a dependency on Microsoft.VisualBasic.FileIO.TextFieldParser.
1. It isn't available in netstandard2.0.
2. There are a few public blogs saying it isn't very performant, and it looks like it hasn't been updated or enhanced in a few years.

I think we should fold this functionality back M.D.A and investigate what we can do about removing the dependency on TextFieldParser. (One option could be to port the code to C# and modify it as appropriate.)

chriss2401 · 2020-08-04T07:55:50Z

Hi @pgovind and @eerhardt,

If there is consensus on keeping only one project (M.D.A), then I can roll back the changes and open a PR that only closes #2926, and then another one for #2902.

Another option could be that there is a high level public function in M.D.A which loads a csv and then all the helper functions are in the I.O. related one. That way there are two separations and the user can just call M.D.A.

pgovind · 2020-08-10T17:51:53Z

@chriss2401 I think rolling back the changes and fixing only #2926 and #2902 is a good idea! I really like the fixes for those 2 issues. Once my .NET 5 work is done, I'll work on a full fledged LoadCsv method that should fix all our parsing concerns with LoadCsv

chriss2401 · 2020-08-15T08:32:39Z

@chriss2401 I think rolling back the changes and fixing only #2926 and #2902 is a good idea! I really like the fixes for those 2 issues. Once my .NET 5 work is done, I'll work on a full fledged LoadCsv method that should fix all our parsing concerns with LoadCsv

Sounds good! I opened a PR for #2902 , will afterwards open one for #2926. Closing this PR :)

chriss2401 commented Jun 21, 2020

View reviewed changes

chriss2401 mentioned this pull request Jun 21, 2020

New M.D.A.IO proj and improved LoadCsv #2927

Closed

pgovind reviewed Jul 28, 2020

View reviewed changes

chriss2401 added 2 commits July 28, 2020 12:46

Adding M.D.A.IO test project to .sln file and fixing TestAppendRow un…

d4ac04a

…it test. Restoring TargetFrameworks that were changed by accident.

pgovind reviewed Jul 28, 2020

View reviewed changes

pgovind mentioned this pull request Jul 28, 2020

Writing dataframe to txt doesn't work properly #2917

Closed

eerhardt suggested changes Jul 30, 2020

View reviewed changes

pgovind mentioned this pull request Aug 10, 2020

Add WriteCsv plus unit tests. #2947

Merged

chriss2401 closed this Aug 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Robust dataframe io #2938

Robust dataframe io #2938

chriss2401 commented Jun 21, 2020

dnfadmin commented Jun 21, 2020 •

edited

Loading

chriss2401 Jun 21, 2020

chriss2401 Jun 21, 2020 •

edited

Loading

pgovind Jul 28, 2020

chriss2401 Jun 21, 2020

chriss2401 Jun 21, 2020

chriss2401 Jun 21, 2020

pgovind Jul 28, 2020 •

edited

Loading

pgovind Jul 28, 2020

pgovind left a comment

eerhardt left a comment

chriss2401 commented Aug 4, 2020

pgovind commented Aug 10, 2020

chriss2401 commented Aug 15, 2020

Robust dataframe io #2938

Robust dataframe io #2938

Conversation

chriss2401 commented Jun 21, 2020

dnfadmin commented Jun 21, 2020 • edited Loading

chriss2401 Jun 21, 2020

Choose a reason for hiding this comment

chriss2401 Jun 21, 2020 • edited Loading

Choose a reason for hiding this comment

pgovind Jul 28, 2020

Choose a reason for hiding this comment

chriss2401 Jun 21, 2020

Choose a reason for hiding this comment

chriss2401 Jun 21, 2020

Choose a reason for hiding this comment

chriss2401 Jun 21, 2020

Choose a reason for hiding this comment

pgovind Jul 28, 2020 • edited Loading

Choose a reason for hiding this comment

pgovind Jul 28, 2020

Choose a reason for hiding this comment

pgovind left a comment

Choose a reason for hiding this comment

eerhardt left a comment

Choose a reason for hiding this comment

chriss2401 commented Aug 4, 2020

pgovind commented Aug 10, 2020

chriss2401 commented Aug 15, 2020

dnfadmin commented Jun 21, 2020 •

edited

Loading

chriss2401 Jun 21, 2020 •

edited

Loading

pgovind Jul 28, 2020 •

edited

Loading