VectorUdf with DataFrame + Arrow #277

pgovind · 2019-10-01T23:07:43Z

Minor changes to the UDF API to pass in and return corefxlab DataFrames
Accompanying unit test changes

Putting it up here to get initial thoughts. FxDataFrame's Arrow support means true zero copy exchange of data.

@eerhardt @rapoth @imback82 . Not able to add reviewers for some reason :/

Can be reviewed now!

src/csharp/Microsoft.Spark.Worker.UnitTest/CommandExecutorTests.cs

pgovind · 2019-10-01T23:16:22Z

src/csharp/Microsoft.Spark.Worker.UnitTest/CommandExecutorTests.cs

@@ -601,10 +594,10 @@ Int64Array ConvertInt64s(Int64Array int64s)
                    Assert.Equal($"udf: {i}", stringArray.GetString(i));
                }

-                var longArray = (Int64Array)outputBatch.Column(1);
+                var doubleArray = (DoubleArray)outputBatch.Column(1);


long -> double occurs in the line 523 above. dataFrame.Column(1) + 100 returns a double column.

That seems wrong. PrimitiveColumn<long> + int should return a PrimitiveColumn<long>. Just like how:

long x = 0; var y = x + 5; y.GetType()

returns System.Int64.

In reply to: 330316160 [](ancestors = 330316160)

Would you consider this something we HAVE to change? For example, long + float results in a float because the compiler knows to make the conversions. The DataFrame doesn't attempt to convert because each column's Memory has to be cloned in the conversion. For a chain of operations, this seems wasteful? Instead it converts once to a PrimitiveColumn<double>(or PrimitiveColumn<decimal>) when there is a type mismatch so all subsequent operations can work without cloning.

The other alternative I'd considered was converting only when the underlying Memory had to be changed. PrimitiveColumn<long> + int would involve no conversions for example. I didn't think we should duplicate all the conversion logic in the compiler however, so I defaulted to what we have now. Do you know of an easier/better way to do this?

I find it very surprising that adding, subtracting, or multiplying two integers would result in a floating-point number. IMO, I think that is something that should be changed.

But I'm not all customers, so maybe getting more thoughts/opinions here would be a good idea. But matching normal C# behavior seems like it is reasonable default behavior.

eerhardt · 2019-10-02T17:11:51Z

Looks like a test is failing:

[xUnit.net 00:00:35.73]     Microsoft.Spark.E2ETest.IpcTests.DataFrameTests.TestGroupedMapUdf [FAIL]
  X Microsoft.Spark.E2ETest.IpcTests.DataFrameTests.TestGroupedMapUdf [1s 347ms]
  Error Message:
   Assert.Equal() Failure
Expected: 3
Actual:   5

examples/Microsoft.Spark.CSharp.Examples/Sql/VectorUdfs.cs

src/csharp/Microsoft.Spark/Microsoft.Spark.csproj

eerhardt · 2019-10-02T17:21:48Z

examples/Microsoft.Spark.CSharp.Examples/Sql/VectorUdfs.cs

@@ -8,6 +8,9 @@
 using Microsoft.Spark.Sql;
 using Microsoft.Spark.Sql.Types;
 using StructType = Microsoft.Spark.Sql.Types.StructType;
+using FxDataFrame = Microsoft.Data.DataFrame;


@imback82 @rapoth - do either of you have any thoughts/opinions on how to make this better? Using 2 types named DataFrame in Spark seems unfortunate. Is there a better term we can call the corefx DataFrame class?

src/csharp/Microsoft.Spark.Worker/Command/SqlCommandExecutor.cs

src/csharp/Microsoft.Spark.E2ETest/IpcTests/Sql/DataFrameTests.cs

benchmark/csharp/Tpch/VectorFunctions.cs

examples/Microsoft.Spark.CSharp.Examples/Sql/VectorUdfs.cs

benchmark/csharp/Tpch/VectorFunctions.cs

examples/Microsoft.Spark.CSharp.Examples/Sql/VectorUdfs.cs

src/csharp/Microsoft.Spark.Worker/Command/SqlCommandExecutor.cs

src/csharp/Microsoft.Spark/Sql/ArrowArrayHelpers.cs

benchmark/csharp/Tpch/VectorFunctions.cs

eng/Versions.props

pgovind · 2019-11-11T23:04:49Z

examples/Microsoft.Spark.CSharp.Examples/Sql/VectorUdfs.cs

 using Microsoft.Spark.Sql;
 using Microsoft.Spark.Sql.Types;
 using StructType = Microsoft.Spark.Sql.Types.StructType;
+using FxDataFrame = Microsoft.Data.Analysis.DataFrame;


@eerhardt question from before:

@imback82 @rapoth - do either of you have any thoughts/opinions on how to make this better? Using 2 types named DataFrame in Spark seems unfortunate. Is there a better term we can call the corefx DataFrame class?

We should think about this sooner rather than later. For ex: In the last .NET Town Hall, Spark Notebooks was demo'd and Dan had a question of whether the DataFrame he saw in the sample was a Spark DataFrame or the corefx DataFrame?

pgovind · 2019-11-11T23:21:38Z

Updated to address comments and use the Microsoft.Data.Analysis package on Nuget

eerhardt · 2019-11-12T18:23:27Z

benchmark/csharp/Tpch/VectorFunctions.cs

-            }
-
-            return builder.Build();
+            return (PrimitiveDataFrameColumn<double>)(price * (1 - discount) * (1 + tax));


Do we not have operator overloads for PrimitiveDataFrameColumn<T> that return the same type? I think that would make this code simpler so the user doesn't have to cast.

No unfortunately. I'll file a bug.

examples/Microsoft.Spark.CSharp.Examples/README.md

examples/Microsoft.Spark.CSharp.Examples/Sql/Batch/VectorDataFrameUdfs.cs

src/csharp/Microsoft.Spark.E2ETest/IpcTests/Sql/DataFrameTests.cs

imback82 · 2020-03-04T16:46:15Z

src/csharp/Microsoft.Spark.E2ETest/IpcTests/Sql/DataFrameTests.cs

+            Func<Column, Column> udf2 = ExperimentalDataFrameFunctions.VectorUdf<ArrowStringDataFrameColumn, ArrowStringDataFrameColumn>(
+                (strings) =>
+                {
+                    StringArray stringArray = (StringArray)ToArrowArray(


So do we need to first change to arrow array then back to dataframe column? No way to go from ArrowStringDataFrameColumn to ArrowStringDataFrameColumn? cc: @eerhardt

I thought we had an Apply function that could be used here.
One of the biggest issues is going from the UTF8 arrow string format, to the UTF16 .NET string format, then back to UTF8 arrow string format.

In reply to: 387794908 [](ancestors = 387794908)

Yea, it will be really nice if the user doesn't have to go thru Arrow APIs but completely lives in FxDataFrame APIs.

We don't have it in 0.2.0 yet. However, we should release a 0.3.0 soon which will have this API. How about we accept this change for now and I put up a new PR once 0.3.0 is out?

That sounds like a good plan to me.

Filed dotnet/corefxlab#2860

Sounds good to me as well.

src/csharp/Microsoft.Spark.E2ETest/IpcTests/Sql/DataFrameTests.cs

src/csharp/Microsoft.Spark.UnitTest/CommandSerDeTests.cs

src/csharp/Microsoft.Spark.Experimental/Sql/RelationalGroupedDatasetExtensions.cs

imback82 · 2020-03-04T17:04:29Z

src/csharp/Microsoft.Spark.UnitTest/TestUtils/ArrowTestUtils.cs

@@ -100,6 +108,16 @@ public static IArrowType GetArrowType<T>()
            throw new NotSupportedException($"Unknown type: {typeof(T)}");
        }

+        public static ArrowStringDataFrameColumn ToArrowStringDataFrameColumn(StringArray array)
+        {
+            return new ArrowStringDataFrameColumn("String",


Can we set the name to null? It's a bit confusing to give a name from unnamed column (StringArray).

Actually we can't. Arrow doesn't allow empty or null values for the column name. We encounter this when we go back from DataFrame->Arrow Record Batches

I see. So when we get ArrowStringDataFrameColumn in the UDF, is it already named? If so, the name is from the Arrow sent from JVM, right?

src/csharp/Microsoft.Spark.UnitTest/WorkerFunctionTests.cs

pgovind · 2020-03-04T21:38:41Z

Addressed comments and did a search/replace for styling changes(I think I got all of them).

imback82

I have some minor comments, but otherwise, LGTM. Thanks @pgovind!

src/csharp/Microsoft.Spark.UnitTest/UdfWrapperTests.cs

src/csharp/Microsoft.Spark.UnitTest/WorkerFunctionTests.cs

imback82 · 2020-03-05T01:50:38Z

src/csharp/Microsoft.Spark/Sql/ArrowArrayHelpers.cs

+            DataFrameColumn ret;
+            if (typeof(T) == typeof(PrimitiveDataFrameColumn<bool>))
+            {
+                ret = new PrimitiveDataFrameColumn<bool>("Empty");


right, it doesn't change the behavior. I prefer returning here because I don't have to scroll down all the way down to see if we are doing more to ret or not (and we can get rid of ret as well).

src/csharp/Microsoft.Spark/Sql/ArrowUdfWrapper.cs

src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs

imback82 · 2020-03-05T02:04:27Z

Addressed comments and did a search/replace for styling changes(I think I got all of them).

I was probably reviewing while you were making changes. If you already fixed them, just resolve my comments. Thanks!

pgovind · 2020-03-06T06:21:15Z

I see. So when we get ArrowStringDataFrameColumn in the UDF, is it already named? If so, the name is from the Arrow sent from JVM, right?

Yup, it is already named with the Arrow data from the JVM. The Arrow schema defines a Field for a StringColumn that provides a Name.

imback82

one nit comment

src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs

imback82

LGTM. Thanks @pgovind!

pgovind · 2020-03-06T06:58:34Z

Thanks for the review @imback82. I know it was a time consuming PR

pgovind commented Oct 1, 2019

View reviewed changes

src/csharp/Microsoft.Spark.Worker.UnitTest/CommandExecutorTests.cs Outdated Show resolved Hide resolved

pgovind commented Oct 1, 2019

View reviewed changes

imback82 requested review from eerhardt, rapoth, imback82 and suhsteve October 1, 2019 23:25

eerhardt reviewed Oct 2, 2019

View reviewed changes

examples/Microsoft.Spark.CSharp.Examples/Sql/VectorUdfs.cs Outdated Show resolved Hide resolved

eerhardt reviewed Oct 2, 2019

View reviewed changes

src/csharp/Microsoft.Spark/Microsoft.Spark.csproj Outdated Show resolved Hide resolved

eerhardt reviewed Oct 2, 2019

View reviewed changes

src/csharp/Microsoft.Spark.Worker/Command/SqlCommandExecutor.cs Outdated Show resolved Hide resolved

eerhardt reviewed Oct 2, 2019

View reviewed changes

src/csharp/Microsoft.Spark.E2ETest/IpcTests/Sql/DataFrameTests.cs Outdated Show resolved Hide resolved

pgovind force-pushed the dataframe_branch branch from c400aaa to b8b845d Compare October 8, 2019 22:44

pgovind changed the title ~~VectorUdf with DataFrame + Arrow~~ WIP: VectorUdf with DataFrame + Arrow Oct 8, 2019

pgovind changed the title ~~WIP: VectorUdf with DataFrame + Arrow~~ VectorUdf with DataFrame + Arrow Oct 9, 2019

eerhardt reviewed Oct 11, 2019

View reviewed changes

benchmark/csharp/Tpch/VectorFunctions.cs Outdated Show resolved Hide resolved

pgovind force-pushed the dataframe_branch branch from 78b7157 to 16c29bc Compare October 11, 2019 20:06

pgovind commented Oct 11, 2019

View reviewed changes

examples/Microsoft.Spark.CSharp.Examples/Sql/VectorUdfs.cs Outdated Show resolved Hide resolved

suhsteve reviewed Oct 14, 2019

View reviewed changes

imback82 reviewed Oct 15, 2019

View reviewed changes

benchmark/csharp/Tpch/VectorFunctions.cs Outdated Show resolved Hide resolved

This was referenced Oct 15, 2019

WIP: Operator overloads dotnet/corefxlab#2752

Closed

More operator overloads dotnet/corefxlab#2753

Closed

pgovind commented Nov 11, 2019

View reviewed changes

eng/Versions.props Outdated Show resolved Hide resolved

pgovind commented Nov 11, 2019

View reviewed changes

pgovind force-pushed the dataframe_branch branch from 1e0e6d2 to 3249e0e Compare November 11, 2019 23:08

pgovind requested review from suhsteve, eerhardt and imback82 November 11, 2019 23:17

eerhardt reviewed Nov 12, 2019

View reviewed changes

Merge branch 'master' into dataframe_branch

3d32f58