-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-13129: [C#] Fix TableFromRecordBatches #10562
Conversation
I'm not sure I understand the benefits here. Can you explain what was broken? Also, can you add a unit test? |
arrow/csharp/src/Apache.Arrow/Table.cs Lines 45 to 46 in e9fa304
The column array actually gets cleared in the new column as well since it's not a copy. |
Alternatively, we could change public static Table TableFromRecordBatches(Schema schema, IList<RecordBatch> recordBatches)
{
int nBatches = recordBatches.Count;
int nColumns = schema.Fields.Count;
List<Column> columns = new List<Column>(nColumns);
for (int icol = 0; icol < nColumns; icol++)
{
List<Array> columnArrays = new List<Array>(nBatches);
for (int jj = 0; jj < nBatches; jj++)
{
columnArrays.Add(recordBatches[jj].Column(icol) as Array);
}
columns.Add(new Arrow.Column(schema.GetFieldByIndex(icol), columnArrays));
} And undo the The reason I'd prefer that approach is because we have a bunch of other APIs that assume they take ownership of the array/list that is passed to them. The reason for that is to limit the amount of allocations. If someone builds up a single list, passes it into the Thoughts? |
Honestly I find the semantics quite misleading, like the author of TableFromRecordBatches :) IMO these copies are harmless in terms of performance. It also enforces more coherence with the default constructor. Python does it too: arrow/python/pyarrow/table.pxi Lines 1591 to 1609 in 3ce67eb
|
Unfortunately, this isn't true. Allocating unnecessary objects puts pressure on the garbage collector. Check out this article about some of the performance improvements that were made by reducing allocations. |
!! the example with GC called every 15ms looked more like a bug than something else... Note though that in that benchmark there are probably as many allocations in pure clr as in the old managed version. If I there's no way to tell the user what's going on through the type system, that'll lead to bugs just like this one. I still value API "unsurpriseness" more than performance. It's up to you. I can - reluctantly :) - edit the code in TableFromBatches |
As a benchmark, I just ran a slightly modified version of their benchmark using ToList, also tried ToArray and no op. This certainly does not reveal any pathological issue in netcoreapp3.1 using System;
using System.Linq;
using System.Diagnostics;
using System.Threading;
class Program
{
public static void Main()
{
new Thread(() =>
{
var a = new int[20].ToList();
while (true) a = a.ToList();
}) { IsBackground = true }.Start();
var sw = new Stopwatch();
while (true)
{
sw.Restart();
for (int i = 0; i < 10; i++)
{
GC.Collect();
Thread.Sleep(15);
}
Console.WriteLine(sw.Elapsed.TotalSeconds);
}
}
} |
Take a look at all the changes we've been making in the dotnet/runtime libraries that reduce allocations: https://github.com/dotnet/runtime/pulls?q=is%3Apr+is%3Aclosed+allocation+. Even the article I linked says:
If you allocate less objects, the GC has less work to do.
I agree. Looking at the rest of the APIs that copy, they all take arrow/csharp/src/Apache.Arrow/Arrays/ArrayData.cs Lines 34 to 44 in 8e43f23
arrow/csharp/src/Apache.Arrow/RecordBatch.cs Lines 63 to 70 in 8e43f23
arrow/csharp/src/Apache.Arrow/Schema.cs Lines 39 to 48 in 8e43f23
But then we also have So how about following that same pattern here?
thoughts? |
I agree. I think the Is that ok ? |
I guess I don't see how the Flight assembly is relevant here. It doesn't use Table, Column, or ChunkedArray. But it is already IVT:
Note - what I am proposing above only applies to Table, Column, and ChunkedArray for this PR. I'm not saying we should go change a bunch of other APIs. |
@eerhardt @0x0L What is the status on this PR? Does either of you want to move it forward? |
Sorry, not really involved with C# anymore. I'll try to push the changes we discussed shortly |
In the end, I only fixed the behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the contribution.
No description provided.