Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-13129: [C#] Fix TableFromRecordBatches #10562

Merged
merged 1 commit into from
Jun 1, 2022

Conversation

0x0L
Copy link
Contributor

@0x0L 0x0L commented Jun 19, 2021

No description provided.

@github-actions
Copy link

@eerhardt
Copy link
Contributor

I'm not sure I understand the benefits here. Can you explain what was broken?

Also, can you add a unit test?

@0x0L
Copy link
Contributor Author

0x0L commented Jun 30, 2021

columns.Add(new Arrow.Column(schema.GetFieldByIndex(icol), columnArrays));
columnArrays.Clear();

The column array actually gets cleared in the new column as well since it's not a copy.

@eerhardt
Copy link
Contributor

eerhardt commented Jul 6, 2021

Alternatively, we could change TableFromRecordBatches to be:

         public static Table TableFromRecordBatches(Schema schema, IList<RecordBatch> recordBatches)
        {
            int nBatches = recordBatches.Count;
            int nColumns = schema.Fields.Count;

            List<Column> columns = new List<Column>(nColumns);
            for (int icol = 0; icol < nColumns; icol++)
            {
                List<Array> columnArrays = new List<Array>(nBatches);
                for (int jj = 0; jj < nBatches; jj++)
                {
                    columnArrays.Add(recordBatches[jj].Column(icol) as Array);
                }
                columns.Add(new Arrow.Column(schema.GetFieldByIndex(icol), columnArrays));
            }

And undo the ToList() changes here.

The reason I'd prefer that approach is because we have a bunch of other APIs that assume they take ownership of the array/list that is passed to them. The reason for that is to limit the amount of allocations. If someone builds up a single list, passes it into the Table constructor, and then moves on, it is a waste for the Table constructor to make a copy of the list as well. It is confusing for some APIs to make a defensive copy, but others not to.

Thoughts?

@0x0L
Copy link
Contributor Author

0x0L commented Jul 6, 2021

Honestly I find the semantics quite misleading, like the author of TableFromRecordBatches :) IMO these copies are harmless in terms of performance. It also enforces more coherence with the default constructor.

Python does it too:

arrow/python/pyarrow/table.pxi

Lines 1591 to 1609 in 3ce67eb

cdef:
vector[shared_ptr[CChunkedArray]] columns
shared_ptr[CSchema] c_schema
int i, K = <int> len(arrays)
converted_arrays = _sanitize_arrays(arrays, names, schema, metadata,
&c_schema)
columns.reserve(K)
for item in converted_arrays:
if isinstance(item, Array):
columns.push_back(
make_shared[CChunkedArray](
(<Array> item).sp_array
)
)
elif isinstance(item, ChunkedArray):
columns.push_back((<ChunkedArray> item).sp_chunked_array)
else:

@eerhardt
Copy link
Contributor

eerhardt commented Jul 6, 2021

IMO these copies are harmless in terms of performance

Unfortunately, this isn't true. Allocating unnecessary objects puts pressure on the garbage collector. Check out this article about some of the performance improvements that were made by reducing allocations.

@0x0L
Copy link
Contributor Author

0x0L commented Jul 7, 2021

!! the example with GC called every 15ms looked more like a bug than something else... Note though that in that benchmark there are probably as many allocations in pure clr as in the old managed version.

If I there's no way to tell the user what's going on through the type system, that'll lead to bugs just like this one.
We may use ref parameters and null the ref we were given but that's still quite unsatisfactory.

I still value API "unsurpriseness" more than performance.

It's up to you. I can - reluctantly :) - edit the code in TableFromBatches

@0x0L
Copy link
Contributor Author

0x0L commented Jul 7, 2021

As a benchmark, I just ran a slightly modified version of their benchmark using ToList, also tried ToArray and no op. This certainly does not reveal any pathological issue in netcoreapp3.1

using System;
using System.Linq;
using System.Diagnostics;
using System.Threading;

class Program
{
    public static void Main()
    {
        new Thread(() =>
        {
            var a = new int[20].ToList();
            while (true) a = a.ToList();
        }) { IsBackground = true }.Start();

        var sw = new Stopwatch();
        while (true)
        {
            sw.Restart();
            for (int i = 0; i < 10; i++)
            {
                GC.Collect();
                Thread.Sleep(15);
            }
            Console.WriteLine(sw.Elapsed.TotalSeconds);
        }
    }
}

@eerhardt
Copy link
Contributor

eerhardt commented Jul 7, 2021

Take a look at all the changes we've been making in the dotnet/runtime libraries that reduce allocations: https://github.com/dotnet/runtime/pulls?q=is%3Apr+is%3Aclosed+allocation+. Even the article I linked says:

Lots of effort goes into reducing allocation, not because the act of allocating is itself particularly expensive, but because of the follow-on costs in cleaning up after those allocations via the garbage collector (GC).

If you allocate less objects, the GC has less work to do.

I still value API "unsurpriseness"

I agree. Looking at the rest of the APIs that copy, they all take IEnumerable<T> for the parameter, and then call ToList or ToArray. Here are some examples:

public ArrayData(
IArrowType dataType,
int length, int nullCount = 0, int offset = 0,
IEnumerable<ArrowBuffer> buffers = null, IEnumerable<ArrayData> children = null)
{
DataType = dataType ?? NullType.Default;
Length = length;
NullCount = nullCount;
Offset = offset;
Buffers = buffers?.ToArray();
Children = children?.ToArray();

public RecordBatch(Schema schema, IEnumerable<IArrowArray> data, int length)
{
if (length < 0)
{
throw new ArgumentOutOfRangeException(nameof(length));
}
_arrays = data?.ToList() ?? throw new ArgumentNullException(nameof(data));

public Schema(
IEnumerable<Field> fields,
IEnumerable<KeyValuePair<string, string>> metadata)
{
if (fields == null)
{
throw new ArgumentNullException(nameof(fields));
}
_fields = fields.ToList();

But then we also have internal APIs that take List<T> and the code takes ownership of that list without copying. This reduces allocations internally, while keeping the public API "unsurprising".

So how about following that same pattern here?

  • Change these APIs to take IEnumerable instead of IList
  • Internally when we create a Table, Column, ChunkedArray, we call the internal API that takes ownership of the list.

thoughts?

@0x0L
Copy link
Contributor Author

0x0L commented Jul 7, 2021

I agree. I think the internal access modifier should be used. I guess Flight assembly will probably have to be declared as friend.

Is that ok ?

@eerhardt
Copy link
Contributor

eerhardt commented Jul 7, 2021

I guess I don't see how the Flight assembly is relevant here. It doesn't use Table, Column, or ChunkedArray. But it is already IVT:

[assembly: InternalsVisibleTo("Apache.Arrow.Flight, PublicKey=0024000004800000940000000602000000240000525341310004000001000100e504183f6d470d6b67b6d19212be3e1f598f70c246a120194bc38130101d0c1853e4a0f2232cb12e37a7a90e707aabd38511dac4f25fcb0d691b2aa265900bf42de7f70468fc997551a40e1e0679b605aa2088a4a69e07c117e988f5b1738c570ee66997fba02485e7856a49eca5fd0706d09899b8312577cbb9034599fc92d4")]

Note - what I am proposing above only applies to Table, Column, and ChunkedArray for this PR. I'm not saying we should go change a bunch of other APIs.

@pitrou
Copy link
Member

pitrou commented May 4, 2022

@eerhardt @0x0L What is the status on this PR? Does either of you want to move it forward?
(also cc @wjones127 in case he's interested)

@0x0L
Copy link
Contributor Author

0x0L commented May 4, 2022

Sorry, not really involved with C# anymore. I'll try to push the changes we discussed shortly

@0x0L
Copy link
Contributor Author

0x0L commented May 4, 2022

In the end, I only fixed the behavior.
I'm not too sure where to stop with this ownership issue and it probably deserves more time to think about

Copy link
Contributor

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the contribution.

@eerhardt eerhardt merged commit 9b0afc3 into apache:master Jun 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants