Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10542: [C#][Flight] Add beginning on flight code for net core #8694

Closed
wants to merge 34 commits into from
Closed

Conversation

Ulimo
Copy link
Contributor

@Ulimo Ulimo commented Nov 17, 2020

I closed the previous PR and opened this one, since the original one did not use InternalsVisibleTo and required a version bump on the Apache.Arrow project from netstandard1.3 to netstandard1.5.

This adds basic support for both a flight server and a flight client for Net Core.
Sorry for the massive PR, but I wanted to have some basic functionality in place with atleast some tests that use the put and get flow before creating a PR.
I hope that this PR can help create some discussion on how the interfaces should look etc, and if this looks like an accetable interface for flight in net core. It tries to mimic as much as possible the original gRPC net core interface, but mapping the classes from the network protocol to the C# classes.

This implementation uses InternalsVisibleTo, to hinder a bump of .netstandard version from 1.3 to 1.5.
So the Apache.Arrow project still uses netstandard1.3.
All flight code is in a seperate project Apache.Arrow.Flight

This also required changing build version from 2.2 to 3.0

The code does not include:

  • Handshake - the reason is that AspNetCore contains features already for authentication/authorization for gRPC. Can be added later ofcourse.
  • DoExchange - I feel that more feedback/discussion is required before DoExchange can be implemented.

Note:
This has been fixed
Sourcelink did not work when using grpc.tools to have code compilation in the build step. So I had to generate the grpc code manually for sourcelink to work. This means that there are alot of extra code in this PR that are auto generated.

cc: @eerhardt
Looks like you are the most active for C#, would be great to get some feedback if this is similar to how you would implement it, or if major changes are required I am ofcourse up for that as well.

@github-actions
Copy link

Also made ToProtocol methods internal.
Copy link
Contributor

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great work @Ulimo! Thanks so much for the contribution. This will greatly help the .NET Arrow ecosystem.

Here's some initial feedback on the structure. I will review the code deeper tomorrow as I get more time.

.github/workflows/csharp.yml Outdated Show resolved Hide resolved
.github/workflows/csharp.yml Outdated Show resolved Hide resolved
csharp/src/Apache.Arrow.Flight/Format/Flight.proto Outdated Show resolved Hide resolved
csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj Outdated Show resolved Hide resolved
csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj Outdated Show resolved Hide resolved
csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj Outdated Show resolved Hide resolved
Ulimo and others added 4 commits November 17, 2020 23:45
Co-authored-by: James Newton-King <james@newtonking.com>
This was required to remove dependency from Grpc package.
Copy link
Contributor

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic work, @Ulimo! I'm really excited to get this in, I've wanted a .NET Arrow Flight library for a while, but haven't had the chance to work on it.

I left a few suggestions I found while playing with the code. One last suggestion I'll make is to go through each class and make sure we only make public what actually needs to be public. Once we ship an API it is hard to change/break it. So erroring on the side of hiding something until it is absolutely needed to be public is a good approach.

Client code is under client namespace.
Server code is under server namespace.
Added support for application metadata.
Removed exposure of properties not required for the user.
@Ulimo
Copy link
Contributor Author

Ulimo commented Nov 22, 2020

@eerhardt first of all thank you so much for your support and feedback and sorry for this wall of text. Based on your last feedback that the API is hard to change etc in the future, I really took a long sitting to go through all the classes etc, and found some issues that I really needed to address.

This resulted in a small overhaul on the structure etc. I also went through so the gRPC schema information is actually exposed to the user as well. This also resulted in some rewrites. The feedback you have given has been exactly what I needed to come to the conclusions etc on structure, so I am once again extremely thankful for all the feedback you have given.

This ofcourse increased the scope of the PR a bit again, but I hope it helps give a more real look and feel on how the framework will behave, and if the implemented design actually works.

Here comes information on the work that has been done, and also some checklists to see that all the required information is exposed etc, and why some classes are public.

Major changes

  • All exposed classes start with prefix "Flight".
  • Generated code is now internal.
  • New project Apache.Arrow.Flight.AspNetCore, required since generated code is internal, this project references grpc.aspnetcore.server which is netcoreapp3.1 or net5.0 only, and contains extensions only to add flight support for aspnetcore projects.
  • All flight client methods now follows gRPC API more closely, allows reading response headers for all calls.
  • Exposing application metadata required rewriting a bit of the logic in client and servers, read more in its section.
  • Moved client specific classes under .Client namespace.
  • Moved server specific classes under .Server namespace.
  • Internal classes are now under the .Internal namespace.
  • Only mutual used classes are now in the .Flight namespace.

Minor changes

  • Removed multi property exposure of bytestring, to only bytestring.
  • Exposed TotalBytes and TotalRecords in FlightInfo.
  • Moved public abstract reader, writer classes to internal namespace since they cant be used by the user.
  • Added FlightServerRecordBatchStreamReader to expose FlightDescriptor only to the server since the client wont use it.
  • Added FlightClientRecordBatchStreamReader which just extends FlightRecordBatchStreamReader which is now abstract.
  • StreamWriter made internal
  • PutResult does not expose ArrowBuffer (was taken from java project), instead it exposes bytestring like the other classes.
    This was done so the user can choose what type of metadata to return, and also follow more closely the other classes.
  • Added public static Empty on PutResult, can be useful for implementations when metadata is not used, and empty put result should be returned.
  • Added a new test case to get metadata.
  • Added a new test case to put metadata.
  • Added a new test case for get schema.
  • Added a new test case for DoAction.
  • Added a new test case for ListFlights.

Test coverage is now above 80% on the non generated code.

Application metadata addition

To enable users to write and read application metadata, the class FlightRecordBatchStreamingCall was added that
still follow gRPC look and feel, but exposes FlightRecordBatchStreamReader as response stream instead where the user can
get application metadata.

Application metadata can be read with each record batch, it is implemented as a list since hypothetically metadata can
be sent with the schema aswell. Since the packages sent follow this pattern at the moment:

Schema -> record batch -> record batch -> ... -> done

So the first message could contains metadata as well. When a user gets a new record batch the metadata is cleared similar
to how the java implementation is done.

For writing, this is done by an additional method in FlightRecordBatchStreamWriter which looks like this:

public Task WriteAsync(RecordBatch message, ByteString applicationMetadata);

gRPC support

Here is a check on what gRPC methods and properties are exposed and can be read/used by a user of the framework, mostly to check that everything that is required to be exposed are exposed at this time.

gRPC methods exposed or not exposed to the user

  • ListFlights - Exposed
  • GetFlightInfo - Exposed
  • GetSchema - Exposed
  • DoGet - Exposed
  • DoPut - Exposed
  • DoAction - Exposed
  • ListActions -Exposed
  • Handshake - Not exposed
  • DoExchange - Not exposed

gRPC properties exposed or not exposed to the user

HandshakeRequest

Handshake is not exposed at all

  • protocol_version - not exposed
  • payload - not exposed

HandshakeResponse

Handshake is not exposed at all

  • protocol_version - not exposed
  • payload - not exposed

BasicAuth

Basic auth is not exposed at all

  • username - not exposed
  • password - not exposed

ActionType

  • Type - exposed
  • Description - exposed

Criteria

  • Expression - exposed

Action

  • Type - exposed
  • Body - exposed

Result

  • Body - exposed

SchemaResult

Schema result is not mapped to its own type, but returns a schema directly

  • Schema - exposed

FlightDescriptor

  • Type - exposed
  • Path - exposed
  • Cmd - exposed

FlightInfo

  • Schema - exposed
  • FlightDescriptor - exposed
  • endpoints - exposed
  • TotalRecords - exposed
  • TotalBytes - exposed

FlightEndpoint

  • Ticket - exposed
  • Locations - exposed

Location

  • Uri - exposed

Ticket

  • Ticket - exposed

FlightData

Flight data is not exposed as a class, but the data is exposed in getStream, startPut for client, and DoGet and DoPut for server.

  • FlightDescriptor
    • client can only send flight descriptor for flight data
    • server can only read for flight data
    • Desc: The descriptor of the data. This is only relevant when a client is starting a new DoPut stream.
  • Data header - used to read header of message, ex: schema, record batch message etc, not explicitly exposed and used internally only
  • Application Metadata
    • Client can write metadata through ClientRecordBatchStreamWriter
    • Client can read metadata through FlightRecordBatchStreamReader
    • Server can write metadata through FlightServerRecordBatchStreamWriter
    • Server can read metadata through FlightRecordBatchStreamReader
  • Data body - same as data header, exposure is done through RecordBatch

PutResult

  • Application metadata - exposed

Classes public or internal

Here is a list of classes and if they are public or internal, and motivation on why they are public.

  • FlightRecordBatchStreamingCall - public, required since it exposes FlightRecordBatchStreamReader which is required to read metadata.
  • FlightRecordBatchStreamReader - public, required to allow read on schema, flight descriptor, application metadata
  • RecordBatcReaderImplementation - internal
  • FlightClientRecordBatchStreamWriter - public, implements CompleteAsync for clients, extends FlightRecordBatchStreamWriter wich allows user to write application metadata.
  • FlightDataStream - internal
  • FlightRecordBatchDuplexStreamingCall - public, exposes FlightClientRecordBatchStreamWriter which is required to write application metadata.
  • FlightRecordBatchStreamWriter - public abstract with private protected constructor, client and server implementations extend this one.
  • FlightServerRecordBatchStreamWriter - public, extends FlightRecordBatchStreamWriter which allows user to write application metadata.
  • SchemaWriter - internal
  • FlightAction - public, allows user to read/write type, and body in Actions.
  • FlightActionType - public, allows user to read/write action types with type and description from ListActions.
  • FlightClient - public, allows user to call all the different endpoints.
  • FlightCriteria - public, exposes Expression that servers can implement to filter result from ListFlights.
  • FlightDescriptor - public, exposes DescriptorType, Paths and Cmd for user.
  • FlightDescriptorType - public, contains descriptor types.
  • FlightEndpoint - public, exposes flight ticket and flight locations.
  • FlightInfo - public, exposes flight descriptor, schema, total bytes, total records and endpoints.
  • FlightLocation - public, exposes uri to the user.
  • FlightMessageSerializer - internal
  • FlightPutResult - public, recieved when doing doPut, exposes application metadata.
  • FlightResult - public, contains body from DoAction calls.
  • FlightServerImplementation - internal, forced internal from putting generated code to internal.
  • FlightTicket - public, exposes the bytestring from the ticket to the user.
  • IFlightServer - public, interface to implement a flight server.
  • StreamReader - internal
  • StreamWriter - internal
  • FlightIEndpointRouteBuilderExtensions - public, allows user to map the flight endpoint in asp net core.
  • FlightIGrpcServerBuilderExtensions - public, allows the user to add their implementation of IFlightServer in a nicer way.

@Ulimo
Copy link
Contributor Author

Ulimo commented Nov 23, 2020

@eerhardt to be honest after I went through and really thought regarding the calls, I had one final thing I was not too happy with.
The record batch stream in for get and put, I did not change it since compared to moving files around etc this felt a tiny bit bigger.

The API in itself feels quite easy to use, but I think I overspecified it for RecordBatch, and there will be trouble if we add dictionary support etc.
It also does not work well with DoExchange, and do exchange requires another implementation to work good.

The solution I can think of would reduce on ease of use thought (I think?), which would be:

  • New classes, FlightData (abstract), FlightDataSchema, FlightDataRecordBatch
  • Stream returns FlightData
  • internal code still checks flight "correctness" (schema first message etc).

FlightData structure:

pubic abstract class FlightData {
  public FlightDescriptor FlightDescriptor { get; }
  public FlightDataType Type { get; } //Contains what type of arrow object it was
  public ByteString ApplicationMetadata { get; } //app metadata can be changed from list in streamreader to correct single in FlightData.
  public abstract T GetValue<T>();
  public abstract void Accept(FlightDataVisitor visitor); // Maybe a visitor pattern?
}

public class FlightDataRecordBatch : FlightData {
  public RecordBatch RecordBatch { get; }
  public GetValue<T>() {
    if(typeof(T) != typeof(RecordBatch)) { throw new Exception(...); }
    return RecordBatch;
  }

  public Accept(...) {...};
}

enum FlightDataType {
  Schema = 1,
  RecordBatch = 2,
  ...
}

This stream solution could also have a FlightDataVisitor for the different types also, and I think that it allows
easier addition in the future of new arrow objects. But it of course might be a bit harder to use.
It also follows the gRPC schema much more closely, which makes DoExchange simple to implement.

If not using a visitor, the loop would not be as "pretty" but still functional with easier additions of new types (atleast in my opinion):

while(stream.MoveNext()) {
  if(stream.Current.Type == FlightDataType.RecordBatch) {
    stream.Current.GetValue<RecordBatch>();
    // do stuff with record batch
  }
}

Would love to get your input, so the flight framework starts with a good foundation.

EDIT: Doing put operations would be a bit wierder with this though, it would require the user to have knowledge to send the schema as the first message. I think the putStream should still be similar to the existing solution even with this, but with extra write commands such as Write(ArrowDictionary) etc.

Copy link
Contributor

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. I dug a bit deeper into the code this time. I think this is close to merging. Just a few more comments.

csharp/src/Apache.Arrow.Flight/FlightInfo.cs Outdated Show resolved Hide resolved
csharp/src/Apache.Arrow.Flight/FlightDescriptor.cs Outdated Show resolved Hide resolved
csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs Outdated Show resolved Hide resolved

//
// Summary:
// Gets the call status if the call has already finished. Throws InvalidOperationException
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do callers know when the call has finished? Do they await on the ResponseHeadersAsync?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller ends the stream, so they call CompleteAsync on the request stream.

@Ulimo Ulimo requested a review from eerhardt November 24, 2020 15:10
@eerhardt
Copy link
Contributor

but I think I overspecified it for RecordBatch, and there will be trouble if we add dictionary support etc.

I think this is OK, for now. This is just the initial impementation. The Flight API won't necessarily be "stable" in its first release - even in the C++/Python APIs it isn't stable yet, as far as I understand. There are probably many more places that will need to change to support dictionaries in the future.
Regarding what I said last week about being hard to change later - that was mostly around the concept that it is easier to hide things now, and expose them later if needed than it is to expose things now and hide them later.

@eerhardt
Copy link
Contributor

Do you think you can fill out this table with the current support?

https://github.com/apache/arrow/blob/master/docs/source/status.rst#flight-rpc

@Ulimo Ulimo requested a review from eerhardt November 25, 2020 14:52
Copy link
Contributor

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great @Ulimo! Thanks for all the awesome work here.

I'll merge this today unless I hear other feedback.

@Ulimo
Copy link
Contributor Author

Ulimo commented Nov 25, 2020

@eerhardt I just have one question, getting this out as a preview nuget package soonish, is that possible?

@eerhardt eerhardt closed this in e883f26 Nov 28, 2020
@eerhardt
Copy link
Contributor

I believe the next round of releases is scheduled for sometime in January. (See this recent dev-list thread)
Would you need something earlier than that? If so, one option would be to build the nuget package yourself (dotnet pack in the csharp folder), and push it to a private feed you can use until an official one get pushed to nuget.org.

Potential options for private feeds are Azure DevOps or myget.org.

@eerhardt
Copy link
Contributor

FYI - @Ulimo, the 3.0.0 nuget packages have been released. You can see them at:

https://www.nuget.org/packages/Apache.Arrow.Flight/
https://www.nuget.org/packages/Apache.Arrow.Flight.AspNetCore/

GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
Upgraded build version to 3.1 and C# version to 8.
Also added EmbedUntrackedSources so auto generated code will work with sourcelink.

This change is a requirement for apache#8694 and is isolating some of the changes that it required regarding upgrading net core version.

Closes apache#8702 from Ulimo/ARROW-10634_ci_build

Authored-by: Östman Alexander <alexander.ostman@sweco.se>
Signed-off-by: Eric Erhardt <eric.erhardt@microsoft.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
I closed the previous PR and opened this one, since the original one did not use InternalsVisibleTo and required a version bump on the Apache.Arrow project from netstandard1.3 to netstandard1.5.

This adds basic support for both a flight server and a flight client for Net Core.
Sorry for the massive PR, but I wanted to have some basic functionality in place with atleast some tests that use the put and get flow before creating a PR.
I hope that this PR can help create some discussion on how the interfaces should look etc, and if this looks like an accetable interface for flight in net core. It tries to mimic as much as possible the original gRPC net core interface, but mapping the classes from the network protocol to the C# classes.

This implementation uses InternalsVisibleTo, to hinder a bump of .netstandard version from 1.3 to 1.5.
So the Apache.Arrow project still uses netstandard1.3.
All flight code is in a seperate project Apache.Arrow.Flight

This also required changing build version from 2.2 to 3.0

The code does not include:

* Handshake - the reason is that AspNetCore contains features already for authentication/authorization for gRPC. Can be added later ofcourse.
* DoExchange - I feel that more feedback/discussion is required before DoExchange can be implemented.

Note:
**This has been fixed**
Sourcelink did not work when using grpc.tools to have code compilation in the build step. So I had to generate the grpc code manually for sourcelink to work. This means that there are alot of extra code in this PR that are auto generated.

cc: @eerhardt
Looks like you are the most active for C#, would be great to get some feedback if this is similar to how you would implement it, or if major changes are required I am ofcourse up for that as well.

Closes apache#8694 from Ulimo/master

Lead-authored-by: Östman Alexander <alexander.ostman@sweco.se>
Co-authored-by: Ulimo <alexander.ostman@hotmail.com>
Signed-off-by: Eric Erhardt <eric.erhardt@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants