Skip to content

Apache Arrow Flight transport speed improvement for list structures #31981

@asfimport

Description

@asfimport

I just started testing using Arrow Flight to send results from a GraphQL server with FlightServer() running on i.

GraphQL defines a schema for your data output which can be mapped to an Arrow schema so I thought it would make sense to try using Arrow Flight to transport results instead of using REST style JSON records.

Arrow Flight was 66% faster in all case, but it didn't scale as the number of child records increased. I suspect that serializing structs or lists needs some improvement..

Here is the discussion I opened including links to test scripts.

https://github.com/davlee1972/ariadne_arrow/blob/arrow_flight/benchmark/test_asgi_server.py
Standard ASGI server. Start it up with uvicorn --host=0.0.0.0 test_asgi_server:app

https://github.com/davlee1972/ariadne_arrow/blob/arrow_flight/benchmark/test_arrow_flight_server.py
New Arrow Flight server. Start it up with python test_arrow_flight_server.py

https://github.com/davlee1972/ariadne_arrow/blob/arrow_flight/benchmark/test_asgi_arrow_client.py
Benchmarking script. Pass in length of lists to test and server host.

 

Discussion from Ariadne GraphQL package.

mirumee/ariadne#867

10 records it was 0.049 seconds faster or 80% faster
10000 records it was 0.109 seconds faster or 66% faster
10 million records it was 54 seconds faster or 66% faster.

Also here is the data structure that is sent across the wire..

pyarrow.Table
data: struct<test_lists: struct<float_list: list<item: double>, int_list: list<item: int64>, length: int64, string_list: list<item: string>, time_spent: double>>
child 0, test_lists: struct<float_list: list<item: double>, int_list: list<item: int64>, length: int64, string_list: list<item: string>, time_spent: double>
child 0, float_list: list<item: double>
child 0, item: double
child 1, int_list: list<item: int64>
child 0, item: int64
child 2, length: int64
child 3, string_list: list<item: string>
child 0, item: string
child 4, time_spent: double

data: [
– is_valid: all not null
– child 0 type: struct<float_list: list<item: double>, int_list: list<item: int64>, length: int64, string_list: list<item: string>, time_spent: double>
– is_valid: all not null
– child 0 type: list<item: double>
[[13.500371672273381,17.747395152140353,28.973205439157457,1.361443415643098,19.029191125636135,14.62284718057391,18.44333922481529,7.906278860251386,14.402464768126993,5.826040531772251]]
– child 1 type: list<item: int64>
[[23,3,21,15,20,4,10,16,23,25]]
– child 2 type: int64
[10]
– child 3 type: list<item: string>
[["qypsupwtxy","vrxptpspyt","qpvruwsuqq","ywwpyxrvrt","wswutpxxqv","tsyypstxvv","ytprpqsxsx","wtwsxvprvu","suwtrvqvwp","wtsrwywwty"]]
– child 4 type: double
[0]]

Reporter: David Lee / @davlee1972

Note: This issue was originally created as ARROW-16629. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions