Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-13742: [C++][Go] Expose Dataset API to Go via CGO #10995

Closed
wants to merge 15 commits into from

Conversation

zeroshade
Copy link
Member

This adds a new workflow to build the Dataset Go Module with CGO which first builds the C++ library before building the Go Dataset library and running the tests. This only exposes a basic ability to create a FilesystemDataset and scan it to get record batches. It can then be built upon more in order to add more functionality. I drew inspiration from the existing implementation of the C Data interface and the JNI Dataset library in order to build this but I'm looking for any feedback at all.

@pitrou This is what I was talking about on the dev mailing list the other day.

@github-actions
Copy link

@pitrou
Copy link
Member

pitrou commented Aug 30, 2021

Some high-level comments:

  1. since C has no namespacing, all names should be prefix by "Arrow" (for regular names) or "ARROW_" (for preprocessor macros)
  2. the proposed API is taking the C data interface as an inspiration. However, unless the intent is to allow different producers to provide the dataset API, this could be a more classical (idiomatic) C API.
  3. the dataset can be represented using the (experimental) C stream interface: https://arrow.apache.org/docs/format/CStreamInterface.html

In the end, the C API might look like this:

#include "arrow/c/abi.h"

struct ArrowDatasetFactory;

enum ArrowDatasetFormat {
    ARROW_DATASET_PARQUET, ARROW_DATASET_CSV, ARROW_DATASET_IPC
};

int ArrowDatasetFactoryFromUri(const char* uri,
                               struct ArrowDatasetFormat format,
                               struct ArrowDatasetFactory** out);

int ArrowDatasetFactoryInspect(struct ArrowDatasetFactory* factory,
                               int num_fragments_to_inspect,
                               struct ArrowSchema* out);

int ArrowDatasetFactoryCreateDataset(struct ArrowDatasetFactory* factory,
                                     struct ArrowSchema* optional_schema,
                                     struct ArrowArrayStream* out);

const char* ArrowDatasetFactoryGetLastError();

void ArrowDatasetFactoryDestroy(struct ArrowDatasetFactory*);

cc @bkietz for advice

#define DISABLE_INSPECT_FRAGMENTS 0

// Analagous to arrow::dataset::Scanner
struct Scanner {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @pitrou - the binding should be as minimal as possible since the classes of the dataset and compute APIs are experimental and changed frequently. In particular, we've discussed removing Scanner altogether.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair assessment. In particular since doing this, i've decided to try a different approach and see how it goes, see #11037 where I've implemented consumer functions for the C Data interface for go. I'm looking into trying to just do a minimal binding to the Compute API directly instead and the building on top of that. If that ends up being viable, I'll close out this PR in favor of that approach.

@zeroshade
Copy link
Member Author

Closing this in favor of #11037 and continuing that approach which ensures I'll be able to keep the bindings as minimal as possible and only bind directly to the compute api as minimally as I can rather than the entire dataset api.

@zeroshade zeroshade closed this Sep 9, 2021
asfgit pushed a commit that referenced this pull request Sep 17, 2021
This supercedes some portions of #10995 as I think it's better to centralize the cdata interface interactions in a specific module that is part of the arrow module rather than it being a part of the dataset module. So this should get reviewed and merged first.

CC @pitrou @emkornfield

Closes #11037 from zeroshade/go-cdata

Lead-authored-by: Matthew Topol <mtopol@factset.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Matt Topol <mtopol@factset.com>
Signed-off-by: Matthew Topol <mtopol@factset.com>
ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
This supercedes some portions of apache#10995 as I think it's better to centralize the cdata interface interactions in a specific module that is part of the arrow module rather than it being a part of the dataset module. So this should get reviewed and merged first.

CC @pitrou @emkornfield

Closes apache#11037 from zeroshade/go-cdata

Lead-authored-by: Matthew Topol <mtopol@factset.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Matt Topol <mtopol@factset.com>
Signed-off-by: Matthew Topol <mtopol@factset.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants