Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-14222: [C++] implement GCSFileSystem skeleton #11331

Closed

Conversation

coryan
Copy link
Contributor

@coryan coryan commented Oct 5, 2021

No description provided.

@github-actions
Copy link

github-actions bot commented Oct 5, 2021

static GCSOptions Anonymous();
};

class ARROW_EXPORT GCSFileSystem : public FileSystem {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit: GcsFileSystem? Also add docstring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering what was the style in this project. Fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, google style, There are some cases where we've used the all caps spelling though (e.g. CSV)

@coryan coryan marked this pull request as ready for review October 6, 2021 01:12
@coryan
Copy link
Contributor Author

coryan commented Oct 7, 2021

Ping?


namespace gcs = google::cloud::storage;

google::cloud::Options AsGoogleCloudOptions(GcsOptions const& o) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const should be in front of GcsOptions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I did not mean to restart the "east const" vs. "const west" holy way, but muscle memory is what it is.

namespace gcs = google::cloud::storage;

google::cloud::Options AsGoogleCloudOptions(GcsOptions const& o) {
auto options = google::cloud::Options{};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to prefer this over google::cloud::Options options; ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Almost Always Auto", another holy war. I can change it if you think it is confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the almost always use auto (my undertanding this is from effective C++) directly contradicts the style guide in use (which is only use auto if type is obvious). In this case type is obvious, it just seems like there is more text in this case.

google::cloud::Options AsGoogleCloudOptions(GcsOptions const& o) {
auto options = google::cloud::Options{};
if (!o.endpoint_override.empty()) {
auto scheme = o.scheme;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please spell out the type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

namespace internal {
// TODO(ARROW-1231) - during development only tests should create a GcsFileSystem.
// Remove, and provide a public API, before declaring the feature complete.
std::shared_ptr<GcsFileSystem> MakeGcsFileSystemForTest(const GcsOptions& options);
Copy link
Contributor

@emkornfield emkornfield Oct 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO covers make a static method Make on GcsFileSystem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to make the TODO more explicit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope that is what you had in mind.

namespace internal {

std::shared_ptr<GcsFileSystem> MakeGcsFileSystemForTest(const GcsOptions& options) {
return std::shared_ptr<GcsFileSystem>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe comment on why std::make_shared isn't used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

EXPECT_THAT(b.get(), NotNull());
EXPECT_EQ(b, b);

EXPECT_NE(a, b);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Equals here is mean logically equal, shouldn't this be equal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know what "logically equals" means. At a guess something like "modulo transient errors, both produce the same results"? If so, I am not sure there is a reliable way to do that: consider storage.googleapis.com vs. private.googleapis.com as endpoint overrides, different strings, but depending on your environment, exactly the same behavior (or not!).

Anyway, implemented something closer to the current implementation for the S3 FileSystem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I meant configuration parameters are equal. Before this looked like it was using solely memory address equality.

EXPECT_EQ(a, a);

auto b = internal::MakeGcsFileSystemForTest(GcsOptions{});
EXPECT_THAT(b.get(), NotNull());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think .get() is necessary here and above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh, fixed.

TEST(GCSFileSystem, Compare) {
auto a = internal::MakeGcsFileSystemForTest(GcsOptions{});
EXPECT_THAT(a.get(), NotNull());
EXPECT_EQ(a, a);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be comparing the objects EXPECT_EQ(*a, *a)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we want a->Equals() fixed.

@emkornfield
Copy link
Contributor

C++ looks good to me. I'll wait and see if there are any comments on CMake (I'll merge later today if nothing comes up)

@@ -801,7 +801,11 @@ endif()
set(ARROW_SHARED_PRIVATE_LINK_LIBS ${ARROW_STATIC_LINK_LIBS})

# boost::filesystem is needed for S3 and Flight tests as a boost::process dependency.
if(((ARROW_FLIGHT OR ARROW_S3) AND (ARROW_BUILD_TESTS OR ARROW_BUILD_INTEGRATION)))
if(((ARROW_FLIGHT
OR ARROW_S3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the mac build is a transient error or somehow due to this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think is this change, as the build has -DARROW_GCS=OFF:

https://github.com/apache/arrow/pull/11331/checks?check_run_id=3830637295#step:9:62

and the change is (modulo reformatting) changing (ARROW_FLIGHT OR ARROW_S3) to (ARROW_FLIGHT OR ARROW_S3 OR ARROW_GCS). In addition, this change has been there since the first commit in the branch.

At a guess this is either a transient or the point in master where I based the branch has a problem, I can rebase if you think that would help.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, lets try a rebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the builds at master, I do not think a rebase would help, and it seems even less likely that these changes caused the build breaks:

https://github.com/apache/arrow/actions/runs/1317260919

https://github.com/apache/arrow/actions?query=event%3Apush+branch%3Amaster

@coryan coryan force-pushed the ARROW-14222-implement-gcsfs-skeleton branch from 6898c5a to ea0db36 Compare October 7, 2021 18:45
@coryan
Copy link
Contributor Author

coryan commented Oct 7, 2021

Rebased anyway. At least that makes it easier to understand what build problems are coming from where.

Comment on lines 3709 to 3715
include_directories(SYSTEM ${GOOGLE_CLOUD_CPP_INCLUDE_DIR})
get_target_property(absl_base_INCLUDE_DIR absl::base INTERFACE_INCLUDE_DIRECTORIES)
include_directories(SYSTEM ${absl_base_INCLUDE_DIR})
message(STATUS "Found google-cloud-cpp::storage static library: ${GOOGLE_CLOUD_CPP_STATIC_LIBRARY_STORAGE}"
)
message(STATUS "Found google-cloud-cpp::storage headers: ${GOOGLE_CLOUD_CPP_INCLUDE_DIR}"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you extract include dir and static library path information from google-cloud-cpp::storage?
If we use system Google Cloud Storage (find_package(google_cloud_cpp_storage) is succeeded case), GOOGLE_CLOUD_CPP_* aren't defined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just removed the static library, seems to be unused, and extracted the include dir from the target properties.

This is really not my problem, I will say it anyway: there is something wrong, or at least very unusual, about Arrow's usage of CMake. With modern CMake adding google-cloud-cpp::storage to the target_link_libraries() should define all the necessary command-line flags. If one must set include_directories() then some target is missing a dependency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. You're right.

We have some problems around it but we didn't dig into them yet...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. In case it helps, it might be this:

# Generate a single "objlib" from all C++ modules and link
# that "objlib" into each library kind, to avoid compiling twice
add_library(${LIB_NAME}_objlib OBJECT ${ARG_SOURCES})

the arrow_objlib target does not seem to link any dependencies, so when its object files are compiled, none of the required flags are set. You may want to add the dependencies to arrow_objlib and then things should "Just Work":tm:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be right. We should try it later. :-)

@coryan coryan force-pushed the ARROW-14222-implement-gcsfs-skeleton branch from ea0db36 to 44e77cb Compare October 8, 2021 15:04
@kou
Copy link
Member

kou commented Oct 8, 2021

CI failures on macOS will be fixed by #11374.

@emkornfield
Copy link
Contributor

@kou any more concerns, otherwise I think this can be merged.

@kou
Copy link
Member

kou commented Oct 8, 2021

Nothing. We can merge this.

@kou kou closed this in 1b0ea15 Oct 9, 2021
@ursabot
Copy link

ursabot commented Oct 9, 2021

Benchmark runs are scheduled for baseline = 00c2741 and contender = 1b0ea15. 1b0ea15 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.0% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

@coryan coryan deleted the ARROW-14222-implement-gcsfs-skeleton branch October 9, 2021 03:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants