Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-14892: [Python][C++] GCS Bindings #12763

Merged
merged 85 commits into from
Jun 12, 2022
Merged

ARROW-14892: [Python][C++] GCS Bindings #12763

merged 85 commits into from
Jun 12, 2022

Conversation

emkornfield
Copy link
Contributor

@emkornfield emkornfield commented Mar 31, 2022

Incorporate GCS file system into python and other bug fixes.

Bugs/Other changes:

  • Add GCS bindings mostly based on AWS bindings in Python and associated unit tests
  • Tell was incorrect, it double counted when the stream was constructed with an offset.
  • Missed setting the define in config.cmake which means FileSystemFromUri was never tested and didn't compile this is now fixed
  • Refine logic for GetFileInfo with a single path to recognize prefixes followed by a slash as a directory. This allows datasets to work as expected with a toy dataset generated on local-filesystem and copied to the cloud (I believe this is typical of how other systems write to GCS as well.
  • Switch convention for creating directories to always end in "/" and make use of this as another indicator. From testing with a sample iceberg table it appears this is the convention used for hive-partitioning, so I assume this is common practice for other Hive related writers (i.e. what we want to support).
  • Fix bug introduced in a5e45ce which caused failures when a deletion occurred on a bucket (not an object in the bucket).
  • Ensure output streams are closed on destruction (this is consistent with S3)

@emkornfield emkornfield marked this pull request as draft March 31, 2022 08:26
@emkornfield
Copy link
Contributor Author

@pitrou @kszucs this still isn't ready for review but I have it compiling locally. I was wondering if there are cross-bow or other actions I should be taking to verify packaging?

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@emkornfield emkornfield marked this pull request as ready for review April 2, 2022 06:23
@emkornfield
Copy link
Contributor Author

CC @coryan for C++ changes.

@@ -536,8 +546,7 @@ class GcsFileSystem::Impl {
gcs::ReadFromOffset offset) {
auto stream = client_.ReadObject(path.bucket, path.object, generation, offset);
ARROW_GCS_RETURN_NOT_OK(stream.status());
return std::make_shared<GcsInputStream>(std::move(stream), path, gcs::Generation(),
offset, client_);
return std::make_shared<GcsInputStream>(std::move(stream), path, generation, client_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume offset into this function is 1000. Without the offset parameter passed to GcsInputStream its Tell() function will return 0 when you are in fact reading byte 1000. That seems like the wrong semantics to me, but maybe it is the expected behavior?

Copy link
Contributor Author

@emkornfield emkornfield Apr 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empirically the underlying tell returns 1000. I was observing doubling of expected tell value. The python test that found this wrote N bytes then seeked to N/2 and tried reading. The reading called the FS tell which returned N which caused zeo bytes to be read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the GCS code and it does appear to keep the ReadAt offset to return for tell but I might have missed something

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. SGTM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield I'm curious this wasn't caught by the C++ tests. Is it possible to enhance the generic filesystem tests to cover this?

Copy link
Contributor Author

@emkornfield emkornfield Apr 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry for the tell case we can certainly add a test case. Filed ARROW-16226 to track this.

cpp/src/arrow/filesystem/gcsfs.cc Show resolved Hide resolved
@emkornfield
Copy link
Contributor Author

emkornfield commented Apr 4, 2022

when running docker locally the python tests seem to hang forever when attempting to reach GCS test_bench which explains the timeouts, not sure if this could be some sort of config missing for docker? (all tests pass when run without docker).

This could also be a test-bench versioning issue.

It turns out this was testbench not getting installed properly into the conda env.

@emkornfield
Copy link
Contributor Author

Also CC @jorisvandenbossche if you have time to look

@emkornfield
Copy link
Contributor Author

@github-actions crossbow submit -g python

@emkornfield
Copy link
Contributor Author

@github-actions crossbow submit -g wheel

@pitrou pitrou self-requested a review April 13, 2022 17:00
@emkornfield
Copy link
Contributor Author

Ping @pitrou to see if you have time to review. Also @rok it seems you have been looking at GCS stuff recently.

@voutilad
Copy link

FWIW, @emkornfield I had to revert your last commit (608b6ec) in order to get tests to work. It seems if the testbench suite doesn't properly initialize, the GcsFileSystem tests hang until the ctest timeout. (This is using archery docker build ubuntu-cpp.)

@coryan
Copy link
Contributor

coryan commented Apr 15, 2022

FWIW, @emkornfield I had to revert your last commit (608b6ec) in order to get tests to work. It seems if the testbench suite doesn't properly initialize,

You may be running into postmanlabs/httpbin#673 which we worked around in googleapis/storage-testbench#301 . The latest release should have these fixes.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @emkornfield . You'll find a bunch of comments below.
Also, can you ensure you rebase on the latest git master?

@@ -128,6 +128,8 @@ jobs:
ARROW_GANDIVA: ON
ARROW_HDFS: ON
ARROW_HOME: /usr/local
# TODO(ARROW-16102): Enable this once we can figure out builds.
ARROW_GCS: OFF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield Did you mean to add this? ARROW-16102 is fixed already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, ARROW-16102 was fixed in between this was posted for review and it got reviewed. this will be removed in the rebase.

export PYARROW_TEST_HDFS
export PYARROW_TEST_ORC
export PYARROW_TEST_PARQUET
export PYARROW_TEST_S3

# DO NOT SUBMIT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this there? Is this PR actually ready? Did you perhaps forget to push some followup changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed like an ugly hack. I was hoping someone could point me to a better solution. I will try removing after the rebase to see if the tests now pass, as the prior install command (install_gcs_testbench.sh) has been updated slightly at HEAD

cpp/src/arrow/filesystem/gcsfs.h Outdated Show resolved Hide resolved
struct GcsCredentials {
explicit GcsCredentials(std::shared_ptr<google::cloud::Credentials> c)
struct GcsCredentialsHolder {
explicit GcsCredentialsHolder(std::shared_ptr<google::cloud::Credentials> c)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this constructor is necessary, just let C++ define it implicitly for you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needed for std::make_shared

cpp/src/arrow/filesystem/gcsfs.cc Show resolved Hide resolved
python/pyarrow/_gcsfs.pyx Outdated Show resolved Hide resolved
python/pyarrow/_gcsfs.pyx Outdated Show resolved Hide resolved
python/pyarrow/_gcsfs.pyx Outdated Show resolved Hide resolved
python/pyarrow/_gcsfs.pyx Show resolved Hide resolved
try:
proc = subprocess.Popen(args, env=env)
except OSError:
pytest.skip('`gcs test bench` command cannot be located')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this message? Is "gcs test bench" an actual command?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most copy and past from s3, I rephrased slightly.

@emkornfield
Copy link
Contributor Author

@pitrou thanks for the thorough review. I think I addressed most comments with the exception of the cleanup of directory logic, which I will try to address in a little bit (I left TODO place markers there). I've also rebased, and changed the destructor on OutputStream to close the file it is isn't closed.

@emkornfield
Copy link
Contributor Author

@pitrou I addressed the TODO's I left in there for simplification. Please let me know if the code is now easier to read, there was a bunch of superfluous code. I am still not sure why testbench needs to be installed twice.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @emkornfield !

python/pyarrow/tests/conftest.py Show resolved Hide resolved
python/pyarrow/tests/conftest.py Outdated Show resolved Hide resolved
python/pyarrow/_gcsfs.pyx Show resolved Hide resolved
@@ -183,6 +183,8 @@ def _apply_options(cmd, options):
@click.option("--with-r", default=None, type=BOOL,
help="Build the Arrow R extensions. This is not a CMake option, "
"it will toggle required options")
@click.option("--with-gcs", default=None, type=BOOL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the option is duplicated now? (I'm curious that click doesn't complain about it)


/// Options for the GcsFileSystem implementation.
struct ARROW_EXPORT GcsOptions {
std::shared_ptr<GcsCredentials> credentials;
GcsCredentials credentials;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan on solving this TODO here?


/// Options for the GcsFileSystem implementation.
struct ARROW_EXPORT GcsOptions {
std::shared_ptr<GcsCredentials> credentials;
GcsCredentials credentials;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note a default-constructed credentials could trivially be anonymous...

export PYARROW_TEST_HDFS
export PYARROW_TEST_ORC
export PYARROW_TEST_PARQUET
export PYARROW_TEST_S3

# Without this, install_gcs_test_bench.sh doesn't seem to be put in a
# place that the python env can find it.
python -m pip install "https://github.com/googleapis/storage-testbench/archive/v0.16.0.tar.gz"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment about this in conftest.py below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to check for the existence of the script and use the existing install comment. I tried multiple approaches with docker but could get the environment variable set.

export PYARROW_WITH_HDFS=${ARROW_HDFS:-ON}
export PYARROW_WITH_ORC=${ARROW_ORC:-OFF}
export PYARROW_WITH_PLASMA=${ARROW_PLASMA:-OFF}
export PYARROW_WITH_PARQUET=${ARROW_PARQUET:-OFF}
export PYARROW_WITH_PARQUET_ENCRYPTION=${PARQUET_REQUIRE_ENCRYPTION:-ON}
export PYARROW_WITH_GCS=${ARROW_GCS:-OFF}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this one is duplicate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

@@ -50,6 +50,7 @@ namespace bp = boost::process;
namespace gc = google::cloud;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for FileSystemFromUri somewhere in this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a test to make sure it can instantiate the file system. there are detailed test for the FromUri implementations on GCS already.

# ARROW_GCS: ON
ARROW_GCS: OFF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you revert this change because ARROW_GCS is OFF by default in ci/scripts/cpp_build.sh?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -45,6 +45,7 @@
#cmakedefine ARROW_IPC
#cmakedefine ARROW_JSON

#cmakedefine ARROW_GCS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -431,7 +431,7 @@ tasks:
{############################## Wheel OSX ####################################}

# enable S3 support from macOS 10.13 so we don't need to bundle curl, crypt and ssl
{% for macos_version, macos_codename, arrow_s3 in [("10.9", "mavericks", "OFF"),
{% for macos_version, macos_codename, arrow_s3_gcs in [("10.9", "mavericks", "OFF"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a strong opinion for this but I like adding a new variable for gcs for readability:

{% for macos_version, macos_codename, arrow_s3, arrow_gcs in [("10.9", "mavericks", "OFF", "OFF"),
                                                              ("10.13", "high-sierra", "ON", "ON")] %}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

(I had a typo: "I have a strong opinion" -> "I don't have a strong opinion")

cpp/src/arrow/filesystem/gcsfs.cc Outdated Show resolved Hide resolved
@emkornfield
Copy link
Contributor Author

@github-actions crossbow submit -g wheel

@emkornfield
Copy link
Contributor Author

@pitrou and @kou I believe I addressed all feedback you provided on the last review (sorry I had to force push my branch). One sticking point seems to be GCS Test bench, CI/Crossbow is running now there might be a few other places it needs to be installed.

@pitrou
Copy link
Member

pitrou commented Jun 11, 2022

@github-actions crossbow submit wheel-macos-big-sur*

@github-actions
Copy link

Revision: 922e4ef

Submitted crossbow builds: ursacomputing/crossbow @ actions-2176483478

Task Status
wheel-macos-big-sur-cp310-arm64 Github Actions
wheel-macos-big-sur-cp310-universal2 Github Actions
wheel-macos-big-sur-cp38-arm64 Github Actions
wheel-macos-big-sur-cp39-arm64 Github Actions
wheel-macos-big-sur-cp39-universal2 Github Actions

@pitrou
Copy link
Member

pitrou commented Jun 11, 2022

Ok, the situation on our macOS wheel builds is a bit horrible, but it's pointless to try to improve it here, so I'll revert the last two commits.

@pitrou
Copy link
Member

pitrou commented Jun 11, 2022

@github-actions crossbow submit wheel-macos-*

@github-actions
Copy link

Revision: 0f1f1da

Submitted crossbow builds: ursacomputing/crossbow @ actions-041204a60b

Task Status
wheel-macos-big-sur-cp310-arm64 Github Actions
wheel-macos-big-sur-cp310-universal2 Github Actions
wheel-macos-big-sur-cp38-arm64 Github Actions
wheel-macos-big-sur-cp39-arm64 Github Actions
wheel-macos-big-sur-cp39-universal2 Github Actions
wheel-macos-high-sierra-cp310-amd64 Github Actions
wheel-macos-high-sierra-cp37-amd64 Github Actions
wheel-macos-high-sierra-cp38-amd64 Github Actions
wheel-macos-high-sierra-cp39-amd64 Github Actions
wheel-macos-mavericks-cp310-amd64 Github Actions
wheel-macos-mavericks-cp37-amd64 Github Actions
wheel-macos-mavericks-cp38-amd64 Github Actions
wheel-macos-mavericks-cp39-amd64 Github Actions

@pitrou
Copy link
Member

pitrou commented Jun 11, 2022

@github-actions crossbow submit -g nightly

@github-actions
Copy link

Revision: 0f1f1da

Submitted crossbow builds: ursacomputing/crossbow @ actions-8c3920b06e

Task Status
almalinux-8-amd64 Github Actions
almalinux-8-arm64 TravisCI
almalinux-9-amd64 Github Actions
almalinux-9-arm64 TravisCI
amazon-linux-2-amd64 Github Actions
centos-7-amd64 Github Actions
centos-8-stream-amd64 Github Actions
centos-8-stream-arm64 TravisCI
conan-maximum Github Actions
conan-minimum Github Actions
conda-clean Azure
conda-linux-gcc-py310-arm64 Azure
conda-linux-gcc-py310-cpu Azure
conda-linux-gcc-py310-cuda Azure
conda-linux-gcc-py310-ppc64le Azure
conda-linux-gcc-py37-arm64 Azure
conda-linux-gcc-py37-cpu-r40 Azure
conda-linux-gcc-py37-cpu-r41 Azure
conda-linux-gcc-py37-cuda Azure
conda-linux-gcc-py37-ppc64le Azure
conda-linux-gcc-py38-arm64 Azure
conda-linux-gcc-py38-cpu Azure
conda-linux-gcc-py38-cuda Azure
conda-linux-gcc-py38-ppc64le Azure
conda-linux-gcc-py39-arm64 Azure
conda-linux-gcc-py39-cpu Azure
conda-linux-gcc-py39-cuda Azure
conda-linux-gcc-py39-ppc64le Azure
conda-osx-arm64-clang-py310 Azure
conda-osx-arm64-clang-py38 Azure
conda-osx-arm64-clang-py39 Azure
conda-osx-clang-py310 Azure
conda-osx-clang-py37-r40 Azure
conda-osx-clang-py37-r41 Azure
conda-osx-clang-py38 Azure
conda-osx-clang-py39 Azure
conda-win-vs2017-py310 Azure
conda-win-vs2017-py37-r40 Azure
conda-win-vs2017-py37-r41 Azure
conda-win-vs2017-py38 Azure
conda-win-vs2017-py39 Azure
debian-bookworm-amd64 Github Actions
debian-bookworm-arm64 TravisCI
debian-bullseye-amd64 Github Actions
debian-bullseye-arm64 TravisCI
debian-buster-amd64 Github Actions
debian-buster-arm64 TravisCI
example-cpp-minimal-build-static Github Actions
example-cpp-minimal-build-static-system-dependency Github Actions
example-python-minimal-build-fedora-conda Github Actions
example-python-minimal-build-ubuntu-venv Github Actions
homebrew-cpp Github Actions
homebrew-r-autobrew Github Actions
homebrew-r-brew Github Actions
java-jars Github Actions
nuget Github Actions
python-sdist Github Actions
test-build-cpp-fuzz Github Actions
test-build-vcpkg-win Github Actions
test-conda-cpp Github Actions
test-conda-cpp-valgrind Azure
test-conda-python-3.10 Github Actions
test-conda-python-3.7 Github Actions
test-conda-python-3.7-hdfs-2.9.2 Github Actions
test-conda-python-3.7-hdfs-3.2.1 Github Actions
test-conda-python-3.7-kartothek-latest Github Actions
test-conda-python-3.7-kartothek-master Github Actions
test-conda-python-3.7-pandas-0.24 Github Actions
test-conda-python-3.7-pandas-latest Github Actions
test-conda-python-3.7-spark-v3.1.2 Github Actions
test-conda-python-3.8 Github Actions
test-conda-python-3.8-hypothesis Github Actions
test-conda-python-3.8-pandas-latest Github Actions
test-conda-python-3.8-pandas-nightly Github Actions
test-conda-python-3.8-spark-v3.2.0 Github Actions
test-conda-python-3.9 Github Actions
test-conda-python-3.9-dask-latest Github Actions
test-conda-python-3.9-dask-master Github Actions
test-conda-python-3.9-pandas-master Github Actions
test-conda-python-3.9-spark-master Github Actions
test-debian-10-cpp-amd64 Github Actions
test-debian-10-cpp-i386 Github Actions
test-debian-11-cpp-amd64 Github Actions
test-debian-11-cpp-i386 Github Actions
test-debian-11-go-1.16 Azure
test-debian-11-python-3 Azure
test-debian-c-glib Github Actions
test-debian-ruby Github Actions
test-fedora-35-cpp Github Actions
test-fedora-35-python-3 Azure
test-fedora-r-clang-sanitizer Azure
test-r-arrow-backwards-compatibility Github Actions
test-r-depsource-bundled Azure
test-r-depsource-system Github Actions
test-r-dev-duckdb Github Actions
test-r-devdocs Github Actions
test-r-gcc-11 Github Actions
test-r-gcc-12 Github Actions
test-r-install-local Github Actions
test-r-linux-as-cran Github Actions
test-r-linux-rchk Github Actions
test-r-linux-valgrind Azure
test-r-minimal-build Azure
test-r-offline-maximal Github Actions
test-r-offline-minimal Azure
test-r-rhub-debian-gcc-devel-lto-latest Azure
test-r-rhub-debian-gcc-release-custom-ccache Azure
test-r-rhub-ubuntu-gcc-release-latest Azure
test-r-rocker-r-base-latest Azure
test-r-rstudio-r-base-4.1-opensuse153 Azure
test-r-rstudio-r-base-4.2-centos7-devtoolset-8 Azure
test-r-rstudio-r-base-4.2-focal Azure
test-r-ubuntu-22.04 Github Actions
test-r-versions Github Actions
test-skyhook-integration Github Actions
test-ubuntu-18.04-cpp Github Actions
test-ubuntu-18.04-cpp-release Github Actions
test-ubuntu-18.04-cpp-static Github Actions
test-ubuntu-18.04-r-sanitizer Azure
test-ubuntu-20.04-cpp Github Actions
test-ubuntu-20.04-cpp-14 Github Actions
test-ubuntu-20.04-cpp-17 Github Actions
test-ubuntu-20.04-cpp-bundled Github Actions
test-ubuntu-20.04-cpp-thread-sanitizer Github Actions
test-ubuntu-20.04-python-3 Azure
test-ubuntu-22.04-cpp Github Actions
test-ubuntu-c-glib Github Actions
test-ubuntu-default-docs Azure
test-ubuntu-ruby Github Actions
ubuntu-bionic-amd64 Github Actions
ubuntu-bionic-arm64 TravisCI
ubuntu-focal-amd64 Github Actions
ubuntu-focal-arm64 TravisCI
ubuntu-impish-amd64 Github Actions
ubuntu-impish-arm64 TravisCI
ubuntu-jammy-amd64 Github Actions
ubuntu-jammy-arm64 TravisCI
verify-rc-source-cpp-linux-almalinux-8-amd64 Github Actions
verify-rc-source-cpp-linux-conda-latest-amd64 Github Actions
verify-rc-source-cpp-linux-ubuntu-18.04-amd64 Github Actions
verify-rc-source-cpp-linux-ubuntu-20.04-amd64 Github Actions
verify-rc-source-cpp-linux-ubuntu-22.04-amd64 Github Actions
verify-rc-source-cpp-macos-amd64 Github Actions
verify-rc-source-cpp-macos-arm64 Github Actions
verify-rc-source-cpp-macos-conda-amd64 Github Actions
verify-rc-source-csharp-linux-almalinux-8-amd64 Github Actions
verify-rc-source-csharp-linux-conda-latest-amd64 Github Actions
verify-rc-source-csharp-linux-ubuntu-18.04-amd64 Github Actions
verify-rc-source-csharp-linux-ubuntu-20.04-amd64 Github Actions
verify-rc-source-csharp-linux-ubuntu-22.04-amd64 Github Actions
verify-rc-source-csharp-macos-amd64 Github Actions
verify-rc-source-csharp-macos-arm64 Github Actions
verify-rc-source-go-linux-almalinux-8-amd64 Github Actions
verify-rc-source-go-linux-conda-latest-amd64 Github Actions
verify-rc-source-go-linux-ubuntu-18.04-amd64 Github Actions
verify-rc-source-go-linux-ubuntu-20.04-amd64 Github Actions
verify-rc-source-go-linux-ubuntu-22.04-amd64 Github Actions
verify-rc-source-go-macos-amd64 Github Actions
verify-rc-source-go-macos-arm64 Github Actions
verify-rc-source-integration-linux-almalinux-8-amd64 Github Actions
verify-rc-source-integration-linux-conda-latest-amd64 Github Actions
verify-rc-source-integration-linux-ubuntu-18.04-amd64 Github Actions
verify-rc-source-integration-linux-ubuntu-20.04-amd64 Github Actions
verify-rc-source-integration-linux-ubuntu-22.04-amd64 Github Actions
verify-rc-source-integration-macos-amd64 Github Actions
verify-rc-source-integration-macos-arm64 Github Actions
verify-rc-source-integration-macos-conda-amd64 Github Actions
verify-rc-source-java-linux-almalinux-8-amd64 Github Actions
verify-rc-source-java-linux-conda-latest-amd64 Github Actions
verify-rc-source-java-linux-ubuntu-18.04-amd64 Github Actions
verify-rc-source-java-linux-ubuntu-20.04-amd64 Github Actions
verify-rc-source-java-linux-ubuntu-22.04-amd64 Github Actions
verify-rc-source-java-macos-amd64 Github Actions
verify-rc-source-js-linux-almalinux-8-amd64 Github Actions
verify-rc-source-js-linux-conda-latest-amd64 Github Actions
verify-rc-source-js-linux-ubuntu-18.04-amd64 Github Actions
verify-rc-source-js-linux-ubuntu-20.04-amd64 Github Actions
verify-rc-source-js-linux-ubuntu-22.04-amd64 Github Actions
verify-rc-source-js-macos-amd64 Github Actions
verify-rc-source-js-macos-arm64 Github Actions
verify-rc-source-python-linux-almalinux-8-amd64 Github Actions
verify-rc-source-python-linux-conda-latest-amd64 Github Actions
verify-rc-source-python-linux-ubuntu-18.04-amd64 Github Actions
verify-rc-source-python-linux-ubuntu-20.04-amd64 Github Actions
verify-rc-source-python-linux-ubuntu-22.04-amd64 Github Actions
verify-rc-source-python-macos-amd64 Github Actions
verify-rc-source-python-macos-arm64 Github Actions
verify-rc-source-python-macos-conda-amd64 Github Actions
verify-rc-source-ruby-linux-almalinux-8-amd64 Github Actions
verify-rc-source-ruby-linux-conda-latest-amd64 Github Actions
verify-rc-source-ruby-linux-ubuntu-18.04-amd64 Github Actions
verify-rc-source-ruby-linux-ubuntu-20.04-amd64 Github Actions
verify-rc-source-ruby-linux-ubuntu-22.04-amd64 Github Actions
verify-rc-source-ruby-macos-amd64 Github Actions
verify-rc-source-ruby-macos-arm64 Github Actions
verify-rc-source-windows Github Actions
wheel-macos-big-sur-cp310-arm64 Github Actions
wheel-macos-big-sur-cp310-universal2 Github Actions
wheel-macos-big-sur-cp38-arm64 Github Actions
wheel-macos-big-sur-cp39-arm64 Github Actions
wheel-macos-big-sur-cp39-universal2 Github Actions
wheel-macos-high-sierra-cp310-amd64 Github Actions
wheel-macos-high-sierra-cp37-amd64 Github Actions
wheel-macos-high-sierra-cp38-amd64 Github Actions
wheel-macos-high-sierra-cp39-amd64 Github Actions
wheel-macos-mavericks-cp310-amd64 Github Actions
wheel-macos-mavericks-cp37-amd64 Github Actions
wheel-macos-mavericks-cp38-amd64 Github Actions
wheel-macos-mavericks-cp39-amd64 Github Actions
wheel-manylinux2010-cp310-amd64 Github Actions
wheel-manylinux2010-cp37-amd64 Github Actions
wheel-manylinux2010-cp38-amd64 Github Actions
wheel-manylinux2010-cp39-amd64 Github Actions
wheel-manylinux2014-cp310-amd64 Github Actions
wheel-manylinux2014-cp310-arm64 TravisCI
wheel-manylinux2014-cp37-amd64 Github Actions
wheel-manylinux2014-cp37-arm64 TravisCI
wheel-manylinux2014-cp38-amd64 Github Actions
wheel-manylinux2014-cp38-arm64 TravisCI
wheel-manylinux2014-cp39-amd64 Github Actions
wheel-manylinux2014-cp39-arm64 TravisCI
wheel-windows-cp310-amd64 Github Actions
wheel-windows-cp37-amd64 Github Actions
wheel-windows-cp38-amd64 Github Actions
wheel-windows-cp39-amd64 Github Actions

@kou
Copy link
Member

kou commented Jun 11, 2022

We need to fix wheel-*-cp37-* failures such as wheel-manylinux2014-cp37-amd64:

https://github.com/ursacomputing/crossbow/runs/6842267198?check_suite_focus=true

=================================== FAILURES ===================================
_______________ TestConvertMetadata.test_rangeindex_doesnt_warn ________________
self = <pyarrow.tests.test_pandas.TestConvertMetadata object at 0x7f9961d90d90>
    def test_rangeindex_doesnt_warn(self):
        # ARROW-5606: pandas 0.25 deprecated private _start/stop/step
        # attributes -> can be removed if support < pd 0.25 is dropped
        df = pd.DataFrame(np.random.randn(4, 2), columns=['a', 'b'])
        with pytest.warns(None) as record:
            _check_pandas_roundtrip(df, preserve_index=True)
>       assert len(record) == 0
E       assert 4 == 0
E        +  where 4 = len(WarningsChecker(record=True))
usr/local/lib/python3.7/site-packages/pyarrow/tests/test_pandas.py:229: AssertionError
_______________ TestConvertMetadata.test_multiindex_doesnt_warn ________________
self = <pyarrow.tests.test_pandas.TestConvertMetadata object at 0x7f9961d2a710>
    def test_multiindex_doesnt_warn(self):
        # ARROW-3953: pandas 0.24 rename of MultiIndex labels to codes
        columns = pd.MultiIndex.from_arrays([['one', 'two'], ['X', 'Y']])
        df = pd.DataFrame([(1, 'a'), (2, 'b'), (3, 'c')], columns=columns)
        with pytest.warns(None) as record:
            _check_pandas_roundtrip(df, preserve_index=True)
>       assert len(record) == 0
E       assert 6 == 0
E        +  where 6 = len(WarningsChecker(record=True))
usr/local/lib/python3.7/site-packages/pyarrow/tests/test_pandas.py:280: AssertionError

@emkornfield
Copy link
Contributor Author

These seem unrelated to this PR? If so I can open up JIRA to track

@pitrou
Copy link
Member

pitrou commented Jun 12, 2022

@kou Those seem unrelated to this PR. I need to do a last review pass and then this PR can be merged.

@pitrou
Copy link
Member

pitrou commented Jun 12, 2022

@github-actions crossbow submit -g wheel--cp37-

@github-actions
Copy link

Invalid group(s) {'wheel-*-cp37-*'}. Must be one of {'linux-amd64', 'verify-rc', 'conda', 'nightly-packaging', 'packaging', 'vcpkg', 'linux', 'nightly', 'example', 'fuzz', 'r', 'verify-rc-source-macos', 'wheel', 'verify-rc-source', 'verify-rc-jars', 'cpp', 'integration', 'c-glib', 'ruby', 'python', 'test', 'homebrew', 'verify-rc-wheels', 'linux-arm64', 'example-cpp', 'conan', 'example-python', 'verify-rc-source-linux', 'nightly-release', 'nightly-tests', 'verify-rc-binaries'}
The Archery job run can be found at: https://github.com/apache/arrow/actions/runs/2482888639

@pitrou
Copy link
Member

pitrou commented Jun 12, 2022

@github-actions crossbow submit wheel--cp37-

@github-actions
Copy link

Revision: b8336ef

Submitted crossbow builds: ursacomputing/crossbow @ actions-d604965942

Task Status
wheel-macos-high-sierra-cp37-amd64 Github Actions
wheel-macos-mavericks-cp37-amd64 Github Actions
wheel-manylinux2010-cp37-amd64 Github Actions
wheel-manylinux2014-cp37-amd64 Github Actions
wheel-manylinux2014-cp37-arm64 TravisCI
wheel-windows-cp37-amd64 Github Actions

@pitrou pitrou merged commit 7b5912d into apache:master Jun 12, 2022
@pitrou
Copy link
Member

pitrou commented Jun 12, 2022

Ok, some JIRAs will probably have to be opened for the wheel test failures.

@kou
Copy link
Member

kou commented Jun 12, 2022

Oh, sorry.

@emkornfield
Copy link
Contributor Author

@kou @pitrou thank you very much for all your help on this one.

@grisaitis
Copy link

for documentation, should i open a new issue? v excited about this :)

the docs source: https://github.com/apache/arrow/blob/master/docs/source/python/filesystems.rst

@kou
Copy link
Member

kou commented Jun 20, 2022

Yes, please.
If you can work on it, please submit a pull request too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants