Skip to content

apache-airflow-providers-google 10.9.0 fails to list GCS objects #34909

@atrbgithub

Description

@atrbgithub

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

This affects Airflow 2.7.2. It appears that the 10.9.0 version of apache-airflow-providers-google fails to list objects in gcs.

Example to recreate:

pipenv --python 3.8
pipenv shell
pip install apache-airflow==2.7.2 apache-airflow-providers-google==10.9.0
export AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT='google-cloud-platform://'

Then create the following python test file:

from airflow.providers.google.cloud.hooks.gcs import GCSHook

result = GCSHook().list(
    bucket_name='a-test-bucket,
    prefix="a/test/prefix",
    delimiter='.csv'
)

result = list(result)
print(result)

The output if this is:

[]

In a different pipenv environment, this works when using Airflow 2.7.1 and the 10.7.0 version of the provider:

pipenv --python 3.8
pipenv shell
pip install apache-airflow==2.7.1 apache-airflow-providers-google==10.7.0
export AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT='google-cloud-platform://'

Use the same python test file as above. The output of this is a list of files as expected.

this appears to be the commit which may have broken things.

The hooks/gcs.py file can be patched in the following way which appears to force the lazy loading to kick in:

            print("Forcing loading....")
            all_blobs = list(blobs)

            for blob in all_blobs:
                print(blob.name)

            if blobs.prefixes:
                ids.extend(blobs.prefixes)
            else:
                ids.extend(blob.name for blob in all_blobs)

            page_token = blobs.next_page_token

            if page_token is None:
                # empty next page token
                break

Example patch file:

+++ gcs.py      2023-10-12 11:34:00.774206013 +0000
@@ -829,12 +829,19 @@
                     versions=versions,
                 )

+            print("Forcing loading....")
+            all_blobs = list(blobs)
+
+            for blob in all_blobs:
+                print(blob.name)
+
             if blobs.prefixes:
                 ids.extend(blobs.prefixes)
             else:
-                ids.extend(blob.name for blob in blobs)
+                ids.extend(blob.name for blob in all_blobs)

             page_token = blobs.next_page_token
+
             if page_token is None:
                 # empty next page token
                 break

What you think should happen instead

The provider should be able to list files in gcs.

How to reproduce

Please see above for the steps to reproduce.

Operating System

n/a

Versions of Apache Airflow Providers

10.9.0 of the google provider.

Deployment

Other 3rd-party Helm chart

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions