Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 New Source: GCS #23186

Merged
merged 17 commits into from
Mar 14, 2023
Merged

🎉 New Source: GCS #23186

merged 17 commits into from
Mar 14, 2023

Conversation

tuanchris
Copy link
Contributor

@tuanchris tuanchris commented Feb 17, 2023

What

Add new source GCS. This connecter will:

  • Supports service account for authentication
  • Get all files from a directory
  • Scan for CSV files
  • Generate streams for each of the CSV files
  • Read the CSV files
    #11135 @YowanR

image

How

  • Authenticate using google-cloud-storage database and a service account
  • Create a GCS client, a bucket object, and use get_blobs method to list all blobs
  • Filter .csv files
  • Read the first 0.1 mb of the file to create a json schema object
  • Use pandas to read the files to memory and write to AirbyteStreams

Future improvements to be made:

  • Add different authentication method
  • Support file compression
  • Support other file formats: JSON, Parquet, etc.
  • Handle bigger files size

Recommended reading order

  1. x.java
  2. y.python

🚨 User Impact 🚨

Are there any breaking changes? What is the end result perceived by the user? If yes, please merge this PR with the 🚨🚨 emoji so changelog authors can further highlight this if needed.

Pre-merge Checklist

Expand the relevant checklist and delete the others.

New Connector

Community member or Airbyter

  • Community member? Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
    • docs/integrations/README.md
    • airbyte-integrations/builds.md
  • PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • If new credentials are required for use in CI, add them to GSM. Instructions.
  • /test connector=connectors/<name> command is passing
  • New Connector version released on Dockerhub by running the /publish command described here
  • After the connector is published, connector added to connector index as described here
  • Seed specs have been re-generated by building the platform and committing the changes to the seed spec files, as described here
Updating a connector

Community member or Airbyter

  • Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
  • PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • If new credentials are required for use in CI, add them to GSM. Instructions.
  • /test connector=connectors/<name> command is passing
  • New Connector version released on Dockerhub and connector version bumped by running the /publish command described here
Connector Generator
  • Issue acceptance criteria met
  • PR name follows PR naming conventions
  • If adding a new generator, add it to the list of scaffold modules being tested
  • The generator test modules (all connectors with -scaffold in their name) have been updated with the latest scaffold by running ./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplates then checking in your changes
  • Documentation which references the generator is updated as needed

@tuanchris tuanchris changed the title Source gcs 🎉 New Source: GCS Feb 17, 2023
@tuanchris tuanchris marked this pull request as ready for review February 17, 2023 03:37
@YowanR
Copy link
Contributor

YowanR commented Feb 22, 2023

@sh4sh @natalyjazzviolin can you please take a look? 🙏

@marcosmarxm marcosmarxm self-assigned this Feb 27, 2023
Copy link
Member

@marcosmarxm marcosmarxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @tuanchris 👋 thanks for the contribution. I tried to test the connector. It is working reading one file but not a folder.

In my case I created the folder test_folder and added two csv files.
image

The config below doesn't work:

{
    "gcs_path": "test_folder/",
    "gcs_bucket": "airbyte-integration-test-source-gcs",
    "service_account": "..."
  }

Also it is missing the documentation and instructions to setup the connector.

Today the connector it isn't different from the File using GCS option. Probably if the connector work reading a folder it can be accepted, probably would be better a strategy to read incremental.

@octavia-squidington-iii octavia-squidington-iii added the area/documentation Improvements or additions to documentation label Feb 28, 2023
@tuanchris
Copy link
Contributor Author

Thanks @marcosmarxm for the review. I have:

  • Updated the logic of get_gcs_blobs to filter only .csv files
  • Added docs
    The connector should work with multiple files in GCS now. Here's my gcs structure:

image

And here's the discovery results:

{
  "type": "CATALOG",
  "catalog": {
    "streams": [
      {
        "name": "film",
        "json_schema": {
          "$schema": "http://json-schema.org/draft-07/schema#",
          "type": "object",
          "properties": {
            "film_id": { "type": "string" },
            "title": { "type": "string" },
            "release_year": { "type": "string" },
            "language_id": { "type": "string" },
            "rental_duration": { "type": "string" },
            "rental_rate": { "type": "string" },
            "replacement_cost": { "type": "string" },
            "rating": { "type": "string" }
          }
        },
        "supported_sync_modes": ["full_refresh"]
      },
      {
        "name": "actor",
        "json_schema": {
          "$schema": "http://json-schema.org/draft-07/schema#",
          "type": "object",
          "properties": {
            "actor_id": { "type": "string" },
            "first_name": { "type": "string" },
            "last_name": { "type": "string" }
          }
        },
        "supported_sync_modes": ["full_refresh"]
      }
    ]
  }
}

@lazebnyi lazebnyi requested a review from davydov-d March 1, 2023 01:27
Co-authored-by: sh4sh <6833405+sh4sh@users.noreply.github.com>
Copy link
Collaborator

@davydov-d davydov-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left one more comment, otherwise looks good to me

@davydov-d
Copy link
Collaborator

@marcosmarxm please proceed with this PR when you have a chance

@davydov-d
Copy link
Collaborator

@marcosmarxm please proceed with this PR when you have a chance

cc @sh4sh can you pick up this one please? I guess @marcosmarxm is unavailable

@marcosmarxm
Copy link
Member

marcosmarxm commented Mar 14, 2023

/test connector=connectors/source-gcs

🕑 connectors/source-gcs https://github.com/airbytehq/airbyte/actions/runs/4418340804
✅ connectors/source-gcs https://github.com/airbytehq/airbyte/actions/runs/4418340804
Python tests coverage:

Name                     Stmts   Miss  Cover
--------------------------------------------
source_gcs/__init__.py       2      0   100%
source_gcs/helpers.py       33     20    39%
source_gcs/source.py        38     27    29%
--------------------------------------------
TOTAL                       73     47    36%

Build Passed

Test summary info:

=========================== short test summary info ============================
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/plugin.py:63: Skipping TestIncremental.test_two_sequential_reads: This connector does not implement incremental sync
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:103: The previous connector image could not be retrieved.
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:512: The previous connector image could not be retrieved.
======================== 32 passed, 3 skipped in 24.79s ========================

@marcosmarxm
Copy link
Member

marcosmarxm commented Mar 14, 2023

/publish connector=connectors/source-gcs

🕑 Publishing the following connectors:
connectors/source-gcs
https://github.com/airbytehq/airbyte/actions/runs/4418407804


Connector Did it publish? Were definitions generated?
connectors/source-gcs

if you have connectors that successfully published but failed definition generation, follow step 4 here ▶️

Copy link
Member

@marcosmarxm marcosmarxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tuanchris

@marcosmarxm marcosmarxm merged commit 200b035 into airbytehq:master Mar 14, 2023
adriennevermorel pushed a commit to adriennevermorel/airbyte that referenced this pull request Mar 17, 2023
* initial commit

* fix test error

* Update get_gcs_blobs logic

* add docs

* Update source_definitions.yaml

* Update airbyte-integrations/connectors/source-gcs/source_gcs/source.py

Co-authored-by: sh4sh <6833405+sh4sh@users.noreply.github.com>

* Update airbyte-config/init/src/main/resources/seed/source_definitions.yaml

Co-authored-by: Denys Davydov <davydov.den18@gmail.com>

* Update airbyte-integrations/connectors/source-gcs/source_gcs/helpers.py

Co-authored-by: Denys Davydov <davydov.den18@gmail.com>

* Update airbyte-integrations/connectors/source-gcs/source_gcs/helpers.py

Co-authored-by: Denys Davydov <davydov.den18@gmail.com>

* update docker file for pandas package

* reimplement read_csv file

* add logic to filter selected streams

* close file_obj after reading

* fix format and tests

* add another stream

* auto-bump connector version

---------

Co-authored-by: Sunny <6833405+sh4sh@users.noreply.github.com>
Co-authored-by: Denys Davydov <davydov.den18@gmail.com>
Co-authored-by: marcosmarxm <marcosmarxm@gmail.com>
Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
erohmensing pushed a commit that referenced this pull request Mar 22, 2023
* initial commit

* fix test error

* Update get_gcs_blobs logic

* add docs

* Update source_definitions.yaml

* Update airbyte-integrations/connectors/source-gcs/source_gcs/source.py

Co-authored-by: sh4sh <6833405+sh4sh@users.noreply.github.com>

* Update airbyte-config/init/src/main/resources/seed/source_definitions.yaml

Co-authored-by: Denys Davydov <davydov.den18@gmail.com>

* Update airbyte-integrations/connectors/source-gcs/source_gcs/helpers.py

Co-authored-by: Denys Davydov <davydov.den18@gmail.com>

* Update airbyte-integrations/connectors/source-gcs/source_gcs/helpers.py

Co-authored-by: Denys Davydov <davydov.den18@gmail.com>

* update docker file for pandas package

* reimplement read_csv file

* add logic to filter selected streams

* close file_obj after reading

* fix format and tests

* add another stream

* auto-bump connector version

---------

Co-authored-by: Sunny <6833405+sh4sh@users.noreply.github.com>
Co-authored-by: Denys Davydov <davydov.den18@gmail.com>
Co-authored-by: marcosmarxm <marcosmarxm@gmail.com>
Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
erohmensing pushed a commit that referenced this pull request Mar 22, 2023
* initial commit

* fix test error

* Update get_gcs_blobs logic

* add docs

* Update source_definitions.yaml

* Update airbyte-integrations/connectors/source-gcs/source_gcs/source.py

Co-authored-by: sh4sh <6833405+sh4sh@users.noreply.github.com>

* Update airbyte-config/init/src/main/resources/seed/source_definitions.yaml

Co-authored-by: Denys Davydov <davydov.den18@gmail.com>

* Update airbyte-integrations/connectors/source-gcs/source_gcs/helpers.py

Co-authored-by: Denys Davydov <davydov.den18@gmail.com>

* Update airbyte-integrations/connectors/source-gcs/source_gcs/helpers.py

Co-authored-by: Denys Davydov <davydov.den18@gmail.com>

* update docker file for pandas package

* reimplement read_csv file

* add logic to filter selected streams

* close file_obj after reading

* fix format and tests

* add another stream

* auto-bump connector version

---------

Co-authored-by: Sunny <6833405+sh4sh@users.noreply.github.com>
Co-authored-by: Denys Davydov <davydov.den18@gmail.com>
Co-authored-by: marcosmarxm <marcosmarxm@gmail.com>
Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
@jrolom jrolom added the contributor-program PRs submitted through the contributor program. label Apr 25, 2023
@renatodossantosleal
Copy link

Hi! Any chance that we can have this connector working with parquet also?

@marcosmarxm
Copy link
Member

renatodossantosleal

Can you open a feature request?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/source/gcs contributor-program PRs submitted through the contributor program.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants