diff --git a/airbyte-integrations/connectors/source-google-drive/.dockerignore b/airbyte-integrations/connectors/source-google-drive/.dockerignore new file mode 100644 index 0000000000000..648834f6fa95e --- /dev/null +++ b/airbyte-integrations/connectors/source-google-drive/.dockerignore @@ -0,0 +1,7 @@ +* +!Dockerfile +!Dockerfile.test +!main.py +!source_google_drive +!setup.py +!secrets diff --git a/airbyte-integrations/connectors/source-google-drive/README.md b/airbyte-integrations/connectors/source-google-drive/README.md new file mode 100644 index 0000000000000..f32bd6216161d --- /dev/null +++ b/airbyte-integrations/connectors/source-google-drive/README.md @@ -0,0 +1,126 @@ +# Google Drive Source + +This is the repository for the Google Drive source connector, written in Python. +For information about how to use this connector within Airbyte, see [the documentation](https://docs.airbyte.io/integrations/sources/google-drive). + +## Local development + +### Prerequisites +**To iterate on this connector, make sure to complete this prerequisites section.** + +#### Minimum Python version required `= 3.10.0` + +#### Build & Activate Virtual Environment and install dependencies +From this connector directory, create a virtual environment: +``` +python -m venv .venv +``` + +This will generate a virtualenv for this module in `.venv/`. Make sure this venv is active in your +development environment of choice. To activate it from the terminal, run: +``` +source .venv/bin/activate +pip install -r requirements.txt +pip install '.[tests]' +``` +If you are in an IDE, follow your IDE's instructions to activate the virtualenv. + +Note that while we are installing dependencies from `requirements.txt`, you should only edit `setup.py` for your dependencies. `requirements.txt` is +used for editable installs (`pip install -e`) to pull in Python dependencies from the monorepo and will call `setup.py`. +If this is mumbo jumbo to you, don't worry about it, just put your deps in `setup.py` but install using `pip install -r requirements.txt` and everything +should work as you expect. + +#### Building via Gradle +You can also build the connector in Gradle. This is typically used in CI and not needed for your development workflow. + +To build using Gradle, from the Airbyte repository root, run: +``` +./gradlew :airbyte-integrations:connectors:source-google-drive:build +``` + +#### Create credentials +**If you are a community contributor**, follow the instructions in the [documentation](https://docs.airbyte.io/integrations/sources/google-drive) +to generate the necessary credentials. Then create a file `secrets/config.json` conforming to the `source_google_drive/spec.json` file. +Note that any directory named `secrets` is gitignored across the entire Airbyte repo, so there is no danger of accidentally checking in sensitive information. +See `integration_tests/sample_config.json` for a sample config file. + +**If you are an Airbyte core member**, copy the credentials in Lastpass under the secret name `source google-drive test creds` +and place them into `secrets/config.json`. + +### Locally running the connector +``` +python main.py spec +python main.py check --config secrets/config.json +python main.py discover --config secrets/config.json +python main.py read --config secrets/config.json --catalog integration_tests/configured_catalog.json +``` + +### Locally running the connector docker image + +#### Build +Build the docker image via `airbyte-ci`: +``` +airbyte-ci connectors --name=source-google-drive build +``` + +#### Run +Then run any of the connector commands as follows: +``` +docker run --rm airbyte/source-google-drive:dev spec +docker run --rm -v $(pwd)/secrets:/secrets airbyte/source-google-drive:dev check --config /secrets/config.json +docker run --rm -v $(pwd)/secrets:/secrets airbyte/source-google-drive:dev discover --config /secrets/config.json +docker run --rm -v $(pwd)/secrets:/secrets -v $(pwd)/integration_tests:/integration_tests airbyte/source-google-drive:dev read --config /secrets/config.json --catalog /integration_tests/configured_catalog.json +``` +## Testing +Make sure to familiarize yourself with [pytest test discovery](https://docs.pytest.org/en/latest/goodpractices.html#test-discovery) to know how your test files and methods should be named. +First install test dependencies into your virtual environment: +``` +pip install '.[tests]' +``` +### Unit Tests +To run unit tests locally, from the connector directory run: +``` +python -m pytest unit_tests +``` + +### Integration Tests +There are two types of integration tests: Acceptance Tests (Airbyte's test suite for all source connectors) and custom integration tests (which are specific to this connector). +#### Custom Integration tests +Place custom tests inside `integration_tests/` folder, then, from the connector root, run +``` +python -m pytest integration_tests +``` +#### Acceptance Tests +Customize `acceptance-test-config.yml` file to configure tests. See [Connector Acceptance Tests](https://docs.airbyte.io/connector-development/testing-connectors/connector-acceptance-tests-reference) for more information. +If your connector requires to create or destroy resources for use during acceptance tests create fixtures for it and place them inside integration_tests/acceptance.py. +To run your integration tests with acceptance tests, from the connector root, run +``` +docker build . --no-cache -t airbyte/source-google-drive:dev \ +&& python -m pytest -p connector_acceptance_test.plugin +``` +To run your integration tests with docker + +### Using gradle to run tests +All commands should be run from airbyte project root. +To run unit tests: +``` +./gradlew :airbyte-integrations:connectors:source-google-drive:unitTest +``` +To run acceptance and custom integration tests: +``` +./gradlew :airbyte-integrations:connectors:source-google-drive:integrationTest +``` + +## Dependency Management +All of your dependencies should go in `setup.py`, NOT `requirements.txt`. The requirements file is only used to connect internal Airbyte dependencies in the monorepo for local development. +We split dependencies between two groups, dependencies that are: +* required for your connector to work need to go to `MAIN_REQUIREMENTS` list. +* required for the testing need to go to `TEST_REQUIREMENTS` list + +### Publishing a new version of the connector +You've checked out the repo, implemented a million dollar feature, and you're ready to share your changes with the world. Now what? +1. Make sure your changes are passing unit and integration tests. +1. Bump the connector version in `Dockerfile` -- just increment the value of the `LABEL io.airbyte.version` appropriately (we use [SemVer](https://semver.org/)). +1. Create a Pull Request. +1. Pat yourself on the back for being an awesome contributor. +1. Someone from Airbyte will take a look at your PR and iterate with you to merge it into master. diff --git a/airbyte-integrations/connectors/source-google-drive/acceptance-test-config.yml b/airbyte-integrations/connectors/source-google-drive/acceptance-test-config.yml new file mode 100644 index 0000000000000..3ed802d16e22c --- /dev/null +++ b/airbyte-integrations/connectors/source-google-drive/acceptance-test-config.yml @@ -0,0 +1,38 @@ +acceptance_tests: + basic_read: + tests: + - config_path: secrets/config.json + expect_records: + path: integration_tests/expected_records.jsonl + exact_order: true + timeout_seconds: 1800 + expect_trace_message_on_failure: false + + connection: + tests: + - config_path: secrets/config.json + status: succeed + - config_path: secrets/oauth_config.json + status: succeed + - config_path: integration_tests/invalid_config.json + status: failed + discovery: + tests: + - config_path: secrets/config.json + full_refresh: + tests: + - config_path: secrets/config.json + configured_catalog_path: integration_tests/configured_catalog.json + timeout_seconds: 1800 + + incremental: + tests: + - config_path: secrets/config.json + configured_catalog_path: integration_tests/configured_catalog.json + future_state: + future_state_path: integration_tests/abnormal_state.json + timeout_seconds: 1800 + spec: + tests: + - spec_path: integration_tests/spec.json +connector_image: airbyte/source-google-drive:dev diff --git a/airbyte-integrations/connectors/source-google-drive/acceptance-test-docker.sh b/airbyte-integrations/connectors/source-google-drive/acceptance-test-docker.sh new file mode 100755 index 0000000000000..5797d20fe9a78 --- /dev/null +++ b/airbyte-integrations/connectors/source-google-drive/acceptance-test-docker.sh @@ -0,0 +1,2 @@ +#!/usr/bin/env sh +source "$(git rev-parse --show-toplevel)/airbyte-integrations/bases/connector-acceptance-test/acceptance-test-docker.sh" diff --git a/airbyte-integrations/connectors/source-google-drive/examples/configured_catalog.json b/airbyte-integrations/connectors/source-google-drive/examples/configured_catalog.json new file mode 100644 index 0000000000000..31f1bb6e02ee4 --- /dev/null +++ b/airbyte-integrations/connectors/source-google-drive/examples/configured_catalog.json @@ -0,0 +1,13 @@ +{ + "streams": [ + { + "stream": { + "name": "my_file_stream", + "json_schema": {}, + "supported_sync_modes": ["full_refresh", "incremental"] + }, + "sync_mode": "incremental", + "destination_sync_mode": "overwrite" + } + ] +} diff --git a/airbyte-integrations/connectors/source-google-drive/examples/state.json b/airbyte-integrations/connectors/source-google-drive/examples/state.json new file mode 100644 index 0000000000000..341d344b3c0cb --- /dev/null +++ b/airbyte-integrations/connectors/source-google-drive/examples/state.json @@ -0,0 +1,16 @@ +[ + { + "type": "STREAM", + "stream": { + "stream_descriptor": { + "name": "my_file_stream" + }, + "stream_state": { + "history": { + "https://drive.google.com/file/d/1kbXDc8UFkQcWCRfKKtSjCtlgD7EUXzuM": "2023-10-16T14:51:33.000000Z" + }, + "_ab_source_file_last_modified": "2023-10-16T14:51:33.000000Z_https://drive.google.com/file/d/1kbXDc8UFkQcWCRfKKtSjCtlgD7EUXzuM" + } + } + } +] diff --git a/airbyte-integrations/connectors/source-google-drive/icon.svg b/airbyte-integrations/connectors/source-google-drive/icon.svg new file mode 100644 index 0000000000000..1133b624babf2 --- /dev/null +++ b/airbyte-integrations/connectors/source-google-drive/icon.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/airbyte-integrations/connectors/source-google-drive/integration_tests/__init__.py b/airbyte-integrations/connectors/source-google-drive/integration_tests/__init__.py new file mode 100644 index 0000000000000..e69de29bb2d1d diff --git a/airbyte-integrations/connectors/source-google-drive/integration_tests/abnormal_state.json b/airbyte-integrations/connectors/source-google-drive/integration_tests/abnormal_state.json new file mode 100644 index 0000000000000..8a7cc94f2bbbd --- /dev/null +++ b/airbyte-integrations/connectors/source-google-drive/integration_tests/abnormal_state.json @@ -0,0 +1,35 @@ +[ + { + "type": "STREAM", + "stream": { + "stream_state": { + "history": { + "test.jsonl": "2023-10-16T15:16:07.574000Z", + "subfolder/test2.jsonl": "2023-10-19T10:43:57.000000Z" + }, + "_ab_source_file_last_modified": "2023-10-19T10:43:57.000000Z_subfolder/test2.jsonl" + }, + "stream_descriptor": { + "name": "test" + } + } + }, + { + "type": "STREAM", + "stream": { + "stream_descriptor": { + "name": "test_unstructured" + }, + "stream_state": { + "_ab_source_file_last_modified": "2023-10-27T10:26:09.076000Z_testdoc_presentation.pdf", + "history": { + "testdoc_google": "2023-10-27T09:27:54.134000Z", + "testdoc_docx.docx": "2023-10-27T09:45:54.919000Z", + "testdoc_pdf.pdf": "2023-10-27T09:45:59.945000Z", + "testdoc_ocr_pdf.pdf": "2023-10-27T09:46:05.349000Z", + "testdoc_presentation": "2023-10-27T10:26:09.076000Z" + } + } + } + } +] diff --git a/airbyte-integrations/connectors/source-google-drive/integration_tests/acceptance.py b/airbyte-integrations/connectors/source-google-drive/integration_tests/acceptance.py new file mode 100644 index 0000000000000..6b0c294530cd2 --- /dev/null +++ b/airbyte-integrations/connectors/source-google-drive/integration_tests/acceptance.py @@ -0,0 +1,16 @@ +# +# Copyright (c) 2023 Airbyte, Inc., all rights reserved. +# + + +from typing import Iterable + +import pytest + +pytest_plugins = ("connector_acceptance_test.plugin",) + + +@pytest.fixture(scope="session", autouse=True) +def connector_setup() -> Iterable[None]: + """This fixture is a placeholder for external resources that acceptance test might require.""" + yield diff --git a/airbyte-integrations/connectors/source-google-drive/integration_tests/configured_catalog.json b/airbyte-integrations/connectors/source-google-drive/integration_tests/configured_catalog.json new file mode 100644 index 0000000000000..de9b2ddee6a87 --- /dev/null +++ b/airbyte-integrations/connectors/source-google-drive/integration_tests/configured_catalog.json @@ -0,0 +1,60 @@ +{ + "streams": [ + { + "stream": { + "name": "test", + "json_schema": { + "type": "object", + "properties": { + "y": { + "type": ["null", "integer"] + }, + "x": { + "type": ["null", "integer"] + }, + "_ab_source_file_last_modified": { + "type": "string", + "format": "date-time" + }, + "_ab_source_file_url": { + "type": "string" + } + } + }, + "supported_sync_modes": ["full_refresh", "incremental"], + "source_defined_cursor": true, + "default_cursor_field": ["_ab_source_file_last_modified"] + }, + "sync_mode": "incremental", + "destination_sync_mode": "append" + }, + { + "stream": { + "name": "test_unstructured", + "json_schema": { + "type": "object", + "properties": { + "document_key": { + "type": ["null", "integer"] + }, + "content": { + "type": ["null", "integer"] + }, + "_ab_source_file_last_modified": { + "type": "string", + "format": "date-time" + }, + "_ab_source_file_url": { + "type": "string" + } + } + }, + "supported_sync_modes": ["full_refresh", "incremental"], + "source_defined_cursor": true, + "default_cursor_field": ["_ab_source_file_last_modified"] + }, + "sync_mode": "incremental", + "destination_sync_mode": "append" + } + ] +} diff --git a/airbyte-integrations/connectors/source-google-drive/integration_tests/expected_records.jsonl b/airbyte-integrations/connectors/source-google-drive/integration_tests/expected_records.jsonl new file mode 100644 index 0000000000000..77e061a23dfd2 --- /dev/null +++ b/airbyte-integrations/connectors/source-google-drive/integration_tests/expected_records.jsonl @@ -0,0 +1,9 @@ +{"stream": "test", "data": {"x": 999, "_ab_source_file_last_modified": "2023-10-16T15:16:07.574000Z", "_ab_source_file_url": "test.jsonl"}, "emitted_at": 162727468000} +{"stream": "test", "data": {"x": 9999, "_ab_source_file_last_modified": "2023-10-16T15:16:07.574000Z", "_ab_source_file_url": "test.jsonl"}, "emitted_at": 162727468000} +{"stream": "test", "data": {"y": 9999, "_ab_source_file_last_modified": "2023-10-19T10:43:57.000000Z", "_ab_source_file_url": "subfolder/test2.jsonl"}, "emitted_at": 162727468000} +{"stream": "test", "data": {"y": 123, "_ab_source_file_last_modified": "2023-10-19T10:43:57.000000Z", "_ab_source_file_url": "subfolder/test2.jsonl"}, "emitted_at": 162727468000} +{"stream": "test_unstructured", "data": {"content": "# Heading\n\nThis is the content which is not just a single word", "document_key": "testdoc_google", "_ab_source_file_last_modified": "2023-10-27T09:27:54.134000Z", "_ab_source_file_url": "testdoc_google"}, "emitted_at": 1698400261074} +{"stream": "test_unstructured", "data": {"content": "# Heading\n\nThis is the content which is not just a single word", "document_key": "testdoc_docx.docx", "_ab_source_file_last_modified": "2023-10-27T09:45:54.919000Z", "_ab_source_file_url": "testdoc_docx.docx"}, "emitted_at": 1698400261867} +{"stream": "test_unstructured", "data": {"content": "# Heading\n\nThis is the content which is not just a single word", "document_key": "testdoc_pdf.pdf", "_ab_source_file_last_modified": "2023-10-27T09:45:59.945000Z", "_ab_source_file_url": "testdoc_pdf.pdf"}, "emitted_at": 1698400264556} +{"stream": "test_unstructured", "data": {"content": "This is a test", "document_key": "testdoc_ocr_pdf.pdf", "_ab_source_file_last_modified": "2023-10-27T09:46:05.349000Z", "_ab_source_file_url": "testdoc_ocr_pdf.pdf"}, "emitted_at": 1698400267184} +{"stream": "test_unstructured", "data": {"content": "This is a test", "document_key": "testdoc_presentation", "_ab_source_file_last_modified": "2023-10-27T10:26:09.076000Z", "_ab_source_file_url": "testdoc_presentation"}, "emitted_at": 1698402779268} \ No newline at end of file diff --git a/airbyte-integrations/connectors/source-google-drive/integration_tests/invalid_config.json b/airbyte-integrations/connectors/source-google-drive/integration_tests/invalid_config.json new file mode 100644 index 0000000000000..eca04cd9ad7df --- /dev/null +++ b/airbyte-integrations/connectors/source-google-drive/integration_tests/invalid_config.json @@ -0,0 +1,19 @@ +{ + "folder_url": "https://drive.google.com/drive/folders/yyy", + "credentials": { + "auth_type": "Service", + "service_account_info": "abc" + }, + "streams": [ + { + "name": "test", + "globs": ["**/*.jsonl"], + "format": { + "filetype": "jsonl" + }, + "schemaless": false, + "validation_policy": "Emit Record", + "days_to_sync_if_history_is_full": 3 + } + ] +} diff --git a/airbyte-integrations/connectors/source-google-drive/integration_tests/spec.json b/airbyte-integrations/connectors/source-google-drive/integration_tests/spec.json new file mode 100644 index 0000000000000..95ecc1a0b4b3f --- /dev/null +++ b/airbyte-integrations/connectors/source-google-drive/integration_tests/spec.json @@ -0,0 +1,398 @@ +{ + "documentationUrl": "https://docs.airbyte.com/integrations/sources/google-drive", + "connectionSpecification": { + "title": "Google Drive Source Spec", + "description": "Used during spec; allows the developer to configure the cloud provider specific options\nthat are needed when users configure a file-based source.", + "type": "object", + "properties": { + "start_date": { + "title": "Start Date", + "description": "UTC date and time in the format 2017-01-25T00:00:00.000000Z. Any file modified before this date will not be replicated.", + "examples": ["2021-01-01T00:00:00.000000Z"], + "format": "date-time", + "pattern": "^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}.[0-9]{6}Z$", + "pattern_descriptor": "YYYY-MM-DDTHH:mm:ss.SSSSSSZ", + "order": 1, + "type": "string" + }, + "streams": { + "title": "The list of streams to sync", + "description": "Each instance of this configuration defines a stream. Use this to define which files belong in the stream, their format, and how they should be parsed and validated. When sending data to warehouse destination such as Snowflake or BigQuery, each stream is a separate table.", + "order": 10, + "type": "array", + "items": { + "title": "FileBasedStreamConfig", + "type": "object", + "properties": { + "name": { + "title": "Name", + "description": "The name of the stream.", + "type": "string" + }, + "globs": { + "title": "Globs", + "description": "The pattern used to specify which files should be selected from the file system. For more information on glob pattern matching look here.", + "type": "array", + "items": { + "type": "string" + } + }, + "validation_policy": { + "title": "Validation Policy", + "description": "The name of the validation policy that dictates sync behavior when a record does not adhere to the stream schema.", + "default": "Emit Record", + "enum": ["Emit Record", "Skip Record", "Wait for Discover"] + }, + "input_schema": { + "title": "Input Schema", + "description": "The schema that will be used to validate records extracted from the file. This will override the stream schema that is auto-detected from incoming files.", + "type": "string" + }, + "primary_key": { + "title": "Primary Key", + "description": "The column or columns (for a composite key) that serves as the unique identifier of a record.", + "type": "string" + }, + "days_to_sync_if_history_is_full": { + "title": "Days To Sync If History Is Full", + "description": "When the state history of the file store is full, syncs will only read files that were last modified in the provided day range.", + "default": 3, + "type": "integer" + }, + "format": { + "title": "Format", + "description": "The configuration options that are used to alter how to read incoming files that deviate from the standard formatting.", + "type": "object", + "oneOf": [ + { + "title": "Avro Format", + "type": "object", + "properties": { + "filetype": { + "title": "Filetype", + "default": "avro", + "const": "avro", + "type": "string" + }, + "double_as_string": { + "title": "Convert Double Fields to Strings", + "description": "Whether to convert double fields to strings. This is recommended if you have decimal numbers with a high degree of precision because there can be a loss precision when handling floating point numbers.", + "default": false, + "type": "boolean" + } + } + }, + { + "title": "CSV Format", + "type": "object", + "properties": { + "filetype": { + "title": "Filetype", + "default": "csv", + "const": "csv", + "type": "string" + }, + "delimiter": { + "title": "Delimiter", + "description": "The character delimiting individual cells in the CSV data. This may only be a 1-character string. For tab-delimited data enter '\\t'.", + "default": ",", + "type": "string" + }, + "quote_char": { + "title": "Quote Character", + "description": "The character used for quoting CSV values. To disallow quoting, make this field blank.", + "default": "\"", + "type": "string" + }, + "escape_char": { + "title": "Escape Character", + "description": "The character used for escaping special characters. To disallow escaping, leave this field blank.", + "type": "string" + }, + "encoding": { + "title": "Encoding", + "description": "The character encoding of the CSV data. Leave blank to default to UTF8. See list of python encodings for allowable options.", + "default": "utf8", + "type": "string" + }, + "double_quote": { + "title": "Double Quote", + "description": "Whether two quotes in a quoted CSV value denote a single quote in the data.", + "default": true, + "type": "boolean" + }, + "null_values": { + "title": "Null Values", + "description": "A set of case-sensitive strings that should be interpreted as null values. For example, if the value 'NA' should be interpreted as null, enter 'NA' in this field.", + "default": [], + "type": "array", + "items": { + "type": "string" + }, + "uniqueItems": true + }, + "strings_can_be_null": { + "title": "Strings Can Be Null", + "description": "Whether strings can be interpreted as null values. If true, strings that match the null_values set will be interpreted as null. If false, strings that match the null_values set will be interpreted as the string itself.", + "default": true, + "type": "boolean" + }, + "skip_rows_before_header": { + "title": "Skip Rows Before Header", + "description": "The number of rows to skip before the header row. For example, if the header row is on the 3rd row, enter 2 in this field.", + "default": 0, + "type": "integer" + }, + "skip_rows_after_header": { + "title": "Skip Rows After Header", + "description": "The number of rows to skip after the header row.", + "default": 0, + "type": "integer" + }, + "header_definition": { + "title": "CSV Header Definition", + "description": "How headers will be defined. `User Provided` assumes the CSV does not have a header row and uses the headers provided and `Autogenerated` assumes the CSV does not have a header row and the CDK will generate headers using for `f{i}` where `i` is the index starting from 0. Else, the default behavior is to use the header from the CSV file. If a user wants to autogenerate or provide column names for a CSV having headers, they can skip rows.", + "default": { + "header_definition_type": "From CSV" + }, + "oneOf": [ + { + "title": "From CSV", + "type": "object", + "properties": { + "header_definition_type": { + "title": "Header Definition Type", + "default": "From CSV", + "const": "From CSV", + "type": "string" + } + } + }, + { + "title": "Autogenerated", + "type": "object", + "properties": { + "header_definition_type": { + "title": "Header Definition Type", + "default": "Autogenerated", + "const": "Autogenerated", + "type": "string" + } + } + }, + { + "title": "User Provided", + "type": "object", + "properties": { + "header_definition_type": { + "title": "Header Definition Type", + "default": "User Provided", + "const": "User Provided", + "type": "string" + }, + "column_names": { + "title": "Column Names", + "description": "The column names that will be used while emitting the CSV records", + "type": "array", + "items": { + "type": "string" + } + } + }, + "required": ["column_names"] + } + ], + "type": "object" + }, + "true_values": { + "title": "True Values", + "description": "A set of case-sensitive strings that should be interpreted as true values.", + "default": ["y", "yes", "t", "true", "on", "1"], + "type": "array", + "items": { + "type": "string" + }, + "uniqueItems": true + }, + "false_values": { + "title": "False Values", + "description": "A set of case-sensitive strings that should be interpreted as false values.", + "default": ["n", "no", "f", "false", "off", "0"], + "type": "array", + "items": { + "type": "string" + }, + "uniqueItems": true + } + } + }, + { + "title": "Jsonl Format", + "type": "object", + "properties": { + "filetype": { + "title": "Filetype", + "default": "jsonl", + "const": "jsonl", + "type": "string" + } + } + }, + { + "title": "Parquet Format", + "type": "object", + "properties": { + "filetype": { + "title": "Filetype", + "default": "parquet", + "const": "parquet", + "type": "string" + }, + "decimal_as_float": { + "title": "Convert Decimal Fields to Floats", + "description": "Whether to convert decimal fields to floats. There is a loss of precision when converting decimals to floats, so this is not recommended.", + "default": false, + "type": "boolean" + } + } + }, + { + "title": "Document File Type Format (Experimental)", + "type": "object", + "properties": { + "filetype": { + "title": "Filetype", + "default": "unstructured", + "const": "unstructured", + "type": "string" + } + }, + "description": "Extract text from document formats (.pdf, .docx, .md, .pptx) and emit as one record per file." + } + ] + }, + "schemaless": { + "title": "Schemaless", + "description": "When enabled, syncs will not validate or structure records against the stream's schema.", + "default": false, + "type": "boolean" + } + }, + "required": ["name", "format"] + } + }, + "folder_url": { + "title": "Folder Url", + "description": "URL for the folder you want to sync. Using individual streams and glob patterns, it's possible to only sync a subset of all files located in the folder.", + "examples": [ + "https://drive.google.com/drive/folders/1Xaz0vXXXX2enKnNYU5qSt9NS70gvMyYn" + ], + "order": 0, + "type": "string" + }, + "credentials": { + "title": "Authentication", + "description": "Credentials for connecting to the Google Drive API", + "type": "object", + "oneOf": [ + { + "title": "Authenticate via Google (OAuth)", + "type": "object", + "properties": { + "auth_type": { + "title": "Auth Type", + "default": "Client", + "const": "Client", + "enum": ["Client"], + "type": "string" + }, + "client_id": { + "title": "Client ID", + "description": "Client ID for the Google Drive API", + "airbyte_secret": true, + "type": "string" + }, + "client_secret": { + "title": "Client Secret", + "description": "Client Secret for the Google Drive API", + "airbyte_secret": true, + "type": "string" + }, + "refresh_token": { + "title": "Refresh Token", + "description": "Refresh Token for the Google Drive API", + "airbyte_secret": true, + "type": "string" + } + }, + "required": ["client_id", "client_secret", "refresh_token"] + }, + { + "title": "Service Account Key Authentication", + "type": "object", + "properties": { + "auth_type": { + "title": "Auth Type", + "default": "Service", + "const": "Service", + "enum": ["Service"], + "type": "string" + }, + "service_account_info": { + "title": "Service Account Information", + "description": "The JSON key of the service account to use for authorization. Read more here.", + "airbyte_secret": true, + "type": "string" + } + }, + "required": ["service_account_info"] + } + ] + } + }, + "required": ["streams", "folder_url", "credentials"] + }, + "advanced_auth": { + "auth_flow_type": "oauth2.0", + "predicate_key": ["credentials", "auth_type"], + "predicate_value": "Client", + "oauth_config_specification": { + "complete_oauth_output_specification": { + "type": "object", + "additionalProperties": false, + "properties": { + "refresh_token": { + "type": "string", + "path_in_connector_config": ["credentials", "refresh_token"] + } + } + }, + "complete_oauth_server_input_specification": { + "type": "object", + "additionalProperties": false, + "properties": { + "client_id": { + "type": "string" + }, + "client_secret": { + "type": "string" + } + } + }, + "complete_oauth_server_output_specification": { + "type": "object", + "additionalProperties": false, + "properties": { + "client_id": { + "type": "string", + "path_in_connector_config": ["credentials", "client_id"] + }, + "client_secret": { + "type": "string", + "path_in_connector_config": ["credentials", "client_secret"] + } + } + } + } + } +} diff --git a/airbyte-integrations/connectors/source-google-drive/main.py b/airbyte-integrations/connectors/source-google-drive/main.py new file mode 100644 index 0000000000000..4d051be2ff54d --- /dev/null +++ b/airbyte-integrations/connectors/source-google-drive/main.py @@ -0,0 +1,16 @@ +# +# Copyright (c) 2023 Airbyte, Inc., all rights reserved. +# + + +import sys + +from airbyte_cdk import AirbyteEntrypoint +from airbyte_cdk.entrypoint import launch +from source_google_drive import SourceGoogleDrive + +if __name__ == "__main__": + args = sys.argv[1:] + catalog_path = AirbyteEntrypoint.extract_catalog(args) + source = SourceGoogleDrive(catalog_path) + launch(source, args) diff --git a/airbyte-integrations/connectors/source-google-drive/metadata.yaml b/airbyte-integrations/connectors/source-google-drive/metadata.yaml new file mode 100644 index 0000000000000..4559f077b1e1c --- /dev/null +++ b/airbyte-integrations/connectors/source-google-drive/metadata.yaml @@ -0,0 +1,26 @@ +data: + allowedHosts: + hosts: + - "www.googleapis.com" + connectorBuildOptions: + baseImage: docker.io/airbyte/python-connector-base:1.2.0@sha256:c22a9d97464b69d6ef01898edf3f8612dc11614f05a84984451dde195f337db9 + connectorSubtype: file + connectorType: source + definitionId: 9f8dda77-1048-4368-815b-269bf54ee9b8 + dockerImageTag: 0.0.1 + dockerRepository: airbyte/source-google-drive + githubIssueLabel: source-google-drive + icon: google-drive.svg + license: ELv2 + name: Google Drive + registries: + cloud: + enabled: true + oss: + enabled: true + releaseStage: alpha + documentationUrl: https://docs.airbyte.com/integrations/sources/google-drive + tags: + - language:python + supportLevel: community +metadataSpecVersion: "1.0" diff --git a/airbyte-integrations/connectors/source-google-drive/requirements.txt b/airbyte-integrations/connectors/source-google-drive/requirements.txt new file mode 100644 index 0000000000000..d6e1198b1ab1f --- /dev/null +++ b/airbyte-integrations/connectors/source-google-drive/requirements.txt @@ -0,0 +1 @@ +-e . diff --git a/airbyte-integrations/connectors/source-google-drive/setup.py b/airbyte-integrations/connectors/source-google-drive/setup.py new file mode 100644 index 0000000000000..5928a62fc48d6 --- /dev/null +++ b/airbyte-integrations/connectors/source-google-drive/setup.py @@ -0,0 +1,32 @@ +# +# Copyright (c) 2023 Airbyte, Inc., all rights reserved. +# + + +from setuptools import find_packages, setup + +MAIN_REQUIREMENTS = [ + "airbyte-cdk[file-based]>=0.52.9", + "google-api-python-client==2.104.0", + "google-auth-httplib2==0.1.1", + "google-auth-oauthlib==1.1.0", + "google-api-python-client-stubs==1.18.0", +] + +TEST_REQUIREMENTS = [ + "pytest-mock~=3.6.1", + "pytest~=6.1", +] + +setup( + name="source_google_drive", + description="Source implementation for Google Drive.", + author="Airbyte", + author_email="contact@airbyte.io", + packages=find_packages(), + install_requires=MAIN_REQUIREMENTS, + package_data={"": ["*.json", "schemas/*.json", "schemas/shared/*.json"]}, + extras_require={ + "tests": TEST_REQUIREMENTS, + }, +) diff --git a/airbyte-integrations/connectors/source-google-drive/source_google_drive/__init__.py b/airbyte-integrations/connectors/source-google-drive/source_google_drive/__init__.py new file mode 100644 index 0000000000000..db5171dbaa577 --- /dev/null +++ b/airbyte-integrations/connectors/source-google-drive/source_google_drive/__init__.py @@ -0,0 +1,6 @@ +# +# Copyright (c) 2023 Airbyte, Inc., all rights reserved. +# +from .source import SourceGoogleDrive + +__all__ = ["SourceGoogleDrive"] diff --git a/airbyte-integrations/connectors/source-google-drive/source_google_drive/source.py b/airbyte-integrations/connectors/source-google-drive/source_google_drive/source.py new file mode 100644 index 0000000000000..fe49fba7fe8ce --- /dev/null +++ b/airbyte-integrations/connectors/source-google-drive/source_google_drive/source.py @@ -0,0 +1,55 @@ +# +# Copyright (c) 2023 Airbyte, Inc., all rights reserved. +# +from typing import Any + +from airbyte_cdk.models import AdvancedAuth, ConnectorSpecification, OAuthConfigSpecification +from airbyte_cdk.sources.file_based.file_based_source import FileBasedSource +from airbyte_cdk.sources.file_based.stream.cursor.default_file_based_cursor import DefaultFileBasedCursor +from source_google_drive.spec import SourceGoogleDriveSpec +from source_google_drive.stream_reader import SourceGoogleDriveStreamReader + + +class SourceGoogleDrive(FileBasedSource): + def __init__(self, catalog_path: str): + super().__init__( + stream_reader=SourceGoogleDriveStreamReader(), + spec_class=SourceGoogleDriveSpec, + catalog_path=catalog_path, + cursor_cls=DefaultFileBasedCursor, + ) + + def spec(self, *args: Any, **kwargs: Any) -> ConnectorSpecification: + """ + Returns the specification describing what fields can be configured by a user when setting up a file-based source. + """ + + return ConnectorSpecification( + documentationUrl=self.spec_class.documentation_url(), + connectionSpecification=self.spec_class.schema(), + advanced_auth=AdvancedAuth( + auth_flow_type="oauth2.0", + predicate_key=["credentials", "auth_type"], + predicate_value="Client", + oauth_config_specification=OAuthConfigSpecification( + complete_oauth_output_specification={ + "type": "object", + "additionalProperties": False, + "properties": {"refresh_token": {"type": "string", "path_in_connector_config": ["credentials", "refresh_token"]}}, + }, + complete_oauth_server_input_specification={ + "type": "object", + "additionalProperties": False, + "properties": {"client_id": {"type": "string"}, "client_secret": {"type": "string"}}, + }, + complete_oauth_server_output_specification={ + "type": "object", + "additionalProperties": False, + "properties": { + "client_id": {"type": "string", "path_in_connector_config": ["credentials", "client_id"]}, + "client_secret": {"type": "string", "path_in_connector_config": ["credentials", "client_secret"]}, + }, + }, + ), + ), + ) diff --git a/airbyte-integrations/connectors/source-google-drive/source_google_drive/spec.py b/airbyte-integrations/connectors/source-google-drive/source_google_drive/spec.py new file mode 100644 index 0000000000000..47809289db15c --- /dev/null +++ b/airbyte-integrations/connectors/source-google-drive/source_google_drive/spec.py @@ -0,0 +1,83 @@ +# +# Copyright (c) 2023 Airbyte, Inc., all rights reserved. +# + + +from typing import Any, Dict, Literal, Union + +import dpath.util +from airbyte_cdk.sources.file_based.config.abstract_file_based_spec import AbstractFileBasedSpec +from pydantic import BaseModel, Field + + +class OAuthCredentials(BaseModel): + class Config: + title = "Authenticate via Google (OAuth)" + + auth_type: Literal["Client"] = Field("Client", const=True) + client_id: str = Field( + title="Client ID", + description="Client ID for the Google Drive API", + airbyte_secret=True, + ) + client_secret: str = Field( + title="Client Secret", + description="Client Secret for the Google Drive API", + airbyte_secret=True, + ) + refresh_token: str = Field( + title="Refresh Token", + description="Refresh Token for the Google Drive API", + airbyte_secret=True, + ) + + +class ServiceAccountCredentials(BaseModel): + class Config: + title = "Service Account Key Authentication" + + auth_type: Literal["Service"] = Field("Service", const=True) + service_account_info: str = Field( + title="Service Account Information", + description='The JSON key of the service account to use for authorization. Read more here.', + airbyte_secret=True, + ) + + +class SourceGoogleDriveSpec(AbstractFileBasedSpec, BaseModel): + class Config: + title = "Google Drive Source Spec" + + folder_url: str = Field( + description="URL for the folder you want to sync. Using individual streams and glob patterns, it's possible to only sync a subset of all files located in the folder.", + examples=["https://drive.google.com/drive/folders/1Xaz0vXXXX2enKnNYU5qSt9NS70gvMyYn"], + order=0, + ) + + credentials: Union[OAuthCredentials, ServiceAccountCredentials] = Field( + title="Authentication", description="Credentials for connecting to the Google Drive API", discriminator="auth_type", type="object" + ) + + @classmethod + def documentation_url(cls) -> str: + return "https://docs.airbyte.com/integrations/sources/google-drive" + + @staticmethod + def remove_discriminator(schema: dict) -> None: + """pydantic adds "discriminator" to the schema for oneOfs, which is not treated right by the platform as we inline all references""" + dpath.util.delete(schema, "properties/*/discriminator") + + @classmethod + def schema(cls, *args: Any, **kwargs: Any) -> Dict[str, Any]: + """ + Generates the mapping comprised of the config fields + """ + schema = super().schema(*args, **kwargs) + + cls.remove_discriminator(schema) + + # Remove legacy settings + dpath.util.delete(schema, "properties/streams/items/properties/legacy_prefix") + dpath.util.delete(schema, "properties/streams/items/properties/format/oneOf/*/properties/inference_type") + + return schema diff --git a/airbyte-integrations/connectors/source-google-drive/source_google_drive/stream_reader.py b/airbyte-integrations/connectors/source-google-drive/source_google_drive/stream_reader.py new file mode 100644 index 0000000000000..ae33a54ab015d --- /dev/null +++ b/airbyte-integrations/connectors/source-google-drive/source_google_drive/stream_reader.py @@ -0,0 +1,194 @@ +# +# Copyright (c) 2023 Airbyte, Inc., all rights reserved. +# + +import io +import json +import logging +import re +from datetime import datetime +from io import IOBase +from typing import Iterable, List, Optional, Set + +from airbyte_cdk.sources.file_based.file_based_stream_reader import AbstractFileBasedStreamReader, FileReadMode +from airbyte_cdk.sources.file_based.remote_file import RemoteFile +from airbyte_cdk.utils.traced_exception import AirbyteTracedException, FailureType +from google.oauth2 import credentials, service_account +from googleapiclient.discovery import build +from googleapiclient.http import MediaIoBaseDownload + +from .spec import SourceGoogleDriveSpec + +FOLDER_MIME_TYPE = "application/vnd.google-apps.folder" +GOOGLE_DOC_MIME_TYPE = "application/vnd.google-apps.document" +EXPORTABLE_DOCUMENTS_MIME_TYPES = [ + GOOGLE_DOC_MIME_TYPE, + "application/vnd.google-apps.presentation", + "application/vnd.google-apps.drawing", +] + + +class GoogleDriveRemoteFile(RemoteFile): + id: str + # The mime type of the file as returned by the Google Drive API + # This is not the same as the mime type when opened by the parser (e.g. google docs is exported as docx) + original_mime_type: str + + +class SourceGoogleDriveStreamReader(AbstractFileBasedStreamReader): + def __init__(self): + super().__init__() + self._drive_service = None + + @property + def config(self) -> SourceGoogleDriveSpec: + return self._config + + @config.setter + def config(self, value: SourceGoogleDriveSpec): + """ + FileBasedSource reads the config from disk and parses it, and once parsed, the source sets the config on its StreamReader. + + Note: FileBasedSource only requires the keys defined in the abstract config, whereas concrete implementations of StreamReader + will require keys that (for example) allow it to authenticate with the 3rd party. + + Therefore, concrete implementations of AbstractFileBasedStreamReader's config setter should assert that `value` is of the correct + config type for that type of StreamReader. + """ + assert isinstance(value, SourceGoogleDriveSpec) + self._config = value + + @property + def google_drive_service(self): + if self.config is None: + # We shouldn't hit this; config should always get set before attempting to + # list or read files. + raise ValueError("Source config is missing; cannot create the Google Drive client.") + try: + if self._drive_service is None: + if self.config.credentials.auth_type == "Client": + creds = credentials.Credentials.from_authorized_user_info(self.config.credentials.dict()) + else: + creds = service_account.Credentials.from_service_account_info(json.loads(self.config.credentials.service_account_info)) + self._drive_service = build("drive", "v3", credentials=creds) + except Exception as e: + raise AirbyteTracedException( + internal_message=str(e), + message="Could not authenticate with Google Drive. Please check your credentials.", + failure_type=FailureType.config_error, + exception=e, + ) + + return self._drive_service + + def get_matching_files(self, globs: List[str], prefix: Optional[str], logger: logging.Logger) -> Iterable[RemoteFile]: + """ + Get all files matching the specified glob patterns. + """ + service = self.google_drive_service + root_folder_id = self._get_folder_id(self.config.folder_url) + # ignore prefix argument as it's legacy only and this is a new connector + prefixes = self.get_prefixes_from_globs(globs) + + folder_id_queue = [("", root_folder_id)] + seen: Set[str] = set() + while len(folder_id_queue) > 0: + (path, folder_id) = folder_id_queue.pop() + # fetch all files in this folder (1000 is the max page size) + request = service.files().list( + q=f"'{folder_id}' in parents", pageSize=1000, fields="nextPageToken, files(id, name, modifiedTime, mimeType)" + ) + while True: + results = request.execute() + new_files = results.get("files", []) + for new_file in new_files: + # It's possible files and folders are linked up multiple times, this prevents us from getting stuck in a loop + if new_file["id"] in seen: + continue + seen.add(new_file["id"]) + file_name = path + new_file["name"] + if new_file["mimeType"] == FOLDER_MIME_TYPE: + folder_name = f"{file_name}/" + # check prefix matching in both directions to handle + prefix_matches_folder_name = any(prefix.startswith(folder_name) for prefix in prefixes) + folder_name_matches_prefix = any(folder_name.startswith(prefix) for prefix in prefixes) + if prefix_matches_folder_name or folder_name_matches_prefix or len(prefixes) == 0: + folder_id_queue.append((folder_name, new_file["id"])) + continue + else: + last_modified = datetime.strptime(new_file["modifiedTime"], "%Y-%m-%dT%H:%M:%S.%fZ") + original_mime_type = new_file["mimeType"] + mime_type = ( + self._get_export_mime_type(original_mime_type) + if self._is_exportable_document(original_mime_type) + else original_mime_type + ) + remote_file = GoogleDriveRemoteFile( + uri=file_name, + last_modified=last_modified, + id=new_file["id"], + original_mime_type=original_mime_type, + mime_type=mime_type, + ) + if self.file_matches_globs(remote_file, globs): + yield remote_file + request = service.files().list_next(request, results) + if request is None: + break + + def _get_folder_id(self, url): + # Regular expression pattern to check the URL structure and extract the ID + pattern = r"^https://drive\.google\.com/drive/folders/([a-zA-Z0-9_-]+)$" + + # Find the pattern in the URL + match = re.search(pattern, url) + + if match: + # The matched group is the ID + drive_id = match.group(1) + return drive_id + else: + # If no match is found + raise ValueError(f"Could not extract folder ID from {url}") + + def _is_exportable_document(self, mime_type: str): + """ + Returns true if the given file is a Google App document that can be exported. + """ + return mime_type in EXPORTABLE_DOCUMENTS_MIME_TYPES + + def open_file(self, file: GoogleDriveRemoteFile, mode: FileReadMode, encoding: Optional[str], logger: logging.Logger) -> IOBase: + if self._is_exportable_document(file.original_mime_type): + if mode == FileReadMode.READ: + raise ValueError( + "Google Docs/Drawings/Presentations can only be processed using the document file type format. Please set the format accordingly or adjust the glob pattern." + ) + request = self.google_drive_service.files().export_media(fileId=file.id, mimeType=file.mime_type) + else: + request = self.google_drive_service.files().get_media(fileId=file.id) + handle = io.BytesIO() + downloader = MediaIoBaseDownload(handle, request) + done = False + while done is False: + _, done = downloader.next_chunk() + + handle.seek(0) + + if mode == FileReadMode.READ_BINARY: + return handle + else: + # repack the bytes into a string with the right encoding + text_handle = io.StringIO(handle.read().decode(encoding or "utf-8")) + handle.close() + return text_handle + + def _get_export_mime_type(self, original_mime_type: str): + """ + Returns the mime type to export Google App documents as. + + Google Docs are exported as Docx to preserve as much formatting as possible, everything else goes through PDF. + """ + if original_mime_type.startswith(GOOGLE_DOC_MIME_TYPE): + return "application/vnd.openxmlformats-officedocument.wordprocessingml.document" + else: + return "application/pdf" diff --git a/airbyte-integrations/connectors/source-google-drive/unit_tests/test_reader.py b/airbyte-integrations/connectors/source-google-drive/unit_tests/test_reader.py new file mode 100644 index 0000000000000..c3b095d3a5d18 --- /dev/null +++ b/airbyte-integrations/connectors/source-google-drive/unit_tests/test_reader.py @@ -0,0 +1,643 @@ +# +# Copyright (c) 2023 Airbyte, Inc., all rights reserved. +# + +import datetime +from unittest.mock import MagicMock, call, patch + +import pytest +from airbyte_cdk.sources.file_based.config.file_based_stream_config import FileBasedStreamConfig +from airbyte_cdk.sources.file_based.config.jsonl_format import JsonlFormat +from airbyte_cdk.sources.file_based.file_based_stream_reader import FileReadMode +from source_google_drive.spec import ServiceAccountCredentials, SourceGoogleDriveSpec +from source_google_drive.stream_reader import GoogleDriveRemoteFile, SourceGoogleDriveStreamReader + + +def create_reader( + config=SourceGoogleDriveSpec( + folder_url="https://drive.google.com/drive/folders/1Z2Q3", + streams=[FileBasedStreamConfig(name="test", format=JsonlFormat())], + credentials=ServiceAccountCredentials(auth_type="Service", service_account_info='{"test": "abc"}'), + ) +): + reader = SourceGoogleDriveStreamReader() + reader.config = config + + return reader + + +def flatten_list(list_of_lists): + return [item for sublist in list_of_lists for item in sublist] + + +@pytest.mark.parametrize( + "glob, listing_results, matched_files", + [ + pytest.param( + "*", + [[{"files": [{"id": "abc", "mimeType": "text/csv", "name": "test.csv", "modifiedTime": "2021-01-01T00:00:00.000Z"}]}]], + [ + GoogleDriveRemoteFile( + uri="test.csv", + id="abc", + mime_type="text/csv", + original_mime_type="text/csv", + last_modified=datetime.datetime(2021, 1, 1), + ) + ], + id="Single file", + ), + pytest.param( + "*", + [ + [ + { + "files": [ + {"id": "abc", "mimeType": "text/csv", "name": "test.csv", "modifiedTime": "2021-01-01T00:00:00.000Z"}, + {"id": "def", "mimeType": "text/csv", "name": "another_file.csv", "modifiedTime": "2021-01-01T00:00:00.000Z"}, + ] + }, + ] + ], + [ + GoogleDriveRemoteFile( + uri="test.csv", + id="abc", + mime_type="text/csv", + original_mime_type="text/csv", + last_modified=datetime.datetime(2021, 1, 1), + ), + GoogleDriveRemoteFile( + uri="another_file.csv", + id="def", + mime_type="text/csv", + original_mime_type="text/csv", + last_modified=datetime.datetime(2021, 1, 1), + ), + ], + id="Multiple files", + ), + pytest.param( + "*", + [ + [ + {"files": [{"id": "abc", "mimeType": "text/csv", "name": "test.csv", "modifiedTime": "2021-01-01T00:00:00.000Z"}]}, + { + "files": [ + {"id": "def", "mimeType": "text/csv", "name": "another_file.csv", "modifiedTime": "2021-01-01T00:00:00.000Z"} + ] + }, + ] + ], + [ + GoogleDriveRemoteFile( + uri="test.csv", + id="abc", + mime_type="text/csv", + original_mime_type="text/csv", + last_modified=datetime.datetime(2021, 1, 1), + ), + GoogleDriveRemoteFile( + uri="another_file.csv", + id="def", + mime_type="text/csv", + original_mime_type="text/csv", + last_modified=datetime.datetime(2021, 1, 1), + ), + ], + id="Multiple pages", + ), + pytest.param( + "*", + [ + [ + {"files": []}, + ] + ], + [], + id="No files", + ), + pytest.param( + "**/*", + [ + [ + { + "files": [ + {"id": "abc", "mimeType": "text/csv", "name": "test.csv", "modifiedTime": "2021-01-01T00:00:00.000Z"}, + { + "id": "sub", + "mimeType": "application/vnd.google-apps.folder", + "name": "subfolder", + "modifiedTime": "2021-01-01T00:00:00.000Z", + }, + ] + }, + ], + [ + # second request is for requesting the subfolder + { + "files": [ + {"id": "def", "mimeType": "text/csv", "name": "another_file.csv", "modifiedTime": "2021-01-01T00:00:00.000Z"}, + { + "id": "subsub", + "mimeType": "application/vnd.google-apps.folder", + "name": "subsubfolder", + "modifiedTime": "2021-01-01T00:00:00.000Z", + }, + ] + }, + ], + [ + # third request is for requesting the subsubfolder + { + "files": [ + { + "id": "ghi", + "mimeType": "text/csv", + "name": "yet_another_file.csv", + "modifiedTime": "2021-01-01T00:00:00.000Z", + }, + ] + }, + ], + ], + [ + GoogleDriveRemoteFile( + uri="test.csv", + id="abc", + mime_type="text/csv", + original_mime_type="text/csv", + last_modified=datetime.datetime(2021, 1, 1), + ), + GoogleDriveRemoteFile( + uri="subfolder/another_file.csv", + id="def", + mime_type="text/csv", + original_mime_type="text/csv", + last_modified=datetime.datetime(2021, 1, 1), + ), + GoogleDriveRemoteFile( + uri="subfolder/subsubfolder/yet_another_file.csv", + id="ghi", + mime_type="text/csv", + original_mime_type="text/csv", + last_modified=datetime.datetime(2021, 1, 1), + ), + ], + id="Nested directories", + ), + pytest.param( + "**/*", + [ + [ + { + "files": [ + {"id": "abc", "mimeType": "text/csv", "name": "test.csv", "modifiedTime": "2021-01-01T00:00:00.000Z"}, + { + "id": "sub", + "mimeType": "application/vnd.google-apps.folder", + "name": "subfolder", + "modifiedTime": "2021-01-01T00:00:00.000Z", + }, + ] + }, + ], + [ + # second request is for requesting the subfolder + { + "files": [ + {"id": "abc", "mimeType": "text/csv", "name": "test.csv", "modifiedTime": "2021-01-01T00:00:00.000Z"}, + { + "id": "subsub", + "mimeType": "application/vnd.google-apps.folder", + "name": "subsubfolder", + "modifiedTime": "2021-01-01T00:00:00.000Z", + }, + ] + }, + ], + [ + # third request is for requesting the subsubfolder + { + "files": [ + {"id": "abc", "mimeType": "text/csv", "name": "test.csv", "modifiedTime": "2021-01-01T00:00:00.000Z"}, + { + "id": "sub", + "mimeType": "application/vnd.google-apps.folder", + "name": "link_to_subfolder", + "modifiedTime": "2021-01-01T00:00:00.000Z", + }, + ] + }, + ], + ], + [ + GoogleDriveRemoteFile( + uri="test.csv", + id="abc", + mime_type="text/csv", + original_mime_type="text/csv", + last_modified=datetime.datetime(2021, 1, 1), + ), + ], + id="Duplicates", + ), + pytest.param( + "subfolder/**/*.csv", + [ + [ + { + "files": [ + {"id": "abc", "mimeType": "text/csv", "name": "test.csv", "modifiedTime": "2021-01-01T00:00:00.000Z"}, + { + "id": "sub", + "mimeType": "application/vnd.google-apps.folder", + "name": "subfolder", + "modifiedTime": "2021-01-01T00:00:00.000Z", + }, + ] + }, + ], + [ + # second request is for requesting the subfolder + { + "files": [ + {"id": "def", "mimeType": "text/csv", "name": "another_file.csv", "modifiedTime": "2021-01-01T00:00:00.000Z"}, + { + "id": "ghi", + "mimeType": "text/jsonl", + "name": "non_matching.jsonl", + "modifiedTime": "2021-01-01T00:00:00.000Z", + }, + ] + }, + ], + ], + [ + GoogleDriveRemoteFile( + uri="subfolder/another_file.csv", + id="def", + mime_type="text/csv", + original_mime_type="text/csv", + last_modified=datetime.datetime(2021, 1, 1), + ), + ], + id="Glob matching and subdirectories", + ), + pytest.param( + "subfolder/*.csv", + [ + [ + { + "files": [ + {"id": "abc", "mimeType": "text/csv", "name": "test.csv", "modifiedTime": "2021-01-01T00:00:00.000Z"}, + { + "id": "sub", + "mimeType": "application/vnd.google-apps.folder", + "name": "subfolder", + "modifiedTime": "2021-01-01T00:00:00.000Z", + }, + # This won't get queued because it has no chance of matching the glob + { + "id": "sub", + "mimeType": "application/vnd.google-apps.folder", + "name": "ignored_subfolder", + "modifiedTime": "2021-01-01T00:00:00.000Z", + }, + ] + }, + ], + [ + # second request is for requesting the subfolder + { + "files": [ + {"id": "def", "mimeType": "text/csv", "name": "another_file.csv", "modifiedTime": "2021-01-01T00:00:00.000Z"}, + # This will get queued because it matches the prefix (event though it can't match the glob) + { + "id": "subsub", + "mimeType": "application/vnd.google-apps.folder", + "name": "subsubfolder", + "modifiedTime": "2021-01-01T00:00:00.000Z", + }, + ] + }, + ], + [ + # third request is for requesting the subsubfolder + { + "files": [ + { + "id": "ghi", + "mimeType": "text/csv", + "name": "yet_another_file.csv", + "modifiedTime": "2021-01-01T00:00:00.000Z", + }, + ] + }, + ], + ], + [ + GoogleDriveRemoteFile( + uri="subfolder/another_file.csv", + id="def", + mime_type="text/csv", + original_mime_type="text/csv", + last_modified=datetime.datetime(2021, 1, 1), + ), + ], + id="Glob matching and ignoring most subdirectories that can't be matched", + ), + pytest.param( + "subfolder/subsubfolder/*.csv", + [ + [ + { + "files": [ + {"id": "abc", "mimeType": "text/csv", "name": "test.csv", "modifiedTime": "2021-01-01T00:00:00.000Z"}, + { + "id": "sub", + "mimeType": "application/vnd.google-apps.folder", + "name": "subfolder", + "modifiedTime": "2021-01-01T00:00:00.000Z", + }, + ] + }, + ], + [ + # second request is for requesting the subfolder + { + "files": [ + {"id": "def", "mimeType": "text/csv", "name": "another_file.csv", "modifiedTime": "2021-01-01T00:00:00.000Z"}, + # This will get queued because it matches the prefix (event though it can't match the glob) + { + "id": "subsub", + "mimeType": "application/vnd.google-apps.folder", + "name": "subsubfolder", + "modifiedTime": "2021-01-01T00:00:00.000Z", + }, + ] + }, + ], + [ + # third request is for requesting the subsubfolder + { + "files": [ + { + "id": "ghi", + "mimeType": "text/csv", + "name": "yet_another_file.csv", + "modifiedTime": "2021-01-01T00:00:00.000Z", + }, + # This will get queued because it matches the prefix (event though it can't match the glob) + { + "id": "subsubsub", + "mimeType": "application/vnd.google-apps.folder", + "name": "ignored_subsubsubfolder", + "modifiedTime": "2021-01-01T00:00:00.000Z", + }, + ] + }, + ], + [{"files": []}], + ], + [ + GoogleDriveRemoteFile( + uri="subfolder/subsubfolder/yet_another_file.csv", + id="ghi", + mime_type="text/csv", + original_mime_type="text/csv", + last_modified=datetime.datetime(2021, 1, 1), + ), + ], + id="Glob matching and ignoring subdirectories that can't be matched, multiple levels", + ), + pytest.param( + "*", + [ + [ + { + "files": [ + { + "id": "abc", + "mimeType": "application/vnd.google-apps.document", + "name": "MyDoc", + "modifiedTime": "2021-01-01T00:00:00.000Z", + } + ] + } + ] + ], + [ + GoogleDriveRemoteFile( + uri="MyDoc", + id="abc", + original_mime_type="application/vnd.google-apps.document", + mime_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document", + last_modified=datetime.datetime(2021, 1, 1), + ) + ], + id="Google Doc as docx", + ), + pytest.param( + "*", + [ + [ + { + "files": [ + { + "id": "abc", + "mimeType": "application/vnd.google-apps.presentation", + "name": "MySlides", + "modifiedTime": "2021-01-01T00:00:00.000Z", + } + ] + } + ] + ], + [ + GoogleDriveRemoteFile( + uri="MySlides", + id="abc", + original_mime_type="application/vnd.google-apps.presentation", + mime_type="application/pdf", + last_modified=datetime.datetime(2021, 1, 1), + ) + ], + id="Presentation as pdf", + ), + pytest.param( + "*", + [ + [ + { + "files": [ + { + "id": "abc", + "mimeType": "application/vnd.google-apps.drawing", + "name": "MyDrawing", + "modifiedTime": "2021-01-01T00:00:00.000Z", + } + ] + } + ] + ], + [ + GoogleDriveRemoteFile( + uri="MyDrawing", + id="abc", + original_mime_type="application/vnd.google-apps.drawing", + mime_type="application/pdf", + last_modified=datetime.datetime(2021, 1, 1), + ) + ], + id="Drawing as pdf", + ), + pytest.param( + "*", + [ + [ + { + "files": [ + { + "id": "abc", + "mimeType": "application/vnd.google-apps.video", + "name": "MyVideo", + "modifiedTime": "2021-01-01T00:00:00.000Z", + } + ] + } + ] + ], + [ + GoogleDriveRemoteFile( + uri="MyVideo", + id="abc", + original_mime_type="application/vnd.google-apps.video", + mime_type="application/vnd.google-apps.video", + last_modified=datetime.datetime(2021, 1, 1), + ) + ], + id="Other google file types as is", + ), + ], +) +@patch("source_google_drive.stream_reader.service_account") +@patch("source_google_drive.stream_reader.build") +def test_matching_files(mock_build_service, mock_service_account, glob, listing_results, matched_files): + mock_request = MagicMock() + # execute returns all results from all pages for all listings + flattened_results = flatten_list(listing_results) + + mock_request.execute.side_effect = flattened_results + files_service = MagicMock() + files_service.list.return_value = mock_request + # list next returns a new fake "request" for each page and None at the end of each page (simulating the end of the listing like the Google Drive API behaves in practice) + files_service.list_next.side_effect = flatten_list( + [[*[mock_request for _ in range(len(listing) - 1)], None] for listing in listing_results] + ) + drive_service = MagicMock() + drive_service.files.return_value = files_service + mock_build_service.return_value = drive_service + + reader = create_reader() + + found_files = list(reader.get_matching_files([glob], None, MagicMock())) + assert files_service.list.call_count == len(listing_results) + assert matched_files == found_files + assert files_service.list_next.call_count == len(flattened_results) + + +@pytest.mark.parametrize( + "file, file_content, mode, expect_export, expected_mime_type, expected_read, expect_raise", + [ + pytest.param( + GoogleDriveRemoteFile( + uri="avro_file", id="abc", mime_type="text/csv", original_mime_type="text/csv", last_modified=datetime.datetime(2021, 1, 1) + ), + b"test", + FileReadMode.READ_BINARY, + False, + None, + b"test", + False, + id="Read binary file", + ), + pytest.param( + GoogleDriveRemoteFile( + uri="test.csv", id="abc", mime_type="text/csv", original_mime_type="text/csv", last_modified=datetime.datetime(2021, 1, 1) + ), + b"test", + FileReadMode.READ, + False, + None, + "test", + False, + id="Read text file", + ), + pytest.param( + GoogleDriveRemoteFile( + uri="abc", + id="abc", + mime_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document", + original_mime_type="application/vnd.google-apps.document", + last_modified=datetime.datetime(2021, 1, 1), + ), + b"test", + FileReadMode.READ_BINARY, + True, + "application/vnd.openxmlformats-officedocument.wordprocessingml.document", + b"test", + False, + id="Read google doc as binary file with export", + ), + ], +) +@patch("source_google_drive.stream_reader.MediaIoBaseDownload") +@patch("source_google_drive.stream_reader.service_account") +@patch("source_google_drive.stream_reader.build") +def test_open_file( + mock_build_service, + mock_service_account, + mock_basedownload, + file, + file_content, + mode, + expect_export, + expected_mime_type, + expected_read, + expect_raise, +): + mock_request = MagicMock() + mock_downloader = MagicMock() + + def mock_next_chunk(): + handle = mock_basedownload.call_args[0][0] + if handle.tell() > 0: + return (None, True) + else: + handle.write(file_content) + return (None, False) + + mock_downloader.next_chunk.side_effect = mock_next_chunk + + mock_basedownload.return_value = mock_downloader + + files_service = MagicMock() + if expect_export: + files_service.export_media.return_value = mock_request + else: + files_service.get_media.return_value = mock_request + drive_service = MagicMock() + drive_service.files.return_value = files_service + mock_build_service.return_value = drive_service + + if expect_raise: + with pytest.raises(ValueError): + create_reader().open_file(file, mode, None, MagicMock()).read() + else: + assert expected_read == create_reader().open_file(file, mode, None, MagicMock()).read() + assert mock_downloader.next_chunk.call_count == 2 + if expect_export: + files_service.export_media.assert_has_calls([call(fileId=file.id, mimeType=expected_mime_type)]) + else: + files_service.get_media.assert_has_calls([call(fileId=file.id)]) diff --git a/docs/integrations/sources/google-drive.md b/docs/integrations/sources/google-drive.md new file mode 100644 index 0000000000000..aedb783b2c7ef --- /dev/null +++ b/docs/integrations/sources/google-drive.md @@ -0,0 +1,251 @@ +# Google Drive + +This page contains the setup guide and reference information for the Google Drive source connector. + +:::info +The Google Drive source connector pulls data from a single folder in Google Drive. Subfolders are recursively included in the sync. All files in the specified folder and all sub folders will be considered. +::: + +## Prerequisites + +- Drive folder link - The link to the Google Drive folder you want to sync files from (includes files located in subfolders) + +- **For Airbyte Cloud** A Google Workspace user with access to the spreadsheet + + +- **For Airbyte Open Source:** + - A GCP project + - Enable the Google Drive API in your GCP project + - Service Account Key with access to the Spreadsheet you want to replicate + + +## Setup guide + +The Google Drive source connector supports authentication via either OAuth or Service Account Key Authentication. + +For **Airbyte Cloud** users, we highly recommend using OAuth, as it significantly simplifies the setup process and allows you to authenticate [directly from the Airbyte UI](#set-up-the-google-drive-source-connector-in-airbyte). + + + + + +For **Airbyte Open Source** users, we recommend using Service Account Key Authentication. Follow the steps below to create a service account, generate a key, and enable the Google Drive API. + +:::note +If you prefer to use OAuth for authentication with **Airbyte Open Source**, you can follow [Google's OAuth instructions](https://developers.google.com/identity/protocols/oauth2) to create an authentication app. Be sure to set the scopes to `https://www.googleapis.com/auth/drive.readonly`. You will need to obtain your client ID, client secret, and refresh token for the connector setup. +::: + +### Set up the service account key (Airbyte Open Source) + +#### Create a service account + +1. Open the [Service Accounts page](https://console.cloud.google.com/projectselector2/iam-admin/serviceaccounts) in your Google Cloud console. +2. Select an existing project, or create a new project. +3. At the top of the page, click **+ Create service account**. +4. Enter a name and description for the service account, then click **Create and Continue**. +5. Under **Service account permissions**, select the roles to grant to the service account, then click **Continue**. We recommend the **Viewer** role. + +#### Generate a key + +1. Go to the [API Console/Credentials](https://console.cloud.google.com/apis/credentials) page and click on the email address of the service account you just created. +2. In the **Keys** tab, click **+ Add key**, then click **Create new key**. +3. Select **JSON** as the Key type. This will generate and download the JSON key file that you'll use for authentication. Click **Continue**. + +#### Enable the Google Drive API + +1. Go to the [API Console/Library](https://console.cloud.google.com/apis/library) page. +2. Make sure you have selected the correct project from the top. +3. Find and select the **Google Drive API**. +4. Click **ENABLE**. + +If your folder is viewable by anyone with its link, no further action is needed. If not, give your Service account access to your folder. Check out [this video](https://youtu.be/GyomEw5a2NQ%22) for how to do this. + + + +### Set up the Google Drive source connector in Airbyte + +To set up Google Drive as a source in Airbyte Cloud: + +1. [Log in to your Airbyte Cloud](https://cloud.airbyte.com/workspaces) or Airbyte Open Source account. +2. In the left navigation bar, click **Sources**. In the top-right corner, click **+ New source**. +3. Find and select **Google Drive** from the list of available sources. +4. For **Source name**, enter a name to help you identify this source. +5. Select your authentication method: + + + +#### For Airbyte Cloud + +- **(Recommended)** Select **Authenticate via Google (OAuth)** from the Authentication dropdown, click **Sign in with Google** and complete the authentication workflow. + + + + +#### For Airbyte Open Source + +- **(Recommended)** Select **Service Account Key Authentication** from the dropdown and enter your Google Cloud service account key in JSON format: + + ```js + { "type": "service_account", "project_id": "YOUR_PROJECT_ID", "private_key_id": "YOUR_PRIVATE_KEY", ... } + ``` + +- To authenticate your Google account via OAuth, select **Authenticate via Google (OAuth)** from the dropdown and enter your Google application's client ID, client secret, and refresh token. + + + +6. For **Folder Link**, enter the link to the Google Drive folder. To get the link, navigate to the folder you want to sync in the Google Drive UI, and copy the current URL. +7. Configure the optional **Start Date** parameter that marks a starting date and time in UTC for data replication. Any files that have _not_ been modified since this specified date/time will _not_ be replicated. Use the provided datepicker (recommended) or enter the desired date programmatically in the format `YYYY-MM-DDTHH:mm:ssZ`. Leaving this field blank will replicate data from all files that have not been excluded by the **Path Pattern** and **Path Prefix**. +8. Click **Set up source** and wait for the tests to complete. + +## Supported sync modes + +The Google Drive source connector supports the following [sync modes](https://docs.airbyte.com/cloud/core-concepts#connection-sync-modes): + +| Feature | Supported? | +| :--------------------------------------------- | :--------- | +| Full Refresh Sync | Yes | +| Incremental Sync | Yes | +| Replicate Incremental Deletes | No | +| Replicate Multiple Files \(pattern matching\) | Yes | +| Replicate Multiple Streams \(distinct tables\) | Yes | +| Namespaces | No | + +## Path Patterns + +\(tl;dr -> path pattern syntax using [wcmatch.glob](https://facelessuser.github.io/wcmatch/glob/). GLOBSTAR and SPLIT flags are enabled.\) + +This connector can sync multiple files by using glob-style patterns, rather than requiring a specific path for every file. This enables: + +- Referencing many files with just one pattern, e.g. `**` would indicate every file in the folder. +- Referencing future files that don't exist yet \(and therefore don't have a specific path\). + +You must provide a path pattern. You can also provide many patterns split with \| for more complex directory layouts. + +Each path pattern is a reference from the _root_ of the folder, so don't include the root folder name itself in the pattern\(s\). + +Some example patterns: + +- `**` : match everything. +- `**/*.csv` : match all files with specific extension. +- `myFolder/**/*.csv` : match all csv files anywhere under myFolder. +- `*/**` : match everything at least one folder deep. +- `*/*/*/**` : match everything at least three folders deep. +- `**/file.*|**/file` : match every file called "file" with any extension \(or no extension\). +- `x/*/y/*` : match all files that sit in sub-folder x -> any folder -> folder y. +- `**/prefix*.csv` : match all csv files with specific prefix. +- `**/prefix*.parquet` : match all parquet files with specific prefix. + +Let's look at a specific example, matching the following folder layout (`MyFolder` is the folder specified in the connector config as the root folder, which the patterns are relative to): + +```text +MyFolder + -> log_files + -> some_table_files + -> part1.csv + -> part2.csv + -> images + -> more_table_files + -> part3.csv + -> extras + -> misc + -> another_part1.csv +``` + +We want to pick up part1.csv, part2.csv and part3.csv \(excluding another_part1.csv for now\). We could do this a few different ways: + +- We could pick up every csv file called "partX" with the single pattern `**/part*.csv`. +- To be a bit more robust, we could use the dual pattern `some_table_files/*.csv|more_table_files/*.csv` to pick up relevant files only from those exact folders. +- We could achieve the above in a single pattern by using the pattern `*table_files/*.csv`. This could however cause problems in the future if new unexpected folders started being created. +- We can also recursively wildcard, so adding the pattern `extras/**/*.csv` would pick up any csv files nested in folders below "extras", such as "extras/misc/another_part1.csv". + +As you can probably tell, there are many ways to achieve the same goal with path patterns. We recommend using a pattern that ensures clarity and is robust against future additions to the directory structure. + +## User Schema + +When using the Avro, Jsonl, CSV or Parquet format, you can provide a schema to use for the output stream. **Note that this doesn't apply to the experimental Document file type format.** + +Providing a schema allows for more control over the output of this stream. Without a provided schema, columns and datatypes will be inferred from the first created file in the bucket matching your path pattern and suffix. This will probably be fine in most cases but there may be situations you want to enforce a schema instead, e.g.: + +- You only care about a specific known subset of the columns. The other columns would all still be included, but packed into the `_ab_additional_properties` map. +- Your initial dataset is quite small \(in terms of number of records\), and you think the automatic type inference from this sample might not be representative of the data in the future. +- You want to purposely define types for every column. +- You know the names of columns that will be added to future data and want to include these in the core schema as columns rather than have them appear in the `_ab_additional_properties` map. + +Or any other reason! The schema must be provided as valid JSON as a map of `{"column": "datatype"}` where each datatype is one of: + +- string +- number +- integer +- object +- array +- boolean +- null + +For example: + +- {"id": "integer", "location": "string", "longitude": "number", "latitude": "number"} +- {"username": "string", "friends": "array", "information": "object"} + +## File Format Settings + +### CSV + +Since CSV files are effectively plain text, providing specific reader options is often required for correct parsing of the files. These settings are applied when a CSV is created or exported so please ensure that this process happens consistently over time. + +- **Header Definition**: How headers will be defined. `User Provided` assumes the CSV does not have a header row and uses the headers provided and `Autogenerated` assumes the CSV does not have a header row and the CDK will generate headers using for `f{i}` where `i` is the index starting from 0. Else, the default behavior is to use the header from the CSV file. If a user wants to autogenerate or provide column names for a CSV having headers, they can set a value for the "Skip rows before header" option to ignore the header row. +- **Delimiter**: Even though CSV is an acronym for Comma Separated Values, it is used more generally as a term for flat file data that may or may not be comma separated. The delimiter field lets you specify which character acts as the separator. To use [tab-delimiters](https://en.wikipedia.org/wiki/Tab-separated_values), you can set this value to `\t`. By default, this value is set to `,`. +- **Double Quote**: This option determines whether two quotes in a quoted CSV value denote a single quote in the data. Set to True by default. +- **Encoding**: Some data may use a different character set \(typically when different alphabets are involved\). See the [list of allowable encodings here](https://docs.python.org/3/library/codecs.html#standard-encodings). By default, this is set to `utf8`. +- **Escape Character**: An escape character can be used to prefix a reserved character and ensure correct parsing. A commonly used character is the backslash (`\`). For example, given the following data: + +``` +Product,Description,Price +Jeans,"Navy Blue, Bootcut, 34\"",49.99 +``` + +The backslash (`\`) is used directly before the second double quote (`"`) to indicate that it is _not_ the closing quote for the field, but rather a literal double quote character that should be included in the value (in this example, denoting the size of the jeans in inches: `34"` ). + +Leaving this field blank (default option) will disallow escaping. + +- **False Values**: A set of case-sensitive strings that should be interpreted as false values. +- **Null Values**: A set of case-sensitive strings that should be interpreted as null values. For example, if the value 'NA' should be interpreted as null, enter 'NA' in this field. +- **Quote Character**: In some cases, data values may contain instances of reserved characters \(like a comma, if that's the delimiter\). CSVs can handle this by wrapping a value in defined quote characters so that on read it can parse it correctly. By default, this is set to `"`. +- **Skip Rows After Header**: The number of rows to skip after the header row. +- **Skip Rows Before Header**: The number of rows to skip before the header row. +- **Strings Can Be Null**: Whether strings can be interpreted as null values. If true, strings that match the null_values set will be interpreted as null. If false, strings that match the null_values set will be interpreted as the string itself. +- **True Values**: A set of case-sensitive strings that should be interpreted as true values. + + +### Parquet + +Apache Parquet is a column-oriented data storage format of the Apache Hadoop ecosystem. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. At the moment, partitioned parquet datasets are unsupported. The following settings are available: + +- **Convert Decimal Fields to Floats**: Whether to convert decimal fields to floats. There is a loss of precision when converting decimals to floats, so this is not recommended. + +### Avro + +The Avro parser uses the [Fastavro library](https://fastavro.readthedocs.io/en/latest/). The following settings are available: +- **Convert Double Fields to Strings**: Whether to convert double fields to strings. This is recommended if you have decimal numbers with a high degree of precision because there can be a loss precision when handling floating point numbers. + +### JSONL + +There are currently no options for JSONL parsing. + +### Document File Type Format (Experimental) + +:::warning +The Document file type format is currently an experimental feature and not subject to SLAs. Use at your own risk. +::: + +The Document file type format is a special format that allows you to extract text from Markdown, PDF, Word, Powerpoint and Google documents. If selected, the connector will extract text from the documents and output it as a single field named `content`. The `document_key` field will hold a unique identifier for the processed file which can be used as a primary key. The content of the document will contain markdown formatting converted from the original file format. Each file matching the defined glob pattern needs to either be a markdown (`md`), PDF (`pdf`) or Docx (`docx`) file. + +One record will be emitted for each document. Keep in mind that large files can emit large records that might not fit into every destination as each destination has different limitations for string fields. + +Before parsing each document, the connector exports Google Document files to Docx format internally. Google Sheets, Google Slides, and drawings are internally exported and parsed by the connector as PDFs. + +## Changelog + +| Version | Date | Pull Request | Subject | +|---------|------------|----------------------------------------------------------|-----------------------------------------------------------------------------------| +| 0.0.1 | 2023-11-02 | [31458](https://github.com/airbytehq/airbyte/pull/31458) | Initial Google Drive source | +