diff --git a/docs/.gitbook/assets/azure_shell_vm_overview.png b/docs/.gitbook/assets/azure_shell_vm_overview.png deleted file mode 100644 index bdae4f34bf541..0000000000000 Binary files a/docs/.gitbook/assets/azure_shell_vm_overview.png and /dev/null differ diff --git a/docs/.gitbook/assets/change-to-per-week (1).png b/docs/.gitbook/assets/change-to-per-week (3) (1).png similarity index 100% rename from docs/.gitbook/assets/change-to-per-week (1).png rename to docs/.gitbook/assets/change-to-per-week (3) (1).png diff --git a/docs/.gitbook/assets/change-to-per-week (2).png b/docs/.gitbook/assets/change-to-per-week (3) (2).png similarity index 100% rename from docs/.gitbook/assets/change-to-per-week (2).png rename to docs/.gitbook/assets/change-to-per-week (3) (2).png diff --git a/docs/.gitbook/assets/change-to-per-week.png b/docs/.gitbook/assets/change-to-per-week (3) (3).png similarity index 100% rename from docs/.gitbook/assets/change-to-per-week.png rename to docs/.gitbook/assets/change-to-per-week (3) (3).png diff --git a/docs/.gitbook/assets/datasources (1).png b/docs/.gitbook/assets/datasources (4) (1).png similarity index 100% rename from docs/.gitbook/assets/datasources (1).png rename to docs/.gitbook/assets/datasources (4) (1).png diff --git a/docs/.gitbook/assets/datasources (2).png b/docs/.gitbook/assets/datasources (4) (2).png similarity index 100% rename from docs/.gitbook/assets/datasources (2).png rename to docs/.gitbook/assets/datasources (4) (2).png diff --git a/docs/.gitbook/assets/datasources (3).png b/docs/.gitbook/assets/datasources (4) (3).png similarity index 100% rename from docs/.gitbook/assets/datasources (3).png rename to docs/.gitbook/assets/datasources (4) (3).png diff --git a/docs/.gitbook/assets/datasources.png b/docs/.gitbook/assets/datasources (4) (4).png similarity index 100% rename from docs/.gitbook/assets/datasources.png rename to docs/.gitbook/assets/datasources (4) (4).png diff --git a/docs/.gitbook/assets/duration-spent-in-weekly-webinars (1).png b/docs/.gitbook/assets/duration-spent-in-weekly-webinars (3) (1).png similarity index 100% rename from docs/.gitbook/assets/duration-spent-in-weekly-webinars (1).png rename to docs/.gitbook/assets/duration-spent-in-weekly-webinars (3) (1).png diff --git a/docs/.gitbook/assets/duration-spent-in-weekly-webinars (2).png b/docs/.gitbook/assets/duration-spent-in-weekly-webinars (3) (2).png similarity index 100% rename from docs/.gitbook/assets/duration-spent-in-weekly-webinars (2).png rename to docs/.gitbook/assets/duration-spent-in-weekly-webinars (3) (2).png diff --git a/docs/.gitbook/assets/duration-spent-in-weekly-webinars.png b/docs/.gitbook/assets/duration-spent-in-weekly-webinars (3) (3).png similarity index 100% rename from docs/.gitbook/assets/duration-spent-in-weekly-webinars.png rename to docs/.gitbook/assets/duration-spent-in-weekly-webinars (3) (3).png diff --git a/docs/.gitbook/assets/evolution-of-meetings-per-week (1).png b/docs/.gitbook/assets/evolution-of-meetings-per-week (3) (1).png similarity index 100% rename from docs/.gitbook/assets/evolution-of-meetings-per-week (1).png rename to docs/.gitbook/assets/evolution-of-meetings-per-week (3) (1).png diff --git a/docs/.gitbook/assets/evolution-of-meetings-per-week (2).png b/docs/.gitbook/assets/evolution-of-meetings-per-week (3) (2).png similarity index 100% rename from docs/.gitbook/assets/evolution-of-meetings-per-week (2).png rename to docs/.gitbook/assets/evolution-of-meetings-per-week (3) (2).png diff --git a/docs/.gitbook/assets/evolution-of-meetings-per-week.png b/docs/.gitbook/assets/evolution-of-meetings-per-week (3) (3).png similarity index 100% rename from docs/.gitbook/assets/evolution-of-meetings-per-week.png rename to docs/.gitbook/assets/evolution-of-meetings-per-week (3) (3).png diff --git a/docs/.gitbook/assets/launch (1).png b/docs/.gitbook/assets/launch (3) (1).png similarity index 100% rename from docs/.gitbook/assets/launch (1).png rename to docs/.gitbook/assets/launch (3) (1).png diff --git a/docs/.gitbook/assets/launch (2).png b/docs/.gitbook/assets/launch (3) (2).png similarity index 100% rename from docs/.gitbook/assets/launch (2).png rename to docs/.gitbook/assets/launch (3) (2).png diff --git a/docs/.gitbook/assets/launch.png b/docs/.gitbook/assets/launch (3) (3).png similarity index 100% rename from docs/.gitbook/assets/launch.png rename to docs/.gitbook/assets/launch (3) (3).png diff --git a/docs/.gitbook/assets/meetings-participant-ranked (1).png b/docs/.gitbook/assets/meetings-participant-ranked (3) (1).png similarity index 100% rename from docs/.gitbook/assets/meetings-participant-ranked (1).png rename to docs/.gitbook/assets/meetings-participant-ranked (3) (1).png diff --git a/docs/.gitbook/assets/meetings-participant-ranked (2).png b/docs/.gitbook/assets/meetings-participant-ranked (3) (2).png similarity index 100% rename from docs/.gitbook/assets/meetings-participant-ranked (2).png rename to docs/.gitbook/assets/meetings-participant-ranked (3) (2).png diff --git a/docs/.gitbook/assets/meetings-participant-ranked.png b/docs/.gitbook/assets/meetings-participant-ranked (3) (3).png similarity index 100% rename from docs/.gitbook/assets/meetings-participant-ranked.png rename to docs/.gitbook/assets/meetings-participant-ranked (3) (3).png diff --git a/docs/.gitbook/assets/postgres_credentials (1).png b/docs/.gitbook/assets/postgres_credentials (3) (1).png similarity index 100% rename from docs/.gitbook/assets/postgres_credentials (1).png rename to docs/.gitbook/assets/postgres_credentials (3) (1).png diff --git a/docs/.gitbook/assets/postgres_credentials (2).png b/docs/.gitbook/assets/postgres_credentials (3) (2).png similarity index 100% rename from docs/.gitbook/assets/postgres_credentials (2).png rename to docs/.gitbook/assets/postgres_credentials (3) (2).png diff --git a/docs/.gitbook/assets/postgres_credentials.png b/docs/.gitbook/assets/postgres_credentials (3) (3).png similarity index 100% rename from docs/.gitbook/assets/postgres_credentials.png rename to docs/.gitbook/assets/postgres_credentials (3) (3).png diff --git a/docs/.gitbook/assets/schema (1).png b/docs/.gitbook/assets/schema (3) (1).png similarity index 100% rename from docs/.gitbook/assets/schema (1).png rename to docs/.gitbook/assets/schema (3) (1).png diff --git a/docs/.gitbook/assets/schema (2).png b/docs/.gitbook/assets/schema (3) (2).png similarity index 100% rename from docs/.gitbook/assets/schema (2).png rename to docs/.gitbook/assets/schema (3) (2).png diff --git a/docs/.gitbook/assets/schema.png b/docs/.gitbook/assets/schema (3) (3).png similarity index 100% rename from docs/.gitbook/assets/schema.png rename to docs/.gitbook/assets/schema (3) (3).png diff --git a/docs/.gitbook/assets/setup-successful (1).png b/docs/.gitbook/assets/setup-successful (3) (1).png similarity index 100% rename from docs/.gitbook/assets/setup-successful (1).png rename to docs/.gitbook/assets/setup-successful (3) (1).png diff --git a/docs/.gitbook/assets/setup-successful (2).png b/docs/.gitbook/assets/setup-successful (3) (2).png similarity index 100% rename from docs/.gitbook/assets/setup-successful (2).png rename to docs/.gitbook/assets/setup-successful (3) (2).png diff --git a/docs/.gitbook/assets/setup-successful.png b/docs/.gitbook/assets/setup-successful (3) (3).png similarity index 100% rename from docs/.gitbook/assets/setup-successful.png rename to docs/.gitbook/assets/setup-successful (3) (3).png diff --git a/docs/.gitbook/assets/sync-screen (1).png b/docs/.gitbook/assets/sync-screen (3) (1).png similarity index 100% rename from docs/.gitbook/assets/sync-screen (1).png rename to docs/.gitbook/assets/sync-screen (3) (1).png diff --git a/docs/.gitbook/assets/sync-screen (2).png b/docs/.gitbook/assets/sync-screen (3) (2).png similarity index 100% rename from docs/.gitbook/assets/sync-screen (2).png rename to docs/.gitbook/assets/sync-screen (3) (2).png diff --git a/docs/.gitbook/assets/sync-screen.png b/docs/.gitbook/assets/sync-screen (3) (3).png similarity index 100% rename from docs/.gitbook/assets/sync-screen.png rename to docs/.gitbook/assets/sync-screen (3) (3).png diff --git a/docs/.gitbook/assets/tableau-dashboard (1).png b/docs/.gitbook/assets/tableau-dashboard (3) (1).png similarity index 100% rename from docs/.gitbook/assets/tableau-dashboard (1).png rename to docs/.gitbook/assets/tableau-dashboard (3) (1).png diff --git a/docs/.gitbook/assets/tableau-dashboard (2).png b/docs/.gitbook/assets/tableau-dashboard (3) (2).png similarity index 100% rename from docs/.gitbook/assets/tableau-dashboard (2).png rename to docs/.gitbook/assets/tableau-dashboard (3) (2).png diff --git a/docs/.gitbook/assets/tableau-dashboard.png b/docs/.gitbook/assets/tableau-dashboard (3) (3).png similarity index 100% rename from docs/.gitbook/assets/tableau-dashboard.png rename to docs/.gitbook/assets/tableau-dashboard (3) (3).png diff --git a/docs/.gitbook/assets/zoom-marketplace-build-screen (1).png b/docs/.gitbook/assets/zoom-marketplace-build-screen (3) (1).png similarity index 100% rename from docs/.gitbook/assets/zoom-marketplace-build-screen (1).png rename to docs/.gitbook/assets/zoom-marketplace-build-screen (3) (1).png diff --git a/docs/.gitbook/assets/zoom-marketplace-build-screen (2).png b/docs/.gitbook/assets/zoom-marketplace-build-screen (3) (2).png similarity index 100% rename from docs/.gitbook/assets/zoom-marketplace-build-screen (2).png rename to docs/.gitbook/assets/zoom-marketplace-build-screen (3) (2).png diff --git a/docs/.gitbook/assets/zoom-marketplace-build-screen.png b/docs/.gitbook/assets/zoom-marketplace-build-screen (3) (3).png similarity index 100% rename from docs/.gitbook/assets/zoom-marketplace-build-screen.png rename to docs/.gitbook/assets/zoom-marketplace-build-screen (3) (3).png diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md index c71e4385a4160..ddf750b89770a 100644 --- a/docs/SUMMARY.md +++ b/docs/SUMMARY.md @@ -27,7 +27,7 @@ * [Local Deployment](deploying-airbyte/local-deployment.md) * [On AWS \(EC2\)](deploying-airbyte/on-aws-ec2.md) * [On GCP \(Compute Engine\)](deploying-airbyte/on-gcp-compute-engine.md) - * [On Azure\(VM)](deploying-airbyte/on-azure-vm-cloud-shell.md) + * [On Azure\(VM\)](deploying-airbyte/on-azure-vm-cloud-shell.md) * [On Kubernetes \(Alpha\)](deploying-airbyte/on-kubernetes.md) * [On AWS ECS \(Coming Soon\)](deploying-airbyte/on-aws-ecs.md) * [Connector Catalog](integrations/README.md) @@ -65,7 +65,7 @@ * [MySQL](integrations/sources/mysql.md) * [Oracle DB](integrations/sources/oracle.md) * [Plaid](integrations/sources/plaid.md) - * [PokéAPI](integrations/sources/pokeapi.md) + * [PokéAPI](integrations/sources/pokeapi.md) * [Postgres](integrations/sources/postgres.md) * [Quickbooks](integrations/sources/quickbooks.md) * [Recurly](integrations/sources/recurly.md) @@ -95,26 +95,26 @@ * [Contributing to Airbyte](contributing-to-airbyte/README.md) * [Code of Conduct](contributing-to-airbyte/code-of-conduct.md) * [Developing Locally](contributing-to-airbyte/developing-locally.md) - * [Connector Development Kit (Python)](../airbyte-cdk/python/README.md) - * [Concepts](../airbyte-cdk/python/docs/concepts/README.md) - * [Basic Concepts](../airbyte-cdk/python/docs/concepts/basic-concepts.md) - * [Full Refresh Streams](../airbyte-cdk/python/docs/concepts/full-refresh-stream.md) - * [Incremental Streams](../airbyte-cdk/python/docs/concepts/incremental-stream.md) - * [HTTP-API-based Connectors](../airbyte-cdk/python/docs/concepts/http-streams.md) - * [Python Concepts](../airbyte-cdk/python/docs/concepts/python-concepts.md) - * [Stream Slices](../airbyte-cdk/python/docs/concepts/stream_slices.md) - * [Tutorials](../airbyte-cdk/python/docs/tutorials/README.md) - * [Speedrun: Creating a Source with the CDK](../airbyte-cdk/python/docs/tutorials/cdk-tutorial-any-percent/cdk-speedrun.md) - * [Creating an HTTP API Source](../airbyte-cdk/python/docs/tutorials/cdk-tutorial-python-http/README.md) - * [Getting Started](../airbyte-cdk/python/docs/tutorials/cdk-tutorial-python-http/0-getting-started.md) - * [Step 1: Creating the Source](../airbyte-cdk/python/docs/tutorials/cdk-tutorial-python-http/1-creating-the-source.md) - * [Step 2: Install Dependencies](../airbyte-cdk/python/docs/tutorials/cdk-tutorial-python-http/2-install-dependencies.md) - * [Step 3: Define Inputs](../airbyte-cdk/python/docs/tutorials/cdk-tutorial-python-http/3-define-inputs.md) - * [Step 4: Connection Checking](../airbyte-cdk/python/docs/tutorials/cdk-tutorial-python-http/4-connection-checking.md) - * [Step 5: Declare the Schema](../airbyte-cdk/python/docs/tutorials/cdk-tutorial-python-http/5-declare-schema.md) - * [Step 6: Read Data](../airbyte-cdk/python/docs/tutorials/cdk-tutorial-python-http/6-read-data.md) - * [Step 7: Use the Connector in Airbyte](../airbyte-cdk/python/docs/tutorials/cdk-tutorial-python-http/7-use-connector-in-airbyte.md) - * [Step 8: Test Connector](../airbyte-cdk/python/docs/tutorials/cdk-tutorial-python-http/8-test-your-connector.md) + * [Connector Development Kit \(Python\)](contributing-to-airbyte/python/README.md) + * [Concepts](contributing-to-airbyte/python/concepts/README.md) + * [Basic Concepts](contributing-to-airbyte/python/concepts/basic-concepts.md) + * [Full Refresh Streams](contributing-to-airbyte/python/concepts/full-refresh-stream.md) + * [Incremental Streams](contributing-to-airbyte/python/concepts/incremental-stream.md) + * [HTTP-API-based Connectors](contributing-to-airbyte/python/concepts/http-streams.md) + * [Python Concepts](contributing-to-airbyte/python/concepts/python-concepts.md) + * [Stream Slices](contributing-to-airbyte/python/concepts/stream_slices.md) + * [Tutorials](contributing-to-airbyte/python/tutorials/README.md) + * [Speedrun: Creating a Source with the CDK](contributing-to-airbyte/python/tutorials/cdk-speedrun.md) + * [Creating an HTTP API Source](contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/README.md) + * [Getting Started](contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/0-getting-started.md) + * [Step 1: Creating the Source](contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/1-creating-the-source.md) + * [Step 2: Install Dependencies](contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/2-install-dependencies.md) + * [Step 3: Define Inputs](contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/3-define-inputs.md) + * [Step 4: Connection Checking](contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/4-connection-checking.md) + * [Step 5: Declare the Schema](contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/5-declare-schema.md) + * [Step 6: Read Data](contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/6-read-data.md) + * [Step 7: Use the Connector in Airbyte](contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/7-use-connector-in-airbyte.md) + * [Step 8: Test Connector](contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/8-test-your-connector.md) * [Developing Connectors](contributing-to-airbyte/building-new-connector/README.md) * [Best Practices](contributing-to-airbyte/building-new-connector/best-practices.md) * [Java Connectors](contributing-to-airbyte/building-new-connector/java-connectors.md) diff --git a/docs/contributing-to-airbyte/developing-locally.md b/docs/contributing-to-airbyte/developing-locally.md index 6e16627ea759d..147399f024d2c 100644 --- a/docs/contributing-to-airbyte/developing-locally.md +++ b/docs/contributing-to-airbyte/developing-locally.md @@ -84,7 +84,7 @@ VERSION=dev docker-compose up ## Run formatting automation/tests -To format code in the repo, simply run `./gradlew format` at the base of the repo. +To format code in the repo, simply run `./gradlew format` at the base of the repo. Note: If you are contributing a Python file without imports or function definitions, place the following comment at the top of your file: diff --git a/docs/contributing-to-airbyte/python/README.md b/docs/contributing-to-airbyte/python/README.md new file mode 100644 index 0000000000000..9da7ab4d2d31e --- /dev/null +++ b/docs/contributing-to-airbyte/python/README.md @@ -0,0 +1,87 @@ +# Connector Development Kit \(Python\) + +The Airbyte Python CDK is a framework for rapidly developing production-grade Airbyte connectors. The CDK currently offers helpers specific for creating Airbyte source connectors for: + +* HTTP APIs \(REST APIs, GraphQL, etc..\) +* Singer Taps +* Generic Python sources \(anything not covered by the above\) + +The CDK provides an improved developer experience by providing basic implementation structure and abstracting away low-level glue boilerplate. + +This document is a general introduction to the CDK. Readers should have basic familiarity with the [Airbyte Specification](https://docs.airbyte.io/architecture/airbyte-specification) before proceeding. + +## Getting Started + +Generate an empty connector using the code generator. First clone the Airbyte repository then from the repository root run + +```text +cd airbyte-integrations/connector-templates/generator +npm run generate +``` + +then follow the interactive prompt. Next, find all `TODO`s in the generated project directory -- they're accompanied by lots of comments explaining what you'll need to do in order to implement your connector. Upon completing all TODOs properly, you should have a functioning connector. + +Additionally, you can follow [this tutorial](https://github.com/airbytehq/airbyte/tree/184dab77ebfbc00c69eea9e34b7db29c79a9e6d1/airbyte-cdk/python/docs/tutorials/http_api_source.md) for a complete walkthrough of creating an HTTP connector using the Airbyte CDK. + +### Concepts & Documentation + +See the [concepts docs](concepts/) for a tour through what the API offers. + +### Example Connectors + +**HTTP Connectors**: + +* [Exchangerates API](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-exchange-rates/source_exchange_rates/source.py) +* [Stripe](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-stripe/source_stripe/source.py) +* [Slack](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-slack/source_slack/source.py) + +**Singer connectors**: + +* [Salesforce](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-salesforce-singer/source_salesforce_singer/source.py) +* [Github](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-github-singer/source_github_singer/source.py) + +**Simple Python connectors using the barebones `Source` abstraction**: + +* [Google Sheets](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-google-sheets/google_sheets_source/google_sheets_source.py) +* [Mailchimp](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-mailchimp/source_mailchimp/source.py) + +## Contributing + +### First time setup + +We assume `python` points to python >=3.7. + +Setup a virtual env: + +```text +python -m venv .venv +source .venv/bin/activate +pip install -e ".[dev]" # [dev] installs development-only dependencies +``` + +#### Iteration + +* Iterate on the code locally +* Run tests via `pytest -s unit_tests` +* Perform static type checks using `mypy airbyte_cdk`. `MyPy` configuration is in `.mypy.ini`. +* The `type_check_and_test.sh` script bundles both type checking and testing in one convenient command. Feel free to use it! + +#### Testing + +All tests are located in the `unit_tests` directory. Run `pytest --cov=airbyte_cdk unit_tests/` to run them. This also presents a test coverage report. + +#### Publishing a new version to PyPi + +1. Bump the package version in `setup.py` +2. Open a PR +3. An Airbyte member must comment `/publish-cdk --dry-run=`. Dry runs publish to test.pypi.org. + +## Coming Soon + +* Full OAuth 2.0 support \(including refresh token issuing flow via UI or CLI\) +* Airbyte Java HTTP CDK +* CDK for Async HTTP endpoints \(request-poll-wait style endpoints\) +* CDK for other protocols +* General CDK for Destinations +* Don't see a feature you need? [Create an issue and let us know how we can help!](https://github.com/airbytehq/airbyte/tree/184dab77ebfbc00c69eea9e34b7db29c79a9e6d1/airbyte-cdk/python/github.com/airbytehq/airbyte/issues/new/choose/README.md) + diff --git a/docs/contributing-to-airbyte/python/concepts/README.md b/docs/contributing-to-airbyte/python/concepts/README.md new file mode 100644 index 0000000000000..14a34f632d85b --- /dev/null +++ b/docs/contributing-to-airbyte/python/concepts/README.md @@ -0,0 +1,32 @@ +# Concepts + +This concepts section serves as a general introduction to the Python CDK. Readers will certainly benefit from a deeper understanding of the [Airbyte Specification](https://docs.airbyte.io/architecture/airbyte-specification) before proceeding, but we do a quick overview of it in our basic concepts guide below. + +## Basic Concepts + +If you want to learn more about the classes required to implement an Airbyte Source, head to our [basic concepts doc](basic-concepts.md). + +## Full Refresh Streams + +If you have questions or are running into issues creating your first full refresh stream, head over to our [full refresh stream doc](full-refresh-stream.md). If you have questions about implementing a `path` or `parse_response` function, this doc is for you. + +## Incremental Streams + +Having trouble figuring out how to write a `stream_slices` function or aren't sure what a `cursor_field` is? Head to our [incremental stream doc](incremental-stream.md). + +## Practical Tips + +Airbyte recommends using the CDK template generator to develop with the CDK. The template generates created all the required scaffolding, with convenient TODOs, allowing developers to truly focus on implementing the API. + +For tips on useful Python knowledge, see the [Python Concepts](python-concepts.md) page. + +You can find a complete tutorial for implementing an HTTP source connector in [this tutorial](https://github.com/airbytehq/airbyte/tree/4a397d25247db77a7b78783d26dae35bc3900f59/airbyte-cdk/python/docs/tutorials/http_api_source.md) + +## Examples + +Those interested in getting their hands dirty can check out implemented APIs: + +* [Exchange Rates API](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-exchange-rates/source_exchange_rates/source.py) \(Incremental\) +* [Stripe API](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-stripe/source_stripe/source.py) \(Incremental and Full-Refresh\) +* [Slack API](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-slack/source_slack/source.py) \(Incremental and Full-Refresh\) + diff --git a/docs/contributing-to-airbyte/python/concepts/basic-concepts.md b/docs/contributing-to-airbyte/python/concepts/basic-concepts.md new file mode 100644 index 0000000000000..cc0df7b41986c --- /dev/null +++ b/docs/contributing-to-airbyte/python/concepts/basic-concepts.md @@ -0,0 +1,66 @@ +# Basic Concepts + +## The Airbyte Specification + +As a quick recap, the Airbyte Specification requires an Airbyte Source to support 4 distinct operations: + +1. `Spec` - The required configuration in order to interact with the underlying technical system e.g. database + + information, authentication information etc. + +2. `Check` - Validate that the provided configuration is valid with sufficient permissions for one to perform all + + required operations on the Source. + +3. `Discover` - Discover the Source's schema. This let users select what a subset of the data to sync. Useful + + if users require only a subset of the data. + +4. `Read` - Perform the actual syncing process. Data is read from the Source, parsed into `AirbyteRecordMessage`s + + and sent to the Airbyte Destination. Depending on how the Source is implemented, this sync can be incremental + + or a full-refresh. + +A core concept discussed here is the **Source**. + +The Source contains one or more **Streams** \(or **Airbyte Streams**\). A **Stream** is the other concept key to understanding how Airbyte models the data syncing process. A **Stream** models the logical data groups that make up the larger **Source**. If the **Source** is a RDMS, each **Stream** is a table. In a REST API setting, each **Stream** corresponds to one resource within the API. e.g. a **Stripe Source** would have have one **Stream** for `Transactions`, one for `Charges` and so on. + +## The `Source` class + +Airbyte provides abstract base classes which make it much easier to perform certain categories of tasks e.g: `HttpStream` makes it easy to create HTTP API-based streams. However, if those do not satisfy your use case \(for example, if you're pulling data from a relational database\), you can always directly implement the Airbyte Protocol by subclassing the CDK's `Source` class. + +Note that while this is the most flexible way to implement a source connector, it is also the most toilsome as you will be required to manually manage state, input validation, correctly conforming to the Airbyte Protocol message formats, and more. We recommend using a subclass of `Source` unless you cannot fulfill your use case otherwise. + +## The `AbstractSource` Object + +`AbstractSource` is a more opinionated implementation of `Source`. It implements `Source`'s 4 methods as follows: + +`Spec` and `Check` are the `AbstractSource`'s simplest operations. + +`Spec` returns a checked in json schema file specifying the required configuration. The `AbstractSource` looks for a file named `spec.json` in the module's root by default. Here is an [example](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-exchange-rates/source_exchange_rates/spec.json). + +`Check` delegates to the `AbstractSource`'s `check_connection` function. The function's `config` parameter contains the user-provided configuration, specified in the `spec.json` returned by `Spec`. `check_connection` uses this configuration to validate access and permissioning. Here is an [example](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-exchange-rates/source_exchange_rates/source.py#L90) from the same Exchange Rates API. + +### The `Stream` Abstract Base Class + +An `AbstractSource` also owns a set of `Stream`s. This is populated via the `AbstractSource`'s `streams` [function](https://github.com/airbytehq/airbyte/blob/master/airbyte-cdk/python/airbyte_cdk/sources/abstract_source.py#L63). `Discover` and `Read` rely on this populated set. + +`Discover` returns an `AirbyteCatalog` representing all the distinct resources the underlying API supports. Here is the [entrypoint](https://github.com/airbytehq/airbyte/blob/master/airbyte-cdk/python/airbyte_cdk/sources/abstract_source.py#L74) for those interested in reading the code. + +`Read` creates an in-memory stream reading from each of the `AbstractSource`'s streams. Here is the [entrypoint](https://github.com/airbytehq/airbyte/blob/master/airbyte-cdk/python/airbyte_cdk/sources/abstract_source.py#L90) for those interested. + +As the code examples show, the `AbstractSource` delegates to the set of `Stream`s it owns to fulfill both `Discover` and `Read`. Thus, implementing `AbstractSource`'s `streams` function is required when using the CDK. + +A summary of what we've covered so far on how to use the Airbyte CDK: + +* A concrete implementation of the `AbstractSource` object is required. +* This involves, + 1. implementing the `check_connection`function. + 2. Creating the appropriate `Stream` classes and returning them in the `streams` function. + 3. placing the above mentioned `spec.json` file in the right place. + +## HTTP Streams + +We've covered how the `AbstractSource` works with the `Stream` interface in order to fulfill the Airbyte Specification. Although developers are welcome to implement their own object, the CDK saves developers the hassle of doing so in the case of HTTP APIs with the [`HTTPStream`](http-streams.md) object. + diff --git a/docs/contributing-to-airbyte/python/concepts/full-refresh-stream.md b/docs/contributing-to-airbyte/python/concepts/full-refresh-stream.md new file mode 100644 index 0000000000000..c8c0eb1d2aad3 --- /dev/null +++ b/docs/contributing-to-airbyte/python/concepts/full-refresh-stream.md @@ -0,0 +1,43 @@ +# Full Refresh Streams + +As mentioned in the [Basic Concepts Overview](basic-concepts.md), a `Stream` is the atomic unit for reading data from a Source. A stream can read data from anywhere: a relational database, an API, or even scrape a web page! \(although that might be stretching the limits of what a connector should do\). + +To implement a stream, there are two minimum requirements: 1. Define the stream's schema 2. Implement the logic for reading records from the underlying data source + +## Defining the stream's schema + +Your connector must describe the schema of each stream it can output using [JSONSchema](https://json-schema.org). + +The simplest way to do this is to describe the schema of your streams using one `.json` file per stream. You can also dynamically generate the schema of your stream in code, or you can combine both approaches: start with a `.json` file and dynamically add properties to it. + +The schema of a stream is the return value of `Stream.get_json_schema`. + +### Static schemas + +By default, `Stream.get_json_schema` reads a `.json` file in the `schemas/` directory whose name is equal to the value of the `Stream.name` property. In turn `Stream.name` by default returns the name of the class in snake case. Therefore, if you have a class `class EmployeeBenefits(HttpStream)` the default behavior will look for a file called `schemas/employee_benefits.json`. You can override any of these behaviors as you need. + +Important note: any objects referenced via `$ref` should be placed in the `shared/` directory in their own `.json` files. + +### Dynamic schemas + +If you'd rather define your schema in code, override `Stream.get_json_schema` in your stream class to return a `dict` describing the schema using [JSONSchema](https://json-schema.org). + +### Dynamically modifying static schemas + +Place a `.json` file in the `schemas` folder containing the basic schema like described in the static schemas section. Then, override `Stream.get_json_schema` to run the default behavior, edit the returned value, then return the edited value: + +```text +def get_json_schema(self): + schema = super().get_json_schema() + schema['dynamically_determined_property'] = "property" + return schema +``` + +## Reading records from the data source + +The only method required to implement a `Stream` is `Stream.read_records`. Given some information about how the stream should be read, this method should output an iterable object containing records from the data source. We recommend using generators as they are very efficient with regards to memory requirements. + +## Incremental Streams + +We highly recommend implementing Incremental when feasible. See the [incremental streams page](incremental-stream.md) for more information. + diff --git a/docs/contributing-to-airbyte/python/concepts/http-streams.md b/docs/contributing-to-airbyte/python/concepts/http-streams.md new file mode 100644 index 0000000000000..e088417cd3338 --- /dev/null +++ b/docs/contributing-to-airbyte/python/concepts/http-streams.md @@ -0,0 +1,50 @@ +# HTTP-API-based Connectors + +The CDK offers base classes that greatly simplify writing HTTP API-based connectors. Some of the most useful features include helper functionality for: + +* Authentication \(basic auth, Oauth2, or any custom auth method\) +* Pagination +* Handling rate limiting with static or dynamic backoff timing + +All these features have sane off-the-shelf defaults but are completely customizable depending on your use case. They can also be combined with other stream features described in the [full refresh streams](full-refresh-stream.md) and [incremental streams](incremental-stream.md) sections. + +## Overview of HTTP Streams + +Just like any general HTTP request, the basic `HTTPStream` requires a url to perform the request, and instructions on how to parse the resulting response. + +The full request path is broken up into two parts, the base url and the path. This makes it easy for developers to create a Source-specific base `HTTPStream` class, with the base url filled in, and individual streams for each available HTTP resource. The [Stripe CDK implementation](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-stripe/source_stripe/source.py) is a reification of this pattern. + +The base url is set via the `url_base` property, while the path is set by implementing the abstract `path` function. + +The `parse_response` function instructs the stream how to parse the API response. This returns an `Iterable`, whose elements are each later transformed into an `AirbyteRecordMessage`. API routes whose response contains a single record generally have a `parse_reponse` function that return a list of just that one response. Routes that return a list, usually have a `parse_response` function that return the received list with all elements. Pulling the data out from the response is sufficient, any deserialization is handled by the CDK. + +Lastly, the `HTTPStream` must describe the schema of the records it outputs using JsonSchema. The simplest way to do this is by placing a `.json` file per stream in the `schemas` directory in the generated python module. The name of the `.json` file must match the lower snake case name of the corresponding Stream. Here are [examples](https://github.com/airbytehq/airbyte/tree/master/airbyte-integrations/connectors/source-stripe/source_stripe/schemas) from the Stripe API. + +You can also dynamically set your schema. See the [schema docs](full-refresh-stream.md#defining-the-streams-schema) for more information. + +These four elements - the `url_base` property, the `path` function, the `parse_response` function and the schema file - are the bare minimum required to implement the `HTTPStream`, and can be seen in the same [Stripe example](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-stripe/source_stripe/source.py#L38). + +This basic implementation gives us a Full-Refresh Airbyte Stream. We say Full-Refresh since the stream does not have state and will always indiscriminately read all data from the underlying API resource. + +## Authentication + +The CDK supports Basic and OAuth2.0 authentication via the `TokenAuthenticator` and `Oauth2Authenticator` classes respectively. Both authentication strategies are identical in that they place the api token in the `Authorization` header. The `OAuth2Authenticator` goes an additional step further and has mechanisms to, given a refresh token, refresh the current access token. Note that the `OAuth2Authenticator` currently only supports refresh tokens and not the full OAuth2.0 loop. + +Using either authenticator is as simple as passing the created authenticator into the relevant `HTTPStream` constructor. Here is an [example](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-stripe/source_stripe/source.py#L242) from the Stripe API. + +## Pagination + +Most APIs, when facing a large call, tend to return the results in pages. The CDK accommodates paging via the `next_page_token` function. This function is meant to extract the next page "token" from the latest response. The contents of a "token" are completely up to the developer: it can be an ID, a page number, a partial URL etc.. The CDK will continue making requests as long as the `next_page_token` function. The CDK will continue making requests as long as the `next_page_token` continues returning non-`None` results. This can then be used in the `request_params` and other methods in `HttpStream` to page through API responses. Here is an [example](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-stripe/source_stripe/source.py#L41) from the Stripe API. + +## Rate Limiting + +The CDK, by default, will conduct exponential backoff on the HTTP code 429 and any 5XX exceptions, and fail after 5 tries. + +Retries are governed by the `should_retry` and the `backoff_time` methods. Override these methods to customise retry behavior. Here is an [example](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-slack/source_slack/source.py#L72) from the Slack API. + +Note that Airbyte will always attempt to make as many requests as possible and only slow down if there are errors. It is not currently possible to specify a rate limit Airbyte should adhere to when making requests. + +### Stream Slicing + +When implementing [stream slicing](incremental-stream.md#streamstream_slices) in an `HTTPStream` each Slice is equivalent to a HTTP request; the stream will make one request per element returned by the `stream_slices` function. The current slice being read is passed into every other method in `HttpStream` e.g: `request_params`, `request_headers`, `path`, etc.. to be injected into a request. This allows you to dynamically determine the output of the `request_params`, `path`, and other functions to read the input slice and return the appropriate value. + diff --git a/docs/contributing-to-airbyte/python/concepts/incremental-stream.md b/docs/contributing-to-airbyte/python/concepts/incremental-stream.md new file mode 100644 index 0000000000000..12a1c04bb622e --- /dev/null +++ b/docs/contributing-to-airbyte/python/concepts/incremental-stream.md @@ -0,0 +1,46 @@ +# Incremental Streams + +An incremental Stream is a stream which reads data incrementally. That is, it only reads data that was generated or updated since the last time it ran, and is thus far more efficient than a stream which reads all the source data every time it runs. If possible, developers are encouraged to implement incremental streams to reduce sync times and resource usage. + +Several new pieces are essential to understand how incrementality works with the CDK: + +* `AirbyteStateMessage` +* cursor fields +* `Stream.get_updated_state` + + as well as a few other optional concepts. + +### `AirbyteStateMessage` + +The `AirbyteStateMessage` persists state between syncs, and allows a new sync to pick up from where the previous sync last finished. See the [incremental sync guide](https://docs.airbyte.io/understanding-airbyte/connections/incremental-append) for more information. + +### Cursor fields + +The `cursor_field` refers to the field in the stream's output records used to determine the "recency" or ordering of records. An example is a `created_at` or `updated_at` field in an API or DB table. + +Cursor fields can be input by the user \(e.g: a user can choose to use an auto-incrementing `id` column in a DB table\) or they can be defined by the source e.g: where an API defines that `updated_at` is what determines the ordering of records. + +In the context of the CDK, setting the `Stream.cursor_field` property to any value informs the framework that this stream is incremental. + +### `Stream.get_updated_state` + +This function helps the CDK figure out the latest state for every record output by the connector \(as returned by the `Stream.read_records` method\). This allows sync to resume from where the previous sync last stopped, regardless of success or failure. This function typically compares the state object's and the latest record's cursor field, picking the latest one. + +### `Stream.stream_slices` + +The above methods can optionally be paired with the `stream_slices` function to granularly control exactly when state is saved. Conceptually, a Stream Slice is a subset of the records in a stream which represent the smallest unit of data which can be re-synced. Once a full slice is read, an `AirbyteStateMessage` will be output, causing state to be saved. If a connector fails while reading the Nth slice of a stream, then the next time it retries, it will begin reading at the beginning of the Nth slice again, rather than re-read slices `1...N-1`. + +A Slice object is not typed, and the developer is free to include any information necessary to make the request. This function is called when the `Stream` is about to be read. Typically, the `stream_slices` function, via inspecting the state object, generates a Slice for every request to be made. + +As an example, suppose an API is able to dispense data hourly. If the last sync was exactly 24 hours ago, we can either make an API call retrieving all data at once, or make 24 calls each retrieving an hour's worth of data. In the latter case, the `stream_slices` function, sees that the previous state contains yesterday's timestamp, and returns a list of 24 Slices, each with a different hourly timestamp to be used when creating request. If the stream fails halfway through \(at the 12th slice\), then the next time it starts reading, it will read from the beginning of the 12th slice. + +For a more in-depth description of stream slicing, see the [Stream Slices guide](stream_slices.md). + +## Conclusion + +In summary, an incremental stream requires: + +* the `cursor_field` property +* the `get_updated_state` function +* Optionally, the `stream_slices` function + diff --git a/docs/contributing-to-airbyte/python/concepts/python-concepts.md b/docs/contributing-to-airbyte/python/concepts/python-concepts.md new file mode 100644 index 0000000000000..0b97f2ae3c49f --- /dev/null +++ b/docs/contributing-to-airbyte/python/concepts/python-concepts.md @@ -0,0 +1,59 @@ +# Python Concepts + +The Airbyte CDK makes use of various not-so-obvious Python concepts. You might want to revisit these concepts as you implement your connector: + +## Abstract Classes [ABCs \(AbstractBaseClasses\)](https://docs.python.org/3/library/abc.html) and [abstractmethods](https://docs.python.org/3/library/abc.html#abc.abstractmethod) + +You'll want a strong understanding of these as the central API concepts require extending and using them. + +## [Keyword Arguments](https://realpython.com/python-kwargs-and-args/). + +You'll often see this referred to as `**kwargs` in the code. + +## [Properties](https://www.freecodecamp.org/news/python-property-decorator/) + +Note that there are two ways of defining properties: statically and dynamically. + +### Statically + +```text +class Employee(ABC): + @property + @abstractmethod + def job_title(): + """ returns this employee's job title""" + +class Pilot(Employee): + job_title = "pilot" +``` + +Notice how statically defining properties in this manner looks the same as defining variables. You can then reference this property as follows: + +```text +pilot = Pilot() +print(pilot.job_title) # pilot +``` + +### Dynamically + +You can also run arbitrary code to get the value of a property. For example: + +```text +class Employee(ABC): + @property + @abstractmethod + def job_title(): + """ returns this employee's job title""" + +class Pilot(Employee): + def job_title(): + # You can run any arbitrary code and return its result + return "pilot" +``` + +## [Generators](https://wiki.python.org/moin/Generators) + +Generators are basically iterators over arbitrary source data. They are handy because their syntax is extremely concise and feel just like any other list or collection when working with them in code. + +If you see `yield` anywhere in the code -- that's a generator at work. + diff --git a/docs/contributing-to-airbyte/python/concepts/stream_slices.md b/docs/contributing-to-airbyte/python/concepts/stream_slices.md new file mode 100644 index 0000000000000..21fa6148f06e5 --- /dev/null +++ b/docs/contributing-to-airbyte/python/concepts/stream_slices.md @@ -0,0 +1,28 @@ +# Stream Slices + +## Stream Slicing + +A Stream Slice is a subset of the records in a stream. + +When a stream is being read incrementally, Slices can be used to control when state is saved. + +When slicing is enabled, a state message will be output by the connector after reading every slice. Slicing is completely optional and is provided as a way for connectors to checkpoint state in a more granular way than basic interval-based state checkpointing. Slicing is typically used when reading a large amount of data or when the underlying data source imposes strict rate limits that make it difficult to re-read the same data over and over again. This being said, interval-based checkpointing is compatible with slicing with one difference: intervals are counted within a slice rather than across all records. In other words, the counter used to determine if the interval has been reached \(e.g: every 10k records\) resets at the beginning of every slice. + +The relationship between records in a slice is up to the developer, but slices are typically used to implement date-based checkpointing, for example to group records generated within a particular hour, day, or month etc. + +Slices can be hard-coded or generated dynamically \(e.g: by making a query\). + +The only restriction imposed on slices is that they must be described with a list of `dict`s returned from the `Stream.stream_slices()` method, where each `dict` describes a slice. The `dict`s may have any schema, and are passed as input to each stream's `read_stream` method. This way, the connector can read the current slice description \(the input `dict`\) and use that to make queries as needed. + +### Use cases + +If your use case requires saving state based on an interval e.g: only 10,000 records but nothing more sophisticated, then slicing is not necessary and you can instead set the `state_checkpoint_interval` property on a stream. + +#### The Slack connector: time-based slicing for large datasets + +Slack is a chat platform for businesses. Collectively, a company can easily post tens or hundreds of thousands of messages in a single Slack instance per day. So when writing a connector to pull chats from Slack, it's easy to run into rate limits or for the sync to take a very long time to complete because of the large amount of data. So we want a way to frequently "save" which data we already read from the connector so that if there is a halfway failure, we pick up reading where we left off. In addition, the Slack API does not return messages sorted by timestamp, so we cannot use `state_checkpoint_interval`s. + +This is a great usecase for stream slicing. The `messages` stream, which outputs one record per chat message, can slice records by time e.g: hourly. It implements this by specifying the beginning and end timestamp of each hour that it wants to pull data from. Then after all the records in a given hour \(i.e: slice\) have been read, the connector outputs a STATE message to indicate that state should be saved. This way, if the connector ever fails during a sync \(for example if the API goes down\) then at most, it will reread only one hour's worth of messages. + +See the implementation of the Slack connector [here](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-slack/source_slack/source.py). + diff --git a/docs/contributing-to-airbyte/python/tutorials/README.md b/docs/contributing-to-airbyte/python/tutorials/README.md new file mode 100644 index 0000000000000..84ce15b788618 --- /dev/null +++ b/docs/contributing-to-airbyte/python/tutorials/README.md @@ -0,0 +1,2 @@ +# Tutorials + diff --git a/docs/contributing-to-airbyte/python/tutorials/cdk-speedrun.md b/docs/contributing-to-airbyte/python/tutorials/cdk-speedrun.md new file mode 100644 index 0000000000000..f1fa7b932b92d --- /dev/null +++ b/docs/contributing-to-airbyte/python/tutorials/cdk-speedrun.md @@ -0,0 +1,229 @@ +# Speedrun: Creating a Source with the CDK + +## CDK Speedrun \(HTTP API Source Creation [Any%](https://en.wikipedia.org/wiki/Speedrun#:~:text=Any%25%2C%20or%20fastest%20completion%2C,the%20game%20to%20its%20fullest.&text=Specific%20requirements%20for%20a%20100,different%20depending%20on%20the%20game.) Route\) + +This is a blazing fast guide to building an HTTP source connector. Think of it as the TL;DR version of [this tutorial.](https://github.com/airbytehq/airbyte/tree/576d932ddbd748713ad5cbeabebd3c21cd0148ad/airbyte-cdk/python/docs/cdk-tutorial-python-http.md) + +## Dependencies + +1. Python >= 3.7 +2. Docker +3. NodeJS + +#### Generate the Template + +```bash +$ cd airbyte-integrations/connector-templates/generator # start from repo root +$ npm install +$ npm run generate +``` + +Select the `Python HTTP API Source` and name it `python-http-example`. + +#### Create Dev Environment + +```bash +cd ../../connectors/source-python-http-example +python -m venv .venv # Create a virtual environment in the .venv directory +source .venv/bin/activate +pip install -r requirements.txt +``` + +### Define Connector Inputs + +```bash +cd source_python_http_example +``` + +We're working with the Exchange Rates API, so we need to define our input schema to reflect that. Open the `spec.json` file here and replace it with: + +```javascript +{ + "documentationUrl": "https://docs.airbyte.io/integrations/sources/exchangeratesapi", + "connectionSpecification": { + "$schema": "http://json-schema.org/draft-07/schema#", + "title": "Python Http Tutorial Spec", + "type": "object", + "required": ["start_date", "currency_base"], + "additionalProperties": false, + "properties": { + "start_date": { + "type": "string", + "description": "Start getting data from that date.", + "pattern": "^[0-9]{4}-[0-9]{2}-[0-9]{2}$", + "examples": ["%Y-%m-%d"] + }, + "base": { + "type": "string", + "examples": ["USD", "EUR"], + "description": "ISO reference currency. See here." + } + } + } +} +``` + +Ok, let's write a function that checks the inputs we just defined. Nuke the `source.py` file. Now add this code to it. For a crucial time skip, we're going to define all the imports we need in the future here. + +```python +from datetime import datetime, timedelta +from typing import Any, Iterable, List, Mapping, MutableMapping, Optional, Tuple + +import requests +from airbyte_cdk.sources import AbstractSource +from airbyte_cdk.sources.streams import Stream +from airbyte_cdk.sources.streams.http import HttpStream +from airbyte_cdk.sources.streams.http.auth import NoAuth + +class SourcePythonHttpTutorial(AbstractSource): + def check_connection(self, logger, config) -> Tuple[bool, any]: + accepted_currencies = { + "USD", + "JPY", + "BGN", + "CZK", + "DKK", + } # there are more currencies but let's assume these are the only allowed ones + input_currency = config["base"] + if input_currency not in accepted_currencies: + return False, f"Input currency {input_currency} is invalid. Please input one of the following currencies: {accepted_currencies}" + else: + return True, None + + def streams(self, config: Mapping[str, Any]) -> List[Stream]: + # Parse the date from a string into a datetime object + start_date = datetime.strptime(config["start_date"], "%Y-%m-%d") + return [ExchangeRates(authenticator=auth, base=config["base"], start_date=start_date)] +``` + +Test it. + +```bash +cd .. +mkdir sample_files +echo '{"start_date": "2021-04-01", "base": "USD"}' > sample_files/config.json +echo '{"start_date": "2021-04-01", "base": "BTC"}' > sample_files/invalid_config.json +python main.py check --config sample_files/config.json +python main.py check --config sample_files/invalid_config.json +``` + +Expected output: + +```text +> python main.py check --config sample_files/config.json +{"type": "CONNECTION_STATUS", "connectionStatus": {"status": "SUCCEEDED"}} + +> python main.py check --config sample_files/invalid_config.json +{"type": "CONNECTION_STATUS", "connectionStatus": {"status": "FAILED", "message": "Input currency BTC is invalid. Please input one of the following currencies: {'DKK', 'USD', 'CZK', 'BGN', 'JPY'}"}} +``` + +### Define your Stream + +In your `source.py` file, add this `ExchangeRates` class. This stream represents an endpoint you want to hit. + +```python +from airbyte_cdk.sources.streams.http import HttpStream + +class ExchangeRates(HttpStream): + url_base = "https://api.ratesapi.io/" + + # Set this as a noop. + primary_key = None + + def next_page_token(self, response: requests.Response) -> Optional[Mapping[str, Any]]: + # The API does not offer pagination, so we return None to indicate there are no more pages in the response + return None + + def path( + self, + ) -> str: + return "" # TODO + + def parse_response( + self, + ) -> Iterable[Mapping]: + return None # TODO +``` + +Now download [this file](https://github.com/airbytehq/airbyte/blob/master/airbyte-cdk/python/docs/tutorials/http_api_source_assets/exchange_rates.json). Name it `exchange_rates.json` and place it in `/source_python_http_example/schemas`. It defines your output schema. + +Test your discover function. You should receive a fairly large JSON object in return. + +```bash +python main.py discover --config sample_files/config.json +``` + +### Reading Data from the Source + +Update your `ExchangeRates` class to implement the required functions as follows: + +```python +class ExchangeRates(HttpStream): + url_base = "https://api.ratesapi.io/" + + primary_key = None + + def __init__(self, base: str, **kwargs): + super().__init__() + self.base = base + + + def path( + self, + stream_state: Mapping[str, Any] = None, + stream_slice: Mapping[str, Any] = None, + next_page_token: Mapping[str, Any] = None + ) -> str: + # The "/latest" path gives us the latest currency exchange rates + return "latest" + + def request_params( + self, + stream_state: Mapping[str, Any], + stream_slice: Mapping[str, Any] = None, + next_page_token: Mapping[str, Any] = None, + ) -> MutableMapping[str, Any]: + # The api requires that we include the base currency as a query param so we do that in this method + return {'base': self.base} + + def parse_response( + self, + response: requests.Response, + stream_state: Mapping[str, Any], + stream_slice: Mapping[str, Any] = None, + next_page_token: Mapping[str, Any] = None, + ) -> Iterable[Mapping]: + # The response is a simple JSON whose schema matches our stream's schema exactly, + # so we just return a list containing the response + return [response.json()] + + def next_page_token(self, response: requests.Response) -> Optional[Mapping[str, Any]]: + # The API does not offer pagination, + # so we return None to indicate there are no more pages in the response + return None +``` + +Update your `streams` method in your `SourcePythonHttpExample` class to use the currency base passed in from the stream above. + +```python +def streams(self, config: Mapping[str, Any]) -> List[Stream]: + auth = NoAuth() + return [ExchangeRates(authenticator=auth, base=config['base'])] +``` + +We now need a catalog that defines all of our streams. We only have one, `ExchangeRates`. Download that file [here](https://github.com/airbytehq/airbyte/blob/master/airbyte-cdk/python/docs/tutorials/http_api_source_assets/configured_catalog.json). Place it in `/sample_files` named as `configured_catalog.json`. + +Let's read some data. + +```bash +python main.py read --config sample_files/config.json --catalog sample_files/configured_catalog.json +``` + +If all goes well, containerize it so you can use it in the UI: + +```bash +docker build . -t airbyte/source-python-http-example:dev +``` + +You're done. Stop the clock :\) + diff --git a/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/0-getting-started.md b/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/0-getting-started.md new file mode 100644 index 0000000000000..55344cb8eed82 --- /dev/null +++ b/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/0-getting-started.md @@ -0,0 +1,27 @@ +# Getting Started + +## Summary + +This is a step-by-step guide for how to create an Airbyte source in Python to read data from an HTTP API. We'll be using the Exchange Rates API as an example since it is simple and demonstrates a lot of the capabilities of the CDK. + +## Requirements + +* Python >= 3.7 +* Docker +* NodeJS \(only used to generate the connector\). We'll remove the NodeJS dependency soon. + +All the commands below assume that `python` points to a version of python >=3.7.9. On some systems, `python` points to a Python2 installation and `python3` points to Python3. If this is the case on your machine, substitute all `python` commands in this guide with `python3`. + +## Checklist + +* Step 1: Create the source using the template +* Step 2: Install dependencies for the new source +* Step 3: Define the inputs needed by your connector +* Step 4: Implement connection checking +* Step 5: Declare the schema of your streams +* Step 6: Implement functionality for reading your streams +* Step 7: Use the connector in Airbyte +* Step 8: Write unit tests or integration tests + +Each step of the Creating a Source checklist is explained in more detail in the following steps. We also mention how you can submit the connector to be included with the general Airbyte release at the end of the tutorial. + diff --git a/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/1-creating-the-source.md b/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/1-creating-the-source.md new file mode 100644 index 0000000000000..bd74f27eaabe2 --- /dev/null +++ b/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/1-creating-the-source.md @@ -0,0 +1,15 @@ +# Step 1: Creating the Source + +Airbyte provides a code generator which bootstraps the scaffolding for our connector. + +```bash +$ cd airbyte-integrations/connector-templates/generator # assumes you are starting from the root of the Airbyte project. +# Install NPM from https://www.npmjs.com/get-npm if you don't have it +$ npm install +$ npm run generate +``` + +Select the `Python HTTP API Source` template and then input the name of your connector. For this walk-through we will refer to our source as `python-http-example`. The finalized source code for this tutorial can be found [here](https://github.com/airbytehq/airbyte/tree/master/airbyte-integrations/connectors/source-python-http-tutorial). + +The source we will build in this tutorial will pull data from the [Rates API](https://ratesapi.io/), a free and open API which documents historical exchange rates for fiat currencies. + diff --git a/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/2-install-dependencies.md b/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/2-install-dependencies.md new file mode 100644 index 0000000000000..468713097ad4f --- /dev/null +++ b/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/2-install-dependencies.md @@ -0,0 +1,78 @@ +# Step 2: Install Dependencies + +Now that you've generated the module, let's navigate to its directory and install dependencies: + +```text +cd ../../connectors/source- +python -m venv .venv # Create a virtual environment in the .venv directory +source .venv/bin/activate # enable the venv +pip install -r requirements.txt +``` + +This step sets up the initial python environment. **All** subsequent `python` or `pip` commands assume you have activated your virtual environment. + +Let's verify everything is working as intended. Run: + +```text +python main.py spec +``` + +You should see some output: + +```text +{"type": "SPEC", "spec": {"documentationUrl": "https://docsurl.com", "connectionSpecification": {"$schema": "http://json-schema.org/draft-07/schema#", "title": "Python Http Tutorial Spec", "type": "object", "required": ["TODO"], "additionalProperties": false, "properties": {"TODO: This schema defines the configuration required for the source. This usually involves metadata such as database and/or authentication information.": {"type": "string", "description": "describe me"}}}}} +``` + +We just ran Airbyte Protocol's `spec` command! We'll talk more about this later, but this is a simple sanity check to make sure everything is wired up correctly. + +Note that the `main.py` file is a simple script that makes it easy to run your connector. Its invocation format is `python main.py [args]`. See the module's generated `README.md` for the commands it supports. + +## Notes on iteration cycle + +### Dependencies + +Python dependencies for your source should be declared in `airbyte-integrations/connectors/source-/setup.py` in the `install_requires` field. You will notice that a couple of Airbyte dependencies are already declared there. Do not remove these; they give your source access to the helper interfaces provided by the generator. + +You may notice that there is a `requirements.txt` in your source's directory as well. Don't edit this. It is autogenerated and used to provide Airbyte dependencies. All your dependencies should be declared in `setup.py`. + +### Development Environment + +The commands we ran above created a [Python virtual environment](https://docs.python.org/3/tutorial/venv.html) for your source. If you want your IDE to auto complete and resolve dependencies properly, point it at the virtual env `airbyte-integrations/connectors/source-/.venv`. Also anytime you change the dependencies in the `setup.py` make sure to re-run `pip install -r requirements.txt`. + +### Iterating on your implementation + +There are two ways we recommend iterating on a source. Consider using whichever one matches your style. + +**Run the source using python** + +You'll notice in your source's directory that there is a python file called `main.py`. This file exists as convenience for development. You run it to test that your source works: + +```text +# from airbyte-integrations/connectors/source- +python main.py spec +python main.py check --config secrets/config.json +python main.py discover --config secrets/config.json +python main.py read --config secrets/config.json --catalog sample_files/configured_catalog.json +``` + +The nice thing about this approach is that you can iterate completely within python. The downside is that you are not quite running your source as it will actually be run by Airbyte. Specifically, you're not running it from within the docker container that will house it. + +**Run the source using docker** + +If you want to run your source exactly as it will be run by Airbyte \(i.e. within a docker container\), you can use the following commands from the connector module directory \(`airbyte-integrations/connectors/source-python-http-example`\): + +```text +# First build the container +docker build . -t airbyte/source-:dev + +# Then use the following commands to run it +docker run --rm airbyte/source-python-http-example:dev spec +docker run --rm -v $(pwd)/secrets:/secrets airbyte/source-python-http-example:dev check --config /secrets/config.json +docker run --rm -v $(pwd)/secrets:/secrets airbyte/source-python-http-example:dev discover --config /secrets/config.json +docker run --rm -v $(pwd)/secrets:/secrets -v $(pwd)/sample_files:/sample_files airbyte/source-python-http-example:dev read --config /secrets/config.json --catalog /sample_files/configured_catalog.json +``` + +Note: Each time you make a change to your implementation you need to re-build the connector image via `docker build . -t airbyte/source-:dev`. This ensures the new python code is added into the docker container. + +The nice thing about this approach is that you are running your source exactly as it will be run by Airbyte. The tradeoff is iteration is slightly slower, as the connector is re-built between each change. + diff --git a/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/3-define-inputs.md b/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/3-define-inputs.md new file mode 100644 index 0000000000000..222b6511c5cef --- /dev/null +++ b/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/3-define-inputs.md @@ -0,0 +1,43 @@ +# Step 3: Define Inputs + +Each connector declares the inputs it needs to read data from the underlying data source. This is the Airbyte Protocol's `spec` operation. + +The simplest way to implement this is by creating a `.json` file in `source_/spec.json` which describes your connector's inputs according to the [ConnectorSpecification](https://github.com/airbytehq/airbyte/blob/master/airbyte-protocol/models/src/main/resources/airbyte_protocol/airbyte_protocol.yaml#L211) schema. This is a good place to start when developing your source. Using JsonSchema, define what the inputs are \(e.g. username and password\). Here's [an example](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-freshdesk/source_freshdesk/spec.json) of what the `spec.json` looks like for the Freshdesk API source. + +For more details on what the spec is, you can read about the Airbyte Protocol [here](https://docs.airbyte.io/understanding-airbyte/airbyte-specification). + +The generated code that Airbyte provides, handles implementing the `spec` method for you. It assumes that there will be a file called `spec.json` in the same directory as `source.py`. If you have declared the necessary JsonSchema in `spec.json` you should be done with this step. + +Given that we'll pulling currency data for our example source, we'll define the following `spec.json`: + +```text +{ + "documentationUrl": "https://docs.airbyte.io/integrations/sources/exchangeratesapi", + "connectionSpecification": { + "$schema": "http://json-schema.org/draft-07/schema#", + "title": "Python Http Tutorial Spec", + "type": "object", + "required": ["start_date", "currency_base"], + "additionalProperties": false, + "properties": { + "start_date": { + "type": "string", + "description": "Start getting data from that date.", + "pattern": "^[0-9]{4}-[0-9]{2}-[0-9]{2}$", + "examples": ["%Y-%m-%d"] + }, + "base": { + "type": "string", + "examples": ["USD", "EUR"] + "description": "ISO reference currency. See here." + } + } + } +} +``` + +In addition to metadata, we define two inputs: + +* `start_date`: The beginning date to start tracking currency exchange rates from +* `base`: The currency whose rates we're interested in tracking + diff --git a/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/4-connection-checking.md b/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/4-connection-checking.md new file mode 100644 index 0000000000000..cb037c36a1b8e --- /dev/null +++ b/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/4-connection-checking.md @@ -0,0 +1,60 @@ +# Step 4: Connection Checking + +The second operation in the Airbyte Protocol that we'll implement is the `check` operation. + +This operation verifies that the input configuration supplied by the user can be used to connect to the underlying data source. Note that this user-supplied configuration has the values described in the `spec.json` filled in. In other words if the `spec.json` said that the source requires a `username` and `password` the config object might be `{ "username": "airbyte", "password": "password123" }`. You should then implement something that returns a json object reporting, given the credentials in the config, whether we were able to connect to the source. + +In our case, this is a fairly trivial check since the API requires no credentials. Instead, let's verify that the user-input `base` currency is a legitimate currency. In `source.py` we'll find the following autogenerated source: + +```python +class SourcePythonHttpTutorial(AbstractSource): + + def check_connection(self, logger, config) -> Tuple[bool, any]: + """ + TODO: Implement a connection check to validate that the user-provided config can be used to connect to the underlying API + + See https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-stripe/source_stripe/source.py#L232 + for an example. + + :param config: the user-input config object conforming the connector's spec.json + :param logger: logger object + :return Tuple[bool, any]: (True, None) if the input config can be used to connect to the API successfully, (False, error) otherwise. + """ + return True, None + +... +``` + +Following the docstring instructions, we'll change the implementation to verify that the input currency is a real currency: + +```python + def check_connection(self, logger, config) -> Tuple[bool, any]: + accepted_currencies = {"USD", "JPY", "BGN", "CZK", "DKK"} # assume these are the only allowed currencies + input_currency = config['base'] + if input_currency not in accepted_currencies: + return False, f"Input currency {input_currency} is invalid. Please input one of the following currencies: {accepted_currencies}" + else: + return True, None +``` + +Let's test out this implementation by creating two objects: a valid and an invalid config and attempt to give them as input to the connector + +```text +echo '{"start_date": "2021-04-01", "base": "USD"}' > sample_files/config.json +echo '{"start_date": "2021-04-01", "base": "BTC"}' > sample_files/invalid_config.json +python main.py check --config sample_files/config.json +python main.py check --config sample_files/invalid_config.json +``` + +You should see output like the following: + +```text +> python main.py check --config sample_files/config.json +{"type": "CONNECTION_STATUS", "connectionStatus": {"status": "SUCCEEDED"}} + +> python main.py check --config sample_files/invalid_config.json +{"type": "CONNECTION_STATUS", "connectionStatus": {"status": "FAILED", "message": "Input currency BTC is invalid. Please input one of the following currencies: {'DKK', 'USD', 'CZK', 'BGN', 'JPY'}"}} +``` + +While developing, we recommend storing configs which contain secrets in `secrets/config.json` because the `secrets` directory is gitignored by default. + diff --git a/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/5-declare-schema.md b/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/5-declare-schema.md new file mode 100644 index 0000000000000..d032d5ae2ecef --- /dev/null +++ b/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/5-declare-schema.md @@ -0,0 +1,75 @@ +# Step 5: Declare the Schema + +The `discover` method of the Airbyte Protocol returns an `AirbyteCatalog`: an object which declares all the streams output by a connector and their schemas. It also declares the sync modes supported by the stream \(full refresh or incremental\). See the [catalog tutorial](https://docs.airbyte.io/tutorials/tutorials/beginners-guide-to-catalog) for more information. + +This is a simple task with the Airbyte CDK. For each stream in our connector we'll need to: 1. Create a python `class` in `source.py` which extends `HttpStream` 2. Place a `.json` file in the `source_/schemas/` directory. The name of the file should be the snake\_case name of the stream whose schema it describes, and its contents should be the JsonSchema describing the output from that stream. + +Let's create a class in `source.py` which extends `HttpStream`. You'll notice there are classes with extensive comments describing what needs to be done to implement various connector features. Feel free to read these classes as needed. But for the purposes of this tutorial, let's assume that we are adding classes from scratch either by deleting those generated classes or editing them to match the implementation below. + +We'll begin by creating a stream to represent the data that we're pulling from the Exchange Rates API: + +```python +class ExchangeRates(HttpStream): + url_base = "https://api.ratesapi.io/" + + # Set this as a noop. + primary_key = None + + def next_page_token(self, response: requests.Response) -> Optional[Mapping[str, Any]]: + # The API does not offer pagination, so we return None to indicate there are no more pages in the response + return None + + def path( + self, + stream_state: Mapping[str, Any] = None, + stream_slice: Mapping[str, Any] = None, + next_page_token: Mapping[str, Any] = None + ) -> str: + return "" # TODO + + def parse_response( + self, + response: requests.Response, + stream_state: Mapping[str, Any], + stream_slice: Mapping[str, Any] = None, + next_page_token: Mapping[str, Any] = None, + ) -> Iterable[Mapping]: + return None # TODO +``` + +Note that this implementation is entirely empty -- we haven't actually done anything. We'll come back to this in the next step. But for now we just want to declare the schema of this stream. We'll declare this as a stream that the connector outputs by returning it from the `streams` method: + +```python +from airbyte_cdk.sources.streams.http.auth import NoAuth + +class SourcePythonHttpTutorial(AbstractSource): + + def check_connection(self, logger, config) -> Tuple[bool, any]: + ... + + def streams(self, config: Mapping[str, Any]) -> List[Stream]: + # NoAuth just means there is no authentication required for this API and is included for completeness. + # Skip passing an authenticator if no authentication is required. + # Other authenticators are available for API token-based auth and Oauth2. + auth = NoAuth() + return [ExchangeRates(authenticator=auth)] +``` + +Having created this stream in code, we'll put a file `exchange_rates.json` in the `schemas/` folder. You can download the JSON file describing the output schema [here](https://github.com/airbytehq/airbyte/blob/master/airbyte-cdk/python/docs/tutorials/http_api_source_assets/exchange_rates.json) for convenience and place it in `schemas/`. + +With `.json` schema file in place, let's see if the connector can now find this schema and produce a valid catalog: + +```text +python main.py discover --config sample_files/config.json +``` + +you should see some output like: + +```text +{"type": "CATALOG", "catalog": {"streams": [{"name": "exchange_rates", "json_schema": {"$schema": "http://json-schema.org/draft-04/schema#", "type": "object", "properties": {"base": {"type": "string"}, "rates": {"type": "object", "properties": {"GBP": {"type": "number"}, "HKD": {"type": "number"}, "IDR": {"type": "number"}, "PHP": {"type": "number"}, "LVL": {"type": "number"}, "INR": {"type": "number"}, "CHF": {"type": "number"}, "MXN": {"type": "number"}, "SGD": {"type": "number"}, "CZK": {"type": "number"}, "THB": {"type": "number"}, "BGN": {"type": "number"}, "EUR": {"type": "number"}, "MYR": {"type": "number"}, "NOK": {"type": "number"}, "CNY": {"type": "number"}, "HRK": {"type": "number"}, "PLN": {"type": "number"}, "LTL": {"type": "number"}, "TRY": {"type": "number"}, "ZAR": {"type": "number"}, "CAD": {"type": "number"}, "BRL": {"type": "number"}, "RON": {"type": "number"}, "DKK": {"type": "number"}, "NZD": {"type": "number"}, "EEK": {"type": "number"}, "JPY": {"type": "number"}, "RUB": {"type": "number"}, "KRW": {"type": "number"}, "USD": {"type": "number"}, "AUD": {"type": "number"}, "HUF": {"type": "number"}, "SEK": {"type": "number"}}}, "date": {"type": "string"}}}, "supported_sync_modes": ["full_refresh"]}]}} +``` + +It's that simple! Now the connector knows how to declare your connector's stream's schema. We declare only one stream since our source is simple, but the principle is exactly the same if you had many streams. + +You can also dynamically define schemas, but that's beyond the scope of this tutorial. See the [schema docs](../../concepts/full-refresh-stream.md#defining-the-streams-schema) for more information. + diff --git a/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/6-read-data.md b/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/6-read-data.md new file mode 100644 index 0000000000000..bb5485abdc465 --- /dev/null +++ b/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/6-read-data.md @@ -0,0 +1,217 @@ +# Step 6: Read Data + +Describing schemas is good and all, but at some point we have to start reading data! So let's get to work. But before, let's describe what we're about to do: + +The `HttpStream` superclass, like described in the [concepts documentation](../../concepts/http-streams.md), is facilitating reading data from HTTP endpoints. It contains built-in functions or helpers for: + +* authentication +* pagination +* handling rate limiting or transient errors +* and other useful functionality + +In order for it to be able to do this, we have to provide it with a few inputs: + +* the URL base and path of the endpoint we'd like to hit +* how to parse the response from the API +* how to perform pagination + +Optionally, we can provide additional inputs to customize requests: + +* request parameters and headers +* how to recognize rate limit errors, and how long to wait \(by default it retries 429 and 5XX errors using exponential backoff\) +* HTTP method and request body if applicable + +There are many other customizable options - you can find them in the [`airbyte_cdk.sources.streams.http.HttpStream`](https://github.com/airbytehq/airbyte/blob/master/airbyte-cdk/python/airbyte_cdk/sources/streams/http/http.py) class. + +So in order to read data from the exchange rates API, we'll fill out the necessary information for the stream to do its work. First, we'll implement a basic read that just reads the last day's exchange rates, then we'll implement incremental sync using stream slicing. + +Let's begin by pulling data for the last day's rates by using the `/latest` endpoint: + +```python +class ExchangeRates(HttpStream): + url_base = "https://api.ratesapi.io/" + + primary_key = None + + def __init__(self, base: str, **kwargs): + super().__init__() + self.base = base + + + def path( + self, + stream_state: Mapping[str, Any] = None, + stream_slice: Mapping[str, Any] = None, + next_page_token: Mapping[str, Any] = None + ) -> str: + # The "/latest" path gives us the latest currency exchange rates + return "latest" + + def request_params( + self, + stream_state: Mapping[str, Any], + stream_slice: Mapping[str, Any] = None, + next_page_token: Mapping[str, Any] = None, + ) -> MutableMapping[str, Any]: + # The api requires that we include the base currency as a query param so we do that in this method + return {'base': self.base} + + def parse_response( + self, + response: requests.Response, + stream_state: Mapping[str, Any], + stream_slice: Mapping[str, Any] = None, + next_page_token: Mapping[str, Any] = None, + ) -> Iterable[Mapping]: + # The response is a simple JSON whose schema matches our stream's schema exactly, + # so we just return a list containing the response + return [response.json()] + + def next_page_token(self, response: requests.Response) -> Optional[Mapping[str, Any]]: + # The API does not offer pagination, + # so we return None to indicate there are no more pages in the response + return None +``` + +This may look big, but that's just because there are lots of \(unused, for now\) parameters in these methods \(those can be hidden with Python's `**kwargs`, but don't worry about it for now\). Really we just added a few lines of "significant" code: 1. Added a constructor `__init__` which stores the `base` currency to query for. 2. `return {'base': self.base}` to add the `?base=` query parameter to the request based on the `base` input by the user. 3. `return [response.json()]` to parse the response from the API to match the schema of our schema `.json` file. 4. `return "latest"` to indicate that we want to hit the `/latest` endpoint of the API to get the latest exchange rate data. + +Let's also pass the `base` parameter input by the user to the stream class: + +```python +def streams(self, config: Mapping[str, Any]) -> List[Stream]: + auth = NoAuth() + return [ExchangeRates(authenticator=auth, base=config['base'])] +``` + +We're now ready to query the API! + +To do this, we'll need a [ConfiguredCatalog](https://docs.airbyte.io/tutorials/tutorials/beginners-guide-to-catalog). We've prepared one [here](https://github.com/airbytehq/airbyte/blob/master/airbyte-cdk/python/docs/tutorials/http_api_source_assets/configured_catalog.json) -- download this and place it in `sample_files/configured_catalog.json`. Then run: + +```text + python main.py read --config sample_files/config.json --catalog sample_files/configured_catalog.json +``` + +you should see some output lines, one of which is a record from the API: + +```text +{"type": "RECORD", "record": {"stream": "exchange_rates", "data": {"base": "USD", "rates": {"GBP": 0.7196938353, "HKD": 7.7597848573, "IDR": 14482.4824162185, "ILS": 3.2412081092, "DKK": 6.1532478279, "INR": 74.7852709971, "CHF": 0.915763343, "MXN": 19.8439387671, "CZK": 21.3545717832, "SGD": 1.3261894911, "THB": 31.4398014067, "HRK": 6.2599917253, "EUR": 0.8274720728, "MYR": 4.0979726934, "NOK": 8.3043442284, "CNY": 6.4856433595, "BGN": 1.61836988, "PHP": 48.3516756309, "PLN": 3.770872983, "ZAR": 14.2690111709, "CAD": 1.2436905254, "ISK": 124.9482829954, "BRL": 5.4526272238, "RON": 4.0738932561, "NZD": 1.3841125362, "TRY": 8.3101365329, "JPY": 108.0182043856, "RUB": 74.9555647497, "KRW": 1111.7583781547, "USD": 1.0, "AUD": 1.2840711626, "HUF": 300.6206040546, "SEK": 8.3829540753}, "date": "2021-04-26"}, "emitted_at": 1619498062000}} +``` + +There we have it - a stream which reads data in just a few lines of code! + +We theoretically _could_ stop here and call it a connector. But let's give adding incremental sync a shot. + +## Adding incremental sync + +To add incremental sync, we'll do a few things: 1. Pass the `start_date` param input by the user into the stream. 2. Declare the stream's `cursor_field`. 3. Implement the `get_updated_state` method. 4. Implement the `stream_slices` method. 5. Update the `path` method to specify the date to pull exchange rates for. 6. Update the configured catalog to use `incremental` sync when we're testing the stream. + +We'll describe what each of these methods do below. Before we begin, it may help to familiarize yourself with how incremental sync works in Airbyte by reading the [docs on incremental](https://docs.airbyte.io/architecture/connections/incremental-append). + +To keep things concise, we'll only show functions as we edit them one by one. + +Let's get the easy parts out of the way and pass the `start_date`: + +```python +def streams(self, config: Mapping[str, Any]) -> List[Stream]: + auth = NoAuth() + # Parse the date from a string into a datetime object + start_date = datetime.strptime(config['start_date'], '%Y-%m-%d') + return [ExchangeRates(authenticator=auth, base=config['base'], start_date=start_date)] +``` + +Let's also add this parameter to the constructor and declare the `cursor_field`: + +```python +from datetime import datetime, timedelta + + +class ExchangeRates(HttpStream): + url_base = "https://api.ratesapi.io/" + cursor_field = "date" + primary_key = "date" + + def __init__(self, base: str, start_date: datetime, **kwargs): + super().__init__() + self.base = base + self.start_date = start_date +``` + +Declaring the `cursor_field` informs the framework that this stream now supports incremental sync. The next time you run `python main_dev.py discover --config sample_files/config.json` you'll find that the `supported_sync_modes` field now also contains `incremental`. + +But we're not quite done with supporting incremental, we have to actually emit state! We'll structure our state object very simply: it will be a `dict` whose single key is `'date'` and value is the date of the last day we synced data from. For example, `{'date': '2021-04-26'}` indicates the connector previously read data up until April 26th and therefore shouldn't re-read anything before April 26th. + +Let's do this by implementing the `get_updated_state` method inside the `ExchangeRates` class. + +```python + def get_updated_state(self, current_stream_state: MutableMapping[str, Any], latest_record: Mapping[str, Any]) -> Mapping[str, any]: + # This method is called once for each record returned from the API to compare the cursor field value in that record with the current state + # we then return an updated state object. If this is the first time we run a sync or no state was passed, current_stream_state will be None. + if current_stream_state is not None and 'date' in current_stream_state: + current_parsed_date = datetime.strptime(current_stream_state['date'], '%Y-%m-%d') + latest_record_date = datetime.strptime(latest_record['date'], '%Y-%m-%d') + return {'date': max(current_parsed_date, latest_record_date).strftime('%Y-%m-%d')} + else: + return {'date': self.start_date.strftime('%Y-%m-%d')} +``` + +This implementation compares the date from the latest record with the date in the current state and takes the maximum as the "new" state object. + +We'll implement the `stream_slices` method to return a list of the dates for which we should pull data based on the stream state if it exists: + +```python + def _chunk_date_range(self, start_date: datetime) -> List[Mapping[str, any]]: + """ + Returns a list of each day between the start date and now. + The return value is a list of dicts {'date': date_string}. + """ + dates = [] + while start_date < datetime.now(): + dates.append({'date': start_date.strftime('%Y-%m-%d')}) + start_date += timedelta(days=1) + return dates + + def stream_slices(self, sync_mode, cursor_field: List[str] = None, stream_state: Mapping[str, Any] = None) -> Iterable[ + Optional[Mapping[str, any]]]: + start_date = datetime.strptime(stream_state['date'], '%Y-%m-%d') if stream_state and 'date' in stream_state else self.start_date + return self._chunk_date_range(start_date) +``` + +Each slice will cause an HTTP request to be made to the API. We can then use the information present in the `stream_slice` parameter \(a single element from the list we constructed in `stream_slices` above\) to set other configurations for the outgoing request like `path` or `request_params`. For more info about stream slicing, see [the slicing docs](../../concepts/stream_slices.md). + +In order to pull data for a specific date, the Exchange Rates API requires that we pass the date as the path component of the URL. Let's override the `path` method to achieve this: + +```python +def path(self, stream_state: Mapping[str, Any] = None, stream_slice: Mapping[str, Any] = None, next_page_token: Mapping[str, Any] = None) -> str: + return stream_slice['date'] +``` + +With these changes, your implementation should look like the file [here](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-python-http-tutorial/source_python_http_tutorial/source.py). + +The last thing we need to do is change the `sync_mode` field in the `sample_files/configured_catalog.json` to `incremental`: + +```text +"sync_mode": "incremental", +``` + +We should now have a working implementation of incremental sync! + +Let's try it out: + +```text +python main.py read --config sample_files/config.json --catalog sample_files/configured_catalog.json +``` + +You should see a bunch of `RECORD` messages and `STATE` messages. To verify that incremental sync is working, pass the input state back to the connector and run it again: + +```text +# Save the latest state to sample_files/state.json +python main.py read --config sample_files/config.json --catalog sample_files/configured_catalog.json | grep STATE | tail -n 1 | jq .state.data > sample_files/state.json + +# Run a read operation with the latest state message +python main.py read --config sample_files/config.json --catalog sample_files/configured_catalog.json --state sample_files/state.json +``` + +You should see that only the record from the last date is being synced! This is acceptable behavior, since Airbyte requires at-least-once delivery of records, so repeating the last record twice is OK. + +With that, we've implemented incremental sync for our connector! + diff --git a/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/7-use-connector-in-airbyte.md b/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/7-use-connector-in-airbyte.md new file mode 100644 index 0000000000000..691b4892a9169 --- /dev/null +++ b/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/7-use-connector-in-airbyte.md @@ -0,0 +1,6 @@ +# Step 7: Use the Connector in Airbyte + +To use your connector in your own installation of Airbyte, build the docker image for your container by running `docker build . -t airbyte/source-python-http-example:dev`. Then, follow the instructions from the [building a toy source tutorial](https://docs.airbyte.io/tutorials/tutorials/toy-connector#use-the-connector-in-the-airbyte-ui) for using the connector in the Airbyte UI, replacing the name as appropriate. + +Note: your built docker image must be accessible to the `docker` daemon running on the Airbyte node. If you're doing this tutorial locally, these instructions are sufficient. Otherwise you may need to push your Docker image to Dockerhub. + diff --git a/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/8-test-your-connector.md b/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/8-test-your-connector.md new file mode 100644 index 0000000000000..d2918e72985c1 --- /dev/null +++ b/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/8-test-your-connector.md @@ -0,0 +1,18 @@ +# Step 8: Test Connector + +## Unit Tests + +Add any relevant unit tests to the `unit_tests` directory. Unit tests should **not** depend on any secrets. + +You can run the tests using `python -m pytest -s unit_tests` + +## Integration Tests + +Place any integration tests in the `integration_tests` directory such that they can be [discovered by pytest](https://docs.pytest.org/en/reorganize-docs/new-docs/user/naming_conventions.html). + +## Standard Tests + +Standard tests are a fixed set of tests Airbyte provides that every Airbyte source connector must pass. While they're only required if you intend to submit your connector to Airbyte, you might find them helpful in any case. See [Testing your connectors](https://docs.airbyte.io/contributing-to-airbyte/building-new-connector/testing-connectors) + +If you want to submit this connector to become a default connector within Airbyte, follow steps 8 onwards from the [Python source checklist](https://docs.airbyte.io/tutorials/tutorials/building-a-python-source#step-8-set-up-standard-tests) + diff --git a/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/README.md b/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/README.md new file mode 100644 index 0000000000000..466943b269ca7 --- /dev/null +++ b/docs/contributing-to-airbyte/python/tutorials/cdk-tutorial-python-http/README.md @@ -0,0 +1,2 @@ +# Creating an HTTP API Source + diff --git a/docs/deploying-airbyte/local-deployment.md b/docs/deploying-airbyte/local-deployment.md index 0c8eb52590c0a..05c23fc73727c 100644 --- a/docs/deploying-airbyte/local-deployment.md +++ b/docs/deploying-airbyte/local-deployment.md @@ -20,13 +20,10 @@ docker-compose up ## Deploy on Windows -We recommend following [this guide](https://docs.docker.com/docker-for-windows/install/) to install Docker on Windows. -After installing the WSL 2 backend and Docker you should be able to run containers using Windows PowerShell. -Additionally, as we note frequently, you will need `docker-compose` to build Airbyte from source. -The suggested guide already installs `docker-compose` on Windows. -Instead of cloning the repo, you can alternatively download the latest Airbyte release [here](https://github.com/airbytehq/airbyte/releases). -Unzip the downloaded file, access the unzipped file using PowerShell terminal, and run `docker-compose up`. -After this, you should see the Airbyte containers in the Docker application as in the image below. +We recommend following [this guide](https://docs.docker.com/docker-for-windows/install/) to install Docker on Windows. After installing the WSL 2 backend and Docker you should be able to run containers using Windows PowerShell. Additionally, as we note frequently, you will need `docker-compose` to build Airbyte from source. The suggested guide already installs `docker-compose` on Windows. + +Instead of cloning the repo, you can alternatively download the latest Airbyte release [here](https://github.com/airbytehq/airbyte/releases). Unzip the downloaded file, access the unzipped file using PowerShell terminal, and run `docker-compose up`. After this, you should see the Airbyte containers in the Docker application as in the image below. + ![](../.gitbook/assets/airbyte_deploy_windows_docker.png) ## Troubleshooting diff --git a/docs/deploying-airbyte/on-azure-vm-cloud-shell.md b/docs/deploying-airbyte/on-azure-vm-cloud-shell.md index 0a16a9707bf0a..c4d667af54301 100644 --- a/docs/deploying-airbyte/on-azure-vm-cloud-shell.md +++ b/docs/deploying-airbyte/on-azure-vm-cloud-shell.md @@ -1,4 +1,4 @@ -# On Azure Virtual Machine - Cloud Shell +# On Azure\(VM\) {% hint style="info" %} The instructions have been tested on `Azure VM Linux (ubuntu 18.04)` @@ -12,7 +12,7 @@ Launch cloud shell by going to [https://shell.azure.com/bash](https://shell.azur ## Create a new virtual machine -#### Create resource group +### Create resource group ```bash # Inside Azure cloud shell @@ -20,9 +20,10 @@ rgName=airbyte-demo rgLocation=eastus az group create --name $rgName --location $rgLocation ``` + ![](../.gitbook/assets/azure_shell_create_rg.png) -#### Create virtual machine +### Create virtual machine ```bash # Inside Azure cloud shell @@ -37,26 +38,26 @@ publicIp=$(az vm create --resource-group $rgName \ echo $publicIp ``` -This step will create a virtual machine and add a user account named `byteuser`. The ``--generate-ssh-keys`` option will generate a new ssh key and put it to the default key location (~/.ssh) +This step will create a virtual machine and add a user account named `byteuser`. The `--generate-ssh-keys` option will generate a new ssh key and put it to the default key location \(~/.ssh\) -**Note: Copy the ``publicIpAddress`` output, you will need this address later to connect from your workstation.** +**Note: Copy the `publicIpAddress` output, you will need this address later to connect from your workstation.** ![](../.gitbook/assets/azure_shell_create_vm.png) -#### Download SSH key +### Download SSH key ```bash # Inside Azure cloud shell download ~/.ssh/id_rsa ``` -Above command will generate download link and give you pop-up on right bottom side, click on `Click here to download your file.` to download private key. -Note: Save this file, you will need it to connect to your VM in [Connect to Airbyte](#connect-to-airbyte) step. + +Above command will generate download link and give you pop-up on right bottom side, click on `Click here to download your file.` to download private key. Note: Save this file, you will need it to connect to your VM in [Connect to Airbyte](on-azure-vm-cloud-shell.md#connect-to-airbyte) step. ![](../.gitbook/assets/azure_shell_download_ssh_key.png) -#### Connect to virtual machine +### Connect to virtual machine -- Connect to virtual machine +* Connect to virtual machine ```bash # Inside Azure cloud shell @@ -65,8 +66,8 @@ ssh $userName@$publicIp ## Install environment -- Install `docker` - +* Install `docker` + ```bash # Inside Azure cloud shell sudo apt-get update -y @@ -78,7 +79,7 @@ sudo apt-get install docker-ce docker-ce-cli -y sudo usermod -a -G docker $USER ``` -- Install `docker-compose` +* Install `docker-compose` ```bash # Inside Azure cloud shell @@ -86,13 +87,15 @@ sudo wget https://github.com/docker/compose/releases/download/1.26.2/docker-comp sudo chmod +x /usr/local/bin/docker-compose docker-compose --version ``` -- Close the ssh connection to ensure the group modification is taken into account + +* Close the ssh connection to ensure the group modification is taken into account ```bash # Inside Azure cloud shell logout ``` -- Reconnect to virtual machine + +* Reconnect to virtual machine ```bash # Inside Azure cloud shell @@ -118,18 +121,18 @@ For security reasons, we strongly recommend to not expose Airbyte on Internet av This part assumes that you have access to a terminal on your workstation {% endhint %} -- Create ssh tunnels for port 8000 (the static web server) and port 8001 (the api server) -```bash -# Inside your workstation terminal -# 1. Replace $SSH_KEY with private key path downloaded from earlier steps -# 2. Replace $INSTANCE_IP with publicIpAddress noted from earlier steps -ssh -i $SSH_KEY -L 8000:localhost:8000 -L 8001:localhost:8001 -N -f byteuser@$INSTANCE_IP -``` -- Just visit http://localhost:8000 in your browser and start moving some data! - +* Create ssh tunnels for port 8000 \(the static web server\) and port 8001 \(the api server\) + ```bash + # Inside your workstation terminal + # 1. Replace $SSH_KEY with private key path downloaded from earlier steps + # 2. Replace $INSTANCE_IP with publicIpAddress noted from earlier steps + ssh -i $SSH_KEY -L 8000:localhost:8000 -L 8001:localhost:8001 -N -f byteuser@$INSTANCE_IP + ``` +* Just visit [http://localhost:8000](http://localhost:8000) in your browser and start moving some data! ## Troubleshooting If you encounter any issues, just connect to our [Slack](https://slack.airbyte.io). Our community will help! We also have a [FAQ](../faq/technical-support.md) section in our docs for common problems. + diff --git a/docs/examples/zoom-activity-dashboard.md b/docs/examples/zoom-activity-dashboard.md index e46f9102c9f6a..bdabbca83736e 100644 --- a/docs/examples/zoom-activity-dashboard.md +++ b/docs/examples/zoom-activity-dashboard.md @@ -56,7 +56,7 @@ Once you are in, you need to click on the **Develop** dropdown and then click on Clicking on **Build App** for the first time will display a modal for you to accept the Zoom’s API license and terms of use. Do accept if you agree and you will be presented with the below screen. -![](../.gitbook/assets/zoom-marketplace-build-screen%20%281%29.png) +![](../.gitbook/assets/zoom-marketplace-build-screen%20%283%29%20%281%29.png) Select **JWT** as the app you want to build and click on the **Create** button on the card. You will be presented with a modal to enter the app name; type in `airbyte-zoom`. @@ -86,7 +86,7 @@ So let’s go back to the Airbyte web UI and provide it with the JWT token we co Now click on the **Set up source** button. You will see the below success message when the connection is made successfully. -![](../.gitbook/assets/setup-successful%20%282%29.png) +![](../.gitbook/assets/setup-successful%20%283%29%20%282%29.png) And you will be taken to the page to add your destination. @@ -106,7 +106,7 @@ This will spin a docker container and persist the data we will be replicating in Now, let’s supply the above credentials to the Airbyte UI requiring those credentials. -![](../.gitbook/assets/postgres_credentials.png) +![](../.gitbook/assets/postgres_credentials%20%283%29.png) Then click on the **Set up destination** button. @@ -114,17 +114,17 @@ After the connection has been made to your PostgreSQL database successfully, Air Leave all the fields checked. -![](../.gitbook/assets/schema.png) +![](../.gitbook/assets/schema%20%283%29.png) Select a **Sync frequency** of **manual** and then click on **Set up connection**. After successfully making the connection, you will see your PostgreSQL destination. Click on the Launch button to start the data replication. -![](../.gitbook/assets/launch%20%281%29.png) +![](../.gitbook/assets/launch%20%283%29%20%281%29.png) Then click on the **airbyte-zoom-destination** to see the Sync page. -![](../.gitbook/assets/sync-screen%20%283%29.png) +![](../.gitbook/assets/sync-screen%20%283%29%20%283%29.png) Syncing should take a few minutes or longer depending on the size of the data being replicated. Once Airbyte is done replicating the data, you will get a **succeeded** status. @@ -144,11 +144,11 @@ Go ahead and install Tableau on your machine. After the installation is complete Once your activation is successful, you will see your Tableau dashboard. -![](../.gitbook/assets/tableau-dashboard%20%281%29.png) +![](../.gitbook/assets/tableau-dashboard%20%283%29%20%281%29.png) On the sidebar menu under the **To a Server** section, click on the **More…** menu. You will see a list of datasource connectors you can connect Tableau with. -![](../.gitbook/assets/datasources%20%284%29.png) +![](../.gitbook/assets/datasources%20%284%29%20%284%29.png) Select **PostgreSQL** and you will be presented with a connection credentials modal. @@ -186,7 +186,7 @@ Next, drag **Created At** to **Columns**. Currently, we get the Created At in **YEAR**, but per our requirement we want them in Weeks, so right click on the **YEAR\(Created At\)** and choose **Week Number**. -![](../.gitbook/assets/change-to-per-week.png) +![](../.gitbook/assets/change-to-per-week%20%283%29.png) Tableau should now look like this: @@ -194,7 +194,7 @@ Tableau should now look like this: Now, to finish up, we need to add the **meetings\(Count\) measure** Tableau already calculated for us in the **Rows** section. So drag **meetings\(Count\)** onto the Columns section to complete the chart. -![](../.gitbook/assets/evolution-of-meetings-per-week%20%282%29.png) +![](../.gitbook/assets/evolution-of-meetings-per-week%20%283%29%20%282%29.png) And now we are done with the very first chart. Let's save the sheet and create a new Dashboard that we will add this sheet to as well as the others we will be creating. @@ -232,7 +232,7 @@ Then click on apply. Finally, drag the **Created At** fields \(make sure it’s To get this chart, we need to create a relationship between the **meetings table** and the `report_meeting_participants` table. You can do this by dragging the `report_meeting_participants` table in as a source alongside the **meetings** table and relate both via the **meeting id**. Then you will be able to create a new worksheet that looks like this: -![](../.gitbook/assets/meetings-participant-ranked.png) +![](../.gitbook/assets/meetings-participant-ranked%20%283%29.png) Note: To achieve the ranking, we simply use the sort menu icon on the top menu bar. @@ -246,7 +246,7 @@ The rest of the charts will be needing the **webinars** and `report_webinar_part For this chart, as for the meeting’s counterpart, we will get a calculated field off the Duration field to get the **Webinar Duration in Hours**, and then plot **Created At** against the **Sum of Webinar Duration in Hours**, as shown in the screenshot below. Note: Make sure you create a new sheet for each of these graphs. -![](../.gitbook/assets/duration-spent-in-weekly-webinars%20%283%29.png) +![](../.gitbook/assets/duration-spent-in-weekly-webinars%20%283%29%20%283%29.png) ### Evolution of the number of participants for all webinars per week diff --git a/docs/integrations/README.md b/docs/integrations/README.md index 81b05df535e73..5441e86621975 100644 --- a/docs/integrations/README.md +++ b/docs/integrations/README.md @@ -1,2 +1,2 @@ -# Connectors +# Connector Catalog diff --git a/docs/integrations/custom-connectors.md b/docs/integrations/custom-connectors.md index c0e5ea2e1c9df..1bfbb90b04fd6 100644 --- a/docs/integrations/custom-connectors.md +++ b/docs/integrations/custom-connectors.md @@ -36,8 +36,9 @@ Note that this new connector could just be an updated version of an existing con ## Upgrading a connector -To upgrade your connector version, go to the admin panel in the left hand side of the UI, find this connector in the list, and input the latest connector version. +To upgrade your connector version, go to the admin panel in the left hand side of the UI, find this connector in the list, and input the latest connector version. ![](../.gitbook/assets/upgrading_connector_admin_panel.png) -To browse the available connector versions, simply click on the relevant link in the `Image` column to navigate to the connector's DockerHub page. From there, simply click on the `Tags` section in the top bar. \ No newline at end of file +To browse the available connector versions, simply click on the relevant link in the `Image` column to navigate to the connector's DockerHub page. From there, simply click on the `Tags` section in the top bar. + diff --git a/docs/integrations/destinations/mysql.md b/docs/integrations/destinations/mysql.md index d129d2c00d0d2..a0206daf13116 100644 --- a/docs/integrations/destinations/mysql.md +++ b/docs/integrations/destinations/mysql.md @@ -46,9 +46,7 @@ MySQL doesn't differentiate between a database and schema. A database is essenti ### Setup the MySQL destination in Airbyte -Before setting up MySQL destination in Airbyte, you need to set the [local_infile](https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_local_infile) -system variable to true. You can do this by running the query `SET GLOBAL local_infile = true` with a user with [SYSTEM_VARIABLES_ADMIN](https://dev.mysql.com/doc/refman/8.0/en/privileges-provided.html#priv_system-variables-admin) permission. -This is required cause Airbyte uses `LOAD DATA LOCAL INFILE` to load data into table. +Before setting up MySQL destination in Airbyte, you need to set the [local\_infile](https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_local_infile) system variable to true. You can do this by running the query `SET GLOBAL local_infile = true` with a user with [SYSTEM\_VARIABLES\_ADMIN](https://dev.mysql.com/doc/refman/8.0/en/privileges-provided.html#priv_system-variables-admin) permission. This is required cause Airbyte uses `LOAD DATA LOCAL INFILE` to load data into table. You should now have all the requirements needed to configure MySQL as a destination in the UI. You'll need the following information to configure the MySQL destination: @@ -56,4 +54,5 @@ You should now have all the requirements needed to configure MySQL as a destinat * **Port** * **Username** * **Password** -* **Database** \ No newline at end of file +* **Database** + diff --git a/docs/integrations/destinations/snowflake.md b/docs/integrations/destinations/snowflake.md index 8ec210b6b42cc..e581380619593 100644 --- a/docs/integrations/destinations/snowflake.md +++ b/docs/integrations/destinations/snowflake.md @@ -144,17 +144,19 @@ When an identifier is double-quoted, it is stored and resolved exactly as entere Therefore, Airbyte Snowflake destination will create tables and schemas using the Unquoted identifiers when possible or fallback to Quoted Identifiers if the names are containing special characters. ## Cloud Storage Staging + By default, Airbyte uses batches of `INSERT` commands to add data to a temporary table before copying it over to the final table in Snowflake. This is too slow for larger/multi-GB replications. For those larger replications we recommend configuring using cloud storage to allow batch writes and loading. ### AWS S3 For AWS S3, you will need to create a bucket and provide credentials to access the bucket. We recommend creating a bucket that is only used for Airbyte to stage data to Snowflake. Airbyte needs read/write access to interact with this bucket. -### Google Cloud Storage (GCS) +### Google Cloud Storage \(GCS\) First you will need to create a GCS bucket. Then you will need to run the script below: + * You must run the script as the account admin for Snowflake. * You should replace `AIRBYTE_ROLE` with the role you used for Airbyte's Snowflake configuration. * Replace `YOURBUCKETNAME` with your bucket name @@ -162,7 +164,8 @@ Then you will need to run the script below: * `gcs_airbyte_integration` must be used The script: -``` + +```text create storage INTEGRATION gcs_airbyte_integration TYPE = EXTERNAL_STAGE STORAGE_PROVIDER = GCS @@ -179,6 +182,7 @@ GRANT USAGE ON stage gcs_airbyte_stage TO ROLE AIRBYTE_ROLE; DESC STORAGE INTEGRATION gcs_airbyte_integration; ``` -The final query should show a `STORAGE_GCP_SERVICE_ACCOUNT` property with an email as the property value. +The final query should show a `STORAGE_GCP_SERVICE_ACCOUNT` property with an email as the property value. Finally, you need to add read/write permissions to your bucket with that email. + diff --git a/docs/integrations/sources/clickhouse.md b/docs/integrations/sources/clickhouse.md index 0344a3d69ea3a..d0ee9531ddd38 100644 --- a/docs/integrations/sources/clickhouse.md +++ b/docs/integrations/sources/clickhouse.md @@ -53,4 +53,5 @@ GRANT SELECT ON .* TO 'airbyte'@'%'; You can limit this grant down to specific tables instead of the whole database. Note that to replicate data from multiple ClickHouse databases, you can re-run the command above to grant access to all the relevant schemas, but you'll need to set up multiple sources connecting to the same db on multiple schemas. -Your database user should now be ready for use with Airbyte. \ No newline at end of file +Your database user should now be ready for use with Airbyte. + diff --git a/docs/integrations/sources/facebook-marketing.md b/docs/integrations/sources/facebook-marketing.md index 8d10a198f4943..4d77f67d3e658 100644 --- a/docs/integrations/sources/facebook-marketing.md +++ b/docs/integrations/sources/facebook-marketing.md @@ -50,7 +50,7 @@ For more information, see the [Facebook Insights API documentation. ](https://de Facebook heavily throttles API tokens generated from Facebook Apps by default, making it infeasible to use such a token for syncs with Airbyte. To be able to use this connector without your syncs taking days due to rate limiting follow the instructions in the Setup Guide below to access better rate limits. {% endhint %} -See Facebook's [documentation on rate limiting](https://developers.facebook.com/docs/marketing-api/overview/authorization/#access-levels) for more information on requesting a quota upgrade. +See Facebook's [documentation on rate limiting](https://developers.facebook.com/docs/marketing-api/overview/authorization/#access-levels) for more information on requesting a quota upgrade. ## Getting started diff --git a/docs/integrations/sources/google-search-console.md b/docs/integrations/sources/google-search-console.md index a99de148648b3..08eb61225e3fb 100644 --- a/docs/integrations/sources/google-search-console.md +++ b/docs/integrations/sources/google-search-console.md @@ -4,7 +4,7 @@ The Google Search Console source supports both Full Refresh and Incremental syncs. You can choose if this connector will copy only the new or updated data, or all rows in the tables and columns you set up for replication, every time a sync is run. -This source wraps the [Singer Google Search Console Tap](https://github.com/singer-io/tap-google-search-console). +This source wraps the [Singer Google Search Console Tap](https://github.com/singer-io/tap-google-search-console). ### Output schema @@ -17,7 +17,7 @@ This Source is capable of syncing the following Streams: * [Performance report date](https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query) * [Performance report device](https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query) * [Performance report page](https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query) -* [Performance report query (keyword)](https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query) +* [Performance report query \(keyword\)](https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query) ### Data type mapping @@ -44,15 +44,17 @@ This connector attempts to back off gracefully when it hits Reports API's rate l ## Getting started ### Requirements + * Credentials to a Google Service Account with delegated Domain Wide Authority * Email address of the workspace admin which created the Service Account ### Create a Service Account with delegated domain wide authority -Follow the Google Documentation for performing [Delegating domain-wide authority](https://developers.google.com/identity/protocols/oauth2/service-account#delegatingauthority) to create a Service account with delegated domain wide authority. This account must be created by an administrator of the Google Workspace. -Please make sure to grant the following OAuth scopes to the service user: + +Follow the Google Documentation for performing [Delegating domain-wide authority](https://developers.google.com/identity/protocols/oauth2/service-account#delegatingauthority) to create a Service account with delegated domain wide authority. This account must be created by an administrator of the Google Workspace. Please make sure to grant the following OAuth scopes to the service user: 1. `https://www.googleapis.com/auth/webmasters.readonly` -At the end of this process, you should have JSON credentials to this Google Service Account. +At the end of this process, you should have JSON credentials to this Google Service Account. + +You should now be ready to use the Google Workspace Admin Reports API connector in Airbyte. -You should now be ready to use the Google Workspace Admin Reports API connector in Airbyte. \ No newline at end of file diff --git a/docs/integrations/sources/pokeapi.md b/docs/integrations/sources/pokeapi.md index 26fd98205bf02..98249678b5909 100644 --- a/docs/integrations/sources/pokeapi.md +++ b/docs/integrations/sources/pokeapi.md @@ -2,12 +2,14 @@ ## Overview -The PokéAPI source currently supports only Full Refresh syncs. +The PokéAPI source currently supports only Full Refresh syncs. This source uses the fully open [PokéAPI](https://pokeapi.co/docs/v2#info) to serve and retrieve information about Pokémon. This connector should be primarily used for educational purposes or for getting a trial source up and running without needing any dependencies. + ### Output schema Currently, only one output stream is available from this source, which is the Pokémon output stream. This schema is defined [here](https://github.com/airbytehq/airbyte/tree/master/airbyte-integrations/connectors/source-pokeapi/schemas/pokemon.json). + ### Data type mapping The PokéAPI uses the same [JSONSchema](https://json-schema.org/understanding-json-schema/reference/index.html) types that Airbyte uses internally \(`string`, `date-time`, `object`, `array`, `boolean`, `integer`, and `number`\), so no type conversions happen as part of this source. @@ -24,8 +26,9 @@ The PokéAPI uses the same [JSONSchema](https://json-schema.org/understanding-js ### Performance considerations -According to the API's [fair use policy](https://pokeapi.co/docs/v2#fairuse), please make sure to cache resources retrieved from the PokéAPI wherever possible. +According to the API's [fair use policy](https://pokeapi.co/docs/v2#fairuse), please make sure to cache resources retrieved from the PokéAPI wherever possible. ## Dependencies/Requirements -- As this API is fully open and is not rate-limited, no authentication or rate-limiting is performed, so you can use this connector right out of the box without any further configuration. +* As this API is fully open and is not rate-limited, no authentication or rate-limiting is performed, so you can use this connector right out of the box without any further configuration. + diff --git a/docs/integrations/sources/postgres.md b/docs/integrations/sources/postgres.md index c3d925880bd3a..dd66f7cc3f81c 100644 --- a/docs/integrations/sources/postgres.md +++ b/docs/integrations/sources/postgres.md @@ -158,7 +158,7 @@ This slot **must** use `pgoutput`. #### Create publications and replication identities for tables -For each table you want to replicate with CDC, you should add the replication identity (the method of distinguishing between rows) first. We recommend using `ALTER TABLE tbl1 REPLICA IDENTITY DEFAULT;` to use primary keys to distinguish between rows. After setting the replication identity, you will need to run `CREATE PUBLICATION airbyte_publication FOR TABLES ;`. This publication name is customizable. **You must add the replication identity before creating the publication. Otherwise, `ALTER`/`UPDATE`/`DELETE` statements may fail if Postgres cannot determine how to uniquely identify rows.** Please refer to the [Postgres docs](https://www.postgresql.org/docs/10/sql-alterpublication.html) if you need to add or remove tables from your publication in the future. +For each table you want to replicate with CDC, you should add the replication identity \(the method of distinguishing between rows\) first. We recommend using `ALTER TABLE tbl1 REPLICA IDENTITY DEFAULT;` to use primary keys to distinguish between rows. After setting the replication identity, you will need to run `CREATE PUBLICATION airbyte_publication FOR TABLES ;`. This publication name is customizable. **You must add the replication identity before creating the publication. Otherwise, `ALTER`/`UPDATE`/`DELETE` statements may fail if Postgres cannot determine how to uniquely identify rows.** Please refer to the [Postgres docs](https://www.postgresql.org/docs/10/sql-alterpublication.html) if you need to add or remove tables from your publication in the future. The UI currently allows selecting any tables for CDC. If a table is selected that is not part of the publication, it will not replicate even though it is selected. If a table is part of the publication but does not have a replication identity, that replication identity will be created automatically on the first run if the Airbyte user has the necessary permissions. diff --git a/docs/integrations/sources/stripe.md b/docs/integrations/sources/stripe.md index 15d2a1c3eb7a2..678cdba073dc0 100644 --- a/docs/integrations/sources/stripe.md +++ b/docs/integrations/sources/stripe.md @@ -8,24 +8,24 @@ The Stripe source supports both Full Refresh and Incremental syncs. You can choo This Source is capable of syncing the following core Streams: -* [Balance Transactions](https://stripe.com/docs/api/balance_transactions/list) (Incremental) +* [Balance Transactions](https://stripe.com/docs/api/balance_transactions/list) \(Incremental\) * [Bank accounts](https://stripe.com/docs/api/customer_bank_accounts/list) -* [Charges](https://stripe.com/docs/api/charges/list) (Incremental) -* [Coupons](https://stripe.com/docs/api/coupons/list) (Incremental) -* [Customers](https://stripe.com/docs/api/customers/list) (Incremental) -* [Customer Balance Transactions](https://stripe.com/docs/api/customer_balance_transactions/list) (Incremental) -* [Disputes](https://stripe.com/docs/api/disputes/list) (Incremental) -* [Events](https://stripe.com/docs/api/events/list) (Incremental) -* [Invoices](https://stripe.com/docs/api/invoices/list) (Incremental) -* [Invoice Items](https://stripe.com/docs/api/invoiceitems/list) (Incremental) +* [Charges](https://stripe.com/docs/api/charges/list) \(Incremental\) +* [Coupons](https://stripe.com/docs/api/coupons/list) \(Incremental\) +* [Customers](https://stripe.com/docs/api/customers/list) \(Incremental\) +* [Customer Balance Transactions](https://stripe.com/docs/api/customer_balance_transactions/list) \(Incremental\) +* [Disputes](https://stripe.com/docs/api/disputes/list) \(Incremental\) +* [Events](https://stripe.com/docs/api/events/list) \(Incremental\) +* [Invoices](https://stripe.com/docs/api/invoices/list) \(Incremental\) +* [Invoice Items](https://stripe.com/docs/api/invoiceitems/list) \(Incremental\) * [Invoice Line Items](https://stripe.com/docs/api/invoices/invoice_lines) -* [Payouts](https://stripe.com/docs/api/payouts/list) (Incremental) -* [Plans](https://stripe.com/docs/api/plans/list) (Incremental) -* [Products](https://stripe.com/docs/api/products/list) (Incremental) -* [Refunds](https://stripe.com/docs/api/refunds/list) (Incremental) -* [Subscriptions](https://stripe.com/docs/api/subscriptions/list) (Incremental) +* [Payouts](https://stripe.com/docs/api/payouts/list) \(Incremental\) +* [Plans](https://stripe.com/docs/api/plans/list) \(Incremental\) +* [Products](https://stripe.com/docs/api/products/list) \(Incremental\) +* [Refunds](https://stripe.com/docs/api/refunds/list) \(Incremental\) +* [Subscriptions](https://stripe.com/docs/api/subscriptions/list) \(Incremental\) * [Subscription Items](https://stripe.com/docs/api/subscription_items/list) -* [Transfers](https://stripe.com/docs/api/transfers/list) (Incremental) +* [Transfers](https://stripe.com/docs/api/transfers/list) \(Incremental\) ### Data type mapping diff --git a/docs/integrations/sources/zendesk-chat.md b/docs/integrations/sources/zendesk-chat.md index 32b2d2c89de76..b3a6ba8b073dd 100644 --- a/docs/integrations/sources/zendesk-chat.md +++ b/docs/integrations/sources/zendesk-chat.md @@ -11,12 +11,12 @@ This source can sync data for the [Zendesk Chat API](https://developer.zendesk.c This Source is capable of syncing the following core Streams: * [Accounts](https://developer.zendesk.com/rest_api/docs/chat/accounts#show-account) -* [Agents](https://developer.zendesk.com/rest_api/docs/chat/agents#list-agents) (Incremental) -* [Agent Timelines](https://developer.zendesk.com/rest_api/docs/chat/incremental_export#incremental-agent-timeline-export) (Incremental) +* [Agents](https://developer.zendesk.com/rest_api/docs/chat/agents#list-agents) \(Incremental\) +* [Agent Timelines](https://developer.zendesk.com/rest_api/docs/chat/incremental_export#incremental-agent-timeline-export) \(Incremental\) * [Chats](https://developer.zendesk.com/rest_api/docs/chat/chats#list-chats) * [Shortcuts](https://developer.zendesk.com/rest_api/docs/chat/shortcuts#list-shortcuts) * [Triggers](https://developer.zendesk.com/rest_api/docs/chat/triggers#list-triggers) -* [Bans](https://developer.zendesk.com/rest_api/docs/chat/bans#list-bans) (Incremental) +* [Bans](https://developer.zendesk.com/rest_api/docs/chat/bans#list-bans) \(Incremental\) * [Departments](https://developer.zendesk.com/rest_api/docs/chat/departments#list-departments) * [Goals](https://developer.zendesk.com/rest_api/docs/chat/goals#list-goals) * [Skills](https://developer.zendesk.com/rest_api/docs/chat/skills#list-skills) @@ -38,7 +38,7 @@ This Source is capable of syncing the following core Streams: | :--- | :--- | :--- | | Full Refresh Sync | Yes | | | Incremental Sync | Yes | | -| SSL connection | Yes | | +| SSL connection | Yes | | ### Performance considerations @@ -57,3 +57,4 @@ The Zendesk connector should not run into Zendesk API limitations under normal u Generate a Access Token as described in [Zendesk Chat docs](https://developer.zendesk.com/rest_api/docs/chat/auth) We recommend creating a restricted, read-only key specifically for Airbyte access. This will allow you to control which resources Airbyte should be able to access. + diff --git a/docs/project-overview/changelog/README.md b/docs/project-overview/changelog/README.md index 9b63711c1224d..35b7d5c99ccac 100644 --- a/docs/project-overview/changelog/README.md +++ b/docs/project-overview/changelog/README.md @@ -1,8 +1,9 @@ # Changelog Airbyte is comprised of 2 parts: -* Core is any non-connector code. It includes the scheduler, workers, api, web app, and the Airbyte protocol. Here is the [changelog for Core](./platform.md). -* Connectors that run into Docker containers. Here is the [changelog for the connectors](./connectors.md). + +* Core is any non-connector code. It includes the scheduler, workers, api, web app, and the Airbyte protocol. Here is the [changelog for Core](platform.md). +* Connectors that run into Docker containers. Here is the [changelog for the connectors](connectors.md). ## Airbyte Core Releases diff --git a/docs/project-overview/changelog/connectors.md b/docs/project-overview/changelog/connectors.md index caf0c942bf712..34ae4de88f70c 100644 --- a/docs/project-overview/changelog/connectors.md +++ b/docs/project-overview/changelog/connectors.md @@ -16,28 +16,32 @@ Check out our [connector roadmap](https://github.com/airbytehq/airbyte/projects/ 1 new destination: [**MySQL**](https://docs.airbyte.io/integrations/destinations/mysql) -2 new sources: +2 new sources: + * [**Google Search Console**](https://docs.airbyte.io/integrations/sources/google-search-console) -* [**PokeAPI**](https://docs.airbyte.io/integrations/sources/pokeapi) (talking about long tail and having fun ;)) +* [**PokeAPI**](https://docs.airbyte.io/integrations/sources/pokeapi) \(talking about long tail and having fun ;\)\) Progress on connectors: -* **Zoom**: bugfix on declaring correct types to match data coming from API ([#3159](https://github.com/airbytehq/airbyte/pull/3159)), thanks to [vovavovavovavova](https://github.com/vovavovavovavova) -* **Smartsheets**: bugfix on gracefully handling empty cell values ([#3337](https://github.com/airbytehq/airbyte/pull/3337)), thanks to [Nathan Nowack](https://github.com/zzstoatzz) -* **Stripe**: fix date property name, only add connected account header when set, and set primary key (#3210), thanks to [Nathan Yergler](https://github.com/nyergler) + +* **Zoom**: bugfix on declaring correct types to match data coming from API \([\#3159](https://github.com/airbytehq/airbyte/pull/3159)\), thanks to [vovavovavovavova](https://github.com/vovavovavovavova) +* **Smartsheets**: bugfix on gracefully handling empty cell values \([\#3337](https://github.com/airbytehq/airbyte/pull/3337)\), thanks to [Nathan Nowack](https://github.com/zzstoatzz) +* **Stripe**: fix date property name, only add connected account header when set, and set primary key \(\#3210\), thanks to [Nathan Yergler](https://github.com/nyergler) ## 05/04/2021 -2 new sources: +2 new sources: + * [**Smartsheets**](https://docs.airbyte.io/integrations/sources/smartsheets), thanks to [Nathan Nowack](https://github.com/zzstoatzz) * [**Zendesk Chat**](https://docs.airbyte.io/integrations/sources/zendesk-chat) Progress on connectors: -* **Appstore**: bugfix private key handling in the UI ([#3201](https://github.com/airbytehq/airbyte/pull/3201)) -* **Facebook marketing**: Wait longer (5 min) for async jobs to start ([#3116](https://github.com/airbytehq/airbyte/pull/3116)), thanks to [Max Krog](https://github.com/MaxKrog) -* **Stripe**: support reading data from connected accounts (#3121), and 2 new streams with Refunds & Bank Accounts ([#3030](https://github.com/airbytehq/airbyte/pull/3030)) ([#3086](https://github.com/airbytehq/airbyte/pull/3086)) -* **Redshift destination**: Ignore records that are too big (instead of failing) ([#2988](https://github.com/airbytehq/airbyte/pull/2988)) -* **MongoDB**: add supporting TLS and Replica Sets ([#3111](https://github.com/airbytehq/airbyte/pull/3111)) -* **HTTP sources**: bugfix on handling array responses gracefully ([#3008](https://github.com/airbytehq/airbyte/pull/3008)) + +* **Appstore**: bugfix private key handling in the UI \([\#3201](https://github.com/airbytehq/airbyte/pull/3201)\) +* **Facebook marketing**: Wait longer \(5 min\) for async jobs to start \([\#3116](https://github.com/airbytehq/airbyte/pull/3116)\), thanks to [Max Krog](https://github.com/MaxKrog) +* **Stripe**: support reading data from connected accounts \(\#3121\), and 2 new streams with Refunds & Bank Accounts \([\#3030](https://github.com/airbytehq/airbyte/pull/3030)\) \([\#3086](https://github.com/airbytehq/airbyte/pull/3086)\) +* **Redshift destination**: Ignore records that are too big \(instead of failing\) \([\#2988](https://github.com/airbytehq/airbyte/pull/2988)\) +* **MongoDB**: add supporting TLS and Replica Sets \([\#3111](https://github.com/airbytehq/airbyte/pull/3111)\) +* **HTTP sources**: bugfix on handling array responses gracefully \([\#3008](https://github.com/airbytehq/airbyte/pull/3008)\) ## 04/27/2021 diff --git a/docs/project-overview/changelog/platform.md b/docs/project-overview/changelog/platform.md index d1d0dcf45e630..e888ece14138a 100644 --- a/docs/project-overview/changelog/platform.md +++ b/docs/project-overview/changelog/platform.md @@ -13,10 +13,9 @@ If you're interested in our progress on the Airbyte platform, please read below! ## [05-11-2021 - 0.22.3](https://github.com/airbytehq/airbyte/releases/tag/v0.22.3-alpha) * Bump K8s deployment version to latest stable version, thanks to [Coetzee van Staden](https://github.com/coetzeevs) -* Added tutorial to deploy Airbyte on Azure VM ([#3171](https://github.com/airbytehq/airbyte/pull/3171)), thanks to [geekwhocodes](https://github.com/geekwhocodes) +* Added tutorial to deploy Airbyte on Azure VM \([\#3171](https://github.com/airbytehq/airbyte/pull/3171)\), thanks to [geekwhocodes](https://github.com/geekwhocodes) * Progress on checkpointing to support rate limits better -* Upgrade normalization to use dbt from docker images ([#3186](https://github.com/airbytehq/airbyte/pull/3186)) - +* Upgrade normalization to use dbt from docker images \([\#3186](https://github.com/airbytehq/airbyte/pull/3186)\) ## [05-04-2021 - 0.22.2](https://github.com/airbytehq/airbyte/releases/tag/v0.22.2-alpha)