airbyte
PyAirbyte brings Airbyte ELT to every Python developer.
PyAirbyte
PyAirbyte brings the power of Airbyte to every Python developer. PyAirbyte provides a set of utilities to use Airbyte connectors in Python. It is meant to be used in situations where setting up an Airbyte server or cloud account is not possible or desirable.
Getting Started
Watch this Getting Started Loom video or run one of our Quickstart tutorials below to see how you can use PyAirbyte in your python code.
Secrets Management
PyAirbyte can auto-import secrets from the following sources:
- Environment variables.
- Variables defined in a local
.env
("Dotenv") file. - Google Colab secrets.
- Manual entry via
getpass
.
_Note: You can also build your own secret manager by subclassing the CustomSecretManager
implementation. For more information, see the airbyte.secrets.CustomSecretManager
class definiton._
Retrieving Secrets
import airbyte as ab
source = ab.get_source("source-github")
source.set_config(
"credentials": {
"personal_access_token": ab.get_secret("GITHUB_PERSONAL_ACCESS_TOKEN"),
}
)
By default, PyAirbyte will search all available secrets sources. The get_secret()
function also accepts an optional sources
argument of specific source names (SecretSourceEnum
) and/or secret manager objects to check.
By default, PyAirbyte will prompt the user for any requested secrets that are not provided via other secret managers. You can disable this prompt by passing allow_prompt=False
to get_secret()
.
For more information, see the airbyte.secrets
module.
Secrets Auto-Discovery
If you have a secret matching an expected name, PyAirbyte will automatically use it. For example, if you have a secret named GITHUB_PERSONAL_ACCESS_TOKEN
, PyAirbyte will automatically use it when configuring the GitHub source.
The naming convention for secrets is as {CONNECTOR_NAME}_{PROPERTY_NAME}
, for instance SNOWFLAKE_PASSWORD
and BIGQUERY_CREDENTIALS_PATH
.
PyAirbyte will also auto-discover secrets for interop with hosted Airbyte: AIRBYTE_CLOUD_API_URL
, AIRBYTE_CLOUD_API_KEY
, etc.
Connector compatibility
To make a connector compatible with PyAirbyte, the following requirements must be met:
- The connector must be a Python package, with a
pyproject.toml
or asetup.py
file. - In the package, there must be a
run.py
file that contains arun
method. This method should read arguments from the command line, and run the connector with them, outputting messages to stdout. - The
pyproject.toml
orsetup.py
file must specify a command line entry point for therun
method calledsource-<connector name>
. This is usually done by adding aconsole_scripts
section to thepyproject.toml
file, or aentry_points
section to thesetup.py
file. For example:
[tool.poetry.scripts]
source-my-connector = "my_connector.run:run"
setup(
...
entry_points={
'console_scripts': [
'source-my-connector = my_connector.run:run',
],
},
...
)
To publish a connector to PyPI, specify the pypi
section in the metadata.yaml
file. For example:
data:
# ...
remoteRegistries:
pypi:
enabled: true
packageName: "airbyte-source-my-connector"
Validating source connectors
To validate a source connector for compliance, the airbyte-lib-validate-source
script can be used. It can be used like this:
airbyte-lib-validate-source —connector-dir . -—sample-config secrets/config.json
The script will install the python package in the provided directory, and run the connector against the provided config. The config should be a valid JSON file, with the same structure as the one that would be provided to the connector in Airbyte. The script will exit with a non-zero exit code if the connector fails to run.
For a more lightweight check, the --validate-install-only
flag can be used. This will only check that the connector can be installed and returns a spec, no sample config required.
Contributing
To learn how you can contribute to PyAirbyte, please see our PyAirbyte Contributors Guide.
Frequently asked Questions
1. Does PyAirbyte replace Airbyte? No.
2. What is the PyAirbyte cache? Is it a destination? Yes, you can think of it as a built-in destination implementation, but we avoid the word "destination" in our docs to prevent confusion with our certified destinations list here.
3. Does PyAirbyte work with data orchestration frameworks like Airflow, Dagster, and Snowpark, Yes, it should. Please give it a try and report any problems you see. Also, drop us a note if works for you!
4. Can I use PyAirbyte to develop or test when developing Airbyte sources? Yes, you can, but only for Python-based sources.
5. Can I develop traditional ETL pipelines with PyAirbyte? Yes. Just pick the cache type matching the destination - like SnowflakeCache for landing data in Snowflake.
6. Can PyAirbyte import a connector from a local directory that has python project files, or does it have to be pip install Yes, PyAirbyte can use any local install that has a CLI - and will automatically find connectors by name if they are on PATH.
Changelog and Release Notes
For a version history and list of all changes, please see our GitHub Releases page.
API Reference
1# Copyright (c) 2024 Airbyte, Inc., all rights reserved. 2"""PyAirbyte brings Airbyte ELT to every Python developer. 3 4.. include:: ../README.md 5 6## API Reference 7 8""" 9from __future__ import annotations 10 11from airbyte import caches, cloud, datasets, documents, exceptions, results, secrets, sources 12from airbyte.caches.bigquery import BigQueryCache 13from airbyte.caches.duckdb import DuckDBCache 14from airbyte.caches.util import get_default_cache, new_local_cache 15from airbyte.datasets import CachedDataset 16from airbyte.records import StreamRecord 17from airbyte.results import ReadResult 18from airbyte.secrets import SecretSourceEnum, get_secret 19from airbyte.sources import registry 20from airbyte.sources.base import Source 21from airbyte.sources.registry import get_available_connectors 22from airbyte.sources.util import get_source 23 24 25__all__ = [ 26 # Modules 27 "cloud", 28 "caches", 29 "datasets", 30 "documents", 31 "exceptions", 32 "records", 33 "registry", 34 "results", 35 "secrets", 36 "sources", 37 # Factories 38 "get_available_connectors", 39 "get_default_cache", 40 "get_secret", 41 "get_source", 42 "new_local_cache", 43 # Classes 44 "BigQueryCache", 45 "CachedDataset", 46 "DuckDBCache", 47 "ReadResult", 48 "SecretSourceEnum", 49 "Source", 50 "StreamRecord", 51] 52 53__docformat__ = "google"
118def get_available_connectors() -> list[str]: 119 """Return a list of all available connectors. 120 121 Connectors will be returned in alphabetical order, with the standard prefix "source-". 122 """ 123 return sorted( 124 conn.name for conn in _get_registry_cache().values() if conn.pypi_package_name is not None 125 )
Return a list of all available connectors.
Connectors will be returned in alphabetical order, with the standard prefix "source-".
15def get_default_cache() -> DuckDBCache: 16 """Get a local cache for storing data, using the default database path. 17 18 Cache files are stored in the `.cache` directory, relative to the current 19 working directory. 20 """ 21 cache_dir = Path("./.cache/default_cache") 22 return DuckDBCache( 23 db_path=cache_dir / "default_cache.duckdb", 24 cache_dir=cache_dir, 25 )
Get a local cache for storing data, using the default database path.
Cache files are stored in the .cache
directory, relative to the current
working directory.
15def get_secret( 16 secret_name: str, 17 /, 18 *, 19 sources: list[SecretManager | SecretSourceEnum] | None = None, 20 allow_prompt: bool = True, 21 **kwargs: dict[str, Any], 22) -> SecretString: 23 """Get a secret from the environment. 24 25 The optional `sources` argument of enum type `SecretSourceEnum` or list of `SecretSourceEnum` 26 options. If left blank, all available sources will be checked. If a list of `SecretSourceEnum` 27 entries is passed, then the sources will be checked using the provided ordering. 28 29 If `allow_prompt` is `True` or if SecretSourceEnum.PROMPT is declared in the `source` arg, then 30 the user will be prompted to enter the secret if it is not found in any of the other sources. 31 """ 32 if "source" in kwargs: 33 warnings.warn( 34 message="The `source` argument is deprecated. Use the `sources` argument instead.", 35 category=DeprecationWarning, 36 stacklevel=2, 37 ) 38 sources = kwargs.pop("source") # type: ignore [assignment] 39 40 available_sources: dict[str, SecretManager] = {} 41 for available_source in _get_secret_sources(): 42 # Add available sources to the dict. Order matters. 43 available_sources[available_source.name] = available_source 44 45 if sources is None: 46 # If ANY is in the list, then we don't need to check any other sources. 47 # This is the default behavior. 48 sources = list(available_sources.values()) 49 50 elif not isinstance(sources, list): 51 sources = [sources] # type: ignore [unreachable] # This is a 'just in case' catch. 52 53 # Replace any SecretSourceEnum strings with the matching SecretManager object 54 for source in sources: 55 if isinstance(source, SecretSourceEnum): 56 if source not in available_sources: 57 raise exc.PyAirbyteInputError( 58 guidance="Invalid secret source name.", 59 input_value=source, 60 context={ 61 "Available Sources": list(available_sources.keys()), 62 }, 63 ) 64 65 sources[sources.index(source)] = available_sources[source] 66 67 secret_managers = cast(list[SecretManager], sources) 68 69 if SecretSourceEnum.PROMPT in secret_managers: 70 prompt_source = secret_managers.pop( 71 # Mis-typed, but okay here since we have equality logic for the enum comparison: 72 secret_managers.index(SecretSourceEnum.PROMPT), # type: ignore [arg-type] 73 ) 74 75 if allow_prompt: 76 # Always check prompt last. Add it to the end of the list. 77 secret_managers.append(prompt_source) 78 79 for secret_mgr in secret_managers: 80 val = secret_mgr.get_secret(secret_name) 81 if val: 82 return SecretString(val) 83 84 raise exc.PyAirbyteSecretNotFoundError( 85 secret_name=secret_name, 86 sources=[str(s) for s in available_sources], 87 )
Get a secret from the environment.
The optional sources
argument of enum type SecretSourceEnum
or list of SecretSourceEnum
options. If left blank, all available sources will be checked. If a list of SecretSourceEnum
entries is passed, then the sources will be checked using the provided ordering.
If allow_prompt
is True
or if SecretSourceEnum.PROMPT is declared in the source
arg, then
the user will be prompted to enter the secret if it is not found in any of the other sources.
46def get_source( 47 name: str, 48 config: dict[str, Any] | None = None, 49 *, 50 streams: str | list[str] | None = None, 51 version: str | None = None, 52 pip_url: str | None = None, 53 local_executable: Path | str | None = None, 54 install_if_missing: bool = True, 55) -> Source: 56 """Get a connector by name and version. 57 58 Args: 59 name: connector name 60 config: connector config - if not provided, you need to set it later via the set_config 61 method. 62 streams: list of stream names to select for reading. If set to "*", all streams will be 63 selected. If not provided, you can set it later via the `select_streams()` or 64 `select_all_streams()` method. 65 version: connector version - if not provided, the currently installed version will be used. 66 If no version is installed, the latest available version will be used. The version can 67 also be set to "latest" to force the use of the latest available version. 68 pip_url: connector pip URL - if not provided, the pip url will be inferred from the 69 connector name. 70 local_executable: If set, the connector will be assumed to already be installed and will be 71 executed using this path or executable name. Otherwise, the connector will be installed 72 automatically in a virtual environment. 73 install_if_missing: Whether to install the connector if it is not available locally. This 74 parameter is ignored when local_executable is set. 75 """ 76 if local_executable: 77 if pip_url: 78 raise exc.PyAirbyteInputError( 79 message="Param 'pip_url' is not supported when 'local_executable' is set." 80 ) 81 if version: 82 raise exc.PyAirbyteInputError( 83 message="Param 'version' is not supported when 'local_executable' is set." 84 ) 85 86 if isinstance(local_executable, str): 87 if "/" in local_executable or "\\" in local_executable: 88 # Assume this is a path 89 local_executable = Path(local_executable).absolute() 90 else: 91 which_executable: str | None = None 92 which_executable = shutil.which(local_executable) 93 if not which_executable and sys.platform == "win32": 94 # Try with the .exe extension 95 local_executable = f"{local_executable}.exe" 96 which_executable = shutil.which(local_executable) 97 98 if which_executable is None: 99 raise exc.AirbyteConnectorExecutableNotFoundError( 100 connector_name=name, 101 context={ 102 "executable": local_executable, 103 "working_directory": Path.cwd().absolute(), 104 }, 105 ) from FileNotFoundError(local_executable) 106 local_executable = Path(which_executable).absolute() 107 108 print(f"Using local `{name}` executable: {local_executable!s}") 109 return Source( 110 name=name, 111 config=config, 112 streams=streams, 113 executor=PathExecutor( 114 name=name, 115 path=local_executable, 116 ), 117 ) 118 119 # else: we are installing a connector in a virtual environment: 120 121 metadata: ConnectorMetadata | None = None 122 try: 123 metadata = get_connector_metadata(name) 124 except exc.AirbyteConnectorNotRegisteredError as ex: 125 if not pip_url: 126 log_install_state(name, state=EventState.FAILED, exception=ex) 127 # We don't have a pip url or registry entry, so we can't install the connector 128 raise 129 130 try: 131 executor = VenvExecutor( 132 name=name, 133 metadata=metadata, 134 target_version=version, 135 pip_url=pip_url, 136 ) 137 if install_if_missing: 138 executor.ensure_installation() 139 140 return Source( 141 name=name, 142 config=config, 143 streams=streams, 144 executor=executor, 145 ) 146 except Exception as e: 147 log_install_state(name, state=EventState.FAILED, exception=e) 148 raise
Get a connector by name and version.
Arguments:
- name: connector name
- config: connector config - if not provided, you need to set it later via the set_config method.
- streams: list of stream names to select for reading. If set to "*", all streams will be
selected. If not provided, you can set it later via the
select_streams()
orselect_all_streams()
method. - version: connector version - if not provided, the currently installed version will be used. If no version is installed, the latest available version will be used. The version can also be set to "latest" to force the use of the latest available version.
- pip_url: connector pip URL - if not provided, the pip url will be inferred from the connector name.
- local_executable: If set, the connector will be assumed to already be installed and will be executed using this path or executable name. Otherwise, the connector will be installed automatically in a virtual environment.
- install_if_missing: Whether to install the connector if it is not available locally. This parameter is ignored when local_executable is set.
28def new_local_cache( 29 cache_name: str | None = None, 30 cache_dir: str | Path | None = None, 31 *, 32 cleanup: bool = True, 33) -> DuckDBCache: 34 """Get a local cache for storing data, using a name string to seed the path. 35 36 Args: 37 cache_name: Name to use for the cache. Defaults to None. 38 cache_dir: Root directory to store the cache in. Defaults to None. 39 cleanup: Whether to clean up temporary files. Defaults to True. 40 41 Cache files are stored in the `.cache` directory, relative to the current 42 working directory. 43 """ 44 if cache_name: 45 if " " in cache_name: 46 raise exc.PyAirbyteInputError( 47 message="Cache name cannot contain spaces.", 48 input_value=cache_name, 49 ) 50 51 if not cache_name.replace("_", "").isalnum(): 52 raise exc.PyAirbyteInputError( 53 message="Cache name can only contain alphanumeric characters and underscores.", 54 input_value=cache_name, 55 ) 56 57 cache_name = cache_name or str(ulid.ULID()) 58 cache_dir = cache_dir or Path(f"./.cache/{cache_name}") 59 if not isinstance(cache_dir, Path): 60 cache_dir = Path(cache_dir) 61 62 return DuckDBCache( 63 db_path=cache_dir / f"db_{cache_name}.duckdb", 64 cache_dir=cache_dir, 65 cleanup=cleanup, 66 )
Get a local cache for storing data, using a name string to seed the path.
Arguments:
- cache_name: Name to use for the cache. Defaults to None.
- cache_dir: Root directory to store the cache in. Defaults to None.
- cleanup: Whether to clean up temporary files. Defaults to True.
Cache files are stored in the .cache
directory, relative to the current
working directory.
37class BigQueryCache(CacheBase): 38 """The BigQuery cache implementation.""" 39 40 project_name: str 41 """The name of the project to use. In BigQuery, this is equivalent to the database name.""" 42 43 dataset_name: str = "airbyte_raw" 44 """The name of the dataset to use. In BigQuery, this is equivalent to the schema name.""" 45 46 credentials_path: Optional[str] = None 47 """The path to the credentials file to use. 48 If not passed, falls back to the default inferred from the environment.""" 49 50 _sql_processor_class: type[BigQuerySqlProcessor] = BigQuerySqlProcessor 51 52 @root_validator(pre=True) 53 @classmethod 54 def set_schema_name(cls, values: dict[str, Any]) -> dict[str, Any]: 55 dataset_name = values.get("dataset_name") 56 if dataset_name is None: 57 raise ValueError("dataset_name must be defined") # noqa: TRY003 58 values["schema_name"] = dataset_name 59 return values 60 61 @overrides 62 def get_database_name(self) -> str: 63 """Return the name of the database. For BigQuery, this is the project name.""" 64 return self.project_name 65 66 @overrides 67 def get_sql_alchemy_url(self) -> str: 68 """Return the SQLAlchemy URL to use.""" 69 url: URL = make_url(f"bigquery://{self.project_name!s}") 70 if self.credentials_path: 71 url = url.update_query_dict({"credentials_path": self.credentials_path}) 72 73 return str(url)
The BigQuery cache implementation.
The name of the project to use. In BigQuery, this is equivalent to the database name.
The name of the dataset to use. In BigQuery, this is equivalent to the schema name.
The path to the credentials file to use. If not passed, falls back to the default inferred from the environment.
52 @root_validator(pre=True) 53 @classmethod 54 def set_schema_name(cls, values: dict[str, Any]) -> dict[str, Any]: 55 dataset_name = values.get("dataset_name") 56 if dataset_name is None: 57 raise ValueError("dataset_name must be defined") # noqa: TRY003 58 values["schema_name"] = dataset_name 59 return values
61 @overrides 62 def get_database_name(self) -> str: 63 """Return the name of the database. For BigQuery, this is the project name.""" 64 return self.project_name
Return the name of the database. For BigQuery, this is the project name.
66 @overrides 67 def get_sql_alchemy_url(self) -> str: 68 """Return the SQLAlchemy URL to use.""" 69 url: URL = make_url(f"bigquery://{self.project_name!s}") 70 if self.credentials_path: 71 url = url.update_query_dict({"credentials_path": self.credentials_path}) 72 73 return str(url)
Return the SQLAlchemy URL to use.
Inherited Members
- pydantic.main.BaseModel
- BaseModel
- Config
- dict
- json
- parse_obj
- parse_raw
- parse_file
- from_orm
- construct
- copy
- schema
- schema_json
- validate
- update_forward_refs
127class CachedDataset(SQLDataset): 128 """A dataset backed by a SQL table cache. 129 130 Because this dataset includes all records from the underlying table, we also expose the 131 underlying table as a SQLAlchemy Table object. 132 """ 133 134 def __init__( 135 self, 136 cache: CacheBase, 137 stream_name: str, 138 ) -> None: 139 """We construct the query statement by selecting all columns from the table. 140 141 This prevents the need to scan the table schema to construct the query statement. 142 """ 143 table_name = cache.processor.get_sql_table_name(stream_name) 144 schema_name = cache.schema_name 145 query = select("*").select_from(text(f"{schema_name}.{table_name}")) 146 super().__init__( 147 cache=cache, 148 stream_name=stream_name, 149 query_statement=query, 150 ) 151 152 @overrides 153 def to_pandas(self) -> DataFrame: 154 """Return the underlying dataset data as a pandas DataFrame.""" 155 return self._cache.processor.get_pandas_dataframe(self._stream_name) 156 157 def to_sql_table(self) -> Table: 158 """Return the underlying SQL table as a SQLAlchemy Table object.""" 159 return self._cache.processor.get_sql_table(self.stream_name) 160 161 def __eq__(self, value: object) -> bool: 162 """Return True if the value is a CachedDataset with the same cache and stream name. 163 164 In the case of CachedDataset objects, we can simply compare the cache and stream name. 165 166 Note that this equality check is only supported on CachedDataset objects and not for 167 the base SQLDataset implementation. This is because of the complexity and computational 168 cost of comparing two arbitrary SQL queries that could be bound to different variables, 169 as well as the chance that two queries can be syntactically equivalent without being 170 text-wise equivalent. 171 """ 172 if not isinstance(value, SQLDataset): 173 return False 174 175 if self._cache is not value._cache: 176 return False 177 178 if self._stream_name != value._stream_name: 179 return False 180 181 return True
A dataset backed by a SQL table cache.
Because this dataset includes all records from the underlying table, we also expose the underlying table as a SQLAlchemy Table object.
134 def __init__( 135 self, 136 cache: CacheBase, 137 stream_name: str, 138 ) -> None: 139 """We construct the query statement by selecting all columns from the table. 140 141 This prevents the need to scan the table schema to construct the query statement. 142 """ 143 table_name = cache.processor.get_sql_table_name(stream_name) 144 schema_name = cache.schema_name 145 query = select("*").select_from(text(f"{schema_name}.{table_name}")) 146 super().__init__( 147 cache=cache, 148 stream_name=stream_name, 149 query_statement=query, 150 )
We construct the query statement by selecting all columns from the table.
This prevents the need to scan the table schema to construct the query statement.
152 @overrides 153 def to_pandas(self) -> DataFrame: 154 """Return the underlying dataset data as a pandas DataFrame.""" 155 return self._cache.processor.get_pandas_dataframe(self._stream_name)
Return the underlying dataset data as a pandas DataFrame.
157 def to_sql_table(self) -> Table: 158 """Return the underlying SQL table as a SQLAlchemy Table object.""" 159 return self._cache.processor.get_sql_table(self.stream_name)
Return the underlying SQL table as a SQLAlchemy Table object.
Inherited Members
- airbyte.datasets._sql.SQLDataset
- stream_name
- with_filter
- airbyte.datasets._base.DatasetBase
- to_documents
38class DuckDBCache(CacheBase): 39 """A DuckDB cache.""" 40 41 db_path: Union[Path, str] 42 """Normally db_path is a Path object. 43 44 The database name will be inferred from the file name. For example, given a `db_path` of 45 `/path/to/my/my_db.duckdb`, the database name is `my_db`. 46 """ 47 48 schema_name: str = "main" 49 """The name of the schema to write to. Defaults to "main".""" 50 51 _sql_processor_class = DuckDBSqlProcessor 52 53 @overrides 54 def get_sql_alchemy_url(self) -> str: 55 """Return the SQLAlchemy URL to use.""" 56 # return f"duckdb:///{self.db_path}?schema={self.schema_name}" 57 return f"duckdb:///{self.db_path!s}" 58 59 @overrides 60 def get_database_name(self) -> str: 61 """Return the name of the database.""" 62 if self.db_path == ":memory:": 63 return "memory" 64 65 # Split the path on the appropriate separator ("/" or "\") 66 split_on: Literal["/", "\\"] = "\\" if "\\" in str(self.db_path) else "/" 67 68 # Return the file name without the extension 69 return str(self.db_path).split(sep=split_on)[-1].split(".")[0]
A DuckDB cache.
Normally db_path is a Path object.
The database name will be inferred from the file name. For example, given a db_path
of
/path/to/my/my_dbairbyte.caches.duckdb
, the database name is my_db
.
53 @overrides 54 def get_sql_alchemy_url(self) -> str: 55 """Return the SQLAlchemy URL to use.""" 56 # return f"duckdb:///{self.db_path}?schema={self.schema_name}" 57 return f"duckdb:///{self.db_path!s}"
Return the SQLAlchemy URL to use.
59 @overrides 60 def get_database_name(self) -> str: 61 """Return the name of the database.""" 62 if self.db_path == ":memory:": 63 return "memory" 64 65 # Split the path on the appropriate separator ("/" or "\") 66 split_on: Literal["/", "\\"] = "\\" if "\\" in str(self.db_path) else "/" 67 68 # Return the file name without the extension 69 return str(self.db_path).split(sep=split_on)[-1].split(".")[0]
Return the name of the database.
Inherited Members
- pydantic.main.BaseModel
- BaseModel
- Config
- dict
- json
- parse_obj
- parse_raw
- parse_file
- from_orm
- construct
- copy
- schema
- schema_json
- validate
- update_forward_refs
19class ReadResult(Mapping[str, CachedDataset]): 20 def __init__( 21 self, 22 processed_records: int, 23 cache: CacheBase, 24 processed_streams: list[str], 25 ) -> None: 26 self.processed_records = processed_records 27 self._cache = cache 28 self._processed_streams = processed_streams 29 30 def __getitem__(self, stream: str) -> CachedDataset: 31 if stream not in self._processed_streams: 32 raise KeyError(stream) 33 34 return CachedDataset(self._cache, stream) 35 36 def __contains__(self, stream: object) -> bool: 37 if not isinstance(stream, str): 38 return False 39 40 return stream in self._processed_streams 41 42 def __iter__(self) -> Iterator[str]: 43 return self._processed_streams.__iter__() 44 45 def __len__(self) -> int: 46 return len(self._processed_streams) 47 48 def get_sql_engine(self) -> Engine: 49 return self._cache.get_sql_engine() 50 51 @property 52 def streams(self) -> Mapping[str, CachedDataset]: 53 return { 54 stream_name: CachedDataset(self._cache, stream_name) 55 for stream_name in self._processed_streams 56 } 57 58 @property 59 def cache(self) -> CacheBase: 60 return self._cache
A Mapping is a generic container for associating key/value pairs.
This class provides concrete generic implementations of all methods except for __getitem__, __iter__, and __len__.
Inherited Members
- collections.abc.Mapping
- get
- keys
- items
- values
15class SecretSourceEnum(str, Enum): 16 ENV = "env" 17 DOTENV = "dotenv" 18 GOOGLE_COLAB = "google_colab" 19 GOOGLE_GSM = "google_gsm" # Not enabled by default 20 21 PROMPT = "prompt"
An enumeration.
Inherited Members
- enum.Enum
- name
- value
- builtins.str
- encode
- replace
- split
- rsplit
- join
- capitalize
- casefold
- title
- center
- count
- expandtabs
- find
- partition
- index
- ljust
- lower
- lstrip
- rfind
- rindex
- rjust
- rstrip
- rpartition
- splitlines
- strip
- swapcase
- translate
- upper
- startswith
- endswith
- removeprefix
- removesuffix
- isascii
- islower
- isupper
- istitle
- isspace
- isdecimal
- isdigit
- isnumeric
- isalpha
- isalnum
- isidentifier
- isprintable
- zfill
- format
- format_map
- maketrans
59class Source: 60 """A class representing a source that can be called.""" 61 62 def __init__( 63 self, 64 executor: Executor, 65 name: str, 66 config: dict[str, Any] | None = None, 67 streams: str | list[str] | None = None, 68 *, 69 validate: bool = False, 70 ) -> None: 71 """Initialize the source. 72 73 If config is provided, it will be validated against the spec if validate is True. 74 """ 75 self.executor = executor 76 self.name = name 77 self._processed_records = 0 78 self._config_dict: dict[str, Any] | None = None 79 self._last_log_messages: list[str] = [] 80 self._discovered_catalog: AirbyteCatalog | None = None 81 self._spec: ConnectorSpecification | None = None 82 self._selected_stream_names: list[str] = [] 83 if config is not None: 84 self.set_config(config, validate=validate) 85 if streams is not None: 86 self.select_streams(streams) 87 88 self._deployed_api_root: str | None = None 89 self._deployed_workspace_id: str | None = None 90 self._deployed_source_id: str | None = None 91 self._deployed_connection_id: str | None = None 92 93 def set_streams(self, streams: list[str]) -> None: 94 """Deprecated. See select_streams().""" 95 warnings.warn( 96 "The 'set_streams' method is deprecated and will be removed in a future version. " 97 "Please use the 'select_streams' method instead.", 98 DeprecationWarning, 99 stacklevel=2, 100 ) 101 self.select_streams(streams) 102 103 def select_all_streams(self) -> None: 104 """Select all streams. 105 106 This is a more streamlined equivalent to: 107 > source.select_streams(source.get_available_streams()). 108 """ 109 self._selected_stream_names = self.get_available_streams() 110 111 def select_streams(self, streams: str | list[str]) -> None: 112 """Select the stream names that should be read from the connector. 113 114 Args: 115 - streams: A list of stream names to select. If set to "*", all streams will be selected. 116 117 Currently, if this is not set, all streams will be read. 118 """ 119 if streams == "*": 120 self.select_all_streams() 121 return 122 123 if isinstance(streams, str): 124 # If a single stream is provided, convert it to a one-item list 125 streams = [streams] 126 127 available_streams = self.get_available_streams() 128 for stream in streams: 129 if stream not in available_streams: 130 raise exc.AirbyteStreamNotFoundError( 131 stream_name=stream, 132 connector_name=self.name, 133 available_streams=available_streams, 134 ) 135 self._selected_stream_names = streams 136 137 def get_selected_streams(self) -> list[str]: 138 """Get the selected streams. 139 140 If no streams are selected, return an empty list. 141 """ 142 return self._selected_stream_names 143 144 def set_config( 145 self, 146 config: dict[str, Any], 147 *, 148 validate: bool = True, 149 ) -> None: 150 """Set the config for the connector. 151 152 If validate is True, raise an exception if the config fails validation. 153 154 If validate is False, validation will be deferred until check() or validate_config() 155 is called. 156 """ 157 if validate: 158 self.validate_config(config) 159 160 self._config_dict = config 161 162 def get_config(self) -> dict[str, Any]: 163 """Get the config for the connector.""" 164 return self._config 165 166 @property 167 def _config(self) -> dict[str, Any]: 168 if self._config_dict is None: 169 raise exc.AirbyteConnectorConfigurationMissingError( 170 guidance="Provide via get_source() or set_config()" 171 ) 172 return self._config_dict 173 174 def _discover(self) -> AirbyteCatalog: 175 """Call discover on the connector. 176 177 This involves the following steps: 178 * Write the config to a temporary file 179 * execute the connector with discover --config <config_file> 180 * Listen to the messages and return the first AirbyteCatalog that comes along. 181 * Make sure the subprocess is killed when the function returns. 182 """ 183 with as_temp_files([self._config]) as [config_file]: 184 for msg in self._execute(["discover", "--config", config_file]): 185 if msg.type == Type.CATALOG and msg.catalog: 186 return msg.catalog 187 raise exc.AirbyteConnectorMissingCatalogError( 188 log_text=self._last_log_messages, 189 ) 190 191 def validate_config(self, config: dict[str, Any] | None = None) -> None: 192 """Validate the config against the spec. 193 194 If config is not provided, the already-set config will be validated. 195 """ 196 spec = self._get_spec(force_refresh=False) 197 config = self._config if config is None else config 198 try: 199 jsonschema.validate(config, spec.connectionSpecification) 200 log_config_validation_result( 201 name=self.name, 202 state=EventState.SUCCEEDED, 203 ) 204 except jsonschema.ValidationError as ex: 205 validation_ex = exc.AirbyteConnectorValidationFailedError( 206 message="The provided config is not valid.", 207 context={ 208 "error_message": ex.message, 209 "error_path": ex.path, 210 "error_instance": ex.instance, 211 "error_schema": ex.schema, 212 }, 213 ) 214 log_config_validation_result( 215 name=self.name, 216 state=EventState.FAILED, 217 exception=validation_ex, 218 ) 219 raise validation_ex from ex 220 221 def get_available_streams(self) -> list[str]: 222 """Get the available streams from the spec.""" 223 return [s.name for s in self.discovered_catalog.streams] 224 225 def _get_spec(self, *, force_refresh: bool = False) -> ConnectorSpecification: 226 """Call spec on the connector. 227 228 This involves the following steps: 229 * execute the connector with spec 230 * Listen to the messages and return the first AirbyteCatalog that comes along. 231 * Make sure the subprocess is killed when the function returns. 232 """ 233 if force_refresh or self._spec is None: 234 for msg in self._execute(["spec"]): 235 if msg.type == Type.SPEC and msg.spec: 236 self._spec = msg.spec 237 break 238 239 if self._spec: 240 return self._spec 241 242 raise exc.AirbyteConnectorMissingSpecError( 243 log_text=self._last_log_messages, 244 ) 245 246 @property 247 def config_spec(self) -> dict[str, Any]: 248 """Generate a configuration spec for this connector, as a JSON Schema definition. 249 250 This function generates a JSON Schema dictionary with configuration specs for the 251 current connector, as a dictionary. 252 253 Returns: 254 dict: The JSON Schema configuration spec as a dictionary. 255 """ 256 return self._get_spec(force_refresh=True).connectionSpecification 257 258 def print_config_spec( 259 self, 260 format: Literal["yaml", "json"] = "yaml", # noqa: A002 261 *, 262 output_file: Path | str | None = None, 263 ) -> None: 264 """Print the configuration spec for this connector. 265 266 Args: 267 - format: The format to print the spec in. Must be "yaml" or "json". 268 - output_file: Optional. If set, the spec will be written to the given file path. Otherwise, 269 it will be printed to the console. 270 """ 271 if format not in ["yaml", "json"]: 272 raise exc.PyAirbyteInputError( 273 message="Invalid format. Expected 'yaml' or 'json'", 274 input_value=format, 275 ) 276 if isinstance(output_file, str): 277 output_file = Path(output_file) 278 279 if format == "yaml": 280 content = yaml.dump(self.config_spec, indent=2) 281 elif format == "json": 282 content = json.dumps(self.config_spec, indent=2) 283 284 if output_file: 285 output_file.write_text(content) 286 return 287 288 syntax_highlighted = Syntax(content, format) 289 print(syntax_highlighted) 290 291 @property 292 def _yaml_spec(self) -> str: 293 """Get the spec as a yaml string. 294 295 For now, the primary use case is for writing and debugging a valid config for a source. 296 297 This is private for now because we probably want better polish before exposing this 298 as a stable interface. This will also get easier when we have docs links with this info 299 for each connector. 300 """ 301 spec_obj: ConnectorSpecification = self._get_spec() 302 spec_dict = spec_obj.dict(exclude_unset=True) 303 # convert to a yaml string 304 return yaml.dump(spec_dict) 305 306 @property 307 def docs_url(self) -> str: 308 """Get the URL to the connector's documentation.""" 309 # TODO: Replace with docs URL from metadata when available 310 return "https://docs.airbyte.com/integrations/sources/" + self.name.lower().replace( 311 "source-", "" 312 ) 313 314 @property 315 def discovered_catalog(self) -> AirbyteCatalog: 316 """Get the raw catalog for the given streams. 317 318 If the catalog is not yet known, we call discover to get it. 319 """ 320 if self._discovered_catalog is None: 321 self._discovered_catalog = self._discover() 322 323 return self._discovered_catalog 324 325 @property 326 def configured_catalog(self) -> ConfiguredAirbyteCatalog: 327 """Get the configured catalog for the given streams. 328 329 If the raw catalog is not yet known, we call discover to get it. 330 331 If no specific streams are selected, we return a catalog that syncs all available streams. 332 333 TODO: We should consider disabling by default the streams that the connector would 334 disable by default. (For instance, streams that require a premium license are sometimes 335 disabled by default within the connector.) 336 """ 337 # Ensure discovered catalog is cached before we start 338 _ = self.discovered_catalog 339 340 # Filter for selected streams if set, otherwise use all available streams: 341 streams_filter: list[str] = self._selected_stream_names or self.get_available_streams() 342 343 return ConfiguredAirbyteCatalog( 344 streams=[ 345 ConfiguredAirbyteStream( 346 stream=stream, 347 destination_sync_mode=DestinationSyncMode.overwrite, 348 primary_key=stream.source_defined_primary_key, 349 # TODO: The below assumes all sources can coalesce from incremental sync to 350 # full_table as needed. CDK supports this, so it might be safe: 351 sync_mode=SyncMode.incremental, 352 ) 353 for stream in self.discovered_catalog.streams 354 if stream.name in streams_filter 355 ], 356 ) 357 358 def get_stream_json_schema(self, stream_name: str) -> dict[str, Any]: 359 """Return the JSON Schema spec for the specified stream name.""" 360 catalog: AirbyteCatalog = self.discovered_catalog 361 found: list[AirbyteStream] = [ 362 stream for stream in catalog.streams if stream.name == stream_name 363 ] 364 365 if len(found) == 0: 366 raise exc.PyAirbyteInputError( 367 message="Stream name does not exist in catalog.", 368 input_value=stream_name, 369 ) 370 371 if len(found) > 1: 372 raise exc.PyAirbyteInternalError( 373 message="Duplicate streams found with the same name.", 374 context={ 375 "found_streams": found, 376 }, 377 ) 378 379 return found[0].json_schema 380 381 def get_records(self, stream: str) -> LazyDataset: 382 """Read a stream from the connector. 383 384 This involves the following steps: 385 * Call discover to get the catalog 386 * Generate a configured catalog that syncs the given stream in full_refresh mode 387 * Write the configured catalog and the config to a temporary file 388 * execute the connector with read --config <config_file> --catalog <catalog_file> 389 * Listen to the messages and return the first AirbyteRecordMessages that come along. 390 * Make sure the subprocess is killed when the function returns. 391 """ 392 discovered_catalog: AirbyteCatalog = self.discovered_catalog 393 configured_catalog = ConfiguredAirbyteCatalog( 394 streams=[ 395 ConfiguredAirbyteStream( 396 stream=s, 397 sync_mode=SyncMode.full_refresh, 398 destination_sync_mode=DestinationSyncMode.overwrite, 399 ) 400 for s in discovered_catalog.streams 401 if s.name == stream 402 ], 403 ) 404 if len(configured_catalog.streams) == 0: 405 raise exc.PyAirbyteInputError( 406 message="Requested stream does not exist.", 407 context={ 408 "stream": stream, 409 "available_streams": self.get_available_streams(), 410 "connector_name": self.name, 411 }, 412 ) from KeyError(stream) 413 414 configured_stream = configured_catalog.streams[0] 415 all_properties = cast( 416 list[str], list(configured_stream.stream.json_schema["properties"].keys()) 417 ) 418 419 def _with_logging(records: Iterable[dict[str, Any]]) -> Iterator[dict[str, Any]]: 420 self._log_sync_start(cache=None) 421 yield from records 422 self._log_sync_success(cache=None) 423 424 iterator: Iterator[dict[str, Any]] = _with_logging( 425 records=( # Generator comprehension yields StreamRecord objects for each record 426 StreamRecord.from_record_message( 427 record_message=record.record, 428 expected_keys=all_properties, 429 prune_extra_fields=True, 430 ) 431 for record in self._read_with_catalog(configured_catalog) 432 if record.record 433 ) 434 ) 435 return LazyDataset( 436 iterator, 437 stream_metadata=configured_stream, 438 ) 439 440 def get_documents( 441 self, 442 stream: str, 443 title_property: str | None = None, 444 content_properties: list[str] | None = None, 445 metadata_properties: list[str] | None = None, 446 *, 447 render_metadata: bool = False, 448 ) -> Iterable[Document]: 449 """Read a stream from the connector and return the records as documents. 450 451 If metadata_properties is not set, all properties that are not content will be added to 452 the metadata. 453 454 If render_metadata is True, metadata will be rendered in the document, as well as the 455 the main content. 456 """ 457 return self.get_records(stream).to_documents( 458 title_property=title_property, 459 content_properties=content_properties, 460 metadata_properties=metadata_properties, 461 render_metadata=render_metadata, 462 ) 463 464 def check(self) -> None: 465 """Call check on the connector. 466 467 This involves the following steps: 468 * Write the config to a temporary file 469 * execute the connector with check --config <config_file> 470 * Listen to the messages and return the first AirbyteCatalog that comes along. 471 * Make sure the subprocess is killed when the function returns. 472 """ 473 with as_temp_files([self._config]) as [config_file]: 474 try: 475 for msg in self._execute(["check", "--config", config_file]): 476 if msg.type == Type.CONNECTION_STATUS and msg.connectionStatus: 477 if msg.connectionStatus.status != Status.FAILED: 478 print(f"Connection check succeeded for `{self.name}`.") 479 log_source_check_result( 480 name=self.name, 481 state=EventState.SUCCEEDED, 482 ) 483 return 484 485 log_source_check_result( 486 name=self.name, 487 state=EventState.FAILED, 488 ) 489 raise exc.AirbyteConnectorCheckFailedError( 490 help_url=self.docs_url, 491 context={ 492 "failure_reason": msg.connectionStatus.message, 493 }, 494 ) 495 raise exc.AirbyteConnectorCheckFailedError(log_text=self._last_log_messages) 496 except exc.AirbyteConnectorReadError as ex: 497 raise exc.AirbyteConnectorCheckFailedError( 498 message="The connector failed to check the connection.", 499 log_text=ex.log_text, 500 ) from ex 501 502 def install(self) -> None: 503 """Install the connector if it is not yet installed.""" 504 self.executor.install() 505 print("For configuration instructions, see: \n" f"{self.docs_url}#reference\n") 506 507 def uninstall(self) -> None: 508 """Uninstall the connector if it is installed. 509 510 This only works if the use_local_install flag wasn't used and installation is managed by 511 PyAirbyte. 512 """ 513 self.executor.uninstall() 514 515 def _read_with_catalog( 516 self, 517 catalog: ConfiguredAirbyteCatalog, 518 state: list[AirbyteStateMessage] | None = None, 519 ) -> Iterator[AirbyteMessage]: 520 """Call read on the connector. 521 522 This involves the following steps: 523 * Write the config to a temporary file 524 * execute the connector with read --config <config_file> --catalog <catalog_file> 525 * Listen to the messages and return the AirbyteRecordMessages that come along. 526 * Send out telemetry on the performed sync (with information about which source was used and 527 the type of the cache) 528 """ 529 self._processed_records = 0 # Reset the counter before we start 530 with as_temp_files( 531 [ 532 self._config, 533 catalog.json(), 534 json.dumps(state) if state else "[]", 535 ] 536 ) as [ 537 config_file, 538 catalog_file, 539 state_file, 540 ]: 541 yield from self._tally_records( 542 self._execute( 543 [ 544 "read", 545 "--config", 546 config_file, 547 "--catalog", 548 catalog_file, 549 "--state", 550 state_file, 551 ], 552 ) 553 ) 554 555 def _add_to_logs(self, message: str) -> None: 556 self._last_log_messages.append(message) 557 self._last_log_messages = self._last_log_messages[-10:] 558 559 def _execute(self, args: list[str]) -> Iterator[AirbyteMessage]: 560 """Execute the connector with the given arguments. 561 562 This involves the following steps: 563 * Locate the right venv. It is called ".venv-<connector_name>" 564 * Spawn a subprocess with .venv-<connector_name>/bin/<connector-name> <args> 565 * Read the output line by line of the subprocess and serialize them AirbyteMessage objects. 566 Drop if not valid. 567 """ 568 # Fail early if the connector is not installed. 569 self.executor.ensure_installation(auto_fix=False) 570 571 try: 572 self._last_log_messages = [] 573 for line in self.executor.execute(args): 574 try: 575 message = AirbyteMessage.parse_raw(line) 576 if message.type is Type.RECORD: 577 self._processed_records += 1 578 if message.type == Type.LOG: 579 self._add_to_logs(message.log.message) 580 if message.type == Type.TRACE and message.trace.type == TraceType.ERROR: 581 self._add_to_logs(message.trace.error.message) 582 yield message 583 except Exception: 584 self._add_to_logs(line) 585 except Exception as e: 586 raise exc.AirbyteConnectorReadError( 587 log_text=self._last_log_messages, 588 ) from e 589 590 def _tally_records( 591 self, 592 messages: Iterable[AirbyteMessage], 593 ) -> Generator[AirbyteMessage, Any, None]: 594 """This method simply tallies the number of records processed and yields the messages.""" 595 self._processed_records = 0 # Reset the counter before we start 596 progress.reset(len(self._selected_stream_names or [])) 597 598 for message in messages: 599 yield message 600 progress.log_records_read(new_total_count=self._processed_records) 601 602 def _log_sync_start( 603 self, 604 *, 605 cache: CacheBase | None, 606 ) -> None: 607 """Log the start of a sync operation.""" 608 print(f"Started `{self.name}` read operation at {pendulum.now().format('HH:mm:ss')}...") 609 send_telemetry( 610 source=self, 611 cache=cache, 612 state=EventState.STARTED, 613 event_type=EventType.SYNC, 614 ) 615 616 def _log_sync_success( 617 self, 618 *, 619 cache: CacheBase | None, 620 ) -> None: 621 """Log the success of a sync operation.""" 622 print(f"Completed `{self.name}` read operation at {pendulum.now().format('HH:mm:ss')}.") 623 send_telemetry( 624 source=self, 625 cache=cache, 626 state=EventState.SUCCEEDED, 627 number_of_records=self._processed_records, 628 event_type=EventType.SYNC, 629 ) 630 631 def _log_sync_failure( 632 self, 633 *, 634 cache: CacheBase | None, 635 exception: Exception, 636 ) -> None: 637 """Log the failure of a sync operation.""" 638 print(f"Failed `{self.name}` read operation at {pendulum.now().format('HH:mm:ss')}.") 639 send_telemetry( 640 state=EventState.FAILED, 641 source=self, 642 cache=cache, 643 number_of_records=self._processed_records, 644 exception=exception, 645 event_type=EventType.SYNC, 646 ) 647 648 def read( 649 self, 650 cache: CacheBase | None = None, 651 *, 652 streams: str | list[str] | None = None, 653 write_strategy: str | WriteStrategy = WriteStrategy.AUTO, 654 force_full_refresh: bool = False, 655 skip_validation: bool = False, 656 ) -> ReadResult: 657 """Read from the connector and write to the cache. 658 659 Args: 660 cache: The cache to write to. If None, a default cache will be used. 661 write_strategy: The strategy to use when writing to the cache. If a string, it must be 662 one of "append", "upsert", "replace", or "auto". If a WriteStrategy, it must be one 663 of WriteStrategy.APPEND, WriteStrategy.UPSERT, WriteStrategy.REPLACE, or 664 WriteStrategy.AUTO. 665 streams: Optional if already set. A list of stream names to select for reading. If set 666 to "*", all streams will be selected. 667 force_full_refresh: If True, the source will operate in full refresh mode. Otherwise, 668 streams will be read in incremental mode if supported by the connector. This option 669 must be True when using the "replace" strategy. 670 """ 671 if write_strategy == WriteStrategy.REPLACE and not force_full_refresh: 672 warnings.warn( 673 message=( 674 "Using `REPLACE` strategy without also setting `full_refresh_mode=True` " 675 "could result in data loss. " 676 "To silence this warning, use the following: " 677 'warnings.filterwarnings("ignore", ' 678 'category="airbyte.warnings.PyAirbyteDataLossWarning")`' 679 ), 680 category=PyAirbyteDataLossWarning, 681 stacklevel=1, 682 ) 683 if cache is None: 684 cache = get_default_cache() 685 686 if isinstance(write_strategy, str): 687 try: 688 write_strategy = WriteStrategy(write_strategy) 689 except ValueError: 690 raise exc.PyAirbyteInputError( 691 message="Invalid strategy", 692 context={ 693 "write_strategy": write_strategy, 694 "available_strategies": [s.value for s in WriteStrategy], 695 }, 696 ) from None 697 698 if streams: 699 self.select_streams(streams) 700 701 if not self._selected_stream_names: 702 raise exc.PyAirbyteNoStreamsSelectedError( 703 connector_name=self.name, 704 available_streams=self.get_available_streams(), 705 ) 706 707 cache.processor.register_source( 708 source_name=self.name, 709 incoming_source_catalog=self.configured_catalog, 710 stream_names=set(self._selected_stream_names), 711 ) 712 713 state = ( 714 cache._get_state( # noqa: SLF001 # Private method until we have a public API for it. 715 source_name=self.name, 716 streams=self._selected_stream_names, 717 ) 718 if not force_full_refresh 719 else None 720 ) 721 if not skip_validation: 722 self.validate_config() 723 724 self._log_sync_start(cache=cache) 725 try: 726 cache.processor.process_airbyte_messages( 727 self._read_with_catalog( 728 catalog=self.configured_catalog, 729 state=state, 730 ), 731 write_strategy=write_strategy, 732 ) 733 except Exception as ex: 734 self._log_sync_failure(cache=cache, exception=ex) 735 raise exc.AirbyteConnectorFailedError( 736 log_text=self._last_log_messages, 737 ) from ex 738 739 self._log_sync_success(cache=cache) 740 return ReadResult( 741 processed_records=self._processed_records, 742 cache=cache, 743 processed_streams=[stream.stream.name for stream in self.configured_catalog.streams], 744 )
A class representing a source that can be called.
62 def __init__( 63 self, 64 executor: Executor, 65 name: str, 66 config: dict[str, Any] | None = None, 67 streams: str | list[str] | None = None, 68 *, 69 validate: bool = False, 70 ) -> None: 71 """Initialize the source. 72 73 If config is provided, it will be validated against the spec if validate is True. 74 """ 75 self.executor = executor 76 self.name = name 77 self._processed_records = 0 78 self._config_dict: dict[str, Any] | None = None 79 self._last_log_messages: list[str] = [] 80 self._discovered_catalog: AirbyteCatalog | None = None 81 self._spec: ConnectorSpecification | None = None 82 self._selected_stream_names: list[str] = [] 83 if config is not None: 84 self.set_config(config, validate=validate) 85 if streams is not None: 86 self.select_streams(streams) 87 88 self._deployed_api_root: str | None = None 89 self._deployed_workspace_id: str | None = None 90 self._deployed_source_id: str | None = None 91 self._deployed_connection_id: str | None = None
Initialize the source.
If config is provided, it will be validated against the spec if validate is True.
93 def set_streams(self, streams: list[str]) -> None: 94 """Deprecated. See select_streams().""" 95 warnings.warn( 96 "The 'set_streams' method is deprecated and will be removed in a future version. " 97 "Please use the 'select_streams' method instead.", 98 DeprecationWarning, 99 stacklevel=2, 100 ) 101 self.select_streams(streams)
Deprecated. See select_streams().
103 def select_all_streams(self) -> None: 104 """Select all streams. 105 106 This is a more streamlined equivalent to: 107 > source.select_streams(source.get_available_streams()). 108 """ 109 self._selected_stream_names = self.get_available_streams()
Select all streams.
This is a more streamlined equivalent to:
source.select_streams(source.get_available_streams()).
111 def select_streams(self, streams: str | list[str]) -> None: 112 """Select the stream names that should be read from the connector. 113 114 Args: 115 - streams: A list of stream names to select. If set to "*", all streams will be selected. 116 117 Currently, if this is not set, all streams will be read. 118 """ 119 if streams == "*": 120 self.select_all_streams() 121 return 122 123 if isinstance(streams, str): 124 # If a single stream is provided, convert it to a one-item list 125 streams = [streams] 126 127 available_streams = self.get_available_streams() 128 for stream in streams: 129 if stream not in available_streams: 130 raise exc.AirbyteStreamNotFoundError( 131 stream_name=stream, 132 connector_name=self.name, 133 available_streams=available_streams, 134 ) 135 self._selected_stream_names = streams
Select the stream names that should be read from the connector.
Args:
- streams: A list of stream names to select. If set to "*", all streams will be selected.
Currently, if this is not set, all streams will be read.
137 def get_selected_streams(self) -> list[str]: 138 """Get the selected streams. 139 140 If no streams are selected, return an empty list. 141 """ 142 return self._selected_stream_names
Get the selected streams.
If no streams are selected, return an empty list.
144 def set_config( 145 self, 146 config: dict[str, Any], 147 *, 148 validate: bool = True, 149 ) -> None: 150 """Set the config for the connector. 151 152 If validate is True, raise an exception if the config fails validation. 153 154 If validate is False, validation will be deferred until check() or validate_config() 155 is called. 156 """ 157 if validate: 158 self.validate_config(config) 159 160 self._config_dict = config
Set the config for the connector.
If validate is True, raise an exception if the config fails validation.
If validate is False, validation will be deferred until check() or validate_config() is called.
162 def get_config(self) -> dict[str, Any]: 163 """Get the config for the connector.""" 164 return self._config
Get the config for the connector.
191 def validate_config(self, config: dict[str, Any] | None = None) -> None: 192 """Validate the config against the spec. 193 194 If config is not provided, the already-set config will be validated. 195 """ 196 spec = self._get_spec(force_refresh=False) 197 config = self._config if config is None else config 198 try: 199 jsonschema.validate(config, spec.connectionSpecification) 200 log_config_validation_result( 201 name=self.name, 202 state=EventState.SUCCEEDED, 203 ) 204 except jsonschema.ValidationError as ex: 205 validation_ex = exc.AirbyteConnectorValidationFailedError( 206 message="The provided config is not valid.", 207 context={ 208 "error_message": ex.message, 209 "error_path": ex.path, 210 "error_instance": ex.instance, 211 "error_schema": ex.schema, 212 }, 213 ) 214 log_config_validation_result( 215 name=self.name, 216 state=EventState.FAILED, 217 exception=validation_ex, 218 ) 219 raise validation_ex from ex
Validate the config against the spec.
If config is not provided, the already-set config will be validated.
221 def get_available_streams(self) -> list[str]: 222 """Get the available streams from the spec.""" 223 return [s.name for s in self.discovered_catalog.streams]
Get the available streams from the spec.
246 @property 247 def config_spec(self) -> dict[str, Any]: 248 """Generate a configuration spec for this connector, as a JSON Schema definition. 249 250 This function generates a JSON Schema dictionary with configuration specs for the 251 current connector, as a dictionary. 252 253 Returns: 254 dict: The JSON Schema configuration spec as a dictionary. 255 """ 256 return self._get_spec(force_refresh=True).connectionSpecification
Generate a configuration spec for this connector, as a JSON Schema definition.
This function generates a JSON Schema dictionary with configuration specs for the current connector, as a dictionary.
Returns:
dict: The JSON Schema configuration spec as a dictionary.
258 def print_config_spec( 259 self, 260 format: Literal["yaml", "json"] = "yaml", # noqa: A002 261 *, 262 output_file: Path | str | None = None, 263 ) -> None: 264 """Print the configuration spec for this connector. 265 266 Args: 267 - format: The format to print the spec in. Must be "yaml" or "json". 268 - output_file: Optional. If set, the spec will be written to the given file path. Otherwise, 269 it will be printed to the console. 270 """ 271 if format not in ["yaml", "json"]: 272 raise exc.PyAirbyteInputError( 273 message="Invalid format. Expected 'yaml' or 'json'", 274 input_value=format, 275 ) 276 if isinstance(output_file, str): 277 output_file = Path(output_file) 278 279 if format == "yaml": 280 content = yaml.dump(self.config_spec, indent=2) 281 elif format == "json": 282 content = json.dumps(self.config_spec, indent=2) 283 284 if output_file: 285 output_file.write_text(content) 286 return 287 288 syntax_highlighted = Syntax(content, format) 289 print(syntax_highlighted)
Print the configuration spec for this connector.
Args:
- format: The format to print the spec in. Must be "yaml" or "json".
- output_file: Optional. If set, the spec will be written to the given file path. Otherwise, it will be printed to the console.
306 @property 307 def docs_url(self) -> str: 308 """Get the URL to the connector's documentation.""" 309 # TODO: Replace with docs URL from metadata when available 310 return "https://docs.airbyte.com/integrations/sources/" + self.name.lower().replace( 311 "source-", "" 312 )
Get the URL to the connector's documentation.
314 @property 315 def discovered_catalog(self) -> AirbyteCatalog: 316 """Get the raw catalog for the given streams. 317 318 If the catalog is not yet known, we call discover to get it. 319 """ 320 if self._discovered_catalog is None: 321 self._discovered_catalog = self._discover() 322 323 return self._discovered_catalog
Get the raw catalog for the given streams.
If the catalog is not yet known, we call discover to get it.
325 @property 326 def configured_catalog(self) -> ConfiguredAirbyteCatalog: 327 """Get the configured catalog for the given streams. 328 329 If the raw catalog is not yet known, we call discover to get it. 330 331 If no specific streams are selected, we return a catalog that syncs all available streams. 332 333 TODO: We should consider disabling by default the streams that the connector would 334 disable by default. (For instance, streams that require a premium license are sometimes 335 disabled by default within the connector.) 336 """ 337 # Ensure discovered catalog is cached before we start 338 _ = self.discovered_catalog 339 340 # Filter for selected streams if set, otherwise use all available streams: 341 streams_filter: list[str] = self._selected_stream_names or self.get_available_streams() 342 343 return ConfiguredAirbyteCatalog( 344 streams=[ 345 ConfiguredAirbyteStream( 346 stream=stream, 347 destination_sync_mode=DestinationSyncMode.overwrite, 348 primary_key=stream.source_defined_primary_key, 349 # TODO: The below assumes all sources can coalesce from incremental sync to 350 # full_table as needed. CDK supports this, so it might be safe: 351 sync_mode=SyncMode.incremental, 352 ) 353 for stream in self.discovered_catalog.streams 354 if stream.name in streams_filter 355 ], 356 )
Get the configured catalog for the given streams.
If the raw catalog is not yet known, we call discover to get it.
If no specific streams are selected, we return a catalog that syncs all available streams.
TODO: We should consider disabling by default the streams that the connector would disable by default. (For instance, streams that require a premium license are sometimes disabled by default within the connector.)
358 def get_stream_json_schema(self, stream_name: str) -> dict[str, Any]: 359 """Return the JSON Schema spec for the specified stream name.""" 360 catalog: AirbyteCatalog = self.discovered_catalog 361 found: list[AirbyteStream] = [ 362 stream for stream in catalog.streams if stream.name == stream_name 363 ] 364 365 if len(found) == 0: 366 raise exc.PyAirbyteInputError( 367 message="Stream name does not exist in catalog.", 368 input_value=stream_name, 369 ) 370 371 if len(found) > 1: 372 raise exc.PyAirbyteInternalError( 373 message="Duplicate streams found with the same name.", 374 context={ 375 "found_streams": found, 376 }, 377 ) 378 379 return found[0].json_schema
Return the JSON Schema spec for the specified stream name.
381 def get_records(self, stream: str) -> LazyDataset: 382 """Read a stream from the connector. 383 384 This involves the following steps: 385 * Call discover to get the catalog 386 * Generate a configured catalog that syncs the given stream in full_refresh mode 387 * Write the configured catalog and the config to a temporary file 388 * execute the connector with read --config <config_file> --catalog <catalog_file> 389 * Listen to the messages and return the first AirbyteRecordMessages that come along. 390 * Make sure the subprocess is killed when the function returns. 391 """ 392 discovered_catalog: AirbyteCatalog = self.discovered_catalog 393 configured_catalog = ConfiguredAirbyteCatalog( 394 streams=[ 395 ConfiguredAirbyteStream( 396 stream=s, 397 sync_mode=SyncMode.full_refresh, 398 destination_sync_mode=DestinationSyncMode.overwrite, 399 ) 400 for s in discovered_catalog.streams 401 if s.name == stream 402 ], 403 ) 404 if len(configured_catalog.streams) == 0: 405 raise exc.PyAirbyteInputError( 406 message="Requested stream does not exist.", 407 context={ 408 "stream": stream, 409 "available_streams": self.get_available_streams(), 410 "connector_name": self.name, 411 }, 412 ) from KeyError(stream) 413 414 configured_stream = configured_catalog.streams[0] 415 all_properties = cast( 416 list[str], list(configured_stream.stream.json_schema["properties"].keys()) 417 ) 418 419 def _with_logging(records: Iterable[dict[str, Any]]) -> Iterator[dict[str, Any]]: 420 self._log_sync_start(cache=None) 421 yield from records 422 self._log_sync_success(cache=None) 423 424 iterator: Iterator[dict[str, Any]] = _with_logging( 425 records=( # Generator comprehension yields StreamRecord objects for each record 426 StreamRecord.from_record_message( 427 record_message=record.record, 428 expected_keys=all_properties, 429 prune_extra_fields=True, 430 ) 431 for record in self._read_with_catalog(configured_catalog) 432 if record.record 433 ) 434 ) 435 return LazyDataset( 436 iterator, 437 stream_metadata=configured_stream, 438 )
Read a stream from the connector.
This involves the following steps:
- Call discover to get the catalog
- Generate a configured catalog that syncs the given stream in full_refresh mode
- Write the configured catalog and the config to a temporary file
- execute the connector with read --config
--catalog - Listen to the messages and return the first AirbyteRecordMessages that come along.
- Make sure the subprocess is killed when the function returns.
440 def get_documents( 441 self, 442 stream: str, 443 title_property: str | None = None, 444 content_properties: list[str] | None = None, 445 metadata_properties: list[str] | None = None, 446 *, 447 render_metadata: bool = False, 448 ) -> Iterable[Document]: 449 """Read a stream from the connector and return the records as documents. 450 451 If metadata_properties is not set, all properties that are not content will be added to 452 the metadata. 453 454 If render_metadata is True, metadata will be rendered in the document, as well as the 455 the main content. 456 """ 457 return self.get_records(stream).to_documents( 458 title_property=title_property, 459 content_properties=content_properties, 460 metadata_properties=metadata_properties, 461 render_metadata=render_metadata, 462 )
Read a stream from the connector and return the records as documents.
If metadata_properties is not set, all properties that are not content will be added to the metadata.
If render_metadata is True, metadata will be rendered in the document, as well as the the main content.
464 def check(self) -> None: 465 """Call check on the connector. 466 467 This involves the following steps: 468 * Write the config to a temporary file 469 * execute the connector with check --config <config_file> 470 * Listen to the messages and return the first AirbyteCatalog that comes along. 471 * Make sure the subprocess is killed when the function returns. 472 """ 473 with as_temp_files([self._config]) as [config_file]: 474 try: 475 for msg in self._execute(["check", "--config", config_file]): 476 if msg.type == Type.CONNECTION_STATUS and msg.connectionStatus: 477 if msg.connectionStatus.status != Status.FAILED: 478 print(f"Connection check succeeded for `{self.name}`.") 479 log_source_check_result( 480 name=self.name, 481 state=EventState.SUCCEEDED, 482 ) 483 return 484 485 log_source_check_result( 486 name=self.name, 487 state=EventState.FAILED, 488 ) 489 raise exc.AirbyteConnectorCheckFailedError( 490 help_url=self.docs_url, 491 context={ 492 "failure_reason": msg.connectionStatus.message, 493 }, 494 ) 495 raise exc.AirbyteConnectorCheckFailedError(log_text=self._last_log_messages) 496 except exc.AirbyteConnectorReadError as ex: 497 raise exc.AirbyteConnectorCheckFailedError( 498 message="The connector failed to check the connection.", 499 log_text=ex.log_text, 500 ) from ex
Call check on the connector.
This involves the following steps:
- Write the config to a temporary file
- execute the connector with check --config
- Listen to the messages and return the first AirbyteCatalog that comes along.
- Make sure the subprocess is killed when the function returns.
502 def install(self) -> None: 503 """Install the connector if it is not yet installed.""" 504 self.executor.install() 505 print("For configuration instructions, see: \n" f"{self.docs_url}#reference\n")
Install the connector if it is not yet installed.
507 def uninstall(self) -> None: 508 """Uninstall the connector if it is installed. 509 510 This only works if the use_local_install flag wasn't used and installation is managed by 511 PyAirbyte. 512 """ 513 self.executor.uninstall()
Uninstall the connector if it is installed.
This only works if the use_local_install flag wasn't used and installation is managed by PyAirbyte.
648 def read( 649 self, 650 cache: CacheBase | None = None, 651 *, 652 streams: str | list[str] | None = None, 653 write_strategy: str | WriteStrategy = WriteStrategy.AUTO, 654 force_full_refresh: bool = False, 655 skip_validation: bool = False, 656 ) -> ReadResult: 657 """Read from the connector and write to the cache. 658 659 Args: 660 cache: The cache to write to. If None, a default cache will be used. 661 write_strategy: The strategy to use when writing to the cache. If a string, it must be 662 one of "append", "upsert", "replace", or "auto". If a WriteStrategy, it must be one 663 of WriteStrategy.APPEND, WriteStrategy.UPSERT, WriteStrategy.REPLACE, or 664 WriteStrategy.AUTO. 665 streams: Optional if already set. A list of stream names to select for reading. If set 666 to "*", all streams will be selected. 667 force_full_refresh: If True, the source will operate in full refresh mode. Otherwise, 668 streams will be read in incremental mode if supported by the connector. This option 669 must be True when using the "replace" strategy. 670 """ 671 if write_strategy == WriteStrategy.REPLACE and not force_full_refresh: 672 warnings.warn( 673 message=( 674 "Using `REPLACE` strategy without also setting `full_refresh_mode=True` " 675 "could result in data loss. " 676 "To silence this warning, use the following: " 677 'warnings.filterwarnings("ignore", ' 678 'category="airbyte.warnings.PyAirbyteDataLossWarning")`' 679 ), 680 category=PyAirbyteDataLossWarning, 681 stacklevel=1, 682 ) 683 if cache is None: 684 cache = get_default_cache() 685 686 if isinstance(write_strategy, str): 687 try: 688 write_strategy = WriteStrategy(write_strategy) 689 except ValueError: 690 raise exc.PyAirbyteInputError( 691 message="Invalid strategy", 692 context={ 693 "write_strategy": write_strategy, 694 "available_strategies": [s.value for s in WriteStrategy], 695 }, 696 ) from None 697 698 if streams: 699 self.select_streams(streams) 700 701 if not self._selected_stream_names: 702 raise exc.PyAirbyteNoStreamsSelectedError( 703 connector_name=self.name, 704 available_streams=self.get_available_streams(), 705 ) 706 707 cache.processor.register_source( 708 source_name=self.name, 709 incoming_source_catalog=self.configured_catalog, 710 stream_names=set(self._selected_stream_names), 711 ) 712 713 state = ( 714 cache._get_state( # noqa: SLF001 # Private method until we have a public API for it. 715 source_name=self.name, 716 streams=self._selected_stream_names, 717 ) 718 if not force_full_refresh 719 else None 720 ) 721 if not skip_validation: 722 self.validate_config() 723 724 self._log_sync_start(cache=cache) 725 try: 726 cache.processor.process_airbyte_messages( 727 self._read_with_catalog( 728 catalog=self.configured_catalog, 729 state=state, 730 ), 731 write_strategy=write_strategy, 732 ) 733 except Exception as ex: 734 self._log_sync_failure(cache=cache, exception=ex) 735 raise exc.AirbyteConnectorFailedError( 736 log_text=self._last_log_messages, 737 ) from ex 738 739 self._log_sync_success(cache=cache) 740 return ReadResult( 741 processed_records=self._processed_records, 742 cache=cache, 743 processed_streams=[stream.stream.name for stream in self.configured_catalog.streams], 744 )
Read from the connector and write to the cache.
Arguments:
- cache: The cache to write to. If None, a default cache will be used.
- write_strategy: The strategy to use when writing to the cache. If a string, it must be one of "append", "upsert", "replace", or "auto". If a WriteStrategy, it must be one of WriteStrategy.APPEND, WriteStrategy.UPSERT, WriteStrategy.REPLACE, or WriteStrategy.AUTO.
- streams: Optional if already set. A list of stream names to select for reading. If set to "*", all streams will be selected.
- force_full_refresh: If True, the source will operate in full refresh mode. Otherwise, streams will be read in incremental mode if supported by the connector. This option must be True when using the "replace" strategy.
93class StreamRecord(dict[str, Any]): 94 """The StreamRecord class is a case-aware, case-insensitive dictionary implementation. 95 96 It has these behaviors: 97 - When a key is retrieved, deleted, or checked for existence, it is always checked in a 98 case-insensitive manner. 99 - The original case is stored in a separate dictionary, so that the original case can be 100 retrieved when needed. 101 - Because it is subclassed from `dict`, the `StreamRecord` class can be passed as a normal 102 Python dictionary. 103 - In addition to the properties of the stream's records, the dictionary also stores the Airbyte 104 metadata columns: `_airbyte_raw_id`, `_airbyte_extracted_at`, and `_airbyte_meta`. 105 106 This behavior mirrors how a case-aware, case-insensitive SQL database would handle column 107 references. 108 109 There are two ways this class can store keys internally: 110 - If normalize_keys is True, the keys are normalized using the given normalizer. 111 - If normalize_keys is False, the original case of the keys is stored. 112 113 In regards to missing values, the dictionary accepts an 'expected_keys' input. When set, the 114 dictionary will be initialized with the given keys. If a key is not found in the input data, it 115 will be initialized with a value of None. When provided, the 'expected_keys' input will also 116 determine the original case of the keys. 117 """ 118 119 def _display_case(self, key: str) -> str: 120 """Return the original case of the key.""" 121 return self._pretty_case_keys[self._normalizer.normalize(key)] 122 123 def _index_case(self, key: str) -> str: 124 """Return the internal case of the key. 125 126 If normalize_keys is True, return the normalized key. 127 Otherwise, return the original case of the key. 128 """ 129 if self._normalize_keys: 130 return self._normalizer.normalize(key) 131 132 return self._display_case(key) 133 134 @classmethod 135 def from_record_message( 136 cls, 137 record_message: AirbyteRecordMessage, 138 *, 139 prune_extra_fields: bool, 140 normalize_keys: bool = True, 141 normalizer: type[NameNormalizerBase] | None = None, 142 expected_keys: list[str] | None = None, 143 ) -> StreamRecord: 144 """Return a StreamRecord from a RecordMessage.""" 145 data_dict: dict[str, Any] = record_message.data.copy() 146 data_dict[AB_RAW_ID_COLUMN] = str(ulid.ULID()) 147 data_dict[AB_EXTRACTED_AT_COLUMN] = datetime.fromtimestamp( 148 record_message.emitted_at / 1000, tz=pytz.utc 149 ) 150 data_dict[AB_META_COLUMN] = {} 151 152 return cls( 153 from_dict=data_dict, 154 prune_extra_fields=prune_extra_fields, 155 normalize_keys=normalize_keys, 156 normalizer=normalizer, 157 expected_keys=expected_keys, 158 ) 159 160 def __init__( 161 self, 162 from_dict: dict, 163 *, 164 prune_extra_fields: bool, 165 normalize_keys: bool = True, 166 normalizer: type[NameNormalizerBase] | None = None, 167 expected_keys: list[str] | None = None, 168 ) -> None: 169 """Initialize the dictionary with the given data. 170 171 Args: 172 - normalize_keys: If `True`, the keys will be normalized using the given normalizer. 173 - expected_keys: If provided, the dictionary will be initialized with these given keys. 174 - expected_keys: If provided and `prune_extra_fields` is True, then unexpected fields 175 will be removed. This option is ignored if `expected_keys` is not provided. 176 """ 177 # If no normalizer is provided, use LowerCaseNormalizer. 178 self._normalize_keys = normalize_keys 179 self._normalizer: type[NameNormalizerBase] = normalizer or LowerCaseNormalizer 180 181 # If no expected keys are provided, use all keys from the input dictionary. 182 if not expected_keys: 183 expected_keys = list(from_dict.keys()) 184 prune_extra_fields = False # No expected keys provided. 185 else: 186 expected_keys = list(expected_keys) 187 188 for internal_col in AB_INTERNAL_COLUMNS: 189 if internal_col not in expected_keys: 190 expected_keys.append(internal_col) 191 192 # Store a lookup from normalized keys to pretty cased (originally cased) keys. 193 self._pretty_case_keys: dict[str, str] = { 194 self._normalizer.normalize(pretty_case.lower()): pretty_case 195 for pretty_case in expected_keys 196 } 197 198 if normalize_keys: 199 index_keys = [self._normalizer.normalize(key) for key in expected_keys] 200 else: 201 index_keys = expected_keys 202 203 self.update({k: None for k in index_keys}) # Start by initializing all values to None 204 for k, v in from_dict.items(): 205 index_cased_key = self._index_case(k) 206 if prune_extra_fields and index_cased_key not in index_keys: 207 # Dropping undeclared field 208 continue 209 210 self[index_cased_key] = v 211 212 def __getitem__(self, key: str) -> Any: # noqa: ANN401 213 if super().__contains__(key): 214 return super().__getitem__(key) 215 216 if super().__contains__(self._index_case(key)): 217 return super().__getitem__(self._index_case(key)) 218 219 raise KeyError(key) 220 221 def __setitem__(self, key: str, value: Any) -> None: # noqa: ANN401 222 if super().__contains__(key): 223 super().__setitem__(key, value) 224 return 225 226 if super().__contains__(self._index_case(key)): 227 super().__setitem__(self._index_case(key), value) 228 return 229 230 # Store the pretty cased (originally cased) key: 231 self._pretty_case_keys[self._normalizer.normalize(key)] = key 232 233 # Store the data with the normalized key: 234 super().__setitem__(self._index_case(key), value) 235 236 def __delitem__(self, key: str) -> None: 237 if super().__contains__(key): 238 super().__delitem__(key) 239 return 240 241 if super().__contains__(self._index_case(key)): 242 super().__delitem__(self._index_case(key)) 243 return 244 245 raise KeyError(key) 246 247 def __contains__(self, key: object) -> bool: 248 assert isinstance(key, str), "Key must be a string." 249 return super().__contains__(key) or super().__contains__(self._index_case(key)) 250 251 def __iter__(self) -> Any: # noqa: ANN401 252 return iter(super().__iter__()) 253 254 def __len__(self) -> int: 255 return super().__len__() 256 257 def __eq__(self, other: object) -> bool: 258 if isinstance(other, StreamRecord): 259 return dict(self) == dict(other) 260 261 if isinstance(other, dict): 262 return {k.lower(): v for k, v in self.items()} == { 263 k.lower(): v for k, v in other.items() 264 } 265 return False
The StreamRecord class is a case-aware, case-insensitive dictionary implementation.
It has these behaviors:
- When a key is retrieved, deleted, or checked for existence, it is always checked in a case-insensitive manner.
- The original case is stored in a separate dictionary, so that the original case can be retrieved when needed.
- Because it is subclassed from
dict
, theStreamRecord
class can be passed as a normal Python dictionary. - In addition to the properties of the stream's records, the dictionary also stores the Airbyte
metadata columns:
_airbyte_raw_id
,_airbyte_extracted_at
, and_airbyte_meta
.
This behavior mirrors how a case-aware, case-insensitive SQL database would handle column references.
There are two ways this class can store keys internally:
- If normalize_keys is True, the keys are normalized using the given normalizer.
- If normalize_keys is False, the original case of the keys is stored.
In regards to missing values, the dictionary accepts an 'expected_keys' input. When set, the dictionary will be initialized with the given keys. If a key is not found in the input data, it will be initialized with a value of None. When provided, the 'expected_keys' input will also determine the original case of the keys.
134 @classmethod 135 def from_record_message( 136 cls, 137 record_message: AirbyteRecordMessage, 138 *, 139 prune_extra_fields: bool, 140 normalize_keys: bool = True, 141 normalizer: type[NameNormalizerBase] | None = None, 142 expected_keys: list[str] | None = None, 143 ) -> StreamRecord: 144 """Return a StreamRecord from a RecordMessage.""" 145 data_dict: dict[str, Any] = record_message.data.copy() 146 data_dict[AB_RAW_ID_COLUMN] = str(ulid.ULID()) 147 data_dict[AB_EXTRACTED_AT_COLUMN] = datetime.fromtimestamp( 148 record_message.emitted_at / 1000, tz=pytz.utc 149 ) 150 data_dict[AB_META_COLUMN] = {} 151 152 return cls( 153 from_dict=data_dict, 154 prune_extra_fields=prune_extra_fields, 155 normalize_keys=normalize_keys, 156 normalizer=normalizer, 157 expected_keys=expected_keys, 158 )
Return a StreamRecord from a RecordMessage.
Inherited Members
- builtins.dict
- get
- setdefault
- pop
- popitem
- keys
- items
- values
- update
- fromkeys
- clear
- copy