airbyte.records

PyAirbyte Records module.

Understanding record handling in PyAirbyte

PyAirbyte models record handling after Airbyte's "Destination V2" ("Dv2") record handling. This includes the below implementation details.

Field Name Normalization

  1. PyAirbyte normalizes top-level record keys to lowercase, replacing spaces and hyphens with underscores.
  2. PyAirbyte does not normalize nested keys on sub-properties.

For example, the following record:

{

    "My-Field": "value",
    "Nested": {
        "MySubField": "value"
    }
}

Would be normalized to:

{
    "my_field": "value",
    "nested": {
        "MySubField": "value"
    }
}

Table Name Normalization

Similar to column handling, PyAirbyte normalizes table names to the lowercase version of the stream name and may remove or normalize special characters.

Airbyte-Managed Metadata Columns

PyAirbyte adds the following columns to every record:

  • ab_raw_id: A unique identifier for the record.
  • ab_extracted_at: The time the record was extracted.
  • ab_meta: A dictionary of extra metadata about the record.

The names of these columns are included in the airbyte.constants module for programmatic reference.

Schema Evolution

PyAirbyte supports a very basic form of schema evolution:

  1. Columns are always auto-added to cache tables whenever newly arriving properties are detected as not present in the cache table.
  2. Column types will not be modified or expanded to fit changed types in the source catalog.
    • If column types change, we recommend user to manually alter the column types.
  3. At any time, users can run a full sync with a WriteStrategy of 'replace'. This will create a fresh table from scratch and then swap the old and new tables after table sync is complete.

  1# Copyright (c) 2024 Airbyte, Inc., all rights reserved.
  2"""PyAirbyte Records module.
  3
  4## Understanding record handling in PyAirbyte
  5
  6PyAirbyte models record handling after Airbyte's "Destination V2" ("Dv2") record handling. This
  7includes the below implementation details.
  8
  9### Field Name Normalization
 10
 111. PyAirbyte normalizes top-level record keys to lowercase, replacing spaces and hyphens with
 12   underscores.
 132. PyAirbyte does not normalize nested keys on sub-properties.
 14
 15For example, the following record:
 16
 17```json
 18{
 19
 20    "My-Field": "value",
 21    "Nested": {
 22        "MySubField": "value"
 23    }
 24}
 25```
 26
 27Would be normalized to:
 28
 29```json
 30{
 31    "my_field": "value",
 32    "nested": {
 33        "MySubField": "value"
 34    }
 35}
 36```
 37
 38### Table Name Normalization
 39
 40Similar to column handling, PyAirbyte normalizes table names to the lowercase version of the stream
 41name and may remove or normalize special characters.
 42
 43### Airbyte-Managed Metadata Columns
 44
 45PyAirbyte adds the following columns to every record:
 46
 47- `ab_raw_id`: A unique identifier for the record.
 48- `ab_extracted_at`: The time the record was extracted.
 49- `ab_meta`: A dictionary of extra metadata about the record.
 50
 51The names of these columns are included in the `airbyte.constants` module for programmatic
 52reference.
 53
 54## Schema Evolution
 55
 56PyAirbyte supports a very basic form of schema evolution:
 57
 581. Columns are always auto-added to cache tables whenever newly arriving properties are detected
 59   as not present in the cache table.
 602. Column types will not be modified or expanded to fit changed types in the source catalog.
 61   - If column types change, we recommend user to manually alter the column types.
 623. At any time, users can run a full sync with a `WriteStrategy` of 'replace'. This will create a
 63   fresh table from scratch and then swap the old and new tables after table sync is complete.
 64
 65---
 66
 67"""
 68
 69from __future__ import annotations
 70
 71from datetime import datetime
 72from typing import TYPE_CHECKING, Any
 73
 74import pytz
 75import ulid
 76
 77from airbyte._util.name_normalizers import LowerCaseNormalizer, NameNormalizerBase
 78from airbyte.constants import (
 79    AB_EXTRACTED_AT_COLUMN,
 80    AB_INTERNAL_COLUMNS,
 81    AB_META_COLUMN,
 82    AB_RAW_ID_COLUMN,
 83)
 84
 85
 86if TYPE_CHECKING:
 87    from airbyte_protocol.models import (
 88        AirbyteRecordMessage,
 89    )
 90
 91
 92class StreamRecord(dict[str, Any]):
 93    """The StreamRecord class is a case-aware, case-insensitive dictionary implementation.
 94
 95    It has these behaviors:
 96    - When a key is retrieved, deleted, or checked for existence, it is always checked in a
 97      case-insensitive manner.
 98    - The original case is stored in a separate dictionary, so that the original case can be
 99      retrieved when needed.
100    - Because it is subclassed from `dict`, the `StreamRecord` class can be passed as a normal
101      Python dictionary.
102    - In addition to the properties of the stream's records, the dictionary also stores the Airbyte
103      metadata columns: `_airbyte_raw_id`, `_airbyte_extracted_at`, and `_airbyte_meta`.
104
105    This behavior mirrors how a case-aware, case-insensitive SQL database would handle column
106    references.
107
108    There are two ways this class can store keys internally:
109    - If normalize_keys is True, the keys are normalized using the given normalizer.
110    - If normalize_keys is False, the original case of the keys is stored.
111
112    In regards to missing values, the dictionary accepts an 'expected_keys' input. When set, the
113    dictionary will be initialized with the given keys. If a key is not found in the input data, it
114    will be initialized with a value of None. When provided, the 'expected_keys' input will also
115    determine the original case of the keys.
116    """
117
118    def _display_case(self, key: str) -> str:
119        """Return the original case of the key."""
120        return self._pretty_case_keys[self._normalizer.normalize(key)]
121
122    def _index_case(self, key: str) -> str:
123        """Return the internal case of the key.
124
125        If normalize_keys is True, return the normalized key.
126        Otherwise, return the original case of the key.
127        """
128        if self._normalize_keys:
129            return self._normalizer.normalize(key)
130
131        return self._display_case(key)
132
133    @classmethod
134    def from_record_message(
135        cls,
136        record_message: AirbyteRecordMessage,
137        *,
138        prune_extra_fields: bool,
139        normalize_keys: bool = True,
140        normalizer: type[NameNormalizerBase] | None = None,
141        expected_keys: list[str] | None = None,
142    ) -> StreamRecord:
143        """Return a StreamRecord from a RecordMessage."""
144        data_dict: dict[str, Any] = record_message.data.copy()
145        data_dict[AB_RAW_ID_COLUMN] = str(ulid.ULID())
146        data_dict[AB_EXTRACTED_AT_COLUMN] = datetime.fromtimestamp(
147            record_message.emitted_at / 1000, tz=pytz.utc
148        )
149        data_dict[AB_META_COLUMN] = {}
150
151        return cls(
152            from_dict=data_dict,
153            prune_extra_fields=prune_extra_fields,
154            normalize_keys=normalize_keys,
155            normalizer=normalizer,
156            expected_keys=expected_keys,
157        )
158
159    def __init__(
160        self,
161        from_dict: dict,
162        *,
163        prune_extra_fields: bool,
164        normalize_keys: bool = True,
165        normalizer: type[NameNormalizerBase] | None = None,
166        expected_keys: list[str] | None = None,
167    ) -> None:
168        """Initialize the dictionary with the given data.
169
170        Args:
171        - normalize_keys: If `True`, the keys will be normalized using the given normalizer.
172        - expected_keys: If provided, the dictionary will be initialized with these given keys.
173        - expected_keys: If provided and `prune_extra_fields` is True, then unexpected fields
174          will be removed. This option is ignored if `expected_keys` is not provided.
175        """
176        # If no normalizer is provided, use LowerCaseNormalizer.
177        self._normalize_keys = normalize_keys
178        self._normalizer: type[NameNormalizerBase] = normalizer or LowerCaseNormalizer
179
180        # If no expected keys are provided, use all keys from the input dictionary.
181        if not expected_keys:
182            expected_keys = list(from_dict.keys())
183            prune_extra_fields = False  # No expected keys provided.
184        else:
185            expected_keys = list(expected_keys)
186
187        for internal_col in AB_INTERNAL_COLUMNS:
188            if internal_col not in expected_keys:
189                expected_keys.append(internal_col)
190
191        # Store a lookup from normalized keys to pretty cased (originally cased) keys.
192        self._pretty_case_keys: dict[str, str] = {
193            self._normalizer.normalize(pretty_case.lower()): pretty_case
194            for pretty_case in expected_keys
195        }
196
197        if normalize_keys:
198            index_keys = [self._normalizer.normalize(key) for key in expected_keys]
199        else:
200            index_keys = expected_keys
201
202        self.update({k: None for k in index_keys})  # Start by initializing all values to None
203        for k, v in from_dict.items():
204            index_cased_key = self._index_case(k)
205            if prune_extra_fields and index_cased_key not in index_keys:
206                # Dropping undeclared field
207                continue
208
209            self[index_cased_key] = v
210
211    def __getitem__(self, key: str) -> Any:  # noqa: ANN401
212        if super().__contains__(key):
213            return super().__getitem__(key)
214
215        if super().__contains__(self._index_case(key)):
216            return super().__getitem__(self._index_case(key))
217
218        raise KeyError(key)
219
220    def __setitem__(self, key: str, value: Any) -> None:  # noqa: ANN401
221        if super().__contains__(key):
222            super().__setitem__(key, value)
223            return
224
225        if super().__contains__(self._index_case(key)):
226            super().__setitem__(self._index_case(key), value)
227            return
228
229        # Store the pretty cased (originally cased) key:
230        self._pretty_case_keys[self._normalizer.normalize(key)] = key
231
232        # Store the data with the normalized key:
233        super().__setitem__(self._index_case(key), value)
234
235    def __delitem__(self, key: str) -> None:
236        if super().__contains__(key):
237            super().__delitem__(key)
238            return
239
240        if super().__contains__(self._index_case(key)):
241            super().__delitem__(self._index_case(key))
242            return
243
244        raise KeyError(key)
245
246    def __contains__(self, key: object) -> bool:
247        assert isinstance(key, str), "Key must be a string."
248        return super().__contains__(key) or super().__contains__(self._index_case(key))
249
250    def __iter__(self) -> Any:  # noqa: ANN401
251        return iter(super().__iter__())
252
253    def __len__(self) -> int:
254        return super().__len__()
255
256    def __eq__(self, other: object) -> bool:
257        if isinstance(other, StreamRecord):
258            return dict(self) == dict(other)
259
260        if isinstance(other, dict):
261            return {k.lower(): v for k, v in self.items()} == {
262                k.lower(): v for k, v in other.items()
263            }
264        return False
class StreamRecord(dict[str, typing.Any]):
 93class StreamRecord(dict[str, Any]):
 94    """The StreamRecord class is a case-aware, case-insensitive dictionary implementation.
 95
 96    It has these behaviors:
 97    - When a key is retrieved, deleted, or checked for existence, it is always checked in a
 98      case-insensitive manner.
 99    - The original case is stored in a separate dictionary, so that the original case can be
100      retrieved when needed.
101    - Because it is subclassed from `dict`, the `StreamRecord` class can be passed as a normal
102      Python dictionary.
103    - In addition to the properties of the stream's records, the dictionary also stores the Airbyte
104      metadata columns: `_airbyte_raw_id`, `_airbyte_extracted_at`, and `_airbyte_meta`.
105
106    This behavior mirrors how a case-aware, case-insensitive SQL database would handle column
107    references.
108
109    There are two ways this class can store keys internally:
110    - If normalize_keys is True, the keys are normalized using the given normalizer.
111    - If normalize_keys is False, the original case of the keys is stored.
112
113    In regards to missing values, the dictionary accepts an 'expected_keys' input. When set, the
114    dictionary will be initialized with the given keys. If a key is not found in the input data, it
115    will be initialized with a value of None. When provided, the 'expected_keys' input will also
116    determine the original case of the keys.
117    """
118
119    def _display_case(self, key: str) -> str:
120        """Return the original case of the key."""
121        return self._pretty_case_keys[self._normalizer.normalize(key)]
122
123    def _index_case(self, key: str) -> str:
124        """Return the internal case of the key.
125
126        If normalize_keys is True, return the normalized key.
127        Otherwise, return the original case of the key.
128        """
129        if self._normalize_keys:
130            return self._normalizer.normalize(key)
131
132        return self._display_case(key)
133
134    @classmethod
135    def from_record_message(
136        cls,
137        record_message: AirbyteRecordMessage,
138        *,
139        prune_extra_fields: bool,
140        normalize_keys: bool = True,
141        normalizer: type[NameNormalizerBase] | None = None,
142        expected_keys: list[str] | None = None,
143    ) -> StreamRecord:
144        """Return a StreamRecord from a RecordMessage."""
145        data_dict: dict[str, Any] = record_message.data.copy()
146        data_dict[AB_RAW_ID_COLUMN] = str(ulid.ULID())
147        data_dict[AB_EXTRACTED_AT_COLUMN] = datetime.fromtimestamp(
148            record_message.emitted_at / 1000, tz=pytz.utc
149        )
150        data_dict[AB_META_COLUMN] = {}
151
152        return cls(
153            from_dict=data_dict,
154            prune_extra_fields=prune_extra_fields,
155            normalize_keys=normalize_keys,
156            normalizer=normalizer,
157            expected_keys=expected_keys,
158        )
159
160    def __init__(
161        self,
162        from_dict: dict,
163        *,
164        prune_extra_fields: bool,
165        normalize_keys: bool = True,
166        normalizer: type[NameNormalizerBase] | None = None,
167        expected_keys: list[str] | None = None,
168    ) -> None:
169        """Initialize the dictionary with the given data.
170
171        Args:
172        - normalize_keys: If `True`, the keys will be normalized using the given normalizer.
173        - expected_keys: If provided, the dictionary will be initialized with these given keys.
174        - expected_keys: If provided and `prune_extra_fields` is True, then unexpected fields
175          will be removed. This option is ignored if `expected_keys` is not provided.
176        """
177        # If no normalizer is provided, use LowerCaseNormalizer.
178        self._normalize_keys = normalize_keys
179        self._normalizer: type[NameNormalizerBase] = normalizer or LowerCaseNormalizer
180
181        # If no expected keys are provided, use all keys from the input dictionary.
182        if not expected_keys:
183            expected_keys = list(from_dict.keys())
184            prune_extra_fields = False  # No expected keys provided.
185        else:
186            expected_keys = list(expected_keys)
187
188        for internal_col in AB_INTERNAL_COLUMNS:
189            if internal_col not in expected_keys:
190                expected_keys.append(internal_col)
191
192        # Store a lookup from normalized keys to pretty cased (originally cased) keys.
193        self._pretty_case_keys: dict[str, str] = {
194            self._normalizer.normalize(pretty_case.lower()): pretty_case
195            for pretty_case in expected_keys
196        }
197
198        if normalize_keys:
199            index_keys = [self._normalizer.normalize(key) for key in expected_keys]
200        else:
201            index_keys = expected_keys
202
203        self.update({k: None for k in index_keys})  # Start by initializing all values to None
204        for k, v in from_dict.items():
205            index_cased_key = self._index_case(k)
206            if prune_extra_fields and index_cased_key not in index_keys:
207                # Dropping undeclared field
208                continue
209
210            self[index_cased_key] = v
211
212    def __getitem__(self, key: str) -> Any:  # noqa: ANN401
213        if super().__contains__(key):
214            return super().__getitem__(key)
215
216        if super().__contains__(self._index_case(key)):
217            return super().__getitem__(self._index_case(key))
218
219        raise KeyError(key)
220
221    def __setitem__(self, key: str, value: Any) -> None:  # noqa: ANN401
222        if super().__contains__(key):
223            super().__setitem__(key, value)
224            return
225
226        if super().__contains__(self._index_case(key)):
227            super().__setitem__(self._index_case(key), value)
228            return
229
230        # Store the pretty cased (originally cased) key:
231        self._pretty_case_keys[self._normalizer.normalize(key)] = key
232
233        # Store the data with the normalized key:
234        super().__setitem__(self._index_case(key), value)
235
236    def __delitem__(self, key: str) -> None:
237        if super().__contains__(key):
238            super().__delitem__(key)
239            return
240
241        if super().__contains__(self._index_case(key)):
242            super().__delitem__(self._index_case(key))
243            return
244
245        raise KeyError(key)
246
247    def __contains__(self, key: object) -> bool:
248        assert isinstance(key, str), "Key must be a string."
249        return super().__contains__(key) or super().__contains__(self._index_case(key))
250
251    def __iter__(self) -> Any:  # noqa: ANN401
252        return iter(super().__iter__())
253
254    def __len__(self) -> int:
255        return super().__len__()
256
257    def __eq__(self, other: object) -> bool:
258        if isinstance(other, StreamRecord):
259            return dict(self) == dict(other)
260
261        if isinstance(other, dict):
262            return {k.lower(): v for k, v in self.items()} == {
263                k.lower(): v for k, v in other.items()
264            }
265        return False

The StreamRecord class is a case-aware, case-insensitive dictionary implementation.

It has these behaviors:

  • When a key is retrieved, deleted, or checked for existence, it is always checked in a case-insensitive manner.
  • The original case is stored in a separate dictionary, so that the original case can be retrieved when needed.
  • Because it is subclassed from dict, the StreamRecord class can be passed as a normal Python dictionary.
  • In addition to the properties of the stream's records, the dictionary also stores the Airbyte metadata columns: _airbyte_raw_id, _airbyte_extracted_at, and _airbyte_meta.

This behavior mirrors how a case-aware, case-insensitive SQL database would handle column references.

There are two ways this class can store keys internally:

  • If normalize_keys is True, the keys are normalized using the given normalizer.
  • If normalize_keys is False, the original case of the keys is stored.

In regards to missing values, the dictionary accepts an 'expected_keys' input. When set, the dictionary will be initialized with the given keys. If a key is not found in the input data, it will be initialized with a value of None. When provided, the 'expected_keys' input will also determine the original case of the keys.

@classmethod
def from_record_message( cls, record_message: airbyte_protocol.models.airbyte_protocol.AirbyteRecordMessage, *, prune_extra_fields: bool, normalize_keys: bool = True, normalizer: type[airbyte._util.name_normalizers.NameNormalizerBase] | None = None, expected_keys: list[str] | None = None) -> StreamRecord:
134    @classmethod
135    def from_record_message(
136        cls,
137        record_message: AirbyteRecordMessage,
138        *,
139        prune_extra_fields: bool,
140        normalize_keys: bool = True,
141        normalizer: type[NameNormalizerBase] | None = None,
142        expected_keys: list[str] | None = None,
143    ) -> StreamRecord:
144        """Return a StreamRecord from a RecordMessage."""
145        data_dict: dict[str, Any] = record_message.data.copy()
146        data_dict[AB_RAW_ID_COLUMN] = str(ulid.ULID())
147        data_dict[AB_EXTRACTED_AT_COLUMN] = datetime.fromtimestamp(
148            record_message.emitted_at / 1000, tz=pytz.utc
149        )
150        data_dict[AB_META_COLUMN] = {}
151
152        return cls(
153            from_dict=data_dict,
154            prune_extra_fields=prune_extra_fields,
155            normalize_keys=normalize_keys,
156            normalizer=normalizer,
157            expected_keys=expected_keys,
158        )

Return a StreamRecord from a RecordMessage.

Inherited Members
builtins.dict
get
setdefault
pop
popitem
keys
items
values
update
fromkeys
clear
copy