Skip to content

Commit

Permalink
Add details to docs (#4244)
Browse files Browse the repository at this point in the history
  • Loading branch information
ChristopheDuong committed Jun 21, 2021
1 parent e99cc59 commit 55d3b4d
Show file tree
Hide file tree
Showing 3 changed files with 84 additions and 110 deletions.
86 changes: 36 additions & 50 deletions docs/understanding-airbyte/connections/incremental-append.md
Expand Up @@ -22,51 +22,43 @@ As mentioned above, the delta from a sync will be _appended_ to the existing dat

Assume that `updated_at` is our `cursor_field`. Let's say the following data already exists into our data warehouse.

```javascript
[
{ "name": "Louis XVI", "deceased": false, "updated_at": 1754 },
{ "name": "Marie Antoinette", "deceased": false, "updated_at": 1755 }
]
```
| name | deceased | updated_at |
| :--- | :--- | :--- |
| Louis XVI | false | 1754 |
| Marie Antoinette | false | 1755 |

In the next sync, the delta contains the following record:

```javascript
{ "name": "Louis XVII", "deceased": false, "updated_at": 1785 }
```
| name | deceased | updated_at |
| :--- | :--- | :--- |
| Louis XVII | false | 1785 |

At the end of this incremental sync, the data warehouse would now contain:

```javascript
[
{ "name": "Louis XVI", "deceased": false, "updated_at": 1754 },
{ "name": "Marie Antoinette", "deceased": false, "updated_at": 1755 },
{ "name": "Louis XVII", "deceased": false, "updated_at": 1785 }
]
```
| name | deceased | updated_at |
| :--- | :--- | :--- |
| Louis XVI | false | 1754 |
| Marie Antoinette | false | 1755 |
| Louis XVII | false | 1785 |

### Updating a Record

Let's assume that our warehouse contains all the data that it did at the end of the previous section. Now unfortunately the king and queen lose their heads. Let's see that delta:
Let's assume that our warehouse contains all the data that it did at the end of the previous section. Now, unfortunately the king and queen lose their heads. Let's see that delta:

```javascript
[
{ "name": "Louis XVI", "deceased": true, "updated_at": 1793 },
{ "name": "Marie Antoinette", "deceased": true, "updated_at": 1793 }
]
```
| name | deceased | updated_at |
| :--- | :--- | :--- |
| Louis XVI | true | 1793 |
| Marie Antoinette | true | 1793 |

The output we expect to see in the warehouse is as follows:

```javascript
[
{ "name": "Louis XVI", "deceased": false, "updated_at": 1754 },
{ "name": "Marie Antoinette", "deceased": false, "updated_at": 1755 },
{ "name": "Louis XVII", "deceased": false, "updated_at": 1785 },
{ "name": "Louis XVI", "deceased": true, "updated_at": 1793 },
{ "name": "Marie Antoinette", "deceased": true, "updated_at": 1793 }
]
```
| name | deceased | updated_at |
| :--- | :--- | :--- |
| Louis XVI | false | 1754 |
| Marie Antoinette | false | 1755 |
| Louis XVII | false | 1785 |
| Louis XVI | true | 1793 |
| Marie Antoinette | true | 1793 |

## Source-Defined Cursor

Expand Down Expand Up @@ -108,33 +100,27 @@ select * from table where cursor_field > 'last_sync_max_cursor_field_value'

Let's say the following data already exists into our data warehouse.

```javascript
[
{ "name": "Louis XVI", "deceased": false, "updated_at": 1754 },
{ "name": "Marie Antoinette", "deceased": false, "updated_at": 1755 }
]
```
| name | deceased | updated_at |
| :--- | :--- | :--- |
| Louis XVI | false | 1754 |
| Marie Antoinette | false | 1755 |

At the start of the next sync, the source data contains the following new record:

```javascript
[
{ "name": "Louis XVI", "deceased": true, "updated_at": 1754 },
]
```
| name | deceased | updated_at |
| :--- | :--- | :--- |
| Louis XVI | true | 1754 |

At the end of the second incremental sync, the data warehouse would still contain data from the first sync because the delta record did not provide a valid value for the cursor field \(the cursor field is not greater than last sync's max value, `1754 < 1755`\), so it is not emitted by the source as a new or modified record.

```javascript
[
{ "name": "Louis XVI", "deceased": false, "updated_at": 1754 },
{ "name": "Marie Antoinette", "deceased": false, "updated_at": 1755 }
]
```
| name | deceased | updated_at |
| :--- | :--- | :--- |
| Louis XVI | false | 1754 |
| Marie Antoinette | false | 1755 |

Similarly, if multiple modifications are made during the same day to the same records. If the frequency of the sync is not granular enough \(for example, set for every 24h\), then intermediate modifications to the data are not going to be detected and emitted. Only the state of data at the time the sync runs will be reflected in the destination.

Those concerns could be solved by using a different sync mode based on binary logs, Write-Ahead-Logs \(WAL\), or also called **Incremental - Change Data Capture**. \(coming to Airbyte in the near future\).
Those concerns could be solved by using a different incremental approach based on binary logs, Write-Ahead-Logs \(WAL\), or also called [Change Data Capture (CDC)](../cdc.md).

The current behavior of **Incremental** is not able to handle source schema changes yet, for example, when a column is added, renamed or deleted from an existing table etc. It is recommended to trigger a [Full refresh - Overwrite](full-refresh-overwrite.md) to correctly replicate the data to the destination with the new schema changes.

Expand Down
100 changes: 41 additions & 59 deletions docs/understanding-airbyte/connections/incremental-deduped-history.md
Expand Up @@ -30,65 +30,53 @@ As mentioned above, the delta from a sync will be _appended_ to the existing his

Assume that `updated_at` is our `cursor_field` and `name` is the `primary_key`. Let's say the following data already exists into our data warehouse.

```javascript
[
{ "name": "Louis XVI", "deceased": false, "updated_at": 1754 },
{ "name": "Marie Antoinette", "deceased": false, "updated_at": 1755 }
]
```
| name | deceased | updated_at |
| :--- | :--- | :--- |
| Louis XVI | false | 1754 |
| Marie Antoinette | false | 1755 |

In the next sync, the delta contains the following record:

```javascript
{ "name": "Louis XVII", "deceased": false, "updated_at": 1785 }
```
| name | deceased | updated_at |
| :--- | :--- | :--- |
| Louis XVII | false | 1785 |

At the end of this incremental sync, the data warehouse would now contain:

```javascript
[
{ "name": "Louis XVI", "deceased": false, "updated_at": 1754 },
{ "name": "Marie Antoinette", "deceased": false, "updated_at": 1755 },
{ "name": "Louis XVII", "deceased": false, "updated_at": 1785 }
]
```
| name | deceased | updated_at |
| :--- | :--- | :--- |
| Louis XVI | false | 1754 |
| Marie Antoinette | false | 1755 |
| Louis XVII | false | 1785 |

### Updating a Record

Let's assume that our warehouse contains all the data that it did at the end of the previous section. Now unfortunately the king and queen lose their heads. Let's see that delta:
Let's assume that our warehouse contains all the data that it did at the end of the previous section. Now, unfortunately the king and queen lose their heads. Let's see that delta:

```javascript
[
{ "name": "Louis XVI", "deceased": true, "updated_at": 1793 },
{ "name": "Marie Antoinette", "deceased": true, "updated_at": 1793 }
]
```
| name | deceased | updated_at |
| :--- | :--- | :--- |
| Louis XVI | true | 1793 |
| Marie Antoinette | true | 1793 |

The output we expect to see in the warehouse is as follows:

In the history table:

```javascript
[
{ "name": "Louis XVI", "deceased": false, "updated_at": 1754, "start_at": 1754, "end_at": 1793 },
{ "name": "Louis XVI", "deceased": true, "updated_at": 1793, "start_at": 1793, "end_at": NULL },

{ "name": "Louis XVII", "deceased": false, "updated_at": 1785, "start_at": 1785, "end_at": NULL }

{ "name": "Marie Antoinette", "deceased": false, "updated_at": 1755, "start_at": 1755, "end_at": 1793 },
{ "name": "Marie Antoinette", "deceased": true, "updated_at": 1793, "start_at: 1793, "end_at": NULL }
]
```
| name | deceased | updated_at | start_at | end_at |
| :--- | :--- | :--- | :--- | :--- |
| Louis XVI | false | 1754 | 1754 | 1793 |
| Louis XVI | true | 1793 | 1793 | NULL |
| Louis XVII | false | 1785 | 1785 | NULL |
| Marie Antoinette | false | 1755 | 1755 | 1793 |
| Marie Antoinette | true | 1793 | 1793 | NULL |

In the final de-duplicated table:

```javascript
[
{ "name": "Louis XVI", "deceased": true, "updated_at": 1793 },
{ "name": "Louis XVII", "deceased": false, "updated_at": 1785 },
{ "name": "Marie Antoinette", "deceased": true, "updated_at": 1793 }
]
```
| name | deceased | updated_at |
| :--- | :--- | :--- |
| Louis XVI | true | 1793 |
| Louis XVII | false | 1785 |
| Marie Antoinette | true | 1793 |

## Source-Defined Cursor

Expand Down Expand Up @@ -134,33 +122,27 @@ select * from table where cursor_field > 'last_sync_max_cursor_field_value'

Let's say the following data already exists into our data warehouse.

```javascript
[
{ "name": "Louis XVI", "deceased": false, "updated_at": 1754 },
{ "name": "Marie Antoinette", "deceased": false, "updated_at": 1755 }
]
```
| name | deceased | updated_at |
| :--- | :--- | :--- |
| Louis XVI | false | 1754 |
| Marie Antoinette | false | 1755 |

At the start of the next sync, the source data contains the following new record:

```javascript
[
{ "name": "Louis XVI", "deceased": true, "updated_at": 1754 },
]
```
| name | deceased | updated_at |
| :--- | :--- | :--- |
| Louis XVI | true | 1754 |

At the end of the second incremental sync, the data warehouse would still contain data from the first sync because the delta record did not provide a valid value for the cursor field \(the cursor field is not greater than last sync's max value, `1754 < 1755`\), so it is not emitted by the source as a new or modified record.

```javascript
[
{ "name": "Louis XVI", "deceased": false, "updated_at": 1754 },
{ "name": "Marie Antoinette", "deceased": false, "updated_at": 1755 }
]
```
| name | deceased | updated_at |
| :--- | :--- | :--- |
| Louis XVI | false | 1754 |
| Marie Antoinette | false | 1755 |

Similarly, if multiple modifications are made during the same day to the same records. If the frequency of the sync is not granular enough \(for example, set for every 24h\), then intermediate modifications to the data are not going to be detected and emitted. Only the state of data at the time the sync runs will be reflected in the destination.

Those concerns could be solved by using a different sync mode based on binary logs, Write-Ahead-Logs \(WAL\), or also called **Incremental - Change Data Capture**. \(coming to Airbyte in the near future\).
Those concerns could be solved by using a different incremental approach based on binary logs, Write-Ahead-Logs \(WAL\), or also called [Change Data Capture (CDC)](../cdc.md).

The current behavior of **Incremental** is not able to handle source schema changes yet, for example, when a column is added, renamed or deleted from an existing table etc. It is recommended to trigger a [Full refresh - Overwrite](full-refresh-overwrite.md) to correctly replicate the data to the destination with the new schema changes.

Expand Down
8 changes: 7 additions & 1 deletion docs/understanding-airbyte/namespaces.md
Expand Up @@ -16,7 +16,13 @@ If the Destination does not support namespaces, the [namespace field](https://gi

## Destination namespace configuration

As part of the [connections sync settings](connections/README.md), it is possible to configure the namespace used by destination connectors. Available options are:
As part of the [connections sync settings](connections/README.md), it is possible to configure the namespace used by:
1. destination connectors: to store the `_airbyte_raw_*` tables.
2. basic normalization: to store the final normalized tables.

Note that custom transformation outputs are not affected by the namespace settings from Airbyte: It is up to the configuration of the custom dbt project, and how it is written to handle its [custom schemas](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/using-custom-schemas). The default target schema for dbt in this case, will always be the destination namespace.

Available options for namespace configurations are:

### - Mirror source structure

Expand Down

0 comments on commit 55d3b4d

Please sign in to comment.