## Idea

There is a primary server that acts like a regular single machine server that has continuous WAL archiving configured.

The archiving process sends data to single / multiple standby servers which are in a continuous recovery state.

Whenever the primary server is facing a disaster, there is always a ready standby server that can take over in a matter of moments and become the new primary, without any human interference.

## Risk

Just like the backup / recovery mechanism, there is a risk of data loss in the time window between transaction commit and WAL archiving.

This risk is addressed with the `archive_timeout` parameter and even shrinks in streaming communication method.

## Architecture

### High Level

<img src="./helpers/wal-replication.png" alt="drawing"/>

### Primary

**File Based**

The primary is just doing continuous WAL archiving in an accessible to standby machine place.

### Standby Mode

A server enters standby mode (standby loop) if a `standby.signal` file exists in the data directory when the server is started.

Standby mode is exited and the server switches to normal operation when `pg_ctl promote` is run, or `pg_promote()` is called.

<img src="./helpers/Replication - Standby Loop.png" alt="drawing"/>

### Streaming

<img src="./helpers/streaming-replication-architecture.png" alt="drawing" width="700"/>

1. Start the primary and standby servers.
1. The standby server starts the startup process.
1. The standby server starts a walreceiver process.
1. The walreceiver sends a connection request to the primary server. If the primary server is not running, the walreceiver sends these requests periodically.
1. When the primary server receives a connection request, it starts a walsender process and a TCP connection is established between the walsender and walreceiver.
1. Handshake: the walreceiver sends the latest LSN (Log Sequence Number) of standby’s database cluster.
1. If the standby’s latest LSN is less than the primary’s latest LSN (Standby’s LSN Primary’s LSN), the walsender sends WAL data from the former LSN to the latter LSN. These WAL data are provided by WAL segments stored in the primary’s pg_wal subdirectory. The standby server then replays the received WAL data. In this phase, the standby catches up with the primary, so it is called catch-up.
1. Streaming Replication begins to work.

### Cascading Replication

Idea: use a standby as an upstream of data changes in the primary to not have to connect directly to the primary

<img src="./helpers/cascading-replication.png" alt="drawing" height="500"/>

## Streaming In Depth

### Replication Slots

#### Functionality

Replication slots provide an automated way to ensure that the primary does not remove WAL segments until they have been received by all standbys, and that the primary does not remove rows which could cause a recovery conflict even when the standby is disconnected.

This feature is helping us to remain the exactly right amount of WAL that needed for standby servers to keep up with the primary changes.

#### Caveats

**!!!Be Careful!!!**

A common issue of disk fill up is configuring a replication slot of a problematic standby server - worst case is just idle.

The replication slot will keep an unbounded amount of WAL data for this replication slot causing the disk to fill up really quickly, especially when the WAL file size is higher than the default `16MB` and `archive_timeout` is set to a low value.

Even in an idle looking DB that can fill up a big disk really quickly, [an example on Amazon RDS](https://www.morling.dev/blog/insatiable-postgres-replication-slot/)

### Streaming VS File Based

Method  |   Data Loss Due To Disaster    |   Data Loss Due To WAL Recycling |   Direct Communication
----    |   -------------------------    |   ------------------------------ |   ---------
File Based  |   Potentially larger because not archived WAL file lost   |   Can't happen since every WAL file is recycled only after archiving  |   Not Mandatory (can be in a shared access server)
Streaming   |   Very small to the point of less than seconds            |   Can happen potentially if the standby can't keep up (Replication Slots are fixing that)                 |   Mandatory

Probably the best approach in terms of durability and availability would be using both file based and streaming, since the standby loop is using first the files and then the streaming if configured we can ensure both are used properly in the right scenarios

### Synchronous Replication

#### Idea

Full durability, remove any risk of data loss with ensuring an acknowledgement from the standby on every transaction before committing.

#### Architecture

<img src="./helpers/sync-wal-replication.png" alt="drawing" width="900"/>

1. The backend process writes and flushes WAL data to a WAL segment file.
1. The walsender process sends the WAL data written into the WAL segment to the walreceiver process.
1. After sending the WAL data, the backend process continues to wait for an ACK response from the standby server.
1. The walreceiver on the standby server writes the received WAL data into the standby’s WAL segment using the write() system call, and returns an ACK response to the walsender.
1. The walreceiver flushes the WAL data to the WAL segment using the system call such as fsync(), returns another ACK response to the walsender, and informs the startup process about WAL data updated.
1. The startup process replays the WAL data, which has been written to the WAL segment.
1. The walsender releases the latch of the backend process on receiving the ACK response from the walreceiver, and then, the backend process’s commit or abort action will be completed. The timing for latch-release depends on the parameter synchronous_commit. It is ‘on’ (default), the latch is released when the ACK of step (5) received, whereas it is ‘remote_write’, the latch is released when the ACK of step (4) is received.

(a) - Periodically sends an heartbeat ACK to ensure the primary have a good understanding of the standby needs in terms of sending WAL records.

#### ACK

Meta data about the standby state:
- Latest written LSN
- Latest flushed LSN
- Latest replayed LSN
- Timestamp

#### Problem

Problem: Naturally, this comes with a latency penalty because to commit we always need the standby to commit as well -> two step commit. Every committing transaction latency becomes at least the time it takes to move changes across primary -> standby.

That's why *By Default: Replication is `Asynchronous`*!

#### Multiple Synchronous Replication

You can specify multiple sync replications and make the primary wait until all are committed, that a large overhead of course.

Without compromising on this very high availability you could specify multiple sync replication but make sure only one of them is actually sync at a time. If something happens to that machine another one will kick in - it's basically like a HA solution for the standbys.

You could set a priority list on the sync replication takeover sort:

`synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'`

Or just take any of them on crash (quorum-based):

`synchronous_standby_names = 'ANY 2 (s1, s2, s3)'`

<img src="./helpers/potential-sync-replication.png" alt="drawing" width="900"/>

#### Best Practice

1. Since `synchronous_commit` can be managed on a very low granularity (transaction and higher) the best practice will be making only the business critical transactions / applications to commit synchronically and others to by async.
1. It's best to have the minimum amount of sync replications so you can configure a small amount of sync replication with some async cascading replications / priority based sync replication
1. Make sure that the network bandwidth can keep up with the WAL produce rate

#### Configuration

`synchronous_commit`

Replication Options:

synchronous_commit setting   | local durable commit    |   standby durable commit after PG crash   |   standby durable commit after OS crash   |   standby query consistency
----------------     | --------------------    |   -------------------------------------   |   -------------------------------------   |   -------------------------
remote_apply                    |   `Yes` |   `Yes` |   `Yes` |   `Yes`
on (flushed to durable storage) |   `Yes` |   `Yes` |   `Yes` |   `No`
remote_write                    |   `Yes` |   `Yes` |   `No`  |   `No`
local                           |   `Yes` |   `No`  |   `No`  |   `No`

### Important Notes

#### Similarity

The primary and standby servers should be as similar as possible almost in every term (except their location).
- Tablespaces on mounts created on primary should be created first on standby as well
- PostgreSQL version
- Hardware Architecture (32bit vs 64bit will not work)
- Hardware (strongly advised)