digoal
2024-01-13
PostgreSQL , PolarDB , DuckDB , 在线备份 , 恢复 , 检查点 , lsn , startlsn , stoplsn , controlfile , 控制文件 , minRecoveryPoint
最近写了几期数据库表空间在线备份和恢复.
《为什么PostgreSQL支持基于表空间的在线备份和完全恢复?》
《PostgreSQL如何支持可选择性表空间(Selectivity Tablespace)备份和时间点(PITR)恢复?》
《PostgreSQL recovery target introduce》 详细介绍了PITR恢复参数的概念.
《PostgreSQL Selectivity Tablespace PITR - 部分表空间恢复》
本文主要回答3个细节问题:
1、在线备份的数据从什么wal lsn开始恢复?
执行pg_start_backup后, 会执行检查点, 该检查点的StartLSN位置. 写在backup_label文件中.
因为从这个位置开始有需要的FPW.
2、在线备份的数据至少要恢复到什么wal lsn数据库才是一致的? (为什么会有不一致? 备份过程中有些磁盘操作, 例如有个block写了一半被拷贝走了, 那么在备份文件中这个block可能有一半是新的一半是旧的. 支持cow的文件系统不存在这个问题, 例如zfs, 所以可以关闭fpw)
我们不知道备份的文件中, 哪个数据块可能不一致(除非使用block checksum)?
我们不知道备份的文件中, 某个block对应的fpw在哪? 除非从startpoint开始顺序扫描wal, 直到遇到该block对应的fpw.
我们拷贝出来的备份的文件中, 里面的block在checkpoint后第一次被修改产生的fpw, 该fpw可能存在于stoppoint的wal位置, 所以我们只有恢复到这个wal lsn位置才能保证数据库是一致的.
stoppoint 即 调用pg_stop_backup 写入到wal中的一个标记对应的lsn.
DEMO:
t1 pg_start_backup
执行 checkpoint
写 backup_label (检查点的RedoStartLSN位置写入backup_label中)
t2 copy start //在线备份 拷贝开始
t3 copy stop //在线备份 拷贝结束
t4 pg_stop_backup
一致性位置在哪里? 答案: 实际是 t3, 但是数据库只知道t4.
因为数据库无法知道t3对应的wal lsn在哪, 只有调用 pg_stop_backup 后, 会往wal中写入一笔结束标记(以及对应的start backup lsn).
3、在主库和standby执行在线备份有什么区别?
第一个区别和stop backup有关.
如果是在主库进行在线备份, 调用 pg_stop_backup 后, 会往wal中写入一笔结束标记(以及对应的start backup lsn).
如果是在standby上进行的在线备份, 那么结束位置是 ControlFile->minRecoveryPoint
, 这个位置是在恢复状态中的数据库不断推进的一个位置点, 也就是最少得恢复到这个lsn数据库才是一致的.
第二个区别和start backup有关.
主库支持exclusive 在线备份, 也支持非exclusive在线备份(例如pg_basebackup).
/*
* Currently only non-exclusive backup can be taken during recovery.
*/
if (backup_started_in_recovery && exclusive)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("recovery is in progress"),
errhint("WAL control functions cannot be executed during recovery.")));
src/backend/access/transam/xlog.c
/*
* do_pg_start_backup
*
* Utility function called at the start of an online backup. It creates the
* necessary starting checkpoint and constructs the backup label file.
*
* There are two kind of backups: exclusive and non-exclusive. An exclusive
* backup is started with pg_start_backup(), and there can be only one active
* at a time. The backup and tablespace map files of an exclusive backup are
* written to $PGDATA/backup_label and $PGDATA/tablespace_map, and they are
* removed by pg_stop_backup().
*
* A non-exclusive backup is used for the streaming base backups (see
* src/backend/replication/basebackup.c). The difference to exclusive backups
* is that the backup label and tablespace map files are not written to disk.
* Instead, their would-be contents are returned in *labelfile and *tblspcmapfile,
* and the caller is responsible for including them in the backup archive as
* 'backup_label' and 'tablespace_map'. There can be many non-exclusive backups
* active at the same time, and they don't conflict with an exclusive backup
* either.
*
* labelfile and tblspcmapfile must be passed as NULL when starting an
* exclusive backup, and as initially-empty StringInfos for a non-exclusive
* backup.
*
* If "tablespaces" isn't NULL, it receives a list of tablespaceinfo structs
* describing the cluster's tablespaces.
*
* tblspcmapfile is required mainly for tar format in windows as native windows
* utilities are not able to create symlinks while extracting files from tar.
* However for consistency, the same is used for all platforms.
*
* Returns the minimum WAL location that must be present to restore from this
* backup, and the corresponding timeline ID in *starttli_p.
*
* Every successfully started non-exclusive backup must be stopped by calling
* do_pg_stop_backup() or do_pg_abort_backup().
*
* It is the responsibility of the caller of this function to verify the
* permissions of the calling user!
*/
XLogRecPtr
do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
StringInfo labelfile, List **tablespaces,
StringInfo tblspcmapfile)
{
...
/*
* Now we need to fetch the checkpoint record location, and also
* its REDO pointer. The oldest point in WAL that would be needed
* to restore starting from the checkpoint is precisely the REDO
* pointer.
*/
LWLockAcquire(ControlFileLock, LW_SHARED);
checkpointloc = ControlFile->checkPoint;
startpoint = ControlFile->checkPointCopy.redo;
starttli = ControlFile->checkPointCopy.ThisTimeLineID;
checkpointfpw = ControlFile->checkPointCopy.fullPageWrites;
LWLockRelease(ControlFileLock);
if (backup_started_in_recovery)
{
XLogRecPtr recptr;
/*
* Check to see if all WAL replayed during online backup
* (i.e., since last restartpoint used as backup starting
* checkpoint) contain full-page writes.
*/
SpinLockAcquire(&XLogCtl->info_lck);
recptr = XLogCtl->lastFpwDisableRecPtr;
SpinLockRelease(&XLogCtl->info_lck);
if (!checkpointfpw || startpoint <= recptr)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("WAL generated with full_page_writes=off was replayed "
"since last restartpoint"),
errhint("This means that the backup being taken on the standby "
"is corrupt and should not be used. "
"Enable full_page_writes and run CHECKPOINT on the primary, "
"and then try an online backup again.")));
...
/* Use the log timezone here, not the session timezone */
stamp_time = (pg_time_t) time(NULL);
pg_strftime(strfbuf, sizeof(strfbuf),
"%Y-%m-%d %H:%M:%S %Z",
pg_localtime(&stamp_time, log_timezone));
appendStringInfo(labelfile, "START WAL LOCATION: %X/%X (file %s)\n",
LSN_FORMAT_ARGS(startpoint), xlogfilename);
appendStringInfo(labelfile, "CHECKPOINT LOCATION: %X/%X\n",
LSN_FORMAT_ARGS(checkpointloc));
appendStringInfo(labelfile, "BACKUP METHOD: %s\n",
exclusive ? "pg_start_backup" : "streamed");
appendStringInfo(labelfile, "BACKUP FROM: %s\n",
backup_started_in_recovery ? "standby" : "primary");
appendStringInfo(labelfile, "START TIME: %s\n", strfbuf);
appendStringInfo(labelfile, "LABEL: %s\n", backupidstr);
appendStringInfo(labelfile, "START TIMELINE: %u\n", starttli);
/*
* do_pg_stop_backup
*
* Utility function called at the end of an online backup. It cleans up the
* backup state and can optionally wait for WAL segments to be archived.
*
* If labelfile is NULL, this stops an exclusive backup. Otherwise this stops
* the non-exclusive backup specified by 'labelfile'.
*
* Returns the last WAL location that must be present to restore from this
* backup, and the corresponding timeline ID in *stoptli_p.
*
* It is the responsibility of the caller of this function to verify the
* permissions of the calling user!
*/
XLogRecPtr
do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
{
...
/*
* During recovery, we don't write an end-of-backup record. We assume that
* pg_control was backed up last and its minimum recovery point can be
* available as the backup end location. Since we don't have an
* end-of-backup record, we use the pg_control value to check whether
* we've reached the end of backup when starting recovery from this
* backup. We have no way of checking if pg_control wasn't backed up last
* however.
*
* We don't force a switch to new WAL file but it is still possible to
* wait for all the required files to be archived if waitforarchive is
* true. This is okay if we use the backup to start a standby and fetch
* the missing WAL using streaming replication. But in the case of an
* archive recovery, a user should set waitforarchive to true and wait for
* them to be archived to ensure that all the required files are
* available.
*
* We return the current minimum recovery point as the backup end
* location. Note that it can be greater than the exact backup end
* location if the minimum recovery point is updated after the backup of
* pg_control. This is harmless for current uses.
*
* XXX currently a backup history file is for informational and debug
* purposes only. It's not essential for an online backup. Furthermore,
* even if it's created, it will not be archived during recovery because
* an archiver is not invoked. So it doesn't seem worthwhile to write a
* backup history file during recovery.
*/
if (backup_started_in_recovery)
{
stoppoint = ControlFile->minRecoveryPoint;
...
else
{
/*
* Write the backup-end xlog record
*/
XLogBeginInsert();
XLogRegisterData((char *) (&startpoint), sizeof(startpoint));
stoppoint = XLogInsert(RM_XLOG_ID, XLOG_BACKUP_END);
...
/*
* Advance minRecoveryPoint in control file.
*
* If we crash during recovery, we must reach this point again before the
* database is consistent.
*
* If 'force' is true, 'lsn' argument is ignored. Otherwise, minRecoveryPoint
* is only updated if it's not already greater than or equal to 'lsn'.
*/
...
最小恢复所需wal?
- t0 write backup_label (实际时间可能还略小于t0, 指backup_label里面的startpoint LSN被创造出来的时间)
- t1 write blockX fpw to WAL (checkpoint后, block首次修改必须写fpw)
- t2 write blockX to disk (第一次修改写盘前, 数据库会保证这次修改产生的fpw已经提前写入WAL, 否则不会执行该block的写盘操作)
- t2 copy blockX for backup (假设同一时刻操作, 发生partial write)
- t3 write backup stoppoint to WAL
t0 ~ t3
之间的WAL是最小需求, 才能保证数据库可以恢复到一致性状态.