PostgreSQL在线备份&恢复的3个细节问题&原理

作者

digoal

日期

2024-01-13

背景

最近写了几期数据库表空间在线备份和恢复.

《为什么PostgreSQL支持基于表空间的在线备份和完全恢复?》

《PostgreSQL如何支持可选择性表空间(Selectivity Tablespace)备份和时间点(PITR)恢复?》

《PostgreSQL recovery target introduce》详细介绍了PITR恢复参数的概念.

《PostgreSQL Selectivity Tablespace PITR - 部分表空间恢复》

本文主要回答3个细节问题:

1、在线备份的数据从什么wal lsn开始恢复?

执行pg_start_backup后, 会执行检查点, 该检查点的StartLSN位置. 写在backup_label文件中.

因为从这个位置开始有需要的FPW.

2、在线备份的数据至少要恢复到什么wal lsn数据库才是一致的? (为什么会有不一致? 备份过程中有些磁盘操作, 例如有个block写了一半被拷贝走了, 那么在备份文件中这个block可能有一半是新的一半是旧的. 支持cow的文件系统不存在这个问题, 例如zfs, 所以可以关闭fpw)

我们不知道备份的文件中, 哪个数据块可能不一致(除非使用block checksum)?

我们不知道备份的文件中, 某个block对应的fpw在哪? 除非从startpoint开始顺序扫描wal, 直到遇到该block对应的fpw.

我们拷贝出来的备份的文件中, 里面的block在checkpoint后第一次被修改产生的fpw, 该fpw可能存在于stoppoint的wal位置, 所以我们只有恢复到这个wal lsn位置才能保证数据库是一致的.

stoppoint 即调用pg_stop_backup 写入到wal中的一个标记对应的lsn.

DEMO:

t1 pg_start_backup  
   执行 checkpoint   
   写 backup_label  (检查点的RedoStartLSN位置写入backup_label中)    
  
t2 copy start //在线备份 拷贝开始  
  
t3 copy stop //在线备份 拷贝结束   
  
t4 pg_stop_backup

一致性位置在哪里? 答案: 实际是 t3, 但是数据库只知道t4.

因为数据库无法知道t3对应的wal lsn在哪, 只有调用 pg_stop_backup 后, 会往wal中写入一笔结束标记(以及对应的start backup lsn).

3、在主库和standby执行在线备份有什么区别?

第一个区别和stop backup有关.

如果是在主库进行在线备份, 调用 pg_stop_backup 后, 会往wal中写入一笔结束标记(以及对应的start backup lsn).

如果是在standby上进行的在线备份, 那么结束位置是 ControlFile->minRecoveryPoint, 这个位置是在恢复状态中的数据库不断推进的一个位置点, 也就是最少得恢复到这个lsn数据库才是一致的.

第二个区别和start backup有关.

主库支持exclusive 在线备份, 也支持非exclusive在线备份(例如pg_basebackup).

        /*  
         * Currently only non-exclusive backup can be taken during recovery.  
         */  
        if (backup_started_in_recovery && exclusive)  
                ereport(ERROR,  
                                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),  
                                 errmsg("recovery is in progress"),  
                                 errhint("WAL control functions cannot be executed during recovery.")));

参考代码

src/backend/access/transam/xlog.c

/*  
 * do_pg_start_backup  
 *  
 * Utility function called at the start of an online backup. It creates the  
 * necessary starting checkpoint and constructs the backup label file.  
 *  
 * There are two kind of backups: exclusive and non-exclusive. An exclusive  
 * backup is started with pg_start_backup(), and there can be only one active  
 * at a time. The backup and tablespace map files of an exclusive backup are  
 * written to $PGDATA/backup_label and $PGDATA/tablespace_map, and they are  
 * removed by pg_stop_backup().  
 *  
 * A non-exclusive backup is used for the streaming base backups (see  
 * src/backend/replication/basebackup.c). The difference to exclusive backups  
 * is that the backup label and tablespace map files are not written to disk.  
 * Instead, their would-be contents are returned in *labelfile and *tblspcmapfile,  
 * and the caller is responsible for including them in the backup archive as  
 * 'backup_label' and 'tablespace_map'. There can be many non-exclusive backups  
 * active at the same time, and they don't conflict with an exclusive backup  
 * either.  
 *  
 * labelfile and tblspcmapfile must be passed as NULL when starting an  
 * exclusive backup, and as initially-empty StringInfos for a non-exclusive  
 * backup.  
 *  
 * If "tablespaces" isn't NULL, it receives a list of tablespaceinfo structs  
 * describing the cluster's tablespaces.  
 *  
 * tblspcmapfile is required mainly for tar format in windows as native windows  
 * utilities are not able to create symlinks while extracting files from tar.  
 * However for consistency, the same is used for all platforms.  
 *  
 * Returns the minimum WAL location that must be present to restore from this  
 * backup, and the corresponding timeline ID in *starttli_p.  
 *  
 * Every successfully started non-exclusive backup must be stopped by calling  
 * do_pg_stop_backup() or do_pg_abort_backup().  
 *  
 * It is the responsibility of the caller of this function to verify the  
 * permissions of the calling user!  
 */  
XLogRecPtr  
do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,  
                                   StringInfo labelfile, List **tablespaces,  
                                   StringInfo tblspcmapfile)  
{  
...  
  
                        /*  
                         * Now we need to fetch the checkpoint record location, and also  
                         * its REDO pointer.  The oldest point in WAL that would be needed  
                         * to restore starting from the checkpoint is precisely the REDO  
                         * pointer.  
                         */  
                        LWLockAcquire(ControlFileLock, LW_SHARED);  
                        checkpointloc = ControlFile->checkPoint;  
                        startpoint = ControlFile->checkPointCopy.redo;  
                        starttli = ControlFile->checkPointCopy.ThisTimeLineID;  
                        checkpointfpw = ControlFile->checkPointCopy.fullPageWrites;  
                        LWLockRelease(ControlFileLock);  
  
                        if (backup_started_in_recovery)  
                        {  
                                XLogRecPtr      recptr;  
  
                                /*  
                                 * Check to see if all WAL replayed during online backup  
                                 * (i.e., since last restartpoint used as backup starting  
                                 * checkpoint) contain full-page writes.  
                                 */  
                                SpinLockAcquire(&XLogCtl->info_lck);  
                                recptr = XLogCtl->lastFpwDisableRecPtr;  
                                SpinLockRelease(&XLogCtl->info_lck);  
  
                                if (!checkpointfpw || startpoint <= recptr)  
                                        ereport(ERROR,  
                                                        (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),  
                                                         errmsg("WAL generated with full_page_writes=off was replayed "  
                                                                        "since last restartpoint"),  
                                                         errhint("This means that the backup being taken on the standby "  
                                                                         "is corrupt and should not be used. "  
                                                                         "Enable full_page_writes and run CHECKPOINT on the primary, "  
                                                                         "and then try an online backup again.")));  
  
...  
  
  
                /* Use the log timezone here, not the session timezone */  
                stamp_time = (pg_time_t) time(NULL);  
                pg_strftime(strfbuf, sizeof(strfbuf),  
                                        "%Y-%m-%d %H:%M:%S %Z",  
                                        pg_localtime(&stamp_time, log_timezone));  
                appendStringInfo(labelfile, "START WAL LOCATION: %X/%X (file %s)\n",  
                                                 LSN_FORMAT_ARGS(startpoint), xlogfilename);  
                appendStringInfo(labelfile, "CHECKPOINT LOCATION: %X/%X\n",  
                                                 LSN_FORMAT_ARGS(checkpointloc));  
                appendStringInfo(labelfile, "BACKUP METHOD: %s\n",  
                                                 exclusive ? "pg_start_backup" : "streamed");  
                appendStringInfo(labelfile, "BACKUP FROM: %s\n",  
                                                 backup_started_in_recovery ? "standby" : "primary");  
                appendStringInfo(labelfile, "START TIME: %s\n", strfbuf);  
                appendStringInfo(labelfile, "LABEL: %s\n", backupidstr);  
                appendStringInfo(labelfile, "START TIMELINE: %u\n", starttli);  
  
  
  
/*  
 * do_pg_stop_backup  
 *  
 * Utility function called at the end of an online backup. It cleans up the  
 * backup state and can optionally wait for WAL segments to be archived.  
 *  
 * If labelfile is NULL, this stops an exclusive backup. Otherwise this stops  
 * the non-exclusive backup specified by 'labelfile'.  
 *  
 * Returns the last WAL location that must be present to restore from this  
 * backup, and the corresponding timeline ID in *stoptli_p.  
 *  
 * It is the responsibility of the caller of this function to verify the  
 * permissions of the calling user!  
 */  
XLogRecPtr  
do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)  
{  
  
  
...  
  
        /*  
         * During recovery, we don't write an end-of-backup record. We assume that  
         * pg_control was backed up last and its minimum recovery point can be  
         * available as the backup end location. Since we don't have an  
         * end-of-backup record, we use the pg_control value to check whether  
         * we've reached the end of backup when starting recovery from this  
         * backup. We have no way of checking if pg_control wasn't backed up last  
         * however.  
         *  
         * We don't force a switch to new WAL file but it is still possible to  
         * wait for all the required files to be archived if waitforarchive is  
         * true. This is okay if we use the backup to start a standby and fetch  
         * the missing WAL using streaming replication. But in the case of an  
         * archive recovery, a user should set waitforarchive to true and wait for  
         * them to be archived to ensure that all the required files are  
         * available.  
         *  
         * We return the current minimum recovery point as the backup end  
         * location. Note that it can be greater than the exact backup end  
         * location if the minimum recovery point is updated after the backup of  
         * pg_control. This is harmless for current uses.  
         *  
         * XXX currently a backup history file is for informational and debug  
         * purposes only. It's not essential for an online backup. Furthermore,  
         * even if it's created, it will not be archived during recovery because  
         * an archiver is not invoked. So it doesn't seem worthwhile to write a  
         * backup history file during recovery.  
         */  
        if (backup_started_in_recovery)  
        {  
  
                stoppoint = ControlFile->minRecoveryPoint;  
  
...   
        else  
        {  
                /*  
                 * Write the backup-end xlog record  
                 */  
                XLogBeginInsert();  
                XLogRegisterData((char *) (&startpoint), sizeof(startpoint));  
                stoppoint = XLogInsert(RM_XLOG_ID, XLOG_BACKUP_END);   
...

/*  
 * Advance minRecoveryPoint in control file.  
 *  
 * If we crash during recovery, we must reach this point again before the  
 * database is consistent.  
 *  
 * If 'force' is true, 'lsn' argument is ignored. Otherwise, minRecoveryPoint  
 * is only updated if it's not already greater than or equal to 'lsn'.  
 */  
  
 ...

最小恢复所需wal?

t0 write backup_label (实际时间可能还略小于t0, 指backup_label里面的startpoint LSN被创造出来的时间)
t1 write blockX fpw to WAL (checkpoint后, block首次修改必须写fpw)
t2 write blockX to disk (第一次修改写盘前, 数据库会保证这次修改产生的fpw已经提前写入WAL, 否则不会执行该block的写盘操作)
t2 copy blockX for backup (假设同一时刻操作, 发生partial write)
t3 write backup stoppoint to WAL

t0 ~ t3 之间的WAL是最小需求, 才能保证数据库可以恢复到一致性状态.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20240113_02.md

20240113_02.md

PostgreSQL在线备份&恢复的3个细节问题&原理

作者

日期

标签

背景

参考代码

期望 PostgreSQL|开源PolarDB 增加什么功能?

PolarDB 开源数据库

PolarDB 学习图谱

购买PolarDB云服务折扣活动进行中, 55元起

PostgreSQL 解决方案集合

德哥 / digoal's Github - 公益是一辈子的事.

About 德哥

Files

20240113_02.md

Latest commit

History

20240113_02.md

File metadata and controls

PostgreSQL在线备份&恢复的3个细节问题&原理

作者

日期

标签

背景

参考代码

期望 PostgreSQL|开源PolarDB 增加什么功能?

PolarDB 开源数据库

PolarDB 学习图谱

购买PolarDB云服务折扣活动进行中, 55元起

PostgreSQL 解决方案集合

德哥 / digoal's Github - 公益是一辈子的事.

About 德哥