Skip to content

Latest commit

 

History

History
309 lines (237 loc) · 14.7 KB

20240113_02.md

File metadata and controls

309 lines (237 loc) · 14.7 KB

PostgreSQL在线备份&恢复的3个细节问题&原理

作者

digoal

日期

2024-01-13

标签

PostgreSQL , PolarDB , DuckDB , 在线备份 , 恢复 , 检查点 , lsn , startlsn , stoplsn , controlfile , 控制文件 , minRecoveryPoint


背景

最近写了几期数据库表空间在线备份和恢复.

《为什么PostgreSQL支持基于表空间的在线备份和完全恢复?》

《PostgreSQL如何支持可选择性表空间(Selectivity Tablespace)备份和时间点(PITR)恢复?》

《PostgreSQL recovery target introduce》 详细介绍了PITR恢复参数的概念.

《PostgreSQL Selectivity Tablespace PITR - 部分表空间恢复》

本文主要回答3个细节问题:

1、在线备份的数据从什么wal lsn开始恢复?

执行pg_start_backup后, 会执行检查点, 该检查点的StartLSN位置. 写在backup_label文件中.

因为从这个位置开始有需要的FPW.

2、在线备份的数据至少要恢复到什么wal lsn数据库才是一致的? (为什么会有不一致? 备份过程中有些磁盘操作, 例如有个block写了一半被拷贝走了, 那么在备份文件中这个block可能有一半是新的一半是旧的. 支持cow的文件系统不存在这个问题, 例如zfs, 所以可以关闭fpw)

我们不知道备份的文件中, 哪个数据块可能不一致(除非使用block checksum)?

我们不知道备份的文件中, 某个block对应的fpw在哪? 除非从startpoint开始顺序扫描wal, 直到遇到该block对应的fpw.

我们拷贝出来的备份的文件中, 里面的block在checkpoint后第一次被修改产生的fpw, 该fpw可能存在于stoppoint的wal位置, 所以我们只有恢复到这个wal lsn位置才能保证数据库是一致的.

stoppoint 即 调用pg_stop_backup 写入到wal中的一个标记对应的lsn.

DEMO:

t1 pg_start_backup  
   执行 checkpoint   
   写 backup_label  (检查点的RedoStartLSN位置写入backup_label中)    
  
t2 copy start //在线备份 拷贝开始  
  
t3 copy stop //在线备份 拷贝结束   
  
t4 pg_stop_backup  

一致性位置在哪里? 答案: 实际是 t3, 但是数据库只知道t4.

因为数据库无法知道t3对应的wal lsn在哪, 只有调用 pg_stop_backup 后, 会往wal中写入一笔结束标记(以及对应的start backup lsn).

3、在主库和standby执行在线备份有什么区别?

第一个区别和stop backup有关.

如果是在主库进行在线备份, 调用 pg_stop_backup 后, 会往wal中写入一笔结束标记(以及对应的start backup lsn).

如果是在standby上进行的在线备份, 那么结束位置是 ControlFile->minRecoveryPoint, 这个位置是在恢复状态中的数据库不断推进的一个位置点, 也就是最少得恢复到这个lsn数据库才是一致的.

第二个区别和start backup有关.

主库支持exclusive 在线备份, 也支持非exclusive在线备份(例如pg_basebackup).

        /*  
         * Currently only non-exclusive backup can be taken during recovery.  
         */  
        if (backup_started_in_recovery && exclusive)  
                ereport(ERROR,  
                                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),  
                                 errmsg("recovery is in progress"),  
                                 errhint("WAL control functions cannot be executed during recovery.")));  

参考代码

src/backend/access/transam/xlog.c

/*  
 * do_pg_start_backup  
 *  
 * Utility function called at the start of an online backup. It creates the  
 * necessary starting checkpoint and constructs the backup label file.  
 *  
 * There are two kind of backups: exclusive and non-exclusive. An exclusive  
 * backup is started with pg_start_backup(), and there can be only one active  
 * at a time. The backup and tablespace map files of an exclusive backup are  
 * written to $PGDATA/backup_label and $PGDATA/tablespace_map, and they are  
 * removed by pg_stop_backup().  
 *  
 * A non-exclusive backup is used for the streaming base backups (see  
 * src/backend/replication/basebackup.c). The difference to exclusive backups  
 * is that the backup label and tablespace map files are not written to disk.  
 * Instead, their would-be contents are returned in *labelfile and *tblspcmapfile,  
 * and the caller is responsible for including them in the backup archive as  
 * 'backup_label' and 'tablespace_map'. There can be many non-exclusive backups  
 * active at the same time, and they don't conflict with an exclusive backup  
 * either.  
 *  
 * labelfile and tblspcmapfile must be passed as NULL when starting an  
 * exclusive backup, and as initially-empty StringInfos for a non-exclusive  
 * backup.  
 *  
 * If "tablespaces" isn't NULL, it receives a list of tablespaceinfo structs  
 * describing the cluster's tablespaces.  
 *  
 * tblspcmapfile is required mainly for tar format in windows as native windows  
 * utilities are not able to create symlinks while extracting files from tar.  
 * However for consistency, the same is used for all platforms.  
 *  
 * Returns the minimum WAL location that must be present to restore from this  
 * backup, and the corresponding timeline ID in *starttli_p.  
 *  
 * Every successfully started non-exclusive backup must be stopped by calling  
 * do_pg_stop_backup() or do_pg_abort_backup().  
 *  
 * It is the responsibility of the caller of this function to verify the  
 * permissions of the calling user!  
 */  
XLogRecPtr  
do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,  
                                   StringInfo labelfile, List **tablespaces,  
                                   StringInfo tblspcmapfile)  
{  
...  
  
                        /*  
                         * Now we need to fetch the checkpoint record location, and also  
                         * its REDO pointer.  The oldest point in WAL that would be needed  
                         * to restore starting from the checkpoint is precisely the REDO  
                         * pointer.  
                         */  
                        LWLockAcquire(ControlFileLock, LW_SHARED);  
                        checkpointloc = ControlFile->checkPoint;  
                        startpoint = ControlFile->checkPointCopy.redo;  
                        starttli = ControlFile->checkPointCopy.ThisTimeLineID;  
                        checkpointfpw = ControlFile->checkPointCopy.fullPageWrites;  
                        LWLockRelease(ControlFileLock);  
  
                        if (backup_started_in_recovery)  
                        {  
                                XLogRecPtr      recptr;  
  
                                /*  
                                 * Check to see if all WAL replayed during online backup  
                                 * (i.e., since last restartpoint used as backup starting  
                                 * checkpoint) contain full-page writes.  
                                 */  
                                SpinLockAcquire(&XLogCtl->info_lck);  
                                recptr = XLogCtl->lastFpwDisableRecPtr;  
                                SpinLockRelease(&XLogCtl->info_lck);  
  
                                if (!checkpointfpw || startpoint <= recptr)  
                                        ereport(ERROR,  
                                                        (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),  
                                                         errmsg("WAL generated with full_page_writes=off was replayed "  
                                                                        "since last restartpoint"),  
                                                         errhint("This means that the backup being taken on the standby "  
                                                                         "is corrupt and should not be used. "  
                                                                         "Enable full_page_writes and run CHECKPOINT on the primary, "  
                                                                         "and then try an online backup again.")));  
  
...  
  
  
                /* Use the log timezone here, not the session timezone */  
                stamp_time = (pg_time_t) time(NULL);  
                pg_strftime(strfbuf, sizeof(strfbuf),  
                                        "%Y-%m-%d %H:%M:%S %Z",  
                                        pg_localtime(&stamp_time, log_timezone));  
                appendStringInfo(labelfile, "START WAL LOCATION: %X/%X (file %s)\n",  
                                                 LSN_FORMAT_ARGS(startpoint), xlogfilename);  
                appendStringInfo(labelfile, "CHECKPOINT LOCATION: %X/%X\n",  
                                                 LSN_FORMAT_ARGS(checkpointloc));  
                appendStringInfo(labelfile, "BACKUP METHOD: %s\n",  
                                                 exclusive ? "pg_start_backup" : "streamed");  
                appendStringInfo(labelfile, "BACKUP FROM: %s\n",  
                                                 backup_started_in_recovery ? "standby" : "primary");  
                appendStringInfo(labelfile, "START TIME: %s\n", strfbuf);  
                appendStringInfo(labelfile, "LABEL: %s\n", backupidstr);  
                appendStringInfo(labelfile, "START TIMELINE: %u\n", starttli);  
  
  
  
/*  
 * do_pg_stop_backup  
 *  
 * Utility function called at the end of an online backup. It cleans up the  
 * backup state and can optionally wait for WAL segments to be archived.  
 *  
 * If labelfile is NULL, this stops an exclusive backup. Otherwise this stops  
 * the non-exclusive backup specified by 'labelfile'.  
 *  
 * Returns the last WAL location that must be present to restore from this  
 * backup, and the corresponding timeline ID in *stoptli_p.  
 *  
 * It is the responsibility of the caller of this function to verify the  
 * permissions of the calling user!  
 */  
XLogRecPtr  
do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)  
{  
  
  
...  
  
        /*  
         * During recovery, we don't write an end-of-backup record. We assume that  
         * pg_control was backed up last and its minimum recovery point can be  
         * available as the backup end location. Since we don't have an  
         * end-of-backup record, we use the pg_control value to check whether  
         * we've reached the end of backup when starting recovery from this  
         * backup. We have no way of checking if pg_control wasn't backed up last  
         * however.  
         *  
         * We don't force a switch to new WAL file but it is still possible to  
         * wait for all the required files to be archived if waitforarchive is  
         * true. This is okay if we use the backup to start a standby and fetch  
         * the missing WAL using streaming replication. But in the case of an  
         * archive recovery, a user should set waitforarchive to true and wait for  
         * them to be archived to ensure that all the required files are  
         * available.  
         *  
         * We return the current minimum recovery point as the backup end  
         * location. Note that it can be greater than the exact backup end  
         * location if the minimum recovery point is updated after the backup of  
         * pg_control. This is harmless for current uses.  
         *  
         * XXX currently a backup history file is for informational and debug  
         * purposes only. It's not essential for an online backup. Furthermore,  
         * even if it's created, it will not be archived during recovery because  
         * an archiver is not invoked. So it doesn't seem worthwhile to write a  
         * backup history file during recovery.  
         */  
        if (backup_started_in_recovery)  
        {  
  
                stoppoint = ControlFile->minRecoveryPoint;  
  
...   
        else  
        {  
                /*  
                 * Write the backup-end xlog record  
                 */  
                XLogBeginInsert();  
                XLogRegisterData((char *) (&startpoint), sizeof(startpoint));  
                stoppoint = XLogInsert(RM_XLOG_ID, XLOG_BACKUP_END);   
...    
/*  
 * Advance minRecoveryPoint in control file.  
 *  
 * If we crash during recovery, we must reach this point again before the  
 * database is consistent.  
 *  
 * If 'force' is true, 'lsn' argument is ignored. Otherwise, minRecoveryPoint  
 * is only updated if it's not already greater than or equal to 'lsn'.  
 */  
  
 ...  

最小恢复所需wal?

  • t0 write backup_label (实际时间可能还略小于t0, 指backup_label里面的startpoint LSN被创造出来的时间)
  • t1 write blockX fpw to WAL (checkpoint后, block首次修改必须写fpw)
  • t2 write blockX to disk (第一次修改写盘前, 数据库会保证这次修改产生的fpw已经提前写入WAL, 否则不会执行该block的写盘操作)
  • t2 copy blockX for backup (假设同一时刻操作, 发生partial write)
  • t3 write backup stoppoint to WAL

t0 ~ t3 之间的WAL是最小需求, 才能保证数据库可以恢复到一致性状态.

digoal's wechat