digoal
2015-09-06
PostgreSQL , 并发事务 , 共享行锁 , pg_multixact
multixact是管理共享行锁的基础代码,一个MultiXactId对应一个MultiXactMember数组,每个MultiXactMember包含事务号以及该事务号对应的flag(MultiXactStatus)。
src/include/access/multixact.h
typedef struct MultiXactMember
{
TransactionId xid;
MultiXactStatus status;
} MultiXactMember;
typedef struct xl_multixact_create
{
MultiXactId mid; /* new MultiXact's ID */
MultiXactOffset moff; /* its starting offset in members file */
int32 nmembers; /* number of member XIDs */
MultiXactMember members[FLEXIBLE_ARRAY_MEMBER];
} xl_multixact_create;
src/backend/access/transam/multixact.c
/*-------------------------------------------------------------------------
*
* multixact.c
* PostgreSQL multi-transaction-log manager
*
* The pg_multixact manager is a pg_clog-like manager that stores an array of
* MultiXactMember for each MultiXactId. It is a fundamental part of the
* shared-row-lock implementation. Each MultiXactMember is comprised of a
* TransactionId and a set of flag bits. The name is a bit historical:
* originally, a MultiXactId consisted of more than one TransactionId (except
* in rare corner cases), hence "multi". Nowadays, however, it's perfectly
* legitimate to have MultiXactIds that only include a single Xid.
*
* The meaning of the flag bits is opaque to this module, but they are mostly
* used in heapam.c to identify lock modes that each of the member transactions
* is holding on any given tuple. This module just contains support to store
* and retrieve the arrays.
*
* We use two SLRU areas, one for storing the offsets at which the data
* starts for each MultiXactId in the other one. This trick allows us to
* store variable length arrays of TransactionIds. (We could alternatively
* use one area containing counts and TransactionIds, with valid MultiXactId
* values pointing at slots containing counts; but that way seems less robust
* since it would get completely confused if someone inquired about a bogus
* MultiXactId that pointed to an intermediate slot containing an XID.)
*
* XLOG interactions: this module generates an XLOG record whenever a new
* OFFSETs or MEMBERs page is initialized to zeroes, as well as an XLOG record
* whenever a new MultiXactId is defined. This allows us to completely
* rebuild the data entered since the last checkpoint during XLOG replay.
* Because this is possible, we need not follow the normal rule of
* "write WAL before data"; the only correctness guarantee needed is that
* we flush and sync all dirty OFFSETs and MEMBERs pages to disk before a
* checkpoint is considered complete. If a page does make it to disk ahead
* of corresponding WAL records, it will be forcibly zeroed before use anyway.
* Therefore, we don't need to mark our pages with LSN information; we have
* enough synchronization already.
*
* Like clog.c, and unlike subtrans.c, we have to preserve state across
* crashes and ensure that MXID and offset numbering increases monotonically
* across a crash. We do this in the same way as it's done for transaction
* IDs: the WAL record is guaranteed to contain evidence of every MXID we
* could need to worry about, and we just make sure that at the end of
* replay, the next-MXID and next-offset counters are at least as large as
* anything we saw during replay.
*
* We are able to remove segments no longer necessary by carefully tracking
* each table's used values: during vacuum, any multixact older than a certain
* value is removed; the cutoff value is stored in pg_class. The minimum value
* across all tables in each database is stored in pg_database, and the global
* minimum across all databases is part of pg_control and is kept in shared
* memory. At checkpoint time, after the value is known flushed in WAL, any
* files that correspond to multixacts older than that value are removed.
* (These files are also removed when a restartpoint is executed.)
*
* When new multixactid values are to be created, care is taken that the
* counter does not fall within the wraparound horizon considering the global
* minimum value.
*
* Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
* src/backend/access/transam/multixact.c
*
*-------------------------------------------------------------------------
*/
因为每个MultiXactMember包含事务号xid以及该事务号对应的flag(MultiXactStatus),PG使用1个字节存储flag,为了对其,每4个字节和4个XID作为一个MultiXactMember组(刚好20个字节,含4个MultiXactMember)
同样multixact page和BLOCKSZ一样大,所以如果是8K的block size,可以包含409个MultiXactMember组。
如果你的某一条记录被409个事务同时加共享行锁,需要消耗一个8KB的multixact page。
/*
* The situation for members is a bit more complex: we store one byte of
* additional flag bits for each TransactionId. To do this without getting
* into alignment issues, we store four bytes of flags, and then the
* corresponding 4 Xids. Each such 5-word (20-byte) set we call a "group", and
* are stored as a whole in pages. Thus, with 8kB BLCKSZ, we keep 409 groups
* per page. This wastes 12 bytes per page, but that's OK -- simplicity (and
* performance) trumps space efficiency here.
*
* Note that the "offset" macros work with byte offset, not array indexes, so
* arithmetic must be done using "char *" pointers.
*/
/* We need eight bits per xact, so one xact fits in a byte */
#define MXACT_MEMBER_BITS_PER_XACT 8
#define MXACT_MEMBER_FLAGS_PER_BYTE 1
#define MXACT_MEMBER_XACT_BITMASK ((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
/* how many full bytes of flags are there in a group? */
#define MULTIXACT_FLAGBYTES_PER_GROUP 4
#define MULTIXACT_MEMBERS_PER_MEMBERGROUP \
(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
/* size in bytes of a complete group */
#define MULTIXACT_MEMBERGROUP_SIZE \
(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
#define MULTIXACT_MEMBERS_PER_PAGE \
(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
multixact头文件:
有效的multixactid从1开始,最大为0xFFFFFFFF(和xid一致)。
src/include/access/multixact.h
/*
* The first two MultiXactId values are reserved to store the truncation Xid
* and epoch of the first segment, so we start assigning multixact values from
* 2.
*/
#define InvalidMultiXactId ((MultiXactId) 0)
#define FirstMultiXactId ((MultiXactId) 1)
#define MaxMultiXactId ((MultiXactId) 0xFFFFFFFF)
#define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
#define MaxMultiXactOffset ((MultiXactOffset) 0xFFFFFFFF)
/* Number of SLRU buffers to use for multixact */
#define NUM_MXACTOFFSET_BUFFERS 8
#define NUM_MXACTMEMBER_BUFFERS 16
目前使用1个字节存储flag,即MultiXactStatus,现在支持6个状态值。
可以看到这些状态涉及6种行锁操作。
FOR KEY SHARE, FOR SHARE, FOR NO KEY UPDATE, FOR UPDATE
an update that doesn't touch "key" columns
other updates, and delete
/*
* Possible multixact lock modes ("status"). The first four modes are for
* tuple locks (FOR KEY SHARE, FOR SHARE, FOR NO KEY UPDATE, FOR UPDATE); the
* next two are used for update and delete modes.
*/
typedef enum
{
MultiXactStatusForKeyShare = 0x00,
MultiXactStatusForShare = 0x01,
MultiXactStatusForNoKeyUpdate = 0x02,
MultiXactStatusForUpdate = 0x03,
/* an update that doesn't touch "key" columns */
MultiXactStatusNoKeyUpdate = 0x04,
/* other updates, and delete */
MultiXactStatusUpdate = 0x05
} MultiXactStatus;
src/include/access/multixact.h
src/backend/access/transam/multixact.c
您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.