Support fast ALTER TABLE ADD COLUMN for appendonly row-oriented tables

Since 16828d5, heap table can avoid a whole table rewrite in case of ADD COLUMN with non-NULL default value. The way it achieves that is to store the evaluated default values in pg_attribute. To any existing tuples, such a new attribute will be regarded as "missing", and when we read the tuples, we will fill the missing attributes using the values stored in pg_attribute. And, any new tuples being inserted will store the value physically on file. Now doing the same for ao_row tables. The key difficulty we faced was that unlike heap tuples, the memtuple used by ao_row tables does not store the number of attributes in its header. Therefore, we do not know how many attributes are missing. The idea (kudos to @ashwinstar) to solve above issue is that, during ADD COLUMN, we record the current last row number in each segment of the AO table. This values gives us knowledge about whether a memtuple, with a monotonically assigned row number, carries the corresponding attribute value or not. Therefore, we can figure out how many attributes are present in the memtuple and how many are missing. The algorithm is that (see `AOExecutorReadBlockBindingInit`): 1. the number of stored attributes = the largest column number `colno` that has smaller/equal `last_row_number` than the row number of memtuple we are reading; 2. the number of missing attributes = the total number of attributes minus the number of stored attributes. We store the last row numbers in a new field `lastrownum` in pg_attribute_encoding. It is only added in ADD COLUMN. During table rewrite, we remove all those entries because the `lastrownums` won't be needed anymore. Other notable things: 1. Whenever we remove fastsequence entries (like in TRUNCATE), we would erase the lastrownums field. We do this instead of removing the pg_attribute_encoding entries entirely because we would want to support the same optimization for CO tables in future, and we surely couldn't remove pg_attribute_encoding entries for CO tables in TRUNCATE because CO tables still need them for the encoding options (pg_attribute_encoding.attoptions). So better do the right thing now. 2. We do not store any invalid fastsequence number '0' in lastrownums besides the initial one for RESERVED_SEGNO (segno=0). This is to save space in the catalog because in many cases only the first few segments are used. Note that, this can be done because these two assumptions are true: a. we only pick unused segments from segno low to high (see choose_new_segfile()) b. once a segment is used, it would only have a fastsequence number greater than 0; If these assumptions are broken some day, then we would have to store everything in the lastrownums field, or figure out some other way to save space. 3. Because we now have a possibility to update pg_attribute_encoding more than once in a command (e.g. when we drop an AO table we would drop the pg_attribute_encoding first, and then remove gp_fastsequence entries which try to clear pg_attribute_encoding again). So now we increments command counter in RemoveAttributeEncodingsByRelid(). However, that causes (or reveals) two issues with ATSETAM: a. In swap_relation_files(), we do RemoveAttributeEncodingsByRelid after we've swapped/transferred pg_appendonly entries (ATAOEntries). So when we are invalidating cache as part of command increment, we might have issue finding pg_appendonly entry for a table which we have removed the pg_appendonly entry of but haven't changed its relam. Basically, we should not increment command between ATAOEntries and changing relam. Added comment for that. b. After we swapped relam and updated pg_class, we have to increment command counter so whatever we do that follows would see the table in proper AM. Otherwise, problem occurs e.g., after we've done AO->heap, when we reindex the table, we might still see the table as AO and has problem building the relation descriptor. This is revealed by the RemoveAttributeEncodingsByRelid change because we made the pg_appendonly change visible but not the relam. Performance wise, there is certain overhead being added to calculate the number of stored attributes. But it is not excessive for two reasons: 1. The time is proportional to the number of total attributes in the table, which is usually not large and has an upper limit (1600). For each attribute number, all we need is just a pointer dereference and numeric comparison. 2. During the course of scanning a varblock, we only need to do the above work once since the number of attributes stored shouldn't change in varblock. Added two sets of tests in regress/isolation2 for single-client/concurrency testing, respectively. Fix #14929
arenadata · May 24, 2023 · d3dcb9b · d3dcb9b
1 parent 3dbbe82
commit d3dcb9b
Show file tree

Hide file tree

Showing 28 changed files with 1,518 additions and 112 deletions.
diff --git a/gpMgmt/bin/gpcheckcat b/gpMgmt/bin/gpcheckcat
@@ -1512,6 +1512,82 @@ def checkAOSegVpinfo():
 
     processThread(threads)
 
+class checkAOLastrownumThread(execThread):
+    def __init__(self, cfg, db):
+        execThread.__init__(self, cfg, db, None)
+
+    # pg_attribute_encoding.lastrownums[segno], if exists, should have a corresponding entry in gp_fastsequence with
+    # an objid same as segno. And the value of pg_attribute_encoding.lastrownums[segno] should fall in the range of
+    # [0, {last_sequence}] where {last_sequence} is the current gp_fastsequence value with the corresponding objid.
+    # Note that objmod starts from 0 but the array index starts from 1.
+    def run(self):
+        aolastrownum_query = """
+            SELECT
+                c.relname,
+                ao.relid,
+                ae.attnum,
+                ae.lastrownums,
+                f.objmod,
+                f.last_sequence,
+                ae.lastrownums[f.objmod + 1] AS lastrownum
+            FROM
+                pg_attribute_encoding ae
+                JOIN pg_appendonly ao ON ae.attrelid = ao.relid
+                LEFT JOIN gp_fastsequence f ON ao.segrelid = f.objid
+                JOIN pg_class c ON ao.relid = c.oid
+            WHERE
+                f.last_sequence IS NULL
+                OR f.last_sequence < ae.lastrownums[f.objmod + 1]
+                OR ae.lastrownums[f.objmod + 1] < 0;
+        """
+
+        try:
+            # Execute the query
+            curs = self.db.query(aolastrownum_query)
+            nrows = curs.ntuples()
+
+            if nrows == 0:
+                logger.info('[OK] AO lastrownums check for pg_attribute_encoding')
+            else:
+                GV.checkStatus = False
+                # we could not fix this issue automatically
+                setError(ERROR_NOREPAIR)
+                logger.info('[FAIL] AO lastrownums check for pg_attribute_encoding')
+                for relname, relid, attnum, lastrownums, objmod, last_sequence, last_rownum in curs.getresult():
+                    logger.error("   found inconsistent last_rownum {rownum} with last_sequence {seqnum} of aoseg {segno} for table '{relname}' attribute {attnum} on segment {content}"
+                                 .format(rownum = last_rownum,
+                                         seqnum = last_sequence,
+                                         segno = objmod,
+                                         relname = relname,
+                                         attnum = attnum,
+                                         content = self.cfg['content']))
+
+        except Exception as e:
+            GV.checkStatus = False
+            self.error = e
+
+# for test "ao_lastrownums"
+def checkAOLastrownums():
+    threads = []
+    i = 1
+    # parallelise check
+    for dbid in GV.cfg:
+        cfg = GV.cfg[dbid]
+        db_connection = connect2(cfg)
+        thread = checkAOLastrownumThread(cfg, db_connection)
+        thread.start()
+        logger.debug('launching check thread %s for dbid %i' %
+                     (thread.name, dbid))
+        threads.append(thread)
+
+        if (i % GV.opt['-B']) == 0:
+            processThread(threads)
+            threads = []
+
+        i += 1
+
+    processThread(threads)
+
 # -------------------------------------------------------------------------------
 
 # Exclude these tuples from the catalog table scan
@@ -1619,6 +1695,11 @@ def checkTableInconsistentEntry(cat):
     columns = list(map(lambda b: b.replace("stxdmcv","stxdmcv::text"), columns))
     castcols = list(map(lambda b: b.replace("stxdmcv","stxdmcv::text"), castcols))
 
+    # pg_attribute_encoding.lastrownums could have various values on segments, ignore them
+    if catname == 'pg_attribute_encoding':
+        columns.remove('lastrownums')
+        castcols.remove('lastrownums')
+
     if cat.tableHasConsistentOids():
         qry = inconsistentEntryQuery(GV.max_content, catname, ['oid'], columns, castcols)
     else:
@@ -2045,6 +2126,14 @@ all_checks = {
             "version": 'main',
             "order": 15,
             "online": False
+        },
+    "ao_lastrownums":
+        {
+            "description": "Check that lastrownums in pg_attribute_encoding is consistent with gp_fastsequence",
+            "fn": lambda: checkAOLastrownums(),
+            "version": 'main',
+            "order": 16,
+            "online": False
         }
 }
 

diff --git a/gpMgmt/test/behave/mgmt_utils/gpcheckcat.feature b/gpMgmt/test/behave/mgmt_utils/gpcheckcat.feature
@@ -125,6 +125,26 @@ Feature: gpcheckcat tests
         Then gpcheckcat should print "Extra" to stdout
         And gpcheckcat should print "Table miss_attr_db4.public.heap_table.1" to stdout
 
+    Scenario: gpcheckcat should report inconsistent pg_fastsequence.lastrownums values with gp_fastsequence
+        Given database "errorneous_lastrownums" is dropped and recreated
+        And the user runs "psql errorneous_lastrownums -c "create table errlastrownum(a int) using ao_row; insert into errlastrownum select * from generate_series(1,100);""
+        And the user runs "psql errorneous_lastrownums -c "alter table errlastrownum add column newcol int;""
+        When the user runs "gpcheckcat -R ao_lastrownums errorneous_lastrownums"
+        Then gpcheckcat should return a return code of 0
+        When the user runs sql "set allow_system_table_mods=on; update gp_fastsequence set last_sequence = 0 where last_sequence > 0;" in "errorneous_lastrownums" on first primary segment
+        When the user runs "gpcheckcat -R ao_lastrownums errorneous_lastrownums"
+        Then gpcheckcat should return a return code of 3
+        And gpcheckcat should print "Failed test\(s\) that are not reported here: ao_lastrownums" to stdout
+        Given database "errorneous_lastrownums" is dropped and recreated
+        And the user runs "psql errorneous_lastrownums -c "create table errlastrownum(a int) using ao_row; insert into errlastrownum select * from generate_series(1,10);""
+        And the user runs "psql errorneous_lastrownums -c "alter table errlastrownum add column newcol int;""
+        When the user runs "gpcheckcat -R ao_lastrownums errorneous_lastrownums"
+        Then gpcheckcat should return a return code of 0
+        Then the user runs sql "set allow_system_table_mods=on; delete from gp_fastsequence where last_sequence > 0;" in "errorneous_lastrownums" on first primary segment
+        When the user runs "gpcheckcat -R ao_lastrownums errorneous_lastrownums"
+        Then gpcheckcat should return a return code of 3
+        And gpcheckcat should print "Failed test\(s\) that are not reported here: ao_lastrownums" to stdout
+
     Scenario: gpcheckcat should report and repair owner errors and produce timestamped repair scripts
         Given database "owner_db1" is dropped and recreated
         And database "owner_db2" is dropped and recreated

diff --git a/src/backend/access/appendonly/appendonly_compaction.c b/src/backend/access/appendonly/appendonly_compaction.c
@@ -433,7 +433,7 @@ AppendOnlySegmentFileFullCompaction(Relation aorel,
 	tupDesc = RelationGetDescr(aorel);
 	slot = MakeSingleTupleTableSlot(tupDesc, &TTSOpsVirtual);
 	slot->tts_tableOid = RelationGetRelid(aorel);
-	mt_bind = create_memtuple_binding(tupDesc);
+	mt_bind = create_memtuple_binding(tupDesc, tupDesc->natts);
 
 	/*
 	 * We need a ResultRelInfo and an EState so we can use the regular

diff --git a/src/backend/access/appendonly/appendonlyam.c b/src/backend/access/appendonly/appendonlyam.c
@@ -662,6 +662,16 @@ AppendOnlyExecutorReadBlock_GetBlockInfo(AppendOnlyStorageRead *storageRead,
 	executorReadBlock->headerOffsetInFile =
 		AppendOnlyStorageRead_CurrentHeaderOffsetInFile(storageRead);
 
+	/* Start curLargestAttnum from 1, this will be updated in AppendOnlyExecutorReadBlock_BindingInit() */
+	executorReadBlock->curLargestAttnum = 1;
+
+	/* mt_bind should be recreated for the new block */
+	if (executorReadBlock->mt_bind)
+	{
+		destroy_memtuple_binding(executorReadBlock->mt_bind);
+		executorReadBlock->mt_bind = NULL;
+	}
+
 	/* UNDONE: Check blockFirstRowNum */
 
 	return true;
@@ -707,6 +717,10 @@ AppendOnlyExecutorReadBlock_Init(AppendOnlyExecutorReadBlock *executorReadBlock,
 	executorReadBlock->storageRead = storageRead;
 	executorReadBlock->memoryContext = memoryContext;
 
+	Assert(relation); /* should have a valid relation */
+	executorReadBlock->attnum_to_rownum = GetAttnumToLastrownumMapping(RelationGetRelid(relation),
+												RelationGetNumberOfAttributes(relation));
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
@@ -733,6 +747,12 @@ AppendOnlyExecutorReadBlock_Finish(AppendOnlyExecutorReadBlock *executorReadBloc
 		pfree(executorReadBlock->mt_bind);
 		executorReadBlock->mt_bind = NULL;
 	}
+
+	if (executorReadBlock->attnum_to_rownum)
+	{
+		pfree(executorReadBlock->attnum_to_rownum);
+		executorReadBlock->attnum_to_rownum = NULL;
+	}
 }
 
 static void
@@ -741,22 +761,78 @@ AppendOnlyExecutorReadBlock_ResetCounts(AppendOnlyExecutorReadBlock *executorRea
 	executorReadBlock->totalRowsScanned = 0;
 }
 
+/*
+ * Initialize the memtuple attribute bindings.
+ *
+ * Here, we figure out how many attributes are physically stored in the
+ * memtuple based on the row number. Any row with a row number larger than
+ * the pg_attribute_encoding.lastrownums number associated with the attribute
+ * and current segno should have the attribute physically stored in memtuple.
+ * For example, imagine we have this attnum-to-rownum mapping for segno=1:
+ * 	(attnum=1, lastrownums=100)
+ * 	(attnum=2, lastrownums=200)
+ * 	(attnum=3, lastrownums=1000)
+ * 	(attnum=4, lastrownums=2000)
+ * And assume we are reading a memtuple with row number = 1500, we will know that
+ * the first three attribute should be physically stored in the memtuple, but the
+ * fourth attribute and onwards are not.
+ *
+ * So if lastrownum=0 for an attribute and segment pair, it effectively indicates
+ * that all rows in the segment carry that attribute in the on-disk memtuple.
+ *
+ * Note that, the attnum_to_row array is first divided based on attribute numbers,
+ * so the above mapping will be represented in attnum_to_row as (assume there's no
+ * other segno being used):
+ *     [
+ *       0, 100, 0, ...(125 zeroes)...,       <-- for attnum=1
+ *       0, 200, 0, ...(125 zeroes)...,       <-- for attnum=2
+ *       0, 1000, 0, ...(125 zeroes)...,      <-- for attnum=3
+ *       0, 2000, 0, ...(125 zeroes)...,      <-- for attnum=4
+ *       0, ...(all zeroes)...
+ *     ]
+ */
 static void
-AOExecutorReadBlockBindingInit(AppendOnlyExecutorReadBlock *executorReadBlock,
-									   TupleTableSlot *slot)
+AppendOnlyExecutorReadBlock_BindingInit(AppendOnlyExecutorReadBlock *executorReadBlock,
+									   TupleTableSlot *slot,
+									   int64 rowNum)
 {
+	int segno = executorReadBlock->segmentFileNum;
+	int largestAttnum = executorReadBlock->curLargestAttnum;
 	MemoryContext oldContext;
+
+	/* for any row to be read, there's at least one column data in the row */
+	Assert(largestAttnum > 0);
+	Assert(executorReadBlock->attnum_to_rownum != NULL);
+
+	/* Find the first attnum that has a larger lastrownum than rowNum. */
+	while (largestAttnum < slot->tts_tupleDescriptor->natts && 
+			rowNum >= executorReadBlock->attnum_to_rownum[largestAttnum * MAX_AOREL_CONCURRENCY + segno])
+		largestAttnum++;
+
+	/*
+	 * If we already created the bindings and also the largest attnum have not changed, 
+	 * we do not need to recreate the bindings again.
+	 */
+	if (executorReadBlock->mt_bind && largestAttnum == executorReadBlock->curLargestAttnum)
+		return;
+
+	/* Otherwise, we have to create/recreate bindings */
+	oldContext = MemoryContextSwitchTo(executorReadBlock->memoryContext);
+
+	/* destroy the previous bindings */
+	if (executorReadBlock->mt_bind)
+		destroy_memtuple_binding(executorReadBlock->mt_bind);
+
 	/*
 	 * MemTupleBinding should be created from the slot's tuple descriptor
-	 * and not from the tuple descriptor in the relation.  These could be
-	 * different.  One example is alter table rewrite.
+	 * (plus the expected largest attnum that we calculated above). We should
+	 * not using the tuple descriptor in the relation which could be different
+	 * in case like alter table rewrite.
 	 */
-	if (!executorReadBlock->mt_bind)
-	{
-		oldContext = MemoryContextSwitchTo(executorReadBlock->memoryContext);
-		executorReadBlock->mt_bind = create_memtuple_binding(slot->tts_tupleDescriptor);
-		MemoryContextSwitchTo(oldContext);
-	}
+	executorReadBlock->mt_bind = create_memtuple_binding(slot->tts_tupleDescriptor, largestAttnum);
+	MemoryContextSwitchTo(oldContext);
+
+	executorReadBlock->curLargestAttnum = largestAttnum;
 }
 
 
@@ -780,14 +856,14 @@ AppendOnlyExecutorReadBlock_ProcessTuple(AppendOnlyExecutorReadBlock *executorRe
 
 	AOTupleIdInit(aoTupleId, executorReadBlock->segmentFileNum, rowNum);
 
-	if (slot)
-		AOExecutorReadBlockBindingInit(executorReadBlock, slot);
-
 	/*
 	 * Is it legal to call this function with NULL slot?  The
 	 * HeapKeyTestUsingSlot call below assumes that the slot is not NULL.
 	 */
 	Assert (slot);
+
+	AppendOnlyExecutorReadBlock_BindingInit(executorReadBlock, slot, rowNum);
+
 	{
 		bool		shouldFree = false;
 
@@ -2827,7 +2903,7 @@ appendonly_insert_init(Relation rel, int segno, int64 num_rows)
 	 */
 	aoInsertDesc->appendOnlyMetaDataSnapshot = RegisterSnapshot(GetCatalogSnapshot(InvalidOid));
 
-	aoInsertDesc->mt_bind = create_memtuple_binding(RelationGetDescr(rel));
+	aoInsertDesc->mt_bind = create_memtuple_binding(RelationGetDescr(rel), RelationGetNumberOfAttributes(rel));
 
 	aoInsertDesc->appendFile = -1;
 	aoInsertDesc->appendFilePathNameMaxLen = AOSegmentFilePathNameLen(rel) + 1;

diff --git a/src/backend/access/appendonly/appendonlyam_handler.c b/src/backend/access/appendonly/appendonlyam_handler.c
@@ -1290,7 +1290,7 @@ appendonly_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 	SIMPLE_FAULT_INJECTOR("cluster_ao_seq_scan_begin");
 
-	mt_bind = create_memtuple_binding(oldTupDesc);
+	mt_bind = create_memtuple_binding(oldTupDesc, oldTupDesc->natts);
 
 	while (appendonly_getnextslot(aoscandesc, ForwardScanDirection, slot))
 	{