Skip to content

Commit

Permalink
Support fast ALTER TABLE ADD COLUMN for appendonly row-oriented tables
Browse files Browse the repository at this point in the history
Since 16828d5, heap table can avoid a whole
table rewrite in case of ADD COLUMN with non-NULL default value. The way it
achieves that is to store the evaluated default values in pg_attribute. To
any existing tuples, such a new attribute will be regarded as "missing", and
when we read the tuples, we will fill the missing attributes using the values
stored in pg_attribute. And, any new tuples being inserted will store the value
physically on file.

Now doing the same for ao_row tables. The key difficulty we faced was that
unlike heap tuples, the memtuple used by ao_row tables does not store the
number of attributes in its header. Therefore, we do not know how many attributes
are missing.

The idea (kudos to @ashwinstar) to solve above issue is that, during ADD COLUMN,
we record the current last row number in each segment of the AO table. This
values gives us knowledge about whether a memtuple, with a monotonically assigned
row number, carries the corresponding attribute value or not. Therefore, we can
figure out how many attributes are present in the memtuple and how many are
missing. The algorithm is that (see `AOExecutorReadBlockBindingInit`):
1. the number of stored attributes = the largest column number `colno` that has
   smaller/equal `last_row_number` than the row number of memtuple we are reading;
2. the number of missing attributes = the total number of attributes minus the
   number of stored attributes.

We store the last row numbers in a new field `lastrownum` in pg_attribute_encoding.
It is only added in ADD COLUMN. During table rewrite, we remove all those entries
because the `lastrownums` won't be needed anymore.

Other notable things:
1. Whenever we remove fastsequence entries (like in TRUNCATE), we would erase the
   lastrownums field. We do this instead of removing the pg_attribute_encoding
   entries entirely because we would want to support the same optimization for CO
   tables in future, and we surely couldn't remove pg_attribute_encoding entries
   for CO tables in TRUNCATE because CO tables still need them for the encoding
   options (pg_attribute_encoding.attoptions). So better do the right thing now.
2. We do not store any invalid fastsequence number '0' in lastrownums besides the
   initial one for RESERVED_SEGNO (segno=0). This is to save space in the catalog
   because in many cases only the first few segments are used. Note that, this can
   be done because these two assumptions are true:
     a. we only pick unused segments from segno low to high (see choose_new_segfile())
     b. once a segment is used, it would only have a fastsequence number greater than 0;
   If these assumptions are broken some day, then we would have to store everything
   in the lastrownums field, or figure out some other way to save space.
3. Because we now have a possibility to update pg_attribute_encoding more than once
   in a command (e.g. when we drop an AO table we would drop the pg_attribute_encoding
   first, and then remove gp_fastsequence entries which try to clear pg_attribute_encoding
   again). So now we increments command counter in RemoveAttributeEncodingsByRelid().
   However, that causes (or reveals) two issues with ATSETAM:
     a. In swap_relation_files(), we do RemoveAttributeEncodingsByRelid after we've
        swapped/transferred pg_appendonly entries (ATAOEntries). So when we are
        invalidating cache as part of command increment, we might have issue finding
        pg_appendonly entry for a table which we have removed the pg_appendonly entry
        of but haven't changed its relam. Basically, we should not increment command
        between ATAOEntries and changing relam. Added comment for that.
     b. After we swapped relam and updated pg_class, we have to increment command
        counter so whatever we do that follows would see the table in proper AM.
        Otherwise, problem occurs e.g., after we've done AO->heap, when we reindex the
        table, we might still see the table as AO and has problem building the relation
        descriptor. This is revealed by the RemoveAttributeEncodingsByRelid change
        because we made the pg_appendonly change visible but not the relam.

Performance wise, there is certain overhead being added to calculate the number of
stored attributes. But it is not excessive for two reasons:
1. The time is proportional to the number of total attributes in the table, which
  is usually not large and has an upper limit (1600). For each attribute number, all
  we need is just a pointer dereference and numeric comparison.
2. During the course of scanning a varblock, we only need to do the above work once
  since the number of attributes stored shouldn't change in varblock.

Added two sets of tests in regress/isolation2 for single-client/concurrency testing,
respectively.

Fix #14929
  • Loading branch information
huansong committed May 24, 2023
1 parent 3dbbe82 commit d3dcb9b
Show file tree
Hide file tree
Showing 28 changed files with 1,518 additions and 112 deletions.
89 changes: 89 additions & 0 deletions gpMgmt/bin/gpcheckcat
Original file line number Diff line number Diff line change
Expand Up @@ -1512,6 +1512,82 @@ def checkAOSegVpinfo():

processThread(threads)

class checkAOLastrownumThread(execThread):
def __init__(self, cfg, db):
execThread.__init__(self, cfg, db, None)

# pg_attribute_encoding.lastrownums[segno], if exists, should have a corresponding entry in gp_fastsequence with
# an objid same as segno. And the value of pg_attribute_encoding.lastrownums[segno] should fall in the range of
# [0, {last_sequence}] where {last_sequence} is the current gp_fastsequence value with the corresponding objid.
# Note that objmod starts from 0 but the array index starts from 1.
def run(self):
aolastrownum_query = """
SELECT
c.relname,
ao.relid,
ae.attnum,
ae.lastrownums,
f.objmod,
f.last_sequence,
ae.lastrownums[f.objmod + 1] AS lastrownum
FROM
pg_attribute_encoding ae
JOIN pg_appendonly ao ON ae.attrelid = ao.relid
LEFT JOIN gp_fastsequence f ON ao.segrelid = f.objid
JOIN pg_class c ON ao.relid = c.oid
WHERE
f.last_sequence IS NULL
OR f.last_sequence < ae.lastrownums[f.objmod + 1]
OR ae.lastrownums[f.objmod + 1] < 0;
"""

try:
# Execute the query
curs = self.db.query(aolastrownum_query)
nrows = curs.ntuples()

if nrows == 0:
logger.info('[OK] AO lastrownums check for pg_attribute_encoding')
else:
GV.checkStatus = False
# we could not fix this issue automatically
setError(ERROR_NOREPAIR)
logger.info('[FAIL] AO lastrownums check for pg_attribute_encoding')
for relname, relid, attnum, lastrownums, objmod, last_sequence, last_rownum in curs.getresult():
logger.error(" found inconsistent last_rownum {rownum} with last_sequence {seqnum} of aoseg {segno} for table '{relname}' attribute {attnum} on segment {content}"
.format(rownum = last_rownum,
seqnum = last_sequence,
segno = objmod,
relname = relname,
attnum = attnum,
content = self.cfg['content']))

except Exception as e:
GV.checkStatus = False
self.error = e

# for test "ao_lastrownums"
def checkAOLastrownums():
threads = []
i = 1
# parallelise check
for dbid in GV.cfg:
cfg = GV.cfg[dbid]
db_connection = connect2(cfg)
thread = checkAOLastrownumThread(cfg, db_connection)
thread.start()
logger.debug('launching check thread %s for dbid %i' %
(thread.name, dbid))
threads.append(thread)

if (i % GV.opt['-B']) == 0:
processThread(threads)
threads = []

i += 1

processThread(threads)

# -------------------------------------------------------------------------------

# Exclude these tuples from the catalog table scan
Expand Down Expand Up @@ -1619,6 +1695,11 @@ def checkTableInconsistentEntry(cat):
columns = list(map(lambda b: b.replace("stxdmcv","stxdmcv::text"), columns))
castcols = list(map(lambda b: b.replace("stxdmcv","stxdmcv::text"), castcols))

# pg_attribute_encoding.lastrownums could have various values on segments, ignore them
if catname == 'pg_attribute_encoding':
columns.remove('lastrownums')
castcols.remove('lastrownums')

if cat.tableHasConsistentOids():
qry = inconsistentEntryQuery(GV.max_content, catname, ['oid'], columns, castcols)
else:
Expand Down Expand Up @@ -2045,6 +2126,14 @@ all_checks = {
"version": 'main',
"order": 15,
"online": False
},
"ao_lastrownums":
{
"description": "Check that lastrownums in pg_attribute_encoding is consistent with gp_fastsequence",
"fn": lambda: checkAOLastrownums(),
"version": 'main',
"order": 16,
"online": False
}
}

Expand Down
20 changes: 20 additions & 0 deletions gpMgmt/test/behave/mgmt_utils/gpcheckcat.feature
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,26 @@ Feature: gpcheckcat tests
Then gpcheckcat should print "Extra" to stdout
And gpcheckcat should print "Table miss_attr_db4.public.heap_table.1" to stdout

Scenario: gpcheckcat should report inconsistent pg_fastsequence.lastrownums values with gp_fastsequence
Given database "errorneous_lastrownums" is dropped and recreated
And the user runs "psql errorneous_lastrownums -c "create table errlastrownum(a int) using ao_row; insert into errlastrownum select * from generate_series(1,100);""
And the user runs "psql errorneous_lastrownums -c "alter table errlastrownum add column newcol int;""
When the user runs "gpcheckcat -R ao_lastrownums errorneous_lastrownums"
Then gpcheckcat should return a return code of 0
When the user runs sql "set allow_system_table_mods=on; update gp_fastsequence set last_sequence = 0 where last_sequence > 0;" in "errorneous_lastrownums" on first primary segment
When the user runs "gpcheckcat -R ao_lastrownums errorneous_lastrownums"
Then gpcheckcat should return a return code of 3
And gpcheckcat should print "Failed test\(s\) that are not reported here: ao_lastrownums" to stdout
Given database "errorneous_lastrownums" is dropped and recreated
And the user runs "psql errorneous_lastrownums -c "create table errlastrownum(a int) using ao_row; insert into errlastrownum select * from generate_series(1,10);""
And the user runs "psql errorneous_lastrownums -c "alter table errlastrownum add column newcol int;""
When the user runs "gpcheckcat -R ao_lastrownums errorneous_lastrownums"
Then gpcheckcat should return a return code of 0
Then the user runs sql "set allow_system_table_mods=on; delete from gp_fastsequence where last_sequence > 0;" in "errorneous_lastrownums" on first primary segment
When the user runs "gpcheckcat -R ao_lastrownums errorneous_lastrownums"
Then gpcheckcat should return a return code of 3
And gpcheckcat should print "Failed test\(s\) that are not reported here: ao_lastrownums" to stdout

Scenario: gpcheckcat should report and repair owner errors and produce timestamped repair scripts
Given database "owner_db1" is dropped and recreated
And database "owner_db2" is dropped and recreated
Expand Down
2 changes: 1 addition & 1 deletion src/backend/access/appendonly/appendonly_compaction.c
Original file line number Diff line number Diff line change
Expand Up @@ -433,7 +433,7 @@ AppendOnlySegmentFileFullCompaction(Relation aorel,
tupDesc = RelationGetDescr(aorel);
slot = MakeSingleTupleTableSlot(tupDesc, &TTSOpsVirtual);
slot->tts_tableOid = RelationGetRelid(aorel);
mt_bind = create_memtuple_binding(tupDesc);
mt_bind = create_memtuple_binding(tupDesc, tupDesc->natts);

/*
* We need a ResultRelInfo and an EState so we can use the regular
Expand Down
104 changes: 90 additions & 14 deletions src/backend/access/appendonly/appendonlyam.c
Original file line number Diff line number Diff line change
Expand Up @@ -662,6 +662,16 @@ AppendOnlyExecutorReadBlock_GetBlockInfo(AppendOnlyStorageRead *storageRead,
executorReadBlock->headerOffsetInFile =
AppendOnlyStorageRead_CurrentHeaderOffsetInFile(storageRead);

/* Start curLargestAttnum from 1, this will be updated in AppendOnlyExecutorReadBlock_BindingInit() */
executorReadBlock->curLargestAttnum = 1;

/* mt_bind should be recreated for the new block */
if (executorReadBlock->mt_bind)
{
destroy_memtuple_binding(executorReadBlock->mt_bind);
executorReadBlock->mt_bind = NULL;
}

/* UNDONE: Check blockFirstRowNum */

return true;
Expand Down Expand Up @@ -707,6 +717,10 @@ AppendOnlyExecutorReadBlock_Init(AppendOnlyExecutorReadBlock *executorReadBlock,
executorReadBlock->storageRead = storageRead;
executorReadBlock->memoryContext = memoryContext;

Assert(relation); /* should have a valid relation */
executorReadBlock->attnum_to_rownum = GetAttnumToLastrownumMapping(RelationGetRelid(relation),
RelationGetNumberOfAttributes(relation));

MemoryContextSwitchTo(oldcontext);
}

Expand All @@ -733,6 +747,12 @@ AppendOnlyExecutorReadBlock_Finish(AppendOnlyExecutorReadBlock *executorReadBloc
pfree(executorReadBlock->mt_bind);
executorReadBlock->mt_bind = NULL;
}

if (executorReadBlock->attnum_to_rownum)
{
pfree(executorReadBlock->attnum_to_rownum);
executorReadBlock->attnum_to_rownum = NULL;
}
}

static void
Expand All @@ -741,22 +761,78 @@ AppendOnlyExecutorReadBlock_ResetCounts(AppendOnlyExecutorReadBlock *executorRea
executorReadBlock->totalRowsScanned = 0;
}

/*
* Initialize the memtuple attribute bindings.
*
* Here, we figure out how many attributes are physically stored in the
* memtuple based on the row number. Any row with a row number larger than
* the pg_attribute_encoding.lastrownums number associated with the attribute
* and current segno should have the attribute physically stored in memtuple.
* For example, imagine we have this attnum-to-rownum mapping for segno=1:
* (attnum=1, lastrownums=100)
* (attnum=2, lastrownums=200)
* (attnum=3, lastrownums=1000)
* (attnum=4, lastrownums=2000)
* And assume we are reading a memtuple with row number = 1500, we will know that
* the first three attribute should be physically stored in the memtuple, but the
* fourth attribute and onwards are not.
*
* So if lastrownum=0 for an attribute and segment pair, it effectively indicates
* that all rows in the segment carry that attribute in the on-disk memtuple.
*
* Note that, the attnum_to_row array is first divided based on attribute numbers,
* so the above mapping will be represented in attnum_to_row as (assume there's no
* other segno being used):
* [
* 0, 100, 0, ...(125 zeroes)..., <-- for attnum=1
* 0, 200, 0, ...(125 zeroes)..., <-- for attnum=2
* 0, 1000, 0, ...(125 zeroes)..., <-- for attnum=3
* 0, 2000, 0, ...(125 zeroes)..., <-- for attnum=4
* 0, ...(all zeroes)...
* ]
*/
static void
AOExecutorReadBlockBindingInit(AppendOnlyExecutorReadBlock *executorReadBlock,
TupleTableSlot *slot)
AppendOnlyExecutorReadBlock_BindingInit(AppendOnlyExecutorReadBlock *executorReadBlock,
TupleTableSlot *slot,
int64 rowNum)
{
int segno = executorReadBlock->segmentFileNum;
int largestAttnum = executorReadBlock->curLargestAttnum;
MemoryContext oldContext;

/* for any row to be read, there's at least one column data in the row */
Assert(largestAttnum > 0);
Assert(executorReadBlock->attnum_to_rownum != NULL);

/* Find the first attnum that has a larger lastrownum than rowNum. */
while (largestAttnum < slot->tts_tupleDescriptor->natts &&
rowNum >= executorReadBlock->attnum_to_rownum[largestAttnum * MAX_AOREL_CONCURRENCY + segno])
largestAttnum++;

/*
* If we already created the bindings and also the largest attnum have not changed,
* we do not need to recreate the bindings again.
*/
if (executorReadBlock->mt_bind && largestAttnum == executorReadBlock->curLargestAttnum)
return;

/* Otherwise, we have to create/recreate bindings */
oldContext = MemoryContextSwitchTo(executorReadBlock->memoryContext);

/* destroy the previous bindings */
if (executorReadBlock->mt_bind)
destroy_memtuple_binding(executorReadBlock->mt_bind);

/*
* MemTupleBinding should be created from the slot's tuple descriptor
* and not from the tuple descriptor in the relation. These could be
* different. One example is alter table rewrite.
* (plus the expected largest attnum that we calculated above). We should
* not using the tuple descriptor in the relation which could be different
* in case like alter table rewrite.
*/
if (!executorReadBlock->mt_bind)
{
oldContext = MemoryContextSwitchTo(executorReadBlock->memoryContext);
executorReadBlock->mt_bind = create_memtuple_binding(slot->tts_tupleDescriptor);
MemoryContextSwitchTo(oldContext);
}
executorReadBlock->mt_bind = create_memtuple_binding(slot->tts_tupleDescriptor, largestAttnum);
MemoryContextSwitchTo(oldContext);

executorReadBlock->curLargestAttnum = largestAttnum;
}


Expand All @@ -780,14 +856,14 @@ AppendOnlyExecutorReadBlock_ProcessTuple(AppendOnlyExecutorReadBlock *executorRe

AOTupleIdInit(aoTupleId, executorReadBlock->segmentFileNum, rowNum);

if (slot)
AOExecutorReadBlockBindingInit(executorReadBlock, slot);

/*
* Is it legal to call this function with NULL slot? The
* HeapKeyTestUsingSlot call below assumes that the slot is not NULL.
*/
Assert (slot);

AppendOnlyExecutorReadBlock_BindingInit(executorReadBlock, slot, rowNum);

{
bool shouldFree = false;

Expand Down Expand Up @@ -2827,7 +2903,7 @@ appendonly_insert_init(Relation rel, int segno, int64 num_rows)
*/
aoInsertDesc->appendOnlyMetaDataSnapshot = RegisterSnapshot(GetCatalogSnapshot(InvalidOid));

aoInsertDesc->mt_bind = create_memtuple_binding(RelationGetDescr(rel));
aoInsertDesc->mt_bind = create_memtuple_binding(RelationGetDescr(rel), RelationGetNumberOfAttributes(rel));

aoInsertDesc->appendFile = -1;
aoInsertDesc->appendFilePathNameMaxLen = AOSegmentFilePathNameLen(rel) + 1;
Expand Down
2 changes: 1 addition & 1 deletion src/backend/access/appendonly/appendonlyam_handler.c
Original file line number Diff line number Diff line change
Expand Up @@ -1290,7 +1290,7 @@ appendonly_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,

SIMPLE_FAULT_INJECTOR("cluster_ao_seq_scan_begin");

mt_bind = create_memtuple_binding(oldTupDesc);
mt_bind = create_memtuple_binding(oldTupDesc, oldTupDesc->natts);

while (appendonly_getnextslot(aoscandesc, ForwardScanDirection, slot))
{
Expand Down
Loading

0 comments on commit d3dcb9b

Please sign in to comment.