Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Support fast ALTER TABLE ADD COLUMN for appendonly row-oriented tables
Since 16828d5, heap table can avoid a whole table rewrite in case of ADD COLUMN with non-NULL default value. The way it achieves that is to store the evaluated default values in pg_attribute. To any existing tuples, such a new attribute will be regarded as "missing", and when we read the tuples, we will fill the missing attributes using the values stored in pg_attribute. And, any new tuples being inserted will store the value physically on file. Now doing the same for ao_row tables. The key difficulty we faced was that unlike heap tuples, the memtuple used by ao_row tables does not store the number of attributes in its header. Therefore, we do not know how many attributes are missing. The idea (kudos to @ashwinstar) to solve above issue is that, during ADD COLUMN, we record the current last row number in each segment of the AO table. This values gives us knowledge about whether a memtuple, with a monotonically assigned row number, carries the corresponding attribute value or not. Therefore, we can figure out how many attributes are present in the memtuple and how many are missing. The algorithm is that (see `AOExecutorReadBlockBindingInit`): 1. the number of stored attributes = the largest column number `colno` that has smaller/equal `last_row_number` than the row number of memtuple we are reading; 2. the number of missing attributes = the total number of attributes minus the number of stored attributes. We store the last row numbers in a new field `lastrownum` in pg_attribute_encoding. It is only added in ADD COLUMN. During table rewrite, we remove all those entries because the `lastrownums` won't be needed anymore. Other notable things: 1. Whenever we remove fastsequence entries (like in TRUNCATE), we would erase the lastrownums field. We do this instead of removing the pg_attribute_encoding entries entirely because we would want to support the same optimization for CO tables in future, and we surely couldn't remove pg_attribute_encoding entries for CO tables in TRUNCATE because CO tables still need them for the encoding options (pg_attribute_encoding.attoptions). So better do the right thing now. 2. We do not store any invalid fastsequence number '0' in lastrownums besides the initial one for RESERVED_SEGNO (segno=0). This is to save space in the catalog because in many cases only the first few segments are used. Note that, this can be done because these two assumptions are true: a. we only pick unused segments from segno low to high (see choose_new_segfile()) b. once a segment is used, it would only have a fastsequence number greater than 0; If these assumptions are broken some day, then we would have to store everything in the lastrownums field, or figure out some other way to save space. 3. Because we now have a possibility to update pg_attribute_encoding more than once in a command (e.g. when we drop an AO table we would drop the pg_attribute_encoding first, and then remove gp_fastsequence entries which try to clear pg_attribute_encoding again). So now we increments command counter in RemoveAttributeEncodingsByRelid(). However, that causes (or reveals) two issues with ATSETAM: a. In swap_relation_files(), we do RemoveAttributeEncodingsByRelid after we've swapped/transferred pg_appendonly entries (ATAOEntries). So when we are invalidating cache as part of command increment, we might have issue finding pg_appendonly entry for a table which we have removed the pg_appendonly entry of but haven't changed its relam. Basically, we should not increment command between ATAOEntries and changing relam. Added comment for that. b. After we swapped relam and updated pg_class, we have to increment command counter so whatever we do that follows would see the table in proper AM. Otherwise, problem occurs e.g., after we've done AO->heap, when we reindex the table, we might still see the table as AO and has problem building the relation descriptor. This is revealed by the RemoveAttributeEncodingsByRelid change because we made the pg_appendonly change visible but not the relam. Performance wise, there is certain overhead being added to calculate the number of stored attributes. But it is not excessive for two reasons: 1. The time is proportional to the number of total attributes in the table, which is usually not large and has an upper limit (1600). For each attribute number, all we need is just a pointer dereference and numeric comparison. 2. During the course of scanning a varblock, we only need to do the above work once since the number of attributes stored shouldn't change in varblock. Added two sets of tests in regress/isolation2 for single-client/concurrency testing, respectively. Fix #14929
- Loading branch information