Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use correct snapshot when scanning columnar tables #5154

Merged
merged 4 commits into from
Sep 2, 2021

Conversation

onurctirtir
Copy link
Member

@onurctirtir onurctirtir commented Jul 30, 2021

Fixes #5180.

i) Instead of using xact snapshot, use the snapshot provided to columnarAM
when scanning table. This is the main purpose of this pr, also see #5154 (comment).

ii) Let me also explain the tricky part of this pr (bf4dfad):

To avoid visibility issues around pending writes that are force-flushed
to disk when scanning the table in the same xact, update curcid of given
snapshot before starting to scan the columnar table.

The reason behind such visibility issues is that we increment command counter
after modifying columnar metadata tables when flushing pending writes.

Now that we don't always use xact snapshot to scan a columnar table,
writes that we just flushed might not be visible to the query that made
pending writes flushed to disk since curcid of provided snapshot would
become smaller than the command id being used when modifying columnar
metadata tables.

To give an example, before bf4dfad, below was a possible scenario
after the changes done in i):

CREATE TABLE t(a int, b int) USING columnar;
BEGIN;
  INSERT INTO t VALUES (5, 10);

  SELECT * FROM t;
  ┌───┬───┐
  │ a │ b │
  ├───┼───┤
  └───┴───┘
  (0 rows)

  SELECT * FROM t;
  ┌───┬────┐
  │ a │ b  │
  ├───┼────┤
  │ 510 │
  └───┴────┘
  (1 row)

@codecov
Copy link

codecov bot commented Jul 30, 2021

Codecov Report

Merging #5154 (1a9cb20) into master (262f893) will decrease coverage by 0.00%.
The diff coverage is 85.71%.

❗ Current head 1a9cb20 differs from pull request most recent head 21235c1. Consider uploading reports for the commit 21235c1 to get more accurate results

@@            Coverage Diff             @@
##           master    #5154      +/-   ##
==========================================
- Coverage   94.20%   94.20%   -0.01%     
==========================================
  Files         214      214              
  Lines       42121    42133      +12     
==========================================
+ Hits        39682    39693      +11     
- Misses       2439     2440       +1     

@onurctirtir onurctirtir force-pushed the col/optimize-index-read branch from 45da562 to 73058d3 Compare August 2, 2021 08:02
Base automatically changed from col/optimize-index-read to master August 2, 2021 08:06
@onurctirtir onurctirtir force-pushed the col/use-correct-snapshot branch from 5b1ba38 to fba4998 Compare August 2, 2021 08:54
@onurctirtir
Copy link
Member Author

onurctirtir commented Aug 3, 2021

Doing some manual testing to see how the changes in this pr would affect the things and found a bug that happens on master but not on this branch:

drop table test;
create table test (a int unique) using columnar;

-- s1
begin;
insert into test select s from generate_series(1,2) s;
select * from test; -- force flush write state

-- s2
insert into test select s from generate_series(1,4) s;
ERROR:  unexpected chunk group count

This is because, on master, we use the snapshot provided to columnar_index_fetch_tuple to determine number of chunk groups that stripe has. However, again on master, ReadChunkGroupRowCounts passes snapshot as NULL (probably means using catalog snapshot) when scanning chunk group metadata and above error is thrown.

@onurctirtir onurctirtir added the bug label Aug 3, 2021
@onurctirtir onurctirtir force-pushed the col/use-correct-snapshot branch from 7a27802 to bd8771a Compare August 16, 2021 12:43
@onurctirtir onurctirtir force-pushed the col/use-correct-snapshot branch 8 times, most recently from 9969605 to c4fcb01 Compare August 19, 2021 10:08
Copy link
Contributor

@jeff-davis jeff-davis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review my comment and see if there's a way to clarify the code a bit. Other than that, it looks good.

Comment on lines 273 to 279
PushCopiedSnapshot(snapshot);

/* now our snapshot is the active one */
UpdateActiveSnapshotCommandId();
snapshot = GetActiveSnapshot();

snapshotCreatedByUs = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little hard for me to follow what's happening here. If we aren't using GetActiveSnapshot() anywhere, why not just pop it unconditionally after updating the command ID? And I don't think EndRead is guaranteed to be called, so that means we can end up leaving a snapshot on the stack for no reason.

And if using the active stack is useful, why not always use it, so we don't have to keep track of whether to pop it or not? That would make one code path rather than two, which is more likely to be exercised thoroughly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hy not just pop it unconditionally after updating the command ID?

Agree, this makes a lot of sense, thank you.

Copy link
Member Author

@onurctirtir onurctirtir Aug 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I guess pushing our snapshot to stack and not keeping it in readState seem to be the better one at first glance. That way, we can use GetActiveSnapshot() during read operation.
In that case, we would need to call PopActiveSnapshot in EndRead but I'm not sure what happens if EndRead is not guaranteed to be called.
Moreover, if snapshot is not an MVCC snapshot, then we don't push a new snapshot as the active one.
In that case, I think we would still need to keep snapshot in readState to use during the read operation.

If we go with the other option, i.e.: keeping snapshot in readState and immediately calling PopActiveSnapshot in BeginRead, I think this time we should make sure to Register / Unregister snapshot in BeginRead / EndRead. But then what happens if EndRead doesn't get called ?

For this reason, I think we should choose the second option, but not sure how to make sure we unregister the snapshot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like ColumnarEndRead will be callled even if the executor node doesn't run to completion (e.g. limit or exists query). I thought I remembered a case where it might not be called, but I'm not seeing it now.

So the Push/Update/Pop/Register approach seems best. Let me know if you run into a problem case.

@jeff-davis
Copy link
Contributor

Also, I was just thinking about the WITH HOLD cursor case. I don't think we even have tests for that, but it seems to basically work now. We might need to be careful about what resource owner owns any snapshot, etc.

Comment on lines 273 to 279
PushCopiedSnapshot(snapshot);

/* now our snapshot is the active one */
UpdateActiveSnapshotCommandId();
snapshot = GetActiveSnapshot();

snapshotCreatedByUs = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like ColumnarEndRead will be callled even if the executor node doesn't run to completion (e.g. limit or exists query). I thought I remembered a case where it might not be called, but I'm not seeing it now.

So the Push/Update/Pop/Register approach seems best. Let me know if you run into a problem case.

@onurctirtir
Copy link
Member Author

onurctirtir commented Aug 31, 2021

Also, I was just thinking about the WITH HOLD cursor case. I don't think we even have tests for that, but it seems to basically work now. We might need to be careful about what resource owner owns any snapshot, etc.

This comment made me do some testing with cursors, looks like changes in this pr doesn't help with the cases where FETCH ALL force-flushes pending writes. Note that below scenario is also reproduceable on older versions (at least on v10.1.2).
The root cause is that we build the readState when reading the first row, so we don't use the snaphot provided when running DECLARE command.

Even if we've called init_columnar_read_state in begin_read, we would still encounter with this issue when doing index-scan. This is because, index_begin_read doesn't provide a snapshot. For this reason, we have to use the snapshot passed to index_fetch_tuple there, which first calls init_columnar_read_state.

I think I should open issue without attempting to solve this in this pr, what do you think @jeff-davis ?

DROP TABLE columnar_table;
CREATE TABLE columnar_table(q1 int8, q2 int8) USING columnar;
INSERT INTO columnar_table VALUES (1, 2), (3, 4);

BEGIN;
  DECLARE columnar_cursor CURSOR WITH HOLD FOR
  SELECT * FROM columnar_table ORDER BY 1,2;

  INSERT INTO columnar_table VALUES (5, 6), (7, 8);

  -- bad, shows (5, 6) and (7, 8) as well :/
  -- heapAM would only show (1, 2) & (3, 4)  
  FETCH ALL FROM columnar_cursor;
┌────┬────┐
│ q1 │ q2 │
├────┼────┤
│  12 │
│  34 │
│  56 │
│  78 │
└────┴────┘
(4 rows)
COMMIT;

Comment on lines +275 to +283
/* now our snapshot is the active one */
UpdateActiveSnapshotCommandId();
snapshot = GetActiveSnapshot();
RegisterSnapshot(snapshot);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment that this doesn't fix the issue with cursors, open an issue as well

Instead of using xact snapshot, use the snapshot provided
to columnarAM when scanning table.
Seems that we always increment the command counter right after
finishing metadata table modification.

For this reason, it makes sense to call CommandCounterIncrement
within FinishModifyRelation.
In next commit, we will adjust curcid of the snapshot being used when
scanning the columnar table.

However, for index scan, snapshot is provided not when beginning scan
but within fetch-tuple call.

For this reason, start flushing pending writes in init_columnar_read_state
since this seem to be a prerequisite step that needs to be done before
scanning a columnar table regardless of the scan method being used.
Before starting to scan a columnar table, we always flush the pending
writes to disk.

However, we increment command counter after modifying metadata tables.

On the other hand, now that we _don't always use_ xact snapshot to scan
a columnar table, writes that we just flushed might not be visible to
the query that just flushed pending writes to disk since curcid of
provided snapshot would become smaller than the command id being used
when modifying metadata tables.

To give an example, before this change, below was a possible scenario
due to the changes that we made to use the correct snapshot.

```sql
CREATE TABLE t(a int, b int) USING columnar;
BEGIN;
  INSERT INTO t VALUES (5, 10);

  SELECT * FROM t;
  ┌───┬───┐
  │ a │ b │
  ├───┼───┤
  └───┴───┘
  (0 rows)

  SELECT * FROM t;
  ┌───┬────┐
  │ a │ b  │
  ├───┼────┤
  │ 5 │ 10 │
  └───┴────┘
  (1 row)
```
@onurctirtir onurctirtir force-pushed the col/use-correct-snapshot branch from bffcdda to bf4dfad Compare September 2, 2021 08:12
@onurctirtir onurctirtir merged commit 0bf2920 into master Sep 2, 2021
@onurctirtir onurctirtir deleted the col/use-correct-snapshot branch September 2, 2021 08:20
@onurctirtir onurctirtir added this to the 10.2 Release milestone Sep 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Columnar self-insert sometimes runs forever
2 participants