Propagate DDL commands to workers through 2PC #513

lithp · 2016-05-10T07:34:17Z

(update to issue to track DDL propagation through 2PC here)

Citus 5.0 propagates Alter Table and Create Index commands to worker nodes. We implemented this feature using Citus' current replication model. We also decided to switch to using 2PC (or pg_paxos) once the metadata propagation changes were implemented.

One drawback to the current approach is when ALTER TABLE ... SET NOT NULL fails it marks shards inactive:

postgres=# ALTER TABLE ddl_testing ALTER COLUMN address SET NOT NULL;
WARNING:  could not receive query results from [snip]
DETAIL:  Client error: column "address" contains null values
WARNING:  could not apply command on shard 102025 on [snip]
DETAIL:  Shard placement will be marked as inactive.
ALTER TABLE

If the problem was caused by the user and not by a failed node we probably shouldn't mark the placements inactive. We should instead somehow error out like we do when you try to create a duplicate index.

The text was updated successfully, but these errors were encountered:

ozgune · 2016-05-10T22:41:13Z

@lithp -- we're planning on improving DDL propagation logic to make sure that we don't have partial failures.

Do you think #25 or #265 captures the issue mentioned here?

lithp · 2016-05-11T07:32:37Z

Yep, #25, or better DDL propogation, relates to this issue. Added a comment over there.

ozgune · 2016-06-01T18:56:23Z

This issue is also related to #19. We're prioritizing this issue assuming that we could use a shared infrastructure for #19 and this one (#513).

marcocitus · 2016-06-01T23:57:20Z

I think the existing 2PC infrastructure for COPY and master_modify_multiple_shards (see multi_transaction.c) is suitable for this. Metadata propagation for masterless will also reuse some of that infrastructure.

metdos · 2016-06-02T09:00:27Z

Also related to inactive shards problem here #480.

ozgune · 2016-06-09T21:39:29Z

I'm copy/pasting an internal email thread as additional context to this issue.

Marco: Speculation: A DDL command might have been blocked on something and then failed when the node crashed, which currently causes it to mark the placement as inactive, since it's no longer in sync with the other shards and an operator needs to step in to apply the command manually. In 5.2, we'll have DDL via 2PC, which partially that problem.

Lukas: You are right - I was indeed creating an index around that time. It might have been a DDL command that was running, not (just) COPY.

I'm not sure on logs, since the server was replaced after the crash, and we don't store historic per-node logs elsewhere right now (afaik, Daniel can confirm).

aamederen · 2016-06-17T13:50:01Z

After discussions with @marcocitus and @sumedhpathak, we have decided our approach to this issue.

The resolution of this issue could in the future also address other DDL related issues such as #480, #131, #356, #357, #265 and #192.

In the solution, we will focus on 2PC while also handling #480. Besides,

We will open a new connection per shard, instead of per worker, for considering future work about Improve error messages on failed unique indexes #480.
We will use xact handler for abort/commit mechanism, since it will make it easier to handle Make shard creation atomic #265 in the future.

Fixes #513 Fixes #480 This change modifies the DDL Propagation logic so that DDL queries are propagated via 2-Phase Commit protocol. This way, failures during the execution of distributed DDL commands will not leave the table in an intermediate state. The workflow of the successful case is this: 1. Open individual connections to all shard placements 2. Send `BEGIN; SELECT worker_apply_shard_ddl_command(<shardId>, <DDL Command>)` to all connections, one by one, in a serial manner. 3. Send `PREPARE TRANSCATION <transaction_id>` to all connections. 4. Sedn `COMMIT` to all connections. Failure cases: - If a worker problem occurs before sending of all DDL commands is finished, then all changes are rolled back. - If a worker problem occurs after all DDL commands are sent but not after `PREPARE TRANSACTION` commands are finished, then all changes are rolled back. However, if a worker node is failed, then the prepared transactions in that worker should be rolled back manually. - If a worker problem occurs during `COMMIT PREPARED` statements are being sent, then the prepared transactions on the failed workers should be commited manually. - If master fails before the first 'PREPARE TRANSACTION' is sent, then nothing is changed on workers. - If master fails during `PREPARE TRANSACTION` commands are being sent, then the prepared transactions on workers should be rolled back manually. - If master fails during `COMMIT PREPARED` or `ROLLBACK PREPARED` commands are being sent, then the remaining prepared transactions on the workers should be handled manually.

Fixes #513 Fixes #480 This change modifies the DDL Propagation logic so that DDL queries are propagated via 2-Phase Commit protocol. This way, failures during the execution of distributed DDL commands will not leave the table in an intermediate state. The workflow of the successful case is this: 1. Open individual connections to all shard placements and send `BEGIN` 2. Send `SELECT worker_apply_shard_ddl_command(<shardId>, <DDL Command>)` to all connections, one by one, in a serial manner. 3. Send `PREPARE TRANSCATION <transaction_id>` to all connections. 4. Sedn `COMMIT` to all connections. Failure cases: - If a worker problem occurs before sending of all DDL commands is finished, then all changes are rolled back. - If a worker problem occurs after all DDL commands are sent but not after `PREPARE TRANSACTION` commands are finished, then all changes are rolled back. However, if a worker node is failed, then the prepared transactions in that worker should be rolled back manually. - If a worker problem occurs during `COMMIT PREPARED` statements are being sent, then the prepared transactions on the failed workers should be commited manually. - If master fails before the first 'PREPARE TRANSACTION' is sent, then nothing is changed on workers. - If master fails during `PREPARE TRANSACTION` commands are being sent, then the prepared transactions on workers should be rolled back manually. - If master fails during `COMMIT PREPARED` or `ROLLBACK PREPARED` commands are being sent, then the remaining prepared transactions on the workers should be handled manually.

Fixes #513 This change modifies the DDL Propagation logic so that DDL queries are propagated via 2-Phase Commit protocol. This way, failures during the execution of distributed DDL commands will not leave the table in an intermediate state. The workflow of the successful case is this: 1. Open individual connections to all shard placements and send `BEGIN` 2. Send `SELECT worker_apply_shard_ddl_command(<shardId>, <DDL Command>)` to all connections, one by one, in a serial manner. 3. Send `PREPARE TRANSCATION <transaction_id>` to all connections. 4. Sedn `COMMIT` to all connections. Failure cases: - If a worker problem occurs before sending of all DDL commands is finished, then all changes are rolled back. - If a worker problem occurs after all DDL commands are sent but not after `PREPARE TRANSACTION` commands are finished, then all changes are rolled back. However, if a worker node is failed, then the prepared transactions in that worker should be rolled back manually. - If a worker problem occurs during `COMMIT PREPARED` statements are being sent, then the prepared transactions on the failed workers should be commited manually. - If master fails before the first 'PREPARE TRANSACTION' is sent, then nothing is changed on workers. - If master fails during `PREPARE TRANSACTION` commands are being sent, then the prepared transactions on workers should be rolled back manually. - If master fails during `COMMIT PREPARED` or `ROLLBACK PREPARED` commands are being sent, then the remaining prepared transactions on the workers should be handled manually. This change also helps with #480, since failed DDL changes no longer mark failed placements as inactive.

marcocitus · 2016-07-15T13:27:44Z

We are making 2 changes to the way DDL commands are used.

The first is that we will prevent DDL in a transaction block, since that would likely lead to various issues in combination with other commands. Currently we allow DDL Commands in transaction blocks, though actually doing so would be very dangerous. It is probably a 1 week task to properly support multi-statement DDL transactions.

The second is that it will error out if max_prepared_transactions is not set on the workers, since the DDL propagation always uses 2PC. We initially considered reusing multi_shard_commit_protocol which defaults to 1pc, but recovery from commit failures is difficult in that case. In MX we have a function to recover from 2PC failures, which would take 2-3 days to integrate.

This means you cannot CREATE INDEX on a distributed table without setting max_prepared_transactions on the workers to >0

Any strong objections to these changes?

Fixes #513 This change modifies the DDL Propagation logic so that DDL queries are propagated via 2-Phase Commit protocol. This way, failures during the execution of distributed DDL commands will not leave the table in an intermediate state and the pending prepared transactions can be commited manually. DDL commands are not allowed inside other transaction blocks or functions. DDL commands are performed with 2PC regardless of the value of `citus.multi_shard_commit_protocol` parameter. The workflow of the successful case is this: 1. Open individual connections to all shard placements and send `BEGIN` 2. Send `SELECT worker_apply_shard_ddl_command(<shardId>, <DDL Command>)` to all connections, one by one, in a serial manner. 3. Send `PREPARE TRANSCATION <transaction_id>` to all connections. 4. Sedn `COMMIT` to all connections. Failure cases: - If a worker problem occurs before sending of all DDL commands is finished, then all changes are rolled back. - If a worker problem occurs after all DDL commands are sent but not after `PREPARE TRANSACTION` commands are finished, then all changes are rolled back. However, if a worker node is failed, then the prepared transactions in that worker should be rolled back manually. - If a worker problem occurs during `COMMIT PREPARED` statements are being sent, then the prepared transactions on the failed workers should be commited manually. - If master fails before the first 'PREPARE TRANSACTION' is sent, then nothing is changed on workers. - If master fails during `PREPARE TRANSACTION` commands are being sent, then the prepared transactions on workers should be rolled back manually. - If master fails during `COMMIT PREPARED` or `ROLLBACK PREPARED` commands are being sent, then the remaining prepared transactions on the workers should be handled manually. This change also helps with #480, since failed DDL changes no longer mark failed placements as inactive.

lithp added the bug label May 10, 2016

lithp mentioned this issue May 11, 2016

Propagate DDL commands to workers v2 #25

Closed

ozgune changed the title ~~When ALTER TABLE ... SET NOT NULL fails it marks shards inactive~~ Propagate DDL commands to workers through 2PC Jun 1, 2016

ozgune added this to the 5.2 Release milestone Jun 1, 2016

ozgune added the 1-2 weeks label Jun 1, 2016

sumedhpathak added the planned label Jun 4, 2016

sumedhpathak mentioned this issue Jun 7, 2016

Improve error messages on failed unique indexes #480

Closed

aamederen self-assigned this Jun 8, 2016

aamederen mentioned this issue Jun 22, 2016

Propagate DDL Commands with 2PC #618

Merged

aamederen added needs review and removed planned labels Jun 22, 2016

aamederen closed this as completed in #618 Jul 19, 2016

aamederen removed the needs review label Jul 19, 2016

ozgune mentioned this issue Jul 23, 2016

Make shard creation atomic #265

Closed

ozgune mentioned this issue Aug 1, 2017

DDL commands (Alter Table, Create Index) display Notice messages that may confuse users #1531

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Propagate DDL commands to workers through 2PC #513

Propagate DDL commands to workers through 2PC #513

lithp commented May 10, 2016 •

edited by ozgune

ozgune commented May 10, 2016 •

edited

lithp commented May 11, 2016

ozgune commented Jun 1, 2016

marcocitus commented Jun 1, 2016 •

edited

metdos commented Jun 2, 2016

ozgune commented Jun 9, 2016

aamederen commented Jun 17, 2016 •

edited by ozgune

marcocitus commented Jul 15, 2016 •

edited

Propagate DDL commands to workers through 2PC #513

Propagate DDL commands to workers through 2PC #513

Comments

lithp commented May 10, 2016 • edited by ozgune

ozgune commented May 10, 2016 • edited

lithp commented May 11, 2016

ozgune commented Jun 1, 2016

marcocitus commented Jun 1, 2016 • edited

metdos commented Jun 2, 2016

ozgune commented Jun 9, 2016

aamederen commented Jun 17, 2016 • edited by ozgune

marcocitus commented Jul 15, 2016 • edited

lithp commented May 10, 2016 •

edited by ozgune

ozgune commented May 10, 2016 •

edited

marcocitus commented Jun 1, 2016 •

edited

aamederen commented Jun 17, 2016 •

edited by ozgune

marcocitus commented Jul 15, 2016 •

edited