Skip to content
This repository has been archived by the owner on Aug 2, 2021. It is now read-only.

swap, swap/chain, contracts/swap: add transaction queue #2124

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

ralph-pichler
Copy link
Member

@ralph-pichler ralph-pichler commented Mar 3, 2020

This PR introduces a central component, the TxScheduler, in charge of sending transactions, waiting for their result afterwards and ensuring the result was successfully processed. In the future this component is supposed to take care of more chain-related tasks.

The TxScheduler is an interface currently only implemented by the TxQueue which executes transactions in sequence (see #2006, the next one is only sent after the previous confirmed) in order to avoid most nonce-related issues (#1929 ).

The general idea behind the TxScheduler is as follows:

  • A component which wishes to send a transaction does not do so directly, instead it creates a chain.TxRequest and schedules it with the TxScheduler which returns an assigned id and takes care of the rest.
  • Every request has an associated handlerID which specifies which handler to notify of events for this request. A component should register itself as the handler for the handlerIDs it uses on startup.
  • The TxScheduler will execute the requests at some point and notify the appropriate handler. If the handler function failed, the notification does not count as delivered and will be tried again in the future. This guarantee is also preserved across restarts of the swarm client. The idea behind that is that other places in the code which need to send transaction no longer need to be concerned with issues like io errors, network problems, client restarts, etc.. They just queue the request and are guaranteed to be notified of its result at some point.
  • When scheduling the request the component can also attach extra data which is stored alongside the request data. The purpose is to provide meta-information about the request that is stored within the same atomic write.

For the TxQueue transactions are processed the following way:

  • Scheduled requests go into a queue which is persisted to disk to ensure nothing is lost on node shutdown or crashes. For this, the PersistentQueue was introduced as a helper structure.
  • The queue then processes those requests in a loop:
    • It takes a request from the queue and makes it the active request
    • It sets gas limit, gas price (if necessary) and the nonce and signs the transaction
    • It sends the transaction to the backend
    • It waits for a receipt to be available
  • If the node shuts down it will continue processing the active request the next time.
  • If anything in the process fails prior to sending, the transaction counts as cancelled, otherwise as "status unknown".
  • On IO / decode errors the queue terminates. Some of those might be recoverable after a while and future PRs might attempt to restart the queue at some point.

On the cashing out side the CashoutProcessor now accepts a CashoutResultHandler which it will notify of the cashout result. This is usually Swap. During tests this handler is overridden to keep track of cashed out cheques. This mechanism replaces the cashDone channel on the backend and therefore obsoletes the global cashCheque function variable and the setupContractTest function.

In this PR only the cashout transactions for the chequebook transaction go through this mechanism. This was done to keep the PR small. Smaller future PRs should

  • handle the error cases in the cashout processor (status unkown / cancelled)
  • move the deposit transactions to this mechanism
  • add confirmation monitoring
  • ensure the node is synced when generating transactions
  • add the ability for requests to expire (so that txs are not suddenly sent months later) OR the easier alternative of just marking all none pending requests as cancelled on startup, then the initiator has to take care of resending or not in the handler.

This PR is quite large so it might be useful to look at commits individually. The PR has been split into 3 commits. The first one is the PersistentQueue, the second one implements the actual queue and the third one integrates it with the cashout transactions.

closes #2006
closes #2005
closes #1929
closes #1634

@ralph-pichler ralph-pichler self-assigned this Mar 3, 2020
@ralph-pichler ralph-pichler requested a review from janos March 3, 2020 20:23
@ralph-pichler ralph-pichler changed the title swap, contract/swap: add transaction queue swap, swap/chain, contracts/swap: add transaction queue Mar 3, 2020
Copy link
Contributor

@Eknir Eknir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work!!

My comments are mainly for explanation, but at points I also note that you have something defined, but not implemented. Perhaps, in these cases, it is better to leave the definition out and add this when you actually need it (especially since this PR will be used as a reference for developers who are going to implement the queue with other blockchain interactions in Swarm).

I do think that a markdown file, perhaps with diagrams would help to understand the architecture of this PR and assist developers who are going to work with this code. Maybe @vojtechsimetka could help with this.

After you addressed my comments and clarified things for me, I would like to go over this PR once again.

contracts/swap/swap.go Show resolved Hide resolved
contracts/swap/swap.go Outdated Show resolved Hide resolved
contracts/swap/swap.go Show resolved Hide resolved
swap/chain/txscheduler.go Outdated Show resolved Hide resolved
swap/cashout.go Show resolved Hide resolved
swap/chain/persistentqueue.go Outdated Show resolved Hide resolved
swap/chain/persistentqueue_test.go Show resolved Hide resolved
swap/chain/txqueue.go Show resolved Hide resolved
swap/chain/txqueue.go Show resolved Hide resolved
swap/chain/txqueue.go Show resolved Hide resolved
swap/chain/txqueue.go Outdated Show resolved Hide resolved
Copy link
Member

@janos janos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a quite long PR and most of the implementation is pretty great. 👏 My comments are related to lock handling which I think can be better designed.

// A lock must be held and kept until after the trigger function was called or the batch write failed
func (pq *PersistentQueue) Queue(b *state.StoreBatch, v interface{}) (key string, trigger func(), err error) {
// the nonce guarantees keys don't collide if multiple transactions are queued in the same second
pq.nonce++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possible data race on nonce field on concurrent Queue calls. There is a comment about the lock, but it would be nicer to have an api where locking is implicit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lock.Lock()
key, exists, err = pq.Peek(i)
if exists {
return key, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A deadlock if exists, as the lock is not unlocked.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lock.Lock()
key, exists, err := pq.Peek(i)
if exists {
return key, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A deadlock if exists, as the lock is not unlocked.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 85 to 88
// No lock should not be held when this is called. Only a single call to next may be active at any time
// If the the key is not "", the value exists, the supplied lock was acquired and must be released by the caller after processing the item
// The supplied lock should be the same that is used for the other functions
func (pq *PersistentQueue) Next(ctx context.Context, i interface{}, lock *sync.Mutex) (key string, err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lock handling is a bit strange. The function can return leaving the lock both unlocked or locked. I think that lock should be internal to the implementation, not exposed with the package API.

I think that it would be better to protect the queue with internal lock, then to relay on the queue user to do the locking. It is easy to unlock an already unlocked lock or to have a deadlock, by wrong usage.

Batch processing and writing require lock, but that could be encapsulated by different functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This locking design was the consequence of an earlier version of the code where the request queue and a notification queue were modified at the same time and I wanted to avoid having to hold three locks of three different objects at the same time as the risk of deadlock seemed high. Anyway that case does not exist anymore in this version, so perhaps a lock can be put back into pq again. I will attempt a redesign next week.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with @janos here.

we can have a separate PR for this

Copy link
Member Author

@ralph-pichler ralph-pichler Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to leave it as is for now. Usage of any of the pq functions without holding the main txqueue lock is always wrong. A lock managed by the pq internally is insufficient as in order to make sure that batches don't overlap or trigger signals are not missed locking is required beyond the scope of single functions. I also tried various approaches to put the locking as part of the batch itself but those only further complicated things.

I think for now we should consider persistentQueue (which was unexported, this was never supposed to be used elsewhere anyway) as a helper structure that is exclusively used by the txqueue and therefore have it share its lock. I think we should merge this to finish work on this codebase and if necessary consider any further redesigns when migrating transaction sending to bee.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion is that we should not share the lock this way as it creates a code that is hard to maintain. I would not like to approve it in this state.

If we are focusing on bee and will not add new features to the swarm repo, we do not need to merge this PR now. But to leave it for the bee project.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I'll leave this PR open for further experimentation. Then we can either still merge this at some point in the future or just continue the redesign on a PR on the bee repo (although I assume it will take a while until we get to tx sending there).


count := 200

var errout error // stores the last error that occurred in one of the routines
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errorout can race as two goroutins may set it at the same time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -0,0 +1,7 @@
package chain
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add copyright header to every new file in this package.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

func (txq *TxQueue) waitForNextRequest() (requestMetadata *txRequestData, err error) {
var id uint64
// get the id of the next request in the queue
key, err := txq.requestQueue.Next(txq.ctx, &id, &txq.lock)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also relates to the lock usage. I have no concrete suggestion how to implement it differently, but tracking lock state across TxQueue and PersistentQueue may be quite hard to debug deadlocks or data races.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -126,6 +129,7 @@ func newTestSwap(t *testing.T, key *ecdsa.PrivateKey, backend *swapTestBackend)
usedBackend = newTestBackend(t)
}
swap, dir := newBaseTestSwap(t, key, usedBackend)
swap.txScheduler.Start()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check for retuned error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't return an error.

Copy link
Contributor

@mortelli mortelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few questions from my first pass through the code:

  • could you give an conceptual example of another struct that would implement the TxScheduler interface, other than TxQueue? (i want to make sure i understand the difference between these 2 in terms of responsibilities)
  • why does persistentQueue have a prefix, if all entries have their own separate keys? is it the idea to have multiple persistentQueue structs in the same state.Store? is this the case already?
  • i understand the situation of having a transaction with an unknown status, but why is there a func to actually notify this? would this take place in the future, when we allow transactions to expire, or is it already happening?
  • regarding future PRs: can you please explain what the node's actions would be in terms of confirmation monitoring? would this be basically issue Wait for sufficient amount of transaction confirmations #1633?

i definitely will review this PR again (even if it is merged before i manage to do so) as i would like to have a more in-depth understanding of some of the code here.

looks good so far though 👍

contracts/swap/swap.go Show resolved Hide resolved
swap/chain/persistentqueue.go Show resolved Hide resolved
// It returns the generated key and a trigger function which must be called once the batch was successfully written
// This only returns an error if the encoding fails which is an unrecoverable error
// A lock must be held and kept until after the trigger function was called or the batch write failed
func (pq *PersistentQueue) Queue(b *state.StoreBatch, v interface{}) (key string, trigger func(), err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a developer, not sure this comment

// call trigger function after writing to the batch to prevent undefined behaviour

is really clear about what to do here.

but it would at least be a sign that i would have to be careful when using these functions

swap/chain/persistentqueue.go Outdated Show resolved Hide resolved
swap/chain/persistentqueue.go Show resolved Hide resolved
swap/chain/txqueue.go Show resolved Hide resolved
}

// ToSignedTx returns a signed types.Transaction for the given request and nonce
func (request *TxRequest) ToSignedTx(nonce uint64, opts *bind.TransactOpts) (*types.Transaction, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am for ToSignedTx since it operates on the receiver request

contracts/swap/swap.go Show resolved Hide resolved
contracts/swap/swap.go Outdated Show resolved Hide resolved
swap/cashout_test.go Show resolved Hide resolved
@ralph-pichler
Copy link
Member Author

@mortelli

could you give an conceptual example of another struct that would implement the TxScheduler interface, other than TxQueue? (i want to make sure i understand the difference between these 2 in terms of responsibilities)

An alternative would be a scheduler which tracks nonce count locally and allows for parallel requests instead of queueing. Another one might be a simple mock for testing the rest of the code without running the entire queue mechanism.

why does persistentQueue have a prefix, if all entries have their own separate keys? is it the idea to have multiple persistentQueue structs in the same state.Store? is this the case already?

Yes, there are already multiple in the same store now. The request queue plus one notification queue per handler. Also this is the same state store as for swap in production and we want to avoid key collisions.

i understand the situation of having a transaction with an unknown status, but why is there a func to actually notify this? would this take place in the future, when we allow transactions to expire, or is it already happening?

This can already happen now if transactions don't confirm in time or backend.SendTransaction fails. The notification exists so the sending component can react accordingly. What it does then depends (e.g. we would never reattempt a deposit but trying a cashing transaction with unknown status again might be reasonable).

regarding future PRs: can you please explain what the node's actions would be in terms of confirmation monitoring? would this be basically issue #1633?

Not fully sure yet about that. There would at least be a notification once the confirmation number has been reached.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
4 participants