Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subnet knows about pending tx, but has stopped producing blocks #1115

Closed
tactical-retreat opened this issue Mar 5, 2024 · 18 comments
Closed
Labels
bug Something isn't working

Comments

@tactical-retreat
Copy link
Contributor

Describe the bug

I have 3 validator nodes that I registered prior to BLS signatures being required for registration. Since I want to use AWM, I had to wipe the staking info from the nodes, generate new node IDs, and re-register as validators.

To avoid having insufficient stake, I registered the nodes with 2000 stake instead of the default 20. Everything went pretty smoothly and the subnet came back up. I can fetch blocks. I can submit transactions.

But no new blocks are being produced, and there are no errors or clarifying info in the logs. The validators are definitely aware of the pending transaction. I logged into a validator and used cast to send some ether around:

INFO [03-05|01:29:09.174] <e3k9c6RKgouPWtyQkr1ULC2PSFVu6BZdFefXZXJfwJ1zjWsmE Chain> github.com/ava-labs/subnet-evm/internal/ethapi/api.go:1899: Submitted transaction hash=0xb0a87e8908229f5ff66523f52785a2bd768786428362476489c6d101c57bed17 from=0xc83F0Ddc642F3663190dfA40Ae95743CB831Fd97 nonce=23 recipient=0xc83F0Ddc642F3663190dfA40Ae95743CB831Fd97 value=1,000,000,000,000,000,000 type=2 gasFeeCap=40,000,000,000 gasTipCap=40,000,000,000 gasPrice=40,000,000,000
INFO [03-05|01:29:32.677] <e3k9c6RKgouPWtyQkr1ULC2PSFVu6BZdFefXZXJfwJ1zjWsmE Chain> github.com/ava-labs/subnet-evm/rpc/handler.go:526: Served eth_sendRawTransaction reqid=5 execTime="66.64µs"   procTime="68.05µs"   totalTime="68.19µs"  err="already known"

To Reproduce

Not sure how to reproduce this. or how to debug further.

I'm hoping there's some obvious thing I borked that can be fixed with a magical flag =(

@tactical-retreat tactical-retreat added the bug Something isn't working label Mar 5, 2024
@ceyonur
Copy link
Collaborator

ceyonur commented Mar 5, 2024

Tx has a nonce of 23. Did your chain accepted other 22 txs from this sender? What is your chain config? Could you send startup logs?

@tactical-retreat
Copy link
Contributor Author

Yeah, the nonce isn't it. I tried with several different mechanisms and several different accounts, including the faucet, ferdy flip dapp, and finally cast directly.

Startup logs for validator: https://gist.github.com/tactical-retreat/5692eca862f7156602d77e3b2135aa98

@tactical-retreat
Copy link
Contributor Author

I have no chain config for this subnet, just an upgrade file (to enable the warp precompile and to force set the teleporter code at the right location).

Genesis file is here: https://github.com/DeepWaterStudios/ff-subnet-public/blob/main/subnet-configuration/genesis-files/ferdy_testnet.json

@ceyonur
Copy link
Collaborator

ceyonur commented Mar 5, 2024

Could you run this and see what's reported there (assuming localhost;9650)?

curl --location 'http://localhost:9650/ext/bc/e3k9c6RKgouPWtyQkr1ULC2PSFVu6BZdFefXZXJfwJ1zjWsmE/rpc' \
--header 'Content-Type: application/json' \
--data '{
    "jsonrpc": "2.0",
    "method": "eth_baseFee",
    "params": [],
    "id": 1
}'

@tactical-retreat
Copy link
Contributor Author

tactical-retreat commented Mar 5, 2024

$ curl --location 'http://localhost:9650/ext/bc/e3k9c6RKgouPWtyQkr1ULC2PSFVu6BZdFefXZXJfwJ1zjWsmE/rpc' \
--header 'Content-Type: application/json' \
--data '{
    "jsonrpc": "2.0",
    "method": "eth_baseFee",
    "params": [],
    "id": 1
}'
{"jsonrpc":"2.0","id":1,"result":"0x37e11d600"}

FYI the RPC endpoint is https://testnet-rpc.ferdyflip.xyz/rpc if you want to test anything yourself.

There's no explorer yet but I've been using https://devnet.routescan.io/ with that endpoint.

@ceyonur
Copy link
Collaborator

ceyonur commented Mar 5, 2024

Do you see anything in validator logs? Could you send a validator's log file (if they're all showing the same thing)?

@tactical-retreat
Copy link
Contributor Author

Validator startup logs were posted above, nothing exciting in them. Logs are very boring after that.

If I send a tx:

cast send --rpc-url=http://localhost:9650/ext/bc/e3k9c6RKgouPWtyQkr1ULC2PSFVu6BZdFefXZXJfwJ1zjWsmE/rpc --private-key=$KEY 0xc83F0Ddc642F3663190dfA40Ae95743CB831Fd97 --value="1 ether" --gas-price="50 gwei

I see:

Mar 05 16:04:57 v-fuji-1.us-east1-c.c.ferdyflip-testnet.internal avalanchego[50880]: INFO [03-05|16:04:57.383] <e3k9c6RKgouPWtyQkr1ULC2PSFVu6BZdFefXZXJfwJ1zjWsmE Chain> github.com/ava-labs/subnet-evm/internal/ethapi/api.go:1899: Submitted transaction hash=0x5330b05ca7373641985e23d9a4f917bd7383f164ef9f6e5c3ebe53261ee2725b from=0xc83F0Ddc642F3663190dfA40Ae95743CB831Fd97 nonce=23 recipient=0xc83F0Ddc642F3663190dfA40Ae95743CB831Fd97 value=1,000,000,000,000,000,000 type=2 gasFeeCap=50,000,000,000 gasTipCap=50,000,000,000 gasPrice=50,000,000,000

Nothing after that. If I resend I get:

Mar 05 16:06:05 v-fuji-1.us-east1-c.c.ferdyflip-testnet.internal avalanchego[50880]: INFO [03-05|16:06:05.788] <e3k9c6RKgouPWtyQkr1ULC2PSFVu6BZdFefXZXJfwJ1zjWsmE Chain> github.com/ava-labs/subnet-evm/rpc/handler.go:526: Served eth_sendRawTransaction reqid=5 execTime="66.13µs"  procTime="67.7µs"   totalTime="67.85µs"  err="already known"

@tactical-retreat
Copy link
Contributor Author

And to clarify what I mean by boring, just some C Chain output, mostly warp messages:

0000000000000000000000e35cce502b4bf974d517636bd66e7b2472d63ec9000000000000000000000000000000000000000000000000016345785d8a00000000000000000000000000000000000000000000 warpMessageID=2361VPcTjTCTy1GtFQc8YsHi9jaD2QSt3KDW6bteEkzAcTGDo4
Mar 05 15:28:45 v-fuji-1.us-east1-c.c.ferdyflip-testnet.internal avalanchego[50880]: INFO [03-05|15:28:45.877] <C Chain> github.com/ava-labs/coreth/core/blockchain.go:1497: Resetting chain preference number=30,574,861 hash=cdc710..d1516a drop=1 dropfrom=5f7dd9..fc9eea add=1 addfrom=c3feca..2f298b
Mar 05 15:53:47 v-fuji-1.us-east1-c.c.ferdyflip-testnet.internal avalanchego[50880]: INFO [03-05|15:53:47.355] <C Chain> github.com/ava-labs/coreth/precompile/contracts/warp/config.go:119: Accepted warp unsigned message 

@tactical-retreat
Copy link
Contributor Author

I added a chain config with logging set to debug and I now see these entries, which again confirm that it knows about pending tx. But nothing that indicates why there are no new blocks.

Mar 06 18:08:51 v-fuji-1.us-east1-c.c.ferdyflip-testnet.internal avalanchego[54310]: DEBUG[03-06|18:08:51.111] <e3k9c6RKgouPWtyQkr1ULC2PSFVu6BZdFefXZXJfwJ1zjWsmE Chain> github.com/ava-labs/subnet-evm/peer/network.go:459: processing AppGossip from node nodeID=NodeID-KyP1LyscPyThYCEbN1KXekaeNgCtUgFhq msg="EthTxsGossip(Len=253)"
Mar 06 18:09:26 v-fuji-1.us-east1-c.c.ferdyflip-testnet.internal avalanchego[54310]: DEBUG[03-06|18:09:26.790] <e3k9c6RKgouPWtyQkr1ULC2PSFVu6BZdFefXZXJfwJ1zjWsmE Chain> github.com/ava-labs/subnet-evm/peer/network.go:459: processing AppGossip from node nodeID=NodeID-ACQt5KkctKfdLSNB6bfu7dVzmsyqpm3z2 msg="EthTxsGossip(Len=253)"
Mar 06 18:09:30 v-fuji-1.us-east1-c.c.ferdyflip-testnet.internal avalanchego[54310]: DEBUG[03-06|18:09:30.558] <e3k9c6RKgouPWtyQkr1ULC2PSFVu6BZdFefXZXJfwJ1zjWsmE Chain> github.com/ava-labs/subnet-evm/peer/network.go:459: processing AppGossip from node nodeID=NodeID-Lzo8r6TFVKqcmL5QxboPG4kuNZ5ECJooS msg="EthTxsGossip(Len=129)"

@tactical-retreat
Copy link
Contributor Author

Actually, I see that my repo (https://github.com/tactical-retreat/ferdynet-subnet-evm) is one commit behind head, and the head commit has a lot of stuff related to gossip. I'll look into redeploying a new VM.

@tactical-retreat
Copy link
Contributor Author

OK I sync'd redeployed, and had the faucet send some tokens. On the RPC node I see:

Mar 07 00:39:02 rpc-fuji-1.us-east1-c.c.ferdyflip-testnet.internal avalanchego[1091]: INFO [03-07|00:39:02.612] <e3k9c6RKgouPWtyQkr1ULC2PSFVu6BZdFefXZXJfwJ1zjWsmE Chain> github.com/ava-labs/subnet-evm/core/txpool/legacypool/legacypool.go:894: Setting new local account address=0x2352D20fC81225c8ECD8f6FaA1B37F24FEd450c9
Mar 07 00:39:02 rpc-fuji-1.us-east1-c.c.ferdyflip-testnet.internal avalanchego[1091]: INFO [03-07|00:39:02.613] <e3k9c6RKgouPWtyQkr1ULC2PSFVu6BZdFefXZXJfwJ1zjWsmE Chain> github.com/ava-labs/subnet-evm/internal/ethapi/api.go:1953: Submitted transaction hash=0xbe248f7c522fee7aa1246353997edd1e74bbf873002bd4a0e6560dd376596a7f from=0x2352D20fC81225c8ECD8f6FaA1B37F24FEd450c9 nonce=1364 recipient=0xB5d6CdEc20963882A0d3DEF5a60AfD3223eC48c4 value=20,000,000,000,000,000,000 type=2 gasFeeCap=100,000,000,000 gasTipCap=10,000,000,000 gasPrice=100,000,000,000

On the validator I see:

Mar 07 00:39:02 v-fuji-1.us-east1-c.c.ferdyflip-testnet.internal avalanchego[1081]: DEBUG[03-07|00:39:02.619] <e3k9c6RKgouPWtyQkr1ULC2PSFVu6BZdFefXZXJfwJ1zjWsmE Chain> github.com/ava-labs/subnet-evm/peer/network.go:443: forwarding AppGossip to SDK network nodeID=NodeID-Lzo8r6TFVKqcmL5QxboPG4kuNZ5ECJooS gossipLen=128 err="unknown codec version"
Mar 07 00:39:06 v-fuji-1.us-east1-c.c.ferdyflip-testnet.internal avalanchego[1081]: DEBUG[03-07|00:39:06.202] <e3k9c6RKgouPWtyQkr1ULC2PSFVu6BZdFefXZXJfwJ1zjWsmE Chain> github.com/ava-labs/subnet-evm/core/txpool/legacypool/legacypool.go:389: Transaction pool status report executable=1 queued=0 stales=0
Mar 07 00:39:32 v-fuji-1.us-east1-c.c.ferdyflip-testnet.internal avalanchego[1081]: DEBUG[03-07|00:39:32.717] <e3k9c6RKgouPWtyQkr1ULC2PSFVu6BZdFefXZXJfwJ1zjWsmE Chain> github.com/ava-labs/subnet-evm/peer/network.go:443: forwarding AppGossip to SDK network nodeID=NodeID-Lzo8r6TFVKqcmL5QxboPG4kuNZ5ECJooS gossipLen=128 err="unknown codec version"

@ceyonur
Copy link
Collaborator

ceyonur commented Mar 7, 2024

All of your validators are running v0.6.0> and with same upgrade.json files right? Could you send one of validator's debugs logs (full logs)?

This shows a completely different chain: https://devnet.routescan.io/block/1. (last accepted block no is 8).

@tactical-retreat
Copy link
Contributor Author

Yes. I upgraded them yesterday to 1.11.2 + the subnet-evm compiled from head. They all have the same upgrade.json.

I don't have the full logs from yesterday's run, sorry. I can probably restore to that if necessary.

But I went ahead and recreated the subnet (FerdyNet2 now) to see if the issue was that my DB was borked, which is why you see a different chain now. The new chain is working fine, so apparently that's not it.

The next thing I'm going to do is repeat the circumstances which got me into the previous problem, wipe the certificates, regenerate, re-add validators with 2000 weight. Will see if that makes this reproducable.

@tactical-retreat
Copy link
Contributor Author

OK, it seems like this is reproducible.

Startup logs: https://gist.github.com/tactical-retreat/05b073ac017d1554f41c1d7de640f95d

Reproduction instructions:

  1. create subnet with 3 nodes validating with default weight (20)
  2. send at least one tx
  3. take down all validators, delete staking information
  4. restart avalanchego (no longer validating)
  5. use new information to register as fuji validators
  6. register new node ids as subnet validators with stake 2000
  7. try and send a tx

Tx will hang forever.

@ceyonur
Copy link
Collaborator

ceyonur commented Mar 7, 2024

3- Did you just delete the staking information or whole database?
5- What new information? Staking certs?
6- Did older validator nodes started validating with same db or did you start new validator nodes from scratch?

@tactical-retreat
Copy link
Contributor Author

3 - just the staking information

cd /mnt/avax-db/config/staking
mkdir .backup2; mv s* .backup2/
sudo systemctl restart avalanchego

5 - node id, bls key, pop. the stuff that core validation site requires that comes out of this command

curl -X POST --data '{
    "jsonrpc":"2.0",
    "id"     :1,
    "method" :"info.getNodeID"
}' -H 'content-type:application/json;' 127.0.0.1:9650/ext/info

6 - I used the exact same machines with the exact same db that they had when i stopped avalanchego

@ceyonur
Copy link
Collaborator

ceyonur commented May 5, 2024

We have recently introduced few changes that addresses the issue. Could you try with the recent versions of Subnet-EVM (0.6.4)and Avalanchego (1.11.5)?

@darioush
Copy link
Collaborator

@tactical-retreat did this fix your issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

3 participants