-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node consistently 12 hours behind #127
Comments
Hi there! Hm, can you please share more details about the environment your nodes are running in, along with specs? |
Hi, I am running docker containers on aws ec2 instances. This is the .env.mainnet file OP_NODE_L1_ETH_RPC=${L1_RPC_URL} OP_NODE_L2_ENGINE_AUTH=/tmp/engine-auth-jwt This is the docker file: services: This is the latest entry from the git log:
|
apologies for the formatting above. not sure what's happened to it. |
Hi, |
We've had some large gas-consuming blocks recently, which can cause underprovisioned nodes to occasionally fall behind. Usually the bottleneck is due to the storage device (SSD / NVME). |
Same problem all of a sudden. Been running a base node for moths without issues, but now it cannot get to 100% sync anymore. It's not my storage device which is a problem since I'm running a 4TB WD_BLACK SN850X PCIe 4.0 SSD on a Intel Core i9-13900KS. Currently on block 5651609 of 5673372 and no chance in sight to catch up. |
@Toeplitz Thanks for following up with more info. Is your node still running? Can you please share more details regarding how far back it currently is? |
My issue isn't intermittent. Both my nodes are constantly 12 hours behind. I have been monitoring the disk usage on datadog and it is definitely not the bottleneck here. |
That is really odd.. re: "can't find next L1 block info (yet)". Could it be the L1 node you are using is behind? |
I just checked my L1 nodes and they are fully synched and have been fully synched without any issues for a while now so don't think it is that. |
I managed to get in sync by starting from scratch and syncing up from a downloaded snapshot. Seems like something got stuck with my old node. |
I have tried restoring from a snapshot before but it didn't work at the time. Will try it again and let you know how I get on. |
the "exactly 12 hours behind" issue seems like it must be config or network related. 12 hours is the max-sequencer drift window, which suggests L1 chain derivation is failing. |
What's the recommended storage specs? I'm running a GP3 on AWS at 16,000 IOPS and 1,000 MB/s and our node is still delayed a couple of times a day, the longest of which lasted about 4 hours last Sunday. |
Were you ever able to get your node to fully sync and not fall behind? Is your node still delayed occasionally? When it happens, how far back do you think it is falling behind, on average? I know you mentioned 4 hours but just wanted to confirm if that was an outlier or if it is usually about that length of time. Are you running other nodes in tandem in the same environment? The specs you mentioned should be sufficient. Aside from that, you'll need at least 1.75 TB to sync mainnet. |
I ended up resyncing from the snapshot which worked. At one point though I had three instances that were all consistently 12 hours behind with no obvious indication on why. |
We've not been able to replicate. Have you modified the start script in any way (other than by pointing it to your own L1 node)? |
@roberto-bayardo I use default start script except |
Heya, was helping another person with a similar issue and just wanted to follow-up with an update alongside a recent comment on #172 (comment):
Could you please try setting it to false and let us know if it resolves the problem? |
#172 (comment) Right now I have the exact same problem on my mainnet optimism archive node v1.4.2, I have tried all the settings, including deleting "--l1.trustrpc" or changing L1 provider, the node goes superfast until the block with 12h delay from live, and then it starts to be very slow, remaining 12 hours late. I created another node from 0 to see if it starts again node-op-node-1
node-op-geth-1
|
A few days back, our Base node requested too much from the L1 RPC and got rate limited. Then, it got into this issue of being consistently about 12 hours delayed from the current block. This has been the 3rd day now. We changed the --l1.trustrpc or OP_NODE_L1_TRUST_RPC to false but the issue still has not resolved after almost 1 day now. Trying to resync from the latest snapshot but it was 7 days ago. Please help to post the latest snapshot soon. |
we're on the same. 12hours behind, impossible to catch up. In the log
and this repeats over and over. Tried with 1.3.2 and 1.4.3-rc.3 (op-geth 1.101304.2) |
I was facing this error since three days. then i downloaded the optimism release separately without touching the datadir. My services had to be changed and some new flags are also added which were not present in my old setup. After that I kept the op-node off and only started op-geth. go run ./op-wheel/cmd engine set-forkchoice --unsafe=114685543 --safe=114685543 --finalized=114685543 --engine=http://localhost:8551 --engine.jwt-secret=/root/op-node-optimism/optimism/op-node/jwt.txt At last I finally restarted op-node and now my node has started syncing in full speed. It might take a few hours to sync but the logs have definitely changed and I am sure it will be resynced. |
interesting. Had no other choice, so I removed op-geth data and downloaded snaphost (after painful 3-4 days, as it was slow as hell), so will be recovering from snapshot. |
@zainirfan13 i got this error
|
@MrFrogoz Are you on branch "op-node/v1.4.2" For me it worked when i build this branch. |
@tmeinlschmidt Every time an update comes on optimism I just pray to God that please this time sync peacefully after upgrade and trust me its always painful. |
@zainirfan13 I downloaded the repo and inserted the same version, still not working
|
Try running the command outside the op-wheel folder like i shared, also make sure you have built in main folder using
after this run the command "go run ./op-wheel/cmd engine set-forkchoice --unsafe=114696811 --safe=114696811 --finalized=114696811 --engine=http://localhost:50005 --engine.jwt-secret=/data/shared/jwt.txt" in the optimism folder. Make sure you are using the correct block number that your op-geth is present on. |
I also had the same problem on 2024-01-12, but the problem was because I was still using the old version v0.3.1 without upgrading to the new version. After I upgraded to the new version v0.6.1 and used the latest snapshot, I no longer had this problem. |
building outside of the op-wheel folder, inside the root of the repo doesn't build op-wheel for me. I had to build op-node and op-wheel separately. Even then, I get the same error
|
ahh, using engine api means you have to use the engine port, default being 8551. went through without any log, and after restarting I don't see it walking back forever anymore. Hopefully I'll reach the tip after it syncs |
I encountered the same problem in my private l2 op stack network. After I set the block_time to 6 in rollup.json, the problem was resolved. |
Hey all, I put together a quick video in case it might help troubleshoot. TLDW, sometimes if a required upgrade is missed either on Base mainnet, Base Sepolia (or even on Sepolia), the local node can fork off onto it's own chain. To confirm if this is the case you can use the If you confirm that your block hash isn't valid, then it's best to:
Hope that helps. If you're still experiencing a 12h drift after the above, please let me know and happy to try to help troubleshoot further. 2024-02-26.at.15.33.27-converted.mp4 |
Since it has been some time without any new updates, closing this issue. If anyone is still encountering this problem, please come to Base Discord and open a #developer-support ticket. We can take a closer look there. 👍 |
I have multiple nodes which are synching but are consistently 12 hours (~21k blocks) behind. There is nothing obvious in the logs from what I can see. On both instances that this has happened on, the instance actually syncs in good time but once it catches up to about 12 hours behind it slows down and stays slow. Below are the logs for my geth and node.
geth
t=2023-10-20T14:06:49+0000 lvl=info msg="Sync progress" reason="processed safe block derived from L1" l2_finalized=0x7bcbef0c6fc2813d65a6ff8f3a8b355d66f6345d38dfabc52c393180161affe0:5488451 l2_safe=0x75f75f5de826110603239f82bd9070ef1723122b4d1a0bd7797def5a0eb0cdd7:5488949 l2_unsafe=0x75f75f5de826110603239f82bd9070ef1723122b4d1a0bd7797def5a0eb0cdd7:5488949 l2_engineSyncTarget=0x75f75f5de826110603239f82bd9070ef1723122b4d1a0bd7797def5a0eb0cdd7:5488949 l2_time=1,697,767,245 l1_derived=0x102a221bd61eba1c7a4aefdc952a613f37d98f04ae616b5e09f1ce77f428cadd:18392017 t=2023-10-20T14:06:49+0000 lvl=warn msg="ignoring batch with mismatching parent hash" batch_index=0 batch_timestamp=1,697,767,247 parent_hash=0x6cba462977ab51f2ba86cd960c1aaaaf40e1dcf10c2dff21f05c09907b57e111 batch_epoch=0x5177bbb4fc159ffccdd7a30efcb90340245537e071632b00f0ec45bac1c19d16:18388407 txs=7 current_safe_head=0x75f75f5de826110603239f82bd9070ef1723122b4d1a0bd7797def5a0eb0cdd7 t=2023-10-20T14:06:49+0000 lvl=warn msg="dropping batch" batch_timestamp=1,697,767,247 parent_hash=0x6cba462977ab51f2ba86cd960c1aaaaf40e1dcf10c2dff21f05c09907b57e111 batch_epoch=0x5177bbb4fc159ffccdd7a30efcb90340245537e071632b00f0ec45bac1c19d16:18388407 txs=7 l2_safe_head=0x75f75f5de826110603239f82bd9070ef1723122b4d1a0bd7797def5a0eb0cdd7:5488949 l2_safe_head_time=1,697,767,245 t=2023-10-20T14:06:51+0000 lvl=info msg="Received signed execution payload from p2p" id=0xc537b59cadb1faf30fbc5351019584f6f29f77c5261501436ec30fb0fe73acb1:5510732 peer=16Uiu2HAmTup6mra5PBDNxo38dPqSZrJVdbkRaFmsBSeVYuBbfoPt t=2023-10-20T14:06:51+0000 lvl=info msg="Optimistically queueing unsafe L2 execution payload" id=0xc537b59cadb1faf30fbc5351019584f6f29f77c5261501436ec30fb0fe73acb1:5510732 t=2023-10-20T14:06:53+0000 lvl=info msg="Received signed execution payload from p2p" id=0x11432e2b42964a1b271e4a27ea9a1bd2811ab71aa182c2afabcf00345f28de65:5510733 peer=16Uiu2HAmTup6mra5PBDNxo38dPqSZrJVdbkRaFmsBSeVYuBbfoPt t=2023-10-20T14:06:53+0000 lvl=info msg="Optimistically queueing unsafe L2 execution payload" id=0x11432e2b42964a1b271e4a27ea9a1bd2811ab71aa182c2afabcf00345f28de65:5510733 t=2023-10-20T14:06:55+0000 lvl=info msg="Received signed execution payload from p2p" id=0x994c17ff2e7b9828ede0d0a54a2b0c27d7ff760738f06c24fcfd59af531bf480:5510734 peer=16Uiu2HAmTup6mra5PBDNxo38dPqSZrJVdbkRaFmsBSeVYuBbfoPt t=2023-10-20T14:06:55+0000 lvl=info msg="Optimistically queueing unsafe L2 execution payload" id=0x994c17ff2e7b9828ede0d0a54a2b0c27d7ff760738f06c24fcfd59af531bf480:5510734 t=2023-10-20T14:06:57+0000 lvl=info msg="Received signed execution payload from p2p" id=0x68a9266751a6a7137b5940432ea4e8307be83660271c094f26f052ec2010469d:5510735 peer=16Uiu2HAmTup6mra5PBDNxo38dPqSZrJVdbkRaFmsBSeVYuBbfoPt t=2023-10-20T14:06:57+0000 lvl=info msg="Optimistically queueing unsafe L2 execution payload" id=0x68a9266751a6a7137b5940432ea4e8307be83660271c094f26f052ec2010469d:5510735 t=2023-10-20T14:06:59+0000 lvl=info msg="Received signed execution payload from p2p" id=0x18d2d7e65902af386210f03af30beecf29112cc6d401f4c0615001d39b35f143:5510736 peer=16Uiu2HAmTup6mra5PBDNxo38dPqSZrJVdbkRaFmsBSeVYuBbfoPt t=2023-10-20T14:06:59+0000 lvl=info msg="Optimistically queueing unsafe L2 execution payload" id=0x18d2d7e65902af386210f03af30beecf29112cc6d401f4c0615001d39b35f143:5510736
node
INFO [10-20|14:07:49.178] Chain head was updated number=5,488,975 hash=02cbd0..9c29ad root=454d2c..79e5a5 elapsed="56.359µs" age=12h6m12s INFO [10-20|14:07:49.179] Starting work on payload id=0xd8738b4081f03f1c INFO [10-20|14:07:49.180] Imported new potential chain segment number=5,488,976 hash=20c509..4b57c9 blocks=1 txs=1 mgas=0.047 elapsed="680.647µs" mgasps=68.924age=12h6m10s dirty=0.00B INFO [10-20|14:07:49.181] Chain head was updated number=5,488,976 hash=20c509..4b57c9 root=91478a..4cc74f elapsed="58.409µs" age=12h6m10s INFO [10-20|14:07:49.182] Starting work on payload id=0x595a99110dbba427 INFO [10-20|14:07:49.183] Imported new potential chain segment number=5,488,977 hash=965a0f..a2de8c blocks=1 txs=1 mgas=0.047 elapsed="608.423µs" mgasps=77.106age=12h6m8s dirty=0.00B INFO [10-20|14:07:49.184] Chain head was updated number=5,488,977 hash=965a0f..a2de8c root=30acb3..b8737b elapsed="54.629µs" age=12h6m8s INFO [10-20|14:07:49.185] Starting work on payload id=0x37aead7a2cb71706 INFO [10-20|14:07:49.186] Imported new potential chain segment number=5,488,978 hash=34c7ff..026a16 blocks=1 txs=1 mgas=0.047 elapsed="603.345µs" mgasps=77.755age=12h6m6s dirty=0.00B INFO [10-20|14:07:49.186] Chain head was updated number=5,488,978 hash=34c7ff..026a16 root=c5c135..9b6c31 elapsed="56.8µs" age=12h6m6s INFO [10-20|14:07:49.188] Starting work on payload id=0x9fc452ac14108431 INFO [10-20|14:07:49.189] Imported new potential chain segment number=5,488,979 hash=1ed5ef..e2f20c blocks=1 txs=1 mgas=0.047 elapsed="562.039µs" mgasps=83.469age=12h6m4s dirty=0.00B INFO [10-20|14:07:49.189] Chain head was updated number=5,488,979 hash=1ed5ef..e2f20c root=88c3e2..114e8d elapsed="51.821µs" age=12h6m4s
This wasn't an issue until about a week ago. I have checked the discord for similar issues but cannot find any.
The text was updated successfully, but these errors were encountered: