-
Notifications
You must be signed in to change notification settings - Fork 20.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
geth node is consistently behind the mainnet #16218
Comments
I'd recommend against running with verbosity 4 all the time, that's too much. We just ran with |
Dedicated 4 CPUs, 8Gb RAM, Ubuntu 16.04
For two days. (I tried and runned on multiple servers, three servers with db errors, the others got 65 blocks chasing) |
Same here. I run little or no apps on the machine, maybe some browser instances. Used geth --fast --cache=1024 for initial sync. Caught up in cca 10h. @karalabe I tried using --cache=2048 and even 4096. Same result as above with --1024, with 4096 the OS starts swapping after a while, for some reason, even though all memory is not being used (memory leaks?). Sample log dump indicates only the speed of block acquisition. Geth is currently behind one day: It's a bit strange, though, that I can see my address's state on etherscan, no outgoing transactions, but this is not shown in wallet: "If your balance doesn't seem updated, make sure that you are in sync with the network." Not what I expect as occasional end user. The wallet is telling me it doesn't know if it's synced or not? |
In my case, after “Tom and Jerry” for quite long time, it gets synced now (eth.syncing = false). Just let it running and chasing (don’t turn it off). |
@b-f- Geth requires an SSD currently. If you only have an HDD, please use the light client. |
Hello! I am having the exact same issue as cue0083. I am running Geth 1.8.1 on a Raspberry Pi 3 (yes I know people are saying that you cannot run a full Ethereum node on a Raspberry any more but I have not read any convincing arguments why it should not work to do a fast sync). The CPU utilization (checking with top) is around 10%-20% average, so that does not seem to be an issue. The RAM is more or less fully allocated (which you would expect with only 1G) but the 2G swap The I/O utilization (checking with iotop) does not really show anything exceptional either and the 128GB SD card (SanDisk) has some 50GB still free (while trimming/erasing now and then to minimize the amount of garbage that the wear levelling mechanism potentially has to move around). Network bandwidth utilization is neither at any alarming level (as reported by my router) although I can see that there is more or less constant communication going on between the node and peer(s). I am running a fast sync that appears to stop (i.e. 'eth.syncing' starts returning 'false') when it is 65 blocks behind the highest block and then just "hangs" making no apparent further progress. When I restart Geth it will relatively quickly catch up to 65 blocks behind the new highest block and do some state downloading until it eventually "stops" again and 'eth.syncing' starts returning 'false'. Now, the curious thing is that when I check the stack trace after the syncing has "stopped" (i.e. 'eth.syncing' starts returning 'false') I can see that there appears to still be syncing going on with a single peer as the '(*Downloader).spawnSync' is still waiting on a channel (and has been for 1029 minutes) while the various goroutines attached to the channel ('fetchHeaders', 'fetchReceipts', 'fetchBodies' and so on) do NOT appear to "hang" on their selects but rather being busy downloading data (as indicated by 'fetchParts' apparently being called repeatedly). So, the question in my head is now: why does Geth appear to still be in syncing mode while 'eth.syncing' is returning 'false' and what is my node doing being stuck on the same peer for hours downloading data that then appears to be just thrown away, or what? I have attached a stack trace and also ten minutes worth of Go trace if that could be of any use. Thanks! |
Syncing Ethereum is a pain point for many people, so I'll try to detail what's happening behind the scenes so there might be a bit less confusion. The current default mode of sync for Geth is called fast sync. Instead of starting from the genesis block and reprocessing all the transactions that ever occurred (which could take weeks), fast sync downloads the blocks, and only verifies the associated proof-of-works. Downloading all the blocks is a straightforward and fast procedure and will relatively quickly reassemble the entire chain. Many people falsely assume that because they have the blocks, they are in sync. Unfortunately this is not the case, since no transaction was executed, so we do not have any account state available (ie. balances, nonces, smart contract code and data). These need to be downloaded separately and cross checked with the latest blocks. This phase is called the state trie download and it actually runs concurrently with the block downloads; alas it take a lot longer nowadays than downloading the blocks. So, what's the state trie? In the Ethereum mainnet, there are a ton of accounts already, which track the balance, nonce, etc of each user/contract. The accounts themselves are however insufficient to run a node, they need to be cryptographically linked to each block so that nodes can actually verify that the account's are not tampered with. This cryptographic linking is done by creating a tree data structure above the accounts, each level aggregating the layer below it into an ever smaller layer, until you reach the single root. This gigantic data structure containing all the accounts and the intermediate cryptographic proofs is called the state trie. Ok, so why does this pose a problem? This trie data structure is an intricate interlink of hundreds of millions of tiny cryptographic proofs (trie nodes). To truly have a synchronized node, you need to download all the account data, as well as all the tiny cryptographic proofs to verify that noone in the network is trying to cheat you. This itself is already a crazy number of data items. The part where it gets even messier is that this data is constantly morphing: at every block (15s), about 1000 nodes are deleted from this trie and about 2000 new ones are added. This means your node needs to synchronize a dataset that is changing 200 times per second. The worst part is that while you are synchronizing, the network is moving forward, and state that you begun to download might disappear while you're downloading, so your node needs to constantly follow the network while trying to gather all the recent data. But until you actually do gather all the data, your local node is not usable since it cannot cryptographically prove anything about any accounts. If you see that you are 64 blocks behind mainnet, you aren't yet synchronized, not even close. You are just done with the block download phase and still running the state downloads. You can see this yourself via the seemingly endless Q: The node just hangs on importing state enties?! A: The node doesn't hang, it just doesn't know how large the state trie is in advance so it keeps on going and going and going until it discovers and downloads the entire thing. The reason is that a block in Ethereum only contains the state root, a single hash of the root node. When the node begins synchronizing, it knows about exactly 1 node and tries to download it. That node, can refer up to 16 new nodes, so in the next step, we'll know about 16 new nodes and try to download those. As we go along the download, most of the nodes will reference new ones that we didn't know about until then. This is why you might be tempted to think it's stuck on the same numbers. It is not, rather it's discovering and downloading the trie as it goes along. Q: I'm stuck at 64 blocks behind mainnet?! A: As explained above, you are not stuck, just finished with the block download phase, waiting for the state download phase to complete too. This latter phase nowadays take a lot longer than just getting the blocks. Q: Why does downloading the state take so long, I have good bandwidth? A: State sync is mostly limited by disk IO, not bandwidth. The state trie in Ethereum contains hundreds of millions of nodes, most of which take the form of a single hash referencing up to 16 other hashes. This is a horrible way to store data on a disk, because there's almost no structure in it, just random numbers referencing even more random numbers. This makes any underlying database weep, as it cannot optimize storing and looking up the data in any meaningful way. Not only is storing the data very suboptimal, but due to the 200 modification / second and pruning of past data, we cannot even download it is a properly pre-processed way to make it import faster without the underlying database shuffling it around too much. The end result is that even a fast sync nowadays incurs a huge disk IO cost, which is too much for a mechanical hard drive. Q: Wait, so I can't run a full node on an HDD? A: Unfortunately not. Doing a fast sync on an HDD will take more time than you're willing to wait with the current data schema. Even if you do wait it out, an HDD will not be able to keep up with the read/write requirements of transaction processing on mainnet. You however should be able to run a light client on an HDD with minimal impact on system resources. If you wish to run a full node however, an SSD is your only option. |
Thank you for very thoroughful explanation. You should put in readme that fast sync requirements : Fast IO SSD so people don’t lose time. |
Hello! Yes, I realize that there are tons of states hanging under each block (header) but the thing that was puzzling me was that while geth appears to still be syncing (as the stack trace seems to indicate) the 'eth.syncing' call is returning 'false'. I think this is what causes people to feel that geth "hangs", because it claims it is not syncing any more. Please see the stack trace and other debug info in my previous post. Thanks! |
This is a problem not of an HDD, but the design of But anyway, you can solve current problem by replacing LevelDB with another more powerful NoSQL database engine, or increasing the cache size of LevelDB. Because LevelDB doesn't have to send SEEKs to the HDD every time it has to read something, if you cache the entire state, the READ operations would not be necessary at all. And for WRITE operations you can combine the requests for the same physical block. So, in theory you should be able to run Ethereum node on an HDD, because it is not a hardware limitation, it is a limitation of design. The average SEEK time of an HDD is 9 milliseconds, doing some basic math, you can execute 111 WRITE or READ operations per second without bottleneck. Ethereum has 15 transactions per second rate, so if every transaction does 4 READs and 3 WRITEs you would be able to use an HDD. Now, this is if no cache mechanisms are used, but with cache the performance would accelerate dramatically, like a 500x or 1000x speedup, depending on the ram you would be wiling to consume. Today's programmers write software without knowing much about hardware, that's why users are suffering. |
Apparently today's internet users post comments without knowing much about the issue they comment on.
|
Oh, and an Ethereum block currently contains about 175 txs, and it's processed in about 200ms, so the true throughput is 875 TPS, but the PoW mining and block propagation prevents pushing all hardware to its limits. |
we are having same issue, syncing for more than 24hours, how long till states download?
Geth Any suggestions? |
Getting this issue with with v1.8.4-stable-2423ae01. Running on an AWS m5.large (2 vCPU 8 gigs of ram) with a 300GB general purpose SSD attached. Command (initially used "fast" for the first day): Watching the logs, it seems like it just continually imports new state entries. I've had it running for about 4 days, had to restart a few times because the logs showed no activity for hours. It always seems to be "almost" caught up:
What's the best way to figure out the issue? I can't imagine it should take 4 days to sync up. Edit: Opened up port 30303 TCP and went from max 8 peers to 24 now. |
So, if I understand the INFO logs about importing new state entries, I'll actually be done syncing the trie state when the pending number quits bouncing up and down and eventually gets to zero? I'm assuming (a dangerous thing, I know) that the pending count is the number of entries that are linked to, but not yet loaded. Mine is currently fluctuating between 7,000 and 9,000 on TestNet. I've Processed 28,527,973 so far. |
After upgrading to version 1.8.6 my chain synced up after about a day (I wasn't watching it closely so might have been faster). |
@kingjerod : Can you provide some hardware infos (especially SSD/HDD) ? |
Running it on an AWS m5.large with Ubuntu 16.04, has 2 virtual CPUs and 8 gigs of ram. Running it with an SSD (general purpose) volume provisioned to 300GB. Currently synced and it's using 90GB. I think the latest version still needs some work, because it's the only process running on the server and it's almost maxed out both cores (CPU at 193%) and memory is at 6GB after running for 45 minutes. If you're trying to run this on a laptop, be prepared to have your computer crawl. |
Thanks for info. It seems ETH syncing is still a hard task. |
@bobbieltd If you just need the http rpc I would recommend https://infura.io |
there is no way to get synced with an HDD, it is enough to sync with ROPSTEN. @kingjerod recommendation is a good one for RPC |
@karalabe do you think when relying on shared services like AWS, dedicated hardware tenancy can reduce network slowdown? My "box" can handle 3000 IOPS, but there are still days with frequent sync in and out phasing. I'm wondering if this is an issue of my cloud service neighbors hogging up bandwidth. I'll contact my cloud service as well to figure this out. |
How much space would I need to run a full node - I've read somewhere that the size of the blockchain was about 1TB, but others talk about 100GB - big difference... |
@dyvel It depends if you do a full sync or a fast sync. If you do a full it might fit inside 1TB. If you don't care about the history of transactions, a fast sync will work and I imagine might fit in under 100GB. |
As this ticket mostly concerns the performance, and is not directly a flaw in the code to be fixed (other than, make it faster), I'm going to close this ticket. Feel free to open a new ticket if there's something I missed. |
Question for @karalabe or anyone else who knows: what is it exactly that's in a block? What do you actually have from the network when you "have a block"? As far as the tries go, I assume the state trie is separate, but are the transaction and transaction receipt tries included in "block download" or are they also separate somehow. I assume that at the very least the transaction trie is included, but I'm unsure where the transaction receipt trie is also downloaded or whether that's derived by playing back transactions from the transaction trie. |
very informative messages! |
Hey I mistakely set my Cache to 1024 at the start of the sync and it has already been couple of days. |
I am glad that 2 and half years later, you finally understood what I have said. As I am reading Geth 1.9 release notes and it says:
There is still a lot of room to improve performance of trie storage starting from the design. |
Hi,
Thank you for the update.
As a system level 'C' programmer I dislike 'popular' languages like go,
and the packaged 'goods' with it.
As the design of the block does not change there is probably no need to
use a standard database in the first place.
Happy New Year!
Paul.
…On 2021-01-01 8:11 p.m., nuliknol wrote:
This is a problem not of an HDD, but the design of |geth| and the
choice of using a NoSQL database like Level
Today's programmers write software without knowing much about
hardware, that's why users are suffering.
I am glad that 2 and half years later, you finally understood what I
have said. As I am reading Geth 1.9 release notes and it says:
The discovery and optimization of a quadratic CPU and disk IO
complexity, originating from the Go implementation of LevelDB. This
caused Geth to be starved and stalled, exponentially getting worse
as the database grew. Huge shoutout to Gary Rong for his relentless
efforts, especially as his work is beneficial to the entire Go
community.
There is still a lot of room to improve performance of trie storage
starting from the design.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#16218 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMCJ2OKX63EME3UMJGFTIKTSXYF3NANCNFSM4ES4AXVQ>.
|
I'm trying to sync a full node on a server with 10TB+ of HDD and 448GB of SSD. |
Hi there,
System information
Geth version: 1.8.1
OS & Version: CoreOS (running geth in Docker container)
Hardware: m5.large
Expected behaviour
In the article for Iceberg's release (https://blog.ethereum.org/2018/02/14/geth-1-8-iceberg%C2%B9/), the author said that they were able to complete a fast sync in ~3 hours using an m4.2xlarge instance. Could you guys (devs) tell me what command line options you used for that test?
Actual behaviour
I am running on similar hardware and my geth node has been 65 blocks behind the mainnet for the past 2 days.
Steps to reproduce the behaviour
Launch geth container on CoreOS. Geth should run these command line options:
--verbosity 4 --metrics --maxpeers=50 --ipcdisable --v5disc --rpc --rpcvhosts=* --port=30303 --syncmode fast --cache=2048 --rpcaddr 0.0.0.0 --rpcport 8545 --rpcapi db,eth,net,web3,personal,admin,txpool,debug
Backtrace
geth.log
The text was updated successfully, but these errors were encountered: