Skip to content
This repository has been archived by the owner on Apr 4, 2024. It is now read-only.

Unstable eth websocket event subscriptions #1745

Closed
CRossel87a opened this issue Apr 9, 2023 · 22 comments
Closed

Unstable eth websocket event subscriptions #1745

CRossel87a opened this issue Apr 9, 2023 · 22 comments

Comments

@CRossel87a
Copy link

CRossel87a commented Apr 9, 2023

Hi,

I have re-produced this bug with Evmos, Cronos and Canto clients, and have found Ethermint to be the common factor.
For my app I use multiple long-duration event subscriptions to newheads and topics. It is a consistent and provable problem across multiple versions, multiple binaries, multiple compiles (external, and my own). It occurs both when the client has a dedicated server, and when it is shared with other clients.

When a problem occurs, the clients will accept a new subscription and respond with a subscription id, but then not actually push any events. This problem will typically start occurring after running for 2-6 hours. It also seems like the problem occurs more frequently when event activity on-chain is lower. Re-connecting to the client on every 2minute time-outs do not solve the problem. Only a forced client restart will do it.

One temporary fix I have used is to program a periodic restart of the clients every 2-3h. That seems to "minimize the damage".

Steps to Reproduce

New blocks:
wscat -c ws://127.0.0.1:8546
{"jsonrpc":"2.0","id":1,"method":"eth_subscribe","params":["newHeads"]}

For Uniswap2 Sync events:
wscat -c ws://127.0.0.1:8546
{"jsonrpc":"2.0","id":1,"method":"eth_subscribe","params": ["logs", {"topics": ["0x1c411e9a96e071241c2f21f7726b17ae89e3cab4c78be50e062b03a9fffbbad1"]}]}

@CRossel87a
Copy link
Author

Hi,

ref. https://github.com/evmos/evmos/releases/tag/v12.1.2

This did not solve the problem

@CRossel87a
Copy link
Author

CRossel87a commented Apr 16, 2023

Public Evmos RPC websocket server is broken. Does not emit all Uniswap2 sync events:

Replication:
wscat -c wss://eth.bd.evmos.org:8546
{"jsonrpc":"2.0","id":1,"method":"eth_subscribe","params": ["logs", {"topics": ["0x1c411e9a96e071241c2f21f7726b17ae89e3cab4c78be50e062b03a9fffbbad1"]}]}

then do a test trade on Diffusion

@mmsqe
Copy link
Contributor

mmsqe commented Apr 27, 2023

{"jsonrpc":"2.0","id":1,"method":"eth_subscribe","params": ["logs", {"topics": ["0x1c411e9a96e071241c2f21f7726b17ae89e3cab4c78be50e062b03a9fffbbad1"]}]}

I can't reproduce yet, may I ask its events are delay or no new events at all?
Screenshot 2023-04-27 at 11 14 08 AM

@CRossel87a
Copy link
Author

{"jsonrpc":"2.0","id":1,"method":"eth_subscribe","params": ["logs", {"topics": ["0x1c411e9a96e071241c2f21f7726b17ae89e3cab4c78be50e062b03a9fffbbad1"]}]}

I can't reproduce yet, may I ask its events are delay or no new events at all? Screenshot 2023-04-27 at 11 14 08 AM

Hi,

Websocket subscriptions are still not working correctly with v12.1.2. After long polling, typically a few hours, it still stop emitting events - even when the connection is refreshed on 2minute timeouts. Would it help you to know there are multiple concurrent subscriptions running?

@CRossel87a
Copy link
Author

I have been able to break the

{"jsonrpc":"2.0","id":1,"method":"eth_subscribe","params": ["logs", {"topics": ["0x1c411e9a96e071241c2f21f7726b17ae89e3cab4c78be50e062b03a9fffbbad1"]}]}

I had not been subbing to the public RPC lately. So I think you waited too long to try this. If you open up 2-3 multiple concurrent subscriptions the public rpc will break after a few hours

@mmsqe
Copy link
Contributor

mmsqe commented May 9, 2023

May I ask if there's event received after reconnect? If yes, it would be subscription issue in client side. If no, it would be websocket issue in server side.

@CRossel87a
Copy link
Author

CRossel87a commented May 9, 2023

May I ask if there's event received after reconnect? If yes, it would be subscription issue in client side. If no, it would be websocket issue in server side.

Upon re-connect while "broken" you can send the eth_subscribe, and you will obtain a receipt subscription id in response. So the connection is working fine. However, after obtaining the receipt id, you will not get any further logs pushed into the channel.

If you are working on this, please pm me directly at chris.rossel@gmail.com so that I can help you move quickily to get it solved. I am having production interruptions on every Ethermint chain, whereas no issues on any geth, erigon, moonbeam, polkadot-based nodes. I am sure this is damaging Cosmos defi significantly.

@CRossel87a
Copy link
Author

The lack of events after confirmed subscription id is replicated across multiple listning software, such as direct wscat query, or with Python websockets or Rust tokio tungstenite. I am 100% certain this issue is server side. Remember that this error only occurs after an hour or more of polling. Typically a node restart will help for a period.

@CRossel87a
Copy link
Author

I went to log level 'error' and got these:

May 09 13:47:09 v2202211186789206564 cantod[10998]: 1:47PM ERR failed to add event topic to event bus error="topic already registered" api=websocket-server module=websocket-client topic="tm.event='Tx' AND message.module='evm'"

May 09 13:54:17 v2202211186789206564 cantod[11161]: 1:54PM ERR failed to add event topic to event bus error="topic already registered" api=websocket-server module=websocket-client topic="tm.event='Tx' AND message.module='evm'"

After these errors, no more events were pushed out

@VictorTrustyDev
Copy link
Contributor

VictorTrustyDev commented May 9, 2023

I think you found the correct error message

In filter_system.go

ch := make(chan coretypes.ResultEvent)
es.topicChans[f.event] = ch
if err := es.eventBus.AddTopic(f.event, ch); err != nil {
    es.logger.Error("failed to add event topic to event bus", "topic", f.event, "error", err.Error())
}

Look, es.topicChans[f.event] got replaced by new chan ch.
I think flow should be AddTopic first, if success, replace/add it.

Just few lines of comment, I'll let someone else fix it if my guess is correct.

@CRossel87a
Copy link
Author

CRossel87a commented May 9, 2023

Great! Hope that it. I am currently debugging on my Canto node which is running an older version. It seems newheads subscription work while logs do not, so the problem is likely isolated to topic subscriptions

@CRossel87a
Copy link
Author

image

Attempted this quickfix, did not solve it sadly

@mmsqe
Copy link
Contributor

mmsqe commented May 9, 2023

@CRossel87a May I ask the steps to verify not solved?

@CRossel87a
Copy link
Author

CRossel87a commented May 9, 2023

@CRossel87a May I ask the steps to verify not solved?

@mmsqe Same steps as described above. However, I think the reason you have not been able to re-produce it is because it seems to happen if you have a program that is attempting to create multiple subscriptions at the same time, such as when an app attempts to subscribe to 3 topics at once over 3 separate connections. I am now attempting to add time between subscriptions to see if this produces stability. Testing a hypothesis that there is some multi-threaded collision

@VictorTrustyDev
Copy link
Contributor

How about also adding es.index[][]=f onto the else block?

@VictorTrustyDev
Copy link
Contributor

Or backup existing (if any) then restore in case of err != nil

@CRossel87a
Copy link
Author

Or backup existing (if any) then restore in case of err != nil

I am a bit out of my element with Golang :) Just hacking around

@mmsqe
Copy link
Contributor

mmsqe commented May 9, 2023

@CRossel87a May I ask the steps to verify not solved?

@mmsqe Same steps as described above. However, I think the reason you have not been able to re-produce it is because it seems to happen if you have a program that is attempting to create multiple subscriptions at the same time, such as when an app attempts to subscribe to 3 topics at once over 3 separate connections. I am now attempting to add time between subscriptions to see if this produces stability. Testing a hypothesis that there is some multi-threaded collision

If channel get changed somewhere when concurrent subscribe happens, your fix should work. I'll try add more test for it

@DefiDebauchery
Copy link

DefiDebauchery commented May 9, 2023

Maybe I’m misunderstanding the context of the last few messages, but just want to be sure to note that this bug shouldn’t have anything to do with a collision on concurrent subscriptions. Just a single subscribe (with no other connections) will fail after a time.

@CRossel87a
Copy link
Author

@mmsqe @VictorTrustyDev I think we have the solution! Persistent and stable websocket event subscription for two days running

@VictorTrustyDev
Copy link
Contributor

VictorTrustyDev commented May 12, 2023

Nice, how did u fixed it?

@github-actions
Copy link

This issue is stale because it has been open 45 days with no activity. Remove Status: Stale label or comment or this will be closed in 7 days.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants