Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dwallet RPC connection to daemon time out and doesn't reconnect #210

Closed
chappjc opened this issue Apr 25, 2016 · 14 comments
Closed

dwallet RPC connection to daemon time out and doesn't reconnect #210

chappjc opened this issue Apr 25, 2016 · 14 comments
Labels

Comments

@chappjc
Copy link
Member

chappjc commented Apr 25, 2016

15:39:13 2016-04-25 [INF] WLLT: Connecting block 0000000023709697effd8bd72a0723fac6269e1a9f54d5c062443616b0b235d4, height 61831
15:42:13 2016-04-25 [ERR] CHNS: Timeout waiting for session RPC

2 wallets on the same machine connected to the same dcrd (testne) have been hung for 2+ hours for no clear reason. The daemon is still running. dcrwallet doesn't want to retry the connection.

Restarding dcrd did not bring either dcrwallet back to life. CTRL+C on the wallets hang with:

[WRN] DCRW: Stopping legacy RPC server...
[INF] RPCS: Disconnected websocket client .............:58763

@girino
Copy link

girino commented Apr 25, 2016

same thing here. "dcrctl --wallet getinfo" would return:
-4: the client has been shutdown

"dcrctl --wallet stop" would log/print shutdown messages and stop it listening for RPC, but the process would not finish. I had to "kill" it manually.

@cjepson
Copy link
Contributor

cjepson commented Apr 26, 2016

In the future, if you get a hang like this run

killall -SIGABRT dcrwallet

That will give us a stack trace to follow up on.

@chappjc
Copy link
Member Author

chappjc commented Apr 26, 2016

Ah, yes, that was the same message! I forgot to get a trace because I was in a hurry.

But it was something caused be the daemon, since both wallets that were connected hung up. I'll check the daemon log at that time.

@girino
Copy link

girino commented Apr 26, 2016

I have recorded the dcrd log, but it was not on -d debug, so there is'nt much information, just:

13:16:05 2016-04-24 [ERR] RPCS: Websocket receive error from 127.0.0.1:51798: websocket: close 1006 unexpected EOF 
13:16:05 2016-04-24 [INF] RPCS: Disconnected websocket client 127.0.0.1:51798 

@girino
Copy link

girino commented Apr 26, 2016

@cjepson i tried this killall -SIGABRT, but the wallet just quit, no stack trace.
Possibly the stacktrace goes to stdout/stderr?
In that case, it was lost. I'll run it with "nohup" so i can recover the stdout/err output next time. (again, this happened on my solo stake mining wallet, not on the stakepool).

@chappjc
Copy link
Member Author

chappjc commented Apr 26, 2016

You might consider tmux, with the logging plugin. https://github.com/tmux-plugins/tmux-logging.
It's very easy to install and get used to.

@jcvernaleo
Copy link
Member

jcvernaleo commented Apr 26, 2016

@chappjc tmux suggestion is a very good one. I basically never run dcrd/wallet (or btcd) not in a tmux window. An alternative to that logging plugin (which I hadn't seen before) is to just define a shortcut to tell tmux to write the scrollback to a file. In my ~/.tmux.conf I have the following line:

# Save scrollback
bind-key s capture-pane -S -4500 \; save-buffer ~/tmp/tmux.log

That way s (after whatever the key you have for normal tmux commands) will save the scrollback to the file ~/tmp/tmux.log

@chappjc
Copy link
Member Author

chappjc commented Apr 27, 2016

@jcvevernaleo with the plugin, you can start and stop logging (unique file each time) with Shift+p. To dump the entire scrollback it's Alt+Shift+p.
I generally include the install in any deploy script:

@chappjc chappjc closed this as completed Apr 27, 2016
@chappjc
Copy link
Member Author

chappjc commented Apr 27, 2016

Oops, didn't mean to close... working on my phone. :)
The setup I was trying to paste:

git clone https://github.com/tmux-plugins/tmux-logging ~/clone/path

Add to end of .tmux.conf:

run-shell ~/clone/path/logging.tmux

@chappjc chappjc reopened this Apr 27, 2016
@girino
Copy link

girino commented Apr 28, 2016

I got a stacktrace with SIGABRT. i couldn't figure out how to attach it here, so i put it here: http://pastebin.com/ZGErbbaW

@jcvernaleo jcvernaleo added the bug label Apr 28, 2016
@cjepson
Copy link
Contributor

cjepson commented May 1, 2016

@girino In your case it looks like it sends to a channel (the chainntfns channel) when the channel isn't handling ntfns after "Timeout waiting for session RPC".

goroutine 2463 [chan send, 279 minutes]:
github.com/decred/dcrwallet/chain.(_RPCClient).(github.com/decred/dcrwallet
/chain.onStakeDifficulty)-fm(0xc831448720, 0x5adb, 0xe28c591b)
/opt/go/src/github.com/decred/dcrwallet/chain/chain.go:89 +0xae
github.com/decred/dcrrpcclient.(_Client).handleNotification(0xc8200772c0, 0xc8204720c0)

https://github.com/decred/dcrwallet/blob/master/chain/chain.go#L388-L392

The lock up might be because the channel is unbuffered and handling stops when the RPC server fails to connect, then the channel overflows and gridlocks. Maybe @jrick can shed some light on this issue.

If the channel is unbuffered, the sender blocks until the receiver has received the value. If the channel has a buffer, the sender blocks only until the value has been copied to the buffer; if the buffer is full, this means waiting until some receiver has retrieved a value.

https://golang.org/doc/effective_go.html#channels

@chappjc
Copy link
Member Author

chappjc commented May 2, 2016

Perhaps it could have a timeout with a select statement and case: <-time.After(10 * time.Second): or something similar. (Sorry I don't have a trace, BTW). Anyway, this is needed in another project I'm looking at, so I'm toying with solutions like this there.

@jrick
Copy link
Member

jrick commented May 2, 2016

The lock up might be because the channel is unbuffered and handling stops when the RPC server fails to connect, then the channel overflows and gridlocks.

Yes. Look at all of the other on* methods. You'll see that they select on not just the notification send, also reading the quit channel.

@jrick
Copy link
Member

jrick commented May 2, 2016

@alexlyp this isn't fixed yet (although a hang discovered by it was). The real fix is to make the dcrd json-rpc server process websocket requests asynchronously.

Although, this is more of a dcrd issue than a dcrwallet one, so don't feel the need to reopen it. I have a btcd issue and PR for this, and we can create a dcrd issue for it as well if we want to keep track of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants