Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lost volume server always after Volume xyz becomes unwritable #499

Closed
ingardm opened this issue May 23, 2017 · 4 comments
Closed

lost volume server always after Volume xyz becomes unwritable #499

ingardm opened this issue May 23, 2017 · 4 comments

Comments

@ingardm
Copy link
Contributor

ingardm commented May 23, 2017

Hi

We're seeing very frequest volume server disconnects. I've tried running the master with different pulseSeconds. The latest attempt was master with pulseSeconds=60 and volumes with pulseSeconds=1 ( based on #408 )

As far as I can tell it always happens after volumes become unwritable:

I0523 20:52:43 12295 volume_layout.go:203] Volume 188233 becomes unwritable
I0523 20:52:43 12295 volume_layout.go:203] Volume 188232 becomes unwritable
I0523 20:52:43 12295 volume_layout.go:203] Volume 188234 becomes unwritable
I0523 20:52:43 12295 volume_layout.go:203] Volume 188235 becomes unwritable
I0523 20:52:43 12295 volume_layout.go:203] Volume 188236 becomes unwritable
I0523 20:52:44 12295 master_grpc_server.go:63] lost volume server 10.1.14.22:8082
I0523 20:52:44 12295 topology_event_handling.go:52] Removing Volume 1241 from the dead volume server 10.1.14.22:8082
I0523 20:52:44 12295 volume_layout.go:227] Volume 1241 has 0 replica, less than required 2
I0523 20:52:44 12295 topology_event_handling.go:52] Removing Volume 1541 from the dead volume server 10.1.14.22:8082
I0523 20:52:44 12295 volume_layout.go:227] Volume 1541 has 1 replica, less than required 2
.
.
I0523 20:53:59 12295 node.go:237] topo:DefaultDataCenter:DefaultRack removes 10.1.14.22:8082
.
.
I0523 20:54:00 12295 volume_growth.go:205] Created Volume 188240 on 10.1.14.27:8080
I0523 20:54:00 12295 volume_growth.go:205] Created Volume 188241 on topo:DefaultDataCenter:DefaultRack:10.1.14.24:8081
I0523 20:54:00 12295 volume_growth.go:205] Created Volume 188241 on topo:DefaultDataCenter:DefaultRack:10.1.14.23:8081
I0523 20:54:00 12295 volume_growth.go:205] Created Volume 188242 on topo:DefaultDataCenter:DefaultRack:10.1.14.23:8083
I0523 20:54:00 12295 volume_growth.go:205] Created Volume 188242 on topo:DefaultDataCenter:DefaultRack:10.1.14.27:8083
I0523 20:54:00 12295 node.go:223] topo:DefaultDataCenter:DefaultRack adds child 10.1.14.22:8082
I0523 20:54:00 12295 master_grpc_server.go:36] added volume server 10.1.14.22:8082
I0523 20:54:00 12295 node.go:223] topo:DefaultDataCenter:DefaultRack adds child 10.1.14.24:8083
I0523 20:54:00 12295 master_grpc_server.go:36] added volume server 10.1.14.24:8083
I0523 20:54:00 12295 volume_layout.go:203] Volume 187989 becomes unwritable
I0523 20:54:00 12295 volume_layout.go:203] Volume 187519 becomes unwritable
I0523 20:54:01 12295 node.go:223] topo:DefaultDataCenter:DefaultRack adds child 10.1.14.27:8080
I0523 20:54:01 12295 master_grpc_server.go:36] added volume server 10.1.14.27:8080

@ingardm
Copy link
Contributor Author

ingardm commented May 23, 2017

Volume server 10.1.14.22:8082 logs from the same time :
I0523 20:52:43 17403 store.go:279] volume 188234 size 31454086100 will exceed limit 31457280000
I0523 20:52:43 17403 store.go:279] volume 188234 size 31455191006 will exceed limit 31457280000
I0523 20:52:43 17403 store.go:279] volume 188234 size 31455977716 will exceed limit 31457280000
I0523 20:52:43 17403 store.go:279] volume 188234 size 31457099306 will exceed limit 31457280000
I0523 20:52:43 17403 store.go:279] volume 188234 size 31457248214 will exceed limit 31457280000
I0523 20:52:43 17403 store.go:279] volume 188234 size 31457365656 will exceed limit 31457280000
.
.
(lots of these lines)
.
I0523 20:52:44 17403 store.go:279] volume 188236 size 31486811227 will exceed limit 31457280000
I0523 20:52:44 17403 store.go:279] volume 188235 size 31489459644 will exceed limit 31457280000
I0523 20:52:44 17403 store.go:279] volume 188236 size 31488599353 will exceed limit 31457280000
I0523 20:52:44 17403 store.go:279] volume 188235 size 31491156486 will exceed limit 31457280000
I0523 20:53:59 17403 store.go:292] error when reporting size: EOF
I0523 20:53:59 17403 store.go:292] error when reporting size: EOF
I0523 20:53:59 17403 store.go:292] error when reporting size: EOF
I0523 20:53:59 17403 store.go:292] error when reporting size: EOF
I0523 20:53:59 17403 store.go:292] error when reporting size: EOF
.
.
(lots of these lines also)
.
I0523 20:53:59 17403 volume_grpc_client.go:91] Volume Server Failed to talk with master 10.1.14.28:9333: EOF
.
I0523 20:53:59 17403 volume_grpc_client.go:25] heartbeat error: EOF
.
I0523 20:53:59 17403 store.go:292] error when reporting size: rpc error: code = Internal desc = transport is closing
I0523 20:54:00 17403 store.go:35] Resetting master nodes: nodes:[10.1.14.28:9333 10.1.14.28:9333], leader:
I0523 20:54:00 17403 volume_grpc_client.go:52] Heartbeat to 10.1.14.28:9333

chrislusf added a commit that referenced this issue May 24, 2017
@chrislusf
Copy link
Collaborator

Added a possible fix.

@ingardm
Copy link
Contributor Author

ingardm commented May 24, 2017

We're rerunning our tests now and so far so good :) I'll report back later when the test completes

@chrislusf
Copy link
Collaborator

closing.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants