Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage Collection crashes with timeout #991

Closed
rainmaker opened this issue Oct 9, 2014 · 4 comments
Closed

Garbage Collection crashes with timeout #991

rainmaker opened this issue Oct 9, 2014 · 4 comments

Comments

@rainmaker
Copy link

Hello,

We have been experimenting with various GC scenarios and configurations, and have run into the same crash repeatedly.

2014-10-09 17:38:50.069 [error] <0.436.0> Supervisor riak_cs_delete_fsm_sup had child undefined started with {riak_cs_delete_fsm,start_link,undefined} at <0.4353.0> exit with reason timeout in context child_terminated
2014-10-09 17:38:50.070 [error] <0.154.0> Supervisor riak_cs_sup had child riak_cs_gc_d started with riak_cs_gc_d:start_link() at <0.430.0> exit with reason timeout in context child_terminated

In all our experiments, we used s3cmd to interact with our 3-node RiakCS cluster. We have tried modulating these factors:

  • Size of file to upload and delete. Generally, GC of one or many files totaling <50MB happens successfully, while >50MB total usually results in this crash.
  • RAM - We tried between 4-16G RAM with no discernible difference in GC performance
  • gc_batch_size - Tried from the default value (1000) down to 1. It seemed like reducing the batch size improved our success rate, but still didn't help with the larger GCs
  • gc_max_workers - Tried between 1 and 20. Reducing concurrency by setting to 1 or setting delete_concurrency to 1 did not seem to improve things.
  • Number of files - Uploading and deleting a single 200MB file seemed to have as similar success rate to 200 1MB files
  • The Github issues mentioned below indicate that the deletion of a multi-part upload is more error-prone. Disabling multi-part upload in the s3cmd client improved the reliability of GC (it succeeded more often), but did not fix it altogether (it still timed out on multiple 200MB files)

We searched to see if our issue had been discovered before, and came across #949, #946, and #827. Since these were marked as fixed for RiakCS 1.5.1, we downloaded the new release and tried many of the same scenarios. Unfortunately, there appears to be no significant difference in GC between RiakCS 1.5.0 and 1.5.1.

https://gist.github.com/rmasand/4a4c0975ad5b494c1c90 contains our config and our logs.

We would appreciate either a recommendation of a workaround in 1.5.0, or suggestions as to why this still appears broken in 1.5.1.

Raina & @robdimsdale
Cloud Foundry Services

@rainmaker rainmaker changed the title Garbage Collection Garbage Collection crashes with timeout Oct 9, 2014
@kuenishi
Copy link
Contributor

How's the computer resource usage, anything saturated? I think the timeout is due to Riak just being slow due to heavy load, which can be internal or external.
When it comes to internal heavy load, garbage collection in Riak involves a lot of read, so I would recommend making sure that the cluster is correctly tuned up in the sense of hardware, network and kernel, by checking our tuning guide.

@rainmaker
Copy link
Author

Hello,

As far as we can tell, the resources of the machine are not under stress during garbage collection. We are running RiakCS on a VM, which has 2 CPUs, 4G RAM, and 10G disk space. We tried increasing the RAM to 16G at one point, as well as monitoring I/O using iostat. Increasing the resources did not prevent the crash, and we didn't see anything significant from the iostat results. Is the GC timeout at all configurable?

Thanks,

Raina && @ruthie
Cloud Foundry Services

@tartemov
Copy link

tartemov commented Nov 6, 2014

We had the same issue. Resolved by setting parameter +zdbbl 32768 in riak/vm.args.
I think that Basho should return this parameter to the documentation RiakCS 1.5.x (http://docs.basho.com/riakcs/latest/cookbooks/configuration/Configuring-Riak/#Other-Riak-Settings)

@kuenishi
Copy link
Contributor

I'm going to close as I heard the problem solved, but if there are any additional issues or questions don't hesitate to add here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants