Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faktory cannot boot, because Redis is slow to load (?) #225

Closed
ttilberg opened this issue May 27, 2019 · 9 comments
Closed

Faktory cannot boot, because Redis is slow to load (?) #225

ttilberg opened this issue May 27, 2019 · 9 comments

Comments

@ttilberg
Copy link
Contributor

Faktory is unable to startup. I receive error:

# /faktory -b 0.0.0.0:7419 -w 0.0.0.0:7420 -e production
Faktory 1.0.0
Copyright © 2019 Contributed Systems LLC
Licensed under the GNU Public License 3.0
I 2019-05-27T21:44:41.340Z Initializing redis storage at /var/lib/faktory/db, socket /var/lib/faktory/db/redis.sock
E 2019-05-27T21:44:41.362Z error=LOADING Redis is loading the dataset in memory Unable to create Faktory server

I'm using the Docker configuration, storing data in a separate docker volume:

/var/lib/faktory/db # ls -al
total 576632
drwxr-xr-x    2 root     root          4096 May 27 21:44 .
drwxr-xr-x    3 root     root          4096 Oct 17  2018 ..
-rw-r--r--    1 root     root     498190843 May 27 11:09 faktory.rdb
-rw-r--r--    1 root     root       1294928 May 27 21:44 redis.log
srwx------    1 root     root             0 May 27 21:44 redis.sock
-rw-r--r--    1 root     root      90962163 May 27 21:27 temp-11.rdb

This seems to be the relevant area of code I'm getting to: https://github.com/contribsys/faktory/blob/master/storage/redis.go#L48

I get the impression that there is not currently a mechanism to handle the LOADING status and try again. I don't know Go at all, so I'm kind of guessing. I see this being an issue in some other projects:

Here's some background about arriving here:

I had a system issue this weekend where a data cleanup job processor was not running and a lot of jobs were queued totaling a somewhat sizable amount of data. Looking at various logs, I believe I ran out of memory at some point, as I have chunk of lines like this in the Redis log:

11:M 27 May 21:26:57.060 # Can't save in background: fork: Out of memory

I saw the following error on my worker processes:

ReplyError: ERR MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-writes-on-bgsave-error option). Please check the Redis logs for details about the RDB error.

This made me think I also ran out of disk space, but disk space appeared to be okay on the host (around 50%)

I've doubled the amount of memory the host VM has available, but Faktory still will not boot. The redis log generated when attempting to boot is below, which seems to show Redis is getting up, but there's a lot of warnings and I'm not totally fluent in what these are implying. Looking at older logs, I feel that these may even be common in a successful run of Faktory.

12:C 27 May 22:02:04.460 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
12:C 27 May 22:02:04.460 # Redis version=4.0.11, bits=64, commit=bca38d14, modified=0, pid=12, just started
12:C 27 May 22:02:04.460 # Configuration loaded
12:M 27 May 22:02:04.461 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
12:M 27 May 22:02:04.461 # Server initialized
12:M 27 May 22:02:04.461 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
12:M 27 May 22:02:04.461 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
@ttilberg
Copy link
Contributor Author

I just copied my database volume and tried booting redis-server referencing the Faktory data to try to learn more. It took a little over 5 seconds to load Redis, which makes me feels stronger about my assumption earlier about waiting for Redis to load:

/var/lib/faktory/db2 # redis-server --dbfilename faktory.rdb --dir /var/lib/faktory/db2
71:C 27 May 22:28:30.375 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
71:C 27 May 22:28:30.375 # Redis version=4.0.11, bits=64, commit=bca38d14, modified=0, pid=71, just started
71:C 27 May 22:28:30.375 # Configuration loaded
                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 4.0.11 (bca38d14/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._                                   
 (    '      ,       .-`  | `,    )     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
 |    `-._   `._    /     _.-'    |     PID: 71
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           http://redis.io        
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               

71:M 27 May 22:28:30.376 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
71:M 27 May 22:28:30.376 # Server initialized
71:M 27 May 22:28:30.376 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
71:M 27 May 22:28:30.376 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
71:M 27 May 22:28:36.131 * DB loaded from disk: 5.755 seconds
71:M 27 May 22:28:36.131 * Ready to accept connections


@mperham
Copy link
Collaborator

mperham commented May 27, 2019

If you can get me a copy of that database I'll see if I can reproduce the issue. Master has a fix for this LOADING issue and I'd like to verify it fixes the problem before releasing 1.0.1.

@ttilberg
Copy link
Contributor Author

Oh, hey! There it is! I didn't keep reading down because I was feeling a bit lost in the sauce. Happy to send you that file in the morning, I don't currently have remove access.

Now that's a fast fix!

@ttilberg
Copy link
Contributor Author

@mperham I've emailed you the files requested to verify ac4cd80. Thanks for your work!

@mperham
Copy link
Collaborator

mperham commented May 28, 2019

🎉 THANK YOU! 🎉 The fix has been confirmed and will be in 1.0.1.

I 2019-05-28T17:35:22.116Z Initializing redis storage at /Users/mikeperham/.faktory/db, socket /Users/mikeperham/.faktory/db/redis.sock
D 2019-05-28T17:35:22.118Z Booting Redis: /usr/local/bin/redis-server /tmp/redis.conf --unixsocket /Users/mikeperham/.faktory/db/redis.sock --loglevel notice --dir /Users/mikeperham/.faktory/db --logfile /Users/mikeperham/.faktory/db/redis.log
D 2019-05-28T17:35:22.142Z Redis booted in 21.832508ms
I 2019-05-28T17:35:22.162Z Faktory is waiting for Redis to load...
I 2019-05-28T17:35:23.177Z Faktory is waiting for Redis to load...
I 2019-05-28T17:35:24.188Z Faktory is waiting for Redis to load...
I 2019-05-28T17:35:25.200Z Faktory is waiting for Redis to load...
I 2019-05-28T17:35:26.221Z Faktory is waiting for Redis to load...
D 2019-05-28T17:35:27.228Z Running Redis v5.0.3

@mperham mperham closed this as completed May 28, 2019
@phawzy
Copy link

phawzy commented Jul 10, 2019

Hello @mperham
in the fix ac4cd80
you hardcoded the 60 seconds period and i have a 6.3 Gb rdb file that redis cant load in the memory in that period so the faktory service can start
can we make this period configurable ?

@mperham
Copy link
Collaborator

mperham commented Jul 10, 2019

@phawzy I would consider it an error that the machine can't load its dataset in under a minute. IMO that signals that you need to upgrade to a faster machine or trim down the size of your jobs. I don't want to make this a config option -- the user should not have to care about this -- and I don't want to extend the timeout any further because it's arbitrary, what do I extend it to: 2 min, 5 min, 10 min?

A 6GB dataset indicates many millions of jobs or you are loading too much state into Faktory when you should be storing that state in another persistent store, like S3.

@cjbottaro
Copy link
Contributor

Oofa this just bit us. Background job processing is the bread and butter of our app and we just ported over a huge portion of our Sidekiq jobs to Faktory.

I had to build a custom Faktory binary and deploy it. It took 66 seconds for Redis to boot.

Previously we used Sidekiq with four 16 gb Redis ElastiCache instances (just to give you a use case datapoint).

cc @mperham @phawzy

@mperham
Copy link
Collaborator

mperham commented Jul 6, 2021

There's never a right value for a maximum timeout. I've lengthened the value on master from 60 to 600.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants