Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(redis): fix master failure crashing the server #258

Merged
merged 2 commits into from
Nov 29, 2021

Conversation

samuelmasse
Copy link
Contributor

There was a problem when using redis with sentinel, master and slaves where switching a master could make the server crash. The server was set to throw an exception is an attempt to connect failed more than 10 times. I removed this limit so now the server can retry to connect to redis indefinitely.

Closes DEV-2002

@linear
Copy link

linear bot commented Nov 25, 2021

DEV-2002 (BUG) Redis connection don`t recover after master switching in Redis Sentinel (botpress/borpress botpress/v12#1549)

Describe the bug
When using a Redis Sentinel architecture, whenever the current master is down and there a master switch to one of the slaves, Botpress crashes and restart to be able to recover, not like Database connection does, which just keeps throwing errors and reconnects when the server is back up again.

Error after shutting down the current Master (tested using docker):

11/08/2021 21:33:33.288 [Messaging] Redis Id is 308054
11/08/2021 21:33:33.338 [Messaging] Launcher Using channels: messenger, slack, teams, telegram, twilio, discord, smooch, vonage
11/08/2021 21:33:33.341 [Messaging] Launcher Messaging is listening at: http://localhost:3100
11/08/2021 21:33:33.541 [NLU] Launcher NLU Server is ready at http://localhost:3200/
11/08/2021 21:33:34.014 [Studio] Launcher ===========================================================================
                                                                        Botpress Studio
                                                           Version 0.0.41 - Build 20211022-0200_BIN
                                          ===========================================================================
11/08/2021 21:33:34.017 [Studio] JobService ClientId: iXNDk_6SEK5ML_pxMh63d
11/08/2021 21:33:34.025 [Studio] Server Loaded 10 modules
11/08/2021 21:33:34.060 [Studio] CMS Loaded 11 content types
11/08/2021 21:33:34.112 [Studio] Server Discovered 1 bot, mounting it...
11/08/2021 21:33:34.155 [Studio] Server Started in 140ms
11/08/2021 21:33:34.155 [Studio] Launcher Studio is listening at: http://localhost:4000
11/08/2021 21:33:34.337 Server Local Action Server will only run in experimental mode
11/08/2021 21:33:34.345 Server Started in 3446ms
11/08/2021 21:33:34.346 Launcher Botpress is listening at: http://localhost:3000
11/08/2021 21:33:34.346 Launcher Botpress is exposed at: http://localhost:3000
[ioredis] Unhandled error event: Error: connect ECONNREFUSED 192.168.100.2:7010
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1141:16)
[ioredis] Unhandled error event: Error: connect ECONNREFUSED 192.168.100.2:7010
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1141:16)
[ioredis] Unhandled error event: Error: connect ECONNREFUSED 192.168.100.2:7010
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1141:16)
[ioredis] Unhandled error event: Error: connect ECONNREFUSED 192.168.100.2:7010
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1141:16)
[ioredis] Unhandled error event: Error: connect ECONNREFUSED 192.168.100.2:7010
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1141:16)
11/08/2021 21:34:02.005 Launcher Unhandled Rejection [Error, connect ECONNREFUSED 192.168.100.2:7010]
STACK TRACE
Error: connect ECONNREFUSED 192.168.100.2:7010
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1141:16)
11/08/2021 21:34:02.012 Cluster [web] Restarting process...
11/08/2021 21:34:03.682 Launcher ===========================================================================
                                                               Botpress Server
                                     Version 12.26.6 - Build 20211022-0303_BIN - OS: linux ubuntu_18_04
                                 ===========================================================================
11/08/2021 21:34:03.683 Launcher App Data Dir: "/root/botpress"
11/08/2021 21:34:03.683 Launcher Using 10 modules
                        ⦿ analytics
                        ⦿ basic-skills
                        ⦿ builtin
                        ⦿ channel-web
                        ⦿ code-editor
                        ⦿ examples
                        ⦿ hitlnext
                        ⦿ misunderstood
                        ⦿ nlu
                        ⦿ testing
                        ⊝ bot-improvement (disabled)
                        ⊝ broadcast (disabled)
                        ⊝ google-speech (disabled)
                        ⊝ hitl (disabled)
                        ⊝ libraries (disabled)
                        ⊝ ndu (disabled)
                        ⊝ uipath (disabled)
..............

Here Botpress is able to recover after the master switch to one of the slaves, but only after automatically restarting the server, this seems to be related to an Unhandled Exception.

To Reproduce
Steps to reproduce the behavior:

  1. Set up the following Redis Sentinel Architecture:

1 Master -> 2 Slaves -> 3 sentinels

  1. Start Botpress configuring the 3 sentinels and master.

Example config:

Master Name: mymaster
Master Port: 7010

Slave1 Port: 7021
Slave2 Port: 7022

Sentinel1 Port: 26379
Sentinel2 Port: 26380
Sentinel3 Port: 26381

-e REDIS_URL=undefined
-e REDIS_OPTIONS='{"sentinels": [ { "host": "host.docker.internal", "port": 26379 }, { "host": "host.docker.internal", "port": 26380 }, { "host": "host.docker.internal", "port": 26381 } ], "name": "mymaster", "sentinelPassword": 12345, "password": 12345}'

Expected behavior
Botpress to just keep throwing connection errors, don't restart, and recover to the new Redis master when available.

Environment (please complete the following information):

  • OS: Windows, using docker containers for Botpress, Database, and Redis (Master, Slaves, and Sentinels)
  • Botpress Version 12.26.6

Additional context
Resources on High Availability with Redis Sentinel: https://youtu.be/85HzpIk7Mq8?t=116

botpress/borpress botpress/v12#1549 by @ davidvitora

@laurentlp
Copy link
Contributor

@samuelmasse Can you please give me the steps required to reproduce the issue so I can test it on my end!

@samuelmasse samuelmasse merged commit f745ae4 into master Nov 29, 2021
@samuelmasse samuelmasse deleted the sm-fix-redis-sentinel branch November 29, 2021 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants