Redis 3.2.3 crashed by signal: 11 #3796

Open
damaru00 opened this Issue Feb 8, 2017 · 20 comments

Projects

None yet

3 participants

@damaru00
damaru00 commented Feb 8, 2017 edited

We have a big redis node in a cluster, and we're resharding it to other nodes. While doing that, we encountered many times the error reported below, which forced us to manually kill the instance using kill -9 and reload it.

In normal operation the node has been working for months without any problems. The only special thing about it is that there are some very big keys to migrate; for those usually we expected a timeout in the migration, but previously we didn't have this error which forces us to restart the node.

resharding log:
Moving slot 7929 from : to :: .................$
Moving slot 7930 from : to :: .................$
Moving slot 7931 from : to :: .................$
Moving slot 7932 from : to :: .................$
[ERR] Calling MIGRATE: Connection timed out

redis log:

=== REDIS BUG REPORT START: Cut & paste starting from here ===
29395:M 31 Jan 22:16:03.248 # Redis 3.2.3 crashed by signal: 11
29395:M 31 Jan 22:16:03.248 # Crashed running the instuction at: 0x468d4d
29395:M 31 Jan 22:16:03.248 # Accessing address: (nil)
29395:M 31 Jan 22:16:03.248 # Failed assertion: <no assertion failed> (<no file>:0)

------ STACK TRACE ------
EIP:
/home/cybranding/redis-3.2.3/redis-stable/src/redis-server *:7010 [cluster](migrateCloseSocket+0x5d)[0x468d4d]

Backtrace:
/home/cybranding/redis-3.2.3/redis-stable/src/redis-server *:7010 [cluster](logStackTrace+0x29)[0x45e699]
/home/cybranding/redis-3.2.3/redis-stable/src/redis-server *:7010 [cluster](sigsegvHandler+0xaa)[0x45ebca]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7ffec32ce340]
/home/cybranding/redis-3.2.3/redis-stable/src/redis-server *:7010 [cluster](migrateCloseSocket+0x5d)[0x468d4d]
/home/cybranding/redis-3.2.3/redis-stable/src/redis-server *:7010 [cluster](migrateCommand+0x6d6)[0x469526]
/home/cybranding/redis-3.2.3/redis-stable/src/redis-server *:7010 [cluster](call+0x85)[0x426fb5]
/home/cybranding/redis-3.2.3/redis-stable/src/redis-server *:7010 [cluster](processCommand+0x367)[0x42a0e7]
/home/cybranding/redis-3.2.3/redis-stable/src/redis-server *:7010 [cluster](processInputBuffer+0x105)[0x436d05]
/home/cybranding/redis-3.2.3/redis-stable/src/redis-server *:7010 [cluster](aeProcessEvents+0x218)[0x421428]
/home/cybranding/redis-3.2.3/redis-stable/src/redis-server *:7010 [cluster](aeMain+0x2b)[0x4216db]
/home/cybranding/redis-3.2.3/redis-stable/src/redis-server *:7010 [cluster](main+0x410)[0x41e690]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7ffec2f19ec5]
/home/cybranding/redis-3.2.3/redis-stable/src/redis-server *:7010 [cluster][0x41e902]

------ INFO OUTPUT ------
# Server
redis_version:3.2.3
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:4992f89db2d932d
redis_mode:cluster
os:Linux 3.13.0-37-generic x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.8.2
process_id:29395
run_id:ea2b51c12beba3394b49febc23e62ddda85ddf7d
tcp_port:7010
uptime_in_seconds:218901
uptime_in_days:2
hz:10
lru_clock:9505952
executable:/home/cybranding/redis-3.2.3/redis-stable/src/redis-server
config_file:/etc/redis_cluster_client2/redis-3.2.3/7010/redis.conf

# Clients
connected_clients:7
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0

# Memory
used_memory:252229967880
used_memory_human:234.91G
used_memory_rss:226196897792
used_memory_rss_human:210.66G
used_memory_peak:261549868192
used_memory_peak_human:243.59G
total_system_memory:405738954752
total_system_memory_human:377.87G
used_memory_lua:37888
used_memory_lua_human:37.00K
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
mem_fragmentation_ratio:0.90
mem_allocator:jemalloc-4.0.3

# Persistence
loading:0
rdb_changes_since_last_save:2441620072
rdb_bgsave_in_progress:0
rdb_last_save_time:1485682059
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
aof_enabled:1
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_current_size:78198652011
aof_base_size:71141120009
aof_pending_rewrite:0
aof_buffer_length:0
aof_rewrite_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:0

# Stats
total_connections_received:50546
total_commands_processed:56092302
instantaneous_ops_per_sec:470
total_net_input_bytes:8820578118
total_net_output_bytes:15465464075
instantaneous_input_kbps:65.53
instantaneous_output_kbps:39.50
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:70160
evicted_keys:0
keyspace_hits:25360054
keyspace_misses:16707970
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:1

# Replication
role:master
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:17440.44
used_cpu_user:20364.37
used_cpu_sys_children:0.00
used_cpu_user_children:0.00

# Commandstats
cmdstat_get:calls=3013601,usec=69399477,usec_per_call=23.03
cmdstat_setex:calls=93465,usec=29887340,usec_per_call=319.77
cmdstat_incr:calls=6164797,usec=48002289,usec_per_call=7.79
cmdstat_sadd:calls=6404,usec=171138,usec_per_call=26.72
cmdstat_sismember:calls=1920066,usec=57777586,usec_per_call=30.09
cmdstat_zincrby:calls=29150890,usec=1841876323,usec_per_call=63.18
cmdstat_zrevrange:calls=298356,usec=100612284,usec_per_call=337.22
cmdstat_zscore:calls=2641113,usec=185788620,usec_per_call=70.34
cmdstat_expire:calls=10839539,usec=70516036,usec_per_call=6.51
cmdstat_ping:calls=69,usec=114,usec_per_call=1.65
cmdstat_info:calls=50357,usec=8779396,usec_per_call=174.34
cmdstat_cluster:calls=1034173,usec=262678023,usec_per_call=254.00
cmdstat_migrate:calls=879469,usec=6675930023,usec_per_call=7590.86
cmdstat_command:calls=3,usec=10340,usec_per_call=3446.67

# Cluster
cluster_enabled:1

# Keyspace
db0:keys=141374638,expires=30371,avg_ttl=45048480
hash_init_value: 1485278005

------ CLIENT LIST OUTPUT ------
id=50534 addr=127.0.0.1:43431 fd=10 name= age=828 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=zincrby
id=50535 addr=127.0.0.1:43432 fd=11 name= age=828 idle=7 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=sismember
id=50541 addr=127.0.0.1:58608 fd=16 name= age=346 idle=49 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=get
id=50108 addr=127.0.0.1:55292 fd=18 name= age=39041 idle=6 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=zrevrange
id=50546 addr=127.0.0.1:50654 fd=19 name= age=108 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=migrate
id=50543 addr=127.0.0.1:60246 fd=13 name= age=264 idle=264 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=info
id=50544 addr=127.0.0.1:54748 fd=17 name= age=165 idle=159 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=cluster

------ CURRENT CLIENT INFO ------
id=50546 addr=127.0.0.1:50654 fd=19 name= age=108 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=migrate
argv[0]: 'DEL'
argv[1]: 'nht.tw.urls.dec:쩐다'

------ REGISTERS ------
29395:M 31 Jan 22:16:03.283 # 
RAX:00007fcc4d3f30e3 RBX:b8e0a2b8e097b8e0
RCX:0000000000000002 RDX:0000000000000001
RDI:00007fcc4d3f30f9 RSI:00000000004e7857
RBP:00007fdf418f6430 RSP:00007fff08e86490
R8 :0000000000000005 R9 :00007ffec2a0d5a0
R10:0000000000000002 R11:00007ffec2c00180
R12:0000000000000001 R13:00007fc0e9df1d40
R14:0000000000000002 R15:0000000000000000
RIP:0000000000468d4d EFL:0000000000010202
CSGSFS:0000000000000033
29395:M 31 Jan 22:16:03.283 # (00007fff08e8649f) -> 0000000000000058
29395:M 31 Jan 22:16:03.283 # (00007fff08e8649e) -> 0000000004af8cbe
29395:M 31 Jan 22:16:03.284 # (00007fff08e8649d) -> 0000000000000000
29395:M 31 Jan 22:16:03.284 # (00007fff08e8649c) -> 00007fc07adec0f0
29395:M 31 Jan 22:16:03.284 # (00007fff08e8649b) -> 0000000000000000
29395:M 31 Jan 22:16:03.284 # (00007fff08e8649a) -> 00007ffec39f5690
29395:M 31 Jan 22:16:03.284 # (00007fff08e86499) -> 0000000100000000
29395:M 31 Jan 22:16:03.284 # (00007fff08e86498) -> 00007fc000000002
29395:M 31 Jan 22:16:03.284 # (00007fff08e86497) -> 00007fc0ee22c610
29395:M 31 Jan 22:16:03.284 # (00007fff08e86496) -> 00007fec6ad43fa0
29395:M 31 Jan 22:16:03.284 # (00007fff08e86495) -> 00007fc00000000a
29395:M 31 Jan 22:16:03.284 # (00007fff08e86494) -> 00007fc000000000
29395:M 31 Jan 22:16:03.284 # (00007fff08e86493) -> 0000000000469526
29395:M 31 Jan 22:16:03.284 # (00007fff08e86492) -> 0000000000000001
29395:M 31 Jan 22:16:03.284 # (00007fff08e86491) -> 00007fc07adec0f0
29395:M 31 Jan 22:16:03.284 # (00007fff08e86490) -> 00007fc0e9df1d40

------ FAST MEMORY TEST ------
29395:M 31 Jan 22:16:03.284 # Bio thread for job type #0 terminated
29395:M 31 Jan 22:16:03.285 # Bio thread for job type #1 terminated
*** Preparing to test memory region 724000 (94208 bytes)
*** Preparing to test memory region 1907000 (135168 bytes)
*** Preparing to test memory region 7fc053400000 (268131368960 bytes)
*** Preparing to test memory region 7ffec13ff000 (8388608 bytes)
*** Preparing to test memory region 7ffec1c00000 (14680064 bytes)
*** Preparing to test memory region 7ffec2a00000 (4194304 bytes)
*** Preparing to test memory region 7ffec32b9000 (20480 bytes)
*** Preparing to test memory region 7ffec34d8000 (16384 bytes)
*** Preparing to test memory region 7ffec39f5000 (16384 bytes)
*** Preparing to test memory region 7ffec3a01000 (4096 bytes)
*** Preparing to test memory region 7ffec3a02000 (8192 bytes)
*** Preparing to test memory region 7ffec3a06000 (4096 bytes)
@antirez
Owner
antirez commented Feb 8, 2017

Hello, are you still able to reproduce the problem in your environment? That would be very useful. Thanks!

@antirez antirez added this to the Urgent milestone Feb 8, 2017
@danmaz74
Contributor
danmaz74 commented Feb 8, 2017

Hi Salvatore

We stopped resharding the data because this happened quite a few times, but we didn't change anything since then so I'm pretty sure it would happen again if we tried. Did you publish anything new in the meantime? We could restart the sharding, but as it takes a few hours to reload this too big instance when it crashes, I'd rather not do it just to see if it will crash again...

Let me know if you want any additional details.

@antirez
Owner
antirez commented Feb 8, 2017

@danmaz74 totally makes sense, let me investigate the issue in the next hours so I can make a more interesting proposal. I absolutely don't want to create issues in your environment if we can avoid it! Thanks.

@antirez
Owner
antirez commented Feb 8, 2017

@danmaz74 please could you update your redis-server binary (the same causing the crash) here renamed as .png file? It is useful in order to understand exactly where migrateCloseSocket() crashed. Thank you.

@danmaz74
Contributor
danmaz74 commented Feb 8, 2017

@antirez if I got you right we should restart the instance using this binary which will give more debugging info to you, then recreate the situation where this bug happened?

BTW I guess there should have been an attachment in your latest post, but I can't see any. You can send it directly to dan at cybranding.com or contact me on skype @danmaz74 if you want to interact faster.

@antirez
Owner
antirez commented Feb 8, 2017
@danmaz74
Contributor
danmaz74 commented Feb 8, 2017

Got it now - it's attached - remove .zip :)

redis-server.bz2.zip

@antirez
Owner
antirez commented Feb 8, 2017
@antirez
Owner
antirez commented Feb 8, 2017

UPDATE: disassembling the redis-server executable it is quite clear that the problem is that the port parameter of the function is NULL. Should be easy to understand why, so I'm going to figure out when this happens.

@danmaz74
Contributor
danmaz74 commented Feb 8, 2017

Great, we'll wait for an update about that. Thx

@antirez
Owner
antirez commented Feb 9, 2017

Hello again @danmaz74 and @damaru00. I just tracked the bug cause and I'm preparing a fix. It looks like that regardless of the bug, that crashed the server, the migration is not going to work anyway because of a connection error happening while the big key is being migrated, but we can investigate this later as a second step after you upgrade the instance to a version that no longer crashes :-)

@antirez antirez added a commit that referenced this issue Feb 9, 2017
@antirez Fix MIGRATE closing of cached socket on error.
After investigating issue #3796, it was discovered that MIGRATE
could call migrateCloseSocket() after the original MIGRATE c->argv
was already rewritten as a DEL operation. As a result the host/port
passed to migrateCloseSocket() could be anything, often a NULL pointer
that gets deferenced crashing the server.

Now the socket is closed at an earlier time when there is a socket
error in a later stage where no retry will be performed, before we
rewrite the argument vector. Moreover a check was added so that later,
in the socket_err label, there is no further attempt at closing the
socket if the argument was rewritten.

This fix should resolve the bug reported in #3796.
f917e0d
@antirez antirez added a commit that referenced this issue Feb 9, 2017
@antirez Fix MIGRATE closing of cached socket on error.
After investigating issue #3796, it was discovered that MIGRATE
could call migrateCloseSocket() after the original MIGRATE c->argv
was already rewritten as a DEL operation. As a result the host/port
passed to migrateCloseSocket() could be anything, often a NULL pointer
that gets deferenced crashing the server.

Now the socket is closed at an earlier time when there is a socket
error in a later stage where no retry will be performed, before we
rewrite the argument vector. Moreover a check was added so that later,
in the socket_err label, there is no further attempt at closing the
socket if the argument was rewritten.

This fix should resolve the bug reported in #3796.
8d55aeb
@antirez antirez added a commit that referenced this issue Feb 9, 2017
@antirez Fix MIGRATE closing of cached socket on error.
After investigating issue #3796, it was discovered that MIGRATE
could call migrateCloseSocket() after the original MIGRATE c->argv
was already rewritten as a DEL operation. As a result the host/port
passed to migrateCloseSocket() could be anything, often a NULL pointer
that gets deferenced crashing the server.

Now the socket is closed at an earlier time when there is a socket
error in a later stage where no retry will be performed, before we
rewrite the argument vector. Moreover a check was added so that later,
in the socket_err label, there is no further attempt at closing the
socket if the argument was rewritten.

This fix should resolve the bug reported in #3796.
33fad43
@antirez
Owner
antirez commented Feb 9, 2017

@danmaz74 & @damaru00 a fix is now available in all the Redis branches. I suggest upgrading the instance to the latest commit of 3.2, or just cherry pick the commit fixing the issue (referenced in this issue above). Recompiling, and retrying. I think there will be still a socket error that triggered the error: either a write error or a timeout or alike. But the box should not crash and the socket error can be investigated at this point.

@danmaz74
Contributor
danmaz74 commented Feb 9, 2017

Thanks @antirez, we should be able to test this tomorrow.

We were already aware that there would be timeout errors with some slots, because in this instance (which is the original one for the project, started 4 years ago) we created some really big keys when we had just started out and didn't realize that would be a mistake down the line. What we were doing was to manually patch redis-trib so that it would skip those slots, and proceed with other ones with smaller keys.

I was even thinking about creating a more general solution for this kind of situations and propose a pull request, but I'm not sure if it's something common enough. What do you think?

@antirez
Owner
antirez commented Feb 9, 2017

Hello @danmaz74, there is absolutely no issue in having very large keys, Redis Cluster is designed to handle that, even if during reshardings they could block for some time the servers, but it's just a temporary issue: it should NOT fail like that. Today I'll retry with very large keys to see what happens but the timeout is designed to fire, not using the total time of transfer of the key, but for single small amounts of data. So this is unexpected. No problem about large keys, we should just understand what is causing this error in your environment. BTW the error is not a "timeout" error AFAIK, since the timeout is handled in a specific way and is outside the code path that caused the bug you experimented AFAIK.

TLDR: once you install the patch and restart, we can get more clues. Please also make sure to install to latest 3.2.x since there is another non trivial Cluster bug fixed in that version.

@danmaz74
Contributor
danmaz74 commented Feb 9, 2017

I'm really glad to hear that we should be able to move very large keys too, so I guess that's a different issue in itself. Just to give you some more context, we increased the timeout to 5 minutes and the nodes are on the same local network (datacenter), but we still experienced lots of timeouts - before we started experiencing crashes. We'll see with this new version.

Is it necessary to also upgrade all other nodes for the test? Ora can we just upgrade the source one (the one that crashes) for now?

@antirez
Owner
antirez commented Feb 9, 2017

No need to upgrade everything @danmaz74, I hope I can try to move some very large keys tomorrow to see what happens and why there is a timeout. However I want to stress that, apparently, what you are getting is not a timeout per se, but a "socket error" not better specified. We should be able to understand better this socket error problem once you upgrade, but I would like to reproduce it ASAP.

@danmaz74
Contributor

@antirez an update: we deployed this and are running redis-trib. So far no freezes, only timeouts. I would say: let's wait until tomorrow, if there is no freeze we can close the issue.

@antirez
Owner
antirez commented Feb 13, 2017

Looks like a good news! Thanks @danmaz74. Now we have to understand why you get the timeouts. Can you show me exactly what you get?

@uqs uqs pushed a commit to freebsd/freebsd-ports that referenced this issue Feb 14, 2017
osa Upgrade from 3.2.7 to 3.2.8.
<ChangeLog>

Upgrade urgency CRITICAL: This release reverts back the Jemalloc upgrade
                          that is believed to potentially cause a server
                          deadlock. A MIGRATE crash is also fixed.

Two important bug fixes, the first of one is critical:

1. Apparently Jemalloc 4.4.0 may contain a deadlock under particular
   conditions. See antirez/redis#3799.
   We reverted back to the previously used Jemalloc versions and plan
   to upgrade Jemalloc again after having more info about the
   cause of the bug.

2. MIGRATE could crash the server after a socket error. See for reference:
   antirez/redis#3796.

</ChangeLog>


git-svn-id: svn+ssh://svn.freebsd.org/ports/head@434063 35697150-7ecd-e111-bb59-0022644237b5
065c1a8
@uqs uqs pushed a commit to freebsd/freebsd-ports that referenced this issue Feb 14, 2017
@osokin osokin Upgrade from 3.2.7 to 3.2.8.
<ChangeLog>

Upgrade urgency CRITICAL: This release reverts back the Jemalloc upgrade
                          that is believed to potentially cause a server
                          deadlock. A MIGRATE crash is also fixed.

Two important bug fixes, the first of one is critical:

1. Apparently Jemalloc 4.4.0 may contain a deadlock under particular
   conditions. See antirez/redis#3799.
   We reverted back to the previously used Jemalloc versions and plan
   to upgrade Jemalloc again after having more info about the
   cause of the bug.

2. MIGRATE could crash the server after a socket error. See for reference:
   antirez/redis#3796.

</ChangeLog>
0ec9c53
@spinlock spinlock added a commit to CodisLabs/codis that referenced this issue Feb 14, 2017
@spinlock spinlock extern: upgrade redis-3.2.7 to redis-3.2.8
    There're 2 bug fixes in redis-3.2.8:
    1. antirez/redis#3799
        Downgrade jemalloc-4.4.0 to jemalloc 4.0.3
    2. antirez/redis#3796
        Fix a crash in command MIGRATE
7c5cebb
@damaru00

@antirez we encountered this error during resharding:

Moving slot 8302 from from "" to "" ................................................................................$
Moving slot 8303 from from "" to "" ................................................................................$
Moving slot 8304 from from "" to "" ................................................................................$
[ERR] Calling MIGRATE: IOERR error or timeout reading to target instance

We fix it using this command ./redis-trib.rb fix ip:port and restart the resharding. We also encountered this error:

Moving slot 8302 from from "" to "" ................................................................................$
Moving slot 8303 from from "" to "" ................................................................................$
Moving slot 8304 from from "" to "" ................................................................................$
[ERR] Calling MIGRATE: ERR Target instance replied with error: CLUSTERDOWN The cluster is down

But when We fix it using this command ./redis-trib.rb fix ip:port it will give us this error:

Check for open slots...
[WARNING] Node has slots in importing state (8304).
[WARNING] Node has slots in migrating state (8304).
[WARNING] The following slots are open: 8304
Fixing open slot 8304
Set as migrating in:
Set as importing in:
Moving slot 8304 from to
[ERR] Calling MIGRATE: ERR Target instance replied with error: CLUSTERDOWN The cluster is down

We temporarily fixed the situation by using this redis-cli command -> 'CLUSTER SETSLOT STABLE' on both nodes(source/target node) to clear any importing/migrating state from hash slot and restart the resharding, manually excluding the slot which was previously creating problems - it probably has some key which is too big.

Redis log on target node when resharding error "[ERR] Calling MIGRATE: ERR Target instance replied with error: CLUSTERDOWN The cluster is down" :

29729:M 14 Feb 07:01:06.498 * FAIL message received from 8b6accb0259089f4f5fc3942b34fb6b7fcbde33e about 6f70203705a1f26b561f39a600930f7b22dfeb98
29729:M 14 Feb 07:01:06.498 # Cluster state changed: fail
29729:M 14 Feb 07:01:31.008 * Clear FAIL state for node 6f70203705a1f26b561f39a600930f7b22dfeb98: is reachable again and nobody is serving its s$
29729:M 14 Feb 07:01:31.008 # Cluster state changed: ok
29729:M 14 Feb 07:02:38.159 * Marking node 6f70203705a1f26b561f39a600930f7b22dfeb98 as failing (quorum reached).
29729:M 14 Feb 07:02:38.159 # Cluster state changed: fail
29729:M 14 Feb 07:02:58.205 * Clear FAIL state for node 6f70203705a1f26b561f39a600930f7b22dfeb98: is reachable again and nobody is serving its s$
29729:M 14 Feb 07:02:58.205 # Cluster state changed: ok
29729:M 14 Feb 09:18:18.740 * FAIL message received from 5b887a2fc38eade4b6366b4d1de2926733e082d2 about 6f70203705a1f26b561f39a600930f7b22dfeb98
29729:M 14 Feb 09:18:18.740 # Cluster state changed: fail
29729:M 14 Feb 09:18:41.648 * Clear FAIL state for node 6f70203705a1f26b561f39a600930f7b22dfeb98: is reachable again and nobody is serving its s$
29729:M 14 Feb 09:18:41.648 # Cluster state changed: ok

The resharding error "[ERR] Calling MIGRATE: IOERR error or timeout reading to target instance"
didn't log anything on redis log both source and target nodes when it was triggered.

@danmaz74
Contributor

@antirez so the great news is that we didn't experience any new crash: I would say we can consider this issue fixed!

Regarding the timeout, @damaru00 will soon post a new report, but it looks like there is nothing specific about the timeout in the logs. Please take a look at what we have, and let us know if you have any idea where to look to get more information for you.

By the way, would you like to open a separate issue for the timeout thing? No problem for us.

@mat813 mat813 pushed a commit to mat813/freebsd-ports that referenced this issue Feb 14, 2017
osa Upgrade from 3.2.7 to 3.2.8.
<ChangeLog>

Upgrade urgency CRITICAL: This release reverts back the Jemalloc upgrade
                          that is believed to potentially cause a server
                          deadlock. A MIGRATE crash is also fixed.

Two important bug fixes, the first of one is critical:

1. Apparently Jemalloc 4.4.0 may contain a deadlock under particular
   conditions. See antirez/redis#3799.
   We reverted back to the previously used Jemalloc versions and plan
   to upgrade Jemalloc again after having more info about the
   cause of the bug.

2. MIGRATE could crash the server after a socket error. See for reference:
   antirez/redis#3796.

</ChangeLog>


git-svn-id: https://svn.freebsd.org/ports/head@434063 35697150-7ecd-e111-bb59-0022644237b5
3b0dcd8
@jsonn jsonn pushed a commit to jsonn/pkgsrc that referenced this issue Feb 14, 2017
fhajny Update databases/redis to 3.2.8.
================================================================================
Redis 3.2.8     Released Sun Feb 12 16:11:18 CET 2017
================================================================================

Two important bug fixes, the first of one is critical:

1. Apparently Jemalloc 4.4.0 may contain a deadlock under particular
   conditions. See antirez/redis#3799.
   We reverted back to the previously used Jemalloc versions and plan
   to upgrade Jemalloc again after having more info about the
   cause of the bug.

2. MIGRATE could crash the server after a socket error. See for reference:
   antirez/redis#3796.

================================================================================
Redis 3.2.7     Released Tue Jan 31 16:21:41 CET 2017
================================================================================

Main bugs fixes and improvements in this release:

1. MIGRATE could incorrectly move keys between Redis Cluster nodes by turning
   keys with an expire set into persisting keys. This bug was introduced with
   the multiple-keys migration recently. It is now fixed. Only applies to
   Redis Cluster users that use the resharding features of Redis Cluster.

2. As Redis 4.0 beta and the unstable branch already did (for some months at
   this point), Redis 3.2.7 also aliases the Host: and POST commands to QUIT
   avoiding to process the remaining pipeline if there are pending commands.
   This is a security protection against a "Cross Scripting" attack, that
   usually involves trying to feed Redis with HTTP in order to execute commands.
   Example: a developer is running a local copy of Redis for development
   purposes. She also runs a web browser in the same computer. The web browser
   could send an HTTP request to http://127.0.0.1:6379 in order to access the
   Redis instance, since a specially crafted HTTP requesta may also be partially
   valid Redis protocol. However if POST and Host: break the connection, this
   problem should be avoided. IMPORTANT: It is important to realize that it
   is not impossible that another way will be found to talk with a localhost
   Redis using a Cross Protocol attack not involving sending POST or Host: so
   this is only a layer of protection but not a definitive fix for this class
   of issues.

3. A ziplist bug that could cause data corruption, could crash the server and
   MAY ALSO HAVE SECURITY IMPLICATIONS was fixed. The bug looks complex to
   exploit, but attacks always get worse, never better (cit). The bug is very
   very hard to catch in practice, it required manual analysis of the ziplist
   code in order to be found. However it is also possible that rarely it
   happened in the wild. Upgrading is required if you use LINSERT and other
   in-the-middle list manipulation commands.

4. We upgraded to Jemalloc 4.4.0 since the version we used to ship with Redis
   was an early 4.0 release of Jemalloc. This version may have several
   improvements including the ability to better reclaim/use the memory of
   system.
3b52612
@JackieXie168 JackieXie168 pushed a commit to JackieXie168/redis that referenced this issue Feb 17, 2017
@antirez Fix MIGRATE closing of cached socket on error.
After investigating issue #3796, it was discovered that MIGRATE
could call migrateCloseSocket() after the original MIGRATE c->argv
was already rewritten as a DEL operation. As a result the host/port
passed to migrateCloseSocket() could be anything, often a NULL pointer
that gets deferenced crashing the server.

Now the socket is closed at an earlier time when there is a socket
error in a later stage where no retry will be performed, before we
rewrite the argument vector. Moreover a check was added so that later,
in the socket_err label, there is no further attempt at closing the
socket if the argument was rewritten.

This fix should resolve the bug reported in #3796.
1195253
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment