Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory spike with concurrent operations #276

Closed
ktsakalozos opened this issue Dec 14, 2020 · 16 comments
Closed

Memory spike with concurrent operations #276

ktsakalozos opened this issue Dec 14, 2020 · 16 comments

Comments

@ktsakalozos
Copy link
Member

When hitting the dqlite-demo with multiple concurrent requests I can pretty much reliably reproduce a memory spike of GBs.

To reproduce first add an extra HTTP verb handler on the dqlite-demo:

diff --git a/cmd/dqlite-demo/dqlite-demo.go b/cmd/dqlite-demo/dqlite-demo.go
index 49f8197..deb4a91 100644
--- a/cmd/dqlite-demo/dqlite-demo.go
+++ b/cmd/dqlite-demo/dqlite-demo.go
@@ -77,6 +77,15 @@ Complete documentation is available at https://github.com/canonical/go-dqlite`,
                                        if _, err := db.Exec(update, key, value); err != nil {
                                                result = fmt.Sprintf("Error: %s", err.Error())
                                        }
+                               case "POST":
+                                       result = "yes!"
+                                       for i:=0 ; i < 10000; i++{
+                                               value := fmt.Sprintf("This is so mdata %s", i)
+                                               if _, err := db.Exec(update, key, value); err != nil {
+                                                       result = fmt.Sprintf("Error: %s %d", err.Error(), i)
+                                                       break
+                                               }
+                                       }
                                default:
                                        result = fmt.Sprintf("Error: unsupported method %q", r.Method)

Setup a three node cluster as described in the go-dqlite readme [1].

On two terminals start trigger the new operations:

while true ; do curl -X POST -d "foo=bar" http://localhost:8001/mykey1 ; done

And

while true ; do curl -X POST -d "foo=bar" http://localhost:8001/mykey2 ; done

Let it run on for a few minutes and almost half the times the memory usage of one of the dqlite-demo processes will spike, eg [2].

FYI, @sevangelatos

[1] https://github.com/canonical/go-dqlite#demo
[2] https://pasteboard.co/JET1frM.jpg

@ktsakalozos
Copy link
Member Author

Trying to emulate slow/fast disks results the same behavior.

Create a ramdisk [1]:

sudo mkdir /mnt/ramdisk
sudo mount -t tmpfs -o rw,size=8G tmpfs /mnt/ramdisk

Have the majority of nodes on a slow disk:

rm -rf /tmp/dqlite-demo ; 
rm -rf /mnt/ramdisk/tmp/dqlite-demo ;  
dqlite-demo --api 127.0.0.1:8001 --db 127.0.0.1:9001 -D /mnt/ramdisk/tmp/dqlite-demo/ & 
sleep 5 ; dqlite-demo --api 127.0.0.1:8002 --db 127.0.0.1:9002 --join 127.0.0.1:9001 -D /tmp/dqlite-demo & 
sleep 5 ; dqlite-demo --api 127.0.0.1:8003 --db 127.0.0.1:9003 --join 127.0.0.1:9001 -D /tmp/dqlite-demo &

You will see within a few minutes the memory usage to spike.

I would expect that since the majority of nodes are on slow disks, the entire system to be functioning on a slow pace but not to produce this huge memory spike that leads to a node going down.

When placing a single node on a slow disk that node also ends up using too much memory. This is also not a good behavior. There should be mechanisms to protect slow nodes from being overwhelmed.

[1] https://linuxhint.com/ramdisk_ubuntu_1804/

@ktsakalozos
Copy link
Member Author

In a setup where the leader and one voter are on a ramdisk and a node running on a slow file system the leader seems to overwhelm the slow node. The slow node is quickly growing in memory consumption.

Here is the setup I am referring to:

rm -rf /tmp/dqlite-demo ; 
rm -rf /mnt/ramdisk/tmp/dqlite-demo ;  
dqlite-demo --api 127.0.0.1:8001 --db 127.0.0.1:9001 -D /mnt/ramdisk/tmp/dqlite-demo/ & 
sleep 5 ; dqlite-demo --api 127.0.0.1:8002 --db 127.0.0.1:9002 --join 127.0.0.1:9001 -D /mnt/ramdisk/tmp/dqlite-demo/ & 
sleep 5 ; dqlite-demo --api 127.0.0.1:8003 --db 127.0.0.1:9003 --join 127.0.0.1:9001 -D /tmp/dqlite-demo &

And create some load:

while true ; do curl -X POST -d "foo=bar" http://localhost:8001/mykey1 ; done

The slow node will not be able to keep up. Every time the leader needs to send a heartbeat (eg [1]) he would be sending [2] all the entries of the log [3] the slow node is missing. This quickly piles up. Maybe implementing the TODO at [4] would solve the issue. However when i tried to force an upper limit of 100 entries on each send attempt [3] I hit another strange error. The slow node would segfault when it was trying to serve a snapshot.

[1] https://github.com/canonical/raft/blob/master/src/progress.c#L132
[2] https://github.com/canonical/raft/blob/master/src/replication.c#L352
[3] https://github.com/canonical/raft/blob/master/src/replication.c#L99
[4] https://github.com/canonical/raft/blob/master/src/replication.c#L98

@ktsakalozos
Copy link
Member Author

Following up on the previous comment #276 (comment) I took the naive approach to place a max limit (100) on the amount of entries the leader sends to the followers [1, 2]. This stopped the high memory consumption (in the order of GB) causing the machine to swap but it breaks the cluster in two ways.

In normal operation, the leader sends the entries to the followers. In our case the lagging/failing node cannot keep up. Eventually the leader needs to send a snapshot as the entry needed to be sent is no more in the log [3]. On the lagging node the replicationInstallSnapshot [4] is called causing a snapshot_put call [5, 6] and we expect to create a UvBarrier and a chain of calls to uvSnapshotPutBarrierCb, uvSnapshotPutStart, uvSnapshotPutAfterWorkCb, uvSnapshotPutFinish, installSnapshotCb. The installSnapshotCb I believe does the freeing of the snapshot put request [10].

  • The first behavor I saw is in the uvSnapshotPutBarrierCb the put request is NULL [11] causing a segfault. This happens because the uvSnapshotPutBarrierCb is called again after installSnapshotCb via the uvFinalizeAfterWorkCb. This second call to uvSnapshotPutBarrierCb finds the uvSnapshotPut request already freed.

  • The second behavior is caused by the follower deciding it wants to become a candidate before the snapshot finishes. When the leader waits for a follower to apply a snapshot it stops sending heartbeats (see TODO in [12]). Because of this, the follower decides to convertToCandidate and this causes the installSnapshotCb to break [10] with a segfault.

I do not have a good enough grasp of raft and the implementation so it is possible I misunderstand how a proper patch should be formed. Based on the couple of TODOs I encounter in this execution path I think this behavior is something we have considered but maybe we consider it a rare case?

Also, the naive approach of limiting the amount of entries we send from the leader to the followers might not be the right thing to do. Maybe we need to quickly remove the lagging node instead of keep trying to revive it.

Any feedback would be much appreciated. Thank you.

[1] logAcquireMax https://github.com/ktsakalozos/raft/blob/chaos/src/log.c#L751
[2] sendAppendEntries https://github.com/ktsakalozos/raft/blob/chaos/src/replication.c#L102
[3] https://github.com/ktsakalozos/raft/blob/chaos/src/replication.c#L354
[4] replicationInstallSnapshot https://github.com/ktsakalozos/raft/blob/chaos/src/replication.c#L1242
[5] snapshot_put call https://github.com/ktsakalozos/raft/blob/chaos/src/replication.c#L1313
[6] UvSnapshotPut https://github.com/ktsakalozos/raft/blob/chaos/src/uv_snapshot.c#L532
[7] UvBarrier https://github.com/ktsakalozos/raft/blob/chaos/src/uv_snapshot.c#L592
[8] uvSnapshotPutBarrierCb https://github.com/ktsakalozos/raft/blob/chaos/src/uv_snapshot.c#L506
[9] installSnapshotCb https://github.com/ktsakalozos/raft/blob/chaos/src/replication.c#L1167
[10] installSnapshotCb free request https://github.com/ktsakalozos/raft/blob/chaos/src/replication.c#L1238
[11] uvSnapshotPutBarrierCb put request https://github.com/ktsakalozos/raft/blob/chaos/src/uv_snapshot.c#L512
[12] https://github.com/ktsakalozos/raft/blob/chaos/src/progress.c#L130

@freeekanayaka
Copy link
Contributor

This is precious information, keep up the good work!

@freeekanayaka
Copy link
Contributor

I think for whoever takes this up next, this might be a great starting point.

@stgraber
Copy link
Contributor

Thanks to the great work from @ktsakalozos, I wrote a script which seems to reproduce the issue reasonably easily:

#!/bin/sh -eux
DIR=$(pwd)

# Install dependencies
apt-get install build-essential autoconf make libtool pkg-config libsqlite3-dev libuv1-dev --yes
snap install go --classic
umount -l data/db1 || true
umount -l data/db2 || true
rm -rf raft dqlite go db* data
pkill -9 dqlite-demo || true

# Get and build the code
git clone --depth=1 https://github.com/canonical/raft
cd raft
autoreconf -i
./configure
make
cd ..

git clone --depth=1 https://github.com/canonical/dqlite
cd dqlite
autoreconf -i && \
PKG_CONFIG_PATH="${DIR}/raft/" ./configure
make CFLAGS="-I${DIR}/raft/include/" LDFLAGS="-L${DIR}/raft/.libs/"
cd ..

# Environment
export CGO_CFLAGS="-I/root/raft/include/ -I/root/dqlite/include/"
export CGO_LDFLAGS="-L/root/raft/.libs -L/root/dqlite/.libs/"
export CGO_LDFLAGS_ALLOW="-Wl,-wrap,pthread_create"

# Build go-dqlite
go get -tags libsqlite3 github.com/canonical/go-dqlite/cmd/dqlite
go get -tags libsqlite3 github.com/canonical/go-dqlite/cmd/dqlite-demo

# Update dqlite-demo
(
cat << EOF
H4sIAAAAAAACA41RwW7bMAw9x1/BGQiQwLbsbGmyugvQbehxaLHsnskWlQiVRVeWixZF/72yq2LZ
Dkt1EEnh8T3qUSgpIcv2ygHP60bk4k4rh5nAho5ztieoTgAiZQQ+wPJcfl6crxn7VFRnUn6ERVGs
lssoy7KTGlGSJKd1Li8hW6/TFST+XqzA19+paTU6BEF136Bx3CkyoDrg91xpXmkE7uDgXNuVee7/
e+grVlOT19yQUTXX+Z6yV6HfaQST4SgJuxTQWig3ICp29YD1rG8Fd5jCLT6mcM91j/OLEfNhA0Zp
eArNE4tdrx1sQDaObVurjJOz+MpasiVMu3hkZmM9m89D13OUDKHmHUJ8c739FZevL3/oYkEG4/A6
DpDCbphQUe+UZj+Ri69azyz7RuJxHoCSLKgBVcCFT74MaykKn3rHnwJmYBsgfw087UATtZupiMN/
U1BvrO+z6F+DQu8ph2CUPDLpWHhSWeS3b1WwzcfRRoGSe+YymPpfnd50fduSdSigQXcgAdM7r2vZ
j7Hym4leAAaM3qEnAwAA
EOF
) | base64 -d | gzip -d | patch ~/go/src/github.com/canonical/go-dqlite/cmd/dqlite-demo/dqlite-demo.go
go get -tags libsqlite3 github.com/canonical/go-dqlite/cmd/dqlite-demo
set +x

echo ""
echo "Please set the following in your environment (possibly ~/.bashrc)"
echo "export GODEBUG=madvdontneed=1"
echo "export LD_LIBRARY_PATH=\"${DIR}/raft/.libs/:${DIR}/dqlite/.libs/\""
echo ""
echo ""

# Start the test
export LD_LIBRARY_PATH="/root/raft/.libs/:/root/dqlite/.libs/"
export GODEBUG=madvdontneed=1
echo "=> Spawning database"
mkdir data

mkdir data/db1
mount -t tmpfs tmpfs data/db1
go/bin/dqlite-demo --dir data/db1 --api 127.0.0.1:8001 --db 127.0.0.1:9001 > /dev/null 2>&1 &
echo $! > db1.pid

mkdir data/db2
mount -t tmpfs tmpfs data/db2
go/bin/dqlite-demo --dir data/db2 --api 127.0.0.1:8002 --db 127.0.0.1:9002 --join 127.0.0.1:9001 > /dev/null 2>&1 &
echo $! > db2.pid

go/bin/dqlite-demo --dir data/db3 --api 127.0.0.1:8003 --db 127.0.0.1:9003 --join 127.0.0.1:9001 > /dev/null 2>&1 &
echo $! > db3.pid
sleep 10s

echo "=> Spawning test loops"
test_loop() {
	for i in $(seq 100); do
		OUT=$(curl -X POST -d "foo=${i}" "http://localhost:8001/${1}" 2>/dev/null) || sleep 1m
		[ "${OUT}" = "done" ] || sleep 1m
	done
	echo "Loop ${1} exited"
}
test_loop mykey1 &
test_loop mykey2 &
sleep 10s

echo "=> Monitor the result"
while true; do
	RSS1=$(grep RSS /proc/$(cat db1.pid)/status | awk '{print $2}')
	RSS2=$(grep RSS /proc/$(cat db2.pid)/status | awk '{print $2}')
	RSS3=$(grep RSS /proc/$(cat db3.pid)/status | awk '{print $2}')
	echo "==> $(date +%Y%m%d-%H%M): db1=${RSS1} db2=${RSS2} db3=${RSS3}"
	go/bin/dqlite -s 127.0.0.1:9001 demo "SELECT * FROM model;"
	sleep 1m
done

This builds the stack almost entirely from source, runs it with the aggressive memory release mechanism and runs two of the daemons on tmpfs while one is run on whatever the underlying storage usually is.

This then performs up to 200 attempts at writing 10000 records in the database, tracks and prints the RSS every minute as well as showing progress through the DB records present.

Testing this on a reasonably beefy arm64 hardware (as we were told arm64 was seeing this issue more easily), we can pretty quickly notice the non-tmpfs daemon using more memory than the tmpfs ones, this usually starts at around 1.5x and goes to 2, 3, 5 times the usage over the next few minutes. In some cases we get much more dramatic memory usage leading to OOM of that daemon (in such scenario, we're seeing a doubling every minute or so).

It seems like the assumption of a transaction backlog preventing memory release by the slow process is quite likely and also explains why we could find any sign of a traditional memory leakage. A slow voter can be caused by anything from differing I/O speeds as is the case here to network congestion or CPU/scheduling difficulties.

LXD performs extremely little writes to the database, so we probably never hit a situation where we can build a transaction backlog that would cause this. And when we performed specific stress tests to reproduce a memory leakage, we were doing it on extremely fast systems to speed up reproduction.

Some options I can think of:

  • Some way for a follower that can't keep up to demote itself, transferring the role to one that hopefully can handle the load?
  • Maybe have a more aggressive mechanism for a out of date follower to resync? I don't know how our logic works currently but maybe the right thing to do is to flush whatever we have from memory, pull the latest snapshot and then request the entries from that point? That would at least allow for recovery once the load goes away?
  • A way to throttle the number of log entries getting committed, effectively making it possible to configure a maximum transaction rate on the cluster which can then be configured to match a rate suitable for a slow writer?

I'm also wondering how this is handled (if it is) in other Raft implementations, as in, can we cause OVN or other such codebases to fail in a similar way?

@freeekanayaka
Copy link
Contributor

freeekanayaka commented Jan 16, 2021

Thanks to the great work from @ktsakalozos, I wrote a script which seems to reproduce the issue reasonably easily:

Nice job!

I'm also wondering how this is handled (if it is) in other Raft implementations, as in, can we cause OVN or other such codebases to fail in a similar way?

This is usually handled with some kind of "flow control" and "back pressure" mechanisms (similar to TCP). That is implemented in our libraft code as well, at least in some rudimentary form, but at this point I suspect it might be buggy or might need improvements. The gist of that logic is in the progress.c file and the associated call sites in replication.c. It pretty much mimicks etcd's raft implementation, but maybe it needs refinement.

@stgraber
Copy link
Contributor

=> Spawning database
=> Spawning test loops
=> Monitor the result
==> 20210116-2005: db1=84816 db2=82812 db3=143460
mykey2|foo=1 loop=1
mykey1|foo=1 loop=8261
==> 20210116-2006: db1=99668 db2=94132 db3=144940
mykey2|foo=1 loop=1
mykey1|foo=6 loop=68
==> 20210116-2007: db1=102556 db2=97340 db3=193888
mykey2|foo=1 loop=1
mykey1|foo=10 loop=4690
==> 20210116-2008: db1=103404 db2=97420 db3=224368
mykey2|foo=1 loop=1
mykey1|foo=14 loop=6936
==> 20210116-2009: db1=102852 db2=97720 db3=243956
mykey1|foo=18 loop=4000
mykey2|foo=5 loop=8961
==> 20210116-2010: db1=103476 db2=103164 db3=253300
mykey1|foo=18 loop=4000
mykey2|foo=10 loop=4141
==> 20210116-2011: db1=103292 db2=104700 db3=253548
mykey1|foo=18 loop=4000
mykey2|foo=14 loop=3849
==> 20210116-2012: db1=103088 db2=105540 db3=280212
mykey1|foo=18 loop=4000
mykey2|foo=18 loop=7742
==> 20210116-2013: db1=103180 db2=105360 db3=280608
mykey1|foo=18 loop=4000
mykey2|foo=23 loop=3495
==> 20210116-2014: db1=103008 db2=105724 db3=266572
mykey1|foo=18 loop=4000
mykey2|foo=27 loop=1574
==> 20210116-2015: db1=103440 db2=105588 db3=287264
mykey2|foo=29 loop=8914
mykey1|foo=24 loop=6381
==> 20210116-2016: db1=103820 db2=107648 db3=299184
mykey2|foo=29 loop=8914
mykey1|foo=27 loop=9001
==> 20210116-2017: db1=103384 db2=103028 db3=893816
mykey2|foo=29 loop=8914
mykey1|foo=31 loop=2726
==> 20210116-2018: db1=103604 db2=103356 db3=948840
mykey2|foo=29 loop=8914
mykey1|foo=35 loop=539
==> 20210116-2019: db1=103428 db2=103200 db3=1040316
mykey2|foo=29 loop=8914
mykey1|foo=39 loop=394
==> 20210116-2020: db1=103316 db2=103300 db3=1040780
mykey2|foo=29 loop=8914
mykey1|foo=43 loop=1273
==> 20210116-2021: db1=103800 db2=103164 db3=1052052
mykey1|foo=46 loop=4029
mykey2|foo=35 loop=7099
==> 20210116-2022: db1=104216 db2=103500 db3=1502032
mykey1|foo=46 loop=4029
mykey2|foo=39 loop=824
==> 20210116-2023: db1=104556 db2=103868 db3=1502288
mykey1|foo=46 loop=4029
mykey2|foo=43 loop=272
==> 20210116-2024: db1=104460 db2=103984 db3=1502288
mykey1|foo=46 loop=4029
mykey2|foo=46 loop=8016
==> 20210116-2025: db1=104232 db2=104092 db3=1502288
mykey1|foo=46 loop=4029
mykey2|foo=50 loop=7948
==> 20210116-2026: db1=104432 db2=103928 db3=1502536
mykey2|foo=54 loop=2041
mykey1|foo=51 loop=6801
==> 20210116-2027: db1=104464 db2=103764 db3=1502536
mykey2|foo=54 loop=2041
mykey1|foo=55 loop=8063
==> 20210116-2028: db1=104592 db2=103852 db3=1502788
mykey1|foo=59 loop=1825
mykey2|foo=56 loop=7116
==> 20210116-2029: db1=104480 db2=104560 db3=1502788
mykey1|foo=59 loop=1825
mykey2|foo=60 loop=2190
==> 20210116-2030: db1=104608 db2=107912 db3=1659980
mykey1|foo=59 loop=1825
mykey2|foo=63 loop=6609
==> 20210116-2031: db1=104388 db2=107752 db3=1659980
mykey1|foo=59 loop=1825
mykey2|foo=67 loop=7487
==> 20210116-2032: db1=104868 db2=108180 db3=1660244
mykey1|foo=59 loop=1825
mykey2|foo=71 loop=7343
==> 20210116-2033: db1=104192 db2=108364 db3=1660492
mykey1|foo=59 loop=1825
mykey2|foo=75 loop=7992
==> 20210116-2034: db1=104464 db2=106768 db3=1661084
mykey1|foo=59 loop=1825
mykey2|foo=80 loop=122
==> 20210116-2035: db1=104712 db2=106608 db3=1661084
mykey2|foo=83 loop=3509
mykey1|foo=66 loop=6794
==> 20210116-2036: db1=105244 db2=106864 db3=2746356
mykey2|foo=83 loop=3509
mykey1|foo=69 loop=8433
==> 20210116-2037: db1=1675772 db2=50560 db3=5906416
mykey2|foo=85 loop=0
mykey1|foo=71 loop=3476
grep: /proc/131700/status: No such file or directory
==> 20210116-2038: db1=1963344 db2=94956 db3=
mykey2|foo=85 loop=0
mykey1|foo=72 loop=6673

That's an example of it going high very fast and hitting OOM.

@ktsakalozos
Copy link
Member Author

Apart from the leak on db3 resulting in OOM the run shown above by @stgraber displays two more behavioral aspects that need investigation.

  1. The memory consumption of db1 and db2 start from 84MB and reach 105MB within 30 minutes. This slow leaking of memory is present in healthy clusters as well. When for example all three databases are on tmpfs some of them will show increasing memory footprint.

  2. On 20210116-2037 the cluster fell apart. It is not obvious to me why. Looking at db2=50560 is as if db2 had to re-read the database into memory while db1=1675772 went into an unhealthy state. I am not sure what happened there but I would not expect one lagging misbehaving node to take down the cluster.

@stgraber
Copy link
Contributor

May be interesting to run with logging .leader to see if we see a change in leadership right after the OOM too, maybe that'd explain why db1/db2 behave weirdly.

@stgraber
Copy link
Contributor

For the 84MB to 105MB increase, in all test runs I see the daemon stabilizing once it reaches 110MB or so.
As it's not really climbing past that and the growth seems to be fast at the beginning and then stabilizing, this could simply be an artifact of memory fragmentation. I wouldn't worry too much about this for now given that this is under very unusual parallel load (two loops pushing 10k transactions every time we hit them).

The db3 behavior and the behavior of the other databases when db3 falls is what we need to focus on for now.
Once we've found a solution for this, we should then re-run the same logic over an extended timespan to see if we are indeed seeing a constant leak or if things do plateau as they appear to currently.

@MathieuBordere
Copy link
Contributor

MathieuBordere commented Jan 29, 2021

I was triggered by @ktsakalozos reply, i.e. the fact that heartbeat messages contain all missing log entries, because I would expect that heartbeat messages contain no log entries.

...

The slow node will not be able to keep up. Every time the leader needs to send a heartbeat (eg [1]) he would be sending [2] all the entries of the log [3] the slow node is missing. This quickly piles up. Maybe implementing the TODO at [4] would solve the issue. However when i tried to force an upper limit of 100 entries on each send attempt [3] I hit another strange error. The slow node would segfault when it was trying to serve a snapshot.

[1] https://github.com/canonical/raft/blob/master/src/progress.c#L132
[2] https://github.com/canonical/raft/blob/master/src/replication.c#L352

I naively made some changes to not send any log entries in heartbeat messages and the memory spike issue went away (and some unittests started failing) see changes in MathieuBordere/raft@a3ef291 (Don't mind the TODO's and structure of the code, was only testing an idea).

=> Spawning database
=> Spawning test loops
=> Monitor the result
==> 20210129-0932: db1=92256 db2=90516 db3=94052 leader=127.0.0.1:9001
mykey2|foo=1 loop=1
mykey1|foo=2 loop=2013
==> 20210129-0933: db1=103976 db2=95372 db3=101828 leader=127.0.0.1:9001
mykey2|foo=1 loop=1
mykey1|foo=7 loop=1309
==> 20210129-0934: db1=104548 db2=95296 db3=102132 leader=127.0.0.1:9001
mykey2|foo=1 loop=1
mykey1|foo=11 loop=2187
...
==> 20210129-1012: db1=106600 db2=110440 db3=104252 leader=127.0.0.1:9001
mykey2|foo=100 loop=9999
mykey1|foo=94 loop=8024
==> 20210129-1013: db1=107244 db2=110280 db3=104252 leader=127.0.0.1:9001
mykey2|foo=100 loop=9999
mykey1|foo=98 loop=8903
Loop mykey1 exited
==> 20210129-1014: db1=106392 db2=110208 db3=104252 leader=127.0.0.1:9001
mykey2|foo=100 loop=9999
mykey1|foo=100 loop=9999

But there were also runs where db3 memory usage immediately went to 1.2GB but remained stable until the end of the test.

Trying to fix one of the failing unittests, replication/sendFollowupHeartbeat, by changing https://github.com/MathieuBordere/raft/blob/a3ef291a51d33101bd7e9c9c698b8b41664a20fa/src/replication.c#L784 to

replicationProgress(r, i, false, false);

reintroduces an increasing memory issue (don't know if it's the same one).

Maybe this information can give someone inspiration on the root cause of this issue.

@MathieuBordere
Copy link
Contributor

Found a memory leak, fixed it in https://github.com/MathieuBordere/raft/tree/mbordere/memory_leak_during_snapshot_install.
First couple of runs of the test script look promising.

@MathieuBordere
Copy link
Contributor

MathieuBordere commented Feb 4, 2021

The memory spike seems to occur when disk IO has been slow for a while for whatever reason.
If the log is not persisted to disk, the follower cannot inform the leader that it can recover those entries upon failure, so the leader keeps on sending more and more entries in every AppendEntries RPC until the follower can confirm it has persisted the older entries.

This can result in a bunch of large allocations in uvServerAllocCb.

See table below reflecting allocations in uvServerAllocCb, where
he alloc = memory allocation for header in bytes
pl alloc = memory allocation for payload in bytes
ae = AppendEntries
ae recv = some return value (not important here)

it's not uncommon to see 200+ allocations of 20MB in a row. The problem with these allocations is that they contain almost all the same entries. Most of the time, the newer AppendEntries RPC only contains 1 new entry, but we allocate a batch of 20MB for it (the batch includes the entries received in the previous RPC too). This new entry, that's not in the previous RPC, is added to the log, where it retains a reference to the batch of 20MB. This batch is only freed once the entry is removed from the log, this quickly leads to a lot of log entries each responsible for large allocations and the system running out of memory.

he alloc 37552
pl alloc 19300384
ae with n_entries 2344 index 631810 term 1
ae recv 0
he alloc 37568
pl alloc 19308632
ae with n_entries 2345 index 631810 term 1
ae recv 0
he alloc 37584
pl alloc 19316880
ae with n_entries 2346 index 631810 term 1
ae recv 0
he alloc 37600
pl alloc 19325128
ae with n_entries 2347 index 631810 term 1
ae recv 0
he alloc 37616
pl alloc 19333376
ae with n_entries 2348 index 631810 term 1
ae recv 0
he alloc 37632
pl alloc 19341624
ae with n_entries 2349 index 631810 term 1
...
repeats couple of hundred times.

One approach that I tried (in draft) is to copy the new entries in the log to a new (small) memory chunk and freeing the batch immediately, this seems to solve the memory spike while not breaking other functionality.

Another approach would probably be to decrease the number of entries sent in an RPC, however this could still lead to spikes if we receive a couple of hundred in a row, on the other hand not placing a limit on the number of sent entries can also lead to large allocations. I guess the practical limit for the amount of entries that can be sent in one go is lower than the raft snapshot threshold + snapshot trailing amount (need to doublecheck this).

I quite like the approach of freeing the batch immediately (effectively getting rid of the batch logic) as it seems to solve the issue, but I'm open for suggestions.

@freeekanayaka
Copy link
Contributor

Nice breakdown. Just to be sure I understand correctly:

  • Leader L sends 1 entry at index N to follower F
  • F is slow and does not reply to L timely
  • L doesn't hear back from F so it sends F the entry at index N again plus another bunch of new entries, let's say 20 entries more
  • F is still slow at writing to disk, so it still does not acknowledge entry at index N
  • L still doesn't hear back from F so it sends F the entry at index N again plus further new entries, let's say it's now 40 entries more
  • F is still receiving fine, just writing to disk is slow, so it has used both batches that L has sent it to grow its log. However since a batch is not released until all of its entries are unreferenced, the 2nd batch actually has 21 entries that duplicate the 21 entries of the first batch and that consume memory.
  • The situation repeats.

This sounds like a very pathological case, that could happen only with really slow I/O (which keeps being slow for a long time), but I guess that's what is happening for one reason or another.

I think that limiting the amount of entries that the leader sends in those cases would make sense. You can of course free the batch as well, but I presume that would require performing one allocation per entry and the copying all the data. The batch thing was designed to avoid that copy, but I never really measured the performance impact of it, so maybe it's minor.

@MathieuBordere
Copy link
Contributor

Yes, you're correct. Will go for the non-batch approach, seems the least risky for now and do some measurements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants