New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory spike with concurrent operations #276
Comments
Trying to emulate slow/fast disks results the same behavior. Create a ramdisk [1]:
Have the majority of nodes on a slow disk:
You will see within a few minutes the memory usage to spike. I would expect that since the majority of nodes are on slow disks, the entire system to be functioning on a slow pace but not to produce this huge memory spike that leads to a node going down. When placing a single node on a slow disk that node also ends up using too much memory. This is also not a good behavior. There should be mechanisms to protect slow nodes from being overwhelmed. |
In a setup where the leader and one voter are on a ramdisk and a node running on a slow file system the leader seems to overwhelm the slow node. The slow node is quickly growing in memory consumption. Here is the setup I am referring to:
And create some load:
The slow node will not be able to keep up. Every time the leader needs to send a heartbeat (eg [1]) he would be sending [2] all the entries of the log [3] the slow node is missing. This quickly piles up. Maybe implementing the TODO at [4] would solve the issue. However when i tried to force an upper limit of 100 entries on each send attempt [3] I hit another strange error. The slow node would segfault when it was trying to serve a snapshot. [1] https://github.com/canonical/raft/blob/master/src/progress.c#L132 |
Following up on the previous comment #276 (comment) I took the naive approach to place a max limit (100) on the amount of entries the leader sends to the followers [1, 2]. This stopped the high memory consumption (in the order of GB) causing the machine to swap but it breaks the cluster in two ways. In normal operation, the leader sends the entries to the followers. In our case the lagging/failing node cannot keep up. Eventually the leader needs to send a snapshot as the entry needed to be sent is no more in the log [3]. On the lagging node the replicationInstallSnapshot [4] is called causing a snapshot_put call [5, 6] and we expect to create a UvBarrier and a chain of calls to uvSnapshotPutBarrierCb, uvSnapshotPutStart, uvSnapshotPutAfterWorkCb, uvSnapshotPutFinish, installSnapshotCb. The installSnapshotCb I believe does the freeing of the snapshot put request [10].
I do not have a good enough grasp of raft and the implementation so it is possible I misunderstand how a proper patch should be formed. Based on the couple of TODOs I encounter in this execution path I think this behavior is something we have considered but maybe we consider it a rare case? Also, the naive approach of limiting the amount of entries we send from the leader to the followers might not be the right thing to do. Maybe we need to quickly remove the lagging node instead of keep trying to revive it. Any feedback would be much appreciated. Thank you. [1] logAcquireMax https://github.com/ktsakalozos/raft/blob/chaos/src/log.c#L751 |
This is precious information, keep up the good work! |
I think for whoever takes this up next, this might be a great starting point. |
Thanks to the great work from @ktsakalozos, I wrote a script which seems to reproduce the issue reasonably easily: #!/bin/sh -eux
DIR=$(pwd)
# Install dependencies
apt-get install build-essential autoconf make libtool pkg-config libsqlite3-dev libuv1-dev --yes
snap install go --classic
umount -l data/db1 || true
umount -l data/db2 || true
rm -rf raft dqlite go db* data
pkill -9 dqlite-demo || true
# Get and build the code
git clone --depth=1 https://github.com/canonical/raft
cd raft
autoreconf -i
./configure
make
cd ..
git clone --depth=1 https://github.com/canonical/dqlite
cd dqlite
autoreconf -i && \
PKG_CONFIG_PATH="${DIR}/raft/" ./configure
make CFLAGS="-I${DIR}/raft/include/" LDFLAGS="-L${DIR}/raft/.libs/"
cd ..
# Environment
export CGO_CFLAGS="-I/root/raft/include/ -I/root/dqlite/include/"
export CGO_LDFLAGS="-L/root/raft/.libs -L/root/dqlite/.libs/"
export CGO_LDFLAGS_ALLOW="-Wl,-wrap,pthread_create"
# Build go-dqlite
go get -tags libsqlite3 github.com/canonical/go-dqlite/cmd/dqlite
go get -tags libsqlite3 github.com/canonical/go-dqlite/cmd/dqlite-demo
# Update dqlite-demo
(
cat << EOF
H4sIAAAAAAACA41RwW7bMAw9x1/BGQiQwLbsbGmyugvQbehxaLHsnskWlQiVRVeWixZF/72yq2LZ
Dkt1EEnh8T3qUSgpIcv2ygHP60bk4k4rh5nAho5ztieoTgAiZQQ+wPJcfl6crxn7VFRnUn6ERVGs
lssoy7KTGlGSJKd1Li8hW6/TFST+XqzA19+paTU6BEF136Bx3CkyoDrg91xpXmkE7uDgXNuVee7/
e+grVlOT19yQUTXX+Z6yV6HfaQST4SgJuxTQWig3ICp29YD1rG8Fd5jCLT6mcM91j/OLEfNhA0Zp
eArNE4tdrx1sQDaObVurjJOz+MpasiVMu3hkZmM9m89D13OUDKHmHUJ8c739FZevL3/oYkEG4/A6
DpDCbphQUe+UZj+Ri69azyz7RuJxHoCSLKgBVcCFT74MaykKn3rHnwJmYBsgfw087UATtZupiMN/
U1BvrO+z6F+DQu8ph2CUPDLpWHhSWeS3b1WwzcfRRoGSe+YymPpfnd50fduSdSigQXcgAdM7r2vZ
j7Hym4leAAaM3qEnAwAA
EOF
) | base64 -d | gzip -d | patch ~/go/src/github.com/canonical/go-dqlite/cmd/dqlite-demo/dqlite-demo.go
go get -tags libsqlite3 github.com/canonical/go-dqlite/cmd/dqlite-demo
set +x
echo ""
echo "Please set the following in your environment (possibly ~/.bashrc)"
echo "export GODEBUG=madvdontneed=1"
echo "export LD_LIBRARY_PATH=\"${DIR}/raft/.libs/:${DIR}/dqlite/.libs/\""
echo ""
echo ""
# Start the test
export LD_LIBRARY_PATH="/root/raft/.libs/:/root/dqlite/.libs/"
export GODEBUG=madvdontneed=1
echo "=> Spawning database"
mkdir data
mkdir data/db1
mount -t tmpfs tmpfs data/db1
go/bin/dqlite-demo --dir data/db1 --api 127.0.0.1:8001 --db 127.0.0.1:9001 > /dev/null 2>&1 &
echo $! > db1.pid
mkdir data/db2
mount -t tmpfs tmpfs data/db2
go/bin/dqlite-demo --dir data/db2 --api 127.0.0.1:8002 --db 127.0.0.1:9002 --join 127.0.0.1:9001 > /dev/null 2>&1 &
echo $! > db2.pid
go/bin/dqlite-demo --dir data/db3 --api 127.0.0.1:8003 --db 127.0.0.1:9003 --join 127.0.0.1:9001 > /dev/null 2>&1 &
echo $! > db3.pid
sleep 10s
echo "=> Spawning test loops"
test_loop() {
for i in $(seq 100); do
OUT=$(curl -X POST -d "foo=${i}" "http://localhost:8001/${1}" 2>/dev/null) || sleep 1m
[ "${OUT}" = "done" ] || sleep 1m
done
echo "Loop ${1} exited"
}
test_loop mykey1 &
test_loop mykey2 &
sleep 10s
echo "=> Monitor the result"
while true; do
RSS1=$(grep RSS /proc/$(cat db1.pid)/status | awk '{print $2}')
RSS2=$(grep RSS /proc/$(cat db2.pid)/status | awk '{print $2}')
RSS3=$(grep RSS /proc/$(cat db3.pid)/status | awk '{print $2}')
echo "==> $(date +%Y%m%d-%H%M): db1=${RSS1} db2=${RSS2} db3=${RSS3}"
go/bin/dqlite -s 127.0.0.1:9001 demo "SELECT * FROM model;"
sleep 1m
done This builds the stack almost entirely from source, runs it with the aggressive memory release mechanism and runs two of the daemons on tmpfs while one is run on whatever the underlying storage usually is. This then performs up to 200 attempts at writing 10000 records in the database, tracks and prints the RSS every minute as well as showing progress through the DB records present. Testing this on a reasonably beefy arm64 hardware (as we were told arm64 was seeing this issue more easily), we can pretty quickly notice the non-tmpfs daemon using more memory than the tmpfs ones, this usually starts at around 1.5x and goes to 2, 3, 5 times the usage over the next few minutes. In some cases we get much more dramatic memory usage leading to OOM of that daemon (in such scenario, we're seeing a doubling every minute or so). It seems like the assumption of a transaction backlog preventing memory release by the slow process is quite likely and also explains why we could find any sign of a traditional memory leakage. A slow voter can be caused by anything from differing I/O speeds as is the case here to network congestion or CPU/scheduling difficulties. LXD performs extremely little writes to the database, so we probably never hit a situation where we can build a transaction backlog that would cause this. And when we performed specific stress tests to reproduce a memory leakage, we were doing it on extremely fast systems to speed up reproduction. Some options I can think of:
I'm also wondering how this is handled (if it is) in other Raft implementations, as in, can we cause OVN or other such codebases to fail in a similar way? |
Nice job!
This is usually handled with some kind of "flow control" and "back pressure" mechanisms (similar to TCP). That is implemented in our libraft code as well, at least in some rudimentary form, but at this point I suspect it might be buggy or might need improvements. The gist of that logic is in the |
That's an example of it going high very fast and hitting OOM. |
Apart from the leak on db3 resulting in OOM the run shown above by @stgraber displays two more behavioral aspects that need investigation.
|
May be interesting to run with logging |
For the 84MB to 105MB increase, in all test runs I see the daemon stabilizing once it reaches 110MB or so. The db3 behavior and the behavior of the other databases when db3 falls is what we need to focus on for now. |
I was triggered by @ktsakalozos reply, i.e. the fact that heartbeat messages contain all missing log entries, because I would expect that heartbeat messages contain no log entries.
I naively made some changes to not send any log entries in heartbeat messages and the memory spike issue went away (and some unittests started failing) see changes in MathieuBordere/raft@a3ef291 (Don't mind the TODO's and structure of the code, was only testing an idea).
But there were also runs where db3 memory usage immediately went to 1.2GB but remained stable until the end of the test. Trying to fix one of the failing unittests,
reintroduces an increasing memory issue (don't know if it's the same one). Maybe this information can give someone inspiration on the root cause of this issue. |
Found a memory leak, fixed it in https://github.com/MathieuBordere/raft/tree/mbordere/memory_leak_during_snapshot_install. |
The memory spike seems to occur when disk IO has been slow for a while for whatever reason. This can result in a bunch of large allocations in See table below reflecting allocations in it's not uncommon to see 200+ allocations of 20MB in a row. The problem with these allocations is that they contain almost all the same entries. Most of the time, the newer AppendEntries RPC only contains 1 new entry, but we allocate a
One approach that I tried (in draft) is to copy the new entries in the log to a new (small) memory chunk and freeing the batch immediately, this seems to solve the memory spike while not breaking other functionality. Another approach would probably be to decrease the number of entries sent in an RPC, however this could still lead to spikes if we receive a couple of hundred in a row, on the other hand not placing a limit on the number of sent entries can also lead to large allocations. I guess the practical limit for the amount of entries that can be sent in one go is lower than the raft snapshot threshold + snapshot trailing amount (need to doublecheck this). I quite like the approach of freeing the batch immediately (effectively getting rid of the batch logic) as it seems to solve the issue, but I'm open for suggestions. |
Nice breakdown. Just to be sure I understand correctly:
This sounds like a very pathological case, that could happen only with really slow I/O (which keeps being slow for a long time), but I guess that's what is happening for one reason or another. I think that limiting the amount of entries that the leader sends in those cases would make sense. You can of course free the batch as well, but I presume that would require performing one allocation per entry and the copying all the data. The batch thing was designed to avoid that copy, but I never really measured the performance impact of it, so maybe it's minor. |
Yes, you're correct. Will go for the non-batch approach, seems the least risky for now and do some measurements. |
When hitting the
dqlite-demo
with multiple concurrent requests I can pretty much reliably reproduce a memory spike of GBs.To reproduce first add an extra HTTP verb handler on the
dqlite-demo
:Setup a three node cluster as described in the go-dqlite readme [1].
On two terminals start trigger the new operations:
And
Let it run on for a few minutes and almost half the times the memory usage of one of the dqlite-demo processes will spike, eg [2].
FYI, @sevangelatos
[1] https://github.com/canonical/go-dqlite#demo
[2] https://pasteboard.co/JET1frM.jpg
The text was updated successfully, but these errors were encountered: