This is a short, self-contained, correct example SSCCE for debugging purposes from a much larger project.
This program runs a simple database SELECT
query intended to pull a large
quantity of results (100k+). This issue I see is that the program will
randomly fail with a PacketOutOfOrder
error.
This error only occurs on the release target. The debug target ran 355 consecutive times with zero errors.
If the query can run once successfully, I expect it to run successfully every time, barring any external factors (hardware failures, unreliable network, etc).
-
Grab your favorite MySQL derivative and spin up a database server. I've tested against MySQL 5.5 and MariaDB 10.1.21.
-
Add a test user to your database. If you prefer more security, specify your host IP instead of the host wildcard
'%'
.CREATE USER 'testbot'@'%' IDENTIFIED BY 'testbot'; GRANT ALL PRIVILEGES ON `fake_data`.* TO 'testbot'@'%';
-
Add a database for testing and put some fake data into it...
-
Edit the global variables in
tools/fake-data-gen.py
script. Changedb_host
anddb_port
to reflect the database settings for your environment. -
Install
python 2.7
andpip
. -
Install two dependencies for the fake data script:
-
pip install faker
-
pip install mysql-connector==2.1.4
. The latest version of this package would not install for me.
-
-
Run
python2 ./tools/fake-data-gen.py
. If no errors are reported, you should see afake_data
database on your MySQL server with aproducts
table full of random data.
-
-
Run the binary.
export DSN="mysql://testbot:testbot@host:port/fake_data" RUST_LOG=bug_test_01=debug,mysql_async=debug cargo run --release --bin bug-test-01
-
Observe that the programs fails with an error similar to:
DEBUG:bug_test_01: running thread 'main' panicked at 'error resolving future: Error(PacketOutOfOrder, State { next_error: None, backtrace: None })', /checkout/src/libcore/result.rs:860 note: Run with `RUST_BACKTRACE=1` for a backtrace.
If you do not see this error, rerun the binary until you do. Like I said, the error occurs randomly. To improve your chances of the error occurring, remove the
RUST_LOG
environment variable to disable debugging print statements, or only removemysql_async=debug
from that log list.
$ rustc --version --verbose
rustc 1.19.0-nightly (f062832b2 2017-06-07)
binary: rustc
commit-hash: f062832b208e94f2f0f26ed7fb5c48c172069fbe
commit-date: 2017-06-07
host: x86_64-unknown-linux-gnu
release: 1.19.0-nightly
LLVM version: 4.0
$ uname -a
Linux 4.10.11-1-ARCH #1 SMP PREEMPT Tue Apr 18 08:39:42 CEST 2017 x86_64 GNU/Linux
On the very same machine I run cargo run --release --bin bug-test-01
, I fired
up Wireshark and captured the network packets to and from the
database server on port 3306.
I then analyzed the SQL protocol in the packet capture and
found no sequence numbers out of order, and other than the program closing the
TCP connection early with RST packets after the Packet out of order
error,
everything appears to transferring smoothly from the database server to my
computer's network interface card (NIC). NOTE, if you choose to pursue this
exercise yourself, the MySQL dissector is invaluable, but buggy; you may run
into "Malformed MySQL Packets" in Wireshark. I recommend compiling Wireshark
from scratch with the patch mentioned in this bug
report to disable SQL protocol compression detection.
I also used the command below (tshark
is a commandline utility provided with
Wireshark) to print all the MySQL packet sequence numbers captured. My
captures showed all sequence numbers in order.
./tshark -r PATH-TO-CAPTURE-FILE -T fields -e mysql.packet_number | egrep -v '^$' | sed 's/,/\n/g' | less
I did this, and the problem stopped manifesting. I forked
mysql_async
and added a debug!()
statement in
src/conn/futures/read_packet.rs:43
--it prints with every attempt to
process/validate each MySQL packet--and in src/proto.rs:304
--this prints with
every attempt to parse each MySQL packet.
Apparently, giving the event loop (that decodes the packets) the extra bit of
work to print to stderr
delays the program just enough that it doesn't fail
(so long as I have RUST_LOG=mysql_async=debug
enabled.
This indicates to me that I have a race condition of sorts in my program. Not
necessarily in my code, but I rather suspect it's in the libraries I'm using.
Either mysql_async
itself, or the tokio-rs
/futures
/mio
framework.
While this example spins up a second thread, all the database work is done in
one thread (specifically the thread created by tokio_core::reactor::Core
. My
second thread just waits for a result.
Besides, isn't Rust supposed to prevent race conditions from occurring? At
least so far as my code is concerned, it's 100% Rust with no unsafe
blocks.
Heck, while debugging a separate issue with this project, I introduced myself
to the perftools
on Linux and perf report
showed that my program spent over
50% of its time zeroing the 4kB local stack buffer used by
mysql_async::io::Stream
. So I'm not even sure where the garbage
value corrupting the MySQL packet sequence number could be coming from, if not
the tokio-rs
libraries further down the software stack, (e.g.
tokio::net::TcpStream
).
Where do I go from here to isolate and trap this bug?
I don't use gdb
because I'm a novice at best with that program. I can print
variables, step, continue, and show backtraces, but I'm not productive with the
debugger. If I use it, I'm stepping through every line in the program and
that's a huge time sink, not to mention that slowing down the rate at which the
binary processes network traffic prevents the problem from occurring.
I'm not comfortable side-stepping the issue. Teaching my code to retry a fixed number of times and give up is an option, but it complicates my code and reduces its utility. If I can identify the cause and fix it, I'd rather do that. I've pretty much exhausted all the tricks in my toolbox, and I am stalled, so here's hoping someone out there has options in their toolbox to share.
I've also tried reading the source code for mysql_async
, the various
tokior-rs
libraries I use, and it's a bit beyond my capacity to read that
code and infer the why's and contexts and invariant assumptions that
established such code. The very little I've gleaned from the tokio-rs
guides
is that tokio-rs
has a layered approach to I/O, so if the protocol handler
isn't given enough buffered data to decode the next packet/message, then its
return value is used to indicate that the buffer wasn't used and needs more
data.
Copyright (c) 2017 Justin Charette (boxofrox)