Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic Bad Connection Retry #681

Merged
merged 20 commits into from Jul 28, 2021
Merged

Conversation

fineol
Copy link
Contributor

@fineol fineol commented Jul 20, 2021

This pull request resolves #586. It is the result of collaboration by @kardianos, @tc-hib, @shueybubbles, and myself.

First, it adds an optional parameter named "disableretry" to the connection string. When set to "true", this parameter configures go-mssqldb's retry policy to match that requested in issue #275: failed queries immediately return a detailed error. When set to "false", this parameter configures the retry policy to match that requested in issue #586: failed queries are automatically retried as per the standard library's database/sql retry logic.

For versions of Go less than 1.18, the default value for "disableretry" is "true", maintaining backwards compatibility.

Second, starting with Go 1.18 when (we hope) kardianos' change proposal 333949 is accepted, this pull request wraps errors on bad connections with a new RetryableError error that allows both automatic retry and detailed errors, satisfying both issue #275 and #586. At that time the default value for "disableretry" will change to "false".

The change is designed to be mostly backwards compatible. Prior to Go 1.18 the default behavior will be identical to current behavior with the exception that additional information will be returned for server errors.

Subsequent to Go 1.18 it will be identical in spirit to current behavior, but not 100% the same. Automatic retries that are transparent to the caller will be allowed, and when all retries fail, the returned errors will show the same Error() messages as now but will have to be unwrapped for anyone who is comparing errors directly. I think this is a reasonable compromise to restore standard Go retry behavior to go-msqldb.

I have tested on both Go 1.16.5 and a mock of Go 1.18 built from kardianos change proposal 333949 .

I still have some questions and don't expect that this pull request will be accepted as is, but I wanted to open it now to facilitate discussion of it. Some open questions are:

  1. Is Go 1.18 the right time to change the default retry behavior, or should that be done now (a more breaking change)?
  2. Are we comfortable with building in now a dependency on Go 1.18 for something that, while it ought to be included by all rights, isn't yet guaranteed to be included in 1.18?

@shueybubbles
Copy link
Contributor

on the testing front - in my efforts to reproduce builds/tests with "old go" versions using docker I've found some tests like TestDisconnect2 are very sensitive to the environment.
EG I create a .connstr file on my Windows host that has the enlistment and run this:

docker run -v i:\git\go-mssqldb:/go-mssqldb -w /go-mssqldb golang:1.16 go test ./...

If the server is remote, TestDisconnect2 will fail. If the server is the instance running on the host and .connstr has the IP address for it, the test passes.

Are all the appveyor build SQL instances local? If so it's potentially hiding quite a few bugs.

@fineol
Copy link
Contributor Author

fineol commented Jul 20, 2021

Yes, the disconnect tests are very sensitive to environment and more. The new TestDisconnect3 I wrote uses the same mechanism as TestDisconnect2 (mssql.dialerInterrupt and msql.connInterrupt). And TestDisconnect3 is randomly succeeding and failing in the AppVeyor tests. One environment may fail in one run and a virtually identical environment succeed in that same run. Then in the next run they could be reversed.

The random failing makes more sense to me for TestDiscconect2 than 3. Test 2 relies on the relative sequence of two different wait/sleeps, and I could see those occasionally getting out of order for a remote server. But I wrote test 3 with the goal of disconnecting synchronously.

The main test routine exercises a connection and then sends a message on the "disrupt" channel. An interrupter routine waits for that signal with a blocking read from the "disrupt" channel, then breaks all connections it already dialed (including the connection the main test routine already used), and finishes by signaling back through a "waitDisrupt" channel. Finally, the main test routine waits on a blocking read of the "waitDisrupt" channel and after it gets the signal it attempts the second query on the connection, which should fail due to the interruption.

That was the plan, at least. But it isn't working, and I don't know why. I believe the channel synchronization is correct, and if it is there is no way that the connection cannot be broken by the time the second query is exercised. If we accept that, then it means something else is using the original connection, purging it from the pool after it is disrupted, and the second query executes on a new connection that hasn't been disrupted. I have no idea what could be doing that and have no visibility into the pool to find out.

Are all the appveyor build SQL instances local? If so it's potentially hiding quite a few bugs.

Ironically the AppVeyor instances appear to be local:

"(local)\%SQLINSTANCE%"

Whereas at my place the instance is slightly beyond local. It is on a separate VM on the same virtual subnet. But I have never been able to duplicate the AppVeyor errors in my test setup.

All of these problems, as best as I can tell, are related to the difficult task of simulating network failures in unit tests, not the actual driver code.

@shueybubbles
Copy link
Contributor

i have a proposed fix for TestDisconnect2 in #682

@fineol
Copy link
Contributor Author

fineol commented Jul 21, 2021

I was running my tests on Linux where I had no problems, but I saw that AppVeyor runs its tests on Windows so I switched to a Windows dev box and was able to duplicate the AppVeyor problem. My current understanding is that a combination of things causes the issue:

  1. Windows returns two IP addresses for the host "(local)" as specified in the AppVeyor tests (Linux returns just one address for "(local)", and both Windows and Linux naturally return only one if you specify a dotted decimal address directly).
  2. The mssql.dialConnection function in tds.go dials connections in parallel if there are multiple IP addresses.
  3. The createDialer function implemented TestDisconnect3 returns a new dialer each time it is called (as it should), but they all listen on the same disrupt channel.

As a result, the dialer that handles the disrupt signal may or may not be the same dialer that connected to SQL Server first and supplied/controls the connection being used in the test. Therefore, there is a roughly 50-50 chance that the connection being tested will actually be disrupted. That correlates well with the results I was seeing on AppVeyor.

I will need to modify the test to ensure that all dialers get the "disrupt" message and reply back before proceeding with the second query.

@shueybubbles
Copy link
Contributor

good find.
I tend not to use "localhost" in my testing just because I run the tests in docker containers using a mounted enlistment, so sql is never running locally to the tests.
Should the driver optimize for this and only use 127.0.0.1 for localhost? What's the benefit of a parallel dial in the localhost case?

@codecov
Copy link

codecov bot commented Jul 21, 2021

Codecov Report

Merging #681 (2360af0) into master (a78da2a) will increase coverage by 0.73%.
The diff coverage is 96.87%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #681      +/-   ##
==========================================
+ Coverage   70.19%   70.92%   +0.73%     
==========================================
  Files          23       24       +1     
  Lines        5163     5177      +14     
==========================================
+ Hits         3624     3672      +48     
+ Misses       1310     1279      -31     
+ Partials      229      226       -3     
Impacted Files Coverage Δ
bulkcopy.go 57.20% <0.00%> (ø)
error.go 100.00% <100.00%> (+25.00%) ⬆️
mssql.go 91.46% <100.00%> (+4.19%) ⬆️
mssql_go118pre.go 100.00% <100.00%> (ø)
token.go 61.23% <100.00%> (+0.75%) ⬆️
tds.go 65.22% <0.00%> (-0.54%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a78da2a...2360af0. Read the comment docs.

@fineol
Copy link
Contributor Author

fineol commented Jul 21, 2021

My changes to the tests' createDialer function fixed the problem when running TestDisconnect3 in a Windows environment with the server set to "(local)". All test dialers now get the signal to interrupt the connections they created, so the test connection is now reliably broken when intended.

I saw an unrelated data race failure in TestDisconnect1, a test that I neither wrote nor modified. Other pull requests show the same data race that same test, supporting my contention that it has nothing to do with this pull request. I don't plan to investigate and fix that race at this time.

Now that AppVeyor passed this pull request, the code coverage report shows it needs some improvement. I can work on that, but I would to know that there is a chance this pull request will be accepted before investing more in it.

Also, I don't understand the coverage report for tds.go. This pull request made no changes to either tds.go nor tds_test.go. Furthermore, the default error behavior hasn't changed, and tests in tds_test.go only check if err is nil or not nil anyway, so there shouldn't be any side-effects that change the tds tests. Finally, the tds.go coverage change report shows many "changes" where a prior line is listed as having lost coverage while a subsequent line has gained it, but the subsequent line cannot be reached without first executing the prior line. That doesn't make any sense to me.

@fineol
Copy link
Contributor Author

fineol commented Jul 21, 2021

@shueybubbles:

I tend not to use "localhost" in my testing just because I run the tests in docker containers using a mounted enlistment, so sql is never running locally to the tests.

I wasn't running tests with the database server on the test box either, which, combined with the fact that my test box was Linux, is why I didn't see this error earlier.

Should the driver optimize for this and only use 127.0.0.1 for localhost?

Probably not. I think it is best to let the IP address lookup use standard mechanisms.

What's the benefit of a parallel dial in the localhost case?

The code states it is to speed up connections when there are multiple IP addresses by trying them all at once and using the first to answer. The stated use case is Always On availability groups. I think "(local)" just got caught in the crossfire due a Windows idiosyncrasy when resolving "(local)", but I am not an authority:

go-mssqldb/tds.go

Lines 853 to 874 in 9e2f833

// SQL Server AlwaysOn Availability Group Listeners are bound by DNS to a
// list of IP addresses. So if there is more than one, try them all and
// use the first one that allows a connection.
func dialConnection(ctx context.Context, c *Connector, p msdsn.Config) (conn net.Conn, err error) {
var ips []net.IP
ip := net.ParseIP(p.Host)
if ip == nil {
ips, err = net.LookupIP(p.Host)
if err != nil {
return
}
} else {
ips = []net.IP{ip}
}
if len(ips) == 1 {
d := c.getDialer(&p)
addr := net.JoinHostPort(ips[0].String(), strconv.Itoa(int(resolveServerPort(p.Port))))
conn, err = d.DialContext(ctx, "tcp", addr)
} else {
//Try Dials in parallel to avoid waiting for timeouts.
connChan := make(chan net.Conn, len(ips))

@kardianos kardianos merged commit 0730d50 into denisenkom:master Jul 28, 2021
@fineol fineol deleted the feature/autoretry branch September 23, 2021 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Queries that fail due to bad connections are not automatically retried (problem with Issue #275 changes)
4 participants