Automatic Bad Connection Retry #681

fineol · 2021-07-20T13:38:22Z

This pull request resolves #586. It is the result of collaboration by @kardianos, @tc-hib, @shueybubbles, and myself.

First, it adds an optional parameter named "disableretry" to the connection string. When set to "true", this parameter configures go-mssqldb's retry policy to match that requested in issue #275: failed queries immediately return a detailed error. When set to "false", this parameter configures the retry policy to match that requested in issue #586: failed queries are automatically retried as per the standard library's database/sql retry logic.

For versions of Go less than 1.18, the default value for "disableretry" is "true", maintaining backwards compatibility.

Second, starting with Go 1.18 when (we hope) kardianos' change proposal 333949 is accepted, this pull request wraps errors on bad connections with a new RetryableError error that allows both automatic retry and detailed errors, satisfying both issue #275 and #586. At that time the default value for "disableretry" will change to "false".

The change is designed to be mostly backwards compatible. Prior to Go 1.18 the default behavior will be identical to current behavior with the exception that additional information will be returned for server errors.

Subsequent to Go 1.18 it will be identical in spirit to current behavior, but not 100% the same. Automatic retries that are transparent to the caller will be allowed, and when all retries fail, the returned errors will show the same Error() messages as now but will have to be unwrapped for anyone who is comparing errors directly. I think this is a reasonable compromise to restore standard Go retry behavior to go-msqldb.

I have tested on both Go 1.16.5 and a mock of Go 1.18 built from kardianos change proposal 333949 .

I still have some questions and don't expect that this pull request will be accepted as is, but I wanted to open it now to facilitate discussion of it. Some open questions are:

Is Go 1.18 the right time to change the default retry behavior, or should that be done now (a more breaking change)?
Are we comfortable with building in now a dependency on Go 1.18 for something that, while it ought to be included by all rights, isn't yet guaranteed to be included in 1.18?

…me to test it now)

….Error

…ood unset before calling checkBadConn)

shueybubbles · 2021-07-20T21:14:25Z

on the testing front - in my efforts to reproduce builds/tests with "old go" versions using docker I've found some tests like TestDisconnect2 are very sensitive to the environment.
EG I create a .connstr file on my Windows host that has the enlistment and run this:

docker run -v i:\git\go-mssqldb:/go-mssqldb -w /go-mssqldb golang:1.16 go test ./...

If the server is remote, TestDisconnect2 will fail. If the server is the instance running on the host and .connstr has the IP address for it, the test passes.

Are all the appveyor build SQL instances local? If so it's potentially hiding quite a few bugs.

fineol · 2021-07-20T22:51:38Z

Yes, the disconnect tests are very sensitive to environment and more. The new TestDisconnect3 I wrote uses the same mechanism as TestDisconnect2 (mssql.dialerInterrupt and msql.connInterrupt). And TestDisconnect3 is randomly succeeding and failing in the AppVeyor tests. One environment may fail in one run and a virtually identical environment succeed in that same run. Then in the next run they could be reversed.

The random failing makes more sense to me for TestDiscconect2 than 3. Test 2 relies on the relative sequence of two different wait/sleeps, and I could see those occasionally getting out of order for a remote server. But I wrote test 3 with the goal of disconnecting synchronously.

The main test routine exercises a connection and then sends a message on the "disrupt" channel. An interrupter routine waits for that signal with a blocking read from the "disrupt" channel, then breaks all connections it already dialed (including the connection the main test routine already used), and finishes by signaling back through a "waitDisrupt" channel. Finally, the main test routine waits on a blocking read of the "waitDisrupt" channel and after it gets the signal it attempts the second query on the connection, which should fail due to the interruption.

That was the plan, at least. But it isn't working, and I don't know why. I believe the channel synchronization is correct, and if it is there is no way that the connection cannot be broken by the time the second query is exercised. If we accept that, then it means something else is using the original connection, purging it from the pool after it is disrupted, and the second query executes on a new connection that hasn't been disrupted. I have no idea what could be doing that and have no visibility into the pool to find out.

Are all the appveyor build SQL instances local? If so it's potentially hiding quite a few bugs.

Ironically the AppVeyor instances appear to be local:

"(local)\%SQLINSTANCE%"

Whereas at my place the instance is slightly beyond local. It is on a separate VM on the same virtual subnet. But I have never been able to duplicate the AppVeyor errors in my test setup.

All of these problems, as best as I can tell, are related to the difficult task of simulating network failures in unit tests, not the actual driver code.

shueybubbles · 2021-07-21T12:55:14Z

i have a proposed fix for TestDisconnect2 in #682

fineol · 2021-07-21T13:56:24Z

I was running my tests on Linux where I had no problems, but I saw that AppVeyor runs its tests on Windows so I switched to a Windows dev box and was able to duplicate the AppVeyor problem. My current understanding is that a combination of things causes the issue:

Windows returns two IP addresses for the host "(local)" as specified in the AppVeyor tests (Linux returns just one address for "(local)", and both Windows and Linux naturally return only one if you specify a dotted decimal address directly).
The mssql.dialConnection function in tds.go dials connections in parallel if there are multiple IP addresses.
The createDialer function implemented TestDisconnect3 returns a new dialer each time it is called (as it should), but they all listen on the same disrupt channel.

As a result, the dialer that handles the disrupt signal may or may not be the same dialer that connected to SQL Server first and supplied/controls the connection being used in the test. Therefore, there is a roughly 50-50 chance that the connection being tested will actually be disrupted. That correlates well with the results I was seeing on AppVeyor.

I will need to modify the test to ensure that all dialers get the "disrupt" message and reply back before proceeding with the second query.

shueybubbles · 2021-07-21T15:09:09Z

good find.
I tend not to use "localhost" in my testing just because I run the tests in docker containers using a mounted enlistment, so sql is never running locally to the tests.
Should the driver optimize for this and only use 127.0.0.1 for localhost? What's the benefit of a parallel dial in the localhost case?

codecov · 2021-07-21T18:41:23Z

Codecov Report

Merging #681 (2360af0) into master (a78da2a) will increase coverage by 0.73%.
The diff coverage is 96.87%.

@@            Coverage Diff             @@
##           master     #681      +/-   ##
==========================================
+ Coverage   70.19%   70.92%   +0.73%     
==========================================
  Files          23       24       +1     
  Lines        5163     5177      +14     
==========================================
+ Hits         3624     3672      +48     
+ Misses       1310     1279      -31     
+ Partials      229      226       -3

Impacted Files	Coverage Δ
bulkcopy.go	`57.20% <0.00%> (ø)`
error.go	`100.00% <100.00%> (+25.00%)`	⬆️
mssql.go	`91.46% <100.00%> (+4.19%)`	⬆️
mssql_go118pre.go	`100.00% <100.00%> (ø)`
token.go	`61.23% <100.00%> (+0.75%)`	⬆️
tds.go	`65.22% <0.00%> (-0.54%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a78da2a...2360af0. Read the comment docs.

fineol · 2021-07-21T19:30:50Z

My changes to the tests' createDialer function fixed the problem when running TestDisconnect3 in a Windows environment with the server set to "(local)". All test dialers now get the signal to interrupt the connections they created, so the test connection is now reliably broken when intended.

I saw an unrelated data race failure in TestDisconnect1, a test that I neither wrote nor modified. Other pull requests show the same data race that same test, supporting my contention that it has nothing to do with this pull request. I don't plan to investigate and fix that race at this time.

Now that AppVeyor passed this pull request, the code coverage report shows it needs some improvement. I can work on that, but I would to know that there is a chance this pull request will be accepted before investing more in it.

Also, I don't understand the coverage report for tds.go. This pull request made no changes to either tds.go nor tds_test.go. Furthermore, the default error behavior hasn't changed, and tests in tds_test.go only check if err is nil or not nil anyway, so there shouldn't be any side-effects that change the tds tests. Finally, the tds.go coverage change report shows many "changes" where a prior line is listed as having lost coverage while a subsequent line has gained it, but the subsequent line cannot be reached without first executing the prior line. That doesn't make any sense to me.

fineol · 2021-07-21T19:39:43Z

@shueybubbles:

I tend not to use "localhost" in my testing just because I run the tests in docker containers using a mounted enlistment, so sql is never running locally to the tests.

I wasn't running tests with the database server on the test box either, which, combined with the fact that my test box was Linux, is why I didn't see this error earlier.

Should the driver optimize for this and only use 127.0.0.1 for localhost?

Probably not. I think it is best to let the IP address lookup use standard mechanisms.

What's the benefit of a parallel dial in the localhost case?

The code states it is to speed up connections when there are multiple IP addresses by trying them all at once and using the first to answer. The stated use case is Always On availability groups. I think "(local)" just got caught in the crossfire due a Windows idiosyncrasy when resolving "(local)", but I am not an authority:

go-mssqldb/tds.go

Lines 853 to 874 in 9e2f833

    
           // SQL Server AlwaysOn Availability Group Listeners are bound by DNS to a 
        
           // list of IP addresses.  So if there is more than one, try them all and 
        
           // use the first one that allows a connection. 
        
           func dialConnection(ctx context.Context, c *Connector, p msdsn.Config) (conn net.Conn, err error) { 
        
           	var ips []net.IP 
        
           	ip := net.ParseIP(p.Host) 
        
           	if ip == nil { 
        
           		ips, err = net.LookupIP(p.Host) 
        
           		if err != nil { 
        
           			return 
        
           		} 
        
           	} else { 
        
           		ips = []net.IP{ip} 
        
           	} 
        
           	if len(ips) == 1 { 
        
           		d := c.getDialer(&p) 
        
           		addr := net.JoinHostPort(ips[0].String(), strconv.Itoa(int(resolveServerPort(p.Port)))) 
        
           		conn, err = d.DialContext(ctx, "tcp", addr) 
        
           	} else { 
        
           		//Try Dials in parallel to avoid waiting for timeouts. 
        
           		connChan := make(chan net.Conn, len(ips))

tc-hib and others added 13 commits July 13, 2021 21:48

This is how I would have tried to fix denisenkom#586 (I don't have ti…

05022b9

…me to test it now)

Wrap sql errors in the serverError

cb9535a

Make ServerError and wrap sql errors in it

c231c48

To try CL 333949 wrap ErrBadConn in mssql.Error

91db133

Forgot to test connectionGood in the (probably useless) case of mssql…

983f955

….Error

Correctly handle other cases (other types of errors, with connectionG…

952f8bd

…ood unset before calling checkBadConn)

Reordering of switch cases for readability

405258f

Add dsn param to disable retry of bad conn queries

79cbafd

Unit tests for auto retry of broken connections

e505300

Add RetryableError type

7b26bef

Allow disabling of bad-connection retry logic

5ad7cee

Base default retry policy on Go version

d0c7bb3

Improve broken connection tests

5b43d95

Update disconnect test to recognize net.OpError

a89e940

Craig Fitzpatrick added 2 commits July 21, 2021 11:21

Break all dialers' connections in disconnect tests

b5e7169

Document expected errors in TestDisconnect5

941b905

Craig Fitzpatrick added 4 commits July 22, 2021 12:55

Make badStreamPanic panic with StreamError

1033249

Add test for logic of checkBadConn function

1b9067d

Add unit tests for custom error types

e89a1f3

Test bad connection rejection

2360af0

kardianos merged commit 0730d50 into denisenkom:master Jul 28, 2021

fineol deleted the feature/autoretry branch September 23, 2021 20:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic Bad Connection Retry #681

Automatic Bad Connection Retry #681

fineol commented Jul 20, 2021 •

edited

shueybubbles commented Jul 20, 2021

fineol commented Jul 20, 2021

shueybubbles commented Jul 21, 2021

fineol commented Jul 21, 2021

shueybubbles commented Jul 21, 2021

codecov bot commented Jul 21, 2021 •

edited

fineol commented Jul 21, 2021

fineol commented Jul 21, 2021 •

edited

Automatic Bad Connection Retry #681

Automatic Bad Connection Retry #681

Conversation

fineol commented Jul 20, 2021 • edited

shueybubbles commented Jul 20, 2021

fineol commented Jul 20, 2021

shueybubbles commented Jul 21, 2021

fineol commented Jul 21, 2021

shueybubbles commented Jul 21, 2021

codecov bot commented Jul 21, 2021 • edited

Codecov Report

fineol commented Jul 21, 2021

fineol commented Jul 21, 2021 • edited

fineol commented Jul 20, 2021 •

edited

codecov bot commented Jul 21, 2021 •

edited

fineol commented Jul 21, 2021 •

edited