Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

THRIFT-5186: Rewrite address resolution in T{Nonblocking,}ServerSocket [REVIEW WANTED] #2151

Closed
wants to merge 1 commit into from

Conversation

ulidtko
Copy link
Contributor

@ulidtko ulidtko commented May 21, 2020

This is in followup to #2124. Passing AI_ADDRCONFIG hint to getaddrinfo() breaks Thrift servers when there's no network (127.0.0.1 loopback only). Not passing AI_ADDRCONFIG exposes deficient logic of processing getaddrinfo() results. This PR fixes the latter.

Please see the commit messages for the details of why&how.

I'm also including a quick drive-by fix of a compile warning related to Boost.Test deprecations, as a separate commit. Split off to separate PR.

Since I only tested on Linux — am very keen to see what AppVeyor and Travis CI's will have to say about this change (hopefully, some errors which point out my blunders).

Meanwhile — anyone interested is welcome to review. @emmenlau I think you'd want to take a look; this patch certainly needs another pair of eyes.

  • Did you create an Apache Jira ticket? (not required for trivial changes)
  • If a ticket exists: Does your pull request title follow the pattern "THRIFT-NNNN: describe my issue"?
  • Did you squash your changes to a single commit? (not required, but preferred)
  • Did you do your best to avoid breaking changes? If one was needed, did you label the Jira ticket with "Breaking-Change"?

@emmenlau
Copy link
Member

Nice work! I've tried to test it right away but it conflicts with another PR about Unix domain sockets that I submitted a few weeks back. I'll try to give it a shot sometime soon, but its not as straightforward as I'd hoped.

@ulidtko
Copy link
Contributor Author

ulidtko commented May 21, 2020

I've made sure this applies to the latest master by rebasing on top of it today. So you can probably make a local branch or something.

How Unix domain sockets are done here, baffled me as well. Glad that someone tackles this. I you Mario will merge that before me — I will rebase, no problem.

@ulidtko
Copy link
Contributor Author

ulidtko commented May 21, 2020

Linux CI got green. Appveyor though...

C:\projects\thrift\lib\cpp\src\thrift/transport/TSocketUtils.h(99): error C2440: 'return': cannot convert from 'WCHAR *' to 'std::basic_string<char,std::char_traits<char>,std::allocator<char>>' (compiling source file C:\projects\thrift\lib\cpp\src\thrift\transport\TServerSocket.cpp) [C:\projects\build\MSVC2017\x64\lib\cpp\thrift.vcxproj]
  C:\projects\thrift\lib\cpp\src\thrift/transport/TSocketUtils.h(99): note: No constructor could take the source type, or constructor overload resolution was ambiguous (compiling source file C:\projects\thrift\lib\cpp\src\thrift\transport\TServerSocket.cpp)
C:\projects\thrift\lib\cpp\src\thrift/transport/TSocketUtils.h(122): error C2065: 'EAI_SYSTEM': undeclared identifier (compiling source file C:\projects\thrift\lib\cpp\src\thrift\transport\TServerSocket.cpp) [C:\projects\build\MSVC2017\x64\lib\cpp\thrift.vcxproj]

That's what I got here, with the relevant code:

        virtual std::string message(int code) const override {
            return THRIFT_GAI_STRERROR(code);
        }
#  ifdef _WIN32_WCE
#    define THRIFT_GAI_STRERROR(...) thrift_wstr2str(gai_strerrorW(__VA_ARGS__))
#  else
#    define THRIFT_GAI_STRERROR gai_strerrorA
#  endif

Jeez, what a mess.

@ulidtko
Copy link
Contributor Author

ulidtko commented May 22, 2020

Unhandled SIMPLE-ERROR in thread #<SB-THREAD:THREAD "main thread" RUNNING
                                    {10005205B3}>:
  Couldn't execute "/root/.cache/common-lisp/sbcl-1.5.3-linux-x64/thrift/src/lib/cl/externals/software/clon-1.0b24/termio/sbcl/a.out": Text file busy

What... This Travis failure looks like a fluke. I'll just retry.

Weird that my EAI_SYSTEM commit fixes both windows issues; AppVeyor should've continued to fail — but it suddently turned green.

@emmenlau
Copy link
Member

What... This Travis failure looks like a fluke. I'll just retry.

The whole build system has way more false positives than it should. If you can spot a problem and fix it along the way, more than welcome. In any case, re-try is often a good idea :-)

@ulidtko ulidtko force-pushed the jira5186-getaddrinfo-rewrite branch from 6925094 to 3d4ae28 Compare May 22, 2020 08:39
@ulidtko
Copy link
Contributor Author

ulidtko commented May 22, 2020

The whole build system has way more false positives than it should. If you can spot a problem and fix it along the way, more than welcome. In any case, re-try is often a good idea :-)

Agreed, but well, racing in build systems and test runners is very very common, in my experience. And quite often, catching those races is orders of magnitude harder that just retrying :)

@ulidtko
Copy link
Contributor Author

ulidtko commented May 22, 2020

Yay! All CI's green ✔️

Please review & merge.

@emmenlau
Copy link
Member

Awesome!

@emmenlau
Copy link
Member

Now we have to find someone that can review this monster :-) :-) :-)

@Jens-G
Copy link
Member

Jens-G commented May 26, 2020

Now we have to find someone that can review this monster :-) :-) :-)

Recommend to ask on the mailing list. That's really a big one.

@ulidtko ulidtko force-pushed the jira5186-getaddrinfo-rewrite branch from 3d4ae28 to 6b8e935 Compare May 28, 2020 09:53
@ulidtko
Copy link
Contributor Author

ulidtko commented May 28, 2020

+474, -265 isn't really that big for C++ (which is known to be quite verbose).

And most of this diff wouldn't exist, had the classes not been choke-full of copypaste.

The main logical change is pretty simple, and I describe it thoroughly in the commit message: the retry loop of failed bind() is just made to do a little bit more. Sure, there's a lot of second-order detail; but the logic of main change is very simple in fact.

@ulidtko
Copy link
Contributor Author

ulidtko commented May 28, 2020

@Jens-G you mean the thrift-dev mail list?.. Honestly, I don't see it in a healthy state.

Look at the last month: the archive is just a dump of robotic notifications from Jira and GitHub. Among 3 pages of mails, just 2 threads are humans talking. That's way too low SNR, certainly lower than on GitHub.

So what's the point? We're already in pretty good place to discuss the patch: here, in this PR on GitHub.

@ulidtko ulidtko changed the title THRIFT-5186: Rewrite address resolution in T{Nonblocking,}ServerSocket THRIFT-5186: Rewrite address resolution in T{Nonblocking,}ServerSocket [REVIEW WANTED] May 28, 2020
@ulidtko ulidtko requested a review from Jens-G May 28, 2020 10:20
@Jens-G
Copy link
Member

Jens-G commented May 28, 2020

So what's the point?

I only made a proposal. You can do whatever you find helpful.

@Jens-G Jens-G removed their request for review May 28, 2020 13:55
@ulidtko ulidtko force-pushed the jira5186-getaddrinfo-rewrite branch from 6b8e935 to 6572215 Compare May 29, 2020 09:14
@ulidtko
Copy link
Contributor Author

ulidtko commented May 29, 2020

Ok, @emmenlau has a point. It's weird, but I've removed the copyright line.

Also in the last rebase:

  • add entry about this patch to CHANGES.md;
  • clang-format the code I'm adding.

@ulidtko ulidtko force-pushed the jira5186-getaddrinfo-rewrite branch from 6572215 to f97e8f9 Compare June 9, 2020 03:11
@ulidtko
Copy link
Contributor Author

ulidtko commented Jun 9, 2020

Rebased once again. The unrelated commit dropped (Boost unit-test warnings, see #2164).

@ulidtko
Copy link
Contributor Author

ulidtko commented Jun 9, 2020

Travis error was transient, during docker build in Autotools (Ubuntu Xenial) job:

Err:1 https://deb.nodesource.com/node_10.x xenial/main amd64 nodejs amd64 10.21.0-1nodesource1

  Connection timed out after 120000 milliseconds

E: Failed to fetch https://deb.nodesource.com/node_10.x/pool/main/n/nodejs/nodejs_10.21.0-1nodesource1_amd64.deb  Connection timed out after 120000 milliseconds

This caused two other jobs to cancel; Return code: -1 (negative values indicate kill by signal).

I'm just going to kick it again with no-change restart.

@ulidtko ulidtko force-pushed the jira5186-getaddrinfo-rewrite branch from f97e8f9 to eb14595 Compare June 9, 2020 09:37
@ulidtko
Copy link
Contributor Author

ulidtko commented Jun 9, 2020

Same thing again.

curl: (7) Failed to connect to downloads.sourceforge.net port 443: Connection timed out

Travis is doing way more uncached docker builds than it should.

Client: cpp
Patch: Max Ulidtko

My previous patch (9b9567b) has exposed
an issue shared by TServerSocket and TNonblockingServerSocket:
the results of getaddrinfo() call aren't used "in full" -- just a single
address is picked, and then bind() is retried in a loop on that single address.

This leads to poor results if we start varying the network conditions.
Like I show in the Jira issue, this is normal:

    [root@04dd07b70038 /]# ping -6 localhost
    ping: connect: Cannot assign requested address

-- same with `ping -6 $(hostname)` -- a hostname may resolve to IPv6 address
which we might not be able to connect to; and vice-versa, connecting to
IPv4 addresses may fail on IPv6-only systems.

The solution is to iterate over what getaddrinfo() returns while retrying
failed bind(). This is what this patch implements. Server-side analogue
of this behavior of curl:

    > curl -v localhost:8001/index.do
    * Trying ::1:8001...
    * connect to ::1 port 8001 failed: Connection refused
    * Trying 127.0.0.1:8001...
    * Connected to localhost (127.0.0.1) port 8001 (#0)
    > GET /index.do HTTP/1.1
    [...]

To achieve that, I throw away the TGetAddrInfoWrapper, and roll a proper one.

The bind-retrying loop had to be restructured: since sa_family will vary
from retry to retry, the socket() call has to be within the loop.

All the setsockopt() business factored away into side-methods: ~hundred
repetitive #ifdef-rich lines in the middle of important loop do get in the way.

This patch has quite a lot of code movement; git config diff.colormoved zebra
is strongly advised to the reader.

Tested manually on Linux, by running CMake's `make test` in 3 Docker environments:
 1) loopback-only (--net=none)
 2) classic IPv4-only bridge
 3) dual IPv6-IPv4 bridge
The only failing test was 88 - PythonTestSSLSocket, which currently also
fails on master for me (due to NULL cipher being unexpectedly accepted).
@ulidtko ulidtko force-pushed the jira5186-getaddrinfo-rewrite branch from eb14595 to edc8533 Compare June 9, 2020 12:08
@emmenlau
Copy link
Member

emmenlau commented Jun 9, 2020

Travis is doing way more uncached docker builds than it should.

Sorry that you are bit by this. Its a known problem, I've brought it up on the mailing list recently. Sadly, there is no solution yet. For me, sometimes 100% of builds failed, so please don't get too hung up on this. And sorry again!

@RandyAbernethy
Copy link
Contributor

With a diff that big I could have easily missed something but sometimes code needs an overhaul. Code changes looked clean to me. If CI is passing: +1

@Jens-G Jens-G closed this in dabfea2 Jun 9, 2020
@ulidtko
Copy link
Contributor Author

ulidtko commented Jun 10, 2020

Huge thanks everyone! 🎆

@ulidtko ulidtko deleted the jira5186-getaddrinfo-rewrite branch June 10, 2020 07:55
@emmenlau
Copy link
Member

Awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants