Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS connections aborting #40

Closed
snej opened this issue Mar 27, 2020 · 12 comments
Closed

TLS connections aborting #40

snej opened this issue Mar 27, 2020 · 12 comments

Comments

@snej
Copy link
Contributor

snej commented Mar 27, 2020

Reported by victorberger on the forum:

All of this was found on MacOS, from building Lite-C from source code (dev branch) using CMake, and is easily reproducible:

Sometimes but not always, a given TLS replication will be interrupted in the middle after partially completing, displaying the following error. In these cases, the replication never finishes.
Other times, the same replication with the same exact parameters will finish completely. The DB being transferred is not large and system resources are not being strained.

14:43:27.625480| [DB]: {DB#106}==> litecore::SQLiteDataFile /Users/X/X.cblite2/db.sqlite3 @0x7fd77018b260
14:43:27.625514| [DB]: {DB#106} Opening database
14:43:27.629679| [Sync]: {Inserter#107}==> litecore::repl::Inserter ->wss://X:443/X-qa/_blipsync @0x7fd76dc70908
14:43:27.629706| [Sync]: {Inserter#107} Inserted  10 revs in   4.37ms ( 2287/sec) of which 99.7% was commit
TLS: >>> BIO returning MBEDTLS_ERR_SSL_WANT_READ
TLS: >>> mbedtls_socket returning EWOULDBLOCK
14:43:27.692882| [WS]: {BuiltInWebSocket#108}==> litecore::websocket::BuiltInWebSocket wss://X:443/X-qa/_blipsync @0x7fd76dc6a1a0
14:43:27.692905| [WS] WARNING: {BuiltInWebSocket#108} Unexpected or unclean socket disconnect! (reason=errno, code=0)
14:43:27.692958| [Sync]: {Repl#98} Connection closed with WebSocket status 1006: "" (state=2)
14:43:27.693015| [Sync] ERROR: {Repl#98} Got LiteCore error: WebSocket error 1006 "connection closed abnormally"
14:43:27.713996| [Sync]: {Inserter#107} Inserted   8 revs in   1.63ms ( 4906/sec) of which 99.3% was commit
14:43:27.726504| [Sync] ERROR: {C4Replicator#99} State: busy, progress=72.00%, error=WebSocket error 1006 "connection closed abnormally"

If I try a TLS replication using any replication settings, however, replication does not start and the error is different. Here are examples using username/password authentication or replication for a specific channel. All of the endpoints being used are valid for sure.
Username/password:

14:45:03.302449| [Sync]: {Repl#127} Pull=continuous, Options={{auth:{password:"********", username:"X"}}}
14:45:03.302557| [Sync]: {C4Replicator#128}==> c4Internal::C4RemoteReplicator 0x7fd76fa1f6f0 @0x7fd76fa1f6f0
14:45:03.302574| [Sync]: {C4Replicator#128} Starting Replicator {Repl#127}
14:45:03.306847| [Sync]: {Repl#127} Scanning for pre-existing conflicts...
14:45:03.306888| [Sync]: {Repl#127} Found 0 conflicted docs in 0.004 sec
14:45:03.307861| [Sync]: {Repl#127} No local checkpoint 'cp-nEFf3hlcC8OkYXWTw6aTInjD0BQ='
14:45:03.904799| [WS]: {BuiltInWebSocket#129}==> litecore::websocket::BuiltInWebSocket wss://X:443/X/_blipsync @0x7fd76fa1faa0
14:45:03.904830| [WS] WARNING: {BuiltInWebSocket#129} Unexpected or unclean socket disconnect! (reason=WebSocket status, code=401)
14:45:03.904922| [Sync]: {Repl#127} Connection closed with WebSocket status 401: "(unknown HTTP status)" (state=1)
14:45:03.905021| [Sync] ERROR: {Repl#127} Got LiteCore error: WebSocket error 401 "(unknown HTTP status)"
14:45:03.905049| [Sync] ERROR: {C4Replicator#128} State: connecting, progress=0.00%, error=WebSocket error 401 "(unknown HTTP status)"

Specific channel:

14:47:03.952163| [Sync]: {Repl#137} Pull=continuous, Options={{channels:["X"]}}
14:47:03.952250| [Sync]: {C4Replicator#138}==> c4Internal::C4RemoteReplicator 0x7fd76dcbdff0 @0x7fd76dcbdff0
14:47:03.952264| [Sync]: {C4Replicator#138} Starting Replicator {Repl#137}
14:47:03.952473| [Sync]: {Repl#137} Scanning for pre-existing conflicts...
14:47:03.956809| [Sync]: {Repl#137} Found 0 conflicted docs in 0.004 sec
14:47:03.957219| [Sync]: {Repl#137} No local checkpoint 'cp-GMTjAMBpyDuxd0rl7szz4LWh0vw='
14:47:04.505299| [WS]: {BuiltInWebSocket#139}==> litecore::websocket::BuiltInWebSocket wss://X:443/X/_blipsync @0x7fd76dcd4720
14:47:04.505324| [WS] WARNING: {BuiltInWebSocket#139} Unexpected or unclean socket disconnect! (reason=WebSocket status, code=401)
14:47:04.505369| [Sync]: {Repl#137} Connection closed with WebSocket status 401: "(unknown HTTP status)" (state=1)
14:47:04.505419| [Sync] ERROR: {Repl#137} Got LiteCore error: WebSocket error 401 "(unknown HTTP status)"
14:47:04.505464| [Sync] ERROR: {C4Replicator#138} State: connecting, progress=0.00%, error=WebSocket error 401 "(unknown HTTP status)"
@snej
Copy link
Contributor Author

snej commented Mar 27, 2020

@borrrden commented:

The username / password ones are just an HTTP 401, which means that it is considering the username / password combination that it received to be invalid.
Although the message should not be "(unknown HTTP status)"

@snej
Copy link
Contributor Author

snej commented Mar 27, 2020

... and having filed this here, I realize it's actually a LiteCore issue. Oh well, let's leave it here for now since that's where it manifests.

@victorbergeronsemi
Copy link
Contributor

Thank you @snej.
Before commenting on the issue of TLS connections aborting (which I'll try to isolate more and then report back), I actually wanted to clarify on the second issue (HTTP 401 status when trying to authenticate with login/password).
To isolate this problem I'm hosting a CB server + Sync Gateway running the following config:

{
  "log": ["*"],
  "databases": {
    "wss-pull": {
      "server": "http://localhost:8091",
      "bucket": "wss-pull",
      "username": "sync_gateway",
      "password": "sync_gateway",
      "enable_shared_bucket_access": true,
      "import_docs": "continuous",
      "users": {
        "GUEST": { "disabled": true, "admin_channels": ["*"] },
        "sync_gateway" : {"password":"sync_gateway", "admin_channels": ["*"] }
      },
      "num_index_replicas": 0
      }
    }
  }

and have the following simple driver code (it makes no difference if I use the c++ wrapper):

#include <iostream>
#include <thread>
#include "fleece/Fleece.hh"
#include "fleece/Mutable.hh"
#include "CouchbaseLite.hh"

int main() {
    CBLError error;
    CBLDatabaseConfiguration configDB = {"", kCBLDatabase_Create};
    CBLDatabase* db = CBLDatabase_Open("wss-pull", &configDB, &error);
    CBLReplicatorConfiguration config = {};
    CBLReplicator *repl = nullptr;
    config.database = db;
    config.replicatorType = kCBLReplicatorTypePull;
    config.endpoint = CBLEndpoint_NewWithURL("ws://localhost:4984/wss-pull");
    config.authenticator = CBLAuth_NewBasic("sync_gateway", "sync_gateway");
    repl = CBLReplicator_New(&config, &error);
    CBLReplicator_Start(repl);
    CBLReplicatorStatus status;
    while ((status = CBLReplicator_Status(repl)).activity != kCBLReplicatorStopped) {
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
    }
    std::cerr << "Finished with activity=" << status.activity
            << ", error=(" << status.error.domain << "/" << status.error.code << ")\n";
    return 0;
}

Running the above gives me

15:52:39.544054| [WS]: {BuiltInWebSocket#7}==> litecore::websocket::BuiltInWebSocket ws://localhost:4984/wss-pull/_blipsync @0x7fe5be4050b0
15:52:39.544077| [WS] WARNING: {BuiltInWebSocket#7} Unexpected or unclean socket disconnect! (reason=WebSocket status, code=401)
15:52:39.544122| [Sync]: {Repl#5} Connection closed with WebSocket status 401: "(unknown HTTP status)" (state=1)
15:52:39.544211| [Sync] ERROR: {Repl#5} Got LiteCore error: WebSocket error 401 "(unknown HTTP status)"
15:52:39.544236| [Sync] ERROR: {C4Replicator#6} State: connecting, progress=0.00%, error=WebSocket error 401 "(unknown HTTP status)"
15:52:39.544255| [Sync] WARNING: No listener to receive error from CBLReplicator 0x7fe5be404b90: WebSocket error 401 "(unknown HTTP status)"
15:52:39.544272| [Sync]: {Repl#5} now stopped
15:52:39.544300| [Sync]: BLIP sent 0 msgs (0 bytes), rcvd 0 msgs (0 bytes) in 0.002 sec. Max outbox depth was 0, avg nan
15:52:39.544555| [DB]: {DB#4} Closing database

and the Sync Gateway terminal shows

2020-03-27T15:52:39.543-07:00 [INF] HTTP:  #011: GET /wss-pull/_blipsync
2020-03-27T15:52:39.543-07:00 [INF] HTTP: #011:     --> 401 Login required  (0.3 ms)

Could you please tell me if you can spot anything wrong with the code above. I believe the server/sync gateway/credentials are set up correctly: I can replicate/read the DB contents with username/password using a similar driver program but running an older version of Litecore API. Also if I remove the authentication requirements on the SG sync function and remove the username/password on the C++ code it works fine.

Thanks again.

@snej
Copy link
Contributor Author

snej commented Mar 30, 2020

I've fixed the HTTP auth issue on the dev branch; it was a pretty trivial bug.

@snej
Copy link
Contributor Author

snej commented Mar 30, 2020

Note to self: I've been lazy and only testing with SG on localhost. Need to set it up elsewhere and test with that.

@victorbergeronsemi
Copy link
Contributor

victorbergeronsemi commented Mar 30, 2020

Thank you very much. I'll give it a try and report back. I'll come back with more info about the TLS connections and the compiling troubles I had on Windows.

Also, something could be wrong with the LiteCore submodule tagged on the dev branch, commit 0b1681492eda4378305daeaf94853bbe389b52e4 is showing as a 404 for me.

Edit: commit no longer showing as 404.

snej added a commit to couchbase/couchbase-lite-core that referenced this issue Mar 30, 2020
snej added a commit that referenced this issue Mar 30, 2020
Updated LiteCore to pick up the fix.
Fixes #40
@snej
Copy link
Contributor Author

snej commented Mar 30, 2020

The bug showed up immediately once I set up SG on an iMac in the next room and ran the TLS unit tests. It was a LiteCore issue: the BuiltinWebSocket class was misinterpreting a zero-byte read as an error, when it just means TLS wasn't able to decode any bytes from the ciphertext it just read. After I fixed it to just ignore the 0-byte read, everything appears to work fine.

Fix is on the dev branch, if you'd like to try it out.

@victorbergeronsemi
Copy link
Contributor

Thank you very much @snej. Just tried out the dev branch and everything worked well, no problems at all for me on Mac anymore.

Is there still active development going on for LiteC for Windows? I'm still having the issues on the Windows side that I mentioned on the original thread on the CB forum. These are also all related to DB replication / TLS connections (the last of them was an issue with certificates).

To me, the issues on Windows also look like small issues with some newer socket/network code. I've tested both the dev branch and the fix/ci_windows_etc branch, which seems to have several fixes but hasn't been touched since December (there's also an older PR).

@borrrden
Copy link
Member

borrrden commented Apr 1, 2020

Development is not abandoned, but is subject to constraints on time as it is not an officially supported platform (yet), so supported platforms take priority. I know that's a lame answer, so sorry about that.

@victorbergeronsemi
Copy link
Contributor

Thank you for the reply @borrrden @snej.
Great news: I've been able to solve all of the problems I was still facing, even of Windows.

I did this by:

  1. Integrating the changes made to the branch fix/ci_windows_etc into the current dev branch.
  2. Fixing this bug with the TLS certificate code on Windows which I'm quite certain exists. The suggestion that was made later in that thread:

the read_system_root_certs method is concatenating a bunch of DER together, and mbedTLS can’t parse that. It can parse concatenated PEM (ascii) certs, but only one DER.

was correct. I found a (slightly hacky) way to convert each certificate from DER to PEM format before concatenating to the string stream, and that fixed the problem. All worked fine from then on.

I certainly understand the development constraints. Do you have a timeline for finishing/integrating this into dev/master? I'd be glad to make a PR for this since I have the fixes - but I'm not quite sure what the best way of fixing the issue with the certificate format is.

Thanks again!
Victor Berger

@snej
Copy link
Contributor Author

snej commented Apr 3, 2020

Thanks, Victor! I'd love to get a PR with that fix, even if it's only a temporary workaround.

@victorbergeronsemi
Copy link
Contributor

Thanks @snej @borrrden.

The first step for this is ready: I found a clean solution for the certificate format issue and opened a PR on the sockpp repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants