3.7 streaming transactions: ArangoError: cluster internal HTTP connection broken #699

mikestaub · 2020-11-15T14:18:19Z

This branch was working on 3.6.3, but is now failing on 3.7.3

https://github.com/mikestaub/arangojs/pull/1/files

Would it be possible to include a default Oasis cluster in the integration tests so these types of regressions could be caught earlier?

dothebart · 2020-11-16T08:34:21Z

Hi,
If you make this a PR to arangojs, it will become part of the nightly tests that the ci runs.
Thanks for digging deeper into this.

mikestaub · 2020-11-16T13:04:25Z

@dothebart what 3 URIs should I use for the Oasis cluster in my PR?

dothebart · 2020-11-16T13:55:43Z

@pluma can you give a hint for this?

pluma · 2020-11-17T16:48:50Z

@dothebart I've added support for passing multiple URLs with commas via ab866b0. Check the changes to CONTRIBUTING.md in particular.

Note that this will result in acquireHostsList being called, which in my case returns IPv6 URLs which won't be deduplicated if you use an alias like localhost. This also means you can just append a single comma to your TEST_ARANGODB_URL to opt into cluster mode.

Cluster mode always enables round robin.

fceller · 2020-11-24T15:35:52Z

Hi @mikestaub ,

I hope you are doing well. Alan and WIlli are working on extending the automatic testing to catch these issues more easily.

What is the current status? Is it blocking you from moving to 3.7 or did you work around it?

best Frank

mikestaub · 2020-11-25T00:45:37Z

@fceller this is blocking me from upgrading to 3.7 but it is not urgent as 3.6 is working well.

dothebart · 2020-11-30T14:16:50Z

hm, running the tests with 3 coordinators barely doesn't reproduce this.
@mikestaub can you sched a bit more details on the environment you're running into this?

mikestaub · 2020-12-01T10:56:29Z

@dothebart here is the docker-compose.yml file I am using:

version: "3"

services:

  nginx:
    image: nginx:1.17.9
    container_name: arangodb-proxy
    depends_on:
      - arangodb-coordinator1
      - arangodb-coordinator2
      - arangodb-coordinator3
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    ports:
      - 8529:80

  arangodb-coordinator1:
    restart: on-failure
    container_name: arangodb-coordinator1
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --server.statistics true
      --cluster.max-number-of-shards 1
      --cluster.min-replication-factor 3
      --cluster.max-replication-factor 3
      --cluster.force-one-shard true
      --cluster.agency-endpoint tcp://arangodb-agency1:8529
      --cluster.agency-endpoint tcp://arangodb-agency2:8529
      --cluster.agency-endpoint tcp://arangodb-agency3:8529
      --cluster.my-address tcp://arangodb-coordinator1:8529
      --cluster.my-role COORDINATOR
    volumes:
      - arangodb-coordinator1:/var/lib/arangodb3

  arangodb-coordinator2:
    restart: on-failure
    container_name: arangodb-coordinator2
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --server.statistics true
      --cluster.max-number-of-shards 1
      --cluster.min-replication-factor 3
      --cluster.max-replication-factor 3
      --cluster.force-one-shard true
      --cluster.agency-endpoint tcp://arangodb-agency1:8529
      --cluster.agency-endpoint tcp://arangodb-agency2:8529
      --cluster.agency-endpoint tcp://arangodb-agency3:8529
      --cluster.my-address tcp://arangodb-coordinator2:8529
      --cluster.my-role COORDINATOR
    volumes:
      - arangodb-coordinator2:/var/lib/arangodb3

  arangodb-coordinator3:
    restart: on-failure
    container_name: arangodb-coordinator3
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --server.statistics true
      --cluster.max-number-of-shards 1
      --cluster.min-replication-factor 3
      --cluster.max-replication-factor 3
      --cluster.force-one-shard true
      --cluster.agency-endpoint tcp://arangodb-agency1:8529
      --cluster.agency-endpoint tcp://arangodb-agency2:8529
      --cluster.agency-endpoint tcp://arangodb-agency3:8529
      --cluster.my-address tcp://arangodb-coordinator3:8529
      --cluster.my-role COORDINATOR
    volumes:
      - arangodb-coordinator3:/var/lib/arangodb3

  arangodb-agency1:
    restart: on-failure
    container_name: arangodb-agency1
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --foxx.queues false
      --agency.size 3
      --agency.supervision true
      --agency.activate true
      --agency.my-address tcp://arangodb-agency1:8529
      --agency.endpoint tcp://arangodb-agency1:8529
      --agency.endpoint tcp://arangodb-agency2:8529
      --agency.endpoint tcp://arangodb-agency3:8529
    volumes:
      - arangodb-agency1:/var/lib/arangodb3

  arangodb-agency2:
    restart: on-failure
    container_name: arangodb-agency2
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --server.statistics false
      --agency.size 3
      --agency.supervision true
      --agency.activate true
      --agency.my-address tcp://arangodb-agency2:8529
      --agency.endpoint tcp://arangodb-agency1:8529
      --agency.endpoint tcp://arangodb-agency2:8529
      --agency.endpoint tcp://arangodb-agency3:8529
    depends_on:
      - arangodb-agency1
    volumes:
      - arangodb-agency2:/var/lib/arangodb3

  arangodb-agency3:
    restart: on-failure
    container_name: arangodb-agency3
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --server.statistics false
      --agency.size 3
      --agency.supervision true
      --agency.activate true
      --agency.my-address tcp://arangodb-agency3:8529
      --agency.endpoint tcp://arangodb-agency1:8529
      --agency.endpoint tcp://arangodb-agency2:8529
      --agency.endpoint tcp://arangodb-agency3:8529
    depends_on:
      - arangodb-agency1
    volumes:
      - arangodb-agency3:/var/lib/arangodb3

  arangodb-dbserver1:
    restart: on-failure
    container_name: arangodb-dbserver1
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --server.statistics true
      --cluster.min-replication-factor 3
      --cluster.max-replication-factor 3
      --cluster.force-one-shard true
      --cluster.agency-endpoint tcp://arangodb-agency1:8529
      --cluster.agency-endpoint tcp://arangodb-agency2:8529
      --cluster.agency-endpoint tcp://arangodb-agency3:8529
      --cluster.my-address tcp://arangodb-dbserver1:8529
      --cluster.my-role PRIMARY
      --database.directory /var/lib/arangodb3/primary1
    volumes:
      - arangodb-dbserver1:/var/lib/arangodb3

  arangodb-dbserver2:
    restart: on-failure
    container_name: arangodb-dbserver2
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --server.statistics true
      --cluster.min-replication-factor 3
      --cluster.max-replication-factor 3
      --cluster.force-one-shard true
      --cluster.agency-endpoint tcp://arangodb-agency1:8529
      --cluster.agency-endpoint tcp://arangodb-agency2:8529
      --cluster.agency-endpoint tcp://arangodb-agency3:8529
      --cluster.my-address tcp://arangodb-dbserver2:8529
      --cluster.my-role PRIMARY
      --database.directory /var/lib/arangodb3/primary2
    volumes:
      - arangodb-dbserver2:/var/lib/arangodb3

  arangodb-dbserver3:
    restart: on-failure
    container_name: arangodb-dbserver3
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --server.statistics true
      --cluster.min-replication-factor 3
      --cluster.max-replication-factor 3
      --cluster.force-one-shard true
      --cluster.agency-endpoint tcp://arangodb-agency1:8529
      --cluster.agency-endpoint tcp://arangodb-agency2:8529
      --cluster.agency-endpoint tcp://arangodb-agency3:8529
      --cluster.my-address tcp://arangodb-dbserver3:8529
      --cluster.my-role PRIMARY
      --database.directory /var/lib/arangodb3/primary3
    volumes:
      - arangodb-dbserver3:/var/lib/arangodb3

volumes:
  arangodb-agency1:
  arangodb-agency2:
  arangodb-agency3:
  arangodb-dbserver1:
  arangodb-dbserver2:
  arangodb-dbserver3:
  arangodb-coordinator1:
  arangodb-coordinator2:
  arangodb-coordinator3:

dothebart · 2020-12-01T17:21:04Z

this seems to be missing the nginx config file?

mikestaub · 2020-12-02T13:26:17Z

this is the nginx.conf

worker_processes 1;

events {
  worker_connections 1024;
}

http {
  upstream arangodb-servers {
    server arangodb-coordinator1:8529;
    server arangodb-coordinator2:8529;
    server arangodb-coordinator3:8529;
  }

  server {
    listen 80;
    location / {
      proxy_pass http://arangodb-servers;
    }
  }
}

dothebart · 2020-12-02T17:43:00Z

Ok,
@ajanikow was able to narrow it down to what actually is the reason.
nginx in HTTP-proxy-mode attempts to "fix" empty PUT requests which are used for cursors.
Since later on coordinators need to forward the request, they fail to correctly do so.

The instantly working fix is to configure nginx to use tcp-proxy instead of http-proxy by swapping the http section to:

stream  {
  upstream arangodb-servers {
    server arangodb-coordinator1:8529;
    server arangodb-coordinator2:8529;
    server arangodb-coordinator3:8529;
  }
  server {
    listen 80;
    proxy_pass arangodb-servers;
  }
}

We will dig deeper on the real reason later.

mikestaub · 2020-12-19T19:10:59Z

@dothebart any updates on the root cause? I am seeing these errors in Oasis on v3.7.5 as I assume envoy uses TCP not HTTP.

dothebart · 2020-12-21T11:35:25Z

Hi,
sorry, have been busy with release QA.
hm, @ajanikow ensured me that oasis shouldn't have these issues?

ajanikow · 2020-12-21T11:51:09Z

Hello!

Problem with cluster internal HTTP connection broken in Oasis can be related to different thing. Only TCP forwarding is used on all levels, so issue caused by invalid body should not occur.

Can you create Oasis issue? Then we will be able to look on your Deployment (we will check internal reason).

Best Regards,
Adam.

mikestaub · 2020-12-21T12:01:26Z

Here is the Oasis issue: https://arangodb.atlassian.net/servicedesk/customer/portal/13/OASIS-418

mikestaub · 2020-12-21T23:32:52Z

@ajanikow I think the long-term solution is to provide a way to run the Oasis cluster locally so I can run my integration tests against it an be confident it will work once deployed. Either with a docker-compose file or k8 helm charts.

dothebart · 2020-12-23T15:17:41Z

Ok, the current situation is, that ArangoDB will forward all [most] HTTP-headers that it gets from one coordinator to the one that owns the cursor.
In your setup case this is too - connection: close. This starts an unwanted chain of reactions, which in current devel somewhere later down the road doesn't lead to the actual error we see.

However, not forwarding the connection header in first place (since the cluster should use connection keep-alive for performance reasons) fixes this problem, without fixing the ultimatively last point where the error occurs.

This bugfix is going to be part of the upcomming 3.7.6 Release.

mikestaub · 2020-12-23T17:55:37Z

@dothebart great, thanks for tracking down this down. In the meantime can I manually remove that header from the requests being sent by arangojs? Do you have an ETA when 3.7.6 will be available on Oasis?

dothebart · 2020-12-24T11:26:59Z

at least in your testcase this header is added by the NGINX Proxy - as @ajanikow pointed out, using it in TCP-Mode also circumvents the situation from appearing.
The problem is the PUT request without a post-body which makes the nginx go down to HTTP/1.0 - which implies connection: close.
As @ajanikow also told - oasis should also have no nginx in http mode - so if there are more problems, these aren't similar to the docker-compose ones.

mikestaub · 2020-12-30T21:46:17Z

I think it might also be a timing issue in the arangojs task queue. After adding this to my Database config, the issue disappeared:

agentOptions: {
  keepAlive: true,
  keepAliveMsecs: 50000,
  maxSockets: 1, // TODO: remove this
},

dothebart · 2021-01-12T10:16:53Z

Hi, Happy new Year ;)
Its still unclear to me how and why you should be hit by this. Can we have some more details on the total environment?
How and from where do you connect the oasis cluster? Is this your local workstation and mabye there a transparent proxy in the way (company network or telco provider?) ? Whats the tcp-traceroute? Does the instance live near to you?

If its all that, is issue reproducible if you use a cloud VM near to your oasis cluster?

mikestaub · 2021-01-12T10:44:58Z

Happy new year!

The error was still appearing in my local env ( not Oasis ), even with TCP routing enabled. You should be able to reproduce it with that docker-compose file. I assume that docker-compose file is a good approximation of an Oasis setup. I saw the same errors when running on lambda connecting to Oasis.

I actually think it may be a bug in arangojs as it might be firing the HTTP requests in the wrong order ( commit transaction before it was created ).

mikestaub · 2021-03-07T13:49:09Z

I just confirmed, this issue is still present in 3.7.7

dothebart · 2021-03-07T17:27:49Z

we probably can meanwhile close this as duplicate of #702 (comment) since it meanwhile contains a more precise description & testcases of the actual ongoing behaviour WDYT?

pluma · 2021-03-08T15:58:19Z

arangojs 7.3.0 changes behavior related to maxSockets so you may want to try with this version in either case.

mikestaub · 2021-04-23T10:29:23Z

I still have to set maxSockets=1 with arangojs@7.5.0 and arangodb@3.7.7

dothebart · 2021-05-04T09:40:45Z

Hi,
since all of our changes to improve this haven't finally resolved this situation yet, can you please open a Jira issue with maybe a code sample reproducing this?
Please add a reference to this github issue as well.

pluma · 2021-10-20T09:29:47Z

I'm closing this due to inactivity. Please follow the directions provided above if the problem still persists.

dothebart mentioned this issue Nov 30, 2020

add multiple coordinator support arangodb/arangodb#13109

Merged

12 tasks

pluma assigned dothebart Dec 2, 2020

pluma added the ArangoDB Issue concerns ArangoDB, not the driver. label Dec 2, 2020

dothebart mentioned this issue Dec 23, 2020

Remove HTTP "Connection" header when forwarding requests arangodb/arangodb#13284

Merged

11 tasks

dothebart added this to the ArangoDB 3.7.6 milestone Dec 23, 2020

dothebart mentioned this issue Apr 23, 2021

Can't I put haproxy in front of arangodb? arangodb/arangodb-docker#98

Closed

pluma closed this as completed Oct 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.7 streaming transactions: ArangoError: cluster internal HTTP connection broken #699

3.7 streaming transactions: ArangoError: cluster internal HTTP connection broken #699

mikestaub commented Nov 15, 2020

dothebart commented Nov 16, 2020

mikestaub commented Nov 16, 2020

dothebart commented Nov 16, 2020

pluma commented Nov 17, 2020

fceller commented Nov 24, 2020

mikestaub commented Nov 25, 2020

dothebart commented Nov 30, 2020

mikestaub commented Dec 1, 2020

dothebart commented Dec 1, 2020

mikestaub commented Dec 2, 2020

dothebart commented Dec 2, 2020

mikestaub commented Dec 19, 2020

dothebart commented Dec 21, 2020

ajanikow commented Dec 21, 2020

mikestaub commented Dec 21, 2020

mikestaub commented Dec 21, 2020

dothebart commented Dec 23, 2020

mikestaub commented Dec 23, 2020

dothebart commented Dec 24, 2020 •

edited

Loading

mikestaub commented Dec 30, 2020

dothebart commented Jan 12, 2021

mikestaub commented Jan 12, 2021

mikestaub commented Mar 7, 2021

dothebart commented Mar 7, 2021

pluma commented Mar 8, 2021

mikestaub commented Apr 23, 2021

dothebart commented May 4, 2021

pluma commented Oct 20, 2021

3.7 streaming transactions: ArangoError: cluster internal HTTP connection broken #699

3.7 streaming transactions: ArangoError: cluster internal HTTP connection broken #699

Comments

mikestaub commented Nov 15, 2020

dothebart commented Nov 16, 2020

mikestaub commented Nov 16, 2020

dothebart commented Nov 16, 2020

pluma commented Nov 17, 2020

fceller commented Nov 24, 2020

mikestaub commented Nov 25, 2020

dothebart commented Nov 30, 2020

mikestaub commented Dec 1, 2020

dothebart commented Dec 1, 2020

mikestaub commented Dec 2, 2020

dothebart commented Dec 2, 2020

mikestaub commented Dec 19, 2020

dothebart commented Dec 21, 2020

ajanikow commented Dec 21, 2020

mikestaub commented Dec 21, 2020

mikestaub commented Dec 21, 2020

dothebart commented Dec 23, 2020

mikestaub commented Dec 23, 2020

dothebart commented Dec 24, 2020 • edited Loading

mikestaub commented Dec 30, 2020

dothebart commented Jan 12, 2021

mikestaub commented Jan 12, 2021

mikestaub commented Mar 7, 2021

dothebart commented Mar 7, 2021

pluma commented Mar 8, 2021

mikestaub commented Apr 23, 2021

dothebart commented May 4, 2021

pluma commented Oct 20, 2021

dothebart commented Dec 24, 2020 •

edited

Loading