Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.7 streaming transactions: ArangoError: cluster internal HTTP connection broken #699

Closed
mikestaub opened this issue Nov 15, 2020 · 28 comments
Assignees
Labels
ArangoDB Issue concerns ArangoDB, not the driver.

Comments

@mikestaub
Copy link

This branch was working on 3.6.3, but is now failing on 3.7.3

https://github.com/mikestaub/arangojs/pull/1/files

Would it be possible to include a default Oasis cluster in the integration tests so these types of regressions could be caught earlier?

@dothebart
Copy link
Contributor

Hi,
If you make this a PR to arangojs, it will become part of the nightly tests that the ci runs.
Thanks for digging deeper into this.

@mikestaub
Copy link
Author

@dothebart what 3 URIs should I use for the Oasis cluster in my PR?

@dothebart
Copy link
Contributor

@pluma can you give a hint for this?

@pluma
Copy link
Contributor

pluma commented Nov 17, 2020

@dothebart I've added support for passing multiple URLs with commas via ab866b0. Check the changes to CONTRIBUTING.md in particular.

Note that this will result in acquireHostsList being called, which in my case returns IPv6 URLs which won't be deduplicated if you use an alias like localhost. This also means you can just append a single comma to your TEST_ARANGODB_URL to opt into cluster mode.

Cluster mode always enables round robin.

@fceller
Copy link
Collaborator

fceller commented Nov 24, 2020

Hi @mikestaub ,

I hope you are doing well. Alan and WIlli are working on extending the automatic testing to catch these issues more easily.

What is the current status? Is it blocking you from moving to 3.7 or did you work around it?

best Frank

@mikestaub
Copy link
Author

@fceller this is blocking me from upgrading to 3.7 but it is not urgent as 3.6 is working well.

@dothebart
Copy link
Contributor

hm, running the tests with 3 coordinators barely doesn't reproduce this.
@mikestaub can you sched a bit more details on the environment you're running into this?

@mikestaub
Copy link
Author

@dothebart here is the docker-compose.yml file I am using:

version: "3"

services:

  nginx:
    image: nginx:1.17.9
    container_name: arangodb-proxy
    depends_on:
      - arangodb-coordinator1
      - arangodb-coordinator2
      - arangodb-coordinator3
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    ports:
      - 8529:80

  arangodb-coordinator1:
    restart: on-failure
    container_name: arangodb-coordinator1
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --server.statistics true
      --cluster.max-number-of-shards 1
      --cluster.min-replication-factor 3
      --cluster.max-replication-factor 3
      --cluster.force-one-shard true
      --cluster.agency-endpoint tcp://arangodb-agency1:8529
      --cluster.agency-endpoint tcp://arangodb-agency2:8529
      --cluster.agency-endpoint tcp://arangodb-agency3:8529
      --cluster.my-address tcp://arangodb-coordinator1:8529
      --cluster.my-role COORDINATOR
    volumes:
      - arangodb-coordinator1:/var/lib/arangodb3

  arangodb-coordinator2:
    restart: on-failure
    container_name: arangodb-coordinator2
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --server.statistics true
      --cluster.max-number-of-shards 1
      --cluster.min-replication-factor 3
      --cluster.max-replication-factor 3
      --cluster.force-one-shard true
      --cluster.agency-endpoint tcp://arangodb-agency1:8529
      --cluster.agency-endpoint tcp://arangodb-agency2:8529
      --cluster.agency-endpoint tcp://arangodb-agency3:8529
      --cluster.my-address tcp://arangodb-coordinator2:8529
      --cluster.my-role COORDINATOR
    volumes:
      - arangodb-coordinator2:/var/lib/arangodb3

  arangodb-coordinator3:
    restart: on-failure
    container_name: arangodb-coordinator3
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --server.statistics true
      --cluster.max-number-of-shards 1
      --cluster.min-replication-factor 3
      --cluster.max-replication-factor 3
      --cluster.force-one-shard true
      --cluster.agency-endpoint tcp://arangodb-agency1:8529
      --cluster.agency-endpoint tcp://arangodb-agency2:8529
      --cluster.agency-endpoint tcp://arangodb-agency3:8529
      --cluster.my-address tcp://arangodb-coordinator3:8529
      --cluster.my-role COORDINATOR
    volumes:
      - arangodb-coordinator3:/var/lib/arangodb3

  arangodb-agency1:
    restart: on-failure
    container_name: arangodb-agency1
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --foxx.queues false
      --agency.size 3
      --agency.supervision true
      --agency.activate true
      --agency.my-address tcp://arangodb-agency1:8529
      --agency.endpoint tcp://arangodb-agency1:8529
      --agency.endpoint tcp://arangodb-agency2:8529
      --agency.endpoint tcp://arangodb-agency3:8529
    volumes:
      - arangodb-agency1:/var/lib/arangodb3

  arangodb-agency2:
    restart: on-failure
    container_name: arangodb-agency2
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --server.statistics false
      --agency.size 3
      --agency.supervision true
      --agency.activate true
      --agency.my-address tcp://arangodb-agency2:8529
      --agency.endpoint tcp://arangodb-agency1:8529
      --agency.endpoint tcp://arangodb-agency2:8529
      --agency.endpoint tcp://arangodb-agency3:8529
    depends_on:
      - arangodb-agency1
    volumes:
      - arangodb-agency2:/var/lib/arangodb3

  arangodb-agency3:
    restart: on-failure
    container_name: arangodb-agency3
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --server.statistics false
      --agency.size 3
      --agency.supervision true
      --agency.activate true
      --agency.my-address tcp://arangodb-agency3:8529
      --agency.endpoint tcp://arangodb-agency1:8529
      --agency.endpoint tcp://arangodb-agency2:8529
      --agency.endpoint tcp://arangodb-agency3:8529
    depends_on:
      - arangodb-agency1
    volumes:
      - arangodb-agency3:/var/lib/arangodb3

  arangodb-dbserver1:
    restart: on-failure
    container_name: arangodb-dbserver1
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --server.statistics true
      --cluster.min-replication-factor 3
      --cluster.max-replication-factor 3
      --cluster.force-one-shard true
      --cluster.agency-endpoint tcp://arangodb-agency1:8529
      --cluster.agency-endpoint tcp://arangodb-agency2:8529
      --cluster.agency-endpoint tcp://arangodb-agency3:8529
      --cluster.my-address tcp://arangodb-dbserver1:8529
      --cluster.my-role PRIMARY
      --database.directory /var/lib/arangodb3/primary1
    volumes:
      - arangodb-dbserver1:/var/lib/arangodb3

  arangodb-dbserver2:
    restart: on-failure
    container_name: arangodb-dbserver2
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --server.statistics true
      --cluster.min-replication-factor 3
      --cluster.max-replication-factor 3
      --cluster.force-one-shard true
      --cluster.agency-endpoint tcp://arangodb-agency1:8529
      --cluster.agency-endpoint tcp://arangodb-agency2:8529
      --cluster.agency-endpoint tcp://arangodb-agency3:8529
      --cluster.my-address tcp://arangodb-dbserver2:8529
      --cluster.my-role PRIMARY
      --database.directory /var/lib/arangodb3/primary2
    volumes:
      - arangodb-dbserver2:/var/lib/arangodb3

  arangodb-dbserver3:
    restart: on-failure
    container_name: arangodb-dbserver3
    image: arangodb/arangodb:3.7.3
    environment:
      - ARANGO_NO_AUTH=1
    command: >
      --server.endpoint tcp://0.0.0.0:8529
      --server.authentication false
      --server.statistics true
      --cluster.min-replication-factor 3
      --cluster.max-replication-factor 3
      --cluster.force-one-shard true
      --cluster.agency-endpoint tcp://arangodb-agency1:8529
      --cluster.agency-endpoint tcp://arangodb-agency2:8529
      --cluster.agency-endpoint tcp://arangodb-agency3:8529
      --cluster.my-address tcp://arangodb-dbserver3:8529
      --cluster.my-role PRIMARY
      --database.directory /var/lib/arangodb3/primary3
    volumes:
      - arangodb-dbserver3:/var/lib/arangodb3

volumes:
  arangodb-agency1:
  arangodb-agency2:
  arangodb-agency3:
  arangodb-dbserver1:
  arangodb-dbserver2:
  arangodb-dbserver3:
  arangodb-coordinator1:
  arangodb-coordinator2:
  arangodb-coordinator3:

@dothebart
Copy link
Contributor

this seems to be missing the nginx config file?

@mikestaub
Copy link
Author

this is the nginx.conf

worker_processes 1;

events {
  worker_connections 1024;
}

http {
  upstream arangodb-servers {
    server arangodb-coordinator1:8529;
    server arangodb-coordinator2:8529;
    server arangodb-coordinator3:8529;
  }

  server {
    listen 80;
    location / {
      proxy_pass http://arangodb-servers;
    }
  }
}

@pluma pluma added the ArangoDB Issue concerns ArangoDB, not the driver. label Dec 2, 2020
@dothebart
Copy link
Contributor

Ok,
@ajanikow was able to narrow it down to what actually is the reason.
nginx in HTTP-proxy-mode attempts to "fix" empty PUT requests which are used for cursors.
Since later on coordinators need to forward the request, they fail to correctly do so.

The instantly working fix is to configure nginx to use tcp-proxy instead of http-proxy by swapping the http section to:

stream  {
  upstream arangodb-servers {
    server arangodb-coordinator1:8529;
    server arangodb-coordinator2:8529;
    server arangodb-coordinator3:8529;
  }
  server {
    listen 80;
    proxy_pass arangodb-servers;
  }
}

We will dig deeper on the real reason later.

@mikestaub
Copy link
Author

@dothebart any updates on the root cause? I am seeing these errors in Oasis on v3.7.5 as I assume envoy uses TCP not HTTP.

@dothebart
Copy link
Contributor

Hi,
sorry, have been busy with release QA.
hm, @ajanikow ensured me that oasis shouldn't have these issues?

@ajanikow
Copy link

Hello!

Problem with cluster internal HTTP connection broken in Oasis can be related to different thing. Only TCP forwarding is used on all levels, so issue caused by invalid body should not occur.

Can you create Oasis issue? Then we will be able to look on your Deployment (we will check internal reason).

Best Regards,
Adam.

@mikestaub
Copy link
Author

@mikestaub
Copy link
Author

@ajanikow I think the long-term solution is to provide a way to run the Oasis cluster locally so I can run my integration tests against it an be confident it will work once deployed. Either with a docker-compose file or k8 helm charts.

@dothebart
Copy link
Contributor

Ok, the current situation is, that ArangoDB will forward all [most] HTTP-headers that it gets from one coordinator to the one that owns the cursor.
In your setup case this is too - connection: close. This starts an unwanted chain of reactions, which in current devel somewhere later down the road doesn't lead to the actual error we see.

However, not forwarding the connection header in first place (since the cluster should use connection keep-alive for performance reasons) fixes this problem, without fixing the ultimatively last point where the error occurs.

This bugfix is going to be part of the upcomming 3.7.6 Release.

@dothebart dothebart added this to the ArangoDB 3.7.6 milestone Dec 23, 2020
@mikestaub
Copy link
Author

@dothebart great, thanks for tracking down this down. In the meantime can I manually remove that header from the requests being sent by arangojs? Do you have an ETA when 3.7.6 will be available on Oasis?

@dothebart
Copy link
Contributor

dothebart commented Dec 24, 2020

at least in your testcase this header is added by the NGINX Proxy - as @ajanikow pointed out, using it in TCP-Mode also circumvents the situation from appearing.
The problem is the PUT request without a post-body which makes the nginx go down to HTTP/1.0 - which implies connection: close.
As @ajanikow also told - oasis should also have no nginx in http mode - so if there are more problems, these aren't similar to the docker-compose ones.

@mikestaub
Copy link
Author

I think it might also be a timing issue in the arangojs task queue. After adding this to my Database config, the issue disappeared:

agentOptions: {
  keepAlive: true,
  keepAliveMsecs: 50000,
  maxSockets: 1, // TODO: remove this
},

@dothebart
Copy link
Contributor

Hi, Happy new Year ;)
Its still unclear to me how and why you should be hit by this. Can we have some more details on the total environment?
How and from where do you connect the oasis cluster? Is this your local workstation and mabye there a transparent proxy in the way (company network or telco provider?) ? Whats the tcp-traceroute? Does the instance live near to you?

If its all that, is issue reproducible if you use a cloud VM near to your oasis cluster?

@mikestaub
Copy link
Author

Happy new year!

The error was still appearing in my local env ( not Oasis ), even with TCP routing enabled. You should be able to reproduce it with that docker-compose file. I assume that docker-compose file is a good approximation of an Oasis setup. I saw the same errors when running on lambda connecting to Oasis.

I actually think it may be a bug in arangojs as it might be firing the HTTP requests in the wrong order ( commit transaction before it was created ).

@mikestaub
Copy link
Author

I just confirmed, this issue is still present in 3.7.7

@dothebart
Copy link
Contributor

we probably can meanwhile close this as duplicate of #702 (comment) since it meanwhile contains a more precise description & testcases of the actual ongoing behaviour WDYT?

@pluma
Copy link
Contributor

pluma commented Mar 8, 2021

arangojs 7.3.0 changes behavior related to maxSockets so you may want to try with this version in either case.

@mikestaub
Copy link
Author

I still have to set maxSockets=1 with arangojs@7.5.0 and arangodb@3.7.7

@dothebart
Copy link
Contributor

Hi,
since all of our changes to improve this haven't finally resolved this situation yet, can you please open a Jira issue with maybe a code sample reproducing this?
Please add a reference to this github issue as well.

@pluma
Copy link
Contributor

pluma commented Oct 20, 2021

I'm closing this due to inactivity. Please follow the directions provided above if the problem still persists.

@pluma pluma closed this as completed Oct 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ArangoDB Issue concerns ArangoDB, not the driver.
Projects
None yet
Development

No branches or pull requests

5 participants