Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker (.net version) is notoriously flaky (workaround included) #138

Closed
BretFisher opened this issue Jul 23, 2019 · 4 comments
Closed

Worker (.net version) is notoriously flaky (workaround included) #138

BretFisher opened this issue Jul 23, 2019 · 4 comments

Comments

@BretFisher
Copy link
Member

BretFisher commented Jul 23, 2019

Description

There's something with the .net version of the worker app that causes it to randomly not connect to db or redis on startup, which results in a crash. Over the last year I've had thousands of students use this app to learn docker and swarm, and one of the most common issues is this worker failing on startup. It happens on all modern docker versions, across platforms, and there is no common theme to why it doesn't work.

Today in testing, we killed the broken service, which was re-creating the task over and over and it was failing, and once the service was recreated, and it worked... with no changes in how we created it. See log below for typical behavior. The stack trace tells you it can't resolve something, but doesn't show what it can't look up, so I can't tell what it thinks the problem is. From all the cases I've seen and testing I've done, it's not related to other services being down or general network issues.

This also happens for Kubernetes, as seen by other issues reported in this repo.

Hundreds of people have reported this problem to me, and deploying the java version fixes the issue.

Workaround

Deploy the java version of the worker, which I have build here: bretfisher/examplevotingapp_worker:java

Steps to reproduce the issue, if relevant:
In swarm:

  1. deploy all voting app services except worker. Ensure they work as they should.
  2. deploy .net version of worker, using examples in this repo README
  3. random deploys will randomly fail with stack trace of "no such device or address"
  4. service will create a new task, that may work, or may not

Describe the results you received:

Notice below that we created a worker service with two replicas, and you'll see one replica work, and the other fail, then get re-created on the same node and work the 2nd time. It's random if it fails, and which one would fail.

➜  vote git:(master) ✗ docker service logs 413rw4tamzd6
vote_worker.2.itk0wfpt22gh@node3    | Waiting for db
vote_worker.2.itk0wfpt22gh@node3    | Connected to db
vote_worker.2.itk0wfpt22gh@node3    | Found redis at 10.0.4.7
vote_worker.2.itk0wfpt22gh@node3    | Connecting to redis
vote_worker.2.itk0wfpt22gh@node3    | Processing vote for 'a' by 'fb54d895d481b473'
vote_worker.2.tv9i1viknli2@node3    | System.AggregateException: One or more errors occurred. (No such device or address) ---> System.Net.Internals.SocketExceptionFactory+ExtendedSocketException: No such device or address
vote_worker.2.tv9i1viknli2@node3    |    at System.Net.Dns.InternalGetHostByName(String hostName, Boolean includeIPv6)
vote_worker.2.tv9i1viknli2@node3    |    at System.Net.Dns.ResolveCallback(Object context)
vote_worker.2.tv9i1viknli2@node3    | --- End of stack trace from previous location where exception was thrown ---
vote_worker.2.tv9i1viknli2@node3    |    at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
vote_worker.2.tv9i1viknli2@node3    |    at System.Net.Dns.HostResolutionEndHelper(IAsyncResult asyncResult)
vote_worker.2.tv9i1viknli2@node3    |    at System.Net.Dns.EndGetHostAddresses(IAsyncResult asyncResult)
vote_worker.2.tv9i1viknli2@node3    |    at System.Net.Dns.<>c.<GetHostAddressesAsync>b__25_1(IAsyncResult asyncResult)
vote_worker.2.tv9i1viknli2@node3    |    at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
vote_worker.2.tv9i1viknli2@node3    |    --- End of inner exception stack trace ---
vote_worker.2.tv9i1viknli2@node3    |    at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
vote_worker.2.tv9i1viknli2@node3    |    at Npgsql.NpgsqlConnector.Connect(NpgsqlTimeout timeout)
vote_worker.2.tv9i1viknli2@node3    |    at Npgsql.NpgsqlConnector.RawOpen(NpgsqlTimeout timeout)
vote_worker.2.tv9i1viknli2@node3    |    at Npgsql.NpgsqlConnector.Open(NpgsqlTimeout timeout)
vote_worker.2.tv9i1viknli2@node3    |    at Npgsql.ConnectorPool.Allocate(NpgsqlConnection conn, NpgsqlTimeout timeout)
vote_worker.2.tv9i1viknli2@node3    |    at Npgsql.NpgsqlConnection.OpenInternal()
vote_worker.2.tv9i1viknli2@node3    |    at Worker.Program.OpenDbConnection(String connectionString) in /code/src/Worker/Program.cs:line 78
vote_worker.2.tv9i1viknli2@node3    |    at Worker.Program.Main(String[] args) in /code/src/Worker/Program.cs:line 19
vote_worker.2.tv9i1viknli2@node3    | ---> (Inner Exception #0) System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (0x00000005): No such device or address
vote_worker.2.tv9i1viknli2@node3    |    at System.Net.Dns.InternalGetHostByName(String hostName, Boolean includeIPv6)
vote_worker.2.tv9i1viknli2@node3    |    at System.Net.Dns.ResolveCallback(Object context)
vote_worker.2.tv9i1viknli2@node3    | --- End of stack trace from previous location where exception was thrown ---
vote_worker.2.tv9i1viknli2@node3    |    at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
vote_worker.2.tv9i1viknli2@node3    |    at System.Net.Dns.HostResolutionEndHelper(IAsyncResult asyncResult)
vote_worker.2.tv9i1viknli2@node3    |    at System.Net.Dns.EndGetHostAddresses(IAsyncResult asyncResult)
vote_worker.2.tv9i1viknli2@node3    |    at System.Net.Dns.<>c.<GetHostAddressesAsync>b__25_1(IAsyncResult asyncResult)
vote_worker.2.tv9i1viknli2@node3    |    at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)<---
vote_worker.2.tv9i1viknli2@node3    |
vote_worker.1.hjt41aigtbwq@node2    | Waiting for db
vote_worker.1.hjt41aigtbwq@node2    | Waiting for db
vote_worker.1.hjt41aigtbwq@node2    | Waiting for db
vote_worker.1.hjt41aigtbwq@node2    | Waiting for db
vote_worker.1.hjt41aigtbwq@node2    | Waiting for db
vote_worker.1.hjt41aigtbwq@node2    | Connected to db
vote_worker.1.hjt41aigtbwq@node2    | Found redis at 10.0.4.7
vote_worker.1.hjt41aigtbwq@node2    | Connecting to redis
vote_worker.1.hjt41aigtbwq@node2    | Processing vote for 'b' by 'fb54d895d481b473'

Describe the results you expected:

Worker always works :)

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

root@node3:~# docker version
Client:
 Version:           18.09.5
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        e8ff056
 Built:             Thu Apr 11 04:44:24 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.5
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       e8ff056
  Built:            Thu Apr 11 04:10:53 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Output of docker info:

root@node3:~# docker info
Containers: 20
 Running: 0
 Paused: 0
 Stopped: 20
Images: 38
Server Version: 18.09.5
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
 NodeID: 1zyd50hmid7dado3s02tago7y
 Is Manager: true
 ClusterID: spa40crzsqgtn4pix7hj4sac1
 Managers: 3
 Nodes: 3
 Default Address Pool: 10.0.0.0/8
 SubnetSize: 24
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 10
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 134.209.46.216
 Manager Addresses:
  134.209.46.216:2377
  165.227.220.117:2377
  68.183.159.208:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: bb71b10fd8f58240ca47fbb579b9d1028eea7c84
runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-154-generic
Operating System: Ubuntu 16.04.6 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 992.1MiB
Name: node3
ID: UA2M:37YB:EPX6:BKDU:WMNA:3F4A:GVE6:X3GD:Q3UK:YKYF:6J2H:QQNC
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine

WARNING: No swap limit support

Additional environment details (AWS, Docker for Mac, Docker for Windows, VirtualBox, physical, etc.):

This has happened on Docker Desktop, Digital Ocean, Docker Toolbox on VirtualBox, and more.

@logoff
Copy link

logoff commented Aug 22, 2019

Duplicate of #136 (but adding much more info and a workaround)

I'm doing Bret's course (120% useful 😉) and I can confirm the same behaviour with current Docker CE version 19.03.1:

docker version:

Client: Docker Engine - Community
 Version:           19.03.1
 API version:       1.40
 Go version:        go1.12.5
 Git commit:        74b1e89
 Built:             Thu Jul 25 21:21:05 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.1
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.5
  Git commit:       74b1e89e8a
  Built:            Thu Jul 25 21:27:55 2019
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.2.6
  GitCommit:        894b81a4b802e4eb2a91d1ce216b8817763c29fb
 runc:
  Version:          1.0.0-rc8
  GitCommit:        425e105d5a03fabd737a126ad93d62a9eeede87f
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

docker info:

Client:
 Debug Mode: false

Server:
 Containers: 2
  Running: 2
  Paused: 0
  Stopped: 0
 Images: 3
 Server Version: 19.03.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: active
  NodeID: vpe4ire3d489jbp4pqwrpcxmf
  Is Manager: true
  ClusterID: gzt5kws74f8vu9oneudbhlqnf
  Managers: 1
  Nodes: 3
  Default Address Pool: 10.0.0.0/8  
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 192.168.99.100
  Manager Addresses:
   192.168.99.100:2377
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
 runc version: 425e105d5a03fabd737a126ad93d62a9eeede87f
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.14.134-boot2docker
 Operating System: Boot2Docker 19.03.1 (TCL 10.1)
 OSType: linux
 Architecture: x86_64
 CPUs: 1
 Total Memory: 989.5MiB
 Name: node1
 ID: 25IG:VCNG:ICH2:K4K5:OQMM:H5HP:O25K:6UD6:TIPU:F27K:FJOV:TTBU
 Docker Root Dir: /mnt/sda1/var/lib/docker
 Debug Mode: false
 Username: logoff
 Registry: https://index.docker.io/v1/
 Labels:
  provider=virtualbox
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine

@BretFisher
Copy link
Member Author

One issue that has caused this is the lack of correct depends_on for worker in the compose files. I've created PR #141 to resolve that, which should ensure that both redis and db services (and their DNS) are created before the .net worker starts.

@misterjtc
Copy link

I must have an older verion of your repo, because the worker is listed before the db and I had this issue. Creating the db service before the worker fixed it for me!

@mikesir87
Copy link
Member

I'm thinking this issue should be good to go now. Thoughts @BretFisher ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants