Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client connection using AdoNetClustering fails on Docker container #5158

Closed
fduman opened this issue Nov 2, 2018 · 23 comments
Closed

Client connection using AdoNetClustering fails on Docker container #5158

fduman opened this issue Nov 2, 2018 · 23 comments
Assignees
Labels
stale Issues with no activity for the past 6 months
Milestone

Comments

@fduman
Copy link
Contributor

fduman commented Nov 2, 2018

  • Used Orleans 2.1.0.
  • Run on .NET Core 2.1, Docker linux container.

orleans.log
(frontend is Orleans Client app. backend is silo. Db is Postgresql)

Using StaticClustering configuration runs perfectly.

I saw an issue similar with this at microsoft/service-fabric-issues#1182

@sergeybykov
Copy link
Contributor

Does the frontend have network access to the silo gateway endpoint 192.168.32.3:30000? It complains that it couldn't connect to the endpoint:

Connection to 192.168.32.3:30000 could not be established in 00:00:05

@sergeybykov sergeybykov added this to the Triage milestone Nov 2, 2018
@fduman
Copy link
Contributor Author

fduman commented Nov 2, 2018

The frontend has access to the backend. They are on the same docker network. It is interesting that backend silo says “Remote socket closed while receiving connection preamble data from endpoint 192.168.32.4:34992” for the frontend connection.

@benjaminpetit
Copy link
Member

Can you post your client and silos configurations?

@fduman
Copy link
Contributor Author

fduman commented Nov 21, 2018

@benjaminpetit I prepared a sample for the issue. You can execute "docker-compose up" at sample's root folder. Then you should hit http://[Your docker machine IP]:8081/api/values

The web application can't connect to the Silo for the first time.

My Docker Version is 18.06.1-ce, build e68fc7a

I use the default Docker machine ISO on my Mac.

OrleansDockerTest.zip

@fduman
Copy link
Contributor Author

fduman commented Nov 21, 2018

I said there was no problem with static clustering configuration, but I was wrong.

@fduman
Copy link
Contributor Author

fduman commented Dec 9, 2018

@benjaminpetit Is there anything I can do for the issue?

@sergeybykov
Copy link
Contributor

@fduman @benjaminpetit went on a leave until early February. I'm not sure how much we can help you with running on Mac. Asking in https://gitter.im/dotnet/orleans might be a more expedient way to get help with that.

@Gandii
Copy link

Gandii commented Dec 14, 2018

@fduman had the same issue with almost identical setup playing around with the orleans docker sample, realized that the silo was setting its address to 127.0.0.1 when checking the Membeor ship table instead of its Docker network IP.

.Configure<EndpointOptions>(options => options.AdvertisedIPAddress = ip)
then use any method to get the the silos/container IP or even set a fixed IP for the container.

`
IPAddress.TryParse("172.20.0.4",out var ip);

        //or

        var name = Dns.GetHostName(); // get container id
        var ip = Dns.GetHostEntry(name).AddressList.FirstOrDefault(x => x.AddressFamily == AddressFamily.InterNetwork); 

`

@tcz717
Copy link

tcz717 commented Dec 14, 2018

@fduman had the same issue with almost identical setup playing around with the orleans docker sample, realized that the silo was setting its address to 127.0.0.1 when checking the Membeor ship table instead of its Docker network IP.

.Configure<EndpointOptions>(options => options.AdvertisedIPAddress = ip)
then use any method to get the the silos/container IP or even set a fixed IP for the container.

`
IPAddress.TryParse("172.20.0.4",out var ip);

        //or

        var name = Dns.GetHostName(); // get container id
        var ip = Dns.GetHostEntry(name).AddressList.FirstOrDefault(x => x.AddressFamily == AddressFamily.InterNetwork); 

`

I had the same issue, but I was running directly in Linux and without docker. Even so, my client can not connect to the slio if the client running in Linux, however, it worked when the client running in Windows.

@benjaminpetit
Copy link
Member

@tcz717 firewall issue maybe?

@sergeybykov
Copy link
Contributor

@fduman Has this been resolved?

@sergeybykov
Copy link
Contributor

Closing due to inactivity. Feel free to reopen if needed.

@tcz717
Copy link

tcz717 commented Apr 22, 2019

This issue still exists after upgrading to 2.3.0. Here are the context and logs in my situation.

I deploy my service by using docker and docker-compose. silo is the server and gateway is the client. The docker-compose.yml file is

version: '3'

services:
  gateway:
    image: gateway
    ports:
     - "20170:20170"
    depends_on:
     - "silo"
    build:
      context: .
      dockerfile: ./SimCivil.Gate/Dockerfile

  silo:
    image: silo
    ports:
     - "30000:30000"
    build:
      context: .
      dockerfile: ./SimCivil.Orleans.Server/Dockerfile

Then after the startup, the server and the client respectively record errors and say connect timeout. The server log is

silo_1 Application started. Press Ctrl+C to shut down.
silo_1 Hosting environment: Development
silo_1 Content root path: /silo/
silo_1 warn: Orleans.Runtime.Messaging.IncomingMessageAcceptor[101308]
silo_1       Exception receiving connection preamble data from endpoint 172.27.0.3:38149
silo_1 System.Net.Sockets.SocketException (110): Connection timed out
silo_1    at Orleans.Runtime.Messaging.IncomingMessageAcceptor.ReadFromSocket(Socket sock, Int32 expected)
silo_1 warn: Orleans.Runtime.Messaging.IncomingMessageAcceptor[101308]
silo_1       Exception receiving connection preamble data from endpoint 172.27.0.3:42243
silo_1 System.Net.Sockets.SocketException (110): Connection timed out
silo_1    at Orleans.Runtime.Messaging.IncomingMessageAcceptor.ReadFromSocket(Socket sock, Int32 expected)

The client log is

info: Orleans.OutsideRuntimeClient[100313]
       ---------- Initializing OutsideRuntimeClient on 1de835d99218 at 172.27.0.3 Client Id = *cli/ec1d3faa ----------
info: Orleans.OutsideRuntimeClient[100314]
       ---------- Starting OutsideRuntimeClient with runtime Version='2.3.0. Commit Hash: 790b81bf52dbb8b6a0391771b72bd3763230a27e (Release).' in AppDomain=<AppDomain.Id=1, AppDomain.FriendlyName=SimCivil.Gate>
fail: Orleans.OutsideRuntimeClient[101313]
       Could not find any gateway in global::Orleans.Clustering.DynamoDB.DynamoDBGatewayListProvider. Orleans client cannot initialize.
crit: SimCivil.Gate.Gate[0]
       Silo connecting fails
info: Orleans.OutsideRuntimeClient[100313]
       ---------- Initializing OutsideRuntimeClient on 1de835d99218 at 172.27.0.3 Client Id = *cli/9ce78a91 ----------
info: Orleans.OutsideRuntimeClient[100314]
       ---------- Starting OutsideRuntimeClient with runtime Version='2.3.0. Commit Hash: 790b81bf52dbb8b6a0391771b72bd3763230a27e (Release).' in AppDomain=<AppDomain.Id=1, AppDomain.FriendlyName=SimCivil.Gate>
info: Orleans.Messaging.GatewayManager[101309]
       Found 1 knownGateways from Gateway listProvider [gwy.tcp://172.27.0.2:30000/293529013]
warn: Messaging.GatewayConnection/GatewayClientSender_gwy.tcp://172.27.0.2:30000/293529013[100178]
       Unable to connect to gateway at address gwy.tcp://172.27.0.2:30000/293529013 on trial 0 (Exception: Connection to 172.27.0.2:30000 could not be established in 00:00:10)
warn: Messaging.GatewayConnection/GatewayClientSender_gwy.tcp://172.27.0.2:30000/293529013[100913]
       Marking gateway at address gwy.tcp://172.27.0.2:30000/293529013 as Disconnected
info: Orleans.Messaging.GatewayManager[101309]
       Refreshed the live Gateway list. Found 1 gateways from Gateway listProvider: [gwy.tcp://172.27.0.2:30000/293529013]. Picked only known live out of them. Now has 1 live Gateways: [gwy.tcp://172.27.0.2:30000/293529013]. Previous refresh time was = 04/21/2019 07:50:30
warn: Messaging.GatewayConnection/GatewayClientSender_gwy.tcp://172.27.0.2:30000/293529013[100178]
       Unable to connect to gateway at address gwy.tcp://172.27.0.2:30000/293529013 on trial 1 (Exception: Connection to 172.27.0.2:30000 could not be established in 00:00:10)
warn: Messaging.GatewayConnection/GatewayClientSender_gwy.tcp://172.27.0.2:30000/293529013[100913]
       Marking gateway at address gwy.tcp://172.27.0.2:30000/293529013 as Disconnected
warn: Messaging.GatewayConnection/GatewayClientSender_gwy.tcp://172.27.0.2:30000/293529013[100912]
       Marking gateway at address gwy.tcp://172.27.0.2:30000/293529013 as Dead in my client local gateway list.
warn: Orleans.Messaging.GatewayManager[101321]
       All gateways have previously been marked as dead. Clearing the list of dead gateways to expedite reconnection.
info: Orleans.Messaging.GatewayManager[101309]
       Refreshed the live Gateway list. Found 1 gateways from Gateway listProvider: [gwy.tcp://172.27.0.2:30000/293529013]. Picked only known live out of them. Now has 1 live Gateways: [gwy.tcp://172.27.0.2:30000/293529013]. Previous refresh time was = 04/21/2019 07:50:41
 
 Unhandled Exception: Orleans.Runtime.OrleansMessageRejectionException: Target silo S172.27.0.2:30000:293529013 is unavailable
    at Orleans.OrleansTaskExtentions.<ToTypedTask>g__ConvertAsync|4_0[T](Task`1 asyncTask)
    at Orleans.OutsideRuntimeClient.<>c__DisplayClass56_0.<<StartInternal>b__2>d.MoveNext()
 --- End of stack trace from previous location where exception was thrown ---
    at Orleans.OutsideRuntimeClient.<StartInternal>g__ExecuteWithRetries|56_3(Func`1 task, Func`2 shouldRetry)
    at Orleans.OutsideRuntimeClient.StartInternal(Func`2 retryFilter)
    at Orleans.OutsideRuntimeClient.Start(Func`2 retryFilter)
    at Orleans.ClusterClient.Connect(Func`2 retryFilter)
    at SimCivil.Gate.Gate.Main(String[] args) in /SimCivil.Gate/Gate.cs:line 87
    at SimCivil.Gate.Gate.<Main>(String[] args)

I tried using tcpdump to record the packets on the server side. https://drive.google.com/open?id=12PmVHXVjWmiCPLiAj-FiV4sDmKOsZ791 According to the packet record, the server directly did the wavehand after the three-way handshake. You could also check my dockerfiles here: https://github.com/tcz717/SimCivil/blob/feature/inception/SimCivil.Orleans.Server/Dockerfile https://github.com/tcz717/SimCivil/blob/feature/inception/SimCivil.Gate/Dockerfile

@benjaminpetit
Copy link
Member

@tcz717 here you are mapping the silo gateway port (3000) on your host. This is not a good idea, Orleans communications should be isolated from external traffic.

What clustering provider are you using? How do you configure your client and your silos?

@tcz717
Copy link

tcz717 commented Apr 23, 2019

@benjaminpetit You are right, but I just open this port for debug. As for clustering, I used local debug cluster and now I am using DynamoDb. But both of them didn't work, which run normally in non-docker environment. I suppose this issue is not related to clustering provider.

@tcz717
Copy link

tcz717 commented Apr 23, 2019

One interesting thing is that when I complie and build the client by using mono, the silo server and the client can comunicate in docker without changing any other configuration like docker network or clustering.

@benjaminpetit
Copy link
Member

Maybe something related to the ip published in the membership table?

@tcz717
Copy link

tcz717 commented Apr 23, 2019

I checked the field in the dynamoDB and the record is the same as the container IP. Because I set AdvertisedIPAddress as Dns.GetHostAddresses(Dns.GetHostName())[0].

@abhith
Copy link

abhith commented May 5, 2019

Experiencing a similar issue.

Silos and client communication was working fine in windows containers, But when switched to Linux containers, the client unable to communicate with silos.

Our configurations,

XXX.Web/Dockerfile

---
EXPOSE 80
EXPOSE 443
---

XXX.SiloHost/Dockerfile

---
EXPOSE 8090
---

docker-compose.yml

version: '3.4'

services:
  XXX.web:
    image: ${DOCKER_REGISTRY-}XXXweb
    build:
      context: .
      dockerfile: XXX.Web/Dockerfile
    depends_on:
      - XXX.silo
  XXX.silo:
    image: ${DOCKER_REGISTRY-}XXXsilo
    build:
      context: .
      dockerfile: XXX.SiloHost/Dockerfile
  XXX.dashboard:
    image: ${DOCKER_REGISTRY-}XXXsilo
    ports:
      - "8090:8090"
    depends_on:
      - XXX.silo

docker-compose.override.yml

version: '3.4'

services:
  XXX.web:
    environment:
      - ASPNETCORE_ENVIRONMENT=Development
      - ASPNETCORE_URLS=https://+:443;http://+:80
      - ASPNETCORE_HTTPS_PORT=44305
    ports:
      - "50556:80"
      - "44305:443"
    volumes:
      - ${APPDATA}/ASP.NET/Https:/root/.aspnet/https:ro
      - ${APPDATA}/Microsoft/UserSecrets:/root/.microsoft/usersecrets:ro

Following are the containers created on docker-compose up,

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
54087c9a0a9a XXXweb "dotnet XXX.Web.dll" 20 minutes ago Up 11 minutes 0.0.0.0:50556->80/tcp, 0.0.0.0:44305->443/tcp
a8390b891c05 XXXsilo "dotnet XXX.SiloHo…" 20 minutes ago Up 11 minutes 0.0.0.0:8090->8090/tcp
3adbc04c12a0 XXXsilo "dotnet XXX.SiloHo…" 20 minutes ago Up 11 minutes 8090/tcp

We are using SQL Server for clustering and the registered silos entries are,

DeploymentId Address Port Generation SiloName HostName Status ProxyPort SuspectTimes StartTime IAmAliveTime
dev 172.18.0.4 30065 294747788 Silo_46d67 a8390b891c05 3 20094 NULL 2019-05-05 10:23:08.320 2019-05-05 10:43:10.180
dev 172.18.0.2 30069 294747785 Silo_792b9 3adbc04c12a0 3 20019 NULL 2019-05-05 10:23:05.263 2019-05-05 10:43:10.673

And the network inspect result for the running containers,

docker network inspect XXX_default
[
    {
        "Name": "XXX_default",
        "Id": "d2f01f3a6998c95f8157aabfff5f6535378760e51d1572302d079bb33b9e912b",
        "Created": "2019-05-05T08:10:06.6194202Z",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "172.18.0.0/16",
                    "Gateway": "172.18.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": true,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "3adbc04c12a0185a67da8a21b476654785da4f022998db11cfc075abece360a8": {
                "Name": "XXX_XXX.silo_1",
                "EndpointID": "79ec02ac924acad618833968ff5026d48394e47cc099c55c66fb84fd4bf87d89",
                "MacAddress": "02:42:ac:12:00:02",
                "IPv4Address": "172.18.0.2/16",
                "IPv6Address": ""
            },
            "54087c9a0a9ab4c3756c1581f92ab5abad152f8aa8beefa0963b49ca382be4e5": {
                "Name": "XXX_XXX.web_1",
                "EndpointID": "0c3ad891c11e830f17e67ca1ae082c30db2e031ba204f671768a6405d96bfdee",
                "MacAddress": "02:42:ac:12:00:03",
                "IPv4Address": "172.18.0.3/16",
                "IPv6Address": ""
            },
            "a8390b891c05a9939c500901044667169fa113e247b84f7bf3b5ed400357f4d4": {
                "Name": "XXX_XXX.dashboard_1",
                "EndpointID": "d9155862139875451cae487f7d26bb7399f282995a50e875108233e1d8147d95",
                "MacAddress": "02:42:ac:12:00:04",
                "IPv4Address": "172.18.0.4/16",
                "IPv6Address": ""
            }
        },
        "Options": {},
        "Labels": {
            "com.docker.compose.network": "default",
            "com.docker.compose.project": "XXX",
            "com.docker.compose.version": "1.23.2"
        }
    }
]

And the client log is,

info: Orleans.Messaging.GatewayManager[101309]
      Found 2 knownGateways from Gateway listProvider [gwy.tcp://172.18.0.2:20019/294747785, gwy.tcp://172.18.0.4:20094/294747788]
warn: Messaging.GatewayConnection/GatewayClientSender_gwy.tcp://172.18.0.4:20094/294747788[100178]
      Unable to connect to gateway at address gwy.tcp://172.18.0.4:20094/294747788 on trial 0 (Exception: Connection to 172.18.0.4:20094 could not be established in 00:00:05)
warn: Messaging.GatewayConnection/GatewayClientSender_gwy.tcp://172.18.0.4:20094/294747788[100913]
      Marking gateway at address gwy.tcp://172.18.0.4:20094/294747788 as Disconnected
info: Orleans.Messaging.GatewayManager[101309]
      Refreshed the live Gateway list. Found 2 gateways from Gateway listProvider: [gwy.tcp://172.18.0.2:20019/294747785, gwy.tcp://172.18.0.4:20094/294747788]. Picked only known live out of them. Now has 2 live Gateways: [gwy.tcp://172.18.0.2:20019/294747785, gwy.tcp://172.18.0.4:20094/294747788]. Previous refresh time was = 05/05/2019 10:24:02
warn: Messaging.GatewayConnection/GatewayClientSender_gwy.tcp://172.18.0.4:20094/294747788[100178]
      Unable to connect to gateway at address gwy.tcp://172.18.0.4:20094/294747788 on trial 1 (Exception: Connection to 172.18.0.4:20094 could not be established in 00:00:05)
warn: Messaging.GatewayConnection/GatewayClientSender_gwy.tcp://172.18.0.4:20094/294747788[100913]
      Marking gateway at address gwy.tcp://172.18.0.4:20094/294747788 as Disconnected
warn: Messaging.GatewayConnection/GatewayClientSender_gwy.tcp://172.18.0.4:20094/294747788[100912]
      Marking gateway at address gwy.tcp://172.18.0.4:20094/294747788 as Dead in my client local gateway list.

And we can see two active silos in the dashboard.

@abhith
Copy link

abhith commented May 6, 2019

I was able to solve the issue by changing the IClusterClient DI configuration code in the client Startup,

Previously it was,

        public void ConfigureServices(IServiceCollection services)
        {
           ...
            services.AddSingleton(CreateClusterClient);

            ...
        }

        private IClusterClient CreateClusterClient(IServiceProvider serviceProvider)
        {
            var client = new ClientBuilder()
               ... 
                .Build();

            StartClientWithRetries(client).Wait();
            return client;
        }

        private static async Task StartClientWithRetries(IClusterClient client)
        {
            for (var i = 0; i < 5; i++)
            {
                try
                {
                    await client.Connect();
                    return;
                }
                catch (Exception)
                {
                    // ignored
                }

                await Task.Delay(TimeSpan.FromSeconds(5));
            }
        }

And after referring orleans/Samples/2.0/docker-aspnet-core/API/Startup.cs , I changed CreateClusterClient to

        private IClusterClient CreateClusterClient(IServiceProvider serviceProvider)
        {
            var log = serviceProvider.GetService<ILogger<Startup>>();

            var client = new ClientBuilder()
                ...
                .Build();

            client.Connect(RetryFilter).GetAwaiter().GetResult();
            return client;

            async Task<bool> RetryFilter(Exception exception)
            {
                log?.LogWarning("Exception while attempting to connect to Orleans cluster: {Exception}", exception);
                await Task.Delay(TimeSpan.FromSeconds(2));
                return true;
            }
        }

And it WORKED.

My client log is now,

info: Orleans.OutsideRuntimeClient[100313]
      ---------- Initializing OutsideRuntimeClient on 36fe3430f20c at 172.18.0.5 Client Id = *cli/ee1682a1 ----------
info: Orleans.OutsideRuntimeClient[100314]
      ---------- Starting OutsideRuntimeClient with runtime Version='2.3.1. Commit Hash: b577f8d9f29cc6c563106486c784b4c5c3a65217 (Release).' in AppDomain=<AppDomain.Id=1, AppDomain.FriendlyName=Empay.Web>
info: Orleans.Messaging.GatewayManager[101309]
      Found 2 knownGateways from Gateway listProvider [gwy.tcp://172.18.0.3:20092/294835088, gwy.tcp://172.18.0.4:20059/294835088]
warn: Messaging.GatewayConnection/GatewayClientSender_gwy.tcp://172.18.0.4:20059/294835088[100178]
      Unable to connect to gateway at address gwy.tcp://172.18.0.4:20059/294835088 on trial 0 (Exception: Connection to 172.18.0.4:20059 could not be established in 00:00:05)
warn: Messaging.GatewayConnection/GatewayClientSender_gwy.tcp://172.18.0.4:20059/294835088[100913]
      Marking gateway at address gwy.tcp://172.18.0.4:20059/294835088 as Disconnected
info: Orleans.Messaging.GatewayManager[101309]
      Refreshed the live Gateway list. Found 2 gateways from Gateway listProvider: [gwy.tcp://172.18.0.3:20092/294835088, gwy.tcp://172.18.0.4:20059/294835088]. Picked only known live out of them. Now has 2 live Gateways: [gwy.tcp://172.18.0.3:20092/294835088, gwy.tcp://172.18.0.4:20059/294835088]. Previous refresh time was = 05/06/2019 10:38:42
warn: Messaging.GatewayConnection/GatewayClientSender_gwy.tcp://172.18.0.4:20059/294835088[100178]
      Unable to connect to gateway at address gwy.tcp://172.18.0.4:20059/294835088 on trial 1 (Exception: Connection to 172.18.0.4:20059 could not be established in 00:00:05)
warn: Messaging.GatewayConnection/GatewayClientSender_gwy.tcp://172.18.0.4:20059/294835088[100913]
      Marking gateway at address gwy.tcp://172.18.0.4:20059/294835088 as Disconnected
warn: Messaging.GatewayConnection/GatewayClientSender_gwy.tcp://172.18.0.4:20059/294835088[100912]
      Marking gateway at address gwy.tcp://172.18.0.4:20059/294835088 as Dead in my client local gateway list.
info: Orleans.Messaging.GatewayManager[101309]
      Refreshed the live Gateway list. Found 2 gateways from Gateway listProvider: [gwy.tcp://172.18.0.3:20092/294835088, gwy.tcp://172.18.0.4:20059/294835088]. Picked only known live out of them. Now has 1 live Gateways: [gwy.tcp://172.18.0.3:20092/294835088]. Previous refresh time was = 05/06/2019 10:38:48
warn: Empay.Web.Startup[0]
      Exception while attempting to connect to Orleans cluster: Orleans.Runtime.OrleansMessageRejectionException: Target silo S172.18.0.4:20059:294835088 is unavailable
         at Orleans.OutsideRuntimeClient.<>c__DisplayClass56_0.<<StartInternal>b__2>d.MoveNext()
      --- End of stack trace from previous location where exception was thrown ---
         at Orleans.OutsideRuntimeClient.<StartInternal>g__ExecuteWithRetries|56_3(Func`1 task, Func`2 shouldRetry)
warn: Messaging.GatewayConnection/GatewayClientSender_gwy.tcp://172.18.0.3:20092/294835088[100178]
      Unable to connect to gateway at address gwy.tcp://172.18.0.3:20092/294835088 on trial 0 (Exception: Connection to 172.18.0.3:20092 could not be established in 00:00:05)
warn: Messaging.GatewayConnection/GatewayClientSender_gwy.tcp://172.18.0.3:20092/294835088[100913]
      Marking gateway at address gwy.tcp://172.18.0.3:20092/294835088 as Disconnected
info: Orleans.Messaging.GatewayManager[101309]
      Refreshed the live Gateway list. Found 2 gateways from Gateway listProvider: [gwy.tcp://172.18.0.3:20092/294835088, gwy.tcp://172.18.0.4:20059/294835088]. Picked only known live out of them. Now has 1 live Gateways: [gwy.tcp://172.18.0.3:20092/294835088]. Previous refresh time was = 05/06/2019 10:38:52
info: Messaging.GatewayConnection/GatewayClientSender_gwy.tcp://172.18.0.3:20092/294835088[100908]
      Connected to gateway at address gwy.tcp://172.18.0.3:20092/294835088 on trial 1.
info: Messaging.GatewayClientReceiver/gwy.tcp://172.18.0.3:20092/294835088[0]
      Starting AsyncAgent Messaging.GatewayClientReceiver/gwy.tcp://172.18.0.3:20092/294835088 on managed thread 5
info: Messaging.GatewayConnection/GatewayClientSender_gwy.tcp://172.18.0.3:20092/294835088[0]
      Starting AsyncAgent Messaging.GatewayConnection/GatewayClientSender_gwy.tcp://172.18.0.3:20092/294835088 on managed thread 5
info: Orleans.Threading.ThreadPoolThread[0]
      Starting thread Messaging.GatewayClientReceiver/gwy.tcp://172.18.0.3:20092/2948350880 on managed thread 22
info: Orleans.Threading.ThreadPoolThread[0]
      Starting thread Messaging.GatewayConnection/GatewayClientSender_gwy.tcp://172.18.0.3:20092/2948350880 on managed thread 23
info: Orleans.OutsideRuntimeClient[100929]
      ---------- Started OutsideRuntimeClient with Global Client ID: C172.18.0.5:0:-294835121*cli/ee1682a1@463f1bbb, client GUID ID:
*cli/ee1682a1
info: Orleans.ClientOptionsLogger[0]
      Configuration Orleans.Configuration.ClientMessagingOptions:
      ClientSenderBuckets: 8192
      PreferredFamily: InterNetwork
      NetworkInterfaceName:
      ResponseTimeout: 00:00:30
      ResponseTimeoutWithDebugger: 00:30:00
      MaxResendCount: 0
      ResendOnTimeout: False
      DropExpiredMessages: True
      BufferPoolBufferSize: 4096
      BufferPoolMaxSize: 10000
      BufferPoolPreallocationSize: 250
      PropagateActivityId: False
      LargeMessageWarningThreshold: 85000

info: Orleans.ClientOptionsLogger[0]
      Configuration Orleans.Configuration.ClusterOptions:
      ClusterId: dev
      ServiceId: EmpayServiceApp

info: Orleans.ClientOptionsLogger[0]
      Configuration Orleans.Configuration.NetworkingOptions:
      OpenConnectionTimeout: 00:00:05
      MaxSocketAge: 10675199.02:48:05.4775807
      MaxSockets: 500

info: Orleans.ClientOptionsLogger[0]
      Configuration Orleans.Configuration.StatisticsOptions:
      PerfCountersWriteInterval: 00:00:30
      LogWriteInterval: 00:05:00
      CollectionLevel: Info

@tcz717
Copy link

tcz717 commented Sep 19, 2019

I found this issue still exsisted even in Azure Ubuntu vm (No container, server and client are in the same machine)

Microsoft.Orleans.Clustering.DynamoDB (≥ 2.4.2)
Microsoft.Orleans.OrleansAzureUtils (≥ 2.4.2)
Microsoft.Orleans.OrleansRuntime (≥ 2.4.2)
Microsoft.Orleans.OrleansTelemetryConsumers.AI (≥ 2.4.2)
Microsoft.Orleans.Server (≥ 2.4.2)
.NET Core SDK (reflecting any global.json):
 Version:   2.2.402
 Commit:    c7f2f96116

Runtime Environment:
 OS Name:     ubuntu
 OS Version:  18.04
 OS Platform: Linux
 RID:         ubuntu.18.04-x64
 Base Path:   /home/tpdt/azagent/_work/_tool/dotnet/sdk/2.2.402/

Host (useful for support):
  Version: 2.2.7
  Commit:  b1e29ae826

.NET Core SDKs installed:
  2.2.402 [/home/tpdt/azagent/_work/_tool/dotnet/sdk]

.NET Core runtimes installed:
  Microsoft.AspNetCore.All 2.2.7 [/home/tpdt/azagent/_work/_tool/dotnet/shared/Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.App 2.2.7 [/home/tpdt/azagent/_work/_tool/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 2.2.7 [/home/tpdt/azagent/_work/_tool/dotnet/shared/Microsoft.NETCore.App]

@tcz717
Copy link

tcz717 commented Sep 19, 2019

After I researched the source code, I found a bug in SocketManager.Connect https://github.com/dotnet/orleans/blob/2.4.3/src/Orleans.Core/Messaging/SocketManager.cs#L202

Acccodring to documentation, ConnectAsync(SocketAsyncEventArgs) will return bool

true if the I/O operation is pending. The Completed event on the e parameter will be raised upon completion of the operation.

false if the I/O operation completed synchronously. In this case, The Completed event on the e parameter will not be raised and the e object passed as a parameter may be examined immediately after the method call returns to retrieve the result of the operation.

In Linux host, the corefx also says:

    // The asynchronous socket operations here generally do the following:
    // (1) If the operation queue is Ready (queue is empty), try to perform the operation immediately, non-blocking.
    // If this completes (i.e. does not return EWOULDBLOCK), then we return the results immediately
    // for both success (SocketError.Success) or failure.
    // No callback will happen; callers are expected to handle these synchronous completions themselves.

That means if operation completed synchronously, the Completed callback will be never invoked, which will make AutoResetEvent.WaitOne keeping waiting until timeout. That's why the server did not receive Preamble and timeout here

I see you plan to rewrite network stack in 3.0.0. I don't if this bug will be solved. But as for 2.x version, my possible solution is to change Connect method https://github.com/dotnet/orleans/blob/2.4.3/src/Orleans.Core/Messaging/SocketManager.cs#L196-L209 like

        internal static void Connect(Socket s, IPEndPoint endPoint, TimeSpan connectionTimeout)
        {
            var signal = new AutoResetEvent(false);
            var e = new SocketAsyncEventArgs();
            e.RemoteEndPoint = endPoint;
            e.Completed += (sender, eventArgs) => signal.Set();
            bool pending = s.ConnectAsync(e);

            if (pending && !signal.WaitOne(connectionTimeout))
                throw new TimeoutException($"Connection to {endPoint} could not be established in {connectionTimeout}");

            if (e.SocketError != SocketError.Success || !s.Connected)
                throw new OrleansException($"Could not connect to {endPoint}: {e.SocketError}");
        }

@benjaminpetit
Copy link
Member

Fixed

@ghost ghost locked as resolved and limited conversation to collaborators Nov 3, 2021
@ghost ghost added the stale Issues with no activity for the past 6 months label Dec 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stale Issues with no activity for the past 6 months
Projects
None yet
Development

No branches or pull requests

6 participants