-
Notifications
You must be signed in to change notification settings - Fork 30
Multiple network interfaces fail to initialize correctly in EC2. #992
Comments
Update: We managed to get ssh to the public IP of default eth0 working consistently by setting eth0's default gateway with a unit:
The only issue we have right now is that the datadog agent 2 out of 3 times can't talk out to the datadog service:
The datadog is started by this unit:
Interestingly enough every time the container get the IP of
the agent can't talk out to the datadog service but if the IP issues is 172.17.42.2, the service is reachable. |
The most appropriate method for configuring those interfaces is to provide your own .network configs. The "Match" section can be used to selectively apply configs to the various interfaces. |
@mik373 Were you able to get this working with the networkd configs? |
I can't use the static IPs configs for two reasons:
|
You should be able to define .network configs for each interface which enables DHCP. For the public interface gateway, use a lower routing metric to ensure egress packets deterministically use that interface. |
So my etcd cluster with my config works about 80% of the time. The other 20% the interfaces are initialized in the order that creates asymmetrical ip routs and the cluster members can't dial each other. |
@mik373 Sorry, I just noticed there was an open question from you. No, nothing special is needed on CoreOS for SSH to work. Are you still having trouble with this? |
I am having same or very similar issue. My setup is fairly similar, I have a bunch of instances with a single network interface to start with and then there is a daemon which attaches an additional ENI (eth1). I found that systemd-networkd fails to bring up eth1 properly. I believe I am hitting this issue: systemd/systemd#1784 So I have the following hack to make sure that eth1 comes up:
The other problem that I just noticed is that if I reboot an instance which has two ENIs (eth0 and eth1) then the instance comes up with no network working, apart from eth1 because of the above hack. This is quite a serious problem, because it prevents us from using CoreOS with more than one network interface on EC2. |
I don't know if this can help anyone but I have instances on AWS with two interfaces. I was having the same problem when the eth1 would become active and then the computer rebooted I would lose network connectivity. The second interface adds another route and it messes with your eth0 setup , I added this to my /etc/systemd/network
I believe if you use static ips and use a higher Route Metric that can help you also to not lose connectivity . |
@vaijab can you give this another shot with the latest Alpha? That ships with a much newer version of systemd. @marcovnyc's suggestion to set the route metric is also interesting and might help out. I haven't had a chance to look into this yet. |
Thanks @crawford. This is what I have in my user-data to make it work:
|
Is this issue still present with systemd 231? |
Closing due to inactivity. |
This is still an issue in 1911.4.0 as far as I can tell. |
We've found this is an issue when using CoreOS (1911.3.0 at time of writing) with https://github.com/aws/amazon-vpc-cni-k8s/ in EC2. When enough pods are scheduled onto an instance, additional interfaces/ENIs are created. Pod IPs are drawn from a pool of secondary IPs attached to each interface as an implementation detail of the Amazon VPC CNI. These new interfaces learn default routes via DHCP with a metric of 1024. After a reboot, the order of the default routes is undetermined and the node is then unreachable via the We are working around currently by lowering the metric for the
|
Is |
@yuwata |
@bgilbert Thanks. I'd like to ask one more thing: please provide the results of |
BTW, if you think this is a bug in networkd or udevd, then please open a new issue in systemd and provide debugging logs of the daemons: booting with |
Not sure, but systemd/systemd#11881 may fix this issue. |
Hi Experts |
Seems to still be a problem in CoreOS-stable-2191.5.0-hvm (ami-038cea5071a5ee580). |
Scenario:
The debug log for the case I can't ssh is below:
It seems like a race condition in CoreOS determining and initializing eth0 and eth1 that causes the routing to be broken.
The text was updated successfully, but these errors were encountered: