Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to get management ip for cbr0 network with host-gw mode on windows. #1066

Closed
song-jiang opened this issue Nov 21, 2018 · 14 comments
Closed
Labels

Comments

@song-jiang
Copy link

Flanneld failed to get management ip and panic.

Expected Behavior

Flanneld should get management ip successfully.

Current Behavior

Running flanneld.exe on windows node. (AWS Windows_Server-1803-English-Core-ContainersLatest-2018.08.15, kubernetes 1.11.4, host-gw mode). Hit runtime error.

I1121 15:51:15.716306    4200 main.go:450] Searching for interface using 172.20.54.40
I1121 15:51:16.087282    4200 main.go:527] Using interface with name vEthernet (Ethernet 2) and address 172.20.54.40
I1121 15:51:16.089271    4200 main.go:544] Defaulting external address to interface address (172.20.54.40)
I1121 15:51:16.094280    4200 kube.go:126] Waiting 10m0s for node controller to sync
I1121 15:51:16.096278    4200 kube.go:309] Starting kube subnet manager
I1121 15:51:17.097267    4200 kube.go:133] Node controller sync successful
I1121 15:51:17.098264    4200 main.go:244] Created subnet manager: Kubernetes Subnet Manager - ec2amaz-p9t8oq5
I1121 15:51:17.101265    4200 main.go:247] Installing signal handlers
I1121 15:51:17.102265    4200 main.go:386] Found network config - Backend type: host-gw
I1121 15:51:17.103264    4200 hostgw_windows.go:73] HOST-GW config: {Name:cbr0 DNSServerList:}
I1121 15:51:17.121265    4200 hostgw_windows.go:157] Attempting to create HNSNetwork {"Name":"cbr0","Type":"L2Bridge","Subnets":[{"AddressPrefix":"10.244.11.0/24","GatewayAddress":"10.244.11.1"}]}
I1121 15:51:17.126268    4200 hostgw_windows.go:164] Waiting to get ManagementIP from HNSNetwork cbr0
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x18 pc=0x10c8cfa]

goroutine 1 [running]:
main.main()
        /home/song/go/src/github.com/coreos/flannel/main.go:297 +0xbea

Possible Solution

  1. Create and attach endpoint to cbr0 network before checking management ip.
  2. Fix code to return non-nil error if checking for management ip timed out. Currently if timeout, RegisterNetwork still returns error as nil.
    https://github.com/coreos/flannel/blob/master/backend/hostgw/hostgw_windows.go#L170

Steps to Reproduce (for bugs)

  1. Start flanneld.exe

Context

Your Environment

  • Flannel version: current master
  • Backend used (e.g. vxlan or udp): host-gw
  • Etcd version:
  • Kubernetes version (if used): 1.11.4
  • Operating System and version: windows
  • Link to your project (optional):
@ekochen
Copy link

ekochen commented Nov 27, 2018

aiting for the Network to be created
Waiting for the Network to be created
I1127 10:24:28.829686 3668 kube.go:133] Node controller sync successful
I1127 10:24:28.829686 3668 main.go:244] Created subnet manager: Kubernetes Subnet Manager - windows-kube-139
I1127 10:24:28.832687 3668 main.go:247] Installing signal handlers
I1127 10:24:28.832687 3668 main.go:386] Found network config - Backend type: host-gw
I1127 10:24:28.833674 3668 hostgw_windows.go:73] HOST-GW config: {Name:cbr0 DNSServerList:}
I1127 10:24:28.853822 3668 hostgw_windows.go:157] Attempting to create HNSNetwork {"Name":"cbr0","Type":"L2Bridge","Subnets":[{"AddressPrefix":"10.20.4.0/24","GatewayAddress":"10.20.4.1"}]}
I1127 10:24:28.859820 3668 hostgw_windows.go:164] Waiting to get ManagementIP from HNSNetwork cbr0
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x18 pc=0x10e43ea]

goroutine 1 [running]:
main.main()
/home/kasubra/repo/gopath/src/github.com/coreos/flannel/main.go:297 +0xbfa

also encounter this issue , do we have workaround for this ?

@JohPa8696
Copy link

Getting the same issue deploying kubernetes 1.12.2. It was working 1 week ago.

Waiting for the Network to be created
I1129 16:07:23.313913    3220 main.go:450] Searching for interface using 10.204.35.207
Waiting for the Network to be created
I1129 16:07:24.439939    3220 main.go:527] Using interface with name vEthernet (Ethernet) and address 10.204.35.207
I1129 16:07:24.443546    3220 main.go:544] Defaulting external address to interface address (10.204.35.207)
I1129 16:07:24.455545    3220 kube.go:126] Waiting 10m0s for node controller to sync
I1129 16:07:24.459540    3220 kube.go:309] Starting kube subnet manager
Waiting for the Network to be created
I1129 16:07:25.481545    3220 kube.go:133] Node controller sync successful
I1129 16:07:25.488548    3220 main.go:244] Created subnet manager: Kubernetes Subnet Manager - kubewindowsminion1
I1129 16:07:25.489547    3220 main.go:247] Installing signal handlers
I1129 16:07:25.492547    3220 main.go:386] Found network config - Backend type: host-gw
I1129 16:07:25.494540    3220 hostgw_windows.go:73] HOST-GW config: {Name:cbr0 DNSServerList:}
I1129 16:07:25.640833    3220 hostgw_windows.go:157] Attempting to create HNSNetwork {"Name":"cbr0","Type":"L2Bridge","Subnets":[{"AddressPrefix":"10.244.5.0/24","GatewayAddress":"10.244.5.1"}]}
I1129 16:07:25.666035    3220 hostgw_windows.go:164] Waiting to get ManagementIP from HNSNetwork cbr0
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x18 pc=0x10e43ea]

goroutine 1 [running]:
main.main()
        /home/kasubra/repo/gopath/src/github.com/coreos/flannel/main.go:297 +0xbfa

@titilambert
Copy link

Hello,
I applied this patch but I still have an issue. The network interface doesn't get an IP.
Should I force it ?

Here the output of flannel:

...
I1130 20:11:28.512420    3760 kube.go:133] Node controller sync successful                                 
I1130 20:11:28.512420    3760 main.go:244] Created subnet manager: Kubernetes Subnet Manager - dmn0sueik6l2
4d6                                                                                                        
I1130 20:11:28.513406    3760 main.go:247] Installing signal handlers                                      
I1130 20:11:28.513406    3760 main.go:386] Found network config - Backend type: host-gw                    
I1130 20:11:28.513406    3760 hostgw_windows.go:73] HOST-GW config: {Name:cbr0 DNSServerList:}             
I1130 20:11:28.529410    3760 hostgw_windows.go:157] Attempting to create HNSNetwork {"Name":"cbr0","Type":
"L2Bridge","Subnets":[{"AddressPrefix":"10.201.18.0/24","GatewayAddress":"10.201.18.1"}]}                  
I1130 20:11:28.535418    3760 hostgw_windows.go:163] Created HNSNetwork cbr0                               
I1130 20:11:28.539408    3760 hostgw_windows.go:192] Attempting to create bridge HNSEndpoint &{Id: Name:cbr
0_ep VirtualNetwork:0dfc5298-3449-4843-975c-4dfccdcb537e VirtualNetworkName: Policies:[] MacAddress: IPAddr
ess:10.201.18.2 DNSSuffix: DNSServerList: GatewayAddress: EnableInternalDNS:false DisableICC:false PrefixLe
ngth:0 IsRemoteEndpoint:false Namespace:<nil>}                                                             
I1130 20:11:28.545420    3760 hostgw_windows.go:197] Created bridge HNSEndpoint cbr0_ep                    
I1130 20:11:28.545420    3760 hostgw_windows.go:201] Waiting to attach bridge endpoint cbr0_ep to host     
I1130 20:11:29.321437    3760 hostgw_windows.go:209] Attached bridge endpoint cbr0_ep to host successfully 
I1130 20:11:29.336431    3760 hostgw_windows.go:212] Waiting to get ManagementIP from HNSNetwork cbr0      
E1130 20:11:34.356677    3760 main.go:289] Error registering network: timeout, failed to get management IP 
from HNSNetwork cbr0: timed out waiting for the condition                                                  
I1130 20:11:34.357665    3760 main.go:366] Stopping shutdownHandler...    

@song-jiang
Copy link
Author

Hmm... The patch works for me. Not sure waiting for 5 seconds would be enough or not. If you get more results, please let me know.

@titilambert
Copy link

I confirm 5 seconds is too low, I put 25 seconds and It's working. I'm working on Azure, maybe that explains the delay...

barnettZQG pushed a commit to barnettZQG/flannel that referenced this issue Dec 8, 2018
Fix nil  pointer error where get ManagementIP timeout, and extended get ManagementIP wait time to 30s
Fix issues flannel-io#1066
@barnettZQG
Copy link
Contributor

microsoft doc show:

l2bridge:
      Requires: When this mode is used in a virtualization scenario (container host is a VM) MAC address spoofing is required.

My environment is a VM, I tried to change the waiting time to one minute, but the system still could not allocate IP. Whether it is related to the appeal description or not is not supported.

@barnettZQG
Copy link
Contributor

If the IP is not allocated after a certain period of time, it is absolutely useful to restart the machine.

@davemeier
Copy link

davemeier commented Jan 10, 2019

Is the flanneld.exe updated anywhere for the master flannel code, or do I need to build it myself?

[Edit] I have flanneld.exe building successfully now - #1081

@JohnJAS
Copy link
Contributor

JohnJAS commented Jan 23, 2019

I haven't see the patch for flanneld.exe. And I failed to build a flannel.exe according to the building.md.
Is there any walkround for this issue? Or is there anywhere I can get the patched flanneld.exe?

[Edit] I created a PR to enhance the document about how to build the flanned.exe. #1089

@JohnJAS
Copy link
Contributor

JohnJAS commented Jan 24, 2019

Hello,
I applied this patch but I still have an issue. The network interface doesn't get an IP.
Should I force it ?

Here the output of flannel:

...
I1130 20:11:28.512420    3760 kube.go:133] Node controller sync successful                                 
I1130 20:11:28.512420    3760 main.go:244] Created subnet manager: Kubernetes Subnet Manager - dmn0sueik6l2
4d6                                                                                                        
I1130 20:11:28.513406    3760 main.go:247] Installing signal handlers                                      
I1130 20:11:28.513406    3760 main.go:386] Found network config - Backend type: host-gw                    
I1130 20:11:28.513406    3760 hostgw_windows.go:73] HOST-GW config: {Name:cbr0 DNSServerList:}             
I1130 20:11:28.529410    3760 hostgw_windows.go:157] Attempting to create HNSNetwork {"Name":"cbr0","Type":
"L2Bridge","Subnets":[{"AddressPrefix":"10.201.18.0/24","GatewayAddress":"10.201.18.1"}]}                  
I1130 20:11:28.535418    3760 hostgw_windows.go:163] Created HNSNetwork cbr0                               
I1130 20:11:28.539408    3760 hostgw_windows.go:192] Attempting to create bridge HNSEndpoint &{Id: Name:cbr
0_ep VirtualNetwork:0dfc5298-3449-4843-975c-4dfccdcb537e VirtualNetworkName: Policies:[] MacAddress: IPAddr
ess:10.201.18.2 DNSSuffix: DNSServerList: GatewayAddress: EnableInternalDNS:false DisableICC:false PrefixLe
ngth:0 IsRemoteEndpoint:false Namespace:<nil>}                                                             
I1130 20:11:28.545420    3760 hostgw_windows.go:197] Created bridge HNSEndpoint cbr0_ep                    
I1130 20:11:28.545420    3760 hostgw_windows.go:201] Waiting to attach bridge endpoint cbr0_ep to host     
I1130 20:11:29.321437    3760 hostgw_windows.go:209] Attached bridge endpoint cbr0_ep to host successfully 
I1130 20:11:29.336431    3760 hostgw_windows.go:212] Waiting to get ManagementIP from HNSNetwork cbr0      
E1130 20:11:34.356677    3760 main.go:289] Error registering network: timeout, failed to get management IP 
from HNSNetwork cbr0: timed out waiting for the condition                                                  
I1130 20:11:34.357665    3760 main.go:366] Stopping shutdownHandler...    

+1 Same issue after using the patched flanneld.exe.

@davemeier
Copy link

@JohnJAS - I get the same problem in my environment. I'm working in AWS and my windows node is an 1803 instance. I had to modify the Get-MgmtSubnet function in SDN start-kubelet.ps1 as well, as that code was failing. I am still getting the "Error registering network: timeout, failed to get management IP from HNSNetwork cbr0: timed out waiting for the condition" problem.

There is no "ManagementIP" element showing when I look at cbr0:

PS C:\k> Get-HnsNetwork -Id 207cd47b-66af-419f-98c2-c0699927f9b5

ActivityId : 16605ab1-ba02-4009-a984-375bded9b3cd
CurrentEndpointCount : 2
Extensions : {@{Id=e7c3b2f0-f3c5-48df-af2b-10fed6d72e7a; IsEnabled=False; Name=Microsoft Windows Filtering Platform}, @{Id=e9b59cfa-2be1-4b21-828f-b6fbdbddc017; IsEnabled=True; Name=Microsoft Azure VFP Switch
Extension}, @{Id=ea24cd6c-d17a-4348-9190-09f0d5be83dd; IsEnabled=False; Name=Microsoft NDIS Capture}}
ID : 207cd47b-66af-419f-98c2-c0699927f9b5
LayerResources : @{AllocationOrder=3; Allocators=System.Object[]; ID=146d1cbc-74d4-4437-ac1e-8cee81dff07b; PortOperationTime=0; State=1; SwitchOperationTime=0; VfpOperationTime=0;
parentId=00000000-0000-0000-0000-000000000000}
LayeredOn : 7383cb85-bde7-452f-acdb-0e728689a8d5
MacPools : {@{EndMacAddress=00-15-5D-F5-DF-FF; StartMacAddress=00-15-5D-F5-D0-00}}
MaxConcurrentEndpoints : 2
Name : cbr0
Policies : {}
Resources : @{AllocationOrder=0; ID=16605ab1-ba02-4009-a984-375bded9b3cd; PortOperationTime=0; State=1; SwitchOperationTime=0; VfpOperationTime=0; parentId=146d1cbc-74d4-4437-ac1e-8cee81dff07b}
State : 1
Subnets : {@{AddressPrefix=10.244.1.0/24; GatewayAddress=10.244.1.1}}
TotalEndpoints : 2
Type : L2Bridge
Version : 30064771074

@davemeier
Copy link

After moving to Windows 1809 on AWS, I no longer have any problems with flannel or with the cbr0 network. In addition to changing platforms, I also have made sure that my linux and windows EC2 instances are on the same subnet, plus I have disabled source/destination checks. I found these suggestions on the following web page: https://rancher.com/docs/rancher/v2.x/en/cluster-provisioning/rke-clusters/windows-clusters/

My windows pods are now working correctly and can be accessed successfully through the service.

@daschott
Copy link

Indeed, this issue is reproducible on 1803. With the same configuration, it seems to work for 1809. We are going to investigate what the diff is and post back here.

@stale
Copy link

stale bot commented Jan 26, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jan 26, 2023
@stale stale bot closed this as completed Feb 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants