Skip to content
This repository has been archived by the owner on Jan 11, 2023. It is now read-only.

[kubernetes] unable to create cluster with custom vnet #120

Closed
jpoon opened this issue Nov 24, 2016 · 36 comments
Closed

[kubernetes] unable to create cluster with custom vnet #120

jpoon opened this issue Nov 24, 2016 · 36 comments

Comments

@jpoon
Copy link
Contributor

jpoon commented Nov 24, 2016

What happened:

Creating a k8s cluster using an existing vnet, the cluster is unable to create routes in the Azure Route table, and is therefore unable to schedule any pods.

How to reproduce it:

  1. Create a custom vnet
  2. Configure the template and deploy

When the cluster is up, the nodes report as ready:

gfadmin@k8s-master-35738843-0:~$ kubectl get nodes
NAME                        STATUS                     AGE
k8s-agentpool1-35738843-0   Ready                      16h
k8s-agentpool1-35738843-1   Ready                      16h
k8s-agentpool1-35738843-2   Ready                      16h
k8s-master-35738843-0       Ready,SchedulingDisabled   16h

Wtih NetworkUnavailable message of RouteController failed tocreate a route:

gfadmin@k8s-master-35738843-0:~$ kubectl describe node k8s-master-35738843-0
Name:                   k8s-master-35738843-0
Labels:                 beta.kubernetes.io/arch=amd64
                        beta.kubernetes.io/instance-type=Standard_D2_v2
                        beta.kubernetes.io/os=linux
                        failure-domain.beta.kubernetes.io/region=westus
                        failure-domain.beta.kubernetes.io/zone=0
                        kubernetes.io/hostname=k8s-master-35738843-0
Taints:                 <none>
CreationTimestamp:      Wed, 23 Nov 2016 18:40:52 +0000
Phase:
Conditions:
  Type                  Status  LastHeartbeatTime                       LastTransitionTime                        Reason                          Message
  ----                  ------  -----------------                       ------------------                        ------                          -------
  OutOfDisk             False   Thu, 24 Nov 2016 11:02:41 +0000         Wed, 23 Nov 2016 18:40:52 +0000   KubeletHasSufficientDisk        kubelet has sufficient disk space available
  MemoryPressure        False   Thu, 24 Nov 2016 11:02:41 +0000         Wed, 23 Nov 2016 18:40:52 +0000   KubeletHasSufficientMemory      kubelet has sufficient memory available
  DiskPressure          False   Thu, 24 Nov 2016 11:02:41 +0000         Wed, 23 Nov 2016 18:40:52 +0000   KubeletHasNoDiskPressure        kubelet has no disk pressure
  Ready                 True    Thu, 24 Nov 2016 11:02:41 +0000         Wed, 23 Nov 2016 18:40:52 +0000   KubeletReady                    kubelet is posting ready status
  NetworkUnavailable    True    Thu, 24 Nov 2016 11:02:47 +0000         Thu, 24 Nov 2016 11:02:47 +0000   NoRouteCreated                  RouteController failed tocreate a route

Looking at the kube-controller logs (/var/log/containers):

routecontroller.go:132] Could not create route 5cb8901d-b1ac-11e6-89eb-000d3a32ff9f 10.244.2.0/24 for node k8s-master-35738843-0 after 38.691596ms: network.SubnetsClient#Get: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code=\"ResourceNotFound\" Message=\"The Resource 'Microsoft.Network/virtualNetworks/subscriptions' under resource group 'ACSRG2' was not found.\"\n","stream":"stderr","time":"2016-11-23T18:51:29.914307462Z"}

Notice the error message has an malform resource: Microsoft.Network/virtualNetworks/subscriptions.

Workaround

We've deduced this to the /etc/kubernetes/azure.json expecting unqualified names for both the vnet and subnet. Instead, the fully-qualified names are present:

{
    ...
    "subnetName": "/subscriptions/76aabf62-fa6e-41ac-a2f3-5532b22811b5/resourceGroups/ACSRG2/providers/Microsoft.Network/virtualNetworks/k8s-vnet-test/subnets/k8s-subnet-test",
    "securityGroupName": "...",
    "vnetName": "/subscriptions/76aabf62-fa6e-41ac-a2f3-5532b22811b5/resourceGroups/ACSRG2/providers/Microsoft.Network/virtualNetworks/k8s-vnet-test",
    ...
}

After changing the subnet and vnet to unqualified names and restarting kubelet, we see the routes as being created and things are back to normal.

Much of the credit in debugging this goes to @jamesbak.

@jpoon
Copy link
Contributor Author

jpoon commented Nov 24, 2016

cc @colemickens. I can provide private keys to help debug, but it should be fairly easy to get a repro

@colemickens
Copy link
Contributor

@colemickens
Copy link
Contributor

colemickens commented Nov 24, 2016

We go straight to the route table that is listed in the config file, but we also check the subnet to see if it's properly configured.

Options:

  1. Skip the subnet check
  2. Take a special vnetResourceGroup that can override. (Can you reference a vnet in a different sub? If so, also need `vnetSubscrpitionId)
  3. Start versioning the config with a nested struct, or two-pass decode, and then start using full identifiers everywhere. This means more util functions to rip the identifiers apart since the SDK APIs address resources by individual, separate strings of the inner identifiers.

I think #1 might possibly be the right thing to do, depending on if we can support multiple subnets of machines with same route table. If we can, then subnetName is sort of meaningless. We really just care about configuring the appropriate route table in the Routes implementation. At some point we have to expect the user set things up correctly.

CC: @brendandburns for any thoughts.

@colemickens
Copy link
Contributor

This presents another question though, where does the route table live. Need to find out if the route table for the subnet in the existing vnet can live in both resource groups, only one, etc. If it can live in the existing-vnet's resource group, then we need to support the full identifier string for the routeTable field anyway...

@mogthesprog
Copy link

Just came back here to report the same issue (finally got around to looking at it from #99, sorry for the delay). My two cents would be that full Resource IDs feels like it would be the azure idiomatic way of declaring resources, makes me wonder if point 3 of yours makes more sense here?

@jpoon
Copy link
Contributor Author

jpoon commented Nov 30, 2016

Thanks @colemickens. Would it not be easiest to modify the azure.json that the ACS-engine generates and puts on the master node? (ie. the workaround that I mentioned above)

I think this is what you meant by option 3. As deploying Kubernetes under a custom VNET never worked, there would be no need to start versioning this config......yet.

@colemickens
Copy link
Contributor

@jpoon How would that help anything? The problem is that the code assumes that all resources are in the same resource group specified in the config file.

@jpoon
Copy link
Contributor Author

jpoon commented Nov 30, 2016

In our case, everything is under the same resource group. Are there situations where people would deploy a VNET under a separate resource group?

@colemickens
Copy link
Contributor

I had assumed as much, but I don't actually know that as a solid fact. Is that something you would have data on?

@rgardler or @sauryadas for customers who want to deploy clusters (particularly Kubernetes) into existing vnets... are they generally putting the cluster into the existing resource group, or are the existing vnet and new cluster typically in different resource groups?

@jpoon
Copy link
Contributor Author

jpoon commented Nov 30, 2016

As custom vnets don't work at all, would it be reasonable to do a quick fix to support things in the same RG?

@colemickens
Copy link
Contributor

Due to the fact that the apimodel takes vnetSubnetID as the full identifier string (which I agree with), it means that for this "quick fix" we need two template functions - one to extract the vnet name, and another to extract the subnet name.

So that in the template we can write {{ subnetNameFromId .VnetSubnetID }} (very pseudocode-y).

PRs are very welcome, I'm not going to be able to get to this for a while.

@colemickens
Copy link
Contributor

I have a branch here that might fix this issue for deployments into a single RG. I don't have an easy way of testing it. Is anyone here willing to give it a shot? https://github.com/colemickens/acs-engine/tree/colemickens-pr-fix-custom-vnet

@lmickh
Copy link

lmickh commented Jan 10, 2017

@colemickens I tested the branch with a config very similar to the custom vnet example. New single RG, new vnet, and so on. Basically just changed the vnet and RG names. The result after applying the templates was the same as before. Both vnetName and subnetName are fully qualified in /etc/kubernetes/azure.json. Not sure why that is the case.

@colemickens
Copy link
Contributor

I think I fixed it. Could I get you to pull, rebuild and try again? Thanks so much.

@lmickh
Copy link

lmickh commented Jan 10, 2017

No dice. Error on the deployment:

At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details. {
  "error": {
    "code": "InvalidTemplate",
    "message": "Unable to process template language expressions for resource '/subscriptions/<snip>/am23-kube01/providers/Microsoft.Compute/virtualMachines/k8s-agentpool2-14283094-0/extensions/cse0' at line '1' and column '66850'. 'The template variable 'subnetName' is not found. Please see https://aka.ms/arm-template/#variables for usage details.'"
  }
}

@colemickens
Copy link
Contributor

Okay, I pushed another iteration up, let me know if you try it (thanks for guinea-pigging it, I really appreciate it).

@lmickh
Copy link

lmickh commented Jan 11, 2017

Hmm something went wrong again. Both the masters and agents looks like this now:

    "location": "eastus2",
    "subnetName": "subnets",
    "securityGroupName": "k8s-master-14283094-nsg",
    "vnetName": "virtualNetworks",

@colemickens
Copy link
Contributor

@lmickh ha, off by one. Just pushed another one if you want to try.

@lmickh
Copy link

lmickh commented Jan 12, 2017

@colemickens I'm not seeing a new commit on that branch.

@colemickens
Copy link
Contributor

colemickens commented Jan 12, 2017 via email

@colemickens
Copy link
Contributor

Sorry @lmickh apparently I commit --amended last night, but forgot to push. I've just pushed the change up now...

@lmickh
Copy link

lmickh commented Jan 16, 2017

Latest one worked. The short names are listed in azure.json properly and all hosts were able to create routes.

@colemickens
Copy link
Contributor

I'd missed your reply, @lmickh. Thanks very much for dogfooding for me and confirming.

@anhowe
Copy link
Contributor

anhowe commented Jan 23, 2017

Cole is fixing in #172

@MoTAUser
Copy link

@colemickens. We still have the issue of the initial post from @jpoon. We've already tried several days to deploy acs cluster with k8s in custom VNET. Using fix in #172 - still not working for us.
Is there anything we have to consider or is this issue still in progress?

@colemickens
Copy link
Contributor

@MoTAUser Can you elaborate? I've had other people report it's working.

ACS does not support custom vnet, only ACS-Engine does...

@Hupka
Copy link

Hupka commented Jan 31, 2017

Hi @colemickens,
I am working on the same project as @MoTAUser and I try to give you a little bit more information. We are fairly new to this topic so please point out when we give you insufficient information. We are really eager to get this to work.

  • one month ago we figured out that ACS from Azure Marketplace won't work for us, that is why we only work with ACS engine since then.
  • We have a custom vnet set up within the subscription to meet our corporation's requirements. We dedicated a subnet to the ACS deploymet. We temporarily modified the routing table to get kubernetes + docker successfully installed on the deployed machines to avoid proxy issues.
  • we have set up a service principal who is contributer to our subscription
  • we deploy acs+k8s cluster in same resource group as custom VNET
  • deployment runs successfully, agents+masters are tied to the same subnet
  • kubernetes is successfully installed, but no routes are created.

Do you see anything suspicious so far?

Regards,
Adrian

@colemickens
Copy link
Contributor

Heh, yes, the last one of course: "kubernetes is successfully installed, but no routes are created."

Does the apiserver actually start running, or no? That will help me guess as to why routes aren't being created.

@Hupka
Copy link

Hupka commented Jan 31, 2017

Hey, thanks for your reply. Unfortunately we are sitting here in Germany and aren't at work anymore and can't access the azure resource. We are going to dig into this again tomorrow morning. Anything else we should look out for? Do you want to have logs of any kind?

@colemickens
Copy link
Contributor

The full logs of kube-controller-manager (kubectl logs --namespace=kube-system kube-controller-manager-<whatever> and kubelet (journalctl -u kubelet) from the master will probably come in handy.

@MoTAUser
Copy link

MoTAUser commented Feb 1, 2017

It seems apiserver is running normally.
Here are the log files including our acs-engine manifest (kubernetesvnet.json), and the azure deployment jsons (azuredeploy.json, azuredeploy.parameter.json).

colemickens-[kubernetes] unable to create cluster with custom vnet #120 logs.zip

@colemickens
Copy link
Contributor

This is now fixed by the merging of #172.

@colemickens
Copy link
Contributor

@MoTAUser Please file a new issue if you're still having issues after updating ACS-Engine, rebuilding and redeploying. Thanks.

@rrajadevops
Copy link

@jpoon I created the custom acs cluster today. I am looking for how to connect my cluster using kubectl, could you please share the steps how to connect my k8s cluster using kubectl.
Regards,
Raja

@rrajadevops
Copy link

@jpoon @colemickens @mogthesprog @lmickh can anyone please share the steps to expose our custom acs pods in loadbalancer service??

When i try to create LB service i got the below error.

Events:
Type Reason Age From Message


Normal EnsuringLoadBalancer 1s (x6 over 2m) service-controller Ensuring load balancer
Warning CreatingLoadBalancerFailed 1s (x6 over 2m) service-controller Error creating load balancer (will retry): Failed to ensure load balancer for service ge-dashboard-test/pythonservice: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/e9b70815-40be-4610-b24c-**********/resourceGroups/ge-dashboard-test/providers/Microsoft.Network/loadBalancers/ge-dashboard-test-internal?api-version=2017-03-01: StatusCode=0 -- Original Error: adal: Refresh request failed. Status Code = '400'

Regards,
Raja

@jpoon
Copy link
Contributor Author

jpoon commented Apr 4, 2018

Most likely a bad service principal: https://docs.microsoft.com/en-us/azure/aks/kubernetes-service-principal

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants