Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAT Gateway Support #43

Closed
3 tasks done
dkistner opened this issue Feb 21, 2020 · 12 comments
Closed
3 tasks done

NAT Gateway Support #43

dkistner opened this issue Feb 21, 2020 · 12 comments
Assignees
Labels
kind/enhancement Enhancement, improvement, extension kind/roadmap Roadmap BLI
Milestone

Comments

@dkistner
Copy link
Member

dkistner commented Feb 21, 2020

What would you like to be added:
Azure will offer soon a NAT service (currently in a private preview) and there are some scenarios when users could need a dedicated nat service e.g. whitelisting scenarios which require a stable ip(s) for egress connections initiated within the cluster.

Currently all egress traffic from a Gardener managed Azure cluster is routed via the cluster load balancer.

As the nat gateway will come with additional costs I would recommend to integrate it optionally and make it configurable for the users.

As the nat gateway require always to have at least one public ip assigned. I would propose to make it possible for users to pass their public ip address(es) or public ip address range(s) to the extension via .spec.providerConfig.networks.natGateway.ipAddresses[] | .ipAddressRanges[]. Only if both lists are empty the Gardener extension would create one public ip and assign it to the service.

spec:
  type: azure
  ...
  providerConfig:
    apiVersion: azure.provider.extensions.gardener.cloud/v1alpha1
    kind: InfrastructureConfig
    networks:
      natGateway:
        enabled: <true|false>
        ipAddresses:
        - name: my-ip-resource-name
          resourceGroup: my-ip-resource-group
        ipAddressRanges:
        - name: my-iprange-resource-name
          resourceGroup: my-iprange-resource-group
    ...

Later on when we go on with multiple Availability Set support we will probably need to make the nat gateway required. In this case the .spec.providerConfig.networks.natGateway.enabled need to be always true.

Why is this needed:
Support scenarios which require a dedicated nat service e.g. whitelisting scenarios.

Status

cc @vlerenc, @AndreasBurger, @MSSedusch

@dkistner dkistner added the kind/enhancement Enhancement, improvement, extension label Feb 21, 2020
@dkistner dkistner self-assigned this Feb 21, 2020
@MSSedusch
Copy link

should we allow customers to define how many nat gateways to create and in which zone (and in which subnet)?

networks:
      natGateways:
      - name: nat1
        zone: 1
        ipAddresses:
        - name: my-ip-resource-name
          resourceGroup: my-ip-resource-group
        ipAddressRanges:
        - name: my-iprange-resource-name
          resourceGroup: my-iprange-resource-group
        subnet: subnet1
      - name: nat2
        zone: 2
        ipAddresses:
        - name: my-ip-resource-name
          resourceGroup: my-ip-resource-group
        ipAddressRanges:
        - name: my-iprange-resource-name
          resourceGroup: my-iprange-resource-group
        subnet: subnet2

@dkistner
Copy link
Member Author

Hmm we discussed that, the main benefit is redundancy, right? Means if the nat gateway fails in one zone, then only machines in this zone have no egress and the others in the other zones are still fine.

On the other hand side this would not be possible for Availability Set based clusters (distribution accross fault domains is not possible?) and come with more costs, because we need in this case for each zone a nat and not only one per cluster.

@dkistner
Copy link
Member Author

dkistner commented Mar 4, 2020

Just to summarise it.

The current network setup for Azure Shoot clusters consists of one subnet within a virtual network (vnet) and machines assigned to an AvailabilitySet or machines distributed across zones can be attached to the subnet.

This approach has some implications for the natgateway integration:

  • The natgateway need to be deployed into one zone. This means it is not zone redundant deployed like the Standard LoadBalancer and this makes it to a potential a single point of failure. This is anyways the case for AvailabilitySet based clusters as we have here always just one subnet.
  • Egress connections initiated from machines in another zone as the natgateway need to route the traffic first to the zone which host the natgateway before it can go to the Internet via the natgateway. This could potentially have a latency impact.
  • We don't know the SLA and costs + traffic costs for the natgateway compared to the Standard LoadBalancer.

Btw, if we decide to go with multiple natgateways later on, means one per zone, then we could easily extend the suggested structure by adding a zone information to each ip address or ip address range. Gardener Azure extension controller would then only create a public ip(s) for zones were the user does not specify an ip or ip range.

    networks:
      natGateway:
        enabled: <true|false>
        ipAddresses:
        - name: my-ip-resource-name
          resourceGroup: my-ip-resource-group
          zone: 1
        ipAddressRanges:
        - name: my-iprange-resource-name
          resourceGroup: my-iprange-resource-group
          zone: 2

@dkistner
Copy link
Member Author

dkistner commented Mar 6, 2020

Another finding. The natgateway attached a subnet which has a Basic LoadBalancer assigned is not possible.

That would mean if we want to enable natgateways for AvailabilitySet based cluster then we would need to switch to Standard LoadBalancers also for this type of cluster (which we want to do anyways) and the natgateway would be mandatory for this type of cluster (which is different from zoned based cluster, they can work without natgateway). That means a zoned based cluster would only need a Standard LoadBalancer and an AvailabilitySet based cluster need a Standard LoadBalancer + natgateway.

@dkistner
Copy link
Member Author

It seems there is a bug in the azurerm Terraform provider which prevent the Terraformer to detach and delete Gardener managed public ip assigned to the natgateway. With this issue the Terraformer won't be able to detach and delete the public ip (created by Gardener, incase user does not provide one) in case the user wants to assign later on his own public ip/ipranges. See here: hashicorp/terraform-provider-azurerm#6052

@dkistner
Copy link
Member Author

dkistner commented Mar 11, 2020

I propose to go one with the Azure NatGateway integration in several steps.

  1. NatGateway integration with one public ip attached by Gardener (no bring your own ip for the NatGateway in the first step due to NatGateway: Public IP is not detached if intended to be deleted hashicorp/terraform-provider-azurerm#6052) and only for zoned cluster
  2. Enable bring your own ip for zoned clusters (when NatGateway: Public IP is not detached if intended to be deleted hashicorp/terraform-provider-azurerm#6052 is fixed an a new version of the azurerm Terraform provider is released)
  3. Enable NatGateway for AvailabilitySet based/non zoned clusters, when Standard Load Balancer is integrated for non zoned clusters (larger effort, due to LoadBalancer migration etc.). This will probably will require to deploy the NatGateway mandatory/non optional in addition to the Standard LoadBalancer.

@vlerenc
Copy link
Member

vlerenc commented Mar 12, 2020

Thanks @dkistner.

Since users that deploy to multiple AZs want to do that for availability reasons, we should definitely have one NAT GW per zone, otherwise we introduce singletons and break the main motivation to go for AZs in the first place. We will never be able to explain/motivate this decision when push comes to shove.

Enforcing NAT GWs for AvSet-based clusters is acceptable, because we don't want to go on with AvSet-based clusters as of today for two main reasons:

  1. When an AvSet gets a nervous break-down, it's similar to loosing a zone (not really, but if the cluster is set up in this way, at least kind of), and the rest of the cluster is not going down as well.
  2. When the user want's to modernise a cluster, we need another AvSet for the new hardware.
    So, if the logical chain is as follows: multiple AvSets require STD LBs that require NAT GWs, then so be it. The above goals justify this or else AvSet clusters remain very brittle/troubled. We want to change that more than anything else here (frankly, even more than stable outbound IPs).

The plan above also makes a lot of sense in regards to the TF bug. Maybe MS could help here fixing it, or we, or we could use the native SDKs, but "sitting it out" is absolutely fair, especially given all the work we have to do.

@dkistner
Copy link
Member Author

So with the current network setup for Azure Shoots we have only one subnet and machines distributed across several zones can be attached to it. So far this wasn't an issue as the Standard LoadBalancer which was used for ingress and egress (which is still the default, as NatGateway should be optional) is deployed automatically zone redundant by Azure. Azure subnets can currently only associate one NatGateway, which means that multiple NatGateways in different zones attached to the same Subnet is not possible.

So at the moment I see only two possibilities to enable zone redundant NatGateways:

  1. Dedicated subnet for each zone similar as we have it for AWS. This is a larger effort as we need a migration logic to move machines from one subnet to another.
  2. Microsoft will provide an automated zone redundant approach to deploy the NatGateway similar as we have it for the Standard LoadBalancer. This is not yet there (even NatGateway at all is currently not GA) and we don't know if and when something like that will be available.

So for now I see only option 1. as short term solution, but with more effort to enable redundant NatGateways. That's probably also the only option which Gardener fully control. Option 2. would mean wait and go on for now without NatGateway HA. This will be anyways an optional feature and therefore we can reason, if zone redundant reliable Nating is required please go with Standard LoadBalancer.

But as I mentioned. AvailabilitySet are also a valid HA mechanism for machines and in this case we will always have just one subnet and therefore only one NatGateway (expect when multiple NatGateways are allowed to be attached to one subnet).

So how to proceed? @vlerenc, @rfranzke WDYT?

@vlerenc
Copy link
Member

vlerenc commented Mar 12, 2020

Well, "(1) larger effort" and "I see only (1) as short term solution" doesn't fir together for me. ;-)

Considering what you wrote, I would then do (2), i.e. sit it out and hope nobody escalates before MS/Azure changes that. If MS/Azure offers cross-AZ subnets and has "zone-aware LBs" then I would expect that for all resources as well. AWS does it differently and scopes by zone.

@dkistner
Copy link
Member Author

Well, "(1) larger effort" and "I see only (1) as short term solution" doesn't fir together for me. ;-)

:D Sorry for the misleading statements. I meant I see more implementation efforts for (1) because of the change in the network layout and the migration logic to move machines from one subnet to another.Of course we could do that. With short term I mean maybe within weeks, for (2) I do not know. I can only estimate and I would guess months...

I'm also for (2) in general because we still have the Standard LoadBalancer with zone redundancy which should be sufficient for most cases.

@larsdannecker
Copy link

Thanks @dkistner.

Since users that deploy to multiple AZs want to do that for availability reasons, we should definitely have one NAT GW per zone, otherwise we introduce singletons and break the main motivation to go for AZs in the first place. We will never be able to explain/motivate this decision when push comes to shove.

Enforcing NAT GWs for AvSet-based clusters is acceptable, because we don't want to go on with AvSet-based clusters as of today for two main reasons:

  1. When an AvSet gets a nervous break-down, it's similar to loosing a zone (not really, but if the cluster is set up in this way, at least kind of), and the rest of the cluster is not going down as well.
  2. When the user want's to modernise a cluster, we need another AvSet for the new hardware.
    So, if the logical chain is as follows: multiple AvSets require STD LBs that require NAT GWs, then so be it. The above goals justify this or else AvSet clusters remain very brittle/troubled. We want to change that more than anything else here (frankly, even more than stable outbound IPs).

The plan above also makes a lot of sense in regards to the TF bug. Maybe MS could help here fixing it, or we, or we could use the native SDKs, but "sitting it out" is absolutely fair, especially given all the work we have to do.

Hi @vlerenc
what about data centers that are requiring AVsets?

@ghost ghost added the lifecycle/stale Nobody worked on this for 6 months (will further age) label May 18, 2020
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jul 18, 2020
@vlerenc vlerenc added roadmap/cloud-sap and removed lifecycle/rotten Nobody worked on this for 12 months (final aging stage) labels Oct 17, 2020
@vlerenc vlerenc changed the title NAT Gateway integration NAT Gateway Support Oct 18, 2020
@dkistner dkistner added this to the 2021-Q1 milestone Nov 5, 2020
@gardener gardener deleted a comment from vlerenc Nov 24, 2020
@dkistner
Copy link
Member Author

The step 3 to make the NatGateway usable in combination with AvailabilitySets will be probably not be implemented as we are planning to deprecate AvailabilitySet based clusters and replace them with clusters based on VirtualMachineScaleSet Orchestration mode VM (VMO) in the mid term. Those cluster will be out of the box compatible with NatGateway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Enhancement, improvement, extension kind/roadmap Roadmap BLI
Projects
None yet
Development

No branches or pull requests

5 participants