Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2 <-> v3 TlsContext causes temporary downtime #13864

Closed
howardjohn opened this issue Nov 2, 2020 · 4 comments
Closed

v2 <-> v3 TlsContext causes temporary downtime #13864

howardjohn opened this issue Nov 2, 2020 · 4 comments
Assignees
Labels
area/sds SDS related bug stale stalebot believes this issue/PR has not been touched recently

Comments

@howardjohn
Copy link
Contributor

Title: v2 <-> v3 TlsContext causes temporary downtime

Description:
Switching the TLS context version on a cluster causes a downtime

Apply this change (full cluster below) to a v2 cluster via XDS:

>           "@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext",
<           "@type": "type.googleapis.com/envoy.api.v2.auth.UpstreamTlsContext",

Fails a few requests with client disconnected, failure reason: TLS error: Secret is not supplied by SDS.

Repro steps:
I can easily reproduce this with Istio control plane just swapping out the version. I assume it could be reproduce with file based XDS but haven't produced a minimal reproducer

Full cluster:

{
    "transportSocketMatches": [
        {
            "name": "tlsMode-istio",
            "match": {
                "tlsMode": "istio"
            },
            "transportSocket": {
                "name": "envoy.transport_sockets.tls",
                "typedConfig": {
                    "@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext",
                    "commonTlsContext": {
                        "tlsCertificateSdsSecretConfigs": [
                            {
                                "name": "default",
                                "sdsConfig": {
                                    "apiConfigSource": {
                                        "apiType": "GRPC",
                                        "grpcServices": [
                                            {
                                                "envoyGrpc": {
                                                    "clusterName": "sds-grpc"
                                                }
                                            }
                                        ]
                                    }
                                }
                            }
                        ],
                        "combinedValidationContext": {
                            "defaultValidationContext": {
                                "matchSubjectAltNames": [
                                    {
                                        "exact": "spiffe://cluster.local/ns/testyomesh/sa/default"
                                    }
                                ]
                            },
                            "validationContextSdsSecretConfig": {
                                "name": "ROOTCA",
                                "sdsConfig": {
                                    "apiConfigSource": {
                                        "apiType": "GRPC",
                                        "grpcServices": [
                                            {
                                                "envoyGrpc": {
                                                    "clusterName": "sds-grpc"
                                                }
                                            }
                                        ]
                                    }
                                }
                            }
                        },
                        "alpnProtocols": [
                            "istio-peer-exchange",
                            "istio"
                        ]
                    },
                    "sni": "outbound_.80_._.testyomesh-1.testyomesh.svc.cluster.local"
                }
            }
        },
        {
            "name": "tlsMode-disabled",
            "match": {},
            "transportSocket": {
                "name": "envoy.transport_sockets.raw_buffer"
            }
        }
    ],
    "name": "outbound|80||testyomesh-1.testyomesh.svc.cluster.local",
    "type": "EDS",
    "edsClusterConfig": {
        "edsConfig": {
            "ads": {}
        },
        "serviceName": "outbound|80||testyomesh-1.testyomesh.svc.cluster.local"
    },
    "connectTimeout": "10s",
    "circuitBreakers": {
        "thresholds": [
            {
                "maxConnections": 4294967295,
                "maxPendingRequests": 4294967295,
                "maxRequests": 4294967295,
                "maxRetries": 4294967295
            }
        ]
    },
    "filters": [
        {
            "name": "istio.metadata_exchange",
            "typedConfig": {
                "@type": "type.googleapis.com/udpa.type.v1.TypedStruct",
                "typeUrl": "type.googleapis.com/envoy.tcp.metadataexchange.config.MetadataExchange",
                "value": {
                    "protocol": "istio-peer-exchange"
                }
            }
        }
    ]
}
@howardjohn howardjohn added bug triage Issue requires triage labels Nov 2, 2020
howardjohn added a commit to howardjohn/istio that referenced this issue Nov 2, 2020
See istio#28120
See envoyproxy/envoy#13864

This resolves a downtime event on in place upgrade from 1.6 to 1.7. This
is a couple seconds of 503s.

This is intentionally sent only to 1.7 as it is only relevant for this
branch.

Please note this feature flag is shipped by on by default. We have two
choices:

* Off by default. Anyone upgrading from 1.6 to 1.7 will continue to get
downtime unless they read the release notes and add the flag.
* On by default. Anyone with 1.7 already deployed, but that still has
1.6 proxies will encur a downtime unless they read the release notes and
remove the flag.

I have chosen on by default, as the set of people with 1.6 proxies with
1.7.x Istiod upgrading to 1.7.5 seems far smaller than the impacted set
of "off by default", and the mitigation is the same. Additionally, for
those that are impacted, the impact will be exclusively the proxies on
1.6, which is presumably not 100% of proxies, whereas in the other case
ALL proxies are 1.6 and thus impacted.
istio-testing pushed a commit to istio/istio that referenced this issue Nov 2, 2020
* Do not switch TLS version on 1.6 -> 1.7 upgrade

See #28120
See envoyproxy/envoy#13864

This resolves a downtime event on in place upgrade from 1.6 to 1.7. This
is a couple seconds of 503s.

This is intentionally sent only to 1.7 as it is only relevant for this
branch.

Please note this feature flag is shipped by on by default. We have two
choices:

* Off by default. Anyone upgrading from 1.6 to 1.7 will continue to get
downtime unless they read the release notes and add the flag.
* On by default. Anyone with 1.7 already deployed, but that still has
1.6 proxies will encur a downtime unless they read the release notes and
remove the flag.

I have chosen on by default, as the set of people with 1.6 proxies with
1.7.x Istiod upgrading to 1.7.5 seems far smaller than the impacted set
of "off by default", and the mitigation is the same. Additionally, for
those that are impacted, the impact will be exclusively the proxies on
1.6, which is presumably not 100% of proxies, whereas in the other case
ALL proxies are 1.6 and thus impacted.

* fix nil

* Fix initial fetch
@lizan lizan self-assigned this Nov 2, 2020
@mattklein123 mattklein123 added area/sds SDS related help wanted Needs help! and removed triage Issue requires triage labels Nov 3, 2020
@lizan
Copy link
Member

lizan commented Nov 5, 2020

@howardjohn I could only reproduce (w a mocking config) this behavior on 1.14.x branch, i.e. istio proxy 1.6.x, but not with 1.15.x, 1.16.x or latest master. Can you confirm that's the case? If that's the case the upgrade path will be update to 1.7.x proxy first then migrate to v3 transport socket.

I'll try to figure out which commit exactly fixes this and report back.

@lizan lizan removed the help wanted Needs help! label Nov 5, 2020
@github-actions
Copy link

github-actions bot commented Dec 9, 2020

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Dec 9, 2020
@mattklein123 mattklein123 removed the stale stalebot believes this issue/PR has not been touched recently label Dec 9, 2020
@github-actions
Copy link

github-actions bot commented Jan 8, 2021

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Jan 8, 2021
@github-actions
Copy link

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/sds SDS related bug stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests

3 participants